Two efficient techniques for fine-tuning large language models (LLMs) without the massive computational and memory overhead of full model training.
Here’s a clear and detailed explanation of LoRA and QLoRA,
What is LoRA (Low-Rank
Adaptation)?
LoRA is a parameter-efficient fine-tuning (PEFT)
method that allows you to adapt pre-trained models without updating all the
weights.
How it works:
Instead of modifying all the weights in a large model (which
can be billions of parameters), LoRA freezes the original model and adds
small trainable layers (low-rank matrices) to certain parts (usually the
attention weights). These adapters learn the new task while keeping the
original model intact.
Intuition:
Rather than changing a large matrix WWW, LoRA adds a
low-rank update:
W′=W+ΔWwhereΔW=A⋅BW' = W + \Delta W \quad
\text{where} \quad \Delta W = A \cdot BW′=W+ΔWwhereΔW=A⋅B
- A∈Rd×rA \in \mathbb{R}^{d \times r}A∈Rd×r
- B∈Rr×kB \in \mathbb{R}^{r \times k}B∈Rr×k
- rrr (the rank) is small, e.g., 4, 8, 16
What is QLoRA
(Quantized LoRA)?
QLoRA is an extension of LoRA that makes fine-tuning
even more memory-efficient by combining:
- Quantization
— Convert model weights to 4-bit or 8-bit format to save GPU memory.
- LoRA
adapters — Trainable low-rank layers, just like in LoRA.
Why it matters:
- Enables
training of 65B+ parameter models on a single GPU (48–80 GB).
- Maintains
close to full-precision accuracy.
- Huge
memory savings + faster training.
QLoRA introduces:
- 4-bit
quantization (NF4) using bitsandbytes
- Double
quantization (storage optimization)
- Paged optimizers (better memory management for long sequences)
No comments:
Post a Comment