Sanity Bytes: Fine Tuning LLMs

Two efficient techniques for fine-tuning large language models (LLMs) without the massive computational and memory overhead of full model training.

Here’s a clear and detailed explanation of LoRA and QLoRA,

What is LoRA (Low-Rank Adaptation)?

LoRA is a parameter-efficient fine-tuning (PEFT) method that allows you to adapt pre-trained models without updating all the weights.

How it works:

Instead of modifying all the weights in a large model (which can be billions of parameters), LoRA freezes the original model and adds small trainable layers (low-rank matrices) to certain parts (usually the attention weights). These adapters learn the new task while keeping the original model intact.

Intuition:

Rather than changing a large matrix WWW, LoRA adds a low-rank update:

W′=W+ΔWwhereΔW=A⋅BW' = W + \Delta W \quad \text{where} \quad \Delta W = A \cdot BW′=W+ΔWwhereΔW=A⋅B

A∈Rd×rA \in \mathbb{R}^{d \times r}A∈Rd×r
B∈Rr×kB \in \mathbb{R}^{r \times k}B∈Rr×k
rrr (the rank) is small, e.g., 4, 8, 16

What is QLoRA (Quantized LoRA)?

QLoRA is an extension of LoRA that makes fine-tuning even more memory-efficient by combining:

Quantization — Convert model weights to 4-bit or 8-bit format to save GPU memory.
LoRA adapters — Trainable low-rank layers, just like in LoRA.

Why it matters:

Enables training of 65B+ parameter models on a single GPU (48–80 GB).
Maintains close to full-precision accuracy.
Huge memory savings + faster training.

QLoRA introduces:

4-bit quantization (NF4) using bitsandbytes
Double quantization (storage optimization)
Paged optimizers (better memory management for long sequences)

Sanity Bytes

Monday, August 18, 2025

Fine Tuning LLMs - 2 Techniques

No comments:

Post a Comment