Monday, August 18, 2025

Fine Tuning LLMs - 2 Techniques

Two efficient techniques for fine-tuning large language models (LLMs) without the massive computational and memory overhead of full model training.

Here’s a clear and detailed explanation of LoRA and QLoRA,


What is LoRA (Low-Rank Adaptation)?

LoRA is a parameter-efficient fine-tuning (PEFT) method that allows you to adapt pre-trained models without updating all the weights.

How it works:

Instead of modifying all the weights in a large model (which can be billions of parameters), LoRA freezes the original model and adds small trainable layers (low-rank matrices) to certain parts (usually the attention weights). These adapters learn the new task while keeping the original model intact.

Intuition:

Rather than changing a large matrix WWW, LoRA adds a low-rank update:

W′=W+ΔWwhereΔW=A⋅BW' = W + \Delta W \quad \text{where} \quad \Delta W = A \cdot BW′=W+ΔWwhereΔW=A⋅B

  • A∈Rd×rA \in \mathbb{R}^{d \times r}A∈Rd×r
  • B∈Rr×kB \in \mathbb{R}^{r \times k}B∈Rr×k
  • rrr (the rank) is small, e.g., 4, 8, 16

What is QLoRA (Quantized LoRA)?

QLoRA is an extension of LoRA that makes fine-tuning even more memory-efficient by combining:

  1. Quantization — Convert model weights to 4-bit or 8-bit format to save GPU memory.
  2. LoRA adapters — Trainable low-rank layers, just like in LoRA.

Why it matters:

  • Enables training of 65B+ parameter models on a single GPU (48–80 GB).
  • Maintains close to full-precision accuracy.
  • Huge memory savings + faster training.

QLoRA introduces:

  • 4-bit quantization (NF4) using bitsandbytes
  • Double quantization (storage optimization)
  • Paged optimizers (better memory management for long sequences)

No comments:

Post a Comment


Hyderabad, Telangana, India
People call me aggressive, people think I am intimidating, People say that I am a hard nut to crack. But I guess people young or old do like hard nuts -- Isnt It? :-)