AI

What Is Model Quantization? The 4-Bit Trick

What Is Model Quantization? The 4-Bit Trick

Open-Source AI · zbrandco

TL;DR

  • Quantization shrinks a model by storing its billions of weights at lower precision — 4 bits instead of 16. Size drops by roughly 4x; quality loss is usually small.
  • It’s the technique that turned “needs a data-centre GPU” into “runs on a MacBook.” Without it, the entire local-AI movement wouldn’t exist.
  • GGUF is the file format that packages a quantized model for runtimes like Ollama and llama.cpp to consume.
  • The quality trade-off is real and tolerable at 4-bit; push below 3-bit and the model starts getting noticeably dumber.

Squeezing a frontier model into 8 GB of RAM

Picture the situation: you want to run Llama 3 70B at home. At full 16-bit floating-point precision, the weights alone occupy roughly 140 GB. Your machine has 8 GB of unified memory. The gap looks unbridgeable — until quantization closes most of it.

That’s the practical starting point for understanding what model quantization is and why it dominates every conversation about running AI locally in 2026. It isn’t primarily a research technique or an optimisation geeks tinker with.

It is the reason the local-AI ecosystem — Ollama, llama.cpp, LM Studio, all of it — is commercially viable. Strip it out and nearly every model people actually care about goes back to needing a data-centre.

The thesis in one sentence: 4-bit quantization cut model memory requirements by roughly 75% with a quality loss most users never notice in practice — and that single fact is what made consumer AI possible.


What the model actually is, in numbers

A language model is a large table of floating-point numbers called weights. At training time these are stored at 16-bit (BF16 or FP16) precision — roughly 2 bytes each. A 7-billion-parameter model therefore occupies about 14 GB at rest; a 70B model occupies 140 GB. That number scales linearly with parameter count, which is why a new SOTA release almost always exceeds the VRAM on any single consumer GPU.

Quantization addresses this by asking: how much precision do we actually need when running the model, as opposed to training it? The answer turns out to be: far less.

Rather than store each weight as one of 65,536 possible 16-bit values, 4-bit quantization stores it as one of just 16 values — plus a shared scaling factor that remembers the range the weights came from.

When the model runs, those 16-level buckets are mapped back to approximate their original values. Information is lost in that round-trip, but neural networks are surprisingly tolerant of numerical noise, which is why the outputs usually stay close to the full-precision original.

As the Hugging Face quantization documentation frames it, the explicit goal is “lower the memory and compute cost of a model while keeping its behaviour close to the original.”


The VRAM table nobody shows you

Bit-width Bytes per weight 7B model footprint 70B model footprint Typical quality vs FP16
FP16 (16-bit) 2.0 B ~14 GB ~140 GB Baseline
Q8 (8-bit) ~1.0 B ~7 GB ~70 GB Near-identical
Q5 (5-bit) ~0.63 B ~4.5 GB ~44 GB Very close
Q4 (4-bit) ~0.5 B ~4 GB ~40 GB Small, usually acceptable drop
Q3 (3-bit) ~0.38 B ~3 GB ~30 GB Noticeable degradation
Q2 (2-bit) ~0.25 B ~2 GB ~20 GB Significant loss

The numbers above are approximate (actual footprint varies by implementation and context size), but the pattern matters: going from 16-bit to 4-bit roughly quadruples the models you can run on the same hardware.


Why naive quantization fails — and why modern methods don’t

The simplest conceivable quantization approach is uniform: divide the weight range into equal-size buckets and map everything into them. It works, but it wastes precision. Most weights in a trained model cluster tightly around zero with rare outliers that are orders of magnitude larger. A uniform grid wastes most of its 16 levels on the sparse tail and handles the dense middle crudely.

Modern quantization schemes — the kind underlying the popular GGUF quant levels — are non-uniform. They allocate more resolution where weights are densest and can handle outliers separately.

The llama.cpp project, which pioneered GGUF, implements a family of these approaches: Q4_K, Q5_K, Q6_K and their variants use “K-quants” that group weights into blocks and apply separate scales and minimums per block, sharpening accuracy without adding significant overhead.

This is the reason a good Q4_K_M GGUF often performs almost indistinguishably from FP16 on everyday tasks — it isn’t just sloppily rounding weights down; it’s being surgical about where the bits matter most.

Here’s an insight that rarely gets stated plainly: the quality difference between FP16 and Q4 isn’t fixed — it compounds with model scale. A 70B model at Q4 often outperforms a 7B model at FP16, not just because it’s larger, but because it has more redundancy in its weights to absorb the precision loss gracefully.

Quantization hurts smaller models proportionally more. This is why the advice “run the biggest model that fits at Q4” consistently beats “run a smaller model at higher precision” — and it’s a decision the tooling rarely makes obvious to the user.


GGUF: the file format that made local AI frictionless

Quantization is a technique; GGUF is the packaging. When you pull a model in Ollama or download a file from Hugging Face for llama.cpp, you’re almost always getting a GGUF file — a single self-contained binary that bundles the quantized weights, the model’s architecture metadata, tokeniser vocabulary, and configuration in one place.

Before GGUF (and its predecessor GGML), distributing a quantized model meant shipping multiple files, format-converting on the fly, and hoping the runtime version matched the model version. GGUF standardised all of that.

The name you see on Hugging Face — model-Q4_K_M.gguf — decodes as: this model, quantized to 4-bit using the K-quant method, medium variant, in GGUF format. Everything a runtime needs is inside that one file.

For users, GGUF made quantization invisible in the best possible way: ollama pull llama3 just works, defaulting to a 4-bit quant appropriate for most hardware. The sophistication is hidden; the result is a model that runs.


Where quality actually breaks — the honest thresholds

Quantization sells itself as nearly free. It usually is. But there are real ceilings:

8-bit (Q8): Perceptually identical to FP16 in nearly all benchmarks. The only reason not to default here is it’s still twice the size of Q4, so hardware often doesn’t allow it.

4-bit (Q4): The inflection point. Casual conversation, summarisation, creative writing — most users genuinely cannot tell the difference. Precise mathematical reasoning, multi-step code generation, and structured data extraction start to show small but real degradation. Not fatal; just present.

3-bit and below: This is where the model starts getting unreliable. Hallucination rates tick up. Reasoning chains shorten or lose coherence. For anything beyond simple queries, 3-bit is a last resort, not a choice.

One underreported fact: tasks that depend on recall — retrieving specific facts accurately from the model’s training data — are more sensitive to quantization than tasks that depend on reasoning. A 2-bit model can sometimes still produce coherent sentences; it will much more readily confuse names, dates, and statistics.


Picking a quant without overthinking it

When you download a model, you’ll face a dropdown or a filename list. The decision tree is short:

  1. Start at Q4 (look for Q4_K_M in GGUF land, or Ollama’s default). For the majority of workloads, this is the answer. Stop here unless something feels wrong.
  2. Try Q5 or Q8 if the task is precision-sensitive — code generation, maths, structured JSON output — and you have the RAM headroom.

If Q4 isn’t fitting or you need to stretch your hardware further:

  1. Drop to Q3 only if you genuinely cannot fit Q4. Expect to notice the difference in anything complex.
  2. Prefer a bigger model at Q4 over a smaller model at Q8 when total VRAM budgets are similar. A Q4 13B almost always beats a Q8 7B. Test both on your actual prompts; benchmarks won’t tell you what your specific task needs.

As covered in our guide to running a local LLM, the tooling handles the runtime mechanics. If you’re choosing between models for a specific task, our open-source AI model comparison breaks down the trade-offs at each scale. The only decision that’s actually yours is picking the right quant for your memory budget — and now you know how to make it.


The part the marketing skips

Quantization is not magic, and three caveats earn their place in honest coverage:

It doesn’t add intelligence. A quantized small model is still a small model. Shrinking a weaker base model doesn’t recover capabilities the original never had.

Inference speed gains are real but uneven. 4-bit models run faster than FP16 — but the speedup varies significantly by hardware. On Apple Silicon and modern NVIDIA GPUs with INT4 support, the gains are substantial. On older hardware, you may see only modest improvements despite the memory savings.

The quality gap is task-specific, not universal. A single benchmark number doesn’t capture how quantization will affect your use case. The only reliable test is running your actual prompts at different precisions and comparing outputs. This takes ten minutes and is worth doing before committing to a workflow.

For a deeper look at how hardware choices interact with quant levels, see our local AI hardware guide.


Start at Q4, and only move when you feel the trade-off

If you run AI locally in 2026, quantization is the lever that makes it possible. The default answer is Q4_K_M GGUF: roughly a quarter the size of full precision, with quality most users won’t notice losing. Go up to Q5 or Q8 if the task demands it and your RAM allows. Go down to Q3 only as a last resort.

Most importantly: when memory is the constraint, pick the largest model that fits at Q4 over a smaller model at higher precision. The bigger model almost always wins.


Is a 4-bit model much worse than the full-precision original?

Usually only a little. With modern K-quant methods (Q4_K_M), the gap is small enough that most users cannot detect it in everyday use. Precision-sensitive tasks — maths, code, structured output — show a more visible but still modest drop.

Do I need to quantize models myself?

No. Most popular open models are distributed as pre-quantized GGUF files on Hugging Face. Ollama pulls a 4-bit quant by default when you ollama pull a model. You only need to quantize yourself if you’re working with a freshly fine-tuned checkpoint.

What is GGUF, and is it the same as quantization?

They’re different things. Quantization is the technique — fewer bits per weight. GGUF is the file format that packages a quantized model so runtimes like llama.cpp and Ollama can load it efficiently. Most quantized models you download are delivered as GGUF files.

Can quantization make a model run faster, not just smaller?

Yes. In addition to fitting in less RAM, quantized models can run faster because less data needs to move through memory per inference step. The speedup is most pronounced on hardware with native INT4 support — recent Apple Silicon and NVIDIA Ada GPUs. On older hardware the speed gain is real but smaller.


Last verified June 13, 2026. Based on the Hugging Face quantization overview and the llama.cpp project.

We may earn commission from affiliate links at no extra cost to you. Last updated: Jun 14, 2026.
Aira

Founding Editor and Publisher of ZBrandCo, covering artificial intelligence, open-source software, and the developer tools people actually use. Signal over hype: every story starts from a primary source and explains why it matters. ZBrandCo runs no paid reviews and no affiliate links. Tips and corrections: editorial@zbrandco.com.