Ollama’s MLX Mac update boosts LLM inference speed by 20%

Aira Published Jun 14, 2026 · 4 min read

Ollama’s MLX Mac update boosts LLM inference speed by 20%

Image: Ollama

Ollama’s MLX engine just hit its highest performance on Apple Silicon yet. The June 11 update introduces NVIDIA’s NVFP4 quantization format, a new snapshot system that makes multi-agent workflows dramatically faster, and kernel-level optimizations that push output speeds roughly 20% higher than the previous q4_K_M baseline.

TL;DR
– NVFP4 cuts 4-bit quantization quality loss roughly in half vs q4_K_M on Gemma 4 12B
– Output speed: ~55 tok/s vs ~46 tok/s on the updated engine (10-run average, 8,300-token prompt)
– New snapshot system caches model state at branch points — critical for agent workflows where context gets reprocessed dozens of times per task
– Available now: ollama run gemma4:12b-mlx or ollama launch pi --model gemma4:12b-mlx

The Three Concrete Improvements

Ollama’s MLX backend, first previewed in March 2026, has matured from “fastest way to run local LLMs on Mac” to a production-grade engine with three concrete improvements:

NVFP4 quantization — Ollama now supports NVIDIA’s model-optimized NVFP4 format, a 4-bit floating-point format with two-level FP8 E4M3 scaling per 16-value micro-block. Perplexity testing on Gemma 4 12B shows NVFP4 at 17.54 vs q4_K_M at 17.95 vs unquantized BF16 at 17.51. That’s roughly half the quality loss of standard 4-bit quantization. NVFP4 is NVIDIA’s format originally designed for Blackwell GPUs; Ollama brings it to Apple Silicon via MLX’s flexible quantization support.

Fused Metal kernels — Multiple operations now fuse into single Metal kernels through MLX’s just-in-time compiler. The GPU-backed sampling path has been reworked for efficiency. Result: NVFP4 generates ~55 tokens/second vs ~46 tok/s for q4_K_M on the same hardware (MacBook Pro M5 Max, Gemma 4 12B, 8,300-token input prompt, 10-run average).

Snapshot-based prefix caching for agents — This is the most significant change for developer workflows. Agent sessions reprocess the same system prompt, tool definitions, and ingested files dozens of times per task. Traditional prefix caching breaks when conversations branch, when thinking models drop reasoning tokens, or when subagents hand off. Ollama’s new snapshot system saves model state at predictable return points: branch splits, intervals through long prompts, and just before each response. Multiple agents can resume from their own saved state; shared context (often tens of thousands of tokens) processes once. Thinking models get a snapshot right before response generation. Branching and retries only process the new direction.

For Developers Running Local Agents

If you’re running local models for coding agents on Apple Silicon, this update changes the practical ceiling. The combination of NVFP4 quality + snapshot caching means you can run larger models (Gemma 4 12B, Nemotron 3 Ultra) with agent workflows that previously hit memory or latency walls. ollama launch now spins up Claude Code, OpenCode, or Codex with local models in one command — no env vars, no config files.

NVFP4 also enables portability: models optimized for datacenter deployment (Blackwell GPUs) can now be imported and run on Mac via Ollama’s MLX engine. That’s a meaningful bridge between cloud and local inference.

Availability and Requirements

Ollama version: 0.30+ (June 5 release added GGUF support; June 11 added MLX performance update)
Hardware: Apple Silicon Mac (M-series), unified memory architecture required
Models: gemma4:12b-mlx available now; more MLX-optimized models rolling out
Command: ollama run gemma4:12b-mlx for chat; ollama launch pi --model gemma4:12b-mlx for coding agents

NVFP4 support is macOS-only for now — Linux and Windows builds still use llama.cpp/gguf backends. The NVIDIA Developer Forums confirm this is an Apple Silicon exclusive in the current release.

What’s Coming for MLX on Mac

More MLX-optimized models on Ollama Hub (Gemma 4 family is the first batch)
Potential NVFP4 support expansion to Linux via MLX ports
How the snapshot system scales with 70B+ models on M3 Ultra / M4 Max unified memory
Comparison benchmarks: NVFP4 on MLX vs native Blackwell NVFP4 for quality parity

FAQ

Q: Does NVFP4 work on Linux or Windows?
A: No — NVFP4 support is currently macOS-only via Apple’s MLX framework. Linux and Windows Ollama builds use llama.cpp/gguf backends.

Q: What models support the new MLX engine?
A: gemma4:12b-mlx is the first MLX-optimized release. More models from the Gemma 4 family and other architectures are rolling out on Ollama Hub.

Q: How does snapshot caching differ from standard prefix caching?
A: Standard prefix caching only works when conversations extend linearly. Snapshots save state at branch points, before thinking tokens drop, and at prompt intervals — so multi-agent handoffs, retries, and branching conversations all resume from saved state instead of reprocessing.

Source: Ollama Blog — MLX Performance Update (June 11, 2026)
Independent verification: NVIDIA Technical Blog — Introducing NVFP4 (Blackwell architecture specification)
Community discussion: NVIDIA Developer Forums — Ollama with NVFP4 support (confirms macOS exclusivity)

#Apple #Claude #Llama #Nemotron #Nvidia #Ollama

Editorially independent: we accept no payment for coverage and currently use no affiliate links. Read our Editorial Standards and Corrections Policy. Published: Jun 14, 2026.

Ollama’s MLX Mac update boosts LLM inference speed by 20%

The Three Concrete Improvements

For Developers Running Local Agents

Availability and Requirements

What’s Coming for MLX on Mac

FAQ

Read next

Confidential computing and the regulatory focus on data in use

Use GPT-5.6 Sol, Terra, and Luna on Amazon Bedrock

How R8 Made Kotlin Coroutines on Android 2x Faster

The zBrandco Edition