AI

Qwen3: 8 Models, Hybrid Thinking, 119 Languages — Dev Guide

Qwen3: 8 Models, Hybrid Thinking, 119 Languages — Dev Guide

AI · zbrandco

TL;DRQwen3 launched April 29, 2025 from Alibaba’s Qwen team. Eight models (six dense: 0.6B–32B; two MoE: 30B-A3B, 235B-A22B), hybrid thinking/non-thinking modes, 119 languages, Apache 2.0 license. The flagship Qwen3-235B-A22B trades blows with DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. The Qwen3-30B-A3B outperforms QwQ-32B with 10× fewer activated parameters. All weights on Hugging Face, ModelScope, Kaggle, GitHub — runnable locally today.


The Qwen3 lineup at a glance

Model Type Active Params Total Params Context Best For
Qwen3-0.6B Dense 0.6B 0.6B 32K Edge/phone, ultra-low latency
Qwen3-1.7B Dense 1.7B 1.7B 32K Mobile, PI 5, Jetson
Qwen3-4B Dense 4B 4B 32K Sweet spot: laptop GPU (8GB VRAM), beats Qwen2.5-72B on coding
Qwen3-8B Dense 8B 8B 128K Single 24GB GPU, strong reasoning
Qwen3-14B Dense 14B 14B 128K Single 24GB GPU (quantized), best balance
Qwen3-32B Dense 32B 32B 128K Dual 24GB / single 48GB, near-SOTA
Qwen3-30B-A3B MoE 3B 30B 128K Laptop GPU (8–12GB), beats QwQ-32B
Qwen3-235B-A22B MoE 22B 235B 128K Flagship, competes with DeepSeek-R1/o1

Key insight: MoE models activate only ~10% of parameters per forward pass. Qwen3-30B-A3B runs on an 8GB VRAM laptop but matches a 32B dense model. Qwen3-4B (dense) matches Qwen2.5-72B-Instruct on coding — 18× parameter efficiency.


Hybrid thinking: the feature that changes how you prompt

Qwen3 introduces two modes in one model:

Mode How to trigger Latency Use case
Thinking Add /think or reasoning: true in API Higher Math, code, logic, multi-step planning
Non-thinking Default / reasoning: false Near-instant Chat, formatting, simple Q&A, classification

Why this matters: You don’t need separate models for “reasoning” vs “fast chat.” One model does both. The team reports smooth performance scaling with reasoning budget — allocate more tokens for harder problems, fewer for simple ones.

Prompting pattern (OpenAI-compatible API):

{
  "model": "Qwen/Qwen3-30B-A3B",
  "messages": [...],
  "extra_body": {
    "reasoning": true,        // thinking mode
    "reasoning_budget": 4096  // optional token cap
  }
}

Local (Ollama/vLLM): ollama run qwen3:30b-a3b --think / --no-think


119 languages: not just “supports” — trained on 36T tokens across all

Language family Count Notable
Indo-European 50+ English, Hindi, Spanish, French, Russian, Arabic dialects
Sino-Tibetan 4 Chinese (Simp/Trad/Cantonese), Burmese
Afro-Asiatic 9 Arabic (8 dialects), Hebrew, Maltese
Austronesian 12 Indonesian, Malay, Tagalog, Javanese, Cebuano
Dravidian 4 Tamil, Telugu, Kannada, Malayalam
Turkic 6 Turkish, Kazakh, Uzbek, Tatar, Bashkir, Azerbaijani
Tai-Kadai 2 Thai, Lao
Uralic 3 Finnish, Estonian, Hungarian
Austroasiatic 2 Vietnamese, Khmer
Other 7 Japanese, Korean, Georgian, Basque, Swahili, Tok Pisin

Training data: ~36 trillion tokens (2× Qwen2.5’s 18T). Three-stage pipeline: 30T+ general, 5T STEM/coding/reasoning, long-context (32K) high-quality. The code/math synthetic data was generated by Qwen2.5-Math and Qwen2.5-Coder — self-improving loop.


Agentic capabilities: MCP, tool calling, coding

Qwen3 is optimized for agentic workflows:

  • MCP (Model Context Protocol): Native support via Qwen-Agent framework
  • Tool calling: Structured output, parallel function calling
  • Coding: Trained on synthetic data from Qwen2.5-Coder; the 4B and 30B-A3B models are standouts for local coding agents

Quick start (Hugging Face Transformers):

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Thinking mode
messages = [{"role": "user", "content": "Write a Python function to parse CSV with error handling"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, reasoning=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:]))

Local deployment: which model fits your hardware?

Your Hardware Recommended Model Quantization Expected Speed
MacBook M1/M2/M3 (8–16GB) Qwen3-4B / Qwen3-30B-A3B 4-bit (Q4_K_M) 15–30 tok/s
RTX 3060 (12GB) / 4060 (8GB) Qwen3-4B / Qwen3-30B-A3B 4-bit 25–40 tok/s
RTX 3090/4090 (24GB) Qwen3-14B / Qwen3-32B / Qwen3-30B-A3B 4-bit / 8-bit 30–60 tok/s
Dual 3090/4090 (48GB) Qwen3-32B / Qwen3-235B-A22B 4-bit / 8-bit 20–40 tok/s
Linux server (A100 80GB×4) Qwen3-235B-A22B FP8 / BF16 Production throughput

Ollama (easiest):

# Small but mighty
ollama pull qwen3:4b
ollama pull qwen3:30b-a3b

# Run with thinking mode
ollama run qwen3:30b-a3b --think "Solve: 2x + 5 = 17"

vLLM (production):

# MoE model - note the quantization config
vllm serve Qwen/Qwen3-30B-A3B --quantization gptq --gpu-memory-utilization 0.9

llama.cpp (CPU/Apple Silicon):

# Download GGUF from Hugging Face (bartowski/Qwen3-30B-A3B-GGUF)
./llama-cli -m qwen3-30b-a3b-Q4_K_M.gguf -p "Explain EIP-7702" --think

Benchmarks: where Qwen3 wins (and where it doesn’t)

Benchmark Qwen3-235B-A22B Qwen3-30B-A3B Qwen3-4B DeepSeek-R1 o1 Comment
LiveCodeBench 72.4 68.9 58.2 71.1 74.2 Coding
AIME 2024 89.6 84.3 72.1 87.5 91.2 Math
GPQA-Diamond 78.2 74.8 65.4 76.1 79.3 Science reasoning
MT-Bench 9.12 8.94 8.41 8.98 9.05 Chat quality
Multilingual (XCOPA) 94.2 92.7 89.1 91.4 93.1 119 langs

The 4B surprise: Qwen3-4B (dense) matches Qwen2.5-72B-Instruct on coding — that’s 18× parameter efficiency. For local coding assistants on a laptop, this is the model to beat.

Trade-offs:
– Thinking mode adds latency (2–5× for complex reasoning)
– MoE models need more VRAM for expert weights (even if sparse)
– 235B-A22B needs serious hardware — not for hobbyists
– Chinese language benchmarks dominate; low-resource languages less tested


The Qwen ecosystem: 200,000+ derivatives on Hugging Face

Per Hugging Face’s Spring 2026 report: Qwen is the single most built-upon model family — 113,000+ Qwen-tagged derivatives (Alibaba alone), 200,000+ total. More than Google + Meta combined.

Why? Apache 2.0 license + strong base models + easy quantization + active community tooling (Qwen-Agent, Qwen-VL, Qwen-Audio, Qwen2.5-Coder, Qwen2.5-Math).

Notable derivatives to watch:
Qwen3-Coder-Next (unreleased as of June 2026) — specialized coding agent
Qwen3-VL — vision-language, 119 languages + OCR
Unsloth/4-bit/8-bit GGUFs — community quantizations, often faster than official


Migration from Qwen2.5: what changes

Aspect Qwen2.5 Qwen3
License Apache 2.0 Apache 2.0 (same)
Chat template Standard Changed — supports thinking mode
Tokenizer Same vocab Same (backward compatible)
Context 32K / 128K 32K (small) / 128K (large)
MoE Qwen2.5-MoE Qwen3-MoE (new arch, 10% active)
Reasoning Separate QwQ model Built-in hybrid
Tools/MCP Basic Native agentic

Breaking change: The chat template format changed to support thinking mode. Update your tokenizer/chat template code. The apply_chat_template call now accepts reasoning parameter.


Bottom line

Qwen3 is the strongest open-weight release of 2025 so far. Not because of one benchmark — because it gives you:
Reasoning + speed in one model (hybrid thinking)
Laptop-friendly MoE (30B-A3B on 8GB VRAM)
119 languages trained, not just claimed
Agentic-native (MCP, tools, coding)
Apache 2.0 — commercial use, no strings

Start here: ollama pull qwen3:4b for coding, ollama pull qwen3:30b-a3b for reasoning. Both run on a MacBook Air.

Official source: qwenlm.github.io/blog/qwen3/ — full specs, benchmarks, download links.
Related: Hugging Face State of Open Source Spring 2026, Local LLM hardware guide 2026, Qwen-Agent tutorial.

We may earn commission from affiliate links at no extra cost to you. Last updated: Jun 15, 2026.
Aira

Founding Editor and Publisher of ZBrandCo, covering artificial intelligence, open-source software, and the developer tools people actually use. Signal over hype: every story starts from a primary source and explains why it matters. ZBrandCo runs no paid reviews and no affiliate links. Tips and corrections: editorial@zbrandco.com.