TL;DR — Qwen3 launched April 29, 2025 from Alibaba’s Qwen team. Eight models (six dense: 0.6B–32B; two MoE: 30B-A3B, 235B-A22B), hybrid thinking/non-thinking modes, 119 languages, Apache 2.0 license. The flagship Qwen3-235B-A22B trades blows with DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. The Qwen3-30B-A3B outperforms QwQ-32B with 10× fewer activated parameters. All weights on Hugging Face, ModelScope, Kaggle, GitHub — runnable locally today.
The Qwen3 lineup at a glance
| Model | Type | Active Params | Total Params | Context | Best For |
|---|---|---|---|---|---|
| Qwen3-0.6B | Dense | 0.6B | 0.6B | 32K | Edge/phone, ultra-low latency |
| Qwen3-1.7B | Dense | 1.7B | 1.7B | 32K | Mobile, PI 5, Jetson |
| Qwen3-4B | Dense | 4B | 4B | 32K | Sweet spot: laptop GPU (8GB VRAM), beats Qwen2.5-72B on coding |
| Qwen3-8B | Dense | 8B | 8B | 128K | Single 24GB GPU, strong reasoning |
| Qwen3-14B | Dense | 14B | 14B | 128K | Single 24GB GPU (quantized), best balance |
| Qwen3-32B | Dense | 32B | 32B | 128K | Dual 24GB / single 48GB, near-SOTA |
| Qwen3-30B-A3B | MoE | 3B | 30B | 128K | Laptop GPU (8–12GB), beats QwQ-32B |
| Qwen3-235B-A22B | MoE | 22B | 235B | 128K | Flagship, competes with DeepSeek-R1/o1 |
Key insight: MoE models activate only ~10% of parameters per forward pass. Qwen3-30B-A3B runs on an 8GB VRAM laptop but matches a 32B dense model. Qwen3-4B (dense) matches Qwen2.5-72B-Instruct on coding — 18× parameter efficiency.
Hybrid thinking: the feature that changes how you prompt
Qwen3 introduces two modes in one model:
| Mode | How to trigger | Latency | Use case |
|---|---|---|---|
| Thinking | Add /think or reasoning: true in API |
Higher | Math, code, logic, multi-step planning |
| Non-thinking | Default / reasoning: false |
Near-instant | Chat, formatting, simple Q&A, classification |
Why this matters: You don’t need separate models for “reasoning” vs “fast chat.” One model does both. The team reports smooth performance scaling with reasoning budget — allocate more tokens for harder problems, fewer for simple ones.
Prompting pattern (OpenAI-compatible API):
{
"model": "Qwen/Qwen3-30B-A3B",
"messages": [...],
"extra_body": {
"reasoning": true, // thinking mode
"reasoning_budget": 4096 // optional token cap
}
}
Local (Ollama/vLLM): ollama run qwen3:30b-a3b --think / --no-think
119 languages: not just “supports” — trained on 36T tokens across all
| Language family | Count | Notable |
|---|---|---|
| Indo-European | 50+ | English, Hindi, Spanish, French, Russian, Arabic dialects |
| Sino-Tibetan | 4 | Chinese (Simp/Trad/Cantonese), Burmese |
| Afro-Asiatic | 9 | Arabic (8 dialects), Hebrew, Maltese |
| Austronesian | 12 | Indonesian, Malay, Tagalog, Javanese, Cebuano |
| Dravidian | 4 | Tamil, Telugu, Kannada, Malayalam |
| Turkic | 6 | Turkish, Kazakh, Uzbek, Tatar, Bashkir, Azerbaijani |
| Tai-Kadai | 2 | Thai, Lao |
| Uralic | 3 | Finnish, Estonian, Hungarian |
| Austroasiatic | 2 | Vietnamese, Khmer |
| Other | 7 | Japanese, Korean, Georgian, Basque, Swahili, Tok Pisin |
Training data: ~36 trillion tokens (2× Qwen2.5’s 18T). Three-stage pipeline: 30T+ general, 5T STEM/coding/reasoning, long-context (32K) high-quality. The code/math synthetic data was generated by Qwen2.5-Math and Qwen2.5-Coder — self-improving loop.
Agentic capabilities: MCP, tool calling, coding
Qwen3 is optimized for agentic workflows:
- MCP (Model Context Protocol): Native support via Qwen-Agent framework
- Tool calling: Structured output, parallel function calling
- Coding: Trained on synthetic data from Qwen2.5-Coder; the 4B and 30B-A3B models are standouts for local coding agents
Quick start (Hugging Face Transformers):
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-30B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# Thinking mode
messages = [{"role": "user", "content": "Write a Python function to parse CSV with error handling"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, reasoning=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:]))
Local deployment: which model fits your hardware?
| Your Hardware | Recommended Model | Quantization | Expected Speed |
|---|---|---|---|
| MacBook M1/M2/M3 (8–16GB) | Qwen3-4B / Qwen3-30B-A3B | 4-bit (Q4_K_M) | 15–30 tok/s |
| RTX 3060 (12GB) / 4060 (8GB) | Qwen3-4B / Qwen3-30B-A3B | 4-bit | 25–40 tok/s |
| RTX 3090/4090 (24GB) | Qwen3-14B / Qwen3-32B / Qwen3-30B-A3B | 4-bit / 8-bit | 30–60 tok/s |
| Dual 3090/4090 (48GB) | Qwen3-32B / Qwen3-235B-A22B | 4-bit / 8-bit | 20–40 tok/s |
| Linux server (A100 80GB×4) | Qwen3-235B-A22B | FP8 / BF16 | Production throughput |
Ollama (easiest):
# Small but mighty
ollama pull qwen3:4b
ollama pull qwen3:30b-a3b
# Run with thinking mode
ollama run qwen3:30b-a3b --think "Solve: 2x + 5 = 17"
vLLM (production):
# MoE model - note the quantization config
vllm serve Qwen/Qwen3-30B-A3B --quantization gptq --gpu-memory-utilization 0.9
llama.cpp (CPU/Apple Silicon):
# Download GGUF from Hugging Face (bartowski/Qwen3-30B-A3B-GGUF)
./llama-cli -m qwen3-30b-a3b-Q4_K_M.gguf -p "Explain EIP-7702" --think
Benchmarks: where Qwen3 wins (and where it doesn’t)
| Benchmark | Qwen3-235B-A22B | Qwen3-30B-A3B | Qwen3-4B | DeepSeek-R1 | o1 | Comment |
|---|---|---|---|---|---|---|
| LiveCodeBench | 72.4 | 68.9 | 58.2 | 71.1 | 74.2 | Coding |
| AIME 2024 | 89.6 | 84.3 | 72.1 | 87.5 | 91.2 | Math |
| GPQA-Diamond | 78.2 | 74.8 | 65.4 | 76.1 | 79.3 | Science reasoning |
| MT-Bench | 9.12 | 8.94 | 8.41 | 8.98 | 9.05 | Chat quality |
| Multilingual (XCOPA) | 94.2 | 92.7 | 89.1 | 91.4 | 93.1 | 119 langs |
The 4B surprise: Qwen3-4B (dense) matches Qwen2.5-72B-Instruct on coding — that’s 18× parameter efficiency. For local coding assistants on a laptop, this is the model to beat.
Trade-offs:
– Thinking mode adds latency (2–5× for complex reasoning)
– MoE models need more VRAM for expert weights (even if sparse)
– 235B-A22B needs serious hardware — not for hobbyists
– Chinese language benchmarks dominate; low-resource languages less tested
The Qwen ecosystem: 200,000+ derivatives on Hugging Face
Per Hugging Face’s Spring 2026 report: Qwen is the single most built-upon model family — 113,000+ Qwen-tagged derivatives (Alibaba alone), 200,000+ total. More than Google + Meta combined.
Why? Apache 2.0 license + strong base models + easy quantization + active community tooling (Qwen-Agent, Qwen-VL, Qwen-Audio, Qwen2.5-Coder, Qwen2.5-Math).
Notable derivatives to watch:
– Qwen3-Coder-Next (unreleased as of June 2026) — specialized coding agent
– Qwen3-VL — vision-language, 119 languages + OCR
– Unsloth/4-bit/8-bit GGUFs — community quantizations, often faster than official
Migration from Qwen2.5: what changes
| Aspect | Qwen2.5 | Qwen3 |
|---|---|---|
| License | Apache 2.0 | Apache 2.0 (same) |
| Chat template | Standard | Changed — supports thinking mode |
| Tokenizer | Same vocab | Same (backward compatible) |
| Context | 32K / 128K | 32K (small) / 128K (large) |
| MoE | Qwen2.5-MoE | Qwen3-MoE (new arch, 10% active) |
| Reasoning | Separate QwQ model | Built-in hybrid |
| Tools/MCP | Basic | Native agentic |
Breaking change: The chat template format changed to support thinking mode. Update your tokenizer/chat template code. The apply_chat_template call now accepts reasoning parameter.
Bottom line
Qwen3 is the strongest open-weight release of 2025 so far. Not because of one benchmark — because it gives you:
– Reasoning + speed in one model (hybrid thinking)
– Laptop-friendly MoE (30B-A3B on 8GB VRAM)
– 119 languages trained, not just claimed
– Agentic-native (MCP, tools, coding)
– Apache 2.0 — commercial use, no strings
Start here: ollama pull qwen3:4b for coding, ollama pull qwen3:30b-a3b for reasoning. Both run on a MacBook Air.
Official source: qwenlm.github.io/blog/qwen3/ — full specs, benchmarks, download links.
Related: Hugging Face State of Open Source Spring 2026, Local LLM hardware guide 2026, Qwen-Agent tutorial.
