Qwen3 flagship open model rivals DeepSeek-R1, OpenAI o1

Aira Published Jun 15, 2026 · 5 min read

Logo: DeepSeek — MIT, via Wikimedia Commons

TL;DR — Qwen3 launched April 29, 2025 from Alibaba’s Qwen team. Eight models (six dense: 0.6B–32B; two MoE: 30B-A3B, 235B-A22B), hybrid thinking/non-thinking modes, 119 languages, Apache 2.0 license. The flagship Qwen3-235B-A22B trades blows with DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. The Qwen3-30B-A3B outperforms QwQ-32B with 10× fewer activated parameters. All weights on Hugging Face, ModelScope, Kaggle, GitHub — runnable locally today.

The Qwen3 lineup at a glance

Model	Type	Active Params	Total Params	Context	Best For
Qwen3-0.6B	Dense	0.6B	0.6B	32K	Edge/phone, ultra-low latency
Qwen3-1.7B	Dense	1.7B	1.7B	32K	Mobile, PI 5, Jetson
Qwen3-4B	Dense	4B	4B	32K	Sweet spot: laptop GPU (8GB VRAM), beats Qwen2.5-72B on coding
Qwen3-8B	Dense	8B	8B	128K	Single 24GB GPU, strong reasoning
Qwen3-14B	Dense	14B	14B	128K	Single 24GB GPU (quantized), best balance
Qwen3-32B	Dense	32B	32B	128K	Dual 24GB / single 48GB, near-SOTA
Qwen3-30B-A3B	MoE	3B	30B	128K	Laptop GPU (8–12GB), beats QwQ-32B
Qwen3-235B-A22B	MoE	22B	235B	128K	Flagship, competes with DeepSeek-R1/o1

Key insight: MoE models activate only ~10% of parameters per forward pass. Qwen3-30B-A3B runs on an 8GB VRAM laptop but matches a 32B dense model. Qwen3-4B (dense) matches Qwen2.5-72B-Instruct on coding — 18× parameter efficiency.

Hybrid thinking: the feature that changes how you prompt

Qwen3 introduces two modes in one model:

Mode	How to trigger	Latency	Use case
Thinking	Add `/think` or `reasoning: true` in API	Higher	Math, code, logic, multi-step planning
Non-thinking	Default / `reasoning: false`	Near-instant	Chat, formatting, simple Q&A, classification

Why this matters: You don’t need separate models for “reasoning” vs “fast chat.” One model does both. The team reports smooth performance scaling with reasoning budget — allocate more tokens for harder problems, fewer for simple ones.

Prompting pattern (OpenAI-compatible API):

{
  "model": "Qwen/Qwen3-30B-A3B",
  "messages": [...],
  "extra_body": {
    "reasoning": true,        // thinking mode
    "reasoning_budget": 4096  // optional token cap
  }
}

Local (Ollama/vLLM): ollama run qwen3:30b-a3b --think / --no-think

119 languages: not just “supports” — trained on 36T tokens across all

Language family	Count	Notable
Indo-European	50+	English, Hindi, Spanish, French, Russian, Arabic dialects
Sino-Tibetan	4	Chinese (Simp/Trad/Cantonese), Burmese
Afro-Asiatic	9	Arabic (8 dialects), Hebrew, Maltese
Austronesian	12	Indonesian, Malay, Tagalog, Javanese, Cebuano
Dravidian	4	Tamil, Telugu, Kannada, Malayalam
Turkic	6	Turkish, Kazakh, Uzbek, Tatar, Bashkir, Azerbaijani
Tai-Kadai	2	Thai, Lao
Uralic	3	Finnish, Estonian, Hungarian
Austroasiatic	2	Vietnamese, Khmer
Other	7	Japanese, Korean, Georgian, Basque, Swahili, Tok Pisin

Training data: ~36 trillion tokens (2× Qwen2.5’s 18T). Three-stage pipeline: 30T+ general, 5T STEM/coding/reasoning, long-context (32K) high-quality. The code/math synthetic data was generated by Qwen2.5-Math and Qwen2.5-Coder — self-improving loop.

Agentic capabilities: MCP, tool calling, coding

Qwen3 is optimized for agentic workflows:

MCP (Model Context Protocol): Native support via Qwen-Agent framework
Tool calling: Structured output, parallel function calling
Coding: Trained on synthetic data from Qwen2.5-Coder; the 4B and 30B-A3B models are standouts for local coding agents

Quick start (Hugging Face Transformers):

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Thinking mode
messages = [{"role": "user", "content": "Write a Python function to parse CSV with error handling"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, reasoning=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:]))

Local deployment: which model fits your hardware?

Your Hardware	Recommended Model	Quantization	Expected Speed
MacBook M1/M2/M3 (8–16GB)	Qwen3-4B / Qwen3-30B-A3B	4-bit (Q4_K_M)	15–30 tok/s
RTX 3060 (12GB) / 4060 (8GB)	Qwen3-4B / Qwen3-30B-A3B	4-bit	25–40 tok/s
RTX 3090/4090 (24GB)	Qwen3-14B / Qwen3-32B / Qwen3-30B-A3B	4-bit / 8-bit	30–60 tok/s
Dual 3090/4090 (48GB)	Qwen3-32B / Qwen3-235B-A22B	4-bit / 8-bit	20–40 tok/s
Linux server (A100 80GB×4)	Qwen3-235B-A22B	FP8 / BF16	Production throughput

Ollama (easiest):

# Small but mighty
ollama pull qwen3:4b
ollama pull qwen3:30b-a3b

# Run with thinking mode
ollama run qwen3:30b-a3b --think "Solve: 2x + 5 = 17"

vLLM (production):

# MoE model - note the quantization config
vllm serve Qwen/Qwen3-30B-A3B --quantization gptq --gpu-memory-utilization 0.9

llama.cpp (CPU/Apple Silicon):

# Download GGUF from Hugging Face (bartowski/Qwen3-30B-A3B-GGUF)
./llama-cli -m qwen3-30b-a3b-Q4_K_M.gguf -p "Explain EIP-7702" --think

Benchmarks: where Qwen3 wins (and where it doesn’t)

Benchmark	Qwen3-235B-A22B	Qwen3-30B-A3B	Qwen3-4B	DeepSeek-R1	o1	Comment
LiveCodeBench	72.4	68.9	58.2	71.1	74.2	Coding
AIME 2024	89.6	84.3	72.1	87.5	91.2	Math
GPQA-Diamond	78.2	74.8	65.4	76.1	79.3	Science reasoning
MT-Bench	9.12	8.94	8.41	8.98	9.05	Chat quality
Multilingual (XCOPA)	94.2	92.7	89.1	91.4	93.1	119 langs

The 4B surprise: Qwen3-4B (dense) matches Qwen2.5-72B-Instruct on coding — that’s 18× parameter efficiency. For local coding assistants on a laptop, this is the model to beat.

Trade-offs:
– Thinking mode adds latency (2–5× for complex reasoning)
– MoE models need more VRAM for expert weights (even if sparse)
– 235B-A22B needs serious hardware — not for hobbyists
– Chinese language benchmarks dominate; low-resource languages less tested

The Qwen ecosystem: 200,000+ derivatives on Hugging Face

Per Hugging Face’s Spring 2026 report: Qwen is the single most built-upon model family — 113,000+ Qwen-tagged derivatives (Alibaba alone), 200,000+ total. More than Google + Meta combined.

Why? Apache 2.0 license + strong base models + easy quantization + active community tooling (Qwen-Agent, Qwen-VL, Qwen-Audio, Qwen2.5-Coder, Qwen2.5-Math).

Notable derivatives to watch:
– Qwen3-Coder-Next (unreleased as of June 2026) — specialized coding agent
– Qwen3-VL — vision-language, 119 languages + OCR
– Unsloth/4-bit/8-bit GGUFs — community quantizations, often faster than official

Migration from Qwen2.5: what changes

Aspect	Qwen2.5	Qwen3
License	Apache 2.0	Apache 2.0 (same)
Chat template	Standard	Changed — supports thinking mode
Tokenizer	Same vocab	Same (backward compatible)
Context	32K / 128K	32K (small) / 128K (large)
MoE	Qwen2.5-MoE	Qwen3-MoE (new arch, 10% active)
Reasoning	Separate QwQ model	Built-in hybrid
Tools/MCP	Basic	Native agentic

Breaking change: The chat template format changed to support thinking mode. Update your tokenizer/chat template code. The apply_chat_template call now accepts reasoning parameter.

Bottom line

Qwen3 is the strongest open-weight release of 2025 so far. Not because of one benchmark — because it gives you:
– Reasoning + speed in one model (hybrid thinking)
– Laptop-friendly MoE (30B-A3B on 8GB VRAM)
– 119 languages trained, not just claimed
– Agentic-native (MCP, tools, coding)
– Apache 2.0 — commercial use, no strings

Start here: ollama pull qwen3:4b for coding, ollama pull qwen3:30b-a3b for reasoning. Both run on a MacBook Air.

Official source: qwenlm.github.io/blog/qwen3/ — full specs, benchmarks, download links.
Related: Hugging Face State of Open Source Spring 2026, Local LLM hardware guide 2026, Qwen-Agent tutorial.

#Anthropic #Claude #DeepSeek #Gemini #OpenAI #Qwen

Editorially independent: we accept no payment for coverage and currently use no affiliate links. Read our Editorial Standards and Corrections Policy. Published: Jun 15, 2026.

Qwen3 flagship open model rivals DeepSeek-R1, OpenAI o1

The Qwen3 lineup at a glance

Hybrid thinking: the feature that changes how you prompt

119 languages: not just “supports” — trained on 36T tokens across all

Agentic capabilities: MCP, tool calling, coding

Local deployment: which model fits your hardware?

Benchmarks: where Qwen3 wins (and where it doesn’t)

The Qwen ecosystem: 200,000+ derivatives on Hugging Face

Migration from Qwen2.5: what changes

Bottom line

Read next

Use GPT-5.6 Sol, Terra, and Luna on Amazon Bedrock

Copilot in Visual Studio Adds Agent Preview and Built-In Skills

Claude Shared Chats Were Showing Up in Google Search

The zBrandco Edition