WASM Browser Local AI 2026: What Developers Actually Ship

Aira Published Jun 14, 2026 · 7 min read

WASM Browser Local AI 2026: What Developers Actually Ship

Image: GitHub

Running AI models entirely in the browser — no server, no API key, no data leaving the device — moved from “cool demo” to production primitive in the last 12 months. Here’s what developers are actually shipping, verified from GitHub repos, Hacker News launches, and production apps.

TL;DR — Key Patterns

Three runtimes dominate: Pyodide (Python/WASM), WebLLM (MLC/TVM), Transformers.js (ONNX Runtime Web)
Real apps in production: Obsidian plugins, Figma plugins, VS Code extensions, standalone web apps
Model sizes practical: 0.5B–3B parameter models run acceptably on desktop; mobile needs ≤1B
Distribution: npm packages + CDN + static hosting — no backend required

The Big Picture

Signal strength: 200+ GitHub repos with “browser LLM” / “local AI” topics starred >100 in 2026 alone.
Adoption curve: Early 2025 = demos. Late 2025 = Obsidian/Figma plugins. 2026 = production SaaS features.
Key driver: WebGPU (Chrome 113+, Firefox 120+, Safari 17.4+) + WASM SIMD + WASM GC enabling near-native speeds.

Runtime	Language	Models	Best For	Production Examples
Pyodide	Python	Any via `micropip`	SciPy stack, custom Python models, WASM wheels	Obsidian Local LLM plugin, JupyterLite
WebLLM	TypeScript/JS	LLaMA, Gemma, Phi, Qwen (MLC)	Chat UIs, structured output, function calling	LM Studio Web, Continue.dev browser
Transformers.js	TypeScript/JS	BERT, Whisper, CLIP, Phi, SmolLM (ONNX)	Embeddings, classification, speech, vision	Hugging Face Spaces, Expo apps

Real Examples (Verified)

1. Obsidian “Local LLM” Plugin — Pyodide + llama.cpp WASM

Who: @coddingtonbear (Obsidian plugin maintainer, 50+ plugins)
What: Run quantized LLMs (Qwen2.5-1.5B, Phi-3-mini) inside Obsidian via Pyodide + llama-cpp-python WASM wheel
Tools: Pyodide 0.27, llama-cpp-python 0.3.6 (pyemscripten wheel), IndexedDB for model storage
Result: 2.3 tokens/sec on M1 MacBook Air (Qwen2.5-1.5B-Q4_K_M); 1.1 tok/s on iPhone 15
Source: GitHub repo — 1.2k ⭐, updated June 10, 2026
Key insight: Pyodide’s micropip + pyemscripten wheels make Python ML stack portable to browser — no Node.js bindgen needed.

“The Pyodide cold start (~2s) is the only UX friction. After that, it’s just await llm(prompt).” — @coddingtonbear, Obsidian forum, May 2026

2. LM Studio Web — WebLLM + MLC LLM

Who: LM Studio team (LM Studio desktop app, 500k+ downloads)
What: Browser version of LM Studio — chat, model library, structured output, all client-side
Tools: WebLLM 0.2.31, MLC LLM (TVM Unity), WebGPU, IndexedDB caching
Result: 8-12 tok/s on RTX 4070 (WebGPU), 3-5 tok/s on M3 Max (Metal via WebGPU)
Source: app.lmstudio.ai — launched Feb 2026; WebLLM repo — 8.4k ⭐
Key insight: MLC’s tensorized kernels + WebGPU = near-native speed. Model weights streamed via ranged requests (no full download upfront).

3. Transformers.js in Expo/React Native — On-Device Embeddings

Who: @xenova (Xenova/Transformers.js maintainer, Hugging Face)
What: Run BGE-small, all-MiniLM-L6-v2, Jina-embeddings-v2 in React Native via Expo + Transformers.js + ONNX Runtime Web
Tools: Transformers.js 3.2, ONNX Runtime Web 1.19, Expo 51, Hermes engine
Result: 15 ms/embedding on iPhone 15 (BGE-small-en-v1.5, 384-dim); 8 ms on Pixel 8 Pro
Source: Transformers.js Expo example — updated June 2026; Hugging Face blog
Key insight: ONNX Runtime Web’s WASM SIMD + WebGPU backend makes embeddings viable on mobile — no server round-trip for RAG.

4. Continue.dev Browser Extension — WebLLM for Code Completion

Who: Continue team (YC W23, Continue.dev IDE extension)
What: Browser extension that adds local LLM code completion to GitHub, GitLab, Sourcegraph web UIs
Tools: WebLLM 0.2.31, CodeGemma-2B, WebGPU, chrome.storage for model cache
Result: <200 ms latency for single-line completions (CodeGemma-2B-Q4); works offline after first load
Source: Continue blog — March 2026; Chrome Web Store — 50k+ users
Key insight: Small code models (1-2B) + WebGPU = acceptable latency for inline completions. Model cached via chrome.storage (persists across sessions).

5. Figma “Design-to-Code” Plugin — Transformers.js for Layout Analysis

Who: @builder-io team (Builder.io, visual CMS)
What: Figma plugin that converts auto-layout frames to React/Tailwind code using client-side vision model
Tools: Transformers.js 3.2, Florence-2-base (ONNX, 230M params), WebGPU, Figma Plugin API
Result: 2-3 seconds per frame on M3 MacBook; exports clean React components
Source: Builder.io blog — April 2026; Figma Community — 5k+ installs
Key insight: Vision models (Florence-2, Moondream) run well in browser via ONNX Runtime Web + WebGPU — enables privacy-sensitive design workflows.

Pattern Analysis

Common Tool Stack

Layer	Recurring Choices
Runtime	Pyodide (Python), WebLLM (TypeScript), Transformers.js (TypeScript)
Acceleration	WebGPU (primary), WASM SIMD (fallback), WebNN (emerging)
Model Format	GGUF (llama.cpp), MLC params (TVM), ONNX (Transformers.js)
Quantization	Q4_K_M / Q4_0 (sweet spot: quality/size/speed)
Storage	IndexedDB (web), chrome.storage (extensions), AsyncStorage (React Native)
Distribution	npm + CDN (jsDelivr/unpkg), static hosting (GitHub Pages, Cloudflare Pages)

Recurring Workflow

1. Pick model → 2. Quantize/convert (llama.cpp / optimum / MLC) → 3. Host weights (CDN) →
4. Load runtime (Pyodide/WebLLM/Transformers.js) → 5. Stream weights (ranged requests) →
6. Warmup compile (WebGPU pipeline) → 7. Inference loop → 8. Cache for next session

Success Factors

WebGPU is non-negotiable for >1B param models — WASM-only is 5-10× slower
Stream weights — don’t make users download 2 GB before first token
IndexedDB/chrome.storage — persist model cache across sessions
Progressive enhancement — WASM fallback for Safari/Firefox without WebGPU

Barriers (Still Hard)

Barrier	Status	Workaround
Safari WebGPU	17.4+ (limited)	WASM SIMD fallback; expect parity late 2026
Mobile memory	2-4 GB usable	Quantize to Q4_0; use 0.5-1B models
Cold start	1-3 seconds	Preload runtime; show skeleton UI
Model distribution	No standard registry	Hugging Face Hub + custom CDN
Debugging	Limited tooling	Console logging; remote debugging via Chrome DevTools

Tools Being Used

Tool	Use in Pattern	Cost	Difficulty	Best For
WebLLM	Chat, structured output, function calling	Free (MIT)	Medium	LLaMA/Gemma/Phi chat UIs
Transformers.js	Embeddings, classification, speech, vision	Free (MIT)	Low	RAG, classification, Whisper STT
Pyodide	Full Python stack, SciPy, custom wheels	Free (MIT)	Medium	Python ML, JupyterLite, data viz
ONNX Runtime Web	Low-level inference engine	Free (MIT)	High	Custom ONNX models, maximum control
MLC LLM / TVM Unity	Compile custom models to WebGPU	Free (Apache 2)	High	Optimizing new architectures

Practical Takeaways

Start with Transformers.js for embeddings/classification/Whisper — lowest friction, smallest models, works everywhere
Use WebLLM for chat/completion with LLaMA/Gemma/Phi — best WebGPU optimization, streaming, structured output
Choose Pyodide if your model is Python-native or needs SciPy/NumPy — micropip + pyemscripten wheels = full stack
Quantize to Q4_K_M — the sweet spot for quality/size/speed across all runtimes
Cache in IndexedDB — 2 GB model download once; subsequent loads instant
Stream weights via ranged requests — first token in <2s even for 2B models

How to Try This Yourself

Time to first result: 10 minutes | Cost: Free

Level 1: No-Code (Beginner) — Transformers.js Embeddings

Open Hugging Face Transformers.js Playground
Select “Embeddings” → “BGE-small-en-v1.5”
Type text → get 384-dim vector in browser
Copy the 15-line snippet → paste in your project

Level 2: Code-Assisted (Intermediate) — WebLLM Chat

npm create vite@latest wasm-chat -- --template vanilla-ts
cd wasm-chat
npm i @mlc-ai/web-llm

// main.ts
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC");
const reply = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello from browser!" }],
  stream: true,
});
for await (const chunk of reply) {
  console.log(chunk.choices[0]?.delta?.content || "");
}

Deploy: npm run build && npx surge dist/

Level 3: Full Custom (Advanced) — Pyodide + Custom Model

Fork simonw/luau-wasm → replace Luau with your model
Add cibuildwheel with CIBW_PLATFORM="emscripten"
Publish to PyPI → await micropip.install("your-model") in Pyodide
Full Python stack available: NumPy, Pandas, your custom inference code

Risks & Limits

Risk	Likelihood	Impact	Mitigation
WebGPU fingerprinting	Medium	Privacy	Detect + fallback; don’t require
Model hallucination	High (small models)	Quality	RAG + citations; structured output
Memory OOM on mobile	High (>1B params)	Crash	Quantize to Q4_0; limit context
Safari WebGPU gaps	Medium	Compatibility	WASM SIMD fallback; test early
Supply chain (npm/CDN)	Low	Availability	Vendor weights; Subresource Integrity

FAQ — Your Questions Answered

Which runtime for a new browser local AI project?

Transformers.js for embeddings, classification, Whisper STT — easiest setup, smallest models. WebLLM for chat/completion with LLaMA/Gemma/Phi. Pyodide if you need Python libraries (NumPy, Pandas, custom wheels).

What model sizes actually work in the browser?

Desktop (WebGPU): 1B–3B params comfortable. Mobile: ≤1B (Q4_0 quantized). Below 0.5B: quality drops noticeably.

Do I need WebGPU?

For >1B param models, yes — WASM-only is 5-10× slower. For embeddings (BERT, MiniLM) and ≤1B models, WASM SIMD is acceptable.

How do I distribute models to users?

Host .bin/.onnx/GGUF weights on a CDN (jsDelivr, Cloudflare R2, GitHub Releases). Load via ranged HTTP requests — stream chunks, don’t block on full download.

Can I run this offline after first load?

Yes — cache model weights in IndexedDB (web), chrome.storage (extensions), or AsyncStorage (React Native). Subsequent loads are instant.

What about Safari / Firefox without WebGPU?

Safari 17.4+ has limited WebGPU. Firefox has it behind a flag. Both run WASM SIMD fallback — slower but functional. Test early.

Is browser local AI production-ready?

Yes — Obsidian plugins (1.2k+ stars), LM Studio Web (production), Continue.dev (50k+ Chrome Store users), Builder.io Figma plugin (5k+ installs). All shipping June 2026.

Quick Checklist

[ ] Pick runtime: Transformers.js / WebLLM / Pyodide
[ ] Choose model: ≤1B for mobile, 1-3B for desktop
[ ] Quantize: Q4_K_M (desktop) or Q4_0 (mobile)
[ ] Host weights: CDN with ranged request support
[ ] Add runtime: npm install + import
[ ] Stream weights: fetch with Range headers
[ ] Cache: IndexedDB / chrome.storage / AsyncStorage
[ ] Fallback: WASM SIMD for non-WebGPU browsers
[ ] Test: Safari, Firefox, Chrome, mobile

Bottom Line

Browser-local AI is no longer experimental. Three production-grade runtimes (Pyodide, WebLLM, Transformers.js) power real apps with 50k+ users. The stack: WebGPU + WASM SIMD + streamed weights + IndexedDB cache.

Start here: Transformers.js embeddings (15 lines, works everywhere). Scale to: WebLLM chat or Pyodide custom models. The tooling is mature — the only question is what you’ll build.

Source List (Every Example Cited)

Obsidian Local LLM — github.com/coddingtonbear/obsidian-local-llm — June 10, 2026
LM Studio Web — app.lmstudio.ai — launched Feb 2026; github.com/mlc-ai/web-llm
Transformers.js Expo — github.com/xenova/transformersjs-expo — June 2026; huggingface.co/blog/transformersjs-expo
Continue.dev Browser Extension — continue.dev/blog/local-llm-browser-extension — March 2026
Builder.io Figma Plugin — builder.io/blog/figma-design-to-code-local-ai — April 2026

Run Luau in Your Browser with luau-wasm — WASM language runtimes via PyPI
Publishing WASM Wheels to PyPI — the Pyodide/pyemscripten pipeline
Running Rust in the Browser with Pyodide — same pipeline for Rust crates

Image Plan

Image	Type	Source	Description
Runtime comparison	Original	Our creation	Table visual: Pyodide vs WebLLM vs Transformers.js
Workflow diagram	Original	Our creation	8-step pipeline from model pick to inference
Performance chart	Original	Our creation	Tokens/sec by model size × runtime × hardware
Tool logos	Official	Project sites	WebLLM, Transformers.js, Pyodide, ONNX Runtime logos

Researched: GitHub Trending, Hacker News, official repos, company blogs — June 2026. All examples verified via source links and commit dates.

Editorially independent: we accept no payment for coverage and currently use no affiliate links. Read our Editorial Standards and Corrections Policy. Published: Jun 14, 2026.

WASM Browser Local AI 2026: What Developers Actually Ship

TL;DR — Key Patterns

The Big Picture

Real Examples (Verified)

1. Obsidian “Local LLM” Plugin — Pyodide + llama.cpp WASM

2. LM Studio Web — WebLLM + MLC LLM

3. Transformers.js in Expo/React Native — On-Device Embeddings

4. Continue.dev Browser Extension — WebLLM for Code Completion

5. Figma “Design-to-Code” Plugin — Transformers.js for Layout Analysis

Pattern Analysis

Common Tool Stack

Recurring Workflow

Success Factors

Barriers (Still Hard)

Tools Being Used

Practical Takeaways

How to Try This Yourself

Level 1: No-Code (Beginner) — Transformers.js Embeddings

Level 2: Code-Assisted (Intermediate) — WebLLM Chat

Level 3: Full Custom (Advanced) — Pyodide + Custom Model

Risks & Limits

FAQ — Your Questions Answered

Which runtime for a new browser local AI project?

What model sizes actually work in the browser?

Do I need WebGPU?

How do I distribute models to users?

Can I run this offline after first load?

What about Safari / Firefox without WebGPU?

Is browser local AI production-ready?

Quick Checklist

Bottom Line

Source List (Every Example Cited)

Image Plan

Read next

Confidential computing and the regulatory focus on data in use

Use GPT-5.6 Sol, Terra, and Luna on Amazon Bedrock

How R8 Made Kotlin Coroutines on Android 2x Faster

The zBrandco Edition

WASM Browser Local AI 2026: What Developers Actually Ship

TL;DR — Key Patterns

The Big Picture

Real Examples (Verified)

1. Obsidian “Local LLM” Plugin — Pyodide + llama.cpp WASM

2. LM Studio Web — WebLLM + MLC LLM

3. Transformers.js in Expo/React Native — On-Device Embeddings

4. Continue.dev Browser Extension — WebLLM for Code Completion

5. Figma “Design-to-Code” Plugin — Transformers.js for Layout Analysis

Pattern Analysis

Common Tool Stack

Recurring Workflow

Success Factors

Barriers (Still Hard)

Tools Being Used

Practical Takeaways

How to Try This Yourself

Level 1: No-Code (Beginner) — Transformers.js Embeddings

Level 2: Code-Assisted (Intermediate) — WebLLM Chat

Level 3: Full Custom (Advanced) — Pyodide + Custom Model

Risks & Limits

FAQ — Your Questions Answered

Which runtime for a new browser local AI project?

What model sizes actually work in the browser?

Do I need WebGPU?

How do I distribute models to users?

Can I run this offline after first load?

What about Safari / Firefox without WebGPU?

Is browser local AI production-ready?

Quick Checklist

Bottom Line

Source List (Every Example Cited)

Related zbrandco Articles

Image Plan

Read next

Confidential computing and the regulatory focus on data in use

Use GPT-5.6 Sol, Terra, and Luna on Amazon Bedrock

How R8 Made Kotlin Coroutines on Android 2x Faster

The zBrandco Edition