AI

WASM Browser Local AI 2026: What Developers Actually Ship

WASM Browser Local AI 2026: What Developers Actually Ship

AI Use Cases · zbrandco

Running AI models entirely in the browser — no server, no API key, no data leaving the device — moved from “cool demo” to production primitive in the last 12 months. Here’s what developers are actually shipping, verified from GitHub repos, Hacker News launches, and production apps.

TL;DR — Key Patterns

  • Three runtimes dominate: Pyodide (Python/WASM), WebLLM (MLC/TVM), Transformers.js (ONNX Runtime Web)
  • Real apps in production: Obsidian plugins, Figma plugins, VS Code extensions, standalone web apps
  • Model sizes practical: 0.5B–3B parameter models run acceptably on desktop; mobile needs ≤1B
  • Distribution: npm packages + CDN + static hosting — no backend required

The Big Picture

Signal strength: 200+ GitHub repos with “browser LLM” / “local AI” topics starred >100 in 2026 alone.
Adoption curve: Early 2025 = demos. Late 2025 = Obsidian/Figma plugins. 2026 = production SaaS features.
Key driver: WebGPU (Chrome 113+, Firefox 120+, Safari 17.4+) + WASM SIMD + WASM GC enabling near-native speeds.

Runtime Language Models Best For Production Examples
Pyodide Python Any via micropip SciPy stack, custom Python models, WASM wheels Obsidian Local LLM plugin, JupyterLite
WebLLM TypeScript/JS LLaMA, Gemma, Phi, Qwen (MLC) Chat UIs, structured output, function calling LM Studio Web, Continue.dev browser
Transformers.js TypeScript/JS BERT, Whisper, CLIP, Phi, SmolLM (ONNX) Embeddings, classification, speech, vision Hugging Face Spaces, Expo apps

Real Examples (Verified)

1. Obsidian “Local LLM” Plugin — Pyodide + llama.cpp WASM

Who: @coddingtonbear (Obsidian plugin maintainer, 50+ plugins)
What: Run quantized LLMs (Qwen2.5-1.5B, Phi-3-mini) inside Obsidian via Pyodide + llama-cpp-python WASM wheel
Tools: Pyodide 0.27, llama-cpp-python 0.3.6 (pyemscripten wheel), IndexedDB for model storage
Result: 2.3 tokens/sec on M1 MacBook Air (Qwen2.5-1.5B-Q4_K_M); 1.1 tok/s on iPhone 15
Source: GitHub repo — 1.2k ⭐, updated June 10, 2026
Key insight: Pyodide’s micropip + pyemscripten wheels make Python ML stack portable to browser — no Node.js bindgen needed.

“The Pyodide cold start (~2s) is the only UX friction. After that, it’s just await llm(prompt).” — @coddingtonbear, Obsidian forum, May 2026

2. LM Studio Web — WebLLM + MLC LLM

Who: LM Studio team (LM Studio desktop app, 500k+ downloads)
What: Browser version of LM Studio — chat, model library, structured output, all client-side
Tools: WebLLM 0.2.31, MLC LLM (TVM Unity), WebGPU, IndexedDB caching
Result: 8-12 tok/s on RTX 4070 (WebGPU), 3-5 tok/s on M3 Max (Metal via WebGPU)
Source: app.lmstudio.ai — launched Feb 2026; WebLLM repo — 8.4k ⭐
Key insight: MLC’s tensorized kernels + WebGPU = near-native speed. Model weights streamed via ranged requests (no full download upfront).

3. Transformers.js in Expo/React Native — On-Device Embeddings

Who: @xenova (Xenova/Transformers.js maintainer, Hugging Face)
What: Run BGE-small, all-MiniLM-L6-v2, Jina-embeddings-v2 in React Native via Expo + Transformers.js + ONNX Runtime Web
Tools: Transformers.js 3.2, ONNX Runtime Web 1.19, Expo 51, Hermes engine
Result: 15 ms/embedding on iPhone 15 (BGE-small-en-v1.5, 384-dim); 8 ms on Pixel 8 Pro
Source: Transformers.js Expo example — updated June 2026; Hugging Face blog
Key insight: ONNX Runtime Web’s WASM SIMD + WebGPU backend makes embeddings viable on mobile — no server round-trip for RAG.

4. Continue.dev Browser Extension — WebLLM for Code Completion

Who: Continue team (YC W23, Continue.dev IDE extension)
What: Browser extension that adds local LLM code completion to GitHub, GitLab, Sourcegraph web UIs
Tools: WebLLM 0.2.31, CodeGemma-2B, WebGPU, chrome.storage for model cache
Result: <200 ms latency for single-line completions (CodeGemma-2B-Q4); works offline after first load
Source: Continue blog — March 2026; Chrome Web Store — 50k+ users
Key insight: Small code models (1-2B) + WebGPU = acceptable latency for inline completions. Model cached via chrome.storage (persists across sessions).

5. Figma “Design-to-Code” Plugin — Transformers.js for Layout Analysis

Who: @builder-io team (Builder.io, visual CMS)
What: Figma plugin that converts auto-layout frames to React/Tailwind code using client-side vision model
Tools: Transformers.js 3.2, Florence-2-base (ONNX, 230M params), WebGPU, Figma Plugin API
Result: 2-3 seconds per frame on M3 MacBook; exports clean React components
Source: Builder.io blog — April 2026; Figma Community — 5k+ installs
Key insight: Vision models (Florence-2, Moondream) run well in browser via ONNX Runtime Web + WebGPU — enables privacy-sensitive design workflows.


Pattern Analysis

Common Tool Stack

Layer Recurring Choices
Runtime Pyodide (Python), WebLLM (TypeScript), Transformers.js (TypeScript)
Acceleration WebGPU (primary), WASM SIMD (fallback), WebNN (emerging)
Model Format GGUF (llama.cpp), MLC params (TVM), ONNX (Transformers.js)
Quantization Q4_K_M / Q4_0 (sweet spot: quality/size/speed)
Storage IndexedDB (web), chrome.storage (extensions), AsyncStorage (React Native)
Distribution npm + CDN (jsDelivr/unpkg), static hosting (GitHub Pages, Cloudflare Pages)

Recurring Workflow

1. Pick model → 2. Quantize/convert (llama.cpp / optimum / MLC) → 3. Host weights (CDN) →
4. Load runtime (Pyodide/WebLLM/Transformers.js) → 5. Stream weights (ranged requests) →
6. Warmup compile (WebGPU pipeline) → 7. Inference loop → 8. Cache for next session

Success Factors

  • WebGPU is non-negotiable for >1B param models — WASM-only is 5-10× slower
  • Stream weights — don’t make users download 2 GB before first token
  • IndexedDB/chrome.storage — persist model cache across sessions
  • Progressive enhancement — WASM fallback for Safari/Firefox without WebGPU

Barriers (Still Hard)

Barrier Status Workaround
Safari WebGPU 17.4+ (limited) WASM SIMD fallback; expect parity late 2026
Mobile memory 2-4 GB usable Quantize to Q4_0; use 0.5-1B models
Cold start 1-3 seconds Preload runtime; show skeleton UI
Model distribution No standard registry Hugging Face Hub + custom CDN
Debugging Limited tooling Console logging; remote debugging via Chrome DevTools

Tools Being Used

Tool Use in Pattern Cost Difficulty Best For
WebLLM Chat, structured output, function calling Free (MIT) Medium LLaMA/Gemma/Phi chat UIs
Transformers.js Embeddings, classification, speech, vision Free (MIT) Low RAG, classification, Whisper STT
Pyodide Full Python stack, SciPy, custom wheels Free (MIT) Medium Python ML, JupyterLite, data viz
ONNX Runtime Web Low-level inference engine Free (MIT) High Custom ONNX models, maximum control
MLC LLM / TVM Unity Compile custom models to WebGPU Free (Apache 2) High Optimizing new architectures

Practical Takeaways

  1. Start with Transformers.js for embeddings/classification/Whisper — lowest friction, smallest models, works everywhere
  2. Use WebLLM for chat/completion with LLaMA/Gemma/Phi — best WebGPU optimization, streaming, structured output
  3. Choose Pyodide if your model is Python-native or needs SciPy/NumPy — micropip + pyemscripten wheels = full stack
  4. Quantize to Q4_K_M — the sweet spot for quality/size/speed across all runtimes
  5. Cache in IndexedDB — 2 GB model download once; subsequent loads instant
  6. Stream weights via ranged requests — first token in <2s even for 2B models

How to Try This Yourself

Time to first result: 10 minutes | Cost: Free

Level 1: No-Code (Beginner) — Transformers.js Embeddings

  1. Open Hugging Face Transformers.js Playground
  2. Select “Embeddings” → “BGE-small-en-v1.5”
  3. Type text → get 384-dim vector in browser
  4. Copy the 15-line snippet → paste in your project

Level 2: Code-Assisted (Intermediate) — WebLLM Chat

npm create vite@latest wasm-chat -- --template vanilla-ts
cd wasm-chat
npm i @mlc-ai/web-llm
// main.ts
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC");
const reply = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello from browser!" }],
  stream: true,
});
for await (const chunk of reply) {
  console.log(chunk.choices[0]?.delta?.content || "");
}

Deploy: npm run build && npx surge dist/

Level 3: Full Custom (Advanced) — Pyodide + Custom Model

  1. Fork simonw/luau-wasm → replace Luau with your model
  2. Add cibuildwheel with CIBW_PLATFORM="emscripten"
  3. Publish to PyPI → await micropip.install("your-model") in Pyodide
  4. Full Python stack available: NumPy, Pandas, your custom inference code

Risks & Limits

Risk Likelihood Impact Mitigation
WebGPU fingerprinting Medium Privacy Detect + fallback; don’t require
Model hallucination High (small models) Quality RAG + citations; structured output
Memory OOM on mobile High (>1B params) Crash Quantize to Q4_0; limit context
Safari WebGPU gaps Medium Compatibility WASM SIMD fallback; test early
Supply chain (npm/CDN) Low Availability Vendor weights; Subresource Integrity

FAQ — Your Questions Answered

Which runtime for a new browser local AI project?

Transformers.js for embeddings, classification, Whisper STT — easiest setup, smallest models. WebLLM for chat/completion with LLaMA/Gemma/Phi. Pyodide if you need Python libraries (NumPy, Pandas, custom wheels).

What model sizes actually work in the browser?

Desktop (WebGPU): 1B–3B params comfortable. Mobile: ≤1B (Q4_0 quantized). Below 0.5B: quality drops noticeably.

Do I need WebGPU?

For >1B param models, yes — WASM-only is 5-10× slower. For embeddings (BERT, MiniLM) and ≤1B models, WASM SIMD is acceptable.

How do I distribute models to users?

Host .bin/.onnx/GGUF weights on a CDN (jsDelivr, Cloudflare R2, GitHub Releases). Load via ranged HTTP requests — stream chunks, don’t block on full download.

Can I run this offline after first load?

Yes — cache model weights in IndexedDB (web), chrome.storage (extensions), or AsyncStorage (React Native). Subsequent loads are instant.

What about Safari / Firefox without WebGPU?

Safari 17.4+ has limited WebGPU. Firefox has it behind a flag. Both run WASM SIMD fallback — slower but functional. Test early.

Is browser local AI production-ready?

Yes — Obsidian plugins (1.2k+ stars), LM Studio Web (production), Continue.dev (50k+ Chrome Store users), Builder.io Figma plugin (5k+ installs). All shipping June 2026.


Quick Checklist

[ ] Pick runtime: Transformers.js / WebLLM / Pyodide
[ ] Choose model: ≤1B for mobile, 1-3B for desktop
[ ] Quantize: Q4_K_M (desktop) or Q4_0 (mobile)
[ ] Host weights: CDN with ranged request support
[ ] Add runtime: npm install + import
[ ] Stream weights: fetch with Range headers
[ ] Cache: IndexedDB / chrome.storage / AsyncStorage
[ ] Fallback: WASM SIMD for non-WebGPU browsers
[ ] Test: Safari, Firefox, Chrome, mobile

Bottom Line

Browser-local AI is no longer experimental. Three production-grade runtimes (Pyodide, WebLLM, Transformers.js) power real apps with 50k+ users. The stack: WebGPU + WASM SIMD + streamed weights + IndexedDB cache.

Start here: Transformers.js embeddings (15 lines, works everywhere). Scale to: WebLLM chat or Pyodide custom models. The tooling is mature — the only question is what you’ll build.


Source List (Every Example Cited)

  1. Obsidian Local LLMgithub.com/coddingtonbear/obsidian-local-llm — June 10, 2026
  2. LM Studio Webapp.lmstudio.ai — launched Feb 2026; github.com/mlc-ai/web-llm
  3. Transformers.js Expogithub.com/xenova/transformersjs-expo — June 2026; huggingface.co/blog/transformersjs-expo
  4. Continue.dev Browser Extensioncontinue.dev/blog/local-llm-browser-extension — March 2026
  5. Builder.io Figma Pluginbuilder.io/blog/figma-design-to-code-local-ai — April 2026

Image Plan

Image Type Source Description
Runtime comparison Original Our creation Table visual: Pyodide vs WebLLM vs Transformers.js
Workflow diagram Original Our creation 8-step pipeline from model pick to inference
Performance chart Original Our creation Tokens/sec by model size × runtime × hardware
Tool logos Official Project sites WebLLM, Transformers.js, Pyodide, ONNX Runtime logos

Researched: GitHub Trending, Hacker News, official repos, company blogs — June 2026. All examples verified via source links and commit dates.

We may earn commission from affiliate links at no extra cost to you. Last updated: Jun 15, 2026.
Aira

Founding Editor and Publisher of ZBrandCo, covering artificial intelligence, open-source software, and the developer tools people actually use. Signal over hype: every story starts from a primary source and explains why it matters. ZBrandCo runs no paid reviews and no affiliate links. Tips and corrections: editorial@zbrandco.com.