TL;DR — Two independent developers indexed 669 GB and a year’s worth of multi-camera footage using local vision-language models on Apple Silicon. One ran Gemma 4 31B on a 2021 M1 Max (peak 50 GB swap). The other used Qwen2.5-VL-7B with WhisperX for semantic search, face recognition, and DaVinci Resolve integration. Both stacked open-source tools — zero API bills.
The Pattern: Local-First Video Indexing Is Having a Moment
In the last two weeks, two detailed case studies landed on Hacker News from builders who solved the same problem: raw footage archives grow faster than you can edit them.
| Builder | Footage | Hardware | Model | Compute Time |
|---|---|---|---|---|
| Ilias Hadad (@iliashad) | 669 GB, 2,207 GoPro clips, 15h runtime | M1 Max, 64 GB RAM | Qwen2.5-VL-7B-Instruct + WhisperX | 67h 40m |
| simbastack (Framedex author) | 1 year, iPhone + DJI + drone + Nikon Z8 + Ray-Ban Meta | M1 Max, 64 GB RAM (2021 model) | Gemma 4 31B Q4 (LM Studio) | Weekend bulk run |
Neither used cloud APIs. Both ran on 5-year-old MacBooks. Both open-sourced their pipelines.
Why this matters: Cloud vision APIs (Gemini, GPT-4V, Claude) cost $0.002–0.01 per frame. At 5 frames per clip × 2,000 clips = $20–200 per indexing run. Local models shift that to electricity and time — a one-time hardware cost.
Example 1: Ilias Hadad — 669 GB GoPro Archive on M1 Max
Who: Ilias Hadad, software engineer, cyclist
What: Indexed 2,207 GoPro videos from cycling trips to find “interesting moments” and send clips straight to DaVinci Resolve
Tools: Qwen2.5-VL-7B-Instruct, WhisperX, insightface, custom pipeline
Source: HN Discussion · Blog Post · GitHub — June 14, 2026
The Pipeline (7 Stages)
1. Grab frames → 2. Downscale to 720p → 3. Classify/Analyze →
4. Combine metadata → 5. Transcribe audio →
6. Convert (text, visual, audio) to embeddings →
7. Save to Vector DB + SQL DB
Models & Roles
| Model | Task | Notes |
|---|---|---|
| Qwen2.5-VL-7B-Instruct | Scene understanding (5 frames @ 720p) | Advanced mode: slower but catches actions like “falling down” |
| WhisperX | Audio transcription + word alignment + speaker diarization | 97 languages; hallucinates on non-speech — trim silence first |
| insightface + ArcFace | Face detection + 512-dim embeddings | Centralized SQLite face DB for cross-clip recognition |
| Custom | Object detection, on-screen text, shot type, color palette | Single vision call captures everything |
Search Capabilities Enabled
- Semantic search (RAG over embeddings)
- Screenshot → find exact moment in video
- Audio file → find matching clip
- Face recognition with custom reference data
- Object detection + on-screen text extraction
NLE Integration
Direct export to DaVinci Resolve timeline. Also works with Premiere Pro, Final Cut Pro.
Hardware Reality Check
| Metric | Value |
|---|---|
| Hardware | M1 Max (ARM SoC), 64 GB unified RAM |
| Frames analyzed | 57,537 |
| Total compute time | 67h 40m |
| M1 Max vs RTX 3060 12GB | “Much faster than M1 Max” per OP |
| Windows ARM (Snapdragon X Elite) | Lower bandwidth (228 vs 400 GB/s), untested |
OP’s take on unified memory: “Apple’s swap is designed for it… a weekend of pushing the machine hard is well within tolerance.”
Example 2: simbastack — Framedex, Gemma 4 31B on a 2021 MacBook
Who: simbastack, splits time between Maasai Mara (Kenya) running an eco-lodge and Silicon Valley building software
What: Built Framedex — a local-first video indexer that writes .description.md sidecars per clip
Tools: Gemma 4 31B Q4 (LM Studio), WhisperX, ffmpeg, exiftool, Nominatim, insightface, Claude CLI fallback
Source: Blog Post · HN Discussion · GitHub — June 2026
Four Constraints That Shaped the Stack
- Local-first — No cloud upload of thousands of multi-GB clips (cost + privacy)
- Sidecars over databases —
.description.mdper clip: plain text, grep-able, survives tool breakage, travels with files - One vision call captures everything — Exhaustive schema on day one (rating, technical quality, lighting, time of day, color palette, audio quality, people count, keywords, faces, location, transcript, prose description)
- Three vision backends — Claude CLI (default, zero marginal cost via Max plan), Anthropic API (speed), LM Studio local (bulk pass)
Per-Clip Pipeline (Python)
# 1. ffprobe for metadata
# 2. exiftool for GPS lat/lon/altitude (works on iPhone, DJI, drone)
# 3. Reverse-geocode via Nominatim (free, rate-limited, no API key)
# 4. ffmpeg extracts 5 evenly-spaced frames at 1920px
# 5. WhisperX transcribes with word-level alignment + pyannote speaker diarization (97 languages)
# 6. insightface detects faces → 512-dim ArcFace embeddings in centralized SQLite face DB
# 7. Vision model reads frames + transcript snippet + folder context → returns YAML + prose
# 8. Sidecar written to disk
Output: IMG_1103.MOV.description.md
---
rating: 8
technical_quality: good
lighting: bright_daylight
time_of_day: midday
color_palette: [warm_green, sand, canvas_white]
audio_quality: clean
people_count: 1
keywords: [safari_tent, deck, savanna, wildlife_lodge, interior_exterior_transition]
faces: [embedding_ref_1]
location: "Mara Hilltop, Maasai Mara, Kenya"
gps: {lat: -1.406, lon: 35.123, alt: 1650}
transcript: "..."
---
## Description
Wide shot panning from interior of luxury safari tent onto deck overlooking savanna at midday.
Camera moves smoothly from canvas interior through doorway onto wooden deck with Ellie standing
at railing, vast grassland stretching to horizon. Natural light floods tent interior.
Shot type: establishing / transition. Suggested use cases: marketing reels, travel-vlog B-roll.
The “Absurdity”: Gemma 4 31B on 5-Year-Old Hardware
| Spec | Value |
|---|---|
| Hardware | 16-inch MacBook Pro M1 Max, 64 GB RAM (bought 2021) |
| Model | Gemma 4 31B Q4 (28.40 GB in memory) |
| Runtime | LM Studio REST API at 127.0.0.1:1234 |
| Peak Swap | 50.89 GB (Activity Monitor) |
| Memory Pressure | Yellow band |
| Duration | Weekend bulk run |
Author: “M1 Max 16-inch is ‘legendary’ — 5 years on, runs 31B models at usable speed. Expects 3-5 more years as local LLMs get more efficient.”
What Broke (And the Fixes)
From Ilias’s Pipeline
| Issue | Fix |
|---|---|
| Whisper hallucinations on wind/silence | Trim non-speech portions before transcription; consider Parakeet ASR |
| VLM frame sampling misses actions | Use Qwen2.5-VL with 5 frames @ 720p for temporal understanding |
From Framedex Pipeline
| Issue | Fix |
|---|---|
| WhisperX 3.8 breaking API changes | Signature introspection: try token= first, fall back to use_auth_token= on TypeError |
| Claude CLI silent permission failures (exit code 0, permission text as “success”) | Use --permission-mode bypassPermissions in non-interactive mode |
| Nominatim rate limits (free, no API key) | Cache geocoded results locally; batch requests |
The Tool Stack (What You Need to Replicate)
| Layer | Tool | Cost | Notes |
|---|---|---|---|
| Frame extraction | ffmpeg | Free | Hardware-accelerated on Apple Silicon |
| Metadata | exiftool + ffprobe | Free | GPS, camera settings, timestamps |
| Geocoding | Nominatim (OpenStreetMap) | Free | Rate-limited; cache locally |
| Transcription | WhisperX + pyannote | Free | Word-level alignment, speaker diarization |
| Face recognition | insightface + ArcFace | Free | Centralized SQLite embedding DB |
| Vision model | LM Studio (local) / Claude CLI / Anthropic API | Free / $0 / Pay-per-use | Swap backends via config |
| Vector storage | FAISS / Chroma / SQLite-vec | Free | Embeddings + metadata |
| NLE export | DaVinci Resolve / Premiere / FCP | Free / Paid | XML/FCPXML timeline injection |
Hardware Requirements (Minimum Viable)
| Component | Minimum | Recommended | Why |
|---|---|---|---|
| RAM | 32 GB unified | 64 GB+ | 31B Q4 needs ~28 GB; 64 GB avoids constant swap |
| GPU/NPU | Apple Neural Engine / 12 GB VRAM | M1/M2/M3 Max / RTX 3060 12GB+ | Local inference speed |
| Storage | 1 TB NVMe | 2 TB+ | 669 GB raw + embeddings + sidecars + swap |
| OS | macOS 14+ / Linux | macOS for unified memory | Apple Silicon swap behavior is uniquely tolerant |
Windows note: RTX 3060 12 GB is “much faster than M1 Max” per Ilias. Snapdragon X Elite untested for this workload.
Cost Comparison: Local vs Cloud
| Approach | 669 GB Indexing Cost | Recurring? |
|---|---|---|
| Cloud Vision API (Gemini 1.5 Flash @ $0.00015/frame) | ~$85 (5 frames × 1,138 clips) | Per run |
| Cloud Vision API (GPT-4V @ $0.01/frame) | ~$5,700 | Per run |
| Local (M1 Max, 67h compute) | ~$2 electricity + hardware amortized | One-time |
Bottom line: If you index more than once a quarter, local pays for itself in month one.
Quick Start: Your First Local Video Index (30 Min)
Time to first result: 30 min | Cost: $0 | Hardware: M1/M2/M3 Mac or Linux + 12 GB+ VRAM
Level 1: Just Search (Beginner)
- Install LM Studio → download Qwen2.5-VL-7B-Instruct (Q4)
- Run
ffmpeg -i input.mp4 -vf "fps=1/10,scale=720:-1" frame_%04d.jpg - In LM Studio chat: drag 5 frames → “Describe what happens in this video clip”
- Save output → you now have a searchable description
Level 2: Structured Pipeline (Intermediate)
- Framedex (simbastack’s project):
git clone https://github.com/asena/framedex - Configure
config.yamlwith your LM Studio endpoint - Run
python index.py /path/to/footage - Get
.description.mdsidecars +_INDEX.jsonrollup
Level 3: Custom NLE Integration (Advanced)
- Fork edit-mind (Ilias’s project):
git clone https://github.com/IliasHad/edit-mind - Add your DaVinci Resolve project path
- Extend schema for your domain (wedding? wildlife? sports?)
- Build timeline XML export for one-click rough cuts
Risks & Limits
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Whisper hallucinations on non-speech | High | False transcripts | Trim silence; use Parakeet ASR |
| Model updates break prompts | Medium | Pipeline drift | Pin model versions; regression test |
| Swap wear on SSD | Low | Hardware degradation | Monitor swap; 67h weekend = negligible |
| Unified memory pressure crashes | Low | Lost progress | Conservative batch sizes; monitor Activity Monitor |
| NLE API changes | Medium | Export breaks | XML/FCPXML is stable; test on update |
Bottom Line
Two indie devs proved you can index 600 GB+ of raw video on a 5-year-old MacBook using open-source vision models — zero API bills, full privacy, total control.
The stack is real: ffmpeg + WhisperX + insightface + Qwen2.5-VL/Gemma 4 + LM Studio. The hardware is already in your bag if you bought a Max-tier MacBook Pro 2021–2023.
Your move: Pick one clip. Run it through LM Studio with Qwen2.5-VL. See the description. Then decide if your archive is worth a weekend of compute.
FAQ
Q: Can I run this on Windows with an NVIDIA GPU?
A: Yes. RTX 3060 12 GB is “much faster than M1 Max” per Ilias. Use LM Studio or Ollama with CUDA. Snapdragon X Elite untested for this workload.
Q: How much electricity does a 67-hour indexing run cost?
A: ~$2 on US residential rates. M1 Max draws 30–60 W under load; 67h × 50 W = 3.35 kWh.
Q: Does WhisperX work for non-English audio?
A: Yes, 97 languages supported. Hallucination risk on non-speech (wind, engine noise) is the main issue — trim silence first or use Parakeet ASR.
Q: Can I use this for wedding/event videography workflows?
A: Absolutely. Extend the schema with your domain fields (ceremony, reception, speeches, first dance). Export DaVinci Resolve timeline XML for rough cuts.
Q: What happens when model updates break my prompts?
A: Pin model versions in LM Studio (Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf). Regression-test sidecar output monthly.
Explore More on zbrandco
Related AI use-case coverage:
– [INTERNAL: researchers-local-llm-media-indexing]
– [INTERNAL: open-source-ai-coding-agents-2026]
– [INTERNAL: how-to-run-llama-3-3-local-mac]
– [INTERNAL: gemma-4-vision-language-model-test]
Sources:
– Ilias Hadad, “I indexed 669 GB of my GoPro videos using my M1 Max computer and local ML models,” Hacker News / iliashaddad.com, June 14, 2026
– simbastack, “While I slept, my 5-year-old MacBook ran Gemma 4 locally and indexed a year of video,” simbastack.com, June 2026
– Simon Willison, “Why AI hasn’t replaced software engineers, and won’t,” simonwillison.net, June 14, 2026 (context on local AI amplification)
– Framedex GitHub: github.com/asena/framedex
– edit-mind GitHub: github.com/IliasHad/edit-mind
