AI

Indie Devs Index Years of Video Locally — Zero API Costs

Indie Devs Index Years of Video Locally — Zero API Costs

AI · zbrandco

TL;DR — Two independent developers indexed 669 GB and a year’s worth of multi-camera footage using local vision-language models on Apple Silicon. One ran Gemma 4 31B on a 2021 M1 Max (peak 50 GB swap). The other used Qwen2.5-VL-7B with WhisperX for semantic search, face recognition, and DaVinci Resolve integration. Both stacked open-source tools — zero API bills.


The Pattern: Local-First Video Indexing Is Having a Moment

In the last two weeks, two detailed case studies landed on Hacker News from builders who solved the same problem: raw footage archives grow faster than you can edit them.

Builder Footage Hardware Model Compute Time
Ilias Hadad (@iliashad) 669 GB, 2,207 GoPro clips, 15h runtime M1 Max, 64 GB RAM Qwen2.5-VL-7B-Instruct + WhisperX 67h 40m
simbastack (Framedex author) 1 year, iPhone + DJI + drone + Nikon Z8 + Ray-Ban Meta M1 Max, 64 GB RAM (2021 model) Gemma 4 31B Q4 (LM Studio) Weekend bulk run

Neither used cloud APIs. Both ran on 5-year-old MacBooks. Both open-sourced their pipelines.

Why this matters: Cloud vision APIs (Gemini, GPT-4V, Claude) cost $0.002–0.01 per frame. At 5 frames per clip × 2,000 clips = $20–200 per indexing run. Local models shift that to electricity and time — a one-time hardware cost.


Example 1: Ilias Hadad — 669 GB GoPro Archive on M1 Max

Who: Ilias Hadad, software engineer, cyclist
What: Indexed 2,207 GoPro videos from cycling trips to find “interesting moments” and send clips straight to DaVinci Resolve
Tools: Qwen2.5-VL-7B-Instruct, WhisperX, insightface, custom pipeline
Source: HN Discussion · Blog Post · GitHub — June 14, 2026

The Pipeline (7 Stages)

1. Grab frames → 2. Downscale to 720p → 3. Classify/Analyze →
4. Combine metadata → 5. Transcribe audio →
6. Convert (text, visual, audio) to embeddings →
7. Save to Vector DB + SQL DB

Models & Roles

Model Task Notes
Qwen2.5-VL-7B-Instruct Scene understanding (5 frames @ 720p) Advanced mode: slower but catches actions like “falling down”
WhisperX Audio transcription + word alignment + speaker diarization 97 languages; hallucinates on non-speech — trim silence first
insightface + ArcFace Face detection + 512-dim embeddings Centralized SQLite face DB for cross-clip recognition
Custom Object detection, on-screen text, shot type, color palette Single vision call captures everything

Search Capabilities Enabled

  • Semantic search (RAG over embeddings)
  • Screenshot → find exact moment in video
  • Audio file → find matching clip
  • Face recognition with custom reference data
  • Object detection + on-screen text extraction

NLE Integration

Direct export to DaVinci Resolve timeline. Also works with Premiere Pro, Final Cut Pro.

Hardware Reality Check

Metric Value
Hardware M1 Max (ARM SoC), 64 GB unified RAM
Frames analyzed 57,537
Total compute time 67h 40m
M1 Max vs RTX 3060 12GB “Much faster than M1 Max” per OP
Windows ARM (Snapdragon X Elite) Lower bandwidth (228 vs 400 GB/s), untested

OP’s take on unified memory: “Apple’s swap is designed for it… a weekend of pushing the machine hard is well within tolerance.”


Example 2: simbastack — Framedex, Gemma 4 31B on a 2021 MacBook

Who: simbastack, splits time between Maasai Mara (Kenya) running an eco-lodge and Silicon Valley building software
What: Built Framedex — a local-first video indexer that writes .description.md sidecars per clip
Tools: Gemma 4 31B Q4 (LM Studio), WhisperX, ffmpeg, exiftool, Nominatim, insightface, Claude CLI fallback
Source: Blog Post · HN Discussion · GitHub — June 2026

Four Constraints That Shaped the Stack

  1. Local-first — No cloud upload of thousands of multi-GB clips (cost + privacy)
  2. Sidecars over databases.description.md per clip: plain text, grep-able, survives tool breakage, travels with files
  3. One vision call captures everything — Exhaustive schema on day one (rating, technical quality, lighting, time of day, color palette, audio quality, people count, keywords, faces, location, transcript, prose description)
  4. Three vision backends — Claude CLI (default, zero marginal cost via Max plan), Anthropic API (speed), LM Studio local (bulk pass)

Per-Clip Pipeline (Python)

# 1. ffprobe for metadata
# 2. exiftool for GPS lat/lon/altitude (works on iPhone, DJI, drone)
# 3. Reverse-geocode via Nominatim (free, rate-limited, no API key)
# 4. ffmpeg extracts 5 evenly-spaced frames at 1920px
# 5. WhisperX transcribes with word-level alignment + pyannote speaker diarization (97 languages)
# 6. insightface detects faces → 512-dim ArcFace embeddings in centralized SQLite face DB
# 7. Vision model reads frames + transcript snippet + folder context → returns YAML + prose
# 8. Sidecar written to disk

Output: IMG_1103.MOV.description.md

---
rating: 8
technical_quality: good
lighting: bright_daylight
time_of_day: midday
color_palette: [warm_green, sand, canvas_white]
audio_quality: clean
people_count: 1
keywords: [safari_tent, deck, savanna, wildlife_lodge, interior_exterior_transition]
faces: [embedding_ref_1]
location: "Mara Hilltop, Maasai Mara, Kenya"
gps: {lat: -1.406, lon: 35.123, alt: 1650}
transcript: "..."
---

## Description
Wide shot panning from interior of luxury safari tent onto deck overlooking savanna at midday.
Camera moves smoothly from canvas interior through doorway onto wooden deck with Ellie standing
at railing, vast grassland stretching to horizon. Natural light floods tent interior.
Shot type: establishing / transition. Suggested use cases: marketing reels, travel-vlog B-roll.

The “Absurdity”: Gemma 4 31B on 5-Year-Old Hardware

Spec Value
Hardware 16-inch MacBook Pro M1 Max, 64 GB RAM (bought 2021)
Model Gemma 4 31B Q4 (28.40 GB in memory)
Runtime LM Studio REST API at 127.0.0.1:1234
Peak Swap 50.89 GB (Activity Monitor)
Memory Pressure Yellow band
Duration Weekend bulk run

Author: “M1 Max 16-inch is ‘legendary’ — 5 years on, runs 31B models at usable speed. Expects 3-5 more years as local LLMs get more efficient.”


What Broke (And the Fixes)

From Ilias’s Pipeline

Issue Fix
Whisper hallucinations on wind/silence Trim non-speech portions before transcription; consider Parakeet ASR
VLM frame sampling misses actions Use Qwen2.5-VL with 5 frames @ 720p for temporal understanding

From Framedex Pipeline

Issue Fix
WhisperX 3.8 breaking API changes Signature introspection: try token= first, fall back to use_auth_token= on TypeError
Claude CLI silent permission failures (exit code 0, permission text as “success”) Use --permission-mode bypassPermissions in non-interactive mode
Nominatim rate limits (free, no API key) Cache geocoded results locally; batch requests

The Tool Stack (What You Need to Replicate)

Layer Tool Cost Notes
Frame extraction ffmpeg Free Hardware-accelerated on Apple Silicon
Metadata exiftool + ffprobe Free GPS, camera settings, timestamps
Geocoding Nominatim (OpenStreetMap) Free Rate-limited; cache locally
Transcription WhisperX + pyannote Free Word-level alignment, speaker diarization
Face recognition insightface + ArcFace Free Centralized SQLite embedding DB
Vision model LM Studio (local) / Claude CLI / Anthropic API Free / $0 / Pay-per-use Swap backends via config
Vector storage FAISS / Chroma / SQLite-vec Free Embeddings + metadata
NLE export DaVinci Resolve / Premiere / FCP Free / Paid XML/FCPXML timeline injection

Hardware Requirements (Minimum Viable)

Component Minimum Recommended Why
RAM 32 GB unified 64 GB+ 31B Q4 needs ~28 GB; 64 GB avoids constant swap
GPU/NPU Apple Neural Engine / 12 GB VRAM M1/M2/M3 Max / RTX 3060 12GB+ Local inference speed
Storage 1 TB NVMe 2 TB+ 669 GB raw + embeddings + sidecars + swap
OS macOS 14+ / Linux macOS for unified memory Apple Silicon swap behavior is uniquely tolerant

Windows note: RTX 3060 12 GB is “much faster than M1 Max” per Ilias. Snapdragon X Elite untested for this workload.


Cost Comparison: Local vs Cloud

Approach 669 GB Indexing Cost Recurring?
Cloud Vision API (Gemini 1.5 Flash @ $0.00015/frame) ~$85 (5 frames × 1,138 clips) Per run
Cloud Vision API (GPT-4V @ $0.01/frame) ~$5,700 Per run
Local (M1 Max, 67h compute) ~$2 electricity + hardware amortized One-time

Bottom line: If you index more than once a quarter, local pays for itself in month one.


Quick Start: Your First Local Video Index (30 Min)

Time to first result: 30 min | Cost: $0 | Hardware: M1/M2/M3 Mac or Linux + 12 GB+ VRAM

Level 1: Just Search (Beginner)

  1. Install LM Studio → download Qwen2.5-VL-7B-Instruct (Q4)
  2. Run ffmpeg -i input.mp4 -vf "fps=1/10,scale=720:-1" frame_%04d.jpg
  3. In LM Studio chat: drag 5 frames → “Describe what happens in this video clip”
  4. Save output → you now have a searchable description

Level 2: Structured Pipeline (Intermediate)

  1. Framedex (simbastack’s project): git clone https://github.com/asena/framedex
  2. Configure config.yaml with your LM Studio endpoint
  3. Run python index.py /path/to/footage
  4. Get .description.md sidecars + _INDEX.json rollup

Level 3: Custom NLE Integration (Advanced)

  1. Fork edit-mind (Ilias’s project): git clone https://github.com/IliasHad/edit-mind
  2. Add your DaVinci Resolve project path
  3. Extend schema for your domain (wedding? wildlife? sports?)
  4. Build timeline XML export for one-click rough cuts

Risks & Limits

Risk Likelihood Impact Mitigation
Whisper hallucinations on non-speech High False transcripts Trim silence; use Parakeet ASR
Model updates break prompts Medium Pipeline drift Pin model versions; regression test
Swap wear on SSD Low Hardware degradation Monitor swap; 67h weekend = negligible
Unified memory pressure crashes Low Lost progress Conservative batch sizes; monitor Activity Monitor
NLE API changes Medium Export breaks XML/FCPXML is stable; test on update

Bottom Line

Two indie devs proved you can index 600 GB+ of raw video on a 5-year-old MacBook using open-source vision models — zero API bills, full privacy, total control.

The stack is real: ffmpeg + WhisperX + insightface + Qwen2.5-VL/Gemma 4 + LM Studio. The hardware is already in your bag if you bought a Max-tier MacBook Pro 2021–2023.

Your move: Pick one clip. Run it through LM Studio with Qwen2.5-VL. See the description. Then decide if your archive is worth a weekend of compute.


FAQ

Q: Can I run this on Windows with an NVIDIA GPU?
A: Yes. RTX 3060 12 GB is “much faster than M1 Max” per Ilias. Use LM Studio or Ollama with CUDA. Snapdragon X Elite untested for this workload.

Q: How much electricity does a 67-hour indexing run cost?
A: ~$2 on US residential rates. M1 Max draws 30–60 W under load; 67h × 50 W = 3.35 kWh.

Q: Does WhisperX work for non-English audio?
A: Yes, 97 languages supported. Hallucination risk on non-speech (wind, engine noise) is the main issue — trim silence first or use Parakeet ASR.

Q: Can I use this for wedding/event videography workflows?
A: Absolutely. Extend the schema with your domain fields (ceremony, reception, speeches, first dance). Export DaVinci Resolve timeline XML for rough cuts.

Q: What happens when model updates break my prompts?
A: Pin model versions in LM Studio (Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf). Regression-test sidecar output monthly.


Explore More on zbrandco

Related AI use-case coverage:
– [INTERNAL: researchers-local-llm-media-indexing]
– [INTERNAL: open-source-ai-coding-agents-2026]
– [INTERNAL: how-to-run-llama-3-3-local-mac]
– [INTERNAL: gemma-4-vision-language-model-test]


Sources:
– Ilias Hadad, “I indexed 669 GB of my GoPro videos using my M1 Max computer and local ML models,” Hacker News / iliashaddad.com, June 14, 2026
– simbastack, “While I slept, my 5-year-old MacBook ran Gemma 4 locally and indexed a year of video,” simbastack.com, June 2026
– Simon Willison, “Why AI hasn’t replaced software engineers, and won’t,” simonwillison.net, June 14, 2026 (context on local AI amplification)
– Framedex GitHub: github.com/asena/framedex
– edit-mind GitHub: github.com/IliasHad/edit-mind

We may earn commission from affiliate links at no extra cost to you. Last updated: Jun 15, 2026.
Aira

Founding Editor and Publisher of ZBrandCo, covering artificial intelligence, open-source software, and the developer tools people actually use. Signal over hype: every story starts from a primary source and explains why it matters. ZBrandCo runs no paid reviews and no affiliate links. Tips and corrections: editorial@zbrandco.com.