Indie Devs Index Years of Video Locally — Zero API Costs

Aira Published Jun 15, 2026 · 7 min read

Indie Devs Index Years of Video Locally — Zero API Costs

AI · zbrandco

TL;DR — Two independent developers indexed 669 GB and a year’s worth of multi-camera footage using local vision-language models on Apple Silicon. One ran Gemma 4 31B on a 2021 M1 Max (peak 50 GB swap). The other used Qwen2.5-VL-7B with WhisperX for semantic search, face recognition, and DaVinci Resolve integration. Both stacked open-source tools — zero API bills.

The Pattern: Local-First Video Indexing Is Having a Moment

In the last two weeks, two detailed case studies landed on Hacker News from builders who solved the same problem: raw footage archives grow faster than you can edit them.

Builder	Footage	Hardware	Model	Compute Time
Ilias Hadad (@iliashad)	669 GB, 2,207 GoPro clips, 15h runtime	M1 Max, 64 GB RAM	Qwen2.5-VL-7B-Instruct + WhisperX	67h 40m
simbastack (Framedex author)	1 year, iPhone + DJI + drone + Nikon Z8 + Ray-Ban Meta	M1 Max, 64 GB RAM (2021 model)	Gemma 4 31B Q4 (LM Studio)	Weekend bulk run

Neither used cloud APIs. Both ran on 5-year-old MacBooks. Both open-sourced their pipelines.

Why this matters: Cloud vision APIs (Gemini, GPT-4V, Claude) cost $0.002–0.01 per frame. At 5 frames per clip × 2,000 clips = $20–200 per indexing run. Local models shift that to electricity and time — a one-time hardware cost.

Example 1: Ilias Hadad — 669 GB GoPro Archive on M1 Max

Who: Ilias Hadad, software engineer, cyclist
What: Indexed 2,207 GoPro videos from cycling trips to find “interesting moments” and send clips straight to DaVinci Resolve
Tools: Qwen2.5-VL-7B-Instruct, WhisperX, insightface, custom pipeline
Source: HN Discussion · Blog Post · GitHub — June 14, 2026

The Pipeline (7 Stages)

1. Grab frames → 2. Downscale to 720p → 3. Classify/Analyze →
4. Combine metadata → 5. Transcribe audio →
6. Convert (text, visual, audio) to embeddings →
7. Save to Vector DB + SQL DB

Models & Roles

Model	Task	Notes
Qwen2.5-VL-7B-Instruct	Scene understanding (5 frames @ 720p)	Advanced mode: slower but catches actions like “falling down”
WhisperX	Audio transcription + word alignment + speaker diarization	97 languages; hallucinates on non-speech — trim silence first
insightface + ArcFace	Face detection + 512-dim embeddings	Centralized SQLite face DB for cross-clip recognition
Custom	Object detection, on-screen text, shot type, color palette	Single vision call captures everything

Search Capabilities Enabled

Semantic search (RAG over embeddings)
Screenshot → find exact moment in video
Audio file → find matching clip
Face recognition with custom reference data
Object detection + on-screen text extraction

NLE Integration

Direct export to DaVinci Resolve timeline. Also works with Premiere Pro, Final Cut Pro.

Hardware Reality Check

Metric	Value
Hardware	M1 Max (ARM SoC), 64 GB unified RAM
Frames analyzed	57,537
Total compute time	67h 40m
M1 Max vs RTX 3060 12GB	“Much faster than M1 Max” per OP
Windows ARM (Snapdragon X Elite)	Lower bandwidth (228 vs 400 GB/s), untested

OP’s take on unified memory: “Apple’s swap is designed for it… a weekend of pushing the machine hard is well within tolerance.”

Example 2: simbastack — Framedex, Gemma 4 31B on a 2021 MacBook

Who: simbastack, splits time between Maasai Mara (Kenya) running an eco-lodge and Silicon Valley building software
What: Built Framedex — a local-first video indexer that writes .description.md sidecars per clip
Tools: Gemma 4 31B Q4 (LM Studio), WhisperX, ffmpeg, exiftool, Nominatim, insightface, Claude CLI fallback
Source: Blog Post · HN Discussion · GitHub — June 2026

Four Constraints That Shaped the Stack

Local-first — No cloud upload of thousands of multi-GB clips (cost + privacy)
Sidecars over databases — .description.md per clip: plain text, grep-able, survives tool breakage, travels with files
One vision call captures everything — Exhaustive schema on day one (rating, technical quality, lighting, time of day, color palette, audio quality, people count, keywords, faces, location, transcript, prose description)
Three vision backends — Claude CLI (default, zero marginal cost via Max plan), Anthropic API (speed), LM Studio local (bulk pass)

Per-Clip Pipeline (Python)

# 1. ffprobe for metadata
# 2. exiftool for GPS lat/lon/altitude (works on iPhone, DJI, drone)
# 3. Reverse-geocode via Nominatim (free, rate-limited, no API key)
# 4. ffmpeg extracts 5 evenly-spaced frames at 1920px
# 5. WhisperX transcribes with word-level alignment + pyannote speaker diarization (97 languages)
# 6. insightface detects faces → 512-dim ArcFace embeddings in centralized SQLite face DB
# 7. Vision model reads frames + transcript snippet + folder context → returns YAML + prose
# 8. Sidecar written to disk

Output: `IMG_1103.MOV.description.md`

---
rating: 8
technical_quality: good
lighting: bright_daylight
time_of_day: midday
color_palette: [warm_green, sand, canvas_white]
audio_quality: clean
people_count: 1
keywords: [safari_tent, deck, savanna, wildlife_lodge, interior_exterior_transition]
faces: [embedding_ref_1]
location: "Mara Hilltop, Maasai Mara, Kenya"
gps: {lat: -1.406, lon: 35.123, alt: 1650}
transcript: "..."
---

## Description
Wide shot panning from interior of luxury safari tent onto deck overlooking savanna at midday.
Camera moves smoothly from canvas interior through doorway onto wooden deck with Ellie standing
at railing, vast grassland stretching to horizon. Natural light floods tent interior.
Shot type: establishing / transition. Suggested use cases: marketing reels, travel-vlog B-roll.

The “Absurdity”: Gemma 4 31B on 5-Year-Old Hardware

Spec	Value
Hardware	16-inch MacBook Pro M1 Max, 64 GB RAM (bought 2021)
Model	Gemma 4 31B Q4 (28.40 GB in memory)
Runtime	LM Studio REST API at `127.0.0.1:1234`
Peak Swap	50.89 GB (Activity Monitor)
Memory Pressure	Yellow band
Duration	Weekend bulk run

Author: “M1 Max 16-inch is ‘legendary’ — 5 years on, runs 31B models at usable speed. Expects 3-5 more years as local LLMs get more efficient.”

What Broke (And the Fixes)

From Ilias’s Pipeline

Issue	Fix
Whisper hallucinations on wind/silence	Trim non-speech portions before transcription; consider Parakeet ASR
VLM frame sampling misses actions	Use Qwen2.5-VL with 5 frames @ 720p for temporal understanding

From Framedex Pipeline

Issue	Fix
WhisperX 3.8 breaking API changes	Signature introspection: try `token=` first, fall back to `use_auth_token=` on TypeError
Claude CLI silent permission failures (exit code 0, permission text as “success”)	Use `--permission-mode bypassPermissions` in non-interactive mode
Nominatim rate limits (free, no API key)	Cache geocoded results locally; batch requests

The Tool Stack (What You Need to Replicate)

Layer	Tool	Cost	Notes
Frame extraction	ffmpeg	Free	Hardware-accelerated on Apple Silicon
Metadata	exiftool + ffprobe	Free	GPS, camera settings, timestamps
Geocoding	Nominatim (OpenStreetMap)	Free	Rate-limited; cache locally
Transcription	WhisperX + pyannote	Free	Word-level alignment, speaker diarization
Face recognition	insightface + ArcFace	Free	Centralized SQLite embedding DB
Vision model	LM Studio (local) / Claude CLI / Anthropic API	Free / $0 / Pay-per-use	Swap backends via config
Vector storage	FAISS / Chroma / SQLite-vec	Free	Embeddings + metadata
NLE export	DaVinci Resolve / Premiere / FCP	Free / Paid	XML/FCPXML timeline injection

Hardware Requirements (Minimum Viable)

Component	Minimum	Recommended	Why
RAM	32 GB unified	64 GB+	31B Q4 needs ~28 GB; 64 GB avoids constant swap
GPU/NPU	Apple Neural Engine / 12 GB VRAM	M1/M2/M3 Max / RTX 3060 12GB+	Local inference speed
Storage	1 TB NVMe	2 TB+	669 GB raw + embeddings + sidecars + swap
OS	macOS 14+ / Linux	macOS for unified memory	Apple Silicon swap behavior is uniquely tolerant

Windows note: RTX 3060 12 GB is “much faster than M1 Max” per Ilias. Snapdragon X Elite untested for this workload.

Cost Comparison: Local vs Cloud

Approach	669 GB Indexing Cost	Recurring?
Cloud Vision API (Gemini 1.5 Flash @ $0.00015/frame)	~$85 (5 frames × 1,138 clips)	Per run
Cloud Vision API (GPT-4V @ $0.01/frame)	~$5,700	Per run
Local (M1 Max, 67h compute)	~$2 electricity + hardware amortized	One-time

Bottom line: If you index more than once a quarter, local pays for itself in month one.

Quick Start: Your First Local Video Index (30 Min)

Time to first result: 30 min | Cost: $0 | Hardware: M1/M2/M3 Mac or Linux + 12 GB+ VRAM

Level 1: Just Search (Beginner)

Install LM Studio → download Qwen2.5-VL-7B-Instruct (Q4)
Run ffmpeg -i input.mp4 -vf "fps=1/10,scale=720:-1" frame_%04d.jpg
In LM Studio chat: drag 5 frames → “Describe what happens in this video clip”
Save output → you now have a searchable description

Level 2: Structured Pipeline (Intermediate)

Framedex (simbastack’s project): git clone https://github.com/asena/framedex
Configure config.yaml with your LM Studio endpoint
Run python index.py /path/to/footage
Get .description.md sidecars + _INDEX.json rollup

Level 3: Custom NLE Integration (Advanced)

Fork edit-mind (Ilias’s project): git clone https://github.com/IliasHad/edit-mind
Add your DaVinci Resolve project path
Extend schema for your domain (wedding? wildlife? sports?)
Build timeline XML export for one-click rough cuts

Risks & Limits

Risk	Likelihood	Impact	Mitigation
Whisper hallucinations on non-speech	High	False transcripts	Trim silence; use Parakeet ASR
Model updates break prompts	Medium	Pipeline drift	Pin model versions; regression test
Swap wear on SSD	Low	Hardware degradation	Monitor swap; 67h weekend = negligible
Unified memory pressure crashes	Low	Lost progress	Conservative batch sizes; monitor Activity Monitor
NLE API changes	Medium	Export breaks	XML/FCPXML is stable; test on update

Bottom Line

Two indie devs proved you can index 600 GB+ of raw video on a 5-year-old MacBook using open-source vision models — zero API bills, full privacy, total control.

The stack is real: ffmpeg + WhisperX + insightface + Qwen2.5-VL/Gemma 4 + LM Studio. The hardware is already in your bag if you bought a Max-tier MacBook Pro 2021–2023.

Your move: Pick one clip. Run it through LM Studio with Qwen2.5-VL. See the description. Then decide if your archive is worth a weekend of compute.

Explore More on zbrandco

Related AI use-case coverage:
–
–
–
–

Sources:
– Ilias Hadad, “I indexed 669 GB of my GoPro videos using my M1 Max computer and local ML models,” Hacker News / iliashaddad.com, June 14, 2026
– simbastack, “While I slept, my 5-year-old MacBook ran Gemma 4 locally and indexed a year of video,” simbastack.com, June 2026
– Simon Willison, “Why AI hasn’t replaced software engineers, and won’t,” simonwillison.net, June 14, 2026 (context on local AI amplification)
– Framedex GitHub: github.com/asena/framedex
– edit-mind GitHub: github.com/IliasHad/edit-mind

#Anthropic #Apple #Claude #Gemini #Meta #RAG

Editorially independent: we accept no payment for coverage and currently use no affiliate links. Read our Editorial Standards and Corrections Policy. Published: Jun 15, 2026.

Indie Devs Index Years of Video Locally — Zero API Costs

The Pattern: Local-First Video Indexing Is Having a Moment

Example 1: Ilias Hadad — 669 GB GoPro Archive on M1 Max

The Pipeline (7 Stages)

Models & Roles

Search Capabilities Enabled

NLE Integration

Hardware Reality Check

Example 2: simbastack — Framedex, Gemma 4 31B on a 2021 MacBook

Four Constraints That Shaped the Stack

Per-Clip Pipeline (Python)

Output: `IMG_1103.MOV.description.md`

The “Absurdity”: Gemma 4 31B on 5-Year-Old Hardware

What Broke (And the Fixes)

From Ilias’s Pipeline

From Framedex Pipeline

The Tool Stack (What You Need to Replicate)

Hardware Requirements (Minimum Viable)

Cost Comparison: Local vs Cloud

Quick Start: Your First Local Video Index (30 Min)

Level 1: Just Search (Beginner)

Level 2: Structured Pipeline (Intermediate)

Level 3: Custom NLE Integration (Advanced)

Risks & Limits

Bottom Line

Explore More on zbrandco

Read next

Confidential computing and the regulatory focus on data in use

Use GPT-5.6 Sol, Terra, and Luna on Amazon Bedrock

How R8 Made Kotlin Coroutines on Android 2x Faster

The zBrandco Edition

Indie Devs Index Years of Video Locally — Zero API Costs

The Pattern: Local-First Video Indexing Is Having a Moment

Example 1: Ilias Hadad — 669 GB GoPro Archive on M1 Max

The Pipeline (7 Stages)

Models & Roles

Search Capabilities Enabled

NLE Integration

Hardware Reality Check

Example 2: simbastack — Framedex, Gemma 4 31B on a 2021 MacBook

Four Constraints That Shaped the Stack

Per-Clip Pipeline (Python)

Output: IMG_1103.MOV.description.md

The “Absurdity”: Gemma 4 31B on 5-Year-Old Hardware

What Broke (And the Fixes)

From Ilias’s Pipeline

From Framedex Pipeline

The Tool Stack (What You Need to Replicate)

Hardware Requirements (Minimum Viable)

Cost Comparison: Local vs Cloud

Quick Start: Your First Local Video Index (30 Min)

Level 1: Just Search (Beginner)

Level 2: Structured Pipeline (Intermediate)

Level 3: Custom NLE Integration (Advanced)

Risks & Limits

Bottom Line

Explore More on zbrandco

Read next

Confidential computing and the regulatory focus on data in use

Use GPT-5.6 Sol, Terra, and Luna on Amazon Bedrock

How R8 Made Kotlin Coroutines on Android 2x Faster

The zBrandco Edition

Output: `IMG_1103.MOV.description.md`