Information-Weighted Video Frame Compressor for Vision LLMs
A preprocessing layer that scores and discards low-information visual tokens from video frames before they reach a vision LLM, cutting prompt length and latency.
Difficulty: 1-week | Stack: Python, OpenCV, Pillow, transformers (LLaVA or InternVL via HuggingFace), scikit-learn, FastAPI
Who this is for
Developers building video Q&A or video summarization features who are hitting context-length limits or latency budgets — this gives them a knob to trade token count for speed without retraining the vision LLM.
Build steps
- Extract frames from an MP4 at 1 fps using OpenCV, encode each frame into patch tokens using a CLIP ViT (via HuggingFace) and record patch-level attention weights from the last ViT layer as a proxy for ‘information content’.
- Implement three compression strategies to compare: (a) uniform temporal sampling, (b) drop patches with lowest mean attention weight, (c) drop patches where frame-over-frame patch cosine similarity exceeds a threshold (InfoMerge-style temporal redundancy).
- Build a thin FastAPI endpoint that accepts a video URL and a compression_ratio parameter (0.25–1.0), applies strategy (c), and returns the reduced token list ready for injection into a vision LLM prompt.
- Wire the compressed tokens into a LLaVA-1.5-7B inference call (via HuggingFace pipeline) and evaluate answer quality on a 20-question benchmark you create manually from 5 diverse YouTube clips.
- Plot token count vs. answer correctness for all three strategies across compression ratios 0.25, 0.5, 0.75 — surface the crossover point where temporal-redundancy pruning outperforms uniform sampling.
Risks
- ViT attention weights are not a ground-truth measure of semantic importance — some low-attention patches contain critical text or UI elements; you may need to add an edge-detection fallback to protect high-gradient regions.
- LLaVA and InternVL expect visual tokens in a specific interleaved format; injecting a custom subset of patch tokens requires patching the model’s image preprocessing pipeline, which is model-version-specific and fragile.
- Benchmark quality is self-constructed and small (20 questions), making results anecdotal — frame this as a proof-of-concept demo, not a rigorous eval, to avoid over-claiming.
Business Angle
Drop-in Python middleware that slashes vision LLM costs by pruning low-information video frames before they hit the model.
Customer: Solo ML engineer or indie dev building a video Q&A / summarization SaaS (e.g. 'ask questions about your Loom recordings') who is self-hosting LLaVA or InternVL and watching their GPU bill climb as video length grows
Pricing: open-core — $800 MRR in 4 months (16 teams × $50/mo for hosted compression API + priority support; core OSS library stays free)
Full business breakdown →