Information-Weighted Video Frame Compressor for Vision LLMs

A preprocessing layer that scores and discards low-information visual tokens from video frames before they reach a vision LLM, cutting prompt length and latency.

Difficulty: 1-week | Stack: Python, OpenCV, Pillow, transformers (LLaVA or InternVL via HuggingFace), scikit-learn, FastAPI

Who this is for

Developers building video Q&A or video summarization features who are hitting context-length limits or latency budgets — this gives them a knob to trade token count for speed without retraining the vision LLM.

Build steps

Extract frames from an MP4 at 1 fps using OpenCV, encode each frame into patch tokens using a CLIP ViT (via HuggingFace) and record patch-level attention weights from the last ViT layer as a proxy for ‘information content’.
Implement three compression strategies to compare: (a) uniform temporal sampling, (b) drop patches with lowest mean attention weight, (c) drop patches where frame-over-frame patch cosine similarity exceeds a threshold (InfoMerge-style temporal redundancy).
Build a thin FastAPI endpoint that accepts a video URL and a compression_ratio parameter (0.25–1.0), applies strategy (c), and returns the reduced token list ready for injection into a vision LLM prompt.
Wire the compressed tokens into a LLaVA-1.5-7B inference call (via HuggingFace pipeline) and evaluate answer quality on a 20-question benchmark you create manually from 5 diverse YouTube clips.
Plot token count vs. answer correctness for all three strategies across compression ratios 0.25, 0.5, 0.75 — surface the crossover point where temporal-redundancy pruning outperforms uniform sampling.

Risks

ViT attention weights are not a ground-truth measure of semantic importance — some low-attention patches contain critical text or UI elements; you may need to add an edge-detection fallback to protect high-gradient regions.
LLaVA and InternVL expect visual tokens in a specific interleaved format; injecting a custom subset of patch tokens requires patching the model’s image preprocessing pipeline, which is model-version-specific and fragile.
Benchmark quality is self-constructed and small (20 questions), making results anecdotal — frame this as a proof-of-concept demo, not a rigorous eval, to avoid over-claiming.

Information-Weighted Video Frame Compressor for Vision LLMs

Who this is for

Build steps

Risks

Business Angle