Depth-Memory Spatial Q&A
Upload a short phone video of a room, let the app reconstruct a point cloud, then ask spatial questions (‘what is left of the chair?’) answered by querying geometry rather than raw pixels.
Difficulty: 1-week | Stack: Python, FastAPI, Depth Anything v2 (HuggingFace Transformers), Open3D, LangChain tool-use, Claude 3.5 Sonnet or GPT-4o API, Streamlit
Who this is for
Robotics hobbyists and accessibility-tool developers who need reliable answers to ‘where is X relative to Y’ without fine-tuning a spatial model.
Build steps
- Accept a short video (≤30 s) or a set of 5-10 photos from different angles; extract keyframes with OpenCV.
- Run Depth Anything v2 on each keyframe to produce metric depth maps; back-project to 3D points and merge into a single Open3D point cloud aligned via simple ICP.
- Segment prominent objects in each frame with a lightweight SAM or YOLO model; tag point-cloud regions with object labels.
- Expose the tagged point cloud as a tool the LLM can call: ‘get_relative_position(obj_a, obj_b)’ returns a structured answer computed from centroid distances.
- Build a Streamlit UI: upload media → wait for reconstruction → chat box that routes spatial sub-questions to the geometry tool and descriptive sub-questions directly to the VLM.
- Write a small eval script that tests 10 hand-labelled spatial facts from a demo video to give a reproducible accuracy number.
Risks
- Depth Anything v2 produces relative, not metric, depth — room-scale distance estimates will have unknown scale unless you calibrate with a known object size.
- ICP alignment fails on texture-poor scenes (blank walls, uniform floors) and produces a garbled point cloud; add a quality gate that warns the user.
- Object segmentation errors propagate: a mis-labelled region gives confidently wrong answers, so always return the bounding-box visualisation alongside the text answer.
Business Angle
Spatial Q&A SaaS for robotics hobbyists — upload a phone video, query the 3D geometry in plain English
Customer: Indie robotics hobbyist building a home assistant or pick-and-place robot — someone spending weekends on ROS2 stacks who keeps hitting a wall when their robot needs reliable relative-position answers ('is the mug left or right of the kettle?') and doesn't want to fine-tune a vision model
Pricing: freemium — $600 MRR in 4 months (roughly 20 paying users at $29/mo)
Full business breakdown →