Depth-Memory Spatial Q&A

Upload a short phone video of a room, let the app reconstruct a point cloud, then ask spatial questions (‘what is left of the chair?’) answered by querying geometry rather than raw pixels.

Difficulty: 1-week | Stack: Python, FastAPI, Depth Anything v2 (HuggingFace Transformers), Open3D, LangChain tool-use, Claude 3.5 Sonnet or GPT-4o API, Streamlit

Who this is for

Robotics hobbyists and accessibility-tool developers who need reliable answers to ‘where is X relative to Y’ without fine-tuning a spatial model.

Build steps

Accept a short video (≤30 s) or a set of 5-10 photos from different angles; extract keyframes with OpenCV.
Run Depth Anything v2 on each keyframe to produce metric depth maps; back-project to 3D points and merge into a single Open3D point cloud aligned via simple ICP.
Segment prominent objects in each frame with a lightweight SAM or YOLO model; tag point-cloud regions with object labels.
Expose the tagged point cloud as a tool the LLM can call: ‘get_relative_position(obj_a, obj_b)’ returns a structured answer computed from centroid distances.
Build a Streamlit UI: upload media → wait for reconstruction → chat box that routes spatial sub-questions to the geometry tool and descriptive sub-questions directly to the VLM.
Write a small eval script that tests 10 hand-labelled spatial facts from a demo video to give a reproducible accuracy number.

Risks

Depth Anything v2 produces relative, not metric, depth — room-scale distance estimates will have unknown scale unless you calibrate with a known object size.
ICP alignment fails on texture-poor scenes (blank walls, uniform floors) and produces a garbled point cloud; add a quality gate that warns the user.
Object segmentation errors propagate: a mis-labelled region gives confidently wrong answers, so always return the bounding-box visualisation alongside the text answer.

Depth-Memory Spatial Q&A

Who this is for

Build steps

Risks

Business Angle