AI Pulse
← Projects · 1-week

Depth-Memory Spatial Q&A

Upload a short phone video of a room, let the app reconstruct a point cloud, then ask spatial questions (‘what is left of the chair?’) answered by querying geometry rather than raw pixels.

Difficulty: 1-week | Stack: Python, FastAPI, Depth Anything v2 (HuggingFace Transformers), Open3D, LangChain tool-use, Claude 3.5 Sonnet or GPT-4o API, Streamlit

Who this is for

Robotics hobbyists and accessibility-tool developers who need reliable answers to ‘where is X relative to Y’ without fine-tuning a spatial model.

Build steps

  1. Accept a short video (≤30 s) or a set of 5-10 photos from different angles; extract keyframes with OpenCV.
  2. Run Depth Anything v2 on each keyframe to produce metric depth maps; back-project to 3D points and merge into a single Open3D point cloud aligned via simple ICP.
  3. Segment prominent objects in each frame with a lightweight SAM or YOLO model; tag point-cloud regions with object labels.
  4. Expose the tagged point cloud as a tool the LLM can call: ‘get_relative_position(obj_a, obj_b)’ returns a structured answer computed from centroid distances.
  5. Build a Streamlit UI: upload media → wait for reconstruction → chat box that routes spatial sub-questions to the geometry tool and descriptive sub-questions directly to the VLM.
  6. Write a small eval script that tests 10 hand-labelled spatial facts from a demo video to give a reproducible accuracy number.

Risks

  • Depth Anything v2 produces relative, not metric, depth — room-scale distance estimates will have unknown scale unless you calibrate with a known object size.
  • ICP alignment fails on texture-poor scenes (blank walls, uniform floors) and produces a garbled point cloud; add a quality gate that warns the user.
  • Object segmentation errors propagate: a mis-labelled region gives confidently wrong answers, so always return the bounding-box visualisation alongside the text answer.

Business Angle

Spatial Q&A SaaS for robotics hobbyists — upload a phone video, query the 3D geometry in plain English

Customer: Indie robotics hobbyist building a home assistant or pick-and-place robot — someone spending weekends on ROS2 stacks who keeps hitting a wall when their robot needs reliable relative-position answers ('is the mug left or right of the kettle?') and doesn't want to fine-tune a vision model

Pricing: freemium — $600 MRR in 4 months (roughly 20 paying users at $29/mo)

Full business breakdown →