Spatial Q&A SaaS for robotics hobbyists — upload a phone video, query the 3D geometry in plain English
Customer: Indie robotics hobbyist building a home assistant or pick-and-place robot — someone spending weekends on ROS2 stacks who keeps hitting a wall when their robot needs reliable relative-position answers (‘is the mug left or right of the kettle?’) and doesn’t want to fine-tune a vision model
Problem: Current VLMs hallucinate on spatial/relational questions because they reason over 2D pixels, not geometry. Hobbyists either accept flaky answers or spend weeks integrating a full depth pipeline (RealSense + Open3D + custom code) that has nothing to do with their actual project
Pricing: freemium — $600 MRR in 4 months (roughly 20 paying users at $29/mo)
Why now
Depth Anything v2 just made monocular depth reconstruction good enough to run on a consumer video with no special hardware — the barrier that previously required a RealSense or LiDAR is gone. The cluster of papers diagnosing VLM spatial blindness is also creating demand: people now know the root cause and are looking for targeted fixes rather than hoping the next model version improves
Go-to-market
- Post a demo video to r/robotics and r/ROS showing a phone video → point cloud → spatial Q&A in under 60 seconds; link to a free Streamlit playground with a 5-query trial limit
- Write one focused blog post titled ‘Why GPT-4o fails at left/right questions and how to fix it with a depth layer’ — publish on Hacker News Show HN and cross-post to the Hugging Face community forum for Depth Anything v2
- DM 10-15 active contributors in the r/homeassistant and LeRobot Discord who are visibly struggling with spatial perception; offer free access in exchange for a 20-minute feedback call
- Add a ‘powered by Depth-Memory’ watermark to free-tier outputs so every demo video shared by hobbyists acts as an organic ad; gate the watermark removal behind the $29/mo paid plan
Moat (or lack thereof)
Essentially none — the entire stack is open-source (Depth Anything v2, Open3D, LangChain) and a determined dev could replicate it in a weekend. The only real advantage is time-to-value: you’ve already packaged the pipeline so the hobbyist doesn’t have to. Defensibility comes from iteration speed and community goodwill, not technology lock-in. If this gains traction, a well-funded competitor or HuggingFace itself could ship a hosted version quickly.