Spatial Q&A SaaS for robotics hobbyists — upload a phone video, query the 3D geometry in plain English

Customer: Indie robotics hobbyist building a home assistant or pick-and-place robot — someone spending weekends on ROS2 stacks who keeps hitting a wall when their robot needs reliable relative-position answers (‘is the mug left or right of the kettle?’) and doesn’t want to fine-tune a vision model

Problem: Current VLMs hallucinate on spatial/relational questions because they reason over 2D pixels, not geometry. Hobbyists either accept flaky answers or spend weeks integrating a full depth pipeline (RealSense + Open3D + custom code) that has nothing to do with their actual project

Pricing: freemium — $600 MRR in 4 months (roughly 20 paying users at $29/mo)

Why now

Depth Anything v2 just made monocular depth reconstruction good enough to run on a consumer video with no special hardware — the barrier that previously required a RealSense or LiDAR is gone. The cluster of papers diagnosing VLM spatial blindness is also creating demand: people now know the root cause and are looking for targeted fixes rather than hoping the next model version improves

Go-to-market

Post a demo video to r/robotics and r/ROS showing a phone video → point cloud → spatial Q&A in under 60 seconds; link to a free Streamlit playground with a 5-query trial limit
Write one focused blog post titled ‘Why GPT-4o fails at left/right questions and how to fix it with a depth layer’ — publish on Hacker News Show HN and cross-post to the Hugging Face community forum for Depth Anything v2
DM 10-15 active contributors in the r/homeassistant and LeRobot Discord who are visibly struggling with spatial perception; offer free access in exchange for a 20-minute feedback call
Add a ‘powered by Depth-Memory’ watermark to free-tier outputs so every demo video shared by hobbyists acts as an organic ad; gate the watermark removal behind the $29/mo paid plan

Moat (or lack thereof)

Essentially none — the entire stack is open-source (Depth Anything v2, Open3D, LangChain) and a determined dev could replicate it in a weekend. The only real advantage is time-to-value: you’ve already packaged the pipeline so the hobbyist doesn’t have to. Defensibility comes from iteration speed and community goodwill, not technology lock-in. If this gains traction, a well-funded competitor or HuggingFace itself could ship a hosted version quickly.