Privacy-First Desktop Automation Agent
Natural-language task runner for GUI automation using a locally-hosted computer-use model — screen data never leaves the machine.
Difficulty: weekend | Stack: Python, Holo3.1 (via Ollama or HF transformers), PyAutoGUI or pygetwindow, PIL for screenshots, FastAPI for optional local REST trigger
Who this is for
Developers and power users who want Copilot-style automation for their desktop but won’t pipe screenshots to a cloud endpoint — common in finance, law, healthcare.
Build steps
- Serve Holo3.1 locally via Ollama or transformers pipeline; verify screenshot → action inference works on a simple open-browser task
- Build a screen-capture loop: grab screenshot every N ms, encode to base64, send to local model with a task prompt
- Parse model output into pyautogui calls (click x,y / type text / key combo); add a dry-run mode that prints actions without executing
- Add a simple task queue: user types goal in terminal, agent loops until done or hits max-steps guard
- Wire a stop-hotkey (global keyboard listener) to kill the loop safely
Risks
- Holo3.1 action parsing format may differ from what pyautogui expects — need a prompt template tuned to its output schema
- Screenshot latency on CPU-only machines will make the loop too slow for fast UIs; may need to cap resolution or use CUDA
- Runaway agent with no stop condition can destructively click through anything — must ship the kill-switch before testing
Business Angle
Local GUI automation agent for regulated-industry knowledge workers who can't send screenshots to the cloud
Customer: Solo compliance analyst or paralegal at a 10–50 person firm — owns their own machine, runs repetitive multi-app workflows (copy from court portal → paste into case management → log in spreadsheet), IT won't approve cloud tools, personally accountable if data leaks
Pricing: one-time — $1,200 in first 90 days (12 × $99 lifetime licenses)
Full business breakdown →