Codebase Context Index
A local semantic search layer over your repo that any AI agent can query via a simple MCP server to get high-quality, ranked context instead of relying on naive file reads.
Difficulty: 1-week | Stack: Python, FastAPI, LanceDB, tree-sitter, sentence-transformers, MCP SDK (Python)
Who this is for
Developers who notice their AI agents hallucinating or fetching irrelevant files — this gives the agent a purpose-built retrieval tool tuned to code structure rather than generic embedding search.
Build steps
- Use tree-sitter to parse the repo into semantic chunks: function bodies, class definitions, and docstrings — not arbitrary line windows — and store each chunk with its file path, symbol name, and start/end lines.
- Embed each chunk with a local sentence-transformers model (e.g.
all-MiniLM-L6-v2) and persist vectors plus metadata in LanceDB, which stores everything as a single directory alongside the repo. - Expose a two-tool MCP server:
search_code(query, top_k)returns ranked chunks with file/line references, andget_context(file_path, symbol_name)returns the full chunk for a known symbol. - Write a file-watcher (watchdog) that re-indexes changed files on save so the index stays current without a full rebuild.
- Register the MCP server in Claude Code’s
.mcp.jsonand smoke-test it by asking the agent to find ‘where authentication tokens are validated’ — verify it returns the right function, not a README mention.
Risks
- tree-sitter grammar coverage varies by language — TypeScript generics and macros in Rust may produce malformed chunks that break embedding quality.
- Embedding a large monorepo (200k+ LOC) with a CPU-only model can take 10–30 minutes on first run, which will frustrate users expecting instant results.
- MCP server startup latency (model load) adds 2–5 seconds per agent session cold-start, which some users will perceive as the agent ‘hanging’.