Codebase Context Index

A local semantic search layer over your repo that any AI agent can query via a simple MCP server to get high-quality, ranked context instead of relying on naive file reads.

Difficulty: 1-week | Stack: Python, FastAPI, LanceDB, tree-sitter, sentence-transformers, MCP SDK (Python)

Who this is for

Developers who notice their AI agents hallucinating or fetching irrelevant files — this gives the agent a purpose-built retrieval tool tuned to code structure rather than generic embedding search.

Build steps

Use tree-sitter to parse the repo into semantic chunks: function bodies, class definitions, and docstrings — not arbitrary line windows — and store each chunk with its file path, symbol name, and start/end lines.
Embed each chunk with a local sentence-transformers model (e.g. all-MiniLM-L6-v2) and persist vectors plus metadata in LanceDB, which stores everything as a single directory alongside the repo.
Expose a two-tool MCP server: search_code(query, top_k) returns ranked chunks with file/line references, and get_context(file_path, symbol_name) returns the full chunk for a known symbol.
Write a file-watcher (watchdog) that re-indexes changed files on save so the index stays current without a full rebuild.
Register the MCP server in Claude Code’s .mcp.json and smoke-test it by asking the agent to find ‘where authentication tokens are validated’ — verify it returns the right function, not a README mention.

Risks

tree-sitter grammar coverage varies by language — TypeScript generics and macros in Rust may produce malformed chunks that break embedding quality.
Embedding a large monorepo (200k+ LOC) with a CPU-only model can take 10–30 minutes on first run, which will frustrate users expecting instant results.
MCP server startup latency (model load) adds 2–5 seconds per agent session cold-start, which some users will perceive as the agent ‘hanging’.