Project Ideas
Buildable projects inspired by the latest AI frontier research.
Agent Behavior Pattern Library (ADRA-Bank Clone)
A personal catalogue of recorded agent trajectories—tagged by failure mode—that you can replay, diff, and query to understand why an agent regressed between versions.
Agent PII Sentinel
A proxy layer that intercepts and redacts PII before an autonomous web agent submits it to any endpoint.
Agent Session Archivist
A CLI tool that captures, tags, and links AI coding-session transcripts to the git commits they produced.
Agentic PR Review Bot
GitHub App that assigns sub-tasks from an open PR to a Codex agent: write missing tests, fix lint errors, suggest refactors — then pushes results as commits.
Agentic Task Runner with Hardware-Aware Model Routing
A local-first agent framework that automatically routes sub-tasks to the largest model your current hardware can run without hitting swap, falling back to API only when necessary.
AGI Takeoff Speed Simulator
An interactive model that lets users tune self-improvement parameters and visualize domain-specific capability growth curves over time
AI Claim Veracity Auditor
A CLI tool that takes a forwarded AI success/failure story URL and returns a structured evidentiary scorecard — sourcing quality, baseline presence, metric specificity — to fight Slack-channel narrative drift.
AI Policy Tracker & Stance Comparator
Aggregate and diff official AI policy statements from major labs to surface where they agree, diverge, or shift over time.
AI Velocity Ledger
CLI tool that instruments a dev's git history to estimate time-saved by AI-assisted commits and generates a weekly velocity report.
Architecture-Aware Model Router
A drop-in OpenAI-compatible proxy that routes each incoming request to the cheapest model that can meet a declared latency SLA, using live throughput telemetry per model backend.
Async Codex Task Dashboard
Local web UI that dispatches parallel Codex agents against a GitHub repo and tracks their async progress in real time.
Backdoor Trigger Generalization Stress-Tester
Red-team harness that probes whether a backdoor defense trained on known triggers fails when trigger surface, position, or paraphrase shifts.
Benchmark Blindspot Detector
A research tool that ingests any LLM benchmark's task descriptions and flags which tasks have weak verification properties — predicting where benchmark scores are least trustworthy.
Branch-Aware Trajectory Sampler for Multi-Turn Agents
A training data generation pipeline that samples branching rollouts from an LLM agent, stores them as a tree, and exports step-level preference pairs for DPO/RLHF fine-tuning.
Code-to-Math Problem Synthesizer
Feed a Python function and get back a set of math olympiad-style problems whose solutions require understanding that function's logic.
Codebase Context Index
A local semantic search layer over your repo that any AI agent can query via a simple MCP server to get high-quality, ranked context instead of relying on naive file reads.
Confidence-Gated Distillation Trainer
A training script that replicates confidence-gated teacher distillation, filtering noisy teacher tokens before they reach the student model.
Constraint-Violation Detector for Robot Trajectory Descriptions
A CLI tool that feeds constraint-sensitive natural-language instructions into an open-source world model and flags predicted outcomes that violate stated physical constraints.
Context Vocabulary Scope Visualizer
Interactive tool that shows, token by token, how small the 'active' vocabulary really is for any given prompt.
Cooperative SFT+RL Interleaving Scheduler
A modular training loop that dynamically schedules SFT and GRPO/RLVR updates per sample based on real-time difficulty estimates, replacing naive loss mixing.
CoT Graph Compressor
A Streamlit app that converts a model's chain-of-thought trace into a Mermaid reasoning graph, lets you prune redundant nodes, and re-injects the compressed graph as a structured prompt prefix.
Counterfactual Consistency Probe for Vision-Language Models
Automatically test whether a VLM used for robot planning produces physically consistent predictions under counterfactual instructions.
Critic-Generator Research Agent
Two-agent loop where critic refines search queries and generator synthesizes — disentangled per AgentDisCo pattern
Cross-Problem Failure Memory for Coding Agents
Give a coding agent persistent retrieval of past failure traces so it avoids repeating mistakes across LeetCode-style problems
Cultural Commonsense Probe Harness
A CLI tool that stress-tests any LLM against culturally-grounded commonsense questions you author, then surfaces per-language failure heatmaps.
CulturalBench: Automated Cultural-Knowledge Probe for LLMs
A benchmark harness that generates and scores culture-specific Q&A probes for a chosen language/region, revealing where a model's cultural blind spots are before deployment.
CultureCaptions: Native-Sourced Image-Text Collector
A lightweight web tool that lets native speakers submit and annotate culturally-specific image-caption pairs to build WAON-style adaptation datasets.
Data-Residency Compliance Checker for AI Pipelines
A developer tool that statically and dynamically audits Python AI application code to flag any LLM API calls that would send sensitive data off-device, with a report mapped to common compliance frameworks.
Decision Log Weaver
A GitHub Action that automatically generates a structured 'Decision Log' entry from a PR's linked agent transcripts, appended to a DECISIONS.md file in the repo.
Dense Reward Agent Trainer: From Sparse Outcomes to Step Signals
A modular RL fine-tuning harness for open-source LLMs that automatically synthesizes dense, step-level reward signals from sparse end-of-trajectory outcomes using a learned critic.
Depth-Memory Spatial Q&A
Upload a short phone video of a room, let the app reconstruct a point cloud, then ask spatial questions ('what is left of the chair?') answered by querying geometry rather than raw pixels.
Developer Session Productivity Estimator
Capture real coding sessions, prompt devs to estimate counterfactual time-without-AI, and compute a calibrated rlog productivity metric.
Dialect-Adaptive ASR Benchmark Dashboard
A local web dashboard that benchmarks multiple small ASR models across audio clips grouped by dialect/accent tag, surfacing which model has the smallest reality gap for your specific speaker population.
Domain Capability Ceiling Tracker
A dashboard that monitors benchmark progress across AI capability domains and automatically flags when a domain appears to be hitting a data or evaluation bottleneck
Domain-Specialized Offline Assistant via Synthetic Fine-Tuning
Fine-tune a small open-weight model on a narrow regulated domain using cloud-generated synthetic data, then deploy fully air-gapped.
Enterprise AI Adoption Tracker
A dashboard that aggregates and scores public signals of enterprise AI product-market fit (pricing announcements, contract filings, job postings) to surface inflection trends early.
Evolving-World Memory Probe
A harness that stress-tests an LLM agent's memory by feeding it facts that contradict earlier ones, then measuring recall at write/maintain/retrieve granularity.
Financial-Stakes Agent Eval Harness
Run LLM agents in sandboxed environments with fake-but-realistic dollar constraints and log emergent deceptive behaviors.
Hidden-State Lie Detector
A CLI tool that probes an LLM's internal residual stream to flag when its stated answer contradicts its internal representation.
Hybrid Moderation Queue
A content moderation service that routes high-confidence decisions to an LLM and escalates uncertain cases to a lightweight human review dashboard, with latency and accuracy telemetry.
InfoDensity Reasoning Compressor
Fine-tune a small LLM to produce information-dense CoT traces using RL reward on token-efficiency + correctness
Information-Weighted Video Frame Compressor for Vision LLMs
A preprocessing layer that scores and discards low-information visual tokens from video frames before they reach a vision LLM, cutting prompt length and latency.
Interactive Algorithm Visualizer from Paper Abstract
Paste an arXiv abstract and get a runnable, step-through visualization of the algorithm it describes.
Language-Agnostic SWE Mini-Bench Runner
A local benchmark executor that pulls real GitHub issues across Python, TypeScript, and Go repos, sandboxes each in Docker, runs a code agent, then verifies the patch with the repo's own test suite.
Latent-State Streaming Chat UI
Build a streaming chat interface that shows a 'thinking indicator' driven by real concurrent reasoning tokens, not a spinner hack
LLM Architecture Throughput Benchmarker
A CLI tool that stress-tests multiple open-weight models under concurrent load and surfaces tokens/sec, latency percentiles, and cost-per-token side by side.
Local Inference Benchmark Dashboard
A cross-platform CLI + web dashboard that benchmarks LLM inference speed, memory bandwidth, and tokens/sec across Apple Silicon, Grace-Blackwell, and CUDA laptops.
Logic Drift Detector
A CLI tool that queries an LLM about a knowledge graph or JSON schema and uses an SMT solver to flag logical inconsistencies in the response.
Long-Context Local RAG Without Chunking
Document Q&A system that exploits Mamba-2 hybrid model's long-context efficiency to ingest whole files instead of splitting them.
LowResAdapt: Principled LoRA Fine-Tuning CLI for Low-Resource Languages
A command-line toolkit that recommends and executes staged LoRA fine-tuning of a multilingual base model for a target low-resource language, with compute-budget guidance baked in.
Mini RoboTrustBench: Four-Scenario Robustness Suite for Pluggable World Models
A self-contained evaluation harness that runs any video world model through all four RoboTrustBench scenario types and produces a per-category robustness scorecard.
Modality Gap Probe
A tool that stress-tests a VLM by varying font, resolution, and background of rendered text to find the rendering recipe that minimises the pixel-text vs token-text accuracy gap.
Model Edit Reversal Curse Auditor
A testing harness that applies knowledge edits to a model and automatically checks whether the edit propagates to reversed and paraphrased queries.
Multi-Agent Safety Debate Arena
A framework where two specialized LLM agents debate whether a proposed agent action is safe, producing a structured safety verdict without human red-teamers.
Multi-Hop RAG with Evolving Evidence Tracker
A multi-hop question-answering tool that maintains a running 'evidence ledger' across retrieval iterations to avoid contradicting or re-fetching already-established facts.
Multi-Tenant Personalization Sidecar API
A standalone microservice that any app can call to retrieve a compact, cacheable user embedding and automatically-generated system-prompt injection for personalized LLM calls.
Multimodal RAG Evaluator
An evaluation harness that checks whether a RAG pipeline correctly grounds answers in retrieved audio/video content, not just text chunks.
Natural-Language-to-Simulation Scenario Expander for Embodied AI
Give it a plain-English scenario ('robot arm retrieves a tipped-over bottle from a wet countertop') and it outputs fully parameterized simulation configs plus Cosmos-3-validated synthetic observation videos for training embodied agents.
Natural-Language Video Edit Agent
An agent that accepts plain-English editing instructions ('tighten the opening, cut the awkward pause at 0:42, add a zoom on the product') and executes them as real FFmpeg operations.
Negation Ablation Sandbox
An interactive notebook that lets you ablate late-layer attention heads in a transformer and watch negation accuracy change in real time.
Novelty Memory Bot for Your Reading List
A CLI tool that scores each new paper you add against your personal reading history, flagging genuine novelty vs. incremental rehash.
Observational Equivalence Test Generator
Automatically generate pairs of test cases that are surface-identical but semantically different, to catch agents gaming shallow checks.
On-Device Private Code Reviewer with Nemotron Ultra
A git pre-push hook that runs Nemotron Ultra locally via llama.cpp and outputs a structured JSON review of your diff before it leaves your machine.
On-Device Whisper Fine-Tuner for Noisy Telephony Audio
A local CLI tool that continually fine-tunes a quantized Whisper model on your own audio samples without any data leaving the machine.
Ontology-Grounded Agent Compliance Checker
Agent that validates its own tool calls and outputs against a domain ontology before returning results
Persistent Persona Chatbot with Compressed Session Memory
A FastAPI chatbot that summarizes each session into a codebook-quantized user profile, then retrieves and injects it on the next visit—keeping context costs flat regardless of history length.
Personal Workflow Distiller
Use a frontier model to shadow your daily digital work for a week and produce a fine-tuned system prompt that turns a local Ollama model into your personal assistant.
Physical Plausibility Filter for Synthetic Video Datasets
A pipeline that ingests synthetic video clips, scores each clip's temporal coherence and physical plausibility using a video foundation model, and culls low-quality samples before they enter a training set.
Physics-Regime Gym Wrapper
Wrap multiple physics simulators (MuJoCo, PyBullet, Brax) behind a unified Gymnasium interface so one agent trains across varied physical regimes.
Pipe-level Token Filter for Agent CLIs
A configurable stdin→stdout filter that strips low-signal CLI output before it hits your LLM context.
Privacy-First Desktop Automation Agent
Natural-language task runner for GUI automation using a locally-hosted computer-use model — screen data never leaves the machine.
Privacy-Preserving Federated ASR Adapter Aggregator
A minimal federated learning server that aggregates LoRA adapter updates from multiple edge clinic nodes without ever collecting raw audio, then redistributes an improved shared adapter.
Probe-Based Topic Coherence Benchmark Generator
A library and REST API that auto-generates held-out evaluation probes distinguishing thematic vs. taxonomic coherence for any topic model, replacing monolithic NPMI with axis-aware metrics.
Probe Format Confounder Benchmark
Minimal benchmark that tests whether a linear probe is detecting reasoning type or just task format by swapping MCQ/open-ended wrappers around identical logic problems.
RAG Parser Canary Suite
A test harness that stress-tests document parsers (PDFs, HTML, DOCX) for silent extraction failures and measures downstream retrieval factual accuracy.
Regulatory Landscape Briefing Bot
A Slack or web bot that answers 'what does current AI regulation say about X?' by grounding answers in tracked legislative texts across jurisdictions.
Rendering-Aware Document Preprocessor
A drop-in FastAPI microservice that receives a scanned-document image, tries a small set of pre-baked rendering transforms, picks the one that scores highest on a quick VLM confidence probe, and returns the enhanced image.
Repo Pattern Guard
A pre-commit + CI tool that flags bad coding patterns before they become permanent training context for your coding agent.
RL Environment Spec Generator
A web app that takes a natural-language task description and generates a complete reinforcement learning environment specification — reward function, observation space, termination conditions, and verification harness.
Robot Task MCP Server
MCP server that exposes a simulated robot arm (via PyBullet) as tools so any MCP-compatible LLM agent can plan and execute pick-and-place tasks.
Rubric-Driven Creative Quality Scorer
A web app that lets you define structured rubrics for subjective creative tasks and compares rubric-based LLM scoring against pairwise human preference to reveal where each method diverges.
Rule Induction Arena
A text-adventure benchmark harness that generates hidden-rule puzzles, runs multiple LLMs through them, and scores rule-induction capability across difficulty tiers.
Sandboxed Agent Workbench
A local orchestration harness that spins up isolated Docker containers per agent task and tracks memory, tool calls, and session state across runs — a personal Devin-lite infrastructure.
Self-Improvement Loop Sandbox
A small automated research pipeline where a language model iteratively rewrites its own few-shot prompts and measures whether downstream task performance actually improves run-over-run
Semantic Geometry Side-by-Side Viewer
A Streamlit app that runs LDA and BERTopic on the same uploaded corpus and visually contrasts the thematic vs. taxonomic signature of each model's topics using psycholinguistic benchmark anchors.
Semantic RAG Cache Layer
A drop-in caching middleware for RAG pipelines that reuses retrieval plans for semantically similar queries.
Session Memory Consolidation Service
A background service that compresses and consolidates agent conversation history into a structured memory store, injected as a compact context prefix on next session.
SFT-RL Sample Gating Dashboard
A local tool that labels training samples by difficulty and visualizes which should receive SFT vs RL gradient updates.
Sparse vs. Dense Attention Diff Visualizer
An interactive web app that loads a small transformer, lets you toggle between full and DeepSeek-style sparse attention masks, and shows in real time which token pairs are dropped and how outputs shift.
Speculative Decoding Accelerator with Dynamic Top-K Projection
Prototype that implements the NanoSpec core idea — dynamically shrinking the draft model's vocabulary projection at inference time — and benchmarks the speedup.
Steering Vector Leakage Auditor
CLI tool that measures cross-concept contamination when applying activation steering vectors to a local LLM.
StepPO Visualizer: Agentic Credit Assignment Explorer
An interactive tool that runs a small LLM agent on multi-step tasks and visualizes how step-level vs token-level reward signals differ across a trajectory.
Storyboard-to-Video Agentic Pipeline
Give an LLM a script and let it plan scenes, generate each clip via API, and stitch them into a coherent short video autonomously.
Style-Codebook Writing Assistant
A CLI tool that learns your writing style, compresses it into a codebook embedding, and injects it as a compact prefix into every LLM call—no prompt bloat.
Tool-Teaching Benchmark Harness
Empirical testbed measuring how many examples an agent needs to reliably use a novel tool it has never seen
Topic Semantic Axis Auditor
A CLI tool that scores each topic from a trained model on a thematic-relatedness vs. taxonomic-similarity axis so researchers know what geometry their model actually learned.
Trace-Level Agent Safety Monitor
A tool that records multi-step agent execution traces and runs heuristic + LLM-based checks to flag dangerous action sequences before they complete.
Unauthorized-Attribution Detector for AI Lab Claims
Monitor news and social media for third-party claims that invoke AI lab authority, and flag ones the labs haven't endorsed.
Unlearning Provenance Probe
A CLI tool that stress-tests whether an unlearning method actually erased pretraining knowledge versus only SFT-injected facts.
Verifiability Scorer for Personal Task Lists
A CLI tool that analyzes your to-do list and scores each task by how automatable it is using Verifier's Law heuristics.
Vertical AI PMF Benchmark Builder
A lightweight survey + analytics app that lets teams inside a specific industry vertical (e.g., health AI) self-report AI deployment metrics, then anonymously benchmarks them against peers to expose where PMF is real vs. aspirational.
Video State Tracker CLI
CLI tool that feeds a video + structured question set to a multimodal LLM and scores its temporal state-tracking accuracy against ground-truth annotations.
Video World-State Agent with Persistent Character Memory
A stateful agent that maintains a structured 'world model' (characters, props, locations, timeline) across a multi-session video project and uses it to enforce continuity in every new generation call.
Vocabulary-Free Sparse Retrieval Experimenter
A benchmarking harness that compares standard BM25 sparse retrieval against learned sparse methods (SPLADE, FLOPS-regularized models) on a custom document corpus to quantify the vocabulary-mismatch problem.
Watermark Robustness Sandbox
An interactive web tool that lets you embed a token-level watermark into LLM output, then attack it with paraphrasing and synonym substitution to measure survival rate.