AI Pulse

Project Ideas

Buildable projects inspired by the latest AI frontier research.

1-week

Agent Behavior Pattern Library (ADRA-Bank Clone)

A personal catalogue of recorded agent trajectories—tagged by failure mode—that you can replay, diff, and query to understand why an agent regressed between versions.

Python FastAPI SQLite + SQLAlchemy Pydantic
weekend

Agent PII Sentinel

A proxy layer that intercepts and redacts PII before an autonomous web agent submits it to any endpoint.

Python mitmproxy Playwright spaCy
weekend

Agent Session Archivist

A CLI tool that captures, tags, and links AI coding-session transcripts to the git commits they produced.

Python Click SQLite GitPython
1-month

Agentic PR Review Bot

GitHub App that assigns sub-tasks from an open PR to a Codex agent: write missing tests, fix lint errors, suggest refactors — then pushes results as commits.

TypeScript Node.js Octokit OpenAI Codex API
1-week

Agentic Task Runner with Hardware-Aware Model Routing

A local-first agent framework that automatically routes sub-tasks to the largest model your current hardware can run without hitting swap, falling back to API only when necessary.

Python llama-cpp-python MLX-LM LangGraph
weekend

AGI Takeoff Speed Simulator

An interactive model that lets users tune self-improvement parameters and visualize domain-specific capability growth curves over time

Python Streamlit NumPy Matplotlib
weekend

AI Claim Veracity Auditor

A CLI tool that takes a forwarded AI success/failure story URL and returns a structured evidentiary scorecard — sourcing quality, baseline presence, metric specificity — to fight Slack-channel narrative drift.

Python Click Claude API (claude-sonnet-4-5) Jinja2
weekend

AI Policy Tracker & Stance Comparator

Aggregate and diff official AI policy statements from major labs to surface where they agree, diverge, or shift over time.

Python FastAPI SQLite BeautifulSoup/feedparser
1-week

AI Velocity Ledger

CLI tool that instruments a dev's git history to estimate time-saved by AI-assisted commits and generates a weekly velocity report.

Python Click GitPython OpenAI API
1-week

Architecture-Aware Model Router

A drop-in OpenAI-compatible proxy that routes each incoming request to the cheapest model that can meet a declared latency SLA, using live throughput telemetry per model backend.

Python FastAPI httpx Redis
weekend

Async Codex Task Dashboard

Local web UI that dispatches parallel Codex agents against a GitHub repo and tracks their async progress in real time.

Python FastAPI OpenAI Codex API WebSockets
1-week

Backdoor Trigger Generalization Stress-Tester

Red-team harness that probes whether a backdoor defense trained on known triggers fails when trigger surface, position, or paraphrase shifts.

Python HuggingFace Transformers PEFT/LoRA BadNL or custom trigger injection scripts
1-week

Benchmark Blindspot Detector

A research tool that ingests any LLM benchmark's task descriptions and flags which tasks have weak verification properties — predicting where benchmark scores are least trustworthy.

Python FastAPI Pandas Claude API
1-week

Branch-Aware Trajectory Sampler for Multi-Turn Agents

A training data generation pipeline that samples branching rollouts from an LLM agent, stores them as a tree, and exports step-level preference pairs for DPO/RLHF fine-tuning.

Python FastAPI SQLite (via SQLModel) Hugging Face Transformers
weekend

Code-to-Math Problem Synthesizer

Feed a Python function and get back a set of math olympiad-style problems whose solutions require understanding that function's logic.

Python Claude API (claude-sonnet) Streamlit SymPy (optional, for answer verification)
1-week

Codebase Context Index

A local semantic search layer over your repo that any AI agent can query via a simple MCP server to get high-quality, ranked context instead of relying on naive file reads.

Python FastAPI LanceDB tree-sitter
1-week

Confidence-Gated Distillation Trainer

A training script that replicates confidence-gated teacher distillation, filtering noisy teacher tokens before they reach the student model.

Python PyTorch Hugging Face Transformers TRL
1-week

Constraint-Violation Detector for Robot Trajectory Descriptions

A CLI tool that feeds constraint-sensitive natural-language instructions into an open-source world model and flags predicted outcomes that violate stated physical constraints.

Python FastAPI Hugging Face Transformers OpenCV
weekend

Context Vocabulary Scope Visualizer

Interactive tool that shows, token by token, how small the 'active' vocabulary really is for any given prompt.

Python Transformers (HuggingFace) Gradio Plotly
1-month

Cooperative SFT+RL Interleaving Scheduler

A modular training loop that dynamically schedules SFT and GRPO/RLVR updates per sample based on real-time difficulty estimates, replacing naive loss mixing.

Python PyTorch TRL vLLM
1-week

CoT Graph Compressor

A Streamlit app that converts a model's chain-of-thought trace into a Mermaid reasoning graph, lets you prune redundant nodes, and re-injects the compressed graph as a structured prompt prefix.

Python Streamlit Anthropic SDK Mermaid.js (via streamlit-mermaid)
weekend

Counterfactual Consistency Probe for Vision-Language Models

Automatically test whether a VLM used for robot planning produces physically consistent predictions under counterfactual instructions.

Python LiteLLM PIL pandas
weekend

Critic-Generator Research Agent

Two-agent loop where critic refines search queries and generator synthesizes — disentangled per AgentDisCo pattern

Python LangGraph Anthropic Claude API Tavily Search API
weekend

Cross-Problem Failure Memory for Coding Agents

Give a coding agent persistent retrieval of past failure traces so it avoids repeating mistakes across LeetCode-style problems

Python LangGraph or bare asyncio Claude claude-sonnet-4-5 or GPT-4o via API Chroma (local vector DB)
weekend

Cultural Commonsense Probe Harness

A CLI tool that stress-tests any LLM against culturally-grounded commonsense questions you author, then surfaces per-language failure heatmaps.

Python LiteLLM Pydantic Rich (CLI)
1-week

CulturalBench: Automated Cultural-Knowledge Probe for LLMs

A benchmark harness that generates and scores culture-specific Q&A probes for a chosen language/region, revealing where a model's cultural blind spots are before deployment.

Python OpenAI / Anthropic SDK (for probe generation) LangChain or raw HTTP pandas
weekend

CultureCaptions: Native-Sourced Image-Text Collector

A lightweight web tool that lets native speakers submit and annotate culturally-specific image-caption pairs to build WAON-style adaptation datasets.

Python FastAPI SQLite Sentence-Transformers (multilingual CLIP)
1-week

Data-Residency Compliance Checker for AI Pipelines

A developer tool that statically and dynamically audits Python AI application code to flag any LLM API calls that would send sensitive data off-device, with a report mapped to common compliance frameworks.

Python AST module Presidio (Microsoft) Rich (terminal UI)
1-week

Decision Log Weaver

A GitHub Action that automatically generates a structured 'Decision Log' entry from a PR's linked agent transcripts, appended to a DECISIONS.md file in the repo.

TypeScript GitHub Actions Octokit Anthropic Claude API
1-month

Dense Reward Agent Trainer: From Sparse Outcomes to Step Signals

A modular RL fine-tuning harness for open-source LLMs that automatically synthesizes dense, step-level reward signals from sparse end-of-trajectory outcomes using a learned critic.

Python PyTorch Hugging Face TRL + PEFT VLLM
1-week

Depth-Memory Spatial Q&A

Upload a short phone video of a room, let the app reconstruct a point cloud, then ask spatial questions ('what is left of the chair?') answered by querying geometry rather than raw pixels.

Python FastAPI Depth Anything v2 (HuggingFace Transformers) Open3D
1-month

Developer Session Productivity Estimator

Capture real coding sessions, prompt devs to estimate counterfactual time-without-AI, and compute a calibrated rlog productivity metric.

TypeScript VS Code Extension API Node.js Postgres
1-week

Dialect-Adaptive ASR Benchmark Dashboard

A local web dashboard that benchmarks multiple small ASR models across audio clips grouped by dialect/accent tag, surfacing which model has the smallest reality gap for your specific speaker population.

Python FastAPI Whisper.cpp (Python bindings) SQLite
1-week

Domain Capability Ceiling Tracker

A dashboard that monitors benchmark progress across AI capability domains and automatically flags when a domain appears to be hitting a data or evaluation bottleneck

Python FastAPI SQLite APScheduler
1-month

Domain-Specialized Offline Assistant via Synthetic Fine-Tuning

Fine-tune a small open-weight model on a narrow regulated domain using cloud-generated synthetic data, then deploy fully air-gapped.

Python Axolotl or TRL for LoRA fine-tuning Claude API or GPT-4o for synthetic data generation (one-time) Qwen-2.5 7B or Mistral 7B as base
1-week

Enterprise AI Adoption Tracker

A dashboard that aggregates and scores public signals of enterprise AI product-market fit (pricing announcements, contract filings, job postings) to surface inflection trends early.

Python FastAPI SQLite HTMX
weekend

Evolving-World Memory Probe

A harness that stress-tests an LLM agent's memory by feeding it facts that contradict earlier ones, then measuring recall at write/maintain/retrieve granularity.

Python LangGraph SQLite pytest
1-week

Financial-Stakes Agent Eval Harness

Run LLM agents in sandboxed environments with fake-but-realistic dollar constraints and log emergent deceptive behaviors.

Python FastAPI SQLite Docker
weekend

Hidden-State Lie Detector

A CLI tool that probes an LLM's internal residual stream to flag when its stated answer contradicts its internal representation.

Python TransformerLens HuggingFace Transformers scikit-learn
1-week

Hybrid Moderation Queue

A content moderation service that routes high-confidence decisions to an LLM and escalates uncertain cases to a lightweight human review dashboard, with latency and accuracy telemetry.

Python FastAPI OpenAI API (or Claude) Redis
1-week

InfoDensity Reasoning Compressor

Fine-tune a small LLM to produce information-dense CoT traces using RL reward on token-efficiency + correctness

Python PyTorch trl (TRL library) Hugging Face transformers
1-week

Information-Weighted Video Frame Compressor for Vision LLMs

A preprocessing layer that scores and discards low-information visual tokens from video frames before they reach a vision LLM, cutting prompt length and latency.

Python OpenCV Pillow transformers (LLaVA or InternVL via HuggingFace)
1-week

Interactive Algorithm Visualizer from Paper Abstract

Paste an arXiv abstract and get a runnable, step-through visualization of the algorithm it describes.

Python FastAPI Claude API (claude-sonnet) React
1-week

Language-Agnostic SWE Mini-Bench Runner

A local benchmark executor that pulls real GitHub issues across Python, TypeScript, and Go repos, sandboxes each in Docker, runs a code agent, then verifies the patch with the repo's own test suite.

Python Docker SDK GitPython Claude or GPT-4o function-calling
weekend

Latent-State Streaming Chat UI

Build a streaming chat interface that shows a 'thinking indicator' driven by real concurrent reasoning tokens, not a spinner hack

TypeScript Next.js Vercel AI SDK Claude claude-sonnet-4-5 (extended thinking mode)
weekend

LLM Architecture Throughput Benchmarker

A CLI tool that stress-tests multiple open-weight models under concurrent load and surfaces tokens/sec, latency percentiles, and cost-per-token side by side.

Python asyncio httpx ollama
weekend

Local Inference Benchmark Dashboard

A cross-platform CLI + web dashboard that benchmarks LLM inference speed, memory bandwidth, and tokens/sec across Apple Silicon, Grace-Blackwell, and CUDA laptops.

Python llama.cpp (Python bindings) MLX (Apple) FastAPI
weekend

Logic Drift Detector

A CLI tool that queries an LLM about a knowledge graph or JSON schema and uses an SMT solver to flag logical inconsistencies in the response.

Python NetworkX Z3 (Microsoft SMT solver) Anthropic SDK
1-week

Long-Context Local RAG Without Chunking

Document Q&A system that exploits Mamba-2 hybrid model's long-context efficiency to ingest whole files instead of splitting them.

Python llama.cpp (GGUF) or transformers Nemotron-Ultra or any Mamba-2 hybrid GGUF LangChain or raw inference loop
1-week

LowResAdapt: Principled LoRA Fine-Tuning CLI for Low-Resource Languages

A command-line toolkit that recommends and executes staged LoRA fine-tuning of a multilingual base model for a target low-resource language, with compute-budget guidance baked in.

Python HuggingFace Transformers PEFT (LoRA) datasets
1-month

Mini RoboTrustBench: Four-Scenario Robustness Suite for Pluggable World Models

A self-contained evaluation harness that runs any video world model through all four RoboTrustBench scenario types and produces a per-category robustness scorecard.

Python PyTorch Hugging Face Datasets OpenCV
weekend

Modality Gap Probe

A tool that stress-tests a VLM by varying font, resolution, and background of rendered text to find the rendering recipe that minimises the pixel-text vs token-text accuracy gap.

Python Pillow OpenAI GPT-4o or Claude 3.5 Sonnet API Gradio
1-week

Model Edit Reversal Curse Auditor

A testing harness that applies knowledge edits to a model and automatically checks whether the edit propagates to reversed and paraphrased queries.

Python FastAPI ROME / MEMIT (model editing libraries) HuggingFace Transformers
1-week

Multi-Agent Safety Debate Arena

A framework where two specialized LLM agents debate whether a proposed agent action is safe, producing a structured safety verdict without human red-teamers.

Python Claude API (claude-sonnet-4-5) Pydantic FastAPI
1-week

Multi-Hop RAG with Evolving Evidence Tracker

A multi-hop question-answering tool that maintains a running 'evidence ledger' across retrieval iterations to avoid contradicting or re-fetching already-established facts.

Python LlamaIndex OpenAI API (GPT-4o) Pydantic
1-month

Multi-Tenant Personalization Sidecar API

A standalone microservice that any app can call to retrieve a compact, cacheable user embedding and automatically-generated system-prompt injection for personalized LLM calls.

Python FastAPI PostgreSQL (pgvector) sentence-transformers
1-week

Multimodal RAG Evaluator

An evaluation harness that checks whether a RAG pipeline correctly grounds answers in retrieved audio/video content, not just text chunks.

Python OpenAI Whisper (transcription) LlamaIndex FastAPI
1-month

Natural-Language-to-Simulation Scenario Expander for Embodied AI

Give it a plain-English scenario ('robot arm retrieves a tipped-over bottle from a wet countertop') and it outputs fully parameterized simulation configs plus Cosmos-3-validated synthetic observation videos for training embodied agents.

Python FastAPI Nemotron 3 Ultra (structured-output mode) NVIDIA Isaac Sim Python API or PyBullet
1-week

Natural-Language Video Edit Agent

An agent that accepts plain-English editing instructions ('tighten the opening, cut the awkward pause at 0:42, add a zoom on the product') and executes them as real FFmpeg operations.

Python Claude API with tool use FFmpeg (via subprocess) Whisper (OpenAI) for transcript grounding
weekend

Negation Ablation Sandbox

An interactive notebook that lets you ablate late-layer attention heads in a transformer and watch negation accuracy change in real time.

Python TransformerLens Jupyter / Marimo Plotly
weekend

Novelty Memory Bot for Your Reading List

A CLI tool that scores each new paper you add against your personal reading history, flagging genuine novelty vs. incremental rehash.

Python Claude API (claude-haiku for embeddings/scoring) SQLite + sqlite-vec or ChromaDB arXiv API
weekend

Observational Equivalence Test Generator

Automatically generate pairs of test cases that are surface-identical but semantically different, to catch agents gaming shallow checks.

Python ast hypothesis OpenAI API or Anthropic SDK
weekend

On-Device Private Code Reviewer with Nemotron Ultra

A git pre-push hook that runs Nemotron Ultra locally via llama.cpp and outputs a structured JSON review of your diff before it leaves your machine.

Python llama.cpp (GGUF backend) Nemotron-3-Ultra-GGUF weights Click CLI
weekend

On-Device Whisper Fine-Tuner for Noisy Telephony Audio

A local CLI tool that continually fine-tunes a quantized Whisper model on your own audio samples without any data leaving the machine.

Python faster-whisper PEFT/LoRA torchaudio
1-week

Ontology-Grounded Agent Compliance Checker

Agent that validates its own tool calls and outputs against a domain ontology before returning results

Python FastAPI owlready2 Claude API (tool use)
1-week

Persistent Persona Chatbot with Compressed Session Memory

A FastAPI chatbot that summarizes each session into a codebook-quantized user profile, then retrieves and injects it on the next visit—keeping context costs flat regardless of history length.

Python FastAPI Claude API (with prompt caching) sentence-transformers
1-week

Personal Workflow Distiller

Use a frontier model to shadow your daily digital work for a week and produce a fine-tuned system prompt that turns a local Ollama model into your personal assistant.

Python Ollama (Llama 3.2 3B or Phi-4-mini) Claude API (claude-opus-4 for distillation) SQLite
1-week

Physical Plausibility Filter for Synthetic Video Datasets

A pipeline that ingests synthetic video clips, scores each clip's temporal coherence and physical plausibility using a video foundation model, and culls low-quality samples before they enter a training set.

Python PyTorch Hugging Face `transformers` (VideoLlava or Cosmos 3 API) FFmpeg-python
1-week

Physics-Regime Gym Wrapper

Wrap multiple physics simulators (MuJoCo, PyBullet, Brax) behind a unified Gymnasium interface so one agent trains across varied physical regimes.

Python Gymnasium MuJoCo PyBullet
weekend

Pipe-level Token Filter for Agent CLIs

A configurable stdin→stdout filter that strips low-signal CLI output before it hits your LLM context.

Python Click regex/AST rules pytest
weekend

Privacy-First Desktop Automation Agent

Natural-language task runner for GUI automation using a locally-hosted computer-use model — screen data never leaves the machine.

Python Holo3.1 (via Ollama or HF transformers) PyAutoGUI or pygetwindow PIL for screenshots
1-month

Privacy-Preserving Federated ASR Adapter Aggregator

A minimal federated learning server that aggregates LoRA adapter updates from multiple edge clinic nodes without ever collecting raw audio, then redistributes an improved shared adapter.

Python PyTorch PEFT Flower (flwr)
1-month

Probe-Based Topic Coherence Benchmark Generator

A library and REST API that auto-generates held-out evaluation probes distinguishing thematic vs. taxonomic coherence for any topic model, replacing monolithic NPMI with axis-aware metrics.

Python FastAPI PostgreSQL sentence-transformers
1-week

Probe Format Confounder Benchmark

Minimal benchmark that tests whether a linear probe is detecting reasoning type or just task format by swapping MCQ/open-ended wrappers around identical logic problems.

Python HuggingFace Transformers scikit-learn datasets
1-week

RAG Parser Canary Suite

A test harness that stress-tests document parsers (PDFs, HTML, DOCX) for silent extraction failures and measures downstream retrieval factual accuracy.

Python LangChain / LlamaIndex pdfplumber pytest
1-week

Regulatory Landscape Briefing Bot

A Slack or web bot that answers 'what does current AI regulation say about X?' by grounding answers in tracked legislative texts across jurisdictions.

Python LangChain or LlamaIndex pgvector + PostgreSQL FastAPI
weekend

Rendering-Aware Document Preprocessor

A drop-in FastAPI microservice that receives a scanned-document image, tries a small set of pre-baked rendering transforms, picks the one that scores highest on a quick VLM confidence probe, and returns the enhanced image.

Python FastAPI Pillow pdf2image
weekend

Repo Pattern Guard

A pre-commit + CI tool that flags bad coding patterns before they become permanent training context for your coding agent.

Python GitPython Claude API (claude-haiku-4) pre-commit framework
1-week

RL Environment Spec Generator

A web app that takes a natural-language task description and generates a complete reinforcement learning environment specification — reward function, observation space, termination conditions, and verification harness.

Next.js TypeScript Vercel AI SDK shadcn/ui
1-week

Robot Task MCP Server

MCP server that exposes a simulated robot arm (via PyBullet) as tools so any MCP-compatible LLM agent can plan and execute pick-and-place tasks.

Python FastMCP PyBullet Anthropic SDK
weekend

Rubric-Driven Creative Quality Scorer

A web app that lets you define structured rubrics for subjective creative tasks and compares rubric-based LLM scoring against pairwise human preference to reveal where each method diverges.

Python FastAPI Anthropic SDK HTMX
1-week

Rule Induction Arena

A text-adventure benchmark harness that generates hidden-rule puzzles, runs multiple LLMs through them, and scores rule-induction capability across difficulty tiers.

Python Pydantic Anthropic SDK OpenAI SDK
1-month

Sandboxed Agent Workbench

A local orchestration harness that spins up isolated Docker containers per agent task and tracks memory, tool calls, and session state across runs — a personal Devin-lite infrastructure.

Python Docker SDK FastAPI Redis (session memory)
1-week

Self-Improvement Loop Sandbox

A small automated research pipeline where a language model iteratively rewrites its own few-shot prompts and measures whether downstream task performance actually improves run-over-run

Python OpenAI API (or Anthropic API) LangChain SQLite
1-week

Semantic Geometry Side-by-Side Viewer

A Streamlit app that runs LDA and BERTopic on the same uploaded corpus and visually contrasts the thematic vs. taxonomic signature of each model's topics using psycholinguistic benchmark anchors.

Python gensim BERTopic sentence-transformers
weekend

Semantic RAG Cache Layer

A drop-in caching middleware for RAG pipelines that reuses retrieval plans for semantically similar queries.

Python FastAPI LangChain or LlamaIndex Redis
1-week

Session Memory Consolidation Service

A background service that compresses and consolidates agent conversation history into a structured memory store, injected as a compact context prefix on next session.

Python FastAPI SQLite + sqlite-vec Claude API (haiku for compression)
weekend

SFT-RL Sample Gating Dashboard

A local tool that labels training samples by difficulty and visualizes which should receive SFT vs RL gradient updates.

Python Hugging Face Transformers Datasets Gradio
1-week

Sparse vs. Dense Attention Diff Visualizer

An interactive web app that loads a small transformer, lets you toggle between full and DeepSeek-style sparse attention masks, and shows in real time which token pairs are dropped and how outputs shift.

Python PyTorch Transformers (HuggingFace) Gradio
1-week

Speculative Decoding Accelerator with Dynamic Top-K Projection

Prototype that implements the NanoSpec core idea — dynamically shrinking the draft model's vocabulary projection at inference time — and benchmarks the speedup.

Python PyTorch Transformers triton (optional for kernel)
weekend

Steering Vector Leakage Auditor

CLI tool that measures cross-concept contamination when applying activation steering vectors to a local LLM.

Python TransformerLens nnsight Llama-3.2-1B or Gemma-2B
weekend

StepPO Visualizer: Agentic Credit Assignment Explorer

An interactive tool that runs a small LLM agent on multi-step tasks and visualizes how step-level vs token-level reward signals differ across a trajectory.

Python LangGraph Gradio OpenAI API (or local Ollama)
weekend

Storyboard-to-Video Agentic Pipeline

Give an LLM a script and let it plan scenes, generate each clip via API, and stitch them into a coherent short video autonomously.

Python Claude API (claude-3-5-sonnet) Replicate API (Wan2.1 or LTX-Video) FFmpeg
weekend

Style-Codebook Writing Assistant

A CLI tool that learns your writing style, compresses it into a codebook embedding, and injects it as a compact prefix into every LLM call—no prompt bloat.

Python sentence-transformers scikit-learn Anthropic Claude API
1-week

Tool-Teaching Benchmark Harness

Empirical testbed measuring how many examples an agent needs to reliably use a novel tool it has never seen

Python Claude API OpenAI API pytest
weekend

Topic Semantic Axis Auditor

A CLI tool that scores each topic from a trained model on a thematic-relatedness vs. taxonomic-similarity axis so researchers know what geometry their model actually learned.

Python gensim sentence-transformers NLTK/WordNet
1-week

Trace-Level Agent Safety Monitor

A tool that records multi-step agent execution traces and runs heuristic + LLM-based checks to flag dangerous action sequences before they complete.

Python LangChain SQLite Claude API (claude-sonnet-4-5)
1-week

Unauthorized-Attribution Detector for AI Lab Claims

Monitor news and social media for third-party claims that invoke AI lab authority, and flag ones the labs haven't endorsed.

Python FastAPI PostgreSQL NewsAPI or GDELT
weekend

Unlearning Provenance Probe

A CLI tool that stress-tests whether an unlearning method actually erased pretraining knowledge versus only SFT-injected facts.

Python HuggingFace Transformers datasets rich (CLI)
weekend

Verifiability Scorer for Personal Task Lists

A CLI tool that analyzes your to-do list and scores each task by how automatable it is using Verifier's Law heuristics.

Python Typer Rich Claude API (claude-sonnet-4-5)
1-month

Vertical AI PMF Benchmark Builder

A lightweight survey + analytics app that lets teams inside a specific industry vertical (e.g., health AI) self-report AI deployment metrics, then anonymously benchmarks them against peers to expose where PMF is real vs. aspirational.

Next.js Supabase Postgres Vercel
weekend

Video State Tracker CLI

CLI tool that feeds a video + structured question set to a multimodal LLM and scores its temporal state-tracking accuracy against ground-truth annotations.

Python Claude claude-opus-4-5 API (vision) OpenCV JSONL
1-month

Video World-State Agent with Persistent Character Memory

A stateful agent that maintains a structured 'world model' (characters, props, locations, timeline) across a multi-session video project and uses it to enforce continuity in every new generation call.

Python FastAPI Claude API (extended thinking for scene reasoning) Replicate API (video + image gen)
weekend

Vocabulary-Free Sparse Retrieval Experimenter

A benchmarking harness that compares standard BM25 sparse retrieval against learned sparse methods (SPLADE, FLOPS-regularized models) on a custom document corpus to quantify the vocabulary-mismatch problem.

Python Pyserini SPLADE (HuggingFace) datasets (HuggingFace)
1-week

Watermark Robustness Sandbox

An interactive web tool that lets you embed a token-level watermark into LLM output, then attack it with paraphrasing and synonym substitution to measure survival rate.

Python FastAPI HuggingFace Transformers Next.js