Dialect-Adaptive ASR Benchmark Dashboard
A local web dashboard that benchmarks multiple small ASR models across audio clips grouped by dialect/accent tag, surfacing which model has the smallest reality gap for your specific speaker population.
Difficulty: 1-week | Stack: Python, FastAPI, Whisper.cpp (Python bindings), SQLite, React, Recharts, jiwer, ffmpeg-python
Who this is for
Researchers or clinical informatics teams who need to pick the right base model before investing in on-device adaptation — saves weeks of manual benchmarking across accent/dialect subgroups.
Build steps
- Build an audio ingestion API (FastAPI) that accepts audio uploads with metadata fields: dialect_tag, speaker_id, reference_transcript, channel_type (telephony/studio). Store to SQLite with ffmpeg-python normalizing everything to 16 kHz mono WAV.
- Implement a model registry supporting pluggable backends — Whisper tiny/base/small via whisper.cpp Python bindings, plus optional Vosk offline models — so users can add models without code changes.
- Run each registered model over the dataset and compute per-dialect WER, CER, and RTF (real-time factor) using jiwer; store results in a
benchmark_runstable keyed by (model_id, dialect_tag, run_timestamp). - Build a React dashboard with Recharts grouped bar charts showing WER by dialect per model, a scatter plot of WER vs. RTF (the accuracy/speed tradeoff), and a sample viewer that plays audio and shows hypothesis vs. reference diff.
- Add a ‘continual adaptation simulation’ mode that replays the dataset in chronological order and plots WER over time for each model, visualizing drift vs. stability — the core insight from the paper.
Risks
- Obtaining dialectally diverse audio: without a corpus like Gram Vaani (license-restricted), you may only have one accent, making the dialect comparison meaningless — plan to source Creative Commons multilingual speech data from Mozilla Common Voice as a substitute.
- RTF measurement variance: CPU benchmarks are noisy; without pinning CPU affinity and disabling thermal throttling, RTF numbers will vary 30–50% between runs, misleading the ‘on-device feasibility’ conclusion.
- UI complexity creep: the benchmark logic is the core value — resist building a full experiment management system; a flat SQLite schema with run IDs is sufficient and keeps the project shippable in a week.