Dialect-Adaptive ASR Benchmark Dashboard

A local web dashboard that benchmarks multiple small ASR models across audio clips grouped by dialect/accent tag, surfacing which model has the smallest reality gap for your specific speaker population.

Difficulty: 1-week | Stack: Python, FastAPI, Whisper.cpp (Python bindings), SQLite, React, Recharts, jiwer, ffmpeg-python

Who this is for

Researchers or clinical informatics teams who need to pick the right base model before investing in on-device adaptation — saves weeks of manual benchmarking across accent/dialect subgroups.

Build steps

Build an audio ingestion API (FastAPI) that accepts audio uploads with metadata fields: dialect_tag, speaker_id, reference_transcript, channel_type (telephony/studio). Store to SQLite with ffmpeg-python normalizing everything to 16 kHz mono WAV.
Implement a model registry supporting pluggable backends — Whisper tiny/base/small via whisper.cpp Python bindings, plus optional Vosk offline models — so users can add models without code changes.
Run each registered model over the dataset and compute per-dialect WER, CER, and RTF (real-time factor) using jiwer; store results in a benchmark_runs table keyed by (model_id, dialect_tag, run_timestamp).
Build a React dashboard with Recharts grouped bar charts showing WER by dialect per model, a scatter plot of WER vs. RTF (the accuracy/speed tradeoff), and a sample viewer that plays audio and shows hypothesis vs. reference diff.
Add a ‘continual adaptation simulation’ mode that replays the dataset in chronological order and plots WER over time for each model, visualizing drift vs. stability — the core insight from the paper.

Risks

Obtaining dialectally diverse audio: without a corpus like Gram Vaani (license-restricted), you may only have one accent, making the dialect comparison meaningless — plan to source Creative Commons multilingual speech data from Mozilla Common Voice as a substitute.
RTF measurement variance: CPU benchmarks are noisy; without pinning CPU affinity and disabling thermal throttling, RTF numbers will vary 30–50% between runs, misleading the ‘on-device feasibility’ conclusion.
UI complexity creep: the benchmark logic is the core value — resist building a full experiment management system; a flat SQLite schema with run IDs is sufficient and keeps the project shippable in a week.