On-Device Whisper Fine-Tuner for Noisy Telephony Audio
A local CLI tool that continually fine-tunes a quantized Whisper model on your own audio samples without any data leaving the machine.
Difficulty: weekend | Stack: Python, faster-whisper, PEFT/LoRA, torchaudio, librosa, Click CLI
Who this is for
Healthcare IT developers or researchers who need to adapt a base ASR model to a specific clinic’s audio environment (room noise, accent, equipment) without sending any patient audio to the cloud.
Build steps
- Set up a local pipeline that loads a quantized Whisper-small or Whisper-medium model via faster-whisper, then strips telephony artifacts (8 kHz resample, bandpass 300–3400 Hz filter) using torchaudio transforms.
- Implement a simple on-device data store: each transcription session saves (audio_chunk, corrected_text) pairs as a local SQLite dataset; older samples are pruned with a sliding window to simulate continual learning.
- Apply LoRA adapters (via PEFT) to the Whisper encoder attention layers only, keeping adapter rank low (r=4) so training fits in <8 GB RAM on a CPU/MPS device.
- Write a Click CLI with two commands:
transcribe(streams mic or file input through the current adapted model) andadapt(runs a few gradient steps on the local dataset, then checkpoints the adapter weights). - Add a WER evaluation harness using jiwer against a small held-out set to show before/after adaptation accuracy so the user can see the reality gap closing.
Risks
- Catastrophic forgetting: LoRA with too high a learning rate will overfit the local audio and destroy general vocabulary — need careful LR scheduling and EWC regularization or replay buffer.
- Telephony simulation fidelity: Without real 8 kHz telephony recordings, simulated degradation (SoX filters) may not reproduce the actual codec artifacts, making WER improvements look better than they are in deployment.
- Quantized model + PEFT compatibility: faster-whisper uses CTranslate2 which doesn’t support gradient updates — you’ll need to maintain a separate float32 PyTorch model for adaptation and a quantized copy for inference, doubling disk footprint.