On-Device Whisper Fine-Tuner for Noisy Telephony Audio

A local CLI tool that continually fine-tunes a quantized Whisper model on your own audio samples without any data leaving the machine.

Difficulty: weekend | Stack: Python, faster-whisper, PEFT/LoRA, torchaudio, librosa, Click CLI

Who this is for

Healthcare IT developers or researchers who need to adapt a base ASR model to a specific clinic’s audio environment (room noise, accent, equipment) without sending any patient audio to the cloud.

Build steps

Set up a local pipeline that loads a quantized Whisper-small or Whisper-medium model via faster-whisper, then strips telephony artifacts (8 kHz resample, bandpass 300–3400 Hz filter) using torchaudio transforms.
Implement a simple on-device data store: each transcription session saves (audio_chunk, corrected_text) pairs as a local SQLite dataset; older samples are pruned with a sliding window to simulate continual learning.
Apply LoRA adapters (via PEFT) to the Whisper encoder attention layers only, keeping adapter rank low (r=4) so training fits in <8 GB RAM on a CPU/MPS device.
Write a Click CLI with two commands: transcribe (streams mic or file input through the current adapted model) and adapt (runs a few gradient steps on the local dataset, then checkpoints the adapter weights).
Add a WER evaluation harness using jiwer against a small held-out set to show before/after adaptation accuracy so the user can see the reality gap closing.

Risks

Catastrophic forgetting: LoRA with too high a learning rate will overfit the local audio and destroy general vocabulary — need careful LR scheduling and EWC regularization or replay buffer.
Telephony simulation fidelity: Without real 8 kHz telephony recordings, simulated degradation (SoX filters) may not reproduce the actual codec artifacts, making WER improvements look better than they are in deployment.
Quantized model + PEFT compatibility: faster-whisper uses CTranslate2 which doesn’t support gradient updates — you’ll need to maintain a separate float32 PyTorch model for adaptation and a quantized copy for inference, doubling disk footprint.