How to Reduce Hallucinations in Voice AI Transcription
The Hallucination Problem
You transcribe a 10-second silent audio file, and the model outputs: "Subtitles by Amara.org" or "Thank you for watching."
This is the "Hallucination" problem in ASR (Automatic Speech Recognition). Unlike traditional HMM-GMM systems that would output nothing if no acoustic features matched, Transformer-based models (like Whisper) are generative. They are essentially language models conditioned on audio. When the audio signal is weak or silent, the language model prior takes over, predicting the most probable text it saw during training—often subtitle credits.
Root Causes
- Weak Supervision: Whisper was trained on internet video data. Many training samples contained silence or music while the subtitles displayed "Copyright" or "Subscribe" text. The model learned to associate silence with these tokens.
- Decoder Drift: In autoregressive decoding, if the model predicts one wrong token, it feeds that back into itself, potentially causing a feedback loop of repetition.
Engineering Mitigation Strategies
For researchers and developers, here are proven methods to reduce hallucinations.
1. Voice Activity Detection (VAD) Pre-filtering
The most effective method is to simply not feed silence to the model.
- Implementation: Use a lightweight VAD model (like Silero VAD or WebRTC VAD) to segment the audio.
- Logic:
if is_speech(chunk): transcribe(chunk) else: return "" - Result: Eliminates 90% of silence-induced hallucinations.
2. Log-Probability Thresholding
Whisper outputs a log-probability score for its generated tokens. Hallucinations often (but not always) have lower confidence scores than valid speech.
- Strategy: If the average log-probability of a segment is below a threshold (e.g., -1.0), discard the segment or mark it for review.
- Caveat: Sometimes the model is very confident in its hallucination (e.g., log-prob > -0.5). This metric alone is insufficient.
3. Compression Ratio Filtering
Hallucinations often manifest as repetitive loops (e.g., "The The The The").
- Metric: Calculate the compression ratio (gzip size of text / length of text).
- Logic: If the text is highly compressible, it's likely a repetition loop. Discard segments with a compression ratio > 2.4 (OpenAI's default heuristic).
4. Prompt Engineering / Conditioning
You can condition the decoder by providing the previous segment's text as a prompt.
- Risk: If the previous segment was a hallucination, you poison the context for the next segment.
- Fix: Only use high-confidence previous segments as context prompts.
5. Beam Search with Repetition Penalty
Standard greedy decoding selects the highest probability token. Beam search explores multiple paths.
- Tweak: Apply a penalty to tokens that have already been generated in the current window. This physically prevents the model from entering a repetition loop.
Conclusion
Hallucinations are an inherent artifact of generative ASR. They cannot be "trained out" entirely without curating the massive dataset. Therefore, the solution lies in the inference pipeline: wrapping the model in robust pre-processing (VAD) and post-processing (heuristics) logic.
