Voice AI Deep Dive: ASR Decoding, Beam Search, and Why Streaming Partials “Jitter”
Introduction
Two teams can ship the “same” ASR model and users will swear they are different products. The reason is decoding.
In streaming ASR, the model emits probabilities. The decoder decides:
- What text to show now (partial hypothesis)
- What to revise later
- What to finalize (commit)
- When to stop listening (endpoint)
This article explains why partial hypotheses jitter, how beam search interacts with streaming constraints, and what researchers should measure beyond WER.
1. Decoding is Part of the Model
If the acoustic model is (p(\cdot)), the decoder effectively defines your final distribution:
- Search algorithm (greedy vs beam)
- Constraints (lexicon, grammar, hotwords)
- Score shaping (length normalization, blank penalties)
- LM fusion / rescoring
In production, you should think of “ASR = model + decoder + endpointing.”
2. Greedy Decoding: Fast, Surprisingly Strong, Often Unstable
Greedy decoding picks the most likely next token at every step.
Benefits:
- Minimal latency and compute
- Easy to implement
Drawbacks:
- No global correction mechanism
- Susceptible to local mistakes that later become hard to undo
- In streaming, greedy can cause oscillations when token probabilities are close
Greedy is a good baseline for TTFT and throughput, but rarely the best for UX.
3. Beam Search: The Workhorse (and Its Failure Modes)
Beam search keeps the top (B) candidate hypotheses at each step.
What it buys you:
- Recovery from local errors
- Better handling of ambiguous acoustics
- Better integration of language priors (explicit LM or implicit prediction network)
What it costs:
- Compute grows with beam width
- Streaming stability can worsen if beams swap rank frequently
Why beams swap rank
In streaming, small future evidence can flip the best path:
- “I want to” vs “I wanted to”
- “recognize” vs “wreck a nice”
If the decoder displays the current best beam immediately, the UI will show edits as rank flips occur.
4. Pruning and Partial Stability: You Can’t Have Everything
There is a triangle:
- Accuracy: keep many candidates and allow revisions
- Stability: avoid changing what you already showed
- Latency: don’t wait too long to commit
To reduce jitter, systems often add:
- Aggressive pruning (reduces accuracy)
- Commit policies (increases latency)
- Confidence gating (can hide text temporarily)
5. RNN-T Blank Bias: A Latency–Completeness Dial
In RNN-T, the blank token advances time without emitting output. Many implementations include a blank bias:
- Increase blank probability: fewer emissions, more conservative output
- Decrease blank probability: more emissions, faster but riskier partials
Blank bias is effectively a knob for:
- TTFT
- Deletion rate
- Revision rate
Researchers should treat it like a hyperparameter and report it.
6. LM Fusion and Rescoring: Where “Language” Enters
Even if your model has a built-in prior, external LMs are common:
- Shallow fusion: add (\lambda \log p_{LM}(y)) during beam search
- Rescoring: generate N-best and rerank with a stronger LM
In streaming, rescoring can cause:
- Late rewrites (LM changes earlier words once it sees later context)
- Topic-driven rewrites (proper nouns, domain terms)
Mitigations:
- Prefix locking (freeze older tokens)
- Segment-level rescoring (rerank only within current segment)
7. Timestamp Decoding: Another Source of Weirdness
When models output timestamp tokens (or alignments), decoding can change behavior:
- The decoder may delay emission until it can place a boundary.
- Forced segmentation can increase short-term errors near boundaries.
If you see “good offline text but bad streaming text,” check whether timestamp constraints are affecting streaming decode.
8. “Stable Partial” Heuristics That Actually Work
Common production strategies:
A) Prefix agreement
Keep multiple hypotheses (beams). Compute their longest common prefix. Display only that prefix as stable.
This reduces jitter because even if beams swap rank, the shared prefix tends to be stable.
B) N-frame confirmation
Only commit tokens that remain unchanged for N frames/chunks.
C) Confidence thresholding
Display tokens only when their posterior exceeds a threshold. Useful for short words that cause churn.
D) Segment commits
Emit partials freely within a segment, but finalize only when endpointing triggers or a boundary token appears.
9. Metrics: What Researchers Should Report
Beyond WER:
- TTFT (median and p95)
- Word finalization delay: time from acoustic evidence to final word display
- Revision rate: edits per 100 emitted characters
- Stability score: fraction of displayed tokens that never change
- Jitter latency: how long until the displayed text stops changing
These metrics correlate with perceived quality far more than small WER differences.
10. A Practical Baseline Decoder for Streaming Research
If you want a reference implementation that behaves well:
- Beam search with modest width (e.g., 4–8)
- Optional shallow fusion with a small LM
- Prefix agreement for stable display
- Prefix locking with a short confirmation window
- Conservative endpointing and segment-level commits
Then report WER + streaming metrics.
Conclusion
If your streaming ASR feels unstable, the architecture is only half the story. Decoding decisions—beam width, blank bias, rescoring, commit policies—shape the user experience. Treat decoding as part of the model, measure stability explicitly, and you will learn far more than WER can tell you.
