Voice AI Deep Dive: ASR Decoding, Beam Search, and Why Streaming Partials “Jitter”

Introduction

Two teams can ship the “same” ASR model and users will swear they are different products. The reason is decoding.

In streaming ASR, the model emits probabilities. The decoder decides:

What text to show now (partial hypothesis)
What to revise later
What to finalize (commit)
When to stop listening (endpoint)

This article explains why partial hypotheses jitter, how beam search interacts with streaming constraints, and what researchers should measure beyond WER.

1. Decoding is Part of the Model

If the acoustic model is (p(\cdot)), the decoder effectively defines your final distribution:

Search algorithm (greedy vs beam)
Constraints (lexicon, grammar, hotwords)
Score shaping (length normalization, blank penalties)
LM fusion / rescoring

In production, you should think of “ASR = model + decoder + endpointing.”

2. Greedy Decoding: Fast, Surprisingly Strong, Often Unstable

Greedy decoding picks the most likely next token at every step.

Benefits:

Minimal latency and compute
Easy to implement

Drawbacks:

No global correction mechanism
Susceptible to local mistakes that later become hard to undo
In streaming, greedy can cause oscillations when token probabilities are close

Greedy is a good baseline for TTFT and throughput, but rarely the best for UX.

3. Beam Search: The Workhorse (and Its Failure Modes)

Beam search keeps the top (B) candidate hypotheses at each step.

What it buys you:

Recovery from local errors
Better handling of ambiguous acoustics
Better integration of language priors (explicit LM or implicit prediction network)

What it costs:

Compute grows with beam width
Streaming stability can worsen if beams swap rank frequently

Why beams swap rank

In streaming, small future evidence can flip the best path:

“I want to” vs “I wanted to”
“recognize” vs “wreck a nice”

If the decoder displays the current best beam immediately, the UI will show edits as rank flips occur.

4. Pruning and Partial Stability: You Can’t Have Everything

There is a triangle:

Accuracy: keep many candidates and allow revisions
Stability: avoid changing what you already showed
Latency: don’t wait too long to commit

To reduce jitter, systems often add:

Aggressive pruning (reduces accuracy)
Commit policies (increases latency)
Confidence gating (can hide text temporarily)

5. RNN-T Blank Bias: A Latency–Completeness Dial

In RNN-T, the blank token advances time without emitting output. Many implementations include a blank bias:

Increase blank probability: fewer emissions, more conservative output
Decrease blank probability: more emissions, faster but riskier partials

Blank bias is effectively a knob for:

TTFT
Deletion rate
Revision rate

Researchers should treat it like a hyperparameter and report it.

6. LM Fusion and Rescoring: Where “Language” Enters

Even if your model has a built-in prior, external LMs are common:

Shallow fusion: add (\lambda \log p_{LM}(y)) during beam search
Rescoring: generate N-best and rerank with a stronger LM

In streaming, rescoring can cause:

Late rewrites (LM changes earlier words once it sees later context)
Topic-driven rewrites (proper nouns, domain terms)

Mitigations:

Prefix locking (freeze older tokens)
Segment-level rescoring (rerank only within current segment)

7. Timestamp Decoding: Another Source of Weirdness

When models output timestamp tokens (or alignments), decoding can change behavior:

The decoder may delay emission until it can place a boundary.
Forced segmentation can increase short-term errors near boundaries.

If you see “good offline text but bad streaming text,” check whether timestamp constraints are affecting streaming decode.

8. “Stable Partial” Heuristics That Actually Work

Common production strategies:

A) Prefix agreement

Keep multiple hypotheses (beams). Compute their longest common prefix. Display only that prefix as stable.

This reduces jitter because even if beams swap rank, the shared prefix tends to be stable.

B) N-frame confirmation

Only commit tokens that remain unchanged for N frames/chunks.

C) Confidence thresholding

Display tokens only when their posterior exceeds a threshold. Useful for short words that cause churn.

D) Segment commits

Emit partials freely within a segment, but finalize only when endpointing triggers or a boundary token appears.

9. Metrics: What Researchers Should Report

Beyond WER:

TTFT (median and p95)
Word finalization delay: time from acoustic evidence to final word display
Revision rate: edits per 100 emitted characters
Stability score: fraction of displayed tokens that never change
Jitter latency: how long until the displayed text stops changing

These metrics correlate with perceived quality far more than small WER differences.

10. A Practical Baseline Decoder for Streaming Research

If you want a reference implementation that behaves well:

Beam search with modest width (e.g., 4–8)
Optional shallow fusion with a small LM
Prefix agreement for stable display
Prefix locking with a short confirmation window
Conservative endpointing and segment-level commits

Then report WER + streaming metrics.

Conclusion

If your streaming ASR feels unstable, the architecture is only half the story. Decoding decisions—beam width, blank bias, rescoring, commit policies—shape the user experience. Treat decoding as part of the model, measure stability explicitly, and you will learn far more than WER can tell you.

Explore more Voice AI deep dives