Voice AI Deep Dive: VAD, Endpointing, and Turn-Taking (Why Voice Agents Feel Fast or Awkward)

Introduction

If your voice agent feels slow, it is often not the ASR model. It is the decision layer that controls:

When to start listening
When to stop listening
Whether to accept interruptions (barge-in)
How quickly to respond without cutting the user off

This article breaks down the modern stack:

VAD (Voice Activity Detection)
Endpointing (end-of-utterance detection)
Turn-taking (predicting conversational boundaries)

1. VAD vs Endpointing vs Turn-Taking

These are related but distinct:

VAD: Is there speech in this frame/window?
Endpointing: Has the user finished this utterance?
Turn-taking: Is the user yielding the floor to the agent?

A system can have perfect VAD and still have terrible turn-taking UX.

2. VAD: The Basic Building Block

VAD outputs a probability of speech vs non-speech per frame or window.

Common approaches:

Energy-based thresholds (simple, brittle)
Neural VAD (robust, standard)

Key parameters:

Frame/window size (10–30 ms vs 100 ms)
Smoothing (temporal filtering)
Hangover time (keep speech “on” briefly after probability drops)

Why hangover time matters

Without hangover, short pauses inside words (“uh”, “um”, plosives) can be misclassified as silence, causing:

Fragmented segments
Dropped word endings
Excessive endpoint triggers

3. Endpointing: The Most UX-Critical Threshold in Voice

Endpointing decides: “I think the user is done speaking.”

The naive strategy:

If silence lasts > X ms, stop.

This fails because pauses are meaningful:

Users pause to think.
Users pause between clauses.
Users pause before giving a number or name.

Production endpointing uses hysteresis

Typical heuristics:

Require sustained silence longer than X ms, but:
Increase X if the user is mid-sentence or recently speaking strongly.
Decrease X if the user gave a short command-like phrase.

Endpointing becomes a policy conditioned on context.

4. Barge-In: Interruption Handling as a System Problem

Barge-in means the user can interrupt the agent while it is speaking.

Two requirements:

The system must detect user speech while output audio is playing (hard).
The system must stop playback quickly and restart ASR.

Challenges:

Echo cancellation (agent audio leaks into mic)
VAD false positives triggered by TTS output
Race conditions in stream switching

In research, barge-in is often ignored. In real products, it defines “feels conversational.”

5. Turn-Taking Models: Beyond Silence Thresholds

Silence thresholds are crude. Turn-taking predictors use features like:

Prosody (intonation contours)
Timing patterns
Lexical cues (“so…”, “and…”, rising tone)
ASR partial hypothesis structure

A turn-taking model can predict whether the user intends to continue speaking even if there is a short pause.

Practical hybrid strategy

Use VAD + endpointing heuristics for baseline.
Add a lightweight turn-taking model to override endpoint decisions in ambiguous pauses.

6. Evaluation: What to Measure

For researchers, add metrics that correlate with UX:

Cutoff rate: fraction of utterances cut before completion
Endpoint delay: silence duration before system responds
Barge-in latency: time to stop TTS after user starts speaking
False barge-in rate: agent stops speaking due to VAD false positives
Turn-taking accuracy: predict continue/stop at pause boundaries

Report distributions (p50/p90/p95), not only averages.

7. Tuning Guidance (Rules of Thumb)

For command-and-control agents: favor fast endpointing (short silence threshold) with confidence gating.
For dictation or long-form: favor conservative endpointing and allow pauses.
If you must pick one knob: tune hangover time first, then silence threshold.

8. Failure Modes You’ll See in the Wild

Numbers and names get cut off: user pauses before critical entities.
Agent interrupts too early: endpointing triggers mid-thought.
Agent feels sluggish: endpoint delay too long in short commands.
Barge-in doesn’t work in noisy rooms: VAD tuned for quiet conditions.

9. Research Baseline Implementation Sketch

For a reproducible baseline:

Neural VAD producing 20 ms frame probabilities.
Temporal smoothing + hangover of 200–400 ms.
Endpoint rule: stop after 500–900 ms silence, modulated by recent speech energy and ASR confidence.
Optional turn-taking classifier that extends endpoint if continuation probability is high.

This baseline is simple but captures the core tradeoffs.

Conclusion

Voice agents feel “fast” when they manage turns well, not when they chase the last 0.2% WER. Treat VAD, endpointing, and barge-in as first-class modeling problems with measurable metrics, and you will unlock the biggest UX wins.

Explore more Voice AI deep dives