Voice AI Deep Dive: Speaker Diarization (ECAPA, Clustering, TS-VAD, and Overlap)

Introduction

Diarization answers a deceptively simple question:

“Who spoke when?”

For researchers building voice agents, diarization is not optional if you deal with:

Meetings and interviews
Customer support calls
Multi-speaker WhatsApp voice notes
Overlapped speech (two people talking at once)

This deep dive covers the modern diarization stack:

Speaker embeddings (x-vectors → ECAPA)
Clustering (and why it’s fragile)
Resegmentation and neural diarization (TS-VAD style)
Overlap handling and evaluation pitfalls

1. The Classical Pipeline: Embed → Cluster → Segment

Most production diarization systems still follow this shape:

VAD / segmentation: detect speech regions.
Embedding extraction: compute a speaker vector per segment.
Clustering: group embeddings into speaker identities.
Resegmentation: refine boundaries.

The entire system is only as good as the first step. Bad VAD yields garbage diarization.

2. Speaker Embeddings: From x-vectors to ECAPA

What an embedding represents

A speaker embedding is a vector that should be:

Similar for the same speaker across different utterances
Dissimilar for different speakers
Robust to channel noise, background noise, and content

ECAPA-style improvements (why they became popular)

ECAPA-like architectures improve embedding quality by:

Better channel attention and multi-scale feature aggregation
Stronger pooling strategies (attention pooling)
Better robustness under short segments

Short segments are the diarization killer. If your speech chunks are 0.5–1.0 seconds, embedding quality matters more than almost anything.

3. Clustering: The Hidden Source of Instability

After embeddings, clustering assigns segments to speakers. Common clustering approaches:

Agglomerative hierarchical clustering (AHC)
Spectral clustering
Online clustering variants for streaming

The key problem: “How many speakers?”

Real audio does not tell you the speaker count directly. You estimate it via:

Thresholds on similarity
Eigen-gap heuristics (spectral)
Bayesian criteria

Those heuristics are brittle under:

Background noise
Heavy overlap
Short turns
Channel changes (phone speaker vs mic)

If your diarization flips between 2 and 3 speakers, the issue is often speaker-count estimation, not embeddings.

4. Resegmentation: Fix Boundaries After Clustering

Embedding segments are often coarse. Resegmentation tries to refine who spoke when.

Classical resegmentation methods:

HMM-based smoothing over cluster assignments
Viterbi decoding with per-speaker models

Modern resegmentation increasingly uses neural models conditioned on speaker profiles.

5. TS-VAD Style Neural Diarization (Why It Works)

The idea behind TS-VAD-like approaches:

Build a neural model that predicts, for each time frame, whether each speaker is active.
Condition on a set of speaker embeddings representing hypothesized speakers.

Benefits:

Better handling of overlap (multiple speakers can be active simultaneously)
Better boundary precision
Reduced reliance on heuristic clustering thresholds

Limitations:

Requires a candidate set of speaker profiles (often produced by clustering anyway)
Compute cost can be higher than pure clustering pipelines

In practice, TS-VAD is a powerful “second stage” that upgrades a clustering baseline.

6. Overlap: The Hard Mode

Overlap breaks the assumption that diarization is a single-label sequence. In overlap regions:

Two speakers are active at once.
Single-speaker embeddings extracted from mixed audio can drift.

Strategies:

Overlap-aware diarization: multi-label frame outputs (TS-VAD style).
Separation first: run speech separation and diarize each separated stream.
Post-hoc overlap detection: detect overlap and allow multi-speaker labels there.

For research, separation-first can produce impressive overlap DER improvements, but it introduces artifacts and requires careful evaluation.

7. Streaming Diarization: Extra Constraints

Streaming diarization adds:

You can’t see the future to refine clustering.
Speaker identity must remain stable over time (no relabeling).

Online clustering must:

Adapt to new speakers appearing
Avoid “identity drift” when a speaker’s channel changes

Researchers should separately report:

Offline DER
Online DER with fixed latency constraints

8. Evaluation: DER Is Not Enough

The classical metric is Diarization Error Rate (DER), composed of:

Missed speech
False alarm speech
Speaker confusion

But DER can be misleading:

A strong VAD can reduce miss/false alarm but increase confusion.
Overlap scoring conventions vary.
Collar sizes (boundary tolerance) change results significantly.

Also report:

JER (Jaccard Error Rate) for overlap-aware scoring
Confusion-only components
Performance under short-turn conditions

9. A Practical Research Pipeline (Strong Baseline)

VAD with conservative settings (avoid chopping speech).
Extract ECAPA embeddings over sliding windows (1–2 seconds with overlap).
Perform clustering with robust speaker-count selection.
Apply TS-VAD resegmentation conditioned on cluster speaker profiles.
Add overlap-aware scoring and report DER + JER.

10. Common Failure Modes and Fixes

Short turns: embeddings are noisy.
- Fix: longer windows with overlap + attention pooling.
Channel shifts: speaker embedding changes across devices.
- Fix: domain-adaptive training or PLDA-like scoring.
Overtalk: overlap dominates.
- Fix: overlap-aware diarization or separation-first.

Conclusion

Diarization is a systems problem. Embeddings matter, but clustering heuristics and overlap handling often dominate real-world performance. For researchers, a modern baseline is: ECAPA embeddings + robust clustering + TS-VAD resegmentation + overlap-aware evaluation.

Explore more Voice AI deep dives