Voice AI Deep Dive: Speaker Diarization (ECAPA, Clustering, TS-VAD, and Overlap)
Introduction
Diarization answers a deceptively simple question:
“Who spoke when?”
For researchers building voice agents, diarization is not optional if you deal with:
- Meetings and interviews
- Customer support calls
- Multi-speaker WhatsApp voice notes
- Overlapped speech (two people talking at once)
This deep dive covers the modern diarization stack:
- Speaker embeddings (x-vectors → ECAPA)
- Clustering (and why it’s fragile)
- Resegmentation and neural diarization (TS-VAD style)
- Overlap handling and evaluation pitfalls
1. The Classical Pipeline: Embed → Cluster → Segment
Most production diarization systems still follow this shape:
- VAD / segmentation: detect speech regions.
- Embedding extraction: compute a speaker vector per segment.
- Clustering: group embeddings into speaker identities.
- Resegmentation: refine boundaries.
The entire system is only as good as the first step. Bad VAD yields garbage diarization.
2. Speaker Embeddings: From x-vectors to ECAPA
What an embedding represents
A speaker embedding is a vector that should be:
- Similar for the same speaker across different utterances
- Dissimilar for different speakers
- Robust to channel noise, background noise, and content
ECAPA-style improvements (why they became popular)
ECAPA-like architectures improve embedding quality by:
- Better channel attention and multi-scale feature aggregation
- Stronger pooling strategies (attention pooling)
- Better robustness under short segments
Short segments are the diarization killer. If your speech chunks are 0.5–1.0 seconds, embedding quality matters more than almost anything.
3. Clustering: The Hidden Source of Instability
After embeddings, clustering assigns segments to speakers. Common clustering approaches:
- Agglomerative hierarchical clustering (AHC)
- Spectral clustering
- Online clustering variants for streaming
The key problem: “How many speakers?”
Real audio does not tell you the speaker count directly. You estimate it via:
- Thresholds on similarity
- Eigen-gap heuristics (spectral)
- Bayesian criteria
Those heuristics are brittle under:
- Background noise
- Heavy overlap
- Short turns
- Channel changes (phone speaker vs mic)
If your diarization flips between 2 and 3 speakers, the issue is often speaker-count estimation, not embeddings.
4. Resegmentation: Fix Boundaries After Clustering
Embedding segments are often coarse. Resegmentation tries to refine who spoke when.
Classical resegmentation methods:
- HMM-based smoothing over cluster assignments
- Viterbi decoding with per-speaker models
Modern resegmentation increasingly uses neural models conditioned on speaker profiles.
5. TS-VAD Style Neural Diarization (Why It Works)
The idea behind TS-VAD-like approaches:
- Build a neural model that predicts, for each time frame, whether each speaker is active.
- Condition on a set of speaker embeddings representing hypothesized speakers.
Benefits:
- Better handling of overlap (multiple speakers can be active simultaneously)
- Better boundary precision
- Reduced reliance on heuristic clustering thresholds
Limitations:
- Requires a candidate set of speaker profiles (often produced by clustering anyway)
- Compute cost can be higher than pure clustering pipelines
In practice, TS-VAD is a powerful “second stage” that upgrades a clustering baseline.
6. Overlap: The Hard Mode
Overlap breaks the assumption that diarization is a single-label sequence. In overlap regions:
- Two speakers are active at once.
- Single-speaker embeddings extracted from mixed audio can drift.
Strategies:
- Overlap-aware diarization: multi-label frame outputs (TS-VAD style).
- Separation first: run speech separation and diarize each separated stream.
- Post-hoc overlap detection: detect overlap and allow multi-speaker labels there.
For research, separation-first can produce impressive overlap DER improvements, but it introduces artifacts and requires careful evaluation.
7. Streaming Diarization: Extra Constraints
Streaming diarization adds:
- You can’t see the future to refine clustering.
- Speaker identity must remain stable over time (no relabeling).
Online clustering must:
- Adapt to new speakers appearing
- Avoid “identity drift” when a speaker’s channel changes
Researchers should separately report:
- Offline DER
- Online DER with fixed latency constraints
8. Evaluation: DER Is Not Enough
The classical metric is Diarization Error Rate (DER), composed of:
- Missed speech
- False alarm speech
- Speaker confusion
But DER can be misleading:
- A strong VAD can reduce miss/false alarm but increase confusion.
- Overlap scoring conventions vary.
- Collar sizes (boundary tolerance) change results significantly.
Also report:
- JER (Jaccard Error Rate) for overlap-aware scoring
- Confusion-only components
- Performance under short-turn conditions
9. A Practical Research Pipeline (Strong Baseline)
- VAD with conservative settings (avoid chopping speech).
- Extract ECAPA embeddings over sliding windows (1–2 seconds with overlap).
- Perform clustering with robust speaker-count selection.
- Apply TS-VAD resegmentation conditioned on cluster speaker profiles.
- Add overlap-aware scoring and report DER + JER.
10. Common Failure Modes and Fixes
- Short turns: embeddings are noisy.
- Fix: longer windows with overlap + attention pooling.
- Channel shifts: speaker embedding changes across devices.
- Fix: domain-adaptive training or PLDA-like scoring.
- Overtalk: overlap dominates.
- Fix: overlap-aware diarization or separation-first.
Conclusion
Diarization is a systems problem. Embeddings matter, but clustering heuristics and overlap handling often dominate real-world performance. For researchers, a modern baseline is: ECAPA embeddings + robust clustering + TS-VAD resegmentation + overlap-aware evaluation.
