Back to Blog
Voice Model Deep Dives17 min read

Voice AI Deep Dive: Speaker Diarization (ECAPA, Clustering, TS-VAD, and Overlap)

Introduction

Diarization answers a deceptively simple question:

“Who spoke when?”

For researchers building voice agents, diarization is not optional if you deal with:

  • Meetings and interviews
  • Customer support calls
  • Multi-speaker WhatsApp voice notes
  • Overlapped speech (two people talking at once)

This deep dive covers the modern diarization stack:

  1. Speaker embeddings (x-vectors → ECAPA)
  2. Clustering (and why it’s fragile)
  3. Resegmentation and neural diarization (TS-VAD style)
  4. Overlap handling and evaluation pitfalls

1. The Classical Pipeline: Embed → Cluster → Segment

Most production diarization systems still follow this shape:

  1. VAD / segmentation: detect speech regions.
  2. Embedding extraction: compute a speaker vector per segment.
  3. Clustering: group embeddings into speaker identities.
  4. Resegmentation: refine boundaries.

The entire system is only as good as the first step. Bad VAD yields garbage diarization.

2. Speaker Embeddings: From x-vectors to ECAPA

What an embedding represents

A speaker embedding is a vector that should be:

  • Similar for the same speaker across different utterances
  • Dissimilar for different speakers
  • Robust to channel noise, background noise, and content

ECAPA-style improvements (why they became popular)

ECAPA-like architectures improve embedding quality by:

  • Better channel attention and multi-scale feature aggregation
  • Stronger pooling strategies (attention pooling)
  • Better robustness under short segments

Short segments are the diarization killer. If your speech chunks are 0.5–1.0 seconds, embedding quality matters more than almost anything.

3. Clustering: The Hidden Source of Instability

After embeddings, clustering assigns segments to speakers. Common clustering approaches:

  • Agglomerative hierarchical clustering (AHC)
  • Spectral clustering
  • Online clustering variants for streaming

The key problem: “How many speakers?”

Real audio does not tell you the speaker count directly. You estimate it via:

  • Thresholds on similarity
  • Eigen-gap heuristics (spectral)
  • Bayesian criteria

Those heuristics are brittle under:

  • Background noise
  • Heavy overlap
  • Short turns
  • Channel changes (phone speaker vs mic)

If your diarization flips between 2 and 3 speakers, the issue is often speaker-count estimation, not embeddings.

4. Resegmentation: Fix Boundaries After Clustering

Embedding segments are often coarse. Resegmentation tries to refine who spoke when.

Classical resegmentation methods:

  • HMM-based smoothing over cluster assignments
  • Viterbi decoding with per-speaker models

Modern resegmentation increasingly uses neural models conditioned on speaker profiles.

5. TS-VAD Style Neural Diarization (Why It Works)

The idea behind TS-VAD-like approaches:

  • Build a neural model that predicts, for each time frame, whether each speaker is active.
  • Condition on a set of speaker embeddings representing hypothesized speakers.

Benefits:

  • Better handling of overlap (multiple speakers can be active simultaneously)
  • Better boundary precision
  • Reduced reliance on heuristic clustering thresholds

Limitations:

  • Requires a candidate set of speaker profiles (often produced by clustering anyway)
  • Compute cost can be higher than pure clustering pipelines

In practice, TS-VAD is a powerful “second stage” that upgrades a clustering baseline.

6. Overlap: The Hard Mode

Overlap breaks the assumption that diarization is a single-label sequence. In overlap regions:

  • Two speakers are active at once.
  • Single-speaker embeddings extracted from mixed audio can drift.

Strategies:

  • Overlap-aware diarization: multi-label frame outputs (TS-VAD style).
  • Separation first: run speech separation and diarize each separated stream.
  • Post-hoc overlap detection: detect overlap and allow multi-speaker labels there.

For research, separation-first can produce impressive overlap DER improvements, but it introduces artifacts and requires careful evaluation.

7. Streaming Diarization: Extra Constraints

Streaming diarization adds:

  • You can’t see the future to refine clustering.
  • Speaker identity must remain stable over time (no relabeling).

Online clustering must:

  • Adapt to new speakers appearing
  • Avoid “identity drift” when a speaker’s channel changes

Researchers should separately report:

  • Offline DER
  • Online DER with fixed latency constraints

8. Evaluation: DER Is Not Enough

The classical metric is Diarization Error Rate (DER), composed of:

  • Missed speech
  • False alarm speech
  • Speaker confusion

But DER can be misleading:

  • A strong VAD can reduce miss/false alarm but increase confusion.
  • Overlap scoring conventions vary.
  • Collar sizes (boundary tolerance) change results significantly.

Also report:

  • JER (Jaccard Error Rate) for overlap-aware scoring
  • Confusion-only components
  • Performance under short-turn conditions

9. A Practical Research Pipeline (Strong Baseline)

  1. VAD with conservative settings (avoid chopping speech).
  2. Extract ECAPA embeddings over sliding windows (1–2 seconds with overlap).
  3. Perform clustering with robust speaker-count selection.
  4. Apply TS-VAD resegmentation conditioned on cluster speaker profiles.
  5. Add overlap-aware scoring and report DER + JER.

10. Common Failure Modes and Fixes

  • Short turns: embeddings are noisy.
    • Fix: longer windows with overlap + attention pooling.
  • Channel shifts: speaker embedding changes across devices.
    • Fix: domain-adaptive training or PLDA-like scoring.
  • Overtalk: overlap dominates.
    • Fix: overlap-aware diarization or separation-first.

Conclusion

Diarization is a systems problem. Embeddings matter, but clustering heuristics and overlap handling often dominate real-world performance. For researchers, a modern baseline is: ECAPA embeddings + robust clustering + TS-VAD resegmentation + overlap-aware evaluation.

Explore more Voice AI deep dives