Back to Blog
Voice Model Deep Dives18 min read

Voice AI Deep Dive: Speech-to-Speech Models (Codecs, Prosody, and Full-Duplex Challenges)

Introduction

Speech-to-speech (S2S) models promise something voice agents have wanted for decades:

  • Listen continuously
  • Understand and respond naturally
  • Speak back with low latency

But S2S is not “ASR + LLM + TTS” glued together. Modern systems often use:

  • Learned discrete speech representations (codecs)
  • Unit language models over codec tokens
  • Neural vocoders for waveform synthesis
  • Real-time constraints that force design compromises

This deep dive explains the core architecture patterns and the hardest research problems: prosody control and full-duplex interaction.

1. The Classical Cascade vs Modern S2S

Cascade (ASR → text reasoning → TTS)

Pros:

  • Modular, debuggable
  • Text is an interpretable intermediate representation

Cons:

  • Adds latency (ASR + LLM + TTS)
  • Loses prosody and non-verbal cues
  • Barge-in and overlap handling is awkward

S2S (speech → speech)

Pros:

  • Potentially lower latency
  • Can preserve prosody and paralinguistic information
  • More “human-like” interactions

Cons:

  • Harder to control, debug, and evaluate
  • Harder to guarantee correctness

Many production systems are hybrid: partial text supervision + speech generation.

2. Codec Models: Turning Waveforms into Tokens

A codec model learns to compress speech into a sequence of discrete or quantized tokens:

  • Encoder maps waveform to latent representation.
  • Quantizer discretizes latents (vector quantization).
  • Decoder reconstructs waveform from tokens.

Why this is useful:

  • A speech stream becomes a token stream.
  • You can train “speech LMs” over these tokens.
  • Generation becomes feasible at scale.

Key research properties:

  • Bitrate vs quality tradeoff
  • Token rate (tokens per second)
  • Reconstruction fidelity and speaker similarity

3. Unit Language Models: “Text-Like” Modeling over Speech Tokens

Once you have tokens, you can train an autoregressive model:

  • Predict next codec token conditioned on previous tokens (and possibly text or semantics).

But speech tokens encode more than content:

  • Speaker identity
  • Prosody and emotion
  • Background noise

This is both a feature and a risk:

  • Feature: richer conditioning for natural speech.
  • Risk: the model can learn spurious correlations and leak style.

4. Prosody Control: The Core Differentiator

Text-based TTS can be monotone without careful engineering. S2S can preserve prosody, but you still need control:

  • Rate and rhythm
  • Intonation contours
  • Emphasis and pause structure
  • Emotion and speaking style

Research approaches:

  • Explicit prosody tokens
  • Separate predictors for pitch/energy/duration
  • Conditioning on reference speech (“style transfer”)
  • Constraints during decoding (avoid drift)

If you cannot control prosody, S2S agents sound unpredictable.

5. Vocoders: Turning Tokens Back into Waveforms

Even if your model outputs codec tokens, you still need high-quality audio output.

Families:

  • GAN-based vocoders: fast, can be brittle under domain shift
  • Diffusion-based vocoders: high quality, can be slower (though accelerated variants exist)

In real-time agents, vocoder latency is often the dominant component.

6. Full-Duplex: The Hardest Problem in Voice Agents

Full-duplex means:

  • The agent can listen while speaking.
  • The user can interrupt at any time.

This introduces:

  • Acoustic echo cancellation (AEC)
  • Self-interference (agent hears itself)
  • Turn-taking complexity (agent must decide when to yield)

S2S models can also become unstable if they receive their own audio as input.

For robust full-duplex:

  • AEC is non-negotiable.
  • VAD must be calibrated to ignore playback audio.
  • The system must tightly coordinate playback stop/start with inference.

7. Latency Budgeting: Where the Milliseconds Go

For an interactive agent, aim to optimize:

  • Input buffering (how much audio you wait to process)
  • Model compute (unit LM + vocoder)
  • Output buffering (how much audio you generate before playing)

A common mistake:

  • Optimizing the LM while the vocoder dominates total delay.

Researchers should profile end-to-end latency in realistic pipelines.

8. Evaluation: Beyond MOS

Mean Opinion Score (MOS) is useful but insufficient. For S2S agents, also measure:

  • Conversational latency: time-to-first-audio-response
  • Interruption success: barge-in latency and false barge-in rate
  • Content fidelity: does the response preserve intended semantics?
  • Speaker consistency: identity drift across long interactions
  • Prosody controllability: can you reliably produce emphasis and timing?

Also consider safety:

  • Speech generation can produce convincing but incorrect outputs.

9. A Research Baseline Stack

A defensible baseline for S2S research:

  1. Codec model for tokenization.
  2. Unit LM conditioned on text semantics (or discrete semantic tokens).
  3. Fast vocoder (optimized for streaming).
  4. AEC + barge-in aware VAD for duplex interaction.
  5. Evaluation suite with latency + interruption + fidelity metrics.

Conclusion

S2S is the frontier because it combines speech compression, token modeling, and real-time interactive systems engineering. Researchers who treat duplex interaction, prosody control, and end-to-end latency as first-class evaluation targets will build systems that feel dramatically more human.

Explore more Voice AI deep dives