Voice AI Deep Dive: Speech-to-Speech Models (Codecs, Prosody, and Full-Duplex Challenges)

Introduction

Speech-to-speech (S2S) models promise something voice agents have wanted for decades:

Listen continuously
Understand and respond naturally
Speak back with low latency

But S2S is not “ASR + LLM + TTS” glued together. Modern systems often use:

Learned discrete speech representations (codecs)
Unit language models over codec tokens
Neural vocoders for waveform synthesis
Real-time constraints that force design compromises

This deep dive explains the core architecture patterns and the hardest research problems: prosody control and full-duplex interaction.

1. The Classical Cascade vs Modern S2S

Cascade (ASR → text reasoning → TTS)

Pros:

Modular, debuggable
Text is an interpretable intermediate representation

Cons:

Adds latency (ASR + LLM + TTS)
Loses prosody and non-verbal cues
Barge-in and overlap handling is awkward

S2S (speech → speech)

Pros:

Potentially lower latency
Can preserve prosody and paralinguistic information
More “human-like” interactions

Cons:

Harder to control, debug, and evaluate
Harder to guarantee correctness

Many production systems are hybrid: partial text supervision + speech generation.

2. Codec Models: Turning Waveforms into Tokens

A codec model learns to compress speech into a sequence of discrete or quantized tokens:

Encoder maps waveform to latent representation.
Quantizer discretizes latents (vector quantization).
Decoder reconstructs waveform from tokens.

Why this is useful:

A speech stream becomes a token stream.
You can train “speech LMs” over these tokens.
Generation becomes feasible at scale.

Key research properties:

Bitrate vs quality tradeoff
Token rate (tokens per second)
Reconstruction fidelity and speaker similarity

3. Unit Language Models: “Text-Like” Modeling over Speech Tokens

Once you have tokens, you can train an autoregressive model:

Predict next codec token conditioned on previous tokens (and possibly text or semantics).

But speech tokens encode more than content:

Speaker identity
Prosody and emotion
Background noise

This is both a feature and a risk:

Feature: richer conditioning for natural speech.
Risk: the model can learn spurious correlations and leak style.

4. Prosody Control: The Core Differentiator

Text-based TTS can be monotone without careful engineering. S2S can preserve prosody, but you still need control:

Rate and rhythm
Intonation contours
Emphasis and pause structure
Emotion and speaking style

Research approaches:

Explicit prosody tokens
Separate predictors for pitch/energy/duration
Conditioning on reference speech (“style transfer”)
Constraints during decoding (avoid drift)

If you cannot control prosody, S2S agents sound unpredictable.

5. Vocoders: Turning Tokens Back into Waveforms

Even if your model outputs codec tokens, you still need high-quality audio output.

Families:

GAN-based vocoders: fast, can be brittle under domain shift
Diffusion-based vocoders: high quality, can be slower (though accelerated variants exist)

In real-time agents, vocoder latency is often the dominant component.

6. Full-Duplex: The Hardest Problem in Voice Agents

Full-duplex means:

The agent can listen while speaking.
The user can interrupt at any time.

This introduces:

Acoustic echo cancellation (AEC)
Self-interference (agent hears itself)
Turn-taking complexity (agent must decide when to yield)

S2S models can also become unstable if they receive their own audio as input.

For robust full-duplex:

AEC is non-negotiable.
VAD must be calibrated to ignore playback audio.
The system must tightly coordinate playback stop/start with inference.

7. Latency Budgeting: Where the Milliseconds Go

For an interactive agent, aim to optimize:

Input buffering (how much audio you wait to process)
Model compute (unit LM + vocoder)
Output buffering (how much audio you generate before playing)

A common mistake:

Optimizing the LM while the vocoder dominates total delay.

Researchers should profile end-to-end latency in realistic pipelines.

8. Evaluation: Beyond MOS

Mean Opinion Score (MOS) is useful but insufficient. For S2S agents, also measure:

Conversational latency: time-to-first-audio-response
Interruption success: barge-in latency and false barge-in rate
Content fidelity: does the response preserve intended semantics?
Speaker consistency: identity drift across long interactions
Prosody controllability: can you reliably produce emphasis and timing?

Also consider safety:

Speech generation can produce convincing but incorrect outputs.

9. A Research Baseline Stack

A defensible baseline for S2S research:

Codec model for tokenization.
Unit LM conditioned on text semantics (or discrete semantic tokens).
Fast vocoder (optimized for streaming).
AEC + barge-in aware VAD for duplex interaction.
Evaluation suite with latency + interruption + fidelity metrics.

Conclusion

S2S is the frontier because it combines speech compression, token modeling, and real-time interactive systems engineering. Researchers who treat duplex interaction, prosody control, and end-to-end latency as first-class evaluation targets will build systems that feel dramatically more human.

Explore more Voice AI deep dives