Voice AI Deep Dive: Speech-to-Speech Models (Codecs, Prosody, and Full-Duplex Challenges)
Introduction
Speech-to-speech (S2S) models promise something voice agents have wanted for decades:
- Listen continuously
- Understand and respond naturally
- Speak back with low latency
But S2S is not “ASR + LLM + TTS” glued together. Modern systems often use:
- Learned discrete speech representations (codecs)
- Unit language models over codec tokens
- Neural vocoders for waveform synthesis
- Real-time constraints that force design compromises
This deep dive explains the core architecture patterns and the hardest research problems: prosody control and full-duplex interaction.
1. The Classical Cascade vs Modern S2S
Cascade (ASR → text reasoning → TTS)
Pros:
- Modular, debuggable
- Text is an interpretable intermediate representation
Cons:
- Adds latency (ASR + LLM + TTS)
- Loses prosody and non-verbal cues
- Barge-in and overlap handling is awkward
S2S (speech → speech)
Pros:
- Potentially lower latency
- Can preserve prosody and paralinguistic information
- More “human-like” interactions
Cons:
- Harder to control, debug, and evaluate
- Harder to guarantee correctness
Many production systems are hybrid: partial text supervision + speech generation.
2. Codec Models: Turning Waveforms into Tokens
A codec model learns to compress speech into a sequence of discrete or quantized tokens:
- Encoder maps waveform to latent representation.
- Quantizer discretizes latents (vector quantization).
- Decoder reconstructs waveform from tokens.
Why this is useful:
- A speech stream becomes a token stream.
- You can train “speech LMs” over these tokens.
- Generation becomes feasible at scale.
Key research properties:
- Bitrate vs quality tradeoff
- Token rate (tokens per second)
- Reconstruction fidelity and speaker similarity
3. Unit Language Models: “Text-Like” Modeling over Speech Tokens
Once you have tokens, you can train an autoregressive model:
- Predict next codec token conditioned on previous tokens (and possibly text or semantics).
But speech tokens encode more than content:
- Speaker identity
- Prosody and emotion
- Background noise
This is both a feature and a risk:
- Feature: richer conditioning for natural speech.
- Risk: the model can learn spurious correlations and leak style.
4. Prosody Control: The Core Differentiator
Text-based TTS can be monotone without careful engineering. S2S can preserve prosody, but you still need control:
- Rate and rhythm
- Intonation contours
- Emphasis and pause structure
- Emotion and speaking style
Research approaches:
- Explicit prosody tokens
- Separate predictors for pitch/energy/duration
- Conditioning on reference speech (“style transfer”)
- Constraints during decoding (avoid drift)
If you cannot control prosody, S2S agents sound unpredictable.
5. Vocoders: Turning Tokens Back into Waveforms
Even if your model outputs codec tokens, you still need high-quality audio output.
Families:
- GAN-based vocoders: fast, can be brittle under domain shift
- Diffusion-based vocoders: high quality, can be slower (though accelerated variants exist)
In real-time agents, vocoder latency is often the dominant component.
6. Full-Duplex: The Hardest Problem in Voice Agents
Full-duplex means:
- The agent can listen while speaking.
- The user can interrupt at any time.
This introduces:
- Acoustic echo cancellation (AEC)
- Self-interference (agent hears itself)
- Turn-taking complexity (agent must decide when to yield)
S2S models can also become unstable if they receive their own audio as input.
For robust full-duplex:
- AEC is non-negotiable.
- VAD must be calibrated to ignore playback audio.
- The system must tightly coordinate playback stop/start with inference.
7. Latency Budgeting: Where the Milliseconds Go
For an interactive agent, aim to optimize:
- Input buffering (how much audio you wait to process)
- Model compute (unit LM + vocoder)
- Output buffering (how much audio you generate before playing)
A common mistake:
- Optimizing the LM while the vocoder dominates total delay.
Researchers should profile end-to-end latency in realistic pipelines.
8. Evaluation: Beyond MOS
Mean Opinion Score (MOS) is useful but insufficient. For S2S agents, also measure:
- Conversational latency: time-to-first-audio-response
- Interruption success: barge-in latency and false barge-in rate
- Content fidelity: does the response preserve intended semantics?
- Speaker consistency: identity drift across long interactions
- Prosody controllability: can you reliably produce emphasis and timing?
Also consider safety:
- Speech generation can produce convincing but incorrect outputs.
9. A Research Baseline Stack
A defensible baseline for S2S research:
- Codec model for tokenization.
- Unit LM conditioned on text semantics (or discrete semantic tokens).
- Fast vocoder (optimized for streaming).
- AEC + barge-in aware VAD for duplex interaction.
- Evaluation suite with latency + interruption + fidelity metrics.
Conclusion
S2S is the frontier because it combines speech compression, token modeling, and real-time interactive systems engineering. Researchers who treat duplex interaction, prosody control, and end-to-end latency as first-class evaluation targets will build systems that feel dramatically more human.
