Back to Blog
Voice Model Deep Dives17 min read

Voice AI Deep Dive: Multilingual ASR (Tokenization, Routing, LID, and Code-Switching)

Introduction

Multilingual ASR looks like “just train on more languages.” In reality, multilingual performance depends on how you allocate model capacity and how you handle ambiguity:

  • Which language is being spoken?
  • Are there multiple languages in the same utterance (code-switching)?
  • Should the model share subword units across scripts?
  • How do you avoid degrading high-resource languages while adding low-resource ones?

This deep dive focuses on architectural and data decisions researchers make when building multilingual speech models.

1. The First Decision: Shared vs Per-Language Tokenization

Shared vocabulary (global BPE/wordpiece)

Pros:

  • Parameter sharing across languages
  • Better transfer for related languages
  • Simpler model interface

Cons:

  • Scripts compete for vocabulary capacity
  • Rare scripts can get fragmented into long token sequences
  • Code-switching across scripts can be awkward

Per-language vocabularies

Pros:

  • Better efficiency per language
  • Cleaner token boundaries and fewer tokens per word

Cons:

  • Harder to share capacity
  • Requires language routing or language-specific heads
  • Less graceful code-switching

For many multilingual systems, shared vocabulary is simplest, but not always optimal for diverse scripts.

2. Language Identification (LID): The Gatekeeper

If the system picks the wrong language, everything collapses:

  • Tokenization mismatch
  • Wrong pronunciation priors
  • Wrong normalization conventions

LID strategies:

  • Separate LID model upfront
  • Joint LID inside ASR (predict language tokens)
  • Mixture: quick LID + ASR confirmation

Researchers should evaluate LID accuracy under:

  • Short utterances (worst case)
  • Heavy accents
  • Noisy audio

3. Routing and Adapters: Avoiding “One Model To Rule Them All”

If you train one large model on many languages, capacity gets stretched. Routing helps:

  • Mixture-of-experts (MoE) layers: choose subsets of experts per input
  • Adapters: small language-specific modules in a shared backbone
  • Language-specific output heads

This lets you:

  • Preserve high-resource performance
  • Improve low-resource accuracy
  • Reduce negative transfer

But routing can be unstable:

  • Wrong routing decisions look like hallucinations or severe WER spikes.

4. Data Mixing: The Most Important “Architecture”

Multilingual training is dominated by data:

  • High-resource languages can drown out low-resource ones.
  • Some languages have cleaner transcripts and dominate gradients.

Mixing strategies:

  • Temperature-based sampling
  • Per-language upweighting
  • Curriculum schedules (add languages progressively)

Researchers should report:

  • Sampling policy
  • Per-language hours
  • Transcript quality assumptions

5. Code-Switching: Where Systems Break

Code-switching occurs when speakers mix languages:

  • “Let’s go to the dhaba for lunch.”
  • “Call my amma when you’re free.”

Failures in naive systems:

  • LID flips mid-utterance, causing resets
  • Tokenization produces unnatural splits
  • The model “snaps” to one language and corrupts the other

Robust strategies:

  • Allow mixed-language decoding without strict LID locking
  • Use shared vocabulary across common scripts
  • Train explicitly on code-switched datasets

6. Pronunciation and Accent Variation

Even within a language, accents matter. Multilingual models must handle:

  • Non-native speech
  • Regional accents
  • Borrowed words pronounced with native phonology

This is an argument for:

  • Larger acoustic front-ends
  • Better augmentation and data diversity
  • Accent-aware adapters or conditioning

7. Streaming Constraints in Multilingual ASR

Streaming introduces additional complications:

  • LID is harder on short context
  • Code-switching can’t be detected until later in the utterance
  • Routing decisions must be stable (no frequent switching)

Researchers should evaluate:

  • LID accuracy as a function of time (after 200 ms, 500 ms, 1 s)
  • Streaming WER per language under fixed latency budgets

8. Evaluation: Don’t Hide Behind Average WER

Multilingual research often reports a single average WER. That is not useful for product decisions.

Report:

  • Per-language WER/CER
  • Worst-10 language performance
  • Code-switching subsets
  • Accent subsets
  • LID confusion matrices

Also include token efficiency metrics:

  • Tokens per word by script (model efficiency proxy)

9. A Strong Research Baseline

If you want a defensible baseline:

  1. Shared vocabulary model with language tokens.
  2. Temperature-based sampling to balance languages.
  3. Lightweight adapters for low-resource languages.
  4. A code-switching evaluation set and explicit training data if available.
  5. Streaming evaluation under fixed chunk/lookahead settings.

Conclusion

Multilingual ASR is fundamentally about allocation: tokens, capacity, and data. The best systems do not only scale model size—they carefully control tokenization, routing, and mixing, and evaluate code-switching and streaming behavior explicitly.

Explore more Voice AI deep dives