Voice AI Deep Dive: Inverse Text Normalization, Punctuation, and Casing (Making ASR Output Usable)
Introduction
ASR models are judged on WER, but users judge transcripts on readability:
- “Meet me at five thirty” vs “Meet me at 5:30.”
- “lets eat grandma” vs “Let’s eat, grandma.”
- Proper nouns and sentence casing
Most systems require a post-processing stack:
- Punctuation restoration
- Truecasing
- Inverse Text Normalization (ITN) (numbers, dates, currencies)
This deep dive explains how these components work and how to evaluate them like a researcher.
1. Definitions: Normalization vs ITN
- Text normalization (TN): transform written text into a spoken form (used in TTS and data prep).
- Inverse text normalization (ITN): transform spoken-form text into written form (used in ASR post-processing).
Example:
- Spoken: “one hundred and twenty five dollars”
- ITN: “$125”
ITN is not “cosmetic.” It changes meaning, especially for numbers.
2. Why ITN Is Hard
ITN is a structured prediction problem with ambiguity:
- “two to three” could be “2–3” or “2 to 3”
- “may” could be the month or a verb
- “one oh one” could be “101” or “1:01”
ITN also depends on locale:
- 1,000.50 vs 1.000,50
Researchers must define the ITN policy and locale assumptions explicitly.
3. Classical ITN: WFST Grammars
A traditional approach uses grammars (finite-state transducers):
- Hand-crafted rules for numbers, dates, units, currencies
- Deterministic or weighted rewriting
Pros:
- Predictable and controllable
- Easy to enforce formatting constraints
Cons:
- Hard to cover edge cases and multilingual rules
- Requires engineering effort per locale
- Struggles with noisy ASR outputs (“fiv” vs “five”)
WFST ITN remains strong when you need strict formatting guarantees.
4. Neural ITN: Sequence Tagging and Generative Models
Neural approaches treat ITN as:
- A tagging task (“KEEP”, “DELETE”, “REWRITE”)
- A seq2seq rewrite model
- A constrained decoder that emits structured tokens
Pros:
- Handles noisy inputs better
- Learns patterns not captured by rules
- Easier to extend to new domains (addresses, alphanumerics)
Cons:
- Can hallucinate rewrites
- Harder to guarantee formatting correctness
- Requires domain-specific training data
In practice, hybrid systems are common:
- Use neural ITN for candidates, then validate with rule-based constraints.
5. Punctuation Restoration: The “Second ASR”
Punctuation models are typically trained as:
- Token-level classification (comma/period/question/none)
- Transformer tagger over ASR tokens
Inputs can include:
- Words/subwords
- Pause durations
- Prosody features (pitch, energy)
Why punctuation fails:
- ASR errors propagate: wrong words produce wrong punctuation.
- The model needs long-range context to decide clause boundaries.
- Streaming constraints reduce available future context.
Streaming punctuation
Streaming punctuation requires:
- Chunked inference
- Revision policies (punctuation may change after more context arrives)
Researchers should track revision rate the same way as ASR partial stability.
6. Truecasing and Named Entities
Truecasing is often framed as simple:
- “i met john in paris” → “I met John in Paris”
But for voice agents, truecasing depends on:
- Entity resolution (is “apple” a fruit or Apple?)
- Domain vocabulary (product names)
- User context (contact names)
Modern systems often use:
- A casing tagger
- Entity-aware rewriting (knowledge base / contacts / app context)
7. Evaluation: Beyond WER
For punctuation and ITN, WER is misleading. Use task-specific metrics:
- Punctuation F1 (per punctuation class)
- Sentence boundary accuracy
- ITN exact match for normalized spans (numbers, dates, currencies)
- Semantic accuracy for structured fields (phone numbers, prices)
Also report:
- Error severity buckets (e.g., formatting vs meaning-changing)
8. A Research Baseline Pipeline
A strong baseline that is easy to reproduce:
- Run ASR to get tokens + timestamps.
- Run punctuation tagger with pause features.
- Run truecasing tagger, optionally conditioned on entity context.
- Run ITN with hybrid approach:
- grammar-based for well-defined patterns
- neural for messy domains
- validation constraints to prevent hallucinations
9. Common Failure Modes
- “$20” becomes “20 dollars” or the reverse incorrectly.
- Commas inserted in lists, changing meaning.
- Capitalization errors on brands and names.
- Locale formatting flips (“1,5” vs “1.5”).
Researchers should audit failures by category, not only aggregate metrics.
Conclusion
Readable transcription is a structured NLP problem layered on top of ASR. For researcher-grade systems, treat punctuation, casing, and ITN as first-class components with explicit policies and dedicated evaluation, rather than fragile post-processing scripts.
