Voice AI Deep Dive: Inverse Text Normalization, Punctuation, and Casing (Making ASR Output Usable)

Introduction

ASR models are judged on WER, but users judge transcripts on readability:

“Meet me at five thirty” vs “Meet me at 5:30.”
“lets eat grandma” vs “Let’s eat, grandma.”
Proper nouns and sentence casing

Most systems require a post-processing stack:

Punctuation restoration
Truecasing
Inverse Text Normalization (ITN) (numbers, dates, currencies)

This deep dive explains how these components work and how to evaluate them like a researcher.

1. Definitions: Normalization vs ITN

Text normalization (TN): transform written text into a spoken form (used in TTS and data prep).
Inverse text normalization (ITN): transform spoken-form text into written form (used in ASR post-processing).

Example:

Spoken: “one hundred and twenty five dollars”
ITN: “$125”

ITN is not “cosmetic.” It changes meaning, especially for numbers.

2. Why ITN Is Hard

ITN is a structured prediction problem with ambiguity:

“two to three” could be “2–3” or “2 to 3”
“may” could be the month or a verb
“one oh one” could be “101” or “1:01”

ITN also depends on locale:

1,000.50 vs 1.000,50

Researchers must define the ITN policy and locale assumptions explicitly.

3. Classical ITN: WFST Grammars

A traditional approach uses grammars (finite-state transducers):

Hand-crafted rules for numbers, dates, units, currencies
Deterministic or weighted rewriting

Pros:

Predictable and controllable
Easy to enforce formatting constraints

Cons:

Hard to cover edge cases and multilingual rules
Requires engineering effort per locale
Struggles with noisy ASR outputs (“fiv” vs “five”)

WFST ITN remains strong when you need strict formatting guarantees.

4. Neural ITN: Sequence Tagging and Generative Models

Neural approaches treat ITN as:

A tagging task (“KEEP”, “DELETE”, “REWRITE”)
A seq2seq rewrite model
A constrained decoder that emits structured tokens

Pros:

Handles noisy inputs better
Learns patterns not captured by rules
Easier to extend to new domains (addresses, alphanumerics)

Cons:

Can hallucinate rewrites
Harder to guarantee formatting correctness
Requires domain-specific training data

In practice, hybrid systems are common:

Use neural ITN for candidates, then validate with rule-based constraints.

5. Punctuation Restoration: The “Second ASR”

Punctuation models are typically trained as:

Token-level classification (comma/period/question/none)
Transformer tagger over ASR tokens

Inputs can include:

Words/subwords
Pause durations
Prosody features (pitch, energy)

Why punctuation fails:

ASR errors propagate: wrong words produce wrong punctuation.
The model needs long-range context to decide clause boundaries.
Streaming constraints reduce available future context.

Streaming punctuation

Streaming punctuation requires:

Chunked inference
Revision policies (punctuation may change after more context arrives)

Researchers should track revision rate the same way as ASR partial stability.

6. Truecasing and Named Entities

Truecasing is often framed as simple:

“i met john in paris” → “I met John in Paris”

But for voice agents, truecasing depends on:

Entity resolution (is “apple” a fruit or Apple?)
Domain vocabulary (product names)
User context (contact names)

Modern systems often use:

A casing tagger
Entity-aware rewriting (knowledge base / contacts / app context)

7. Evaluation: Beyond WER

For punctuation and ITN, WER is misleading. Use task-specific metrics:

Punctuation F1 (per punctuation class)
Sentence boundary accuracy
ITN exact match for normalized spans (numbers, dates, currencies)
Semantic accuracy for structured fields (phone numbers, prices)

Also report:

Error severity buckets (e.g., formatting vs meaning-changing)

8. A Research Baseline Pipeline

A strong baseline that is easy to reproduce:

Run ASR to get tokens + timestamps.
Run punctuation tagger with pause features.
Run truecasing tagger, optionally conditioned on entity context.
Run ITN with hybrid approach:
- grammar-based for well-defined patterns
- neural for messy domains
- validation constraints to prevent hallucinations

9. Common Failure Modes

“$20” becomes “20 dollars” or the reverse incorrectly.
Commas inserted in lists, changing meaning.
Capitalization errors on brands and names.
Locale formatting flips (“1,5” vs “1.5”).

Researchers should audit failures by category, not only aggregate metrics.

Conclusion

Readable transcription is a structured NLP problem layered on top of ASR. For researcher-grade systems, treat punctuation, casing, and ITN as first-class components with explicit policies and dedicated evaluation, rather than fragile post-processing scripts.

Explore more Voice AI deep dives