Context
KokoroAne English now has a much better raw-text frontend after the Misaki lexicon work, but raw numeric text still appears to be handled mostly by tokenization + lexicon/G2P fallback.
FluidAudio already has SayAsInterpreter for SSML <say-as>, but the public KokoroAne English raw-text path does not seem to apply a strict text-normalization pass before KokoroAneEnglishPhonemizer tokenization.
Observed / likely affected cases
Common chat-style English text can include:
I am 26 years old.
Today is June 13th.
The score is 3.14.
The current time is 1:49 PM.
In a raw-text KokoroAne path, these can reach the word-level G2P path or punctuation tokenization in shapes that are not ideal for TTS. For example, 3.14 can be split around . and sound closer to three fourteen instead of three point one four.
Constraints / non-goals
This should probably not become a broad, locale-sensitive text-normalization system in the KokoroAne frontend. A conservative pass should avoid rewriting ambiguous or structured strings where caller intent is unclear.
Examples that should likely be left unchanged unless a larger TN design is accepted:
- version-like strings:
1.2.3
- separated number formats:
1,234
- embedded digits:
word26, 26word
- loose colon numbers:
1:49
- invalid times:
1:99 PM
- 24-hour forms if not explicitly supported:
13:49
Conservative idea
A narrow pre-tokenization pass for KokoroAne English raw text could handle only strict standalone forms:
- standalone cardinal integers:
26 -> twenty six
- valid ordinals:
13th -> thirteenth
- leading-zero digit strings:
007 -> zero zero seven
- decimals:
3.14 -> three point one four or a variant with an explicit pause after point
- explicit 12-hour meridiem times:
1:49 PM, 1:49 p.m. -> one forty nine p m
The implementation could reuse or share logic with SayAsInterpreter where appropriate, but keep the raw-text rules stricter than SSML because raw text has no explicit caller annotation.
Possible follow-up
If maintainers agree this belongs in the KokoroAne English raw-text frontend, I can prepare a small PR with tests for the supported forms plus negative tests for the ambiguous forms above.
Context
KokoroAne English now has a much better raw-text frontend after the Misaki lexicon work, but raw numeric text still appears to be handled mostly by tokenization + lexicon/G2P fallback.
FluidAudio already has
SayAsInterpreterfor SSML<say-as>, but the public KokoroAne English raw-text path does not seem to apply a strict text-normalization pass beforeKokoroAneEnglishPhonemizertokenization.Observed / likely affected cases
Common chat-style English text can include:
I am 26 years old.Today is June 13th.The score is 3.14.The current time is 1:49 PM.In a raw-text KokoroAne path, these can reach the word-level G2P path or punctuation tokenization in shapes that are not ideal for TTS. For example,
3.14can be split around.and sound closer tothree fourteeninstead ofthree point one four.Constraints / non-goals
This should probably not become a broad, locale-sensitive text-normalization system in the KokoroAne frontend. A conservative pass should avoid rewriting ambiguous or structured strings where caller intent is unclear.
Examples that should likely be left unchanged unless a larger TN design is accepted:
1.2.31,234word26,26word1:491:99 PM13:49Conservative idea
A narrow pre-tokenization pass for KokoroAne English raw text could handle only strict standalone forms:
26->twenty six13th->thirteenth007->zero zero seven3.14->three point one fouror a variant with an explicit pause afterpoint1:49 PM,1:49 p.m.->one forty nine p mThe implementation could reuse or share logic with
SayAsInterpreterwhere appropriate, but keep the raw-text rules stricter than SSML because raw text has no explicit caller annotation.Possible follow-up
If maintainers agree this belongs in the KokoroAne English raw-text frontend, I can prepare a small PR with tests for the supported forms plus negative tests for the ambiguous forms above.