Add classifier-free guidance support for Chatterbox (fixes multilingual) + MinPLogitsWarper#1705
Open
tibzejoker wants to merge 1 commit into
Open
Add classifier-free guidance support for Chatterbox (fixes multilingual) + MinPLogitsWarper#1705tibzejoker wants to merge 1 commit into
tibzejoker wants to merge 1 commit into
Conversation
- ChatterboxModel: when generation_config.guidance_scale > 1, run an internal batch of two sequences (conditional + unconditional with zeroed text token embeddings), matching the reference PyTorch implementation (resemble-ai/chatterbox T3.inference, cfg_weight=0.5). The multilingual checkpoint requires CFG to produce intelligible speech. - Move ClassifierFreeGuidanceLogitsProcessor to the front of the processor list, matching the python transformers ordering (the cond/uncond batches must be combined before any other processor). - Add MinPLogitsWarper and the min_p generation config option (used by the official Chatterbox sampling parameters). Fixes huggingface#1656
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1656
TL;DR
The Chatterbox multilingual checkpoint was thought to be unsupported ("requires special setup"). After a full investigation (ONNX graph diffing against the English export, tokenizer comparison, and step-by-step replication of the reference PyTorch sampling loop), it turns out the architecture is already fully supported by the existing
ChatterboxModel— the missing piece in the library is classifier-free guidance (CFG), which the reference implementation (resemble-ai/chatterboxT3.inference) always applies withcfg_weight=0.5. Without CFG the multilingual model only produces short unintelligible vocalizations followed by an early EOS; with it, it produces correct cloned speech (validated in-browser on WebGPU in French and German with a 10s reference voice).Changes
ChatterboxModel: CFG support, enabled viaguidance_scale(=1 + cfg_weight, i.e.1.5for parity with the python defaults). Whenguidance_scale > 1,forwardinternally runs a batch of two sequences — the conditional input and an unconditional copy whose text token embeddings are zeroed (matchingtext_emb[1].zero_()in the reference implementation; speaker conditioning, exaggeration and speech tokens are shared between the two rows). The two rows are recombined by the existingClassifierFreeGuidanceLogitsProcessor, so the batch size visible togenerate()stays 1. Opt-in: no behavior change whenguidance_scaleis unset.ClassifierFreeGuidanceLogitsProcessorto the front of the list, matching the pythontransformersordering — the cond/uncond logits must be combined back into a single batch before any other processor (e.g. repetition penalty) runs.MinPLogitsWarper+ themin_pgeneration option (full implementation incl.min_tokens_to_keep, with unit tests). The official Chatterbox sampling parameters usemin_p=0.05.Usage (multilingual)
Note on the multilingual model files (for anyone reproducing)
There is currently no official ONNX repo with complete configs for the multilingual checkpoint. The community mirror's
tokenizer.jsonhas a brokenpost_processor(itsTemplateProcessingspecial tokens are referenced without brackets —"BOS","EOS","START_SPEECH","EXAGGERATION"— which don't exist in the vocab, so all five special tokens encode to[UNK]). The correct template, identical to the working English export, is:[EXAGGERATION](6563) [START](255) …text… [STOP](0) [START_SPEECH](6561) [START_SPEECH](6561). Happy to help fix/publish corrected model files if useful.Validation
pnpm build✓ (incl. typegen),pnpm format:check✓,logits_processtest suite ✓ (incl. 3 newMinPLogitsWarperunit tests).