Skip to content

Add classifier-free guidance support for Chatterbox (fixes multilingual) + MinPLogitsWarper#1705

Open
tibzejoker wants to merge 1 commit into
huggingface:mainfrom
tibzejoker:feat/chatterbox-cfg
Open

Add classifier-free guidance support for Chatterbox (fixes multilingual) + MinPLogitsWarper#1705
tibzejoker wants to merge 1 commit into
huggingface:mainfrom
tibzejoker:feat/chatterbox-cfg

Conversation

@tibzejoker

Copy link
Copy Markdown

Fixes #1656

TL;DR

The Chatterbox multilingual checkpoint was thought to be unsupported ("requires special setup"). After a full investigation (ONNX graph diffing against the English export, tokenizer comparison, and step-by-step replication of the reference PyTorch sampling loop), it turns out the architecture is already fully supported by the existing ChatterboxModel — the missing piece in the library is classifier-free guidance (CFG), which the reference implementation (resemble-ai/chatterbox T3.inference) always applies with cfg_weight=0.5. Without CFG the multilingual model only produces short unintelligible vocalizations followed by an early EOS; with it, it produces correct cloned speech (validated in-browser on WebGPU in French and German with a 10s reference voice).

Changes

  1. ChatterboxModel: CFG support, enabled via guidance_scale (= 1 + cfg_weight, i.e. 1.5 for parity with the python defaults). When guidance_scale > 1, forward internally runs a batch of two sequences — the conditional input and an unconditional copy whose text token embeddings are zeroed (matching text_emb[1].zero_() in the reference implementation; speaker conditioning, exaggeration and speech tokens are shared between the two rows). The two rows are recombined by the existing ClassifierFreeGuidanceLogitsProcessor, so the batch size visible to generate() stays 1. Opt-in: no behavior change when guidance_scale is unset.
  2. Processor ordering: move ClassifierFreeGuidanceLogitsProcessor to the front of the list, matching the python transformers ordering — the cond/uncond logits must be combined back into a single batch before any other processor (e.g. repetition penalty) runs.
  3. Add MinPLogitsWarper + the min_p generation option (full implementation incl. min_tokens_to_keep, with unit tests). The official Chatterbox sampling parameters use min_p=0.05.

Usage (multilingual)

const waveform = await model.generate({
    ...inputs, ...speaker_data, exaggeration: 0.5,
    max_new_tokens: 1000,
    do_sample: true, temperature: 0.8, top_k: 0, top_p: 1.0,
    min_p: 0.05, repetition_penalty: 1.2,
    guidance_scale: 1.5, // 1 + cfg_weight(0.5) — required for the multilingual checkpoint
});

Note on the multilingual model files (for anyone reproducing)

There is currently no official ONNX repo with complete configs for the multilingual checkpoint. The community mirror's tokenizer.json has a broken post_processor (its TemplateProcessing special tokens are referenced without brackets — "BOS", "EOS", "START_SPEECH", "EXAGGERATION" — which don't exist in the vocab, so all five special tokens encode to [UNK]). The correct template, identical to the working English export, is: [EXAGGERATION](6563) [START](255) …text… [STOP](0) [START_SPEECH](6561) [START_SPEECH](6561). Happy to help fix/publish corrected model files if useful.

Validation

  • In-browser (WebGPU, Chrome/macOS): multilingual checkpoint + this PR's build → intelligible cloned speech in French and German from a 10.7s reference (previously: 0.5s of breath noise, RMS ≈ 0.0003 → now ≈ 5–6s of speech, RMS ≈ 0.08). English checkpoint unchanged.
  • pnpm build ✓ (incl. typegen), pnpm format:check ✓, logits_process test suite ✓ (incl. 3 new MinPLogitsWarper unit tests).

- ChatterboxModel: when generation_config.guidance_scale > 1, run an
  internal batch of two sequences (conditional + unconditional with
  zeroed text token embeddings), matching the reference PyTorch
  implementation (resemble-ai/chatterbox T3.inference, cfg_weight=0.5).
  The multilingual checkpoint requires CFG to produce intelligible
  speech.
- Move ClassifierFreeGuidanceLogitsProcessor to the front of the
  processor list, matching the python transformers ordering (the
  cond/uncond batches must be combined before any other processor).
- Add MinPLogitsWarper and the min_p generation config option (used by
  the official Chatterbox sampling parameters).

Fixes huggingface#1656
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for chatterbox-multilingual

1 participant