Add classifier-free guidance support for Chatterbox (fixes multilingual) + MinPLogitsWarper by tibzejoker · Pull Request #1705 · huggingface/transformers.js

tibzejoker · 2026-06-10T15:22:46Z

TL;DR

The Chatterbox multilingual checkpoint was thought to be unsupported ("requires special setup"). After a full investigation (ONNX graph diffing against the English export, tokenizer comparison, and step-by-step replication of the reference PyTorch sampling loop), it turns out the architecture is already fully supported by the existing ChatterboxModel — the missing piece in the library is classifier-free guidance (CFG), which the reference implementation (resemble-ai/chatterbox T3.inference) always applies with cfg_weight=0.5. Without CFG the multilingual model only produces short unintelligible vocalizations followed by an early EOS; with it, it produces correct cloned speech (validated in-browser on WebGPU in French and German with a 10s reference voice).

Changes

ChatterboxModel: CFG support, enabled via guidance_scale (= 1 + cfg_weight, i.e. 1.5 for parity with the python defaults). When guidance_scale > 1, forward internally runs a batch of two sequences — the conditional input and an unconditional copy whose text token embeddings are zeroed (matching text_emb[1].zero_() in the reference implementation; speaker conditioning, exaggeration and speech tokens are shared between the two rows). The two rows are recombined by the existing ClassifierFreeGuidanceLogitsProcessor, so the batch size visible to generate() stays 1. Opt-in: no behavior change when guidance_scale is unset.
Processor ordering: move ClassifierFreeGuidanceLogitsProcessor to the front of the list, matching the python transformers ordering — the cond/uncond logits must be combined back into a single batch before any other processor (e.g. repetition penalty) runs.
Add MinPLogitsWarper + the min_p generation option (full implementation incl. min_tokens_to_keep, with unit tests). The official Chatterbox sampling parameters use min_p=0.05.

Usage (multilingual)

const waveform = await model.generate({
    ...inputs, ...speaker_data, exaggeration: 0.5,
    max_new_tokens: 1000,
    do_sample: true, temperature: 0.8, top_k: 0, top_p: 1.0,
    min_p: 0.05, repetition_penalty: 1.2,
    guidance_scale: 1.5, // 1 + cfg_weight(0.5) — required for the multilingual checkpoint
});

Note on the multilingual model files (for anyone reproducing)

There is currently no official ONNX repo with complete configs for the multilingual checkpoint. The community mirror's tokenizer.json has a broken post_processor (its TemplateProcessing special tokens are referenced without brackets — "BOS", "EOS", "START_SPEECH", "EXAGGERATION" — which don't exist in the vocab, so all five special tokens encode to [UNK]). The correct template, identical to the working English export, is: [EXAGGERATION](6563) [START](255) …text… [STOP](0) [START_SPEECH](6561) [START_SPEECH](6561). Happy to help fix/publish corrected model files if useful.

Validation

In-browser (WebGPU, Chrome/macOS): multilingual checkpoint + this PR's build → intelligible cloned speech in French and German from a 10.7s reference (previously: 0.5s of breath noise, RMS ≈ 0.0003 → now ≈ 5–6s of speech, RMS ≈ 0.08). English checkpoint unchanged.
pnpm build ✓ (incl. typegen), pnpm format:check ✓, logits_process test suite ✓ (incl. 3 new MinPLogitsWarper unit tests).

- ChatterboxModel: when generation_config.guidance_scale > 1, run an internal batch of two sequences (conditional + unconditional with zeroed text token embeddings), matching the reference PyTorch implementation (resemble-ai/chatterbox T3.inference, cfg_weight=0.5). The multilingual checkpoint requires CFG to produce intelligible speech. - Move ClassifierFreeGuidanceLogitsProcessor to the front of the processor list, matching the python transformers ordering (the cond/uncond batches must be combined before any other processor). - Add MinPLogitsWarper and the min_p generation config option (used by the official Chatterbox sampling parameters). Fixes huggingface#1656

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add classifier-free guidance support for Chatterbox (fixes multilingual) + MinPLogitsWarper#1705

Add classifier-free guidance support for Chatterbox (fixes multilingual) + MinPLogitsWarper#1705
tibzejoker wants to merge 1 commit into
huggingface:mainfrom
tibzejoker:feat/chatterbox-cfg

tibzejoker commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tibzejoker commented Jun 10, 2026

TL;DR

Changes

Usage (multilingual)

Note on the multilingual model files (for anyone reproducing)

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant