Feature Proposal: Add FunASR Chinese/Multilingual ASR Benchmark Datasets

The HuggingFace datasets library supports audio datasets, but lacks curated speech recognition benchmark datasets for non-English languages, especially Chinese. FunASR (17.8K+ stars, https://github.com/modelscope/FunASR) provides production-grade ASR models with extensive multilingual support:

- **SenseVoice**: Ultra-fast multilingual ASR (50+ languages, strong CJK + Cantonese)
- **Paraformer**: Production-grade Chinese ASR with timestamps and punctuation

Would it be valuable to add FunASR benchmark datasets (e.g., Chinese speech recognition test sets, multilingual ASR evaluation data) to the HuggingFace datasets hub? This would benefit the broader ASR research community and provide standardized evaluation benchmarks beyond the current Whisper-centric datasets.

FunASR models are also available on the HuggingFace model hub (https://huggingface.co/FunAudioLLM), making dataset-model pairing seamless.

Would adding FunASR evaluation datasets be useful?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Proposal: Add FunASR Chinese/Multilingual ASR Benchmark Datasets #8261

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Feature Proposal: Add FunASR Chinese/Multilingual ASR Benchmark Datasets #8261

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions