Skip to content

Feature Proposal: Add FunASR Chinese/Multilingual ASR Benchmark Datasets #8261

Description

@LauraGPT

The HuggingFace datasets library supports audio datasets, but lacks curated speech recognition benchmark datasets for non-English languages, especially Chinese. FunASR (17.8K+ stars, https://github.com/modelscope/FunASR) provides production-grade ASR models with extensive multilingual support:

  • SenseVoice: Ultra-fast multilingual ASR (50+ languages, strong CJK + Cantonese)
  • Paraformer: Production-grade Chinese ASR with timestamps and punctuation

Would it be valuable to add FunASR benchmark datasets (e.g., Chinese speech recognition test sets, multilingual ASR evaluation data) to the HuggingFace datasets hub? This would benefit the broader ASR research community and provide standardized evaluation benchmarks beyond the current Whisper-centric datasets.

FunASR models are also available on the HuggingFace model hub (https://huggingface.co/FunAudioLLM), making dataset-model pairing seamless.

Would adding FunASR evaluation datasets be useful?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions