Awesome Multimodal Data

A curated list of multimodal datasets and data tooling — image-text, video, audio, and document corpora, plus streaming, curation, labeling, and embedding libraries that operate on them.

Maintained by Backblaze.

Related Lists

Image-Text Datasets

Large-scale image-caption corpora for contrastive and generative training.

COYO-700M – 747M curated image-alt-text pairs from Kakao Brain. Alternative to LAION for pretraining multimodal models.
DataComp – Benchmark and toolkit for building image-text datasets. DataComp-1B is the reference filtered training set. Docs
Conceptual Captions 12M – 12M image-text pairs from Google, harvested from alt-text with automated cleaning. Widely used as a CLIP baseline.
BLIP3-KALE – 218M image-text pairs with knowledge-augmented dense captions from Salesforce. Bridges synthetic caption quality and web-scale alt-text factuality. Apache-2.0. Docs
Cambrian-10M – 10M multimodal instruction-tuning samples from NYU VisionX. Combines VQA, OCR, knowledge-based, and GPT-generated data. Backbone for Cambrian-1 VLM training. Apache-2.0. Docs
LAION-5B – 5.85B image-text pairs scraped from Common Crawl. Foundation dataset for Stable Diffusion and many open CLIP models. Docs
LLaVA-OneVision-Data – 3.9M multimodal instruction-tuning samples across 89 subsets (VQA, OCR, math reasoning, captioning). Training data for the LLaVA-OneVision model family. Apache-2.0. Docs
Nemotron-VLM-Dataset-v2 – NVIDIA's 8M-sample VLM training set spanning image QA, OCR (10 languages), video QA, and chain-of-thought reasoning. Used to train Nemotron Nano 2 VL. CC-BY-4.0. Docs
OBELICS – 141M interleaved image-text web documents, 353M images, 115B tokens, extracted from Common Crawl. Training data for IDEFICS. CC-BY-4.0. Docs
OmniCorpus – 8.6B images interleaved with 1.7T text tokens from Common Crawl, Chinese web, and YouTube. Unified multimodal pretraining corpus. ICLR 2025 Spotlight. CC-BY-4.0. Docs
PixelProse / Docmatix – Large-scale document-image instruction dataset from HuggingFace. Useful for training multimodal document-QA models.
PixMo-Cap – 717K dense image captions averaging ~200 words, recorded by human annotators and transcribed via LLM. Pretraining data for the Molmo VLM family. ODC-BY-1.0. Docs
ReCap-DataComp-1B – 1.3B DataComp-1B images recaptioned with LLaVA-1.5-LLaMA3-8B. Longer, more detailed captions improve CLIP and text-to-image training. ICML 2025. Docs
TextAtlas5M – 5.4M image-text pairs for dense text image generation, spanning synthetic and real-world text-rich images (slides, book covers, papers, styled text). Benchmark included. MIT. Docs

Video Datasets

Public video datasets with captions, action labels, or instruction annotations.

Panda-70M – 70M high-quality short video clips with automatically generated captions. Used in several video LLM pretraining runs. Docs
WebVid-10M – 10M weakly-captioned web videos. Common pretraining corpus for text-to-video and video-language models.
Ego4D – 3,670 hours of first-person video from 931 participants across 9 countries. Benchmarks for social, hands, memory, forecasting. Docs
HowTo100M – 136M narrated instructional video clips from 1.22M YouTube videos. A staple for video-language pretraining. Docs
InternVid – 7M videos (~234M clips) with rich captions. Part of the InternVideo foundation-model release. Docs
OpenVid-1M – 1M curated text-video pairs with aesthetic/motion/consistency scores. Includes OpenVidHD-0.4M subset at 1080p. ICLR 2025. Docs
PE-Video – Meta's 1M diverse short videos with text descriptions and 120K human-verified captions. Released with Perception Encoder (2025). CC BY NC 4.0. Docs
VideoUFO – 1.09M Creative Commons video clips paired with brief and detailed captions, covering 1,291 user-focused topics derived from real text-to-video prompts. NeurIPS 2025. CC-BY-4.0. Docs

Audio and Speech Datasets

Public speech, music, and sound-event corpora for training and evaluation.

Mozilla Common Voice – Open speech corpus built from community contributions. 100+ languages and growing. Docs
AudioSet – Google's 2M+ human-labelled 10-second sound clips across 632 classes. Standard corpus for sound-event classification.
CapSpeech – 10M+ machine-annotated and 360k human-annotated audio-caption pairs for style-captioned TTS. Covers accent, emotion, sound effects, and agent speech tasks. CC-BY-NC-4.0. Docs
Emilia – 100k+ hours of multilingual, in-the-wild speech data from the Amphion team. Backbone for modern open TTS training. Docs
LibriSpeech – 1,000-hour English speech corpus derived from LibriVox. The long-running default ASR benchmark.
WavCaps – 400k weakly-labelled audio clips with ChatGPT-generated captions from FreeSound, BBC Sound Effects, and AudioSet. Academic use only. Docs

Document and Text Datasets

Large-scale web, code, and document corpora used to pretrain multimodal models.

Dolma – AI2's 3T-token open pretraining corpus with a transparent pipeline. Backbone for OLMo training.
BigDocs-7.5M – 7.5M permissively licensed document image-text pairs from ServiceNow. Covers OCR, structured parsing, captioning, and QA across scientific papers, tables, and UI screenshots. ICLR 2025. CC-BY-4.0. Docs
Common Corpus – 2T-token fully open text corpus from PleIAs. Public-domain and permissive text only — usable for commercial training. Docs
DCLM-Baseline – 4T-token filtered web corpus from 240T-token Common Crawl pool. Trains 7B models to 64% MMLU at 2.6T tokens. NeurIPS 2024. Docs
FineWeb / FineWeb-Edu – 15T high-quality English tokens filtered from 96 Common Crawl dumps. Edu variant is ~1.3T educational-content tokens. Docs
The Stack v2 – 3TB of permissively licensed source code across 600+ programming languages. Training data for StarCoder/StarCoder2. Docs

Data Loading and Streaming

Formats and libraries for streaming multimodal data from object storage into training jobs.

HuggingFace Datasets – Unified loading, streaming, and processing of thousands of datasets. Native Arrow/Parquet with memory-mapped access. Docs | SDK: Python (pip install datasets)
Lance – Columnar format designed for ML. 100x faster random access than Parquet, versioned, zero-copy from object storage. Docs | SDK: Python (pip install pylance), Rust
WebDataset – PyTorch dataset format packaging samples as tar shards. Streams directly from S3-compatible object storage. Docs | SDK: Python (pip install webdataset)
MosaicML Streaming – Deterministic, shuffled, resumable streaming of datasets from cloud storage. Shard format: MDS. Docs | SDK: Python (pip install mosaicml-streaming)
Ray Data – Distributed data loading and preprocessing inside Ray. Native connectors for Parquet on S3-compatible stores. SDK: Python (pip install 'ray[data]')

Data Curation and Labeling

Tools for deduplication, cleaning, filtering, and annotating multimodal datasets.

Label Studio – Open-source labeling platform supporting text, image, audio, video, and time-series. Docs | SDK: Python (pip install label-studio-sdk)
CVAT – Computer vision annotation tool with strong video and 3D support. Originally from Intel, now community-maintained. Docs
cleanlab – Finds label errors, outliers, and dataset issues automatically with confident-learning methods. Docs | SDK: Python (pip install cleanlab)
FiftyOne – Open-source dataset curation and model-evaluation toolkit for CV. Rich UI for exploring image/video datasets. Docs | SDK: Python (pip install fiftyone)
Datatrove – HuggingFace's data-processing library for LLM pretraining. Parallel filters, deduplication, tokenization pipelines. SDK: Python (pip install datatrove)
fastdup – Unsupervised analysis of large visual datasets. Detects duplicates, outliers, mislabels, and leakage in minutes. Docs | SDK: Python (pip install fastdup)
Cosmos-Curate – NVIDIA's distributed video curation pipeline for world foundation model training. Splits, annotates, filters, and deduplicates video at scale using Ray. Apache-2.0.
Data-Juicer – Composable data-processing toolkit with 200+ operators spanning text, image, audio, and video. Scales from a laptop to 1000-node Ray clusters. NeurIPS 2025 Spotlight. Apache-2.0. Docs | SDK: Python (pip install py-data-juicer)
NeMo Curator – GPU-accelerated curation toolkit from NVIDIA. Handles text, image, video, and audio at scale using RAPIDS and Ray. Apache-2.0. Docs | SDK: Python (pip install nemo-curator)
SemHash – Lightweight multimodal library for semantic deduplication, outlier filtering, and representative sample selection across text, images, and audio. SDK: Python (pip install semhash)

Embedding and Indexing Models

Multimodal embedding models and frameworks for building searchable indices.

FAISS – Meta's library for efficient similarity search of dense vectors. The reference ANN library for research-scale indexing. Docs | SDK: Python (pip install faiss-cpu), C++
sentence-transformers – PyTorch framework for training and using dense embedding models. De-facto library for semantic search. Docs | SDK: Python (pip install sentence-transformers)
OpenCLIP – Open-source reproduction of CLIP with many trained checkpoints (including SigLIP and EVA-CLIP variants). SDK: Python (pip install open-clip-torch)
BGE (BAAI General Embedding) – Family of multilingual text embeddings from BAAI. Top performers on MTEB; extended to visual and multimodal variants.
Nomic Embed – Open multimodal embedding family from Nomic. Nomic Embed Vision pairs with Nomic Embed Text for unified image/text search. Docs
jina-embeddings-v4 – 3.8B-parameter multimodal embedding model for text, images, and visual documents (charts, tables). Supports dense and late-interaction retrieval across 30+ languages. Non-commercial (Qwen Research License). Docs | SDK: Python (pip install transformers)
MMEB-train – 2.1M training samples for the Massive Multimodal Embedding Benchmark. Covers visual QA, image retrieval, classification, and grounding across 20 datasets. Apache-2.0.
SigLIP 2 – Google's sigmoid loss for contrastive image-text pretraining. SigLIP 2 adds multilingual and dense-caption training. Docs
VLM2Vec – Framework for training VLMs as dense embedding models. Ships MMEB-V2 benchmark with 78 tasks across images, videos, and visual documents. Docs | SDK: Python

Templates and Example Projects

Reference implementations, demos, and starter projects.

img2dataset – Fast, resumable tool to turn image-URL lists into WebDataset shards. Routinely used to download LAION-scale datasets. SDK: Python (pip install img2dataset)
video2dataset – Download, trim, and package large video datasets into WebDataset shards.
DataComp Quickstart – Reference workflow for downloading, filtering, and training on DataComp. Good starting point for custom filters.

Contributing

Contributions are welcome. See CONTRIBUTING.md. One entry per PR — edit entries.yaml only and let the maintainers regenerate README.md.

License

Released under CC0 1.0 Universal. You may copy, modify, and redistribute without attribution.

About Backblaze B2

Backblaze B2 Cloud Storage is S3-compatible object storage designed for AI and media workloads. This list is maintained as part of our work making B2 a convenient storage layer for AI workflows.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
categories.yaml		categories.yaml
entries.yaml		entries.yaml
footer.md		footer.md
header.md		header.md
llms.txt		llms.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Multimodal Data

Related Lists

Contents

Image-Text Datasets

Video Datasets

Audio and Speech Datasets

Document and Text Datasets

Data Loading and Streaming

Data Curation and Labeling

Embedding and Indexing Models

Templates and Example Projects

Contributing

License

About Backblaze B2

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Multimodal Data

Related Lists

Contents

Image-Text Datasets

Video Datasets

Audio and Speech Datasets

Document and Text Datasets

Data Loading and Streaming

Data Curation and Labeling

Embedding and Indexing Models

Templates and Example Projects

Contributing

License

About Backblaze B2

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages