A curated list of multimodal datasets and data tooling — image-text, video, audio, and document corpora, plus streaming, curation, labeling, and embedding libraries that operate on them.
Maintained by Backblaze.
- Awesome ML Data Pipelines
- Awesome Agent Infrastructure
- Awesome Physical AI
- Awesome Image Generation
- Awesome Video Generation
- Awesome Audio Generation
- Image-Text Datasets
- Video Datasets
- Audio and Speech Datasets
- Document and Text Datasets
- Data Loading and Streaming
- Data Curation and Labeling
- Embedding and Indexing Models
- Templates and Example Projects
Large-scale image-caption corpora for contrastive and generative training.
- COYO-700M – 747M curated image-alt-text pairs from Kakao Brain. Alternative to LAION for pretraining multimodal models.
- DataComp – Benchmark and toolkit for building image-text datasets. DataComp-1B is the reference filtered training set. Docs
- Conceptual Captions 12M – 12M image-text pairs from Google, harvested from alt-text with automated cleaning. Widely used as a CLIP baseline.
- BLIP3-KALE – 218M image-text pairs with knowledge-augmented dense captions from Salesforce. Bridges synthetic caption quality and web-scale alt-text factuality. Apache-2.0. Docs
- Cambrian-10M – 10M multimodal instruction-tuning samples from NYU VisionX. Combines VQA, OCR, knowledge-based, and GPT-generated data. Backbone for Cambrian-1 VLM training. Apache-2.0. Docs
- LAION-5B – 5.85B image-text pairs scraped from Common Crawl. Foundation dataset for Stable Diffusion and many open CLIP models. Docs
- LLaVA-OneVision-Data – 3.9M multimodal instruction-tuning samples across 89 subsets (VQA, OCR, math reasoning, captioning). Training data for the LLaVA-OneVision model family. Apache-2.0. Docs
- Nemotron-VLM-Dataset-v2 – NVIDIA's 8M-sample VLM training set spanning image QA, OCR (10 languages), video QA, and chain-of-thought reasoning. Used to train Nemotron Nano 2 VL. CC-BY-4.0. Docs
- OBELICS – 141M interleaved image-text web documents, 353M images, 115B tokens, extracted from Common Crawl. Training data for IDEFICS. CC-BY-4.0. Docs
- OmniCorpus – 8.6B images interleaved with 1.7T text tokens from Common Crawl, Chinese web, and YouTube. Unified multimodal pretraining corpus. ICLR 2025 Spotlight. CC-BY-4.0. Docs
- PixelProse / Docmatix – Large-scale document-image instruction dataset from HuggingFace. Useful for training multimodal document-QA models.
- PixMo-Cap – 717K dense image captions averaging ~200 words, recorded by human annotators and transcribed via LLM. Pretraining data for the Molmo VLM family. ODC-BY-1.0. Docs
- ReCap-DataComp-1B – 1.3B DataComp-1B images recaptioned with LLaVA-1.5-LLaMA3-8B. Longer, more detailed captions improve CLIP and text-to-image training. ICML 2025. Docs
- TextAtlas5M – 5.4M image-text pairs for dense text image generation, spanning synthetic and real-world text-rich images (slides, book covers, papers, styled text). Benchmark included. MIT. Docs
Public video datasets with captions, action labels, or instruction annotations.
- Panda-70M – 70M high-quality short video clips with automatically generated captions. Used in several video LLM pretraining runs. Docs
- WebVid-10M – 10M weakly-captioned web videos. Common pretraining corpus for text-to-video and video-language models.
- Ego4D – 3,670 hours of first-person video from 931 participants across 9 countries. Benchmarks for social, hands, memory, forecasting. Docs
- HowTo100M – 136M narrated instructional video clips from 1.22M YouTube videos. A staple for video-language pretraining. Docs
- InternVid – 7M videos (~234M clips) with rich captions. Part of the InternVideo foundation-model release. Docs
- OpenVid-1M – 1M curated text-video pairs with aesthetic/motion/consistency scores. Includes OpenVidHD-0.4M subset at 1080p. ICLR 2025. Docs
- PE-Video – Meta's 1M diverse short videos with text descriptions and 120K human-verified captions. Released with Perception Encoder (2025). CC BY NC 4.0. Docs
- VideoUFO – 1.09M Creative Commons video clips paired with brief and detailed captions, covering 1,291 user-focused topics derived from real text-to-video prompts. NeurIPS 2025. CC-BY-4.0. Docs
Public speech, music, and sound-event corpora for training and evaluation.
- Mozilla Common Voice – Open speech corpus built from community contributions. 100+ languages and growing. Docs
- AudioSet – Google's 2M+ human-labelled 10-second sound clips across 632 classes. Standard corpus for sound-event classification.
- CapSpeech – 10M+ machine-annotated and 360k human-annotated audio-caption pairs for style-captioned TTS. Covers accent, emotion, sound effects, and agent speech tasks. CC-BY-NC-4.0. Docs
- Emilia – 100k+ hours of multilingual, in-the-wild speech data from the Amphion team. Backbone for modern open TTS training. Docs
- LibriSpeech – 1,000-hour English speech corpus derived from LibriVox. The long-running default ASR benchmark.
- WavCaps – 400k weakly-labelled audio clips with ChatGPT-generated captions from FreeSound, BBC Sound Effects, and AudioSet. Academic use only. Docs
Large-scale web, code, and document corpora used to pretrain multimodal models.
- Dolma – AI2's 3T-token open pretraining corpus with a transparent pipeline. Backbone for OLMo training.
- BigDocs-7.5M – 7.5M permissively licensed document image-text pairs from ServiceNow. Covers OCR, structured parsing, captioning, and QA across scientific papers, tables, and UI screenshots. ICLR 2025. CC-BY-4.0. Docs
- Common Corpus – 2T-token fully open text corpus from PleIAs. Public-domain and permissive text only — usable for commercial training. Docs
- DCLM-Baseline – 4T-token filtered web corpus from 240T-token Common Crawl pool. Trains 7B models to 64% MMLU at 2.6T tokens. NeurIPS 2024. Docs
- FineWeb / FineWeb-Edu – 15T high-quality English tokens filtered from 96 Common Crawl dumps. Edu variant is ~1.3T educational-content tokens. Docs
- The Stack v2 – 3TB of permissively licensed source code across 600+ programming languages. Training data for StarCoder/StarCoder2. Docs
Formats and libraries for streaming multimodal data from object storage into training jobs.
- HuggingFace Datasets – Unified loading, streaming, and processing of thousands of datasets. Native Arrow/Parquet with memory-mapped access. Docs | SDK: Python (pip install datasets)
- Lance – Columnar format designed for ML. 100x faster random access than Parquet, versioned, zero-copy from object storage. Docs | SDK: Python (pip install pylance), Rust
- WebDataset – PyTorch dataset format packaging samples as tar shards. Streams directly from S3-compatible object storage. Docs | SDK: Python (pip install webdataset)
- MosaicML Streaming – Deterministic, shuffled, resumable streaming of datasets from cloud storage. Shard format: MDS. Docs | SDK: Python (pip install mosaicml-streaming)
- Ray Data – Distributed data loading and preprocessing inside Ray. Native connectors for Parquet on S3-compatible stores. SDK: Python (pip install 'ray[data]')
Tools for deduplication, cleaning, filtering, and annotating multimodal datasets.
- Label Studio – Open-source labeling platform supporting text, image, audio, video, and time-series. Docs | SDK: Python (pip install label-studio-sdk)
- CVAT – Computer vision annotation tool with strong video and 3D support. Originally from Intel, now community-maintained. Docs
- cleanlab – Finds label errors, outliers, and dataset issues automatically with confident-learning methods. Docs | SDK: Python (pip install cleanlab)
- FiftyOne – Open-source dataset curation and model-evaluation toolkit for CV. Rich UI for exploring image/video datasets. Docs | SDK: Python (pip install fiftyone)
- Datatrove – HuggingFace's data-processing library for LLM pretraining. Parallel filters, deduplication, tokenization pipelines. SDK: Python (pip install datatrove)
- fastdup – Unsupervised analysis of large visual datasets. Detects duplicates, outliers, mislabels, and leakage in minutes. Docs | SDK: Python (pip install fastdup)
- Cosmos-Curate – NVIDIA's distributed video curation pipeline for world foundation model training. Splits, annotates, filters, and deduplicates video at scale using Ray. Apache-2.0.
- Data-Juicer – Composable data-processing toolkit with 200+ operators spanning text, image, audio, and video. Scales from a laptop to 1000-node Ray clusters. NeurIPS 2025 Spotlight. Apache-2.0. Docs | SDK: Python (pip install py-data-juicer)
- NeMo Curator – GPU-accelerated curation toolkit from NVIDIA. Handles text, image, video, and audio at scale using RAPIDS and Ray. Apache-2.0. Docs | SDK: Python (pip install nemo-curator)
- SemHash – Lightweight multimodal library for semantic deduplication, outlier filtering, and representative sample selection across text, images, and audio. SDK: Python (pip install semhash)
Multimodal embedding models and frameworks for building searchable indices.
- FAISS – Meta's library for efficient similarity search of dense vectors. The reference ANN library for research-scale indexing. Docs | SDK: Python (pip install faiss-cpu), C++
- sentence-transformers – PyTorch framework for training and using dense embedding models. De-facto library for semantic search. Docs | SDK: Python (pip install sentence-transformers)
- OpenCLIP – Open-source reproduction of CLIP with many trained checkpoints (including SigLIP and EVA-CLIP variants). SDK: Python (pip install open-clip-torch)
- BGE (BAAI General Embedding) – Family of multilingual text embeddings from BAAI. Top performers on MTEB; extended to visual and multimodal variants.
- Nomic Embed – Open multimodal embedding family from Nomic. Nomic Embed Vision pairs with Nomic Embed Text for unified image/text search. Docs
- jina-embeddings-v4 – 3.8B-parameter multimodal embedding model for text, images, and visual documents (charts, tables). Supports dense and late-interaction retrieval across 30+ languages. Non-commercial (Qwen Research License). Docs | SDK: Python (pip install transformers)
- MMEB-train – 2.1M training samples for the Massive Multimodal Embedding Benchmark. Covers visual QA, image retrieval, classification, and grounding across 20 datasets. Apache-2.0.
- SigLIP 2 – Google's sigmoid loss for contrastive image-text pretraining. SigLIP 2 adds multilingual and dense-caption training. Docs
- VLM2Vec – Framework for training VLMs as dense embedding models. Ships MMEB-V2 benchmark with 78 tasks across images, videos, and visual documents. Docs | SDK: Python
Reference implementations, demos, and starter projects.
- img2dataset – Fast, resumable tool to turn image-URL lists into WebDataset shards. Routinely used to download LAION-scale datasets. SDK: Python (pip install img2dataset)
- video2dataset – Download, trim, and package large video datasets into WebDataset shards.
- DataComp Quickstart – Reference workflow for downloading, filtering, and training on DataComp. Good starting point for custom filters.
Contributions are welcome. See CONTRIBUTING.md. One entry per PR — edit entries.yaml only and let the maintainers regenerate README.md.
Released under CC0 1.0 Universal. You may copy, modify, and redistribute without attribution.
Backblaze B2 Cloud Storage is S3-compatible object storage designed for AI and media workloads. This list is maintained as part of our work making B2 a convenient storage layer for AI workflows.