Skip to content

backblaze-labs/awesome-multimodal-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Multimodal Data Awesome PRs Welcome License: CC0-1.0

A curated list of multimodal datasets and data tooling — image-text, video, audio, and document corpora, plus streaming, curation, labeling, and embedding libraries that operate on them.

Maintained by Backblaze.

Related Lists

Contents


Image-Text Datasets

Large-scale image-caption corpora for contrastive and generative training.

  • COYO-700M – 747M curated image-alt-text pairs from Kakao Brain. Alternative to LAION for pretraining multimodal models.
  • DataComp – Benchmark and toolkit for building image-text datasets. DataComp-1B is the reference filtered training set. Docs
  • Conceptual Captions 12M – 12M image-text pairs from Google, harvested from alt-text with automated cleaning. Widely used as a CLIP baseline.
  • BLIP3-KALE – 218M image-text pairs with knowledge-augmented dense captions from Salesforce. Bridges synthetic caption quality and web-scale alt-text factuality. Apache-2.0. Docs
  • Cambrian-10M – 10M multimodal instruction-tuning samples from NYU VisionX. Combines VQA, OCR, knowledge-based, and GPT-generated data. Backbone for Cambrian-1 VLM training. Apache-2.0. Docs
  • LAION-5B – 5.85B image-text pairs scraped from Common Crawl. Foundation dataset for Stable Diffusion and many open CLIP models. Docs
  • LLaVA-OneVision-Data – 3.9M multimodal instruction-tuning samples across 89 subsets (VQA, OCR, math reasoning, captioning). Training data for the LLaVA-OneVision model family. Apache-2.0. Docs
  • Nemotron-VLM-Dataset-v2 – NVIDIA's 8M-sample VLM training set spanning image QA, OCR (10 languages), video QA, and chain-of-thought reasoning. Used to train Nemotron Nano 2 VL. CC-BY-4.0. Docs
  • OBELICS – 141M interleaved image-text web documents, 353M images, 115B tokens, extracted from Common Crawl. Training data for IDEFICS. CC-BY-4.0. Docs
  • OmniCorpus – 8.6B images interleaved with 1.7T text tokens from Common Crawl, Chinese web, and YouTube. Unified multimodal pretraining corpus. ICLR 2025 Spotlight. CC-BY-4.0. Docs
  • PixelProse / Docmatix – Large-scale document-image instruction dataset from HuggingFace. Useful for training multimodal document-QA models.
  • PixMo-Cap – 717K dense image captions averaging ~200 words, recorded by human annotators and transcribed via LLM. Pretraining data for the Molmo VLM family. ODC-BY-1.0. Docs
  • ReCap-DataComp-1B – 1.3B DataComp-1B images recaptioned with LLaVA-1.5-LLaMA3-8B. Longer, more detailed captions improve CLIP and text-to-image training. ICML 2025. Docs
  • TextAtlas5M – 5.4M image-text pairs for dense text image generation, spanning synthetic and real-world text-rich images (slides, book covers, papers, styled text). Benchmark included. MIT. Docs

Video Datasets

Public video datasets with captions, action labels, or instruction annotations.

  • Panda-70M – 70M high-quality short video clips with automatically generated captions. Used in several video LLM pretraining runs. Docs
  • WebVid-10M – 10M weakly-captioned web videos. Common pretraining corpus for text-to-video and video-language models.
  • Ego4D – 3,670 hours of first-person video from 931 participants across 9 countries. Benchmarks for social, hands, memory, forecasting. Docs
  • HowTo100M – 136M narrated instructional video clips from 1.22M YouTube videos. A staple for video-language pretraining. Docs
  • InternVid – 7M videos (~234M clips) with rich captions. Part of the InternVideo foundation-model release. Docs
  • OpenVid-1M – 1M curated text-video pairs with aesthetic/motion/consistency scores. Includes OpenVidHD-0.4M subset at 1080p. ICLR 2025. Docs
  • PE-Video – Meta's 1M diverse short videos with text descriptions and 120K human-verified captions. Released with Perception Encoder (2025). CC BY NC 4.0. Docs
  • VideoUFO – 1.09M Creative Commons video clips paired with brief and detailed captions, covering 1,291 user-focused topics derived from real text-to-video prompts. NeurIPS 2025. CC-BY-4.0. Docs

Audio and Speech Datasets

Public speech, music, and sound-event corpora for training and evaluation.

  • Mozilla Common Voice – Open speech corpus built from community contributions. 100+ languages and growing. Docs
  • AudioSet – Google's 2M+ human-labelled 10-second sound clips across 632 classes. Standard corpus for sound-event classification.
  • CapSpeech – 10M+ machine-annotated and 360k human-annotated audio-caption pairs for style-captioned TTS. Covers accent, emotion, sound effects, and agent speech tasks. CC-BY-NC-4.0. Docs
  • Emilia – 100k+ hours of multilingual, in-the-wild speech data from the Amphion team. Backbone for modern open TTS training. Docs
  • LibriSpeech – 1,000-hour English speech corpus derived from LibriVox. The long-running default ASR benchmark.
  • WavCaps – 400k weakly-labelled audio clips with ChatGPT-generated captions from FreeSound, BBC Sound Effects, and AudioSet. Academic use only. Docs

Document and Text Datasets

Large-scale web, code, and document corpora used to pretrain multimodal models.

  • Dolma – AI2's 3T-token open pretraining corpus with a transparent pipeline. Backbone for OLMo training.
  • BigDocs-7.5M – 7.5M permissively licensed document image-text pairs from ServiceNow. Covers OCR, structured parsing, captioning, and QA across scientific papers, tables, and UI screenshots. ICLR 2025. CC-BY-4.0. Docs
  • Common Corpus – 2T-token fully open text corpus from PleIAs. Public-domain and permissive text only — usable for commercial training. Docs
  • DCLM-Baseline – 4T-token filtered web corpus from 240T-token Common Crawl pool. Trains 7B models to 64% MMLU at 2.6T tokens. NeurIPS 2024. Docs
  • FineWeb / FineWeb-Edu – 15T high-quality English tokens filtered from 96 Common Crawl dumps. Edu variant is ~1.3T educational-content tokens. Docs
  • The Stack v2 – 3TB of permissively licensed source code across 600+ programming languages. Training data for StarCoder/StarCoder2. Docs

Data Loading and Streaming

Formats and libraries for streaming multimodal data from object storage into training jobs.

  • HuggingFace Datasets – Unified loading, streaming, and processing of thousands of datasets. Native Arrow/Parquet with memory-mapped access. Docs | SDK: Python (pip install datasets)
  • Lance – Columnar format designed for ML. 100x faster random access than Parquet, versioned, zero-copy from object storage. Docs | SDK: Python (pip install pylance), Rust
  • WebDataset – PyTorch dataset format packaging samples as tar shards. Streams directly from S3-compatible object storage. Docs | SDK: Python (pip install webdataset)
  • MosaicML Streaming – Deterministic, shuffled, resumable streaming of datasets from cloud storage. Shard format: MDS. Docs | SDK: Python (pip install mosaicml-streaming)
  • Ray Data – Distributed data loading and preprocessing inside Ray. Native connectors for Parquet on S3-compatible stores. SDK: Python (pip install 'ray[data]')

Data Curation and Labeling

Tools for deduplication, cleaning, filtering, and annotating multimodal datasets.

  • Label Studio – Open-source labeling platform supporting text, image, audio, video, and time-series. Docs | SDK: Python (pip install label-studio-sdk)
  • CVAT – Computer vision annotation tool with strong video and 3D support. Originally from Intel, now community-maintained. Docs
  • cleanlab – Finds label errors, outliers, and dataset issues automatically with confident-learning methods. Docs | SDK: Python (pip install cleanlab)
  • FiftyOne – Open-source dataset curation and model-evaluation toolkit for CV. Rich UI for exploring image/video datasets. Docs | SDK: Python (pip install fiftyone)
  • Datatrove – HuggingFace's data-processing library for LLM pretraining. Parallel filters, deduplication, tokenization pipelines. SDK: Python (pip install datatrove)
  • fastdup – Unsupervised analysis of large visual datasets. Detects duplicates, outliers, mislabels, and leakage in minutes. Docs | SDK: Python (pip install fastdup)
  • Cosmos-Curate – NVIDIA's distributed video curation pipeline for world foundation model training. Splits, annotates, filters, and deduplicates video at scale using Ray. Apache-2.0.
  • Data-Juicer – Composable data-processing toolkit with 200+ operators spanning text, image, audio, and video. Scales from a laptop to 1000-node Ray clusters. NeurIPS 2025 Spotlight. Apache-2.0. Docs | SDK: Python (pip install py-data-juicer)
  • NeMo Curator – GPU-accelerated curation toolkit from NVIDIA. Handles text, image, video, and audio at scale using RAPIDS and Ray. Apache-2.0. Docs | SDK: Python (pip install nemo-curator)
  • SemHash – Lightweight multimodal library for semantic deduplication, outlier filtering, and representative sample selection across text, images, and audio. SDK: Python (pip install semhash)

Embedding and Indexing Models

Multimodal embedding models and frameworks for building searchable indices.

  • FAISS – Meta's library for efficient similarity search of dense vectors. The reference ANN library for research-scale indexing. Docs | SDK: Python (pip install faiss-cpu), C++
  • sentence-transformers – PyTorch framework for training and using dense embedding models. De-facto library for semantic search. Docs | SDK: Python (pip install sentence-transformers)
  • OpenCLIP – Open-source reproduction of CLIP with many trained checkpoints (including SigLIP and EVA-CLIP variants). SDK: Python (pip install open-clip-torch)
  • BGE (BAAI General Embedding) – Family of multilingual text embeddings from BAAI. Top performers on MTEB; extended to visual and multimodal variants.
  • Nomic Embed – Open multimodal embedding family from Nomic. Nomic Embed Vision pairs with Nomic Embed Text for unified image/text search. Docs
  • jina-embeddings-v4 – 3.8B-parameter multimodal embedding model for text, images, and visual documents (charts, tables). Supports dense and late-interaction retrieval across 30+ languages. Non-commercial (Qwen Research License). Docs | SDK: Python (pip install transformers)
  • MMEB-train – 2.1M training samples for the Massive Multimodal Embedding Benchmark. Covers visual QA, image retrieval, classification, and grounding across 20 datasets. Apache-2.0.
  • SigLIP 2 – Google's sigmoid loss for contrastive image-text pretraining. SigLIP 2 adds multilingual and dense-caption training. Docs
  • VLM2Vec – Framework for training VLMs as dense embedding models. Ships MMEB-V2 benchmark with 78 tasks across images, videos, and visual documents. Docs | SDK: Python

Templates and Example Projects

Reference implementations, demos, and starter projects.

  • img2dataset – Fast, resumable tool to turn image-URL lists into WebDataset shards. Routinely used to download LAION-scale datasets. SDK: Python (pip install img2dataset)
  • video2dataset – Download, trim, and package large video datasets into WebDataset shards.
  • DataComp Quickstart – Reference workflow for downloading, filtering, and training on DataComp. Good starting point for custom filters.

Contributing

Contributions are welcome. See CONTRIBUTING.md. One entry per PR — edit entries.yaml only and let the maintainers regenerate README.md.

License

Released under CC0 1.0 Universal. You may copy, modify, and redistribute without attribution.

About Backblaze B2

Backblaze B2 Cloud Storage is S3-compatible object storage designed for AI and media workloads. This list is maintained as part of our work making B2 a convenient storage layer for AI workflows.

About

A curated list of multimodal AI datasets and data infrastructure: video, image, audio, 3D, and robotics data, with formats, loaders, versioning, and labeling tools for large-scale training.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors