- A modern biomedical knowledge graph with molecular, anatomical, clinical, and environmental modalities.
- Integrates 65 heterogeneous resources grounded with 18 ontologies and controlled vocabularies using the BioCypher framework and the Biolink Model.
- Contains 190,531 nodes across 10 entity types, 21,813,816 edges across 26 relation types, and 67,249,863 property instances encoding 110,276,843 values across 150 distinct property keys.
- Independently validated using PaperQA3, a multimodal agent that retrieves and reasons over scientific literature.
- Reproducible, deterministic and infrastructure-agnostic data pipeline with parallel execution.
- Distributed as Apache Parquet files and downloadable via the optimuskg python client.
OptimusKG is developed at the Zitnik Lab, Harvard Medical School.
OptimusKG is available via Harvard Dataverse. The graph can be programmatically accessed using the Python client, available on PyPI:
# With pip.
pip install optimuskg# Or pipx.
pipx install optimuskgThe client fetches files from the gold layer with local caching, and supports loading the graph either as Polars Dataframes or as a NetworkX MultiDiGraph:
import optimuskg
# Download a specific file and store it locally
local_path = optimuskg.get_file("nodes/gene.parquet")
# Load a single Parquet file as a Polars DataFrame
drugs = optimuskg.load_parquet("nodes/drug.parquet")
# Load nodes and edges as Polars DataFrames
# Set lcc=True to load only the largest connected component
nodes, edges = optimuskg.load_graph(lcc=True)
# Load the graph as a NetworkX MultiDiGraph with metadata
# Set lcc=True to load only the largest connected component
G = optimuskg.load_networkx(lcc=True)Note
Downloads are cached by default in platformdirs.user_cache_dir("optimuskg") (~/Library/Caches/optimuskg on macOS, ~/.cache/optimuskg on Linux, and C:\Users\<User>\AppData\Local\optimuskg\optimuskg on Windows). The cache location can be overridden via the $OPTIMUSKG_CACHE_DIR environment variable or programmatically with optimuskg.set_cache_dir(path).
Note
To target a different dataset (e.g., a pre-release), set the $OPTIMUSKG_DOI environment variable or use optimuskg.set_doi("doi:10.xxxx/XXXX").
The pipeline architecture consists of the following components:
| Component | Description |
|---|---|
| catalog | The single source of truth of all datasets, their schemas, their format, and their metadata. |
| dataset | An abstraction that handles file formats, storage locations, and persistence logic. |
| node | A pure Python function whose output value follows solely from its input values. |
| pipeline | A sequence of nodes wired into a DAG-based workflow, organized by the datasets they consume and produce. |
| layer | Follows the medallion architecture data design pattern to logically organize the data. There are 4 layers: landing, bronze, silver, and gold. |
| parameters | Used to define constants for filtering the data across the construction process. |
| provider | An abstraction that provides versioned, automatic data downloads from different data sources. |
| hook | Mechanism that allows injection of custom behavior into the core execution flow, such as before a node runs. |
| conf | A mechanism that separates code from settings, defining the catalog, parameters, logging configuration, and ontology harmonization across different environments. |
Note
We leverage additional features of the Kedro framework, such as namespaces, kedro-viz, kedro-datasets and catalog injection in Jupyter notebooks.
The pipeline is designed to generate the full knowledge graph and all the intermediate datasets used to generate it in one command:
$ uv run kedro run --to-nodes gold.export_kg --runner=optimuskg.runners.FixedParallelRunner --async
[01/28/25 19:29:07] INFO Using 'conf/logging.yml' as logging configuration. You can change this by setting the KEDRO_LOGGING_CONFIG environment variable accordingly.
[01/28/25 19:29:08] INFO Kedro project optimuskg
[01/28/25 19:29:09] INFO Using synchronous mode for loading and saving data. Use the --async flag for potential performance gains.This will automatically download all the necessary data, store it in the landing layer, and execute the bronze, silver, and gold layers to finally export the graph inside the data/gold/kg/ directory.
Note
It is recommended to use the optimuskg.runners.FixedParallelRunner
to run the nodes within a pipeline concurrently, and the async flag to reduce load and save time by using asynchronous mode. The Kedro default ParallelRunner contains a bug that prevents it from running any validation checks.
Tip
The location of each dataset, schema and their format is specified in the catalog.
Tip
Run make help for a list of available Make commands, and uv run cli --help for additional CLI utilities.
Note
The pipeline automatically downloads public datasets and ingests them in the landing layer.
Place any private datasets under data/loading. If absent, the Origin Hook will create empty placeholders, allowing dependent nodes to run even if the private data is missing.
We are passionate about supporting contributors of all levels of experience and would love to see you get involved in the project. See the contributing guide to get started.
If you use OptimusKG in your research, please cite:
@article{vittor2026optimuskg,
title={OptimusKG: Unifying biomedical knowledge in a modern multimodal graph},
author={Vittor, Lucas and Noori, Ayush and Arango, I{\~n}aki and Polonuer, Joaqu{\'\i}n and Rodriques, Sam and White, Andrew and Clifton, David A. and Zitnik, Marinka},
journal={Nature Scientific Data},
year={2026}
}OptimusKG codebase is released under the MIT License. OptimusKG integrates multiple primary data resources, each of which is subject to its own license and terms of use. These terms may impose restrictions on redistribution, commercial use, or downstream applications of the resulting knowledge graph or its subsets. Some resources provide data under academic or noncommercial licenses, while others may impose attribution or usage requirements. As a result, use of OptimusKG may be partially restricted depending on the specific data components included in a given instantiation. Users are responsible for reviewing and complying with the license and terms of use of each primary dataset, as specified by the original data providers. OptimusKG does not alter or override these source-specific licensing conditions.
Made with ❤️ at Zitnik Lab, Harvard Medical School