Skip to content

mims-harvard/OptimusKG

Repository files navigation

uv License: MIT Python 3.12+ GitHub Stars DOI pre-commit Website

Highlights

  • A modern biomedical knowledge graph with molecular, anatomical, clinical, and environmental modalities.
  • Integrates 65 heterogeneous resources grounded with 18 ontologies and controlled vocabularies using the BioCypher framework and the Biolink Model.
  • Contains 190,531 nodes across 10 entity types, 21,813,816 edges across 26 relation types, and 67,249,863 property instances encoding 110,276,843 values across 150 distinct property keys.
  • Independently validated using PaperQA3, a multimodal agent that retrieves and reasons over scientific literature.
  • Reproducible, deterministic and infrastructure-agnostic data pipeline with parallel execution.
  • Distributed as Apache Parquet files and downloadable via the optimuskg python client.

OptimusKG is developed at the Zitnik Lab, Harvard Medical School.

Using OptimusKG

OptimusKG is available via Harvard Dataverse. The graph can be programmatically accessed using the Python client, available on PyPI:

# With pip.
pip install optimuskg
# Or pipx.
pipx install optimuskg

The client fetches files from the gold layer with local caching, and supports loading the graph either as Polars Dataframes or as a NetworkX MultiDiGraph:

import optimuskg

# Download a specific file and store it locally
local_path = optimuskg.get_file("nodes/gene.parquet")

# Load a single Parquet file as a Polars DataFrame
drugs = optimuskg.load_parquet("nodes/drug.parquet")

# Load nodes and edges as Polars DataFrames
# Set lcc=True to load only the largest connected component
nodes, edges = optimuskg.load_graph(lcc=True)

# Load the graph as a NetworkX MultiDiGraph with metadata
# Set lcc=True to load only the largest connected component
G = optimuskg.load_networkx(lcc=True)

Note

Downloads are cached by default in platformdirs.user_cache_dir("optimuskg") (~/Library/Caches/optimuskg on macOS, ~/.cache/optimuskg on Linux, and C:\Users\<User>\AppData\Local\optimuskg\optimuskg on Windows). The cache location can be overridden via the $OPTIMUSKG_CACHE_DIR environment variable or programmatically with optimuskg.set_cache_dir(path).

Note

To target a different dataset (e.g., a pre-release), set the $OPTIMUSKG_DOI environment variable or use optimuskg.set_doi("doi:10.xxxx/XXXX").

Data pipeline

The pipeline architecture consists of the following components:

Component Description
catalog The single source of truth of all datasets, their schemas, their format, and their metadata.
dataset An abstraction that handles file formats, storage locations, and persistence logic.
node A pure Python function whose output value follows solely from its input values.
pipeline A sequence of nodes wired into a DAG-based workflow, organized by the datasets they consume and produce.
layer Follows the medallion architecture data design pattern to logically organize the data. There are 4 layers: landing, bronze, silver, and gold.
parameters Used to define constants for filtering the data across the construction process.
provider An abstraction that provides versioned, automatic data downloads from different data sources.
hook Mechanism that allows injection of custom behavior into the core execution flow, such as before a node runs.
conf A mechanism that separates code from settings, defining the catalog, parameters, logging configuration, and ontology harmonization across different environments.

Note

We leverage additional features of the Kedro framework, such as namespaces, kedro-viz, kedro-datasets and catalog injection in Jupyter notebooks.

Running the pipeline

The pipeline is designed to generate the full knowledge graph and all the intermediate datasets used to generate it in one command:

$ uv run kedro run --to-nodes gold.export_kg --runner=optimuskg.runners.FixedParallelRunner --async

[01/28/25 19:29:07] INFO     Using 'conf/logging.yml' as logging configuration. You can change this by setting the KEDRO_LOGGING_CONFIG environment variable accordingly.
[01/28/25 19:29:08] INFO     Kedro project optimuskg
[01/28/25 19:29:09] INFO     Using synchronous mode for loading and saving data. Use the --async flag for potential performance gains.

This will automatically download all the necessary data, store it in the landing layer, and execute the bronze, silver, and gold layers to finally export the graph inside the data/gold/kg/ directory.

Note

It is recommended to use the optimuskg.runners.FixedParallelRunner to run the nodes within a pipeline concurrently, and the async flag to reduce load and save time by using asynchronous mode. The Kedro default ParallelRunner contains a bug that prevents it from running any validation checks.

Tip

The location of each dataset, schema and their format is specified in the catalog.

Tip

Run make help for a list of available Make commands, and uv run cli --help for additional CLI utilities.

Note

The pipeline automatically downloads public datasets and ingests them in the landing layer.

Place any private datasets under data/loading. If absent, the Origin Hook will create empty placeholders, allowing dependent nodes to run even if the private data is missing.

Contributing

We are passionate about supporting contributors of all levels of experience and would love to see you get involved in the project. See the contributing guide to get started.

Citation

If you use OptimusKG in your research, please cite:

@article{vittor2026optimuskg,
  title={OptimusKG: Unifying biomedical knowledge in a modern multimodal graph},
  author={Vittor, Lucas and Noori, Ayush and Arango, I{\~n}aki and Polonuer, Joaqu{\'\i}n and Rodriques, Sam and White, Andrew and Clifton, David A. and Zitnik, Marinka},
  journal={Nature Scientific Data},
  year={2026}
}

License

OptimusKG codebase is released under the MIT License. OptimusKG integrates multiple primary data resources, each of which is subject to its own license and terms of use. These terms may impose restrictions on redistribution, commercial use, or downstream applications of the resulting knowledge graph or its subsets. Some resources provide data under academic or noncommercial licenses, while others may impose attribution or usage requirements. As a result, use of OptimusKG may be partially restricted depending on the specific data components included in a given instantiation. Users are responsible for reviewing and complying with the license and terms of use of each primary dataset, as specified by the original data providers. OptimusKG does not alter or override these source-specific licensing conditions.

Made with ❤️ at Zitnik Lab, Harvard Medical School

About

A modern multimodal knowledge graph with type-specific metadata across biomedical domains.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages