Warning
Under Development: This library is still under active development and may contain bugs or undergo breaking changes. Use with caution in production or critical research workflows. Please report any bugs or issues using the Issues tab.
ATLAS (Automated Training with Latent-space Aware Sampling) is a unified Python framework for building robust machine learning interatomic potentials (MLIPs). It combines a diversity-aware database generator with a manifold-aware active learning workflow to produce compact, high-quality training datasets. ATLAS supports structure generation for bulk, surface, cluster, and isolated atom configurations across single-, binary-, and ternary (WIP) phase diagrams, with perturbations, vacancies, deformations, and adsorbates.
The active learning engine iteratively trains MACE models, runs molecular dynamics simulations, detects extrapolating structures via descriptor-based or latent-space methods (autoencoder + concave hull), and submits them for DFT labelling, all orchestrated through AiiDA. Additional capabilities include data reduction mode, safeguard checks to prevent premature convergence, test database evaluation, diversity metrics (Vendi Score, Circles Metric), an interactive monitoring dashboard, a desktop GUI (in development), MLIP benchmarking, and comprehensive reporting of model performance and resource usage.
Validated on metals, alloys, and metal oxides, ATLAS produces datasets that exceed foundation model training chemical spaces by orders of magnitude while using up to x300 fewer structures.
Note
Preprint Available: The theoretical framework and benchmarking for this project are now available as a working paper on ChemRxiv: Balancing Diversity and Efficiency in Training Datasets for Robust Machine Learning Potentials.
- Installation
- Developer Workflow
- Usage
- Example: Training a MACE MLIP from scratch
- Package Structure
- Implementation Details
- Authors and Maintainers
To install ATLAS, you can use pip in a python virtual environment or conda environment. Development has been made with python3.11 in mind, which can be installed through the OS's package manager or conda.
First, create a virtual environment and activate it. This can be done in several ways, but we provide some examples using conda, python venv or uv.
# Create a conda environment named atlas which uses python 3.11
conda create -n atlas python=3.11
# Activate the environment
conda activate atlasAn example for an Ubuntu 22.04 system, using python3.11 and venv:
# Install python3.11 and venv
sudo apt install python3.11 python3.11-venv
# Using python venv - create and activate the environment
python3 -m venv atlas
source atlas/bin/activateFirst, install the uv tool. Either as shown below using the standalone installer, or please refer to the official uv installation guide for more options.
wget -qO- https://astral.sh/uv/install.sh | shOnce uv is isntalled, create an environment named atlas specifically with Python 3.11:
# Create the virtual environment
uv venv atlas --python 3.11Make sure to navigate to a folder where you would like your python environment to be located, or specify the desired path. You can activate the newly created environment as follows:
source atlas/bin/activateWith the environment now activated, the library can be installed.
# Clone the reposittory
git clone https://github.com/pol-sb/atlas.gitThere are several installation mechanisms, and several optional dependencies depending on what packages you want to use. Check the list and details of optional dependencies in the pyproject.toml. Currently, the following are available:
macedev
Optional dependencies are installed using the following syntax:
python3 -m pip install ./ATLAS['OPTIONAL_DEPENDENCY_NAME']Some installation examples follow:
# Install the library and the MACE dependencies in the venv using pip
python3 -m pip install ./ATLAS['mace']# Install the library and the MACE dependencies using uv
uv pip install ./ATLAS['mace']Finally, initialize configuration files by running the initial configuration command (atl_init_setup). Then, enter your Materials Project API key in the path displayed in the output to finish the setup process:
# Run the last setup step - configuration initialization
atl_init_setupNote
If the user is only interested in database generation, the setup can be completed only up until this point, skipping the following AiiDA setup.
-
The active learning (AL) loop uses the AiiDA library for managing the workflow. In order to run the AL loop in compute clusters, codes and computers must be conifigured in AiiDA. See the AiiDA installation guide for installation instructions.
-
DFT calculations with VASP use the aiida-vasp plugin, which needs additional configuration. Please, follow the instructions on their website.
-
The steps required to set up the active learning loop with the simplest AiiDA configuration are the following:
- Set up an aiida profile and database with
verdi presto. - Create the AiiDA computer and code entries for ATLAS and aiida-vasp.
- Add the potential datasets for aiida-vasp (information here).
- Set up an aiida profile and database with
Install the development dependencies with pip install -e '.[dev]', which adds pre-commit, pytest, commitizen, and ipdb. After cloning, run pre-commit install to activate the git hooks.
The project uses commitizen (cz commit) for structured conventional commits. It prompts for change type (feat/fix/docs/style/refactor/perf/test/chore), scope (al_loop/core/init_db/md/...), and a summary message. This feeds into automated changelog generation and version bumps (cz bump).
On every commit, pre-commit runs three stages in sequence:
- Schema docs: Regenerates
docs/source/input.mdwhenconfig_schema.yamlchanges. - ruff: Lints all Python files with auto-fix, enforcing pycodestyle, pyflakes, pyupgrade, flake8-bugbear, isort, and numpy-style docstring rules.
- pytest: Runs the full test suite (
python -m pytest tests/ -x), stopping at the first failure. - Miscellaneous: Fixes trailing newlines, validates YAML/TOML, checks for oversized files.
If any hook fails, the commit is blocked and ruff errors must be resolved manually and tests must pass before proceeding. Run pre-commit run --all-files to check everything without committing, or use cz commit --retry to retry the last commitizen interaction after fixing issues.
The goal of this library is to provide workflows, functions and utilities for streamlining the training of neural networks potentials (MLIPs) by means of Active Learning (AL) Loops.
During the library installation, several entry points will be added so that the user can easily run the different utilities:
atl_init_setup: Run initial configuration steps after installing atlas.atl_run_dft_database: Run DFT calculations for a ATLAS structure database.atl_gen_configuration_file: Generate a.tomltemplate configuration file to be used in any of the different operation modes of the code.atl_gen_init_db: Generate a database containing structures for MLIP training.atl_active_learning: Launch an AL loop using a configuration file and a labelled initial database.atl_monitor_al_loop:Launch a flask dashboard locally to monitor a running active learning loop. Open http://127.0.0.1:8000 (or port specified in the launch arguments) in a browser to visualize the dashboard.atl_benchmark_mlip: Evaluate and compare the performance of MLIPs using a suite of benchmarks.
All of the entry points provide usage documentation when launched with the -h/--help argument, e.g.:
$> atl_gen_configuration_file --help
>>> usage: atl_gen_configuration_file [-h] -t TYPE [-p PATH] [-o]
>>>
>>> Generate ATL default configuration files in the TOML format.
>>>
>>> options:
>>> -h, --help
show this help message and exit
>>> -t TYPE, --config_type TYPE
>>> Type of the configuration file to be generated. Available types are:
>>> - active_learning: Configuration file for active learning loop.
>>> - initial_db: Configuration file for initial database generation.
>>> -p PATH, --path PATH
Path in which to store the file.
Will use the CWD by default. Folders will be created if necessary.
>>> -o, --overwrite
Whether to overwrite the destination file, if existent.The utilities for generation and running the AL loop use inputs in the TOML format. Users are advised to use atl_gen_configuration_file to generate a template file which can be customized.
A description of all the possible options and parameters is available in the documentation for the input files: documentation or in the local documentation files: Input.
This example will showcase the training of a MACE potential in a pure Cu database.
In order to generate the database, parameters for generation need to be listed in a .toml configuration file. Use the atl_gen_configuration_file command to generate a template file with instructions that can be customized easily. Click here to see a list and description of the available options.
# Generate a configuration file for the database generation.
atl_gen_configuration_file -t initial_dbAfter performing any desired changes to the created configuration file, a database can be generated using the atl_gen_init_db with the path to the configuration file:
# Generate the initial database
atl_gen_init_db -c ./path/to/config_file.tomlThis database will be generated as an extxyz file. This file must be labelled in order to be suitable for the AL Loop.
The structures can be labelled automatically with VASP, or as a quick testing using a pretrained MACE model.
- For MACE labelling, the following command can be used. For more information, check the MACE documentation:
mace_eval_configs --configs ./unlabelled_db.xyz --model /model/path cu_model_zan.model --output ./labelled_db.xyz --device cpu --batch_size 5- In order to use VASP for structure labelling, run the
atl_run_dft_databasecommand providing a configuration file (can be generated withatl_gen_configuration_file -t run_dft_database) with the input settings and the path of the database:
atl_run_dft_database --db_file ./database.xyz -c settings.tomlGenerate a settings file, customize it using the options here and run the active learning loop:
# Generate a template file for active learning
atl_gen_configuration_file -t active_learning
# Run the active learning loop, piping its outputs to a file.
# Without the '-c' option, the program will search for the 'active_learning_settings.toml'
# in the current directory
# The gui subcommand will launch a gui interface in the localhost, which can be
# viewed in a browser.
atl_active_learning gui --n_sec 60 2>&1 | tee ./run_atl_al.logThe progress of the AL Loop can be monitored by checking its output, or opening the dashboard running at http://127.0.0.1:8000.
After the active learning procedure is completed, a database in the extxyz format and a model file for the potential will be returned.
The main functionalities are organized into the following modules:
workflows: Contains functions and methods that allow to connect the database with workflow tools, mainly with the goal of performing DFT calculations.core: Includes core functionalities and utilities used by the library, such as the generation and management of the database.active_learning: Contains classes and functions leveraged during the active learning loops.examples: Provides example scripts that demonstrate the usage of the library.
The following examples demonstrate the usage of ATLAS:
- launch_sp_calcs_db_aiida.py: This example script demonstrates how to launch single-point calculations for a given set of structures and store the results in the database.
- create_init_db_new.py: This example script showcases how to create and initialize a new database with initial data.
Please refer to the examples in the examples directory for more details on how to utilize the library for your specific needs.
- Pol Sanz Berman (Main Developer) - Predoctoral Researcher, ICIQ
- Lulu Li (Contributor) - Postdoctoral Researcher, ICIQ
- Zan Lian (Contributor) - Postdoctoral Researcher, ICIQ
For technical inquiries or collaborations, please open an issue or contact psanz@iciq.es.