Slips ML Training Pipeline

Supplementary machine learning training pipeline for cybersecurity models used by the SLIPS project from the Stratosphere Laboratory.

This repository is intentionally separate from SLIPS itself, but it is designed to produce models and preprocessing artifacts that remain compatible with SLIPS ML modules.

Overview

The pipeline trains and evaluates classifiers on labeled network-flow datasets, normalizes them to the SLIPS schema, extracts features, applies preprocessing, and stores reproducible experiment outputs. It also supports Optuna-based hyperparameter optimization for multi-objective model search.

Typical use cases:

Train a baseline model on Zeek-style labeled flow datasets
Evaluate trained models on held-out or unseen datasets
Compare dataset mixing strategies such as sequential, random, balanced, and oversampled training
Run Optuna studies to search classifier and preprocessing settings

Key Capabilities

End-to-end training and testing driven by YAML configuration
Dataset normalization from Zeek connection logs to a SLIPS-compatible schema
Modular feature extraction, preprocessing, and classifier wrappers
Support for both scikit-learn and river-based models
Experiment versioning through per-run output folders and effective config snapshots
Built-in plotting utilities for training, testing, and Optuna studies

Repository Structure

run.py: entry point for standard runs and Optuna studies
configs/: example pipeline configurations
src/pipeline.py: runtime orchestration of the full pipeline
src/dataset_wrapper.py: dataset discovery, loading, and caching
src/conn_normalizer.py: Zeek-to-SLIPS normalization
src/features.py: feature extraction
src/preprocessing_wrapper.py: preprocessing pipeline assembly and persistence
src/classifier_wrapper.py: model wrapper abstraction
src/data_selectors.py: dataset mixing and batch selection strategies
src/plot_utils/: plotting scripts and Optuna visualization helpers
docs/OPTUNA.md: detailed Optuna usage and output guide
testing/: automated test suite

Requirements

Python 3
pip and a virtual environment tool recommended
Labeled datasets in the format expected by the configured dataset loader

Install dependencies:

pip install -r requirements.txt

Main Python dependencies include numpy, pandas, scikit-learn, river, PyYAML, optuna, and matplotlib.

Datasets

By default, the pipeline expects datasets under ./datasets.

You can:

Use your own datasets if they match the expected structure
Adapt feature extraction and loading logic for custom formats
Use Stratosphere's labeled security datasets: https://github.com/stratosphereips/security-datasets-for-testing

To keep a consistent local path while storing the actual data elsewhere, use a symlink:

mkdir -p datasets
ln -s /absolute/path/to/your/datasets ./datasets

If you need the resolved absolute path later, for example when mounting the data into Docker, use:

readlink -f datasets

Quick Start

Native Run

Run the default configuration:

python run.py configs/default_config.yaml

Run an Optuna study:

python run.py configs/optuna_conf.yaml --optuna

Docker Run

Build the image:

docker build -t slips-pipeline:latest .

Run with Docker Compose:

docker compose run --rm pipeline python run.py configs/default_config.yaml
docker compose run --rm pipeline python run.py configs/optuna_conf.yaml --optuna

If ./datasets is a symlink, mount the resolved target so the container sees the same data path:

DATA_ROOT=$(readlink -f datasets)
docker compose run --rm \
  -v "$DATA_ROOT":"$DATA_ROOT" \
  pipeline \
  python run.py configs/default_config.yaml

Current resource defaults in docker-compose.yml:

cpus: 4
mem_reservation: 16g
mem_limit: 20g
memswap_limit: 32g

Configuration

Pipeline behavior is defined in YAML configuration files under configs/.

Core sections include:

experiment_name: base name for the run output directory
root: dataset root, typically ./datasets
classes: label set used during training and evaluation
dataset_loader: file discovery, caching, and parsing settings
features: feature extraction behavior
preprocessing: ordered sklearn-style preprocessing steps
model: wrapper type, classifier type, and model parameters
commands: ordered runtime commands such as train and test

When a run starts, the pipeline creates a unique experiment directory and stores:

config_effective.yaml
config_effective.json
model and preprocessing artifacts
logs, metrics, and plots

See configs/default_config.yaml for a baseline example.

Execution Model

At runtime, the pipeline is assembled from the configuration in these stages:

Configuration loading and path resolution
Dataset discovery and normalization into the SLIPS schema
Batch selection through the configured mixer
Feature extraction and preprocessing
Incremental training or testing through the classifier wrapper
Metric calculation, artifact persistence, and plotting

This separation keeps data handling, preprocessing, model execution, and experiment tracking modular and replaceable.

Output

Experiment outputs are written under the configured experiment root, usually ./experiments/<experiment_name>/.

Common artifacts include:

effective runtime configuration snapshots
trained model files
saved preprocessing steps
training and testing logs
metric summaries
plots generated by the configured plotting scripts

Optuna runs additionally create an optuna/ subdirectory with trial-level metrics, trial configs, summaries, and visualization outputs.

For full Optuna details, see docs/OPTUNA.md.

Testing

Run the full test suite with:

pytest

There is also a testing guide in testing/README.md.

Profiling

To profile the pipeline with cProfile:

mkdir -p profiling
python -m cProfile -o profiling/profile.out run.py configs/default_config.yaml

To inspect the profile with snakeviz:

pip install snakeviz
snakeviz profiling/profile.out

Development

For local quality checks:

pre-commit install
pre-commit run --all-files
pytest

If secret scanning is part of your workflow:

pip install detect-secrets
detect-secrets scan > .secrets.baseline

Contributions should keep tests passing and preserve compatibility with the existing configuration-driven pipeline structure.

Extending the Pipeline

Add a New Model

Implement or extend a wrapper in src/classifier_wrapper.py
Ensure the wrapper exposes the expected training, prediction, save, and load behavior
Register the wrapper or classifier in src/class_factory.py if needed
Point the model section of the config to the selected wrapper and classifier

Add a New Preprocessing Step

Use an sklearn-style transformer with fit or partial_fit and transform
Add it through the preprocessing.steps config section
Saved preprocessing artifacts are written under the experiment output directory

Example:

preprocessor.add_step("scaler", StandardScaler())

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github		.github
configs		configs
docs		docs
src		src
testing		testing
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Slips ML Training Pipeline

Overview

Key Capabilities

Repository Structure

Requirements

Datasets

Quick Start

Native Run

Docker Run

Configuration

Execution Model

Output

Testing

Profiling

Development

Extending the Pipeline

Add a New Model

Add a New Preprocessing Step

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Slips ML Training Pipeline

Overview

Key Capabilities

Repository Structure

Requirements

Datasets

Quick Start

Native Run

Docker Run

Configuration

Execution Model

Output

Testing

Profiling

Development

Extending the Pipeline

Add a New Model

Add a New Preprocessing Step

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages