Supplementary machine learning training pipeline for cybersecurity models used by the SLIPS project from the Stratosphere Laboratory.
This repository is intentionally separate from SLIPS itself, but it is designed to produce models and preprocessing artifacts that remain compatible with SLIPS ML modules.
The pipeline trains and evaluates classifiers on labeled network-flow datasets, normalizes them to the SLIPS schema, extracts features, applies preprocessing, and stores reproducible experiment outputs. It also supports Optuna-based hyperparameter optimization for multi-objective model search.
Typical use cases:
- Train a baseline model on Zeek-style labeled flow datasets
- Evaluate trained models on held-out or unseen datasets
- Compare dataset mixing strategies such as sequential, random, balanced, and oversampled training
- Run Optuna studies to search classifier and preprocessing settings
- End-to-end training and testing driven by YAML configuration
- Dataset normalization from Zeek connection logs to a SLIPS-compatible schema
- Modular feature extraction, preprocessing, and classifier wrappers
- Support for both
scikit-learnandriver-based models - Experiment versioning through per-run output folders and effective config snapshots
- Built-in plotting utilities for training, testing, and Optuna studies
run.py: entry point for standard runs and Optuna studiesconfigs/: example pipeline configurationssrc/pipeline.py: runtime orchestration of the full pipelinesrc/dataset_wrapper.py: dataset discovery, loading, and cachingsrc/conn_normalizer.py: Zeek-to-SLIPS normalizationsrc/features.py: feature extractionsrc/preprocessing_wrapper.py: preprocessing pipeline assembly and persistencesrc/classifier_wrapper.py: model wrapper abstractionsrc/data_selectors.py: dataset mixing and batch selection strategiessrc/plot_utils/: plotting scripts and Optuna visualization helpersdocs/OPTUNA.md: detailed Optuna usage and output guidetesting/: automated test suite
- Python 3
pipand a virtual environment tool recommended- Labeled datasets in the format expected by the configured dataset loader
Install dependencies:
pip install -r requirements.txtMain Python dependencies include numpy, pandas, scikit-learn, river, PyYAML, optuna, and matplotlib.
By default, the pipeline expects datasets under ./datasets.
You can:
- Use your own datasets if they match the expected structure
- Adapt feature extraction and loading logic for custom formats
- Use Stratosphere's labeled security datasets: https://github.com/stratosphereips/security-datasets-for-testing
To keep a consistent local path while storing the actual data elsewhere, use a symlink:
mkdir -p datasets
ln -s /absolute/path/to/your/datasets ./datasetsIf you need the resolved absolute path later, for example when mounting the data into Docker, use:
readlink -f datasetsRun the default configuration:
python run.py configs/default_config.yamlRun an Optuna study:
python run.py configs/optuna_conf.yaml --optunaBuild the image:
docker build -t slips-pipeline:latest .Run with Docker Compose:
docker compose run --rm pipeline python run.py configs/default_config.yaml
docker compose run --rm pipeline python run.py configs/optuna_conf.yaml --optunaIf ./datasets is a symlink, mount the resolved target so the container sees the same data path:
DATA_ROOT=$(readlink -f datasets)
docker compose run --rm \
-v "$DATA_ROOT":"$DATA_ROOT" \
pipeline \
python run.py configs/default_config.yamlCurrent resource defaults in docker-compose.yml:
cpus: 4mem_reservation: 16gmem_limit: 20gmemswap_limit: 32g
Pipeline behavior is defined in YAML configuration files under configs/.
Core sections include:
experiment_name: base name for the run output directoryroot: dataset root, typically./datasetsclasses: label set used during training and evaluationdataset_loader: file discovery, caching, and parsing settingsfeatures: feature extraction behaviorpreprocessing: ordered sklearn-style preprocessing stepsmodel: wrapper type, classifier type, and model parameterscommands: ordered runtime commands such astrainandtest
When a run starts, the pipeline creates a unique experiment directory and stores:
config_effective.yamlconfig_effective.json- model and preprocessing artifacts
- logs, metrics, and plots
See configs/default_config.yaml for a baseline example.
At runtime, the pipeline is assembled from the configuration in these stages:
- Configuration loading and path resolution
- Dataset discovery and normalization into the SLIPS schema
- Batch selection through the configured mixer
- Feature extraction and preprocessing
- Incremental training or testing through the classifier wrapper
- Metric calculation, artifact persistence, and plotting
This separation keeps data handling, preprocessing, model execution, and experiment tracking modular and replaceable.
Experiment outputs are written under the configured experiment root, usually ./experiments/<experiment_name>/.
Common artifacts include:
- effective runtime configuration snapshots
- trained model files
- saved preprocessing steps
- training and testing logs
- metric summaries
- plots generated by the configured plotting scripts
Optuna runs additionally create an optuna/ subdirectory with trial-level metrics, trial configs, summaries, and visualization outputs.
For full Optuna details, see docs/OPTUNA.md.
Run the full test suite with:
pytestThere is also a testing guide in testing/README.md.
To profile the pipeline with cProfile:
mkdir -p profiling
python -m cProfile -o profiling/profile.out run.py configs/default_config.yamlTo inspect the profile with snakeviz:
pip install snakeviz
snakeviz profiling/profile.outFor local quality checks:
pre-commit install
pre-commit run --all-files
pytestIf secret scanning is part of your workflow:
pip install detect-secrets
detect-secrets scan > .secrets.baselineContributions should keep tests passing and preserve compatibility with the existing configuration-driven pipeline structure.
- Implement or extend a wrapper in
src/classifier_wrapper.py - Ensure the wrapper exposes the expected training, prediction, save, and load behavior
- Register the wrapper or classifier in
src/class_factory.pyif needed - Point the
modelsection of the config to the selected wrapper and classifier
- Use an sklearn-style transformer with
fitorpartial_fitandtransform - Add it through the
preprocessing.stepsconfig section - Saved preprocessing artifacts are written under the experiment output directory
Example:
preprocessor.add_step("scaler", StandardScaler())