Distributed GPU-accelerated MetaTrader 5 strategy backtester and parameter optimizer.
MetaTrader 5's built-in strategy tester runs one parameter combination per CPU
thread. gpu-mt5-bt maps each combination to its own CUDA thread, then shards
the entire parameter grid across a Ray cluster of GPU workers — so optimization
sweeps that take MT5 hours finish in seconds. Results are streamed to Parquet
as shards complete; crashes can be resumed.
Strategies are written in Python, not MQL5. There is no transpilation — you author each strategy as a
@cuda.jitkernel using the building blocks ingpu_mt5_bt.kernels. See Authoring a strategy below.
┌──────────────────┐
│ CLI (Typer) │
│ gpu-mt5-bt … │
└────────┬─────────┘
│
┌────────▼─────────┐
│ Coordinator │ loads data + grid,
│ (Ray driver) │ shards work, aggregates results
└────────┬─────────┘
│ Ray actors
┌───────────────────┼───────────────────┐
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ GPU Worker │ │ GPU Worker │ │ GPU Worker │ … N
│ (1 per GPU) │ │ (1 per GPU) │ │ (1 per GPU) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ CUDA kernel │ │ CUDA kernel │ │ CUDA kernel │
│ N parallel │ │ N parallel │ │ N parallel │
│ backtests │ │ backtests │ │ backtests │
└─────────────┘ └─────────────┘ └─────────────┘
- The bar array lives once in each GPU's memory; every thread on that GPU reads the same shared bars.
- Each GPU thread owns one parameter combination and runs a complete sequential backtest. Strategies are not internally parallelized — that would corrupt sequential trade state.
- The coordinator only sees aggregate metrics, not per-bar data, so cross-node traffic stays small.
- Results stream to
results.parquet; a crashed run resumes from the last completed shard.
pip install -e ".[dev]" # core + test deps
pip install -e ".[dev,mt5,nvml]" # add MT5 live loader (Win-only) + GPU monitorNumba CUDA needs an NVIDIA GPU and a recent CUDA toolkit (12.x is fine). On a machine without a GPU, kernels are skipped and the CPU reference path is used — so the test suite still passes; only the speedup is missing.
# Generate the included synthetic EURUSD H1 sample (one-shot).
python examples/generate_sample_data.py
# Run a 5,000-combo MA-crossover sweep on a local Ray cluster.
gpu-mt5-bt run examples/ma_crossover.yaml
# Render the HTML report from the latest run.
gpu-mt5-bt report runs/<latest>
# What strategies are registered?
gpu-mt5-bt strategiesrun resolves the run directory to runs/<UTC-timestamp>_<config-name>/ and
writes:
runs/20260508T101530Z_ma_crossover/
├── config.yaml # frozen copy of the input config
├── metadata.json # symbol, timeframe, n_bars, n_combos, started_at
├── results.parquet # one row per parameter combo
├── trades_top.parquet # detailed trades for top-N combos per shard
├── _shards_done.txt # checkpoint for `--resume`
├── report.html # generated by `gpu-mt5-bt report`
└── logs/run.log
| Flag | Meaning |
|---|---|
--resume |
Pick up the most recent matching run dir and skip completed shards. |
--dry-run |
Print resolved config + grid size and exit (no execution). |
--local |
Force in-process execution (no Ray). Useful for debugging. |
--device gpu / cpu |
Force the device per worker. Default: auto. |
# Head node
gpu-mt5-bt cluster start --head --num-gpus 2 --num-cpus 8
# Each worker node
gpu-mt5-bt cluster start --address 10.0.0.1:6379 --num-gpus 4
# Driver (anywhere reachable)
# Set ray_address: 'ray://10.0.0.1:10001' in your config and run normally:
gpu-mt5-bt run my_sweep.yamldistributed.num_gpus_per_worker in the config controls how many GPUs each Ray
actor reserves. With one actor per GPU, each actor caches the bar array and
compiled kernel between shards so only the first shard pays JIT cost.
Minimal example (examples/ma_crossover.yaml):
strategy: ma_cross
data:
source: csv
path: examples/data/EURUSD_H1.csv
symbol: EURUSD
timeframe: H1
start: 2018-01-01
end: 2024-12-31
execution:
starting_balance: 10000
leverage: 100
commission_per_lot: 7.0
slippage_points: 1
spread_mode: from_bars # or fixed: 1.5
stop_out_pct: 0.5
triple_swap_wednesday: true
position_sizing:
mode: fixed_lot # or percent_risk / martingale
lot: 0.1
optimization:
fast_period: { min: 5, max: 50, step: 1 }
slow_period: { min: 20, max: 200, step: 1 }
trailing_stop_atr: { min: 1.0, max: 5.0, step: 0.5 }
distributed:
ray_address: auto # 'auto' | 'local' | 'ray://host:10001'
chunk_size: 10000
num_gpus_per_worker: 1
output:
metric_to_optimize: sharpe # final_equity | sharpe | profit_factor | calmar | sortino
keep_top_n_trades: 50Every field is validated by Pydantic; typos and missing required fields fail fast with a precise error.
A strategy is a Python class registered into a global registry. Each strategy must provide:
- a CPU-only reference implementation (
run_cpu) used by tests and the CPU fallback path - a Numba CUDA kernel (
build_kernel) with signature(bars, params, exec_cfg, out_metrics, out_trades, n_bars)
The two implementations must agree numerically — the test suite enforces this on a fixed-seed synthetic series.
The execution machinery exposes ready-made device functions you can call from your kernel:
| Helper | Purpose |
|---|---|
apply_spread_device |
Add/subtract spread to a fill price |
commission_device |
Per-lot commission |
fx_pnl_device |
P&L in account currency |
lot_for_combo_device |
Lot size given the configured sizing mode |
swap_for_bar_device |
Per-bar swap accrual (with Wed triple-swap) |
record_trade_device |
Write a trade row into the output buffer |
sma_at_device, ema_at_device, rsi_at_device, atr_at_device, donchian_at_device |
Indicators evaluated at one bar |
See src/gpu_mt5_bt/strategies/ma_cross.py for a complete worked example
(Python _run_one_combo_cpu + @cuda.jit ma_cross_kernel). Once authored,
register it from strategies/__init__.py and reference it by name in YAML.
The reference strategies that ship in-box:
| Name | Idea |
|---|---|
ma_cross |
Fast/slow SMA crossover with optional ATR trailing stop |
rsi_meanrev |
Buy on RSI cross-up through oversold, sell on cross-down through overbought |
donchian_breakout |
N-bar channel breakout entry, opposite-channel exit, ATR hard stop |
Export an MT5 strategy-tester report (right-click chart → Save As Report →
.htm) and run:
gpu-mt5-bt validate runs/<latest> path/to/StrategyTester.htmThe validator parses the summary fields and trade list out of the HTML, picks
the GPU run with the highest final_equity from results.parquet, and prints
a side-by-side diff. Acceptance: final equity within 0.1% (configurable
with --tolerance).
pytest -v # all tests
pytest -m "not gpu" # skip GPU-only tests
pytest tests/integration/test_distributed.py # only Ray integrationGPU tests are auto-skipped when CUDA isn't available; same for Ray and MT5. Coverage target is ≥80% on non-kernel code, ≥60% overall.
MIT.