Canonical benchmark harness for the pure-Rust hyperdb-api crate and
the hyperdb-api-node bindings. The one benchmark everyone should run
first is the unified Rust suite at
hyperdb-api/benches/benchmark_suite.rs:
it measures sync and async insert / query paths against the same
schema in one run and emits a side-by-side comparison table.
For cross-language comparison there's also the Node.js bench at
hyperdb-api-node/__test__/benchmark.mjs
that uses the identical schema so its numbers go directly into the
same tables. The Rust-side specialized benches (benchmark.rs,
arrow_batching_benchmark.rs, grpc_benchmark_tests.rs,
async_parallel_benchmark.rs) are "dig-deeper" references for
specific questions.
All benchmarks share the same schema so numbers compare directly:
measurements(id INT NOT NULL, sensor_id INT, value DOUBLE, timestamp BIGINT)
-- 24 bytes / row
The shared primitives (ResourceStats, HostEnv, formatting,
deterministic row generators) live in
hyperdb-api/benches/common.rs and
are pulled into each bench via
#[path = "common.rs"] mod common;.
export HYPERD_PATH=/path/to/hyperd
# Default: 10M rows per workload, 4 parallel workers
cargo run -p hyperdb-api --release --example benchmark_suite
# Custom: (rows per workload) (parallel workers)
cargo run -p hyperdb-api --release --example benchmark_suite -- 100000000 4
# Switch transport without rebuilding (default = TCP):
# ipc | IPC | pipe -> Named Pipe (Windows) / Unix Domain Socket (Unix)
BENCH_TRANSPORT=ipc \
cargo run -p hyperdb-api --release --example benchmark_suite -- 100000000 4The banner under Configuration: prints Transport: Tcp or Transport: Ipc
so the saved benchmark_suite.md always records which transport produced
the numbers.
The suite prints a live log and, at the end, writes two artifacts
under test_results/:
benchmark_suite.md— markdown table identical to the one this doc shows for each platform.benchmark_suite.json— machine-readable version with full host and record fields for offline analysis.
| Workload | Flavor | Variant |
|---|---|---|
insert.bulk |
sync | Inserter (HyperBinary), 1 connection |
insert.bulk |
sync | ChunkSender × N, 1 connection, N threads |
insert.bulk |
async | AsyncArrowInserter, 1 connection |
insert.bulk |
async | AsyncArrowInserter × N, N connections |
insert.bulk |
async | spawn_blocking + ChunkSender × N, N connections |
query.full_scan |
sync / async / parallel async | SELECT id, sensor_id, value, timestamp FROM measurements |
query.filtered |
sync / async / parallel async | WHERE sensor_id = 5 |
query.aggregation |
sync / async / parallel async | GROUP BY sensor_id |
Parallel async queries run against the database populated by the parallel-Arrow insert (one table per worker), so the full-scan row count is N × rows-per-workload.
| Bench | What it isolates |
|---|---|
benchmark.rs |
Sync single- vs multi-threaded insert, per-query resource stats, IPC vs TCP, TCP vs gRPC on a synthesized 4-column schema |
arrow_batching_benchmark.rs |
Arrow IPC flush-threshold sweep (1 B / 16 MB / 64 MB), sync + async, IPC vs TCP |
grpc_benchmark_tests.rs |
gRPC SYNC / ADAPTIVE / ASYNC transfer modes at 10K / 100K / 100M row scales |
async_parallel_benchmark.rs |
Parallel AsyncArrowInserter vs spawn-blocking ChunkSender, parallel streaming queries in 3 shapes |
hyperdb-api-node/__test__/benchmark.mjs |
Node.js N-API bindings — eager / streaming / chunked / columnar / Arrow IPC variants |
Running any of these:
cargo run -p hyperdb-api --release --example benchmark [ROWS]
cargo run -p hyperdb-api --release --example arrow_batching_benchmark [ROWS]
cargo run -p hyperdb-api --release --example grpc_benchmark_tests
cargo run -p hyperdb-api --release --example async_parallel_benchmark [ROWS] [WORKERS]
cargo run -p hyperdb-api --release --example benchmark_suite [ROWS] [WORKERS]The Node.js bench exercises the hyperdb-api-node N-API bindings on
the same measurements schema, so its numbers go in the same
tables as the Rust suite. It additionally covers paths that only
exist in the Node API — eager executeQuery, streaming
executeQueryStream, executeQueryColumnar (Arrow-columnar fast
path), and executeQueryToArrow (full Arrow IPC roundtrip).
cd hyperdb-api-node
npm install # first time only
npm run build # builds hyperdb-api-node.<platform>.node
HYPERD_PATH=/path/to/hyperd node __test__/benchmark.mjs [ROWS]Default is 1M rows. 10M matches the Rust suite's default for cross-language comparison. 100M is feasible for insert but runs out of V8 heap on eager-materialization queries (this is a real characteristic of going through JS-object boundaries; use the Columnar or Arrow variants for large reads).
Results are filed per platform with hardware/software provenance so numbers can be compared apples-to-apples. Each platform section has three subsections:
- Rust suite at 100M rows per workload, 4 parallel workers.
- Node.js bench at 10M rows (larger scales OOM on the eager scan path; see note in that subsection).
- Rust vs Node.js head-to-head at 10M rows.
Contributions welcome for additional platforms — paste the
test_results/benchmark_suite.md output and the Node bench
summary table under the appropriate section and include the host
block from the suite's stdout.
Hardware / software
- OS: Darwin 26.4 (aarch64)
- CPU: Apple M3 Max (14 physical / 14 logical cores)
- Memory: 96.0 GB
- Rust: rustc 1.94.0 (4a4ef493e 2026-03-02)
- Node.js: v25.8.1 (for the hyperdb-api-node bench)
- hyperdb-api version: 0.1.0-rc.1
- hyperd: Release build on same host
- Date: 2026-05-02 (median of 5 post-warmup runs; TCP
SO_RCVBUF/SO_SNDBUFtuned to 4 MiB)
| Workload | Variant | Flavor | Rows | Time (s) | Rows/sec | MB/sec |
|---|---|---|---|---|---|---|
| insert.bulk | AsyncArrowInserter | async | 100.00M | 3.104 | 32.22 M/s | 737.4 |
| insert.bulk | AsyncArrowInserter × 4 | async | 100.00M | 1.667 | 59.99 M/s | 1373.0 |
| insert.bulk | ChunkSender × 4 | sync | 100.00M | 3.671 | 27.24 M/s | 623.5 |
| insert.bulk | Inserter (HyperBinary) | sync | 100.00M | 3.679 | 27.18 M/s | 622.2 |
| insert.bulk | spawn_blocking+ChunkSender × 4 | async | 100.00M | 2.177 | 45.94 M/s | 1051.6 |
| query.aggregation | 4 parallel connections | async | 40 | 0.109 | 365/s | 0.0 |
| query.aggregation | single connection | sync | 10 | 0.048 | 207/s | 0.0 |
| query.aggregation | single connection | async | 10 | 0.049 | 204/s | 0.0 |
| query.filtered | 4 parallel connections | async | 10.00M | 0.220 | 45.36 M/s | 519.1 |
| query.filtered | single connection | sync | 10.00M | 0.297 | 33.67 M/s | 385.3 |
| query.filtered | single connection | async | 10.00M | 0.374 | 26.74 M/s | 306.0 |
| query.full_scan | 4 parallel connections | async | 100.00M | 2.526 | 39.59 M/s | 906.1 |
| query.full_scan | single connection | sync | 100.00M | 5.317 | 18.81 M/s | 430.5 |
| query.full_scan | single connection | async | 100.00M | 5.384 | 18.57 M/s | 425.1 |
Headline takeaways (Rust, macOS / M3 Max):
- Async inserts beat sync inserts at every scale — parallel
AsyncArrowInserter × 4is the fastest path at 1373 MB/s aggregate (60 M rows/s), and even single-connectionAsyncArrowInserter(32 M rows/s) edges ahead of syncInserter(27 M rows/s). The sync paths themselves saw a ~9% improvement from the 4 MiBSO_RCVBUF/SO_SNDBUFtuning landed 2026-05. - Parallel queries scale well on full-scan — 4 connections reach 40 M rows/s / 906 MB/s, a 2.1× wall-clock speedup over the single-connection sync scan.
- The async vs sync gap closed on single-connection queries — post-tuning,
query.filtered asyncruns 27 M rows/s vs the sync path's 34 M rows/s, and full_scan is essentially tied (18.6 vs 18.8 M rows/s). The historical "async is slower single-connection" warning no longer holds at the single-consumer-filter scale. For concurrent workloads, async remains the clear win.
Run via HYPERD_PATH=… node __test__/benchmark.mjs 10000000. The
Columnar and Arrow IPC variants exist only in the Node API and
are the fastest ways to move data out of the JS boundary. The
eager-object path (executeQuery returning Row[]) is the only
one that OOMs at 100M rows under the default V8 heap — insert
succeeds at 100M (51 s, 2.0 M/s, 45 MB/s) but the subsequent
eager scan exhausts the heap. For large reads through
hyperdb-api-node, always use executeQueryColumnar or
executeQueryToArrow.
Numbers below are post-rewrite (all async-native, Inserter renamed
to RowInserter, new ArrowInserter class).
| Workload | Variant | Rows | Time (s) | Rows/sec | MB/sec |
|---|---|---|---|---|---|
| insert.bulk | RowInserter (COPY, row API) | 10.00M | 5.330 | 1.88 M/s | 42.9 |
| insert.bulk | ArrowInserter (COPY, Arrow IPC) | 10.00M | 0.379 | 26.4 M/s | 603.9 |
| query.full_scan | executeQuery (eager, 1M only) | 1.00M | 0.678 | 1.47 M/s | 33.8 |
| query.full_scan | executeQueryStream (1M only) | 1.00M | 0.801 | 1.25 M/s | 28.6 |
| query.full_scan | executeQueryStream (chunked, 1M only) | 1.00M | 0.874 | 1.15 M/s | 26.2 |
| query.full_scan | executeQueryColumnar | 1.00M | 0.102 | 9.80 M/s | 223.7 |
| query.full_scan | executeQueryToArrow | 1.00M | 0.050 | 20.0 M/s | 458.7 |
| query.filtered | executeQueryStream (sensor_id=5) | 100K | 0.114 | 878 K/s | 20.1 |
| query.filtered | executeQueryColumnar | 100K | 0.015 | 6.80 M/s | 156.3 |
| query.filtered | executeQueryToArrow | 100K | 0.005 | 20.6 M/s | 471.7 |
| query.aggregation | GROUP BY sensor_id | 10 | 0.003 | 320 M/s | — |
Same schema, same dataset shape. Post-rewrite, the Node bindings
now have parity with Rust on the Arrow-ingest path because
ArrowInserter moves bytes directly into Hyper without any
per-row JS↔Rust conversion.
| Workload | Rust (best) | Node (best) | Rust factor |
|---|---|---|---|
| insert.bulk | AsyncArrowInserter × 4 — 32.5 M/s / 745 MB/s | ArrowInserter — 26.4 M/s / 603.9 MB/s | 1.2× |
| insert.bulk (row API) | sync Inserter — ~25 M/s (native) | RowInserter — 1.9 M/s | ~13× (CPU-bound JS encode) |
| query.full_scan | async × 4 — 36.3 M/s / 831 MB/s | executeQueryToArrow — 20.0 M/s / 458.7 MB/s | 1.8× |
| query.filtered | sync — 25.6 M/s / 292.8 MB/s | executeQueryToArrow — 20.6 M/s / 471.7 MB/s | 1.2× |
| query.aggregation | sync — 1.46 K/s | GROUP BY — 320 M/s | — (server-side; both latency-bound) |
Reading: on the Arrow-IPC path Node is within ~20% of native
Rust — the remaining gap is the IPC serialization cost in JS. On
the row-by-row API Rust is still ~13× faster because it can skip
the JS object materialization entirely. The pre-rewrite 16× insert
gap closes to ~1.2× once callers opt into ArrowInserter.
Hardware / software (placeholder — replace with host block from your suite run)
- OS: (e.g. Ubuntu 24.04)
- CPU:
- Memory:
- Rust:
- Node.js:
- hyperdb-api version:
- hyperd:
- Date:
Paste the contents of test_results/benchmark_suite.md here after running the suite on Linux. Keep the same column order so the section renders identically across platforms.
Paste the SUMMARY block from node __test__/benchmark.mjs 10000000. See the macOS subsection for the target table shape.
Fill in once both Rust (at 10M) and Node (at 10M) numbers are captured.
Hardware / software
- OS: Windows 11 (build 26100) (x86_64)
- CPU: Intel(R) Core(TM) i9-10980XE @ 3.00 GHz (18 physical / 36 logical cores)
- Memory: 127.8 GB
- Rust: rustc 1.92.0 (ded5c06cf 2025-12-08)
- Node.js: not yet captured
- hyperdb-api version: 0.1.0-rc.1
- hyperd: Release build pinned via
hyperdb-bootstrap - Date: 2026-05-02
| Workload | Variant | Flavor | Rows | Time (s) | Rows/sec | MB/sec |
|---|---|---|---|---|---|---|
| insert.bulk | AsyncArrowInserter | async | 100.00M | 18.563 | 5.39 M/s | 123.3 |
| insert.bulk | AsyncArrowInserter × 4 | async | 100.00M | 4.931 | 20.28 M/s | 464.1 |
| insert.bulk | ChunkSender × 4 | sync | 100.00M | 23.255 | 4.30 M/s | 98.4 |
| insert.bulk | Inserter (HyperBinary) | sync | 100.00M | 22.716 | 4.40 M/s | 100.8 |
| insert.bulk | spawn_blocking+ChunkSender × 4 | async | 100.00M | 4.778 | 20.93 M/s | 479.0 |
| query.aggregation | 4 parallel connections | async | 40 | 0.367 | 109/s | 0.0 |
| query.aggregation | single connection | sync | 10 | 0.180 | 56/s | 0.0 |
| query.aggregation | single connection | async | 10 | 0.179 | 56/s | 0.0 |
| query.filtered | 4 parallel connections | async | 10.00M | 0.611 | 16.37 M/s | 187.3 |
| query.filtered | single connection | sync | 10.00M | 1.263 | 7.92 M/s | 90.6 |
| query.filtered | single connection | async | 10.00M | 1.443 | 6.93 M/s | 79.3 |
| query.full_scan | 4 parallel connections | async | 100.00M | 6.003 | 16.66 M/s | 381.3 |
| query.full_scan | single connection | sync | 100.00M | 14.124 | 7.08 M/s | 162.1 |
| query.full_scan | single connection | async | 100.00M | 16.178 | 6.18 M/s | 141.5 |
Headline takeaways (Rust, native Windows / i9-10980XE):
- Parallel async inserts are the throughput-dominant path —
spawn_blocking + ChunkSender × 4reaches 20.9 M rows/s / 479 MB/s, ~2× faster than sync inserts and within ~30% of the TCP loopback ceiling on this box. The 4-way parallel insert numbers are roughly on par with macOS / M3 Max in absolute throughput, suggesting hyperd's ingest path is not the bottleneck here. - Single-connection sync query went from 2.89 M/s (pre-2026-05 tuning) to 7.08 M/s — a 2.5× improvement — after the read-window + TCP-buffer changes documented below.
- Single-connection sync inserts on Windows lag native Linux/macOS by ~5× even after tuning. This is a residual
hyperd-side gap; the parallel paths hide it because they exercise multiple ingest threads.
Run with BENCH_TRANSPORT=ipc to switch the data path from TCP loopback to a
Windows Named Pipe. Both transports go through the same hyperdb-api API; only
the wire underneath changes.
| Workload | Variant | TCP rows/s | IPC rows/s | Δ |
|---|---|---|---|---|
| insert.bulk | sync Inserter (HyperBinary) | 4.40 | 6.16 | +40% |
| insert.bulk | sync ChunkSender × 4 | 4.30 | 6.06 | +41% |
| insert.bulk | async AsyncArrowInserter | 5.39 | 7.24 | +34% |
| insert.bulk | async AsyncArrowInserter × 4 | 20.28 | 19.84 | -2% |
| insert.bulk | async spawn_blocking+ChunkSender × 4 | 20.93 | 20.81 | -1% |
| query.full_scan | sync | 7.08 | 5.02 | -29% |
| query.full_scan | async | 6.18 | 1.46 | -76% |
| query.full_scan | async × 4 | 16.66 | 4.96 | -70% |
| query.filtered | sync | 7.92 | 7.09 | -10% |
| query.filtered | async | 6.93 | 3.92 | -43% |
| query.filtered | async × 4 | 16.37 | 7.80 | -52% |
Reading: Named Pipe wins single-connection write-heavy paths by 34–41%
but catastrophically regresses every read-heavy path — especially async
(query.full_scan async drops 76%). The asymmetry localizes to tokio's
NamedPipeClient::poll_read: each completion-port wake-up appears to
deliver substantially less data than the corresponding WSARecv wake-up
on a TCP socket, multiplying per-poll overhead on long streamed reads.
Recommendation: keep TransportMode::Tcp (the workspace default) for
mixed workloads. Opt into TransportMode::Ipc only for insert-dominant
Windows pipelines that don't stream large query results back through the
same process.
Not yet captured on native Windows. Run via npm install && npm run build && node __test__/benchmark.mjs 10000000 from hyperdb-api-node/ and paste the SUMMARY block here.
Fill in once both Rust (at 10M) and Node (at 10M) numbers are captured.
Hardware / software (placeholder)
- OS: (e.g. Ubuntu 22.04 under WSL2)
- CPU:
- Memory:
- Rust:
- Node.js:
- hyperdb-api version:
- hyperd:
- Date:
Paste the contents of test_results/benchmark_suite.md here after running the suite under WSL2. WSL2 numbers should land near native Linux — see the Windows notes below for context.
Paste the SUMMARY block from node __test__/benchmark.mjs 10000000.
Fill in once both Rust and Node numbers are captured.
Windows native I/O against hyperd historically ran roughly 6× slower
than macOS / Linux on the streaming-query paths. A 2026-05 client-side
tuning pass closed that gap to roughly 2.5–3.3× on sync full-scan
queries; the residual is hyperd-side.
If you're benchmarking on Windows:
- For performance comparison: WSL2 still runs faster because hyperd's internal hot paths perform better under Linux. Expect Linux-like numbers there.
- For Windows-native validation: run the suite directly under PowerShell / cmd; the numbers in the Windows native section above are the current state of the art post-tuning.
Four client-side optimizations on the loopback TCP data path, all in
hyperdb-api-core::client:
| Change | File / function | Why |
|---|---|---|
| Read syscall window 8 KB → 64 KB | connection.rs::RawConnection::read_message, async_connection.rs::AsyncRawConnection::read_message |
Each WSARecv on Windows is several times more expensive than its recv counterpart on Linux/macOS. The default 8 KB stack-buffer ceiling caused 8× syscall amplification on long streamed reads. |
Read directly into BytesMut spare capacity |
same as above | Removes the temporary stack buffer + extend_from_slice memcpy. Safe Rust via resize + truncate. |
SO_RCVBUF / SO_SNDBUF 64 KB → 4 MiB |
client.rs::Client::connect, async_client.rs::AsyncClient::connect |
Windows defaults to ~64 KiB TCP buffers, which clamps the receive window so hyperd blocks on send() once the kernel buffer fills. Linux auto-tunes much higher. Empirical sweep found 4 MiB is the throughput knee — 8 MiB regresses sync inserts ~18% from extra memory pressure. |
Initial BytesMut capacity 8 KB → 64 KB |
connection.rs, async_connection.rs (struct ctor) |
Avoids early reallocation churn during the first batch of messages. |
On the same i9-10980XE / Windows 11 host the four changes together took
the single-connection sync query.full_scan from 2.89 to 7.08 M/s
(+145%) and the 4-connection parallel scan from 9.31 to 16.66 M/s
(+79%). Inserts are unchanged within noise because their bottleneck is
hyperd's ingest CPU, not the wire.
The 4 MiB SO_RCVBUF / SO_SNDBUF setting is empirically the right
shape on Windows, where the kernel default is tiny. It should be at
worst neutral on macOS / Linux because their auto-tuning kernels treat
setsockopt as an upper bound, not a forced size, and our request is
large enough not to clamp legitimate windows. But this hasn't been
benchmarked end-to-end on those platforms post-tuning.
If you're investigating from macOS or Linux: please run the suite at the
same 100M / 4-worker scale on main and confirm the numbers above the
Windows section haven't shifted. If they have, the most likely cause is
the 4 MiB setsockopt clamping somewhere it shouldn't — search
set_recv_buffer_size / set_send_buffer_size in
hyperdb-api-core/src/client/{client,async_client}.rs and consider gating those
lines on #[cfg(target_os = "windows")] if a regression appears.
A companion bench helper exists for transport A/B without rebuilding:
BENCH_TRANSPORT=tcp ./target/release/examples/benchmark_suite 100000000 4
BENCH_TRANSPORT=ipc ./target/release/examples/benchmark_suite 100000000 4On Unix, BENCH_TRANSPORT=ipc switches to a Unix Domain Socket; on
Windows it switches to a Named Pipe. The Named Pipe results above show
that IPC is a write-only win on Windows, but on Linux UDS may net
out differently — worth measuring.
- Build in release mode:
cargo build --release -p hyperdb-api --example benchmark_suite - Run the suite:
HYPERD_PATH=/path/to/hyperd \ ./target/release/examples/benchmark_suite 100000000 4
- Copy-paste:
- The
Host:block from stdout into the platform section as the hardware/software block. - The
| Workload | … |markdown table at the end of stdout into the results block.
- The
- Commit both the doc update and the JSON artifact (
test_results/benchmark_suite.json) so future runs can diff against yours.
- Scale: 100M rows is the default for the comparison tables; smaller scales (< 10M) don't give the parallel variants enough work to amortize task-spawn overhead.
- Workers: 4 is the default because it matches typical disk / NIC
parallelism on a developer machine. Scale up to
num_cpusfor peak aggregate throughput on servers with NVMe / 10 GbE. - Release mode: always. Debug mode is 5–10× slower and the difference is not a linear factor across the matrix, so relative comparisons become meaningless.
- DEVELOPMENT.md — workspace architecture, build instructions, and pointers to crate-level dev guides.
- The suite uses
CreateAndReplaceper bench, so it leaves the DB files behind undertest_results/for postmortem. They're gitignored. - Parallel-async insert variants use N independent tables
(
measurements_0…measurements_N-1) so no connection contends on the same table. Parallel-async queries run against those same tables, one per worker. - Every row is deterministic:
id = start + i,sensor_id = id % 10,value = id * 0.1,timestamp = 1_700_000_000_000 + id * 1000. Two runs of the suite against the same hyperd produce byte-identical.hyperfiles. HyperProcessis created once and shared by all benchmarks in a single suite run. Drop order is explicit at the end ofmainso the tokio runtime terminates beforehyperdto avoid shutdown races.