Skip to content

hailo: bench fingerprint label + StatsResponse npu_pool_size + ADR refresh (iter 256-257)#420

Merged
ruvnet merged 2 commits intomainfrom
hailo-bench-fingerprint-label
May 4, 2026
Merged

hailo: bench fingerprint label + StatsResponse npu_pool_size + ADR refresh (iter 256-257)#420
ruvnet merged 2 commits intomainfrom
hailo-bench-fingerprint-label

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented May 4, 2026

Summary

Two iterations adding cluster-side observability for per-model + per-pool measurements, plus refreshing ADR-176/178 to record the iter-234..257 hailo work.

What ships

iter-256 — bench --prom fingerprint label

Bench's textfile-collector output carried only concurrency as a label, so a Prometheus alert grouping by series couldn't tell a genuine throughput regression apart from a model swap. Now every metric carries concurrency="N",fingerprint="<hex>". Empty fingerprint (--allow-empty-fingerprint) renders as fingerprint="" rather than getting dropped, so the label set stays scrape-stable.

ruvector_hailo_bench_throughput_per_second{concurrency="2",fingerprint="9c56e596...3aeb..."} 70.712

rate(...) by (fingerprint) now gives one series per model — fingerprint changes are deploy events the operator already knew about, not noise.

iter-257 — StatsResponse.npu_pool_size + ADR refresh

Backward-compatible proto3 add: uint32 npu_pool_size = 10 on StatsResponse. Old workers send 0 (proto3 default → "unknown / pre-iter-257"); new workers send the resolved value. Wired through worker → transport StatsSnapshot → grpc_transport.

ADR refresh:

  • ADR-176 (HEF EPIC): P6 row covering iter 234-237 pool measurement work + iter 256-257 observability
  • ADR-178 (gap analysis): status flipped from Proposed → Closed with 8-row per-gap remediation table

Test plan

  • cargo check -p ruvector-hailo-cluster --bins clean
  • cargo test -p ruvector-hailo-cluster --lib (114 passed)
  • cargo test -p ruvector-hailo-cluster --test bench_cli (6 passed) — locks the new fingerprint label
  • Backward-compatible proto3 field add: pre-iter-257 worker sends 0, new client renders as "unknown"

🤖 Generated with claude-flow

ruvnet and others added 2 commits May 4, 2026 10:37
Bench's textfile-collector output carried only `concurrency` as a
label, so a Prometheus alert grouping by series couldn't tell a
genuine throughput regression apart from a model swap. The
fingerprint *was* recorded by the bench (--auto-fingerprint
already discovered + printed it to stderr) but never made it to
the prom labels.

Now every metric carries `concurrency="N",fingerprint="<hex>"`.
Empty fingerprint (--allow-empty-fingerprint) renders as
`fingerprint=""` rather than getting dropped, so the label set
stays scrape-stable whether or not enforcement is on.

Example output (iter 256, cognitum-v0):

  ruvector_hailo_bench_throughput_per_second{concurrency="2",fingerprint="9c56e5965aea9afd99ad51826805f1be01bb0ea3301aafb74982e29e3b9cf3fa"} 70.712

Now `rate(ruvector_hailo_bench_throughput_per_second[1h]) by (fingerprint)`
gives one series per model — a 9c56...-deploy throughput drop is a
real regression, while a fingerprint change is a deploy event the
operator already knew about.

# What ships
- BenchSummary gains a `fingerprint: String` field, populated from
  the resolved fingerprint (whatever --fingerprint or
  --auto-fingerprint produced).
- write_prom_textfile renders it on every metric.
- bench_cli_prom_file_contains_throughput_metric updated to lock
  the new label format so a future regression surfaces in CI.

Local verification:
  cargo test -p ruvector-hailo-cluster --test bench_cli (6 passed)
  cargo clippy --all-targets -- -D warnings (clean)

Co-Authored-By: claude-flow <ruv@ruv.net>
…er 257)

Surface the resolved RUVECTOR_NPU_POOL_SIZE through the gRPC
StatsResponse so cluster-side observability can differentiate
single-pipeline vs pool=N measurements.

# Proto change (backward-compatible)
StatsResponse gains `uint32 npu_pool_size = 10`. Old workers
send 0 (proto3 default), which clients render as "unknown / pre-
iter-257"; new workers send the resolved value (1, 2, 4, ...).

# Wire-through
- worker.rs: WorkerService.npu_pool_size populated from the env
  var at startup, surfaced via get_stats RPC.
- transport.rs: StatsSnapshot.npu_pool_size field with
  #[serde(default)] so JSON consumers from old workers don't fail.
- grpc_transport.rs: populated from proto resp on stats() RPC.

# ADR refresh (also in this commit)
- ADR-176 (HEF integration EPIC): added P6 row covering iter
  234-237 pool measurement work + iter 256-257 observability layer.
- ADR-178 (gap analysis): bumped Status from Proposed to Closed
  with a per-gap remediation table (8 gaps, 6 closed, 1 deferred,
  2 tracked separately).

Local verification:
  cargo check -p ruvector-hailo-cluster --bins (clean)
  cargo test -p ruvector-hailo-cluster --lib (114 passed)

Co-Authored-By: claude-flow <ruv@ruv.net>
@ruvnet ruvnet merged commit 0442856 into main May 4, 2026
23 of 27 checks passed
@ruvnet ruvnet deleted the hailo-bench-fingerprint-label branch May 4, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant