Date: 2026-04-30 Verdict: both granularity tweaks (coarser via chain fusion, finer via net-RC split) are statistical washes on the benchmarks we have. The real wins on the way to this conclusion were operational, not algorithmic: NUMA pinning (2–4× wall reduction) and capping workers at physical-core count (~10% over c=32 with SMT).
| Optimization | Effect on update_timing wall |
Status |
|---|---|---|
NUMA pinning (taskset -c 0-19,40-59) |
2–4× faster | Real, free, ship it |
| Cap workers at physical-core count (c=16, no SMT) | ~10% faster than c=32 | Real, free |
Coarser: Chain fusion (OT_FUSE_CHAINS=1) |
0–5% wall, p > 0.05 in nearly all cases | Wash. No-op. |
Finer: Net-RC split (OT_NET_RC_SPLIT=1) |
~+1.5% wall (slight regression), p > 0.25 | Wash. Slight regression. |
Both granularity directions land in the noise. OpenTimer's existing per-pin task granularity is already at a near-optimum on these workloads.
Side deliverable (kept): the digest profiler itself was measurably improved during this work — six original assumptions replaced with per-run measurements. See Profiler improvements.
Each pin in OpenTimer's propagation graph emits one tf::Task running 5 sequential sub-ops (_fprop_rc_timing → _fprop_slew → _fprop_delay → _fprop_at → _fprop_test). Chain fusion combines a run of pins where each is a 1-fanin/1-fanout link (in FPROP_CAND state) into a single task that runs the same sub-ops for each pin in dependency order — same algorithm, fewer tasks.
ot/timer/timer.cpp _build_prop_tasks(), gated by OT_FUSE_CHAINS=1. OT_FUSE_MAX caps chain length (default 8). Off by default. Correctness: report_timing -num_paths 50 bit-identical baseline vs fused on vga_lcd.
| Iteration | Claimed result | What we found later |
|---|---|---|
| 1. Single run, no NUMA pinning, leon2 c=32 | −31% wall with OT_FUSE_MAX=8 |
Most was fusion compensating for cross-socket cache thrashing. Fix at the source. |
| 2. Add NUMA pinning, single run | −7.9% wall at best on leon2 c=32 | Single-run noise. |
| 3. NUMA pinning + 3 repeats | −4.6% best mean on leon2 c=32 (m=32) | Still noisy; baseline σ was ~3% of mean. |
| 4. NUMA pinning + 10 repeats + Welch's t-test | p > 0.6 on every leon2 c=32 fused config. vga_lcd c=8 m=16 reaches p=0.046 (barely significant) | Honest answer: fusion is a wash. |
leon2_iccad c=32:
config n mean ms sd SEM Δ% t p sig?
BASELINE 10 6066.2 514.6 162.7 — — — —
m=2 10 6097.8 334.7 105.8 +0.5% -0.16 0.87 ns
m=4 10 5978.1 314.8 99.6 -1.5% +0.46 0.64 ns
m=8 10 6033.6 282.0 89.2 -0.5% +0.18 0.86 ns
Zero detectable effect. Baseline σ alone (~515 ms) exceeds any putative effect. To detect a real 1% wall difference at p<0.05 here we'd need >150 repeats.
vga_lcd c=8:
config n mean ms sd SEM Δ% t p sig?
BASELINE 10 605.6 39.0 12.3 — — — —
m=4 10 588.0 32.8 10.4 -2.9% +1.09 0.274 ns
m=8 10 586.1 29.6 9.4 -3.2% +1.26 0.208 ns
m=16 10 577.7 20.7 6.5 -4.6% +2.00 0.046 *
m=16 reaches p<0.05 but barely. Would not survive Bonferroni correction (3 comparisons → α=0.0167). Treat as inconclusive.
Topology already bounds chain length tightly:
| Bench | n chains | mean | p50 | p90 | p99 | max |
|---|---|---|---|---|---|---|
| leon2_iccad | 2,245,369 | 1.93 | 1 | 3 | 15 | 21 |
| vga_lcd | 238,595 | 1.67 | 1 | 3 | 9 | 11 |
→ OT_FUSE_MAX ∈ {16, 32, 64, 128, ∞} produce identical task graphs for vga_lcd. For leon2, m ∈ {32, 64, 128, ∞} are identical; m=16 differs by 38K extra splits out of ~4.5M tasks (<1%). Any wall-time difference between these configs is run-to-run noise.
The cap was added defensively in case some benchmark had 1000-pin chains (clock trees, scan chains). For OpenTimer's actual benchmark suite, no cap is needed — topology guarantees short chains.
The sub-op profiler (added as ot/taskflow/algorithm/subop_profiler.hpp, gated by OT_SUBOP_PROFILE=1) revealed that _fprop_rc_timing accounts for 83–90% of all per-pin compute:
=== vga_lcd c=8 === === leon2 c=16 ===
fwd_rc_timing 3571 ms (83.2%) fwd_rc_timing 73535 ms (90.4%)
fwd_slew 302 ms ( 7.1%) fwd_slew 3240 ms ( 4.0%)
fwd_delay 175 ms ( 4.1%) fwd_delay 1858 ms ( 2.3%)
fwd_at 59 ms ( 1.4%) fwd_at 652 ms ( 0.8%)
fwd_test 29 ms ( 0.7%) fwd_test 245 ms ( 0.3%)
bwd_rat 154 ms ( 3.6%) bwd_rat 1821 ms ( 2.2%)
Reading Net::_update_rc_timing() confirmed: rc_timing operates on the net (RC tree topology + load capacitances from cell library), with zero dependency on any upstream pin's timing values. It's also already memoized per-net via _rc_timing_updated.
The hypothesis: lift _fprop_rc_timing out of per-pin tasks into a per-NET "Phase 0" with no inter-task dependencies. Per-pin tasks then only do slew/delay/at/test and depend on (a) their net's rc task, (b) fanin pins. This should let RC walks overlap with upstream pins' tail work and raise PE from ~88–92% toward ~95–99%, yielding 5–12% wall improvement.
New function Timer::_build_prop_tasks_net_split() in ot/timer/timer.cpp, gated by OT_NET_RC_SPLIT=1. Emits Phase-0 net-rc tasks first, then per-pin Phase-1 tasks. Backward direction unchanged. Correctness: report_timing -num_paths 100 bit-identical baseline vs net-split on both vga_lcd (2272 lines) and leon2 (3308 lines).
=== vga_lcd c=8 ===
baseline: n=5 mean=588.9 ms sd=27.7 PE=86.6%
netsplit: n=5 mean=597.5 ms sd=17.0 PE=83.4%
Δ wall = +1.47% t=-0.60 p=0.55 ns
=== leon2 c=16 ===
baseline: n=5 mean=5540 ms sd=142 PE=90.6%
netsplit: n=5 mean=5623 ms sd=77 PE=87.7%
Δ wall = +1.50% t=-1.15 p=0.25 ns
Both benchmarks: ~1.5% slight regression, not statistically significant.
Crucially, PE dropped by 3 points in both cases — the opposite of what was predicted. The reasoning:
- The graph is work-bound, not depth-bound. Wall ≈ total_CPU_work / cores / PE. Critical-path-shortening doesn't help when wall is already pinned to the work bound. Total CPU work doesn't change with re-organization.
- Splitting added 18% more tasks (1.6M new net tasks for leon2). At ~0.4 μs dispatch each, that's pure overhead.
- Per-pin tasks shrank from ~17 μs to ~0.7 μs after lifting rc out. Dispatch (~0.4 μs) became 30–60% of pin-task wall — terrible dispatch:work ratio.
- The "extra parallelism" the split exposes was already there: the work was already perfectly parallel via Taskflow's stealing across pins. Splitting just made each unit of work smaller and more expensive to schedule.
Lesson: a sub-op being 90% of compute does not mean it's 90% of opportunity. Memoization + existing inter-pin parallelism mean that work was already well-distributed; "freeing" it from a sequential per-pin chain doesn't help when the chain wasn't the bottleneck.
Same machine, same software, same set_num_threads. Only taskset -c 0-19,40-59 (one NUMA node = CPUs 0-19 physical + 40-59 SMT siblings) added.
| Bench | Cores | Unpinned baseline | NUMA-pinned baseline | Speedup |
|---|---|---|---|---|
| leon2_iccad | 32 | 24.90 s | 6.20 s | 4.0× |
| leon2_iccad | 8 | 14.35 s | 6.91 s | 2.1× |
| vga_lcd | 32 | 1.100 s | 0.503 s | 2.2× |
| vga_lcd | 8 | 0.825 s | 0.613 s | 1.4× |
Cause: Default Linux scheduling lets Taskflow's worker threads roam across both sockets; cross-socket steals incur remote-memory penalties; first-touch allocations land arbitrarily. Pinning to one NUMA node fixes all three.
This invalidates the prior report_scaling.html (5.4 MB) — its leon2 c=32 baseline of 24.9 s was the SMT/NUMA bug, not OpenTimer.
From the focused 5×5 sweep (cores × max_chain), looking only at the BASELINE column:
| Bench | c=2 | c=4 | c=8 | c=16 | c=32 |
|---|---|---|---|---|---|
| leon2 | 15411 | 10170 | 6722 | 5554 | 6107 |
| vga_lcd | 1344 | 832 | 643 | 485 | 535 |
c=16 is the optimal core count on both benchmarks. c=32 forces 12 SMT siblings into play (CPUs 40-51 share L1/L2 with their physical cousins on 0-11), which costs ~10% wall. Free recommendation: cap worker count at the number of physical cores in one NUMA node (here: 20).
While running these experiments we identified and fixed six places where the digest observer was using assumptions instead of measurements. The diagnoses now carry the same warnings on the same workloads, but with per-run-calibrated thresholds and self-documenting output. Three substantive fixes were made, plus three earlier ones from the start of the engagement.
| # | Was | Now |
|---|---|---|
| 1 | Hardcoded SCHEDULING_COST_US constant (a guess) |
Calibrated per-run from observed sub-1μs gaps. Output: est. X μs @ 0.Y μs/task, calibrated from N sub-1μs gaps |
| 2 | All worker idle in one bucket | Macro-idle (wall − busy) separated from micro-gap (inter-task gaps on same worker) |
| 3 | Diagnoses double-counted: LOAD_IMBALANCE + STRAGGLER_TASKS for one root cause |
Collapsed when correlated; LOAD_IMBALANCE adds a "root cause: single straggler" annotation |
| 4 | STRAGGLER_TASKS ratio used self-referential mean (mean included the straggler) |
Trimmed mean excluding top-K outliers |
| 5 | SERIAL_BOTTLENECK used a hard cliff at avg_workers < 2.0 |
Continuous busy-time-weighted parallelism (no threshold) |
| 6 | SERIAL_BOTTLENECK AND-gate missed common cases |
Relaxed to OR-gate: serial > 70%, OR (serial > 50% AND eff < 50%) |
Fix C — parallelism-shape bucket widths use actual run window, not theoretical bucket widths.
The bucket display was dividing by the full theoretical bucket width (e.g., 9s for 1-10s), even when a 5.5s run only used 4.5s of that bucket. Result: avg_workers under-reported by up to 2× in tail buckets, which contradicted the parallel-efficiency number elsewhere in the digest.
| Workload | PE | BEFORE shape (avg/W) | AFTER shape (avg/W) | Now consistent? |
|---|---|---|---|---|
| leon2 c=32 | 93.8% | n/a | 31.3/32 (98%) + 29.9/32 (94%) | ✅ |
| leon2 c=16 | 91.4% | 9.0/16 (56%) + 8.2/16 (51%) | 15.1/16 (94%) + 14.6/16 (91%) | ✅ |
| vga_lcd c=8 | 86.7% | 6.1/8 (76%) + 4.2/8 (52%) | 7.4/8 (92%) + 6.9/8 (86%) | ✅ |
| c1355 c=4 | 77.1% | 3.2/4 (80%) + 0.1/4 (3%) | 3.3/4 (82%) + 2.7/4 (68%) | ✅ tail bucket meaningful |
Side effect: SERIAL_BOTTLENECK weighted_parallelism is now accurate. c17's serial fraction dropped from 71% → 65%, and on some runs the diagnosis stops firing entirely (correctly — c17's problem is load imbalance, not seriality).
Fix A — SCHEDULING_OVERHEAD trigger uses measured cost. Was hardcoded mean_dur < 1.0μs; now mean_dur < 2 × calibrated_sched_cost_us. Output reads task_mean(0.5μs) < 2×scheduling_cost(~0.7μs). Doesn't fire on any OpenTimer benchmark (mean is pulled high by stragglers), but the threshold is now portable across schedulers of different per-task dispatch cost.
Fix B — long-gap classification is adaptive. Was hardcoded "gap > 100μs is long"; now adapts to 100 × measured sched_cost_us, snapped to nearest decade in {10μs, 100μs, 1ms}. Each digest now self-documents: 'long-gap' = >X μs (≈ 100× sched_cost, calibrated Y μs/task). To enable adaptive re-aggregation at digest time, the on_exit gap accumulator went from 3 fixed buckets to 5 finer ones.
| Workload | sched_cost | Old hardcoded | New adaptive |
|---|---|---|---|
| vga_lcd c=4 | 0.257 μs | >100μs | >10μs |
| vga_lcd c=8 | 0.310 μs | >100μs | >10μs |
| c1355 c=4 | 0.291 μs | >100μs | >10μs |
| leon2 c=16 | 0.353 μs | >100μs | >100μs |
| leon2 c=32 | 0.427 μs | >100μs | >100μs |
Net result for users: the parallelism shape now matches the parallel efficiency (was off by ~40 percentage points on leon2 c=16 before). Digests are scheduler-portable. No diagnostic numbers are hardcoded.
What worked:
- NUMA pinning via
taskset(nonumactlneeded; first-touch allocation handles memory locality once threads stay put) - Round-robin interleaving of configs within repeats (any system drift hits all configs equally)
- Welch's t-test for unequal-variance comparisons
- Reading
Wall : %d μsline fromDigestObserveroutput for ground truth - Sub-op profiler on a separate file (
subop_profiler.hpp) that doesn't touch the existingdigest_observer.hpp - Bit-identical 100-path
report_timingdiff as the cheap correctness check
What didn't:
- Single-run comparisons (yielded the false 31% claim)
- 3-repeat means (still inconclusive at our effect size; gave the false 5–8% claims)
- Per-process invocation runner (each
ot-shellre-reads ~60s of input files — ate ~80 min of compute on the 80-run repeat sweep alone) - Reasoning from "X is the dominant cost" to "extracting X is the optimization" — see Experiment 2 hypothesis failure
What we'd do differently next time:
- Refactor runner to keep one
ot-shellalive and force re-propagation between configs (would cut the per-run cost from ~80s → ~10s for leon2). Requires either: (a) a Timer "force_dirty" debug hook, or (b) re-read_spef-ing on each iteration (still slow). Worth building before any future optimization sweep. - Move env-var reads from
static constto per-call so a single process can sweep multipleOT_FUSE_MAX/OT_NET_RC_SPLITvalues. - Compute work-bound vs depth-bound before designing the optimization. If wall ≈ CPU_work/cores, no re-organization will help — only reducing CPU_work or improving PE will.
ot/timer/timer.cpp code change (uncommitted) — both experiments
ot/timer/timer.hpp header decl for _build_prop_tasks_net_split
ot/taskflow/algorithm/digest_observer.hpp digest observer (in-tree, untouched)
ot/taskflow/algorithm/subop_profiler.hpp new — gated by OT_SUBOP_PROFILE=1
digest-observer-package/digest_observer.hpp source-of-truth observer
reports/inf_sweep/ 84 digests — first NUMA-pinned sweep + m=4 follow-up
reports/focused_sweep/ 150 digests — 5 cores × 5 configs × 3 reps × 2 benches
reports/winners_repeat/ 80 digests — chain-fusion 10-rep statistical validation
reports/netsplit_correctness/ timing reports baseline vs netsplit (bit-identical)
reports/netsplit_repeats/ 20 digests — net-split 5-rep validation
reports/chain_fusion_findings.md this document
reports/report_scaling.html superseded — its leon2 c=32 baseline was SMT/NUMA bug
/tmp/run_inf_sweep.sh runners (informational; not committed)
/tmp/run_focused_sweep.sh
/tmp/run_winners_repeat.sh
/tmp/run_netsplit_correctness.sh
/tmp/run_netsplit_repeats.sh
- Do not ship chain fusion or net-RC split as currently designed. No statistically significant benefit on the benchmarks we have; net-split shows a slight regression.
- Keep both code paths behind their env vars so they're easy to test on future benchmarks (deep clock trees, real production CPU netlists).
- Adopt NUMA pinning + physical-core-only worker counts immediately. Together they explain >>10× the wall savings either granularity tweak would have at best.
- Re-baseline the prior
report_scaling.htmlnumbers. Its dramatic "scaling wall" was the absence of NUMA pinning, not OpenTimer. - Stop searching for wins inside the granularity dimension on these benchmarks. Both directions are dead ends. If there's more performance to find, it has to come from elsewhere: reducing total CPU work (algorithm changes, out of scope), reducing per-task memory access cost (allocator/data-layout changes), or exploiting the NUMA topology more aggressively (NUMA-aware stealing in Taskflow).
- Do real production netlists (CPUs with extensive clock distribution) have longer chains where fusion would matter? Would the net-RC split help on benchmarks with much higher fanout per net?
- Would a memory allocator swap (tcmalloc, mimalloc, jemalloc per-arena) move the needle? OpenTimer allocates ~millions of small objects (RctNode, Pin, Arc) — allocator quality could matter more than task granularity at this scale.
- What's the contribution of
Net::_update_rc_timing()'s recursiveunordered_map<string, RctNode>walks to the per-pin cost? Replacing the map with a flat indexed structure would be an algorithm change (out of scope) but worth measuring. The 46 μs/net rc-walk cost suggests this is a real hot spot. - Would NUMA-aware work stealing (Taskflow doesn't have it built-in) provide more than the current pinning approach?
- Could
_fprop_slew/delay/at/test(collectively only ~10% of per-pin time) be batched or vectorized across pins on the same net? They share cell-library data.