Granularity Experiments in OpenTimer — Findings & Verdict

Date: 2026-04-30 Verdict: both granularity tweaks (coarser via chain fusion, finer via net-RC split) are statistical washes on the benchmarks we have. The real wins on the way to this conclusion were operational, not algorithmic: NUMA pinning (2–4× wall reduction) and capping workers at physical-core count (~10% over c=32 with SMT).

TL;DR

Optimization	Effect on `update_timing` wall	Status
NUMA pinning (`taskset -c 0-19,40-59`)	2–4× faster	Real, free, ship it
Cap workers at physical-core count (c=16, no SMT)	~10% faster than c=32	Real, free
Coarser: Chain fusion (`OT_FUSE_CHAINS=1`)	0–5% wall, p > 0.05 in nearly all cases	Wash. No-op.
Finer: Net-RC split (`OT_NET_RC_SPLIT=1`)	~+1.5% wall (slight regression), p > 0.25	Wash. Slight regression.

Both granularity directions land in the noise. OpenTimer's existing per-pin task granularity is already at a near-optimum on these workloads.

Side deliverable (kept): the digest profiler itself was measurably improved during this work — six original assumptions replaced with per-run measurements. See Profiler improvements.

Experiment 1 — Coarser: chain fusion

What it is

Each pin in OpenTimer's propagation graph emits one tf::Task running 5 sequential sub-ops (_fprop_rc_timing → _fprop_slew → _fprop_delay → _fprop_at → _fprop_test). Chain fusion combines a run of pins where each is a 1-fanin/1-fanout link (in FPROP_CAND state) into a single task that runs the same sub-ops for each pin in dependency order — same algorithm, fewer tasks.

Code change

ot/timer/timer.cpp _build_prop_tasks(), gated by OT_FUSE_CHAINS=1. OT_FUSE_MAX caps chain length (default 8). Off by default. Correctness: report_timing -num_paths 50 bit-identical baseline vs fused on vga_lcd.

The journey, with corrections

Iteration	Claimed result	What we found later
1. Single run, no NUMA pinning, leon2 c=32	−31% wall with `OT_FUSE_MAX=8`	Most was fusion compensating for cross-socket cache thrashing. Fix at the source.
2. Add NUMA pinning, single run	−7.9% wall at best on leon2 c=32	Single-run noise.
3. NUMA pinning + 3 repeats	−4.6% best mean on leon2 c=32 (m=32)	Still noisy; baseline σ was ~3% of mean.
4. NUMA pinning + 10 repeats + Welch's t-test	p > 0.6 on every leon2 c=32 fused config. vga_lcd c=8 m=16 reaches p=0.046 (barely significant)	Honest answer: fusion is a wash.

Final results (10 reps, Welch's t-test)

leon2_iccad c=32:

config       n    mean ms    sd     SEM     Δ%      t      p     sig?
BASELINE    10    6066.2    514.6   162.7    —      —      —     —
m=2         10    6097.8    334.7   105.8   +0.5%  -0.16  0.87   ns
m=4         10    5978.1    314.8    99.6   -1.5%  +0.46  0.64   ns
m=8         10    6033.6    282.0    89.2   -0.5%  +0.18  0.86   ns

Zero detectable effect. Baseline σ alone (~515 ms) exceeds any putative effect. To detect a real 1% wall difference at p<0.05 here we'd need >150 repeats.

vga_lcd c=8:

config       n    mean ms    sd     SEM     Δ%      t      p     sig?
BASELINE    10     605.6    39.0    12.3     —      —      —     —
m=4         10     588.0    32.8    10.4   -2.9%  +1.09  0.274  ns
m=8         10     586.1    29.6     9.4   -3.2%  +1.26  0.208  ns
m=16        10     577.7    20.7     6.5   -4.6%  +2.00  0.046   *

m=16 reaches p<0.05 but barely. Would not survive Bonferroni correction (3 comparisons → α=0.0167). Treat as inconclusive.

Chain length distribution (why `OT_FUSE_MAX` doesn't matter much)

Topology already bounds chain length tightly:

Bench	n chains	mean	p50	p90	p99	max
leon2_iccad	2,245,369	1.93	1	3	15	21
vga_lcd	238,595	1.67	1	3	9	11

→ OT_FUSE_MAX ∈ {16, 32, 64, 128, ∞} produce identical task graphs for vga_lcd. For leon2, m ∈ {32, 64, 128, ∞} are identical; m=16 differs by 38K extra splits out of ~4.5M tasks (<1%). Any wall-time difference between these configs is run-to-run noise.

The cap was added defensively in case some benchmark had 1000-pin chains (clock trees, scan chains). For OpenTimer's actual benchmark suite, no cap is needed — topology guarantees short chains.

Experiment 2 — Finer: net-RC split

Hypothesis

The sub-op profiler (added as ot/taskflow/algorithm/subop_profiler.hpp, gated by OT_SUBOP_PROFILE=1) revealed that _fprop_rc_timing accounts for 83–90% of all per-pin compute:

=== vga_lcd c=8 ===              === leon2 c=16 ===
fwd_rc_timing  3571 ms (83.2%)   fwd_rc_timing 73535 ms (90.4%)
fwd_slew        302 ms ( 7.1%)   fwd_slew       3240 ms ( 4.0%)
fwd_delay       175 ms ( 4.1%)   fwd_delay      1858 ms ( 2.3%)
fwd_at           59 ms ( 1.4%)   fwd_at          652 ms ( 0.8%)
fwd_test         29 ms ( 0.7%)   fwd_test        245 ms ( 0.3%)
bwd_rat         154 ms ( 3.6%)   bwd_rat        1821 ms ( 2.2%)

Reading Net::_update_rc_timing() confirmed: rc_timing operates on the net (RC tree topology + load capacitances from cell library), with zero dependency on any upstream pin's timing values. It's also already memoized per-net via _rc_timing_updated.

The hypothesis: lift _fprop_rc_timing out of per-pin tasks into a per-NET "Phase 0" with no inter-task dependencies. Per-pin tasks then only do slew/delay/at/test and depend on (a) their net's rc task, (b) fanin pins. This should let RC walks overlap with upstream pins' tail work and raise PE from ~88–92% toward ~95–99%, yielding 5–12% wall improvement.

Code change

New function Timer::_build_prop_tasks_net_split() in ot/timer/timer.cpp, gated by OT_NET_RC_SPLIT=1. Emits Phase-0 net-rc tasks first, then per-pin Phase-1 tasks. Backward direction unchanged. Correctness: report_timing -num_paths 100 bit-identical baseline vs net-split on both vga_lcd (2272 lines) and leon2 (3308 lines).

Results (5 reps, Welch's t-test)

=== vga_lcd c=8 ===
baseline:  n=5  mean=588.9 ms  sd=27.7  PE=86.6%
netsplit:  n=5  mean=597.5 ms  sd=17.0  PE=83.4%
Δ wall = +1.47%   t=-0.60   p=0.55   ns

=== leon2 c=16 ===
baseline:  n=5  mean=5540 ms   sd=142   PE=90.6%
netsplit:  n=5  mean=5623 ms   sd=77    PE=87.7%
Δ wall = +1.50%   t=-1.15   p=0.25   ns

Both benchmarks: ~1.5% slight regression, not statistically significant.

Why the hypothesis was wrong

Crucially, PE dropped by 3 points in both cases — the opposite of what was predicted. The reasoning:

The graph is work-bound, not depth-bound. Wall ≈ total_CPU_work / cores / PE. Critical-path-shortening doesn't help when wall is already pinned to the work bound. Total CPU work doesn't change with re-organization.
Splitting added 18% more tasks (1.6M new net tasks for leon2). At ~0.4 μs dispatch each, that's pure overhead.
Per-pin tasks shrank from ~17 μs to ~0.7 μs after lifting rc out. Dispatch (~0.4 μs) became 30–60% of pin-task wall — terrible dispatch:work ratio.
The "extra parallelism" the split exposes was already there: the work was already perfectly parallel via Taskflow's stealing across pins. Splitting just made each unit of work smaller and more expensive to schedule.

Lesson: a sub-op being 90% of compute does not mean it's 90% of opportunity. Memoization + existing inter-pin parallelism mean that work was already well-distributed; "freeing" it from a sequential per-pin chain doesn't help when the chain wasn't the bottleneck.

NUMA pinning effect (the big lever found along the way)

Same machine, same software, same set_num_threads. Only taskset -c 0-19,40-59 (one NUMA node = CPUs 0-19 physical + 40-59 SMT siblings) added.

Bench	Cores	Unpinned baseline	NUMA-pinned baseline	Speedup
leon2_iccad	32	24.90 s	6.20 s	4.0×
leon2_iccad	8	14.35 s	6.91 s	2.1×
vga_lcd	32	1.100 s	0.503 s	2.2×
vga_lcd	8	0.825 s	0.613 s	1.4×

Cause: Default Linux scheduling lets Taskflow's worker threads roam across both sockets; cross-socket steals incur remote-memory penalties; first-touch allocations land arbitrarily. Pinning to one NUMA node fixes all three.

This invalidates the prior report_scaling.html (5.4 MB) — its leon2 c=32 baseline of 24.9 s was the SMT/NUMA bug, not OpenTimer.

SMT contention finding (c=16 vs c=32 baselines)

From the focused 5×5 sweep (cores × max_chain), looking only at the BASELINE column:

Bench	c=2	c=4	c=8	c=16	c=32
leon2	15411	10170	6722	5554	6107
vga_lcd	1344	832	643	485	535

c=16 is the optimal core count on both benchmarks. c=32 forces 12 SMT siblings into play (CPUs 40-51 share L1/L2 with their physical cousins on 0-11), which costs ~10% wall. Free recommendation: cap worker count at the number of physical cores in one NUMA node (here: 20).

Profiler improvements

While running these experiments we identified and fixed six places where the digest observer was using assumptions instead of measurements. The diagnoses now carry the same warnings on the same workloads, but with per-run-calibrated thresholds and self-documenting output. Three substantive fixes were made, plus three earlier ones from the start of the engagement.

Earlier fixes (already in place when this work started)

#	Was	Now
1	Hardcoded `SCHEDULING_COST_US` constant (a guess)	Calibrated per-run from observed sub-1μs gaps. Output: `est. X μs @ 0.Y μs/task, calibrated from N sub-1μs gaps`
2	All worker idle in one bucket	Macro-idle (`wall − busy`) separated from micro-gap (inter-task gaps on same worker)
3	Diagnoses double-counted: `LOAD_IMBALANCE` + `STRAGGLER_TASKS` for one root cause	Collapsed when correlated; `LOAD_IMBALANCE` adds a "root cause: single straggler" annotation
4	`STRAGGLER_TASKS` ratio used self-referential mean (mean included the straggler)	Trimmed mean excluding top-K outliers
5	`SERIAL_BOTTLENECK` used a hard cliff at `avg_workers < 2.0`	Continuous busy-time-weighted parallelism (no threshold)
6	`SERIAL_BOTTLENECK` AND-gate missed common cases	Relaxed to OR-gate: serial > 70%, OR (serial > 50% AND eff < 50%)

Three new fixes from this engagement (post-experiments)

Fix C — parallelism-shape bucket widths use actual run window, not theoretical bucket widths. The bucket display was dividing by the full theoretical bucket width (e.g., 9s for 1-10s), even when a 5.5s run only used 4.5s of that bucket. Result: avg_workers under-reported by up to 2× in tail buckets, which contradicted the parallel-efficiency number elsewhere in the digest.

Workload	PE	BEFORE shape (avg/W)	AFTER shape (avg/W)	Now consistent?
leon2 c=32	93.8%	n/a	31.3/32 (98%) + 29.9/32 (94%)	✅
leon2 c=16	91.4%	9.0/16 (56%) + 8.2/16 (51%)	15.1/16 (94%) + 14.6/16 (91%)	✅
vga_lcd c=8	86.7%	6.1/8 (76%) + 4.2/8 (52%)	7.4/8 (92%) + 6.9/8 (86%)	✅
c1355 c=4	77.1%	3.2/4 (80%) + 0.1/4 (3%)	3.3/4 (82%) + 2.7/4 (68%)	✅ tail bucket meaningful

Side effect: SERIAL_BOTTLENECK weighted_parallelism is now accurate. c17's serial fraction dropped from 71% → 65%, and on some runs the diagnosis stops firing entirely (correctly — c17's problem is load imbalance, not seriality).

Fix A — SCHEDULING_OVERHEAD trigger uses measured cost. Was hardcoded mean_dur < 1.0μs; now mean_dur < 2 × calibrated_sched_cost_us. Output reads task_mean(0.5μs) < 2×scheduling_cost(~0.7μs). Doesn't fire on any OpenTimer benchmark (mean is pulled high by stragglers), but the threshold is now portable across schedulers of different per-task dispatch cost.

Fix B — long-gap classification is adaptive. Was hardcoded "gap > 100μs is long"; now adapts to 100 × measured sched_cost_us, snapped to nearest decade in {10μs, 100μs, 1ms}. Each digest now self-documents: 'long-gap' = >X μs (≈ 100× sched_cost, calibrated Y μs/task). To enable adaptive re-aggregation at digest time, the on_exit gap accumulator went from 3 fixed buckets to 5 finer ones.

Workload	sched_cost	Old hardcoded	New adaptive
vga_lcd c=4	0.257 μs	>100μs	>10μs
vga_lcd c=8	0.310 μs	>100μs	>10μs
c1355 c=4	0.291 μs	>100μs	>10μs
leon2 c=16	0.353 μs	>100μs	>100μs
leon2 c=32	0.427 μs	>100μs	>100μs

Net result for users: the parallelism shape now matches the parallel efficiency (was off by ~40 percentage points on leon2 c=16 before). Digests are scheduler-portable. No diagnostic numbers are hardcoded.

Methodology notes

What worked:

NUMA pinning via taskset (no numactl needed; first-touch allocation handles memory locality once threads stay put)
Round-robin interleaving of configs within repeats (any system drift hits all configs equally)
Welch's t-test for unequal-variance comparisons
Reading Wall : %d μs line from DigestObserver output for ground truth
Sub-op profiler on a separate file (subop_profiler.hpp) that doesn't touch the existing digest_observer.hpp
Bit-identical 100-path report_timing diff as the cheap correctness check

What didn't:

Single-run comparisons (yielded the false 31% claim)
3-repeat means (still inconclusive at our effect size; gave the false 5–8% claims)
Per-process invocation runner (each ot-shell re-reads ~60s of input files — ate ~80 min of compute on the 80-run repeat sweep alone)
Reasoning from "X is the dominant cost" to "extracting X is the optimization" — see Experiment 2 hypothesis failure

What we'd do differently next time:

Refactor runner to keep one ot-shell alive and force re-propagation between configs (would cut the per-run cost from ~80s → ~10s for leon2). Requires either: (a) a Timer "force_dirty" debug hook, or (b) re-read_spef-ing on each iteration (still slow). Worth building before any future optimization sweep.
Move env-var reads from static const to per-call so a single process can sweep multiple OT_FUSE_MAX / OT_NET_RC_SPLIT values.
Compute work-bound vs depth-bound before designing the optimization. If wall ≈ CPU_work/cores, no re-organization will help — only reducing CPU_work or improving PE will.

Files and artifacts

ot/timer/timer.cpp                                     code change (uncommitted) — both experiments
ot/timer/timer.hpp                                     header decl for _build_prop_tasks_net_split
ot/taskflow/algorithm/digest_observer.hpp              digest observer (in-tree, untouched)
ot/taskflow/algorithm/subop_profiler.hpp               new — gated by OT_SUBOP_PROFILE=1
digest-observer-package/digest_observer.hpp            source-of-truth observer

reports/inf_sweep/                       84 digests — first NUMA-pinned sweep + m=4 follow-up
reports/focused_sweep/                  150 digests — 5 cores × 5 configs × 3 reps × 2 benches
reports/winners_repeat/                  80 digests — chain-fusion 10-rep statistical validation
reports/netsplit_correctness/            timing reports baseline vs netsplit (bit-identical)
reports/netsplit_repeats/                20 digests — net-split 5-rep validation
reports/chain_fusion_findings.md         this document
reports/report_scaling.html              superseded — its leon2 c=32 baseline was SMT/NUMA bug

/tmp/run_inf_sweep.sh                    runners (informational; not committed)
/tmp/run_focused_sweep.sh
/tmp/run_winners_repeat.sh
/tmp/run_netsplit_correctness.sh
/tmp/run_netsplit_repeats.sh

Verdict

Do not ship chain fusion or net-RC split as currently designed. No statistically significant benefit on the benchmarks we have; net-split shows a slight regression.
Keep both code paths behind their env vars so they're easy to test on future benchmarks (deep clock trees, real production CPU netlists).
Adopt NUMA pinning + physical-core-only worker counts immediately. Together they explain >>10× the wall savings either granularity tweak would have at best.
Re-baseline the prior report_scaling.html numbers. Its dramatic "scaling wall" was the absence of NUMA pinning, not OpenTimer.
Stop searching for wins inside the granularity dimension on these benchmarks. Both directions are dead ends. If there's more performance to find, it has to come from elsewhere: reducing total CPU work (algorithm changes, out of scope), reducing per-task memory access cost (allocator/data-layout changes), or exploiting the NUMA topology more aggressively (NUMA-aware stealing in Taskflow).

Open questions for future investigation

Do real production netlists (CPUs with extensive clock distribution) have longer chains where fusion would matter? Would the net-RC split help on benchmarks with much higher fanout per net?
Would a memory allocator swap (tcmalloc, mimalloc, jemalloc per-arena) move the needle? OpenTimer allocates ~millions of small objects (RctNode, Pin, Arc) — allocator quality could matter more than task granularity at this scale.
What's the contribution of Net::_update_rc_timing()'s recursive unordered_map<string, RctNode> walks to the per-pin cost? Replacing the map with a flat indexed structure would be an algorithm change (out of scope) but worth measuring. The 46 μs/net rc-walk cost suggests this is a real hot spot.
Would NUMA-aware work stealing (Taskflow doesn't have it built-in) provide more than the current pinning approach?
Could _fprop_slew/delay/at/test (collectively only ~10% of per-pin time) be batched or vectorized across pins on the same net? They share cell-library data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Granularity Experiments in OpenTimer — Findings & Verdict

TL;DR

Experiment 1 — Coarser: chain fusion

What it is

Code change

The journey, with corrections

Final results (10 reps, Welch's t-test)

Chain length distribution (why `OT_FUSE_MAX` doesn't matter much)

Experiment 2 — Finer: net-RC split

Hypothesis

Code change

Results (5 reps, Welch's t-test)

Why the hypothesis was wrong

NUMA pinning effect (the big lever found along the way)

SMT contention finding (c=16 vs c=32 baselines)

Profiler improvements

Earlier fixes (already in place when this work started)

Three new fixes from this engagement (post-experiments)

Methodology notes

Files and artifacts

Verdict

Open questions for future investigation

FilesExpand file tree

opentimer-granularity-experiments.md

Latest commit

History

opentimer-granularity-experiments.md

File metadata and controls

Granularity Experiments in OpenTimer — Findings & Verdict

TL;DR

Experiment 1 — Coarser: chain fusion

What it is

Code change

The journey, with corrections

Final results (10 reps, Welch's t-test)

Chain length distribution (why OT_FUSE_MAX doesn't matter much)

Experiment 2 — Finer: net-RC split

Hypothesis

Code change

Results (5 reps, Welch's t-test)

Why the hypothesis was wrong

NUMA pinning effect (the big lever found along the way)

SMT contention finding (c=16 vs c=32 baselines)

Profiler improvements

Earlier fixes (already in place when this work started)

Three new fixes from this engagement (post-experiments)

Methodology notes

Files and artifacts

Verdict

Open questions for future investigation

Chain length distribution (why `OT_FUSE_MAX` doesn't matter much)