Architecture E (POC, opt-in -gc e): Perceus reuse-in-place front line + precise STW tracing GC backstop for the C backend#27458
Architecture E (POC, opt-in -gc e): Perceus reuse-in-place front line + precise STW tracing GC backstop for the C backend#27458eptx wants to merge 31 commits into
-gc e): Perceus reuse-in-place front line + precise STW tracing GC backstop for the C backend#27458Conversation
libgc's parallel mark defaults to one helper thread per core. On macOS every stop-the-world collection then wakes N-1 mark helpers that contend (mach thread_suspend/resume + mark-queue spin), starving the application's own worker threads. For an allocation-heavy multi-threaded server (the cx picoev multi-reactor HTTP leg) this collapses throughput. Measured on a 12-core M-series, serve-file [?http-service] under wrk -t8 -c100: parallel mark (default): ~48.5K req/s, ~5 active cores single marker: ~125-131K req/s (2.6x) Fix: emit GC_set_markers_count(1) in the main() boehm preamble on macOS, before GC_INIT() — GC_thr_init computes the marker count there and starts the helpers eagerly, so the call must precede it (an equivalent call in _vinit runs too late). Mirrors the GC_set_markers_count(1) already done for shared libs. The GC_MARKERS env var still overrides this (read first in GC_thr_init). Scoped to macOS; Linux parallel-mark is left at default pending separate measurement. (cherry picked from commit 0421ce3)
The C codegen for global declarations handled @[volatile] but silently ignored @[thread_local], so a @[thread_local] __global compiled to a plain process-shared global. Any concurrent use of such a global then raced across threads. Emit the __thread storage-class (GCC/Clang/tcc; valid for the zero/nil/literal initializers V produces for globals) when the decl carries the attribute. Required by the scope-aware region allocator, whose per-thread state must be genuinely thread-local. (cherry picked from commit 13e1ce9)
…c to ON The bundled gc.c amalgamation embedded `/* #undef THREAD_LOCAL_ALLOC */` in its autoheader config — the one alloc-lock mitigation left OFF. Flip it to `#define THREAD_LOCAL_ALLOC 1` (the canonical configure-equivalent of `--enable-thread-local-alloc=yes`); the TLA implementation is already present in the amalgamation (guarded by `#ifdef THREAD_LOCAL_ALLOC`), so this activates real code. Compiles clean. This affects the Linux / `-prod`-bundled `gc.o` path only. The macOS `cx` build links the prebuilt `thirdparty/tcc/lib/libgc.a` (built by thirdparty-macos-arm64_bdwgc.sh), which already defaults TLA on — so the flip brings the bundled config into parity with what macOS already ships. Companion runtime lever (already present at this pin): the macOS marker-pin `GC_set_markers_count(1)` before `GC_INIT()` in cmain.v::gen_boehm_gc_init(). Together: ~2.4-5x on MARK-bound multi-thread work. Neither removes the GC_allocate_ml alloc-lock, so alloc-heavy `[?map [par]]` (cx-private vlang#14) stays ~1.3x slower than serial — partial relief by design. Full acceptance + measurements in cx-private bench/parallel-alloc/P0-ACCEPTANCE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit ae03a36)
…vgc backstop + Linux port + per-thread accounting + deep-free Forward-port of the full E work (developed on the upstream-master clone) onto the latest-V (a83aabb, Jun-11) base, on top of the cherry-picked cx patches. Adds: - P3 minimal STW mark-region backstop collector (vgc_*), churn-correct on macOS. - Linux backstop port: signal-suspend+ACK + dl_iterate_phdr ELF roots (vgc_platform.h). - P1/P2 Perceus front line, DECOUPLED from -autofree (fires on the `perceus` define alone) so it composes with the tracing backstop. - Per-thread heap accounting (live_delta/alloc_delta) → R2 alloc-MP near-linear. - Sound deep-free of nested heap fields of a dropped &Foo (perceus.v deep-drop analysis). - `-gc e` unified flag = vgc backstop + Perceus front line (opt-in; boehm stays default). perceus.v is a new cgen pass (was untracked in the clone; now committed here). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 677770d)
A vgc span serves one size CLASS but real workloads pack many TYPES (plus conservative ptrmap==0 allocations) into it. The span recorded only the FIRST typed allocation's ptrmap and applied it to every object, so any object whose real pointer layout differed had live child pointers skipped during mark -> reclaimed-while-reachable (observed as corrupted results in a deep alloc-heavy serial fold). The backstop now scans every scannable (non-noscan) span conservatively: finds every pointer, may over-retain, never under-retains. The backstop runs rarely behind the Perceus front line, so the cost is negligible. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 004b02f)
gen_free_for_sumtype/gen_free_for_array were not .option-aware and emitted it->_typ / it->len against the _option_* wrapper, producing a C error for any program with a ?SumType or ?[]T field freed under autofree/perceus. Mirror the existing option handling in gen_free_for_map/struct; no-op for non-option types. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 26ac2bb)
Remove references to the downstream consumer/repo from comments so the runtime and Perceus analysis read as standalone upstream V work. No code change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 6f94f55)
Option/result wrappers always embed an IError err field (a pointer-bearing interface) regardless of the payload, so `?int` etc. contain a pointer even though int does not. contains_ptr called final_sym, which strips the .option/.result flags, and so classified `[]?int` as scan-free -> emitted a _noscan allocation whose live err pointer a conservative GC mark would skip (potential reclaim-while-reachable). Check the flags before final_sym. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d112d5c)
When the Perceus in-place map reuse fires, the result buffer is the receiver's own buffer (cap >= len), so the map loop can store results by ascending index instead of calling array_push. This removes the per-element call + grow-check + memmove and, by making the loop body transparent, lets the C optimizer vectorize it and elide the (non-escaping) allocation. ~3x faster than Boehm single-thread on the reuse micro-bench (prod), parity-or-better otherwise; correctness unchanged (reuse precondition already proved unique+dead+same-size). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d7e9f5a)
The trace ring + async-signal-safe crash dump (P3 bring-up scaffolding) are now compiled only under `-cflags -DVGC_DIAG`; without it vgc_trace/vgc_trace_init are zero-overhead no-ops and the startup vgc_say probes are removed, so a -gc vgc/e binary emits nothing to stderr. The loud span-registry-overflow abort message is preserved (vgc_say + the raw stderr write helpers stay). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 25c1145)
…drop, HEAP_vgc arity, overflow-thread panic) Forward-ported from the upstream-tracking clone (vlang-v-latest). All V-only, CX-agnostic, with standalone patches + CX-free repros in cx-private bench/parallel-alloc/. 1. vgc tiny-allocator free clobbers live siblings: the tiny allocator packs several sub-16B noscan objects into one span slot; vgc_free cleared the whole slot, reclaiming live tiny neighbors (short map-key char buffers) -> map corruption under -gc vgc/e. Fix: per-span is_tiny flag; vgc_free defers tiny-block slots to the tracing collector (Go model). 2. Perceus drops a store-target index var before its use: pcs_lower_stmt AssignStmt left a store target's index/container idents (m[key]=v) out of the use-set, so Perceus dropped `key` at its declaration, freeing it before map_set cloned it. Fix: non-Ident store targets collect all idents as uses. 3. HEAP_vgc macro arity: HEAP_vgc(type,expr,ptrmap,nptrs) was emitted with 2 args by some paths -> "too few arguments to function-like macro". The ptrmap is dead at runtime (conservative-mark backstop ignores it), so HEAP_vgc == HEAP. Fix: always emit plain HEAP under vgc (cgen.v + assign.v). 4. vgc overflow-thread panic: >vgc_max_threads(64) concurrent threads exhaust the fixed cache table -> cache_idx=-1 -> caches[-1] out-of-range panic. Fix: cache_idx<0 allocates from central (vgc_cache_get_span) + folds accounting into the global atomics (vgc_acct_alloc). Validated: vlib map_test/array_test/string_test green under -gc vgc AND -gc e; g_churn battery + corpora none==e clean; cx default gate 125/125 + conformance green (no regression). NOTE: building cx itself with -gc e surfaces 6 PRE-EXISTING cx-under-gc-e failures (fail on the pre-fix fork too) — tracked separately, not caused by these fixes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit a95aff9)
Two CX-agnostic V-runtime changes for parallel alloc-heavy scaling under the full-STW vgc backstop (-gc e/vgc): 1. Dynamic span capacity (always-on, sound robustness fix). allspans was a fixed inline [262144]&VGC_Span; a higher GC trigger exhausted it -> the loud vgc_say(0xDEAD) abort. Now allspans is an mmap-backed pointer (vgc_os_alloc, 16M-entry default = 128MB address space, lazily committed by the OS), allocated once on the first vgc_span_alloc under vgc_heap.lock (NOT in vgc_init -- spans are allocated during _vinit, before vgc_init runs). The pointer never moves, so the collector's lock-free allspans walks (incl. lazy sweep) never see a relocated/freed buffer. Loud abort kept as the backstop at the (now huge) cap. Env override VGC_ALLSPANS_CAP. 2. Per-thread GC pacing knobs (env-gated, DEFAULT OFF -> a build with no env set is byte-identical to before). VGC_NEXT_GC_MB raises the trigger floor; VGC_PACE scales the live trigger by live_threads so N concurrent allocators don't trip the shared trigger N x more often per unit of per-thread progress. Validated (clone, identical changes): g_churn battery 0 corruptions (default + VGC_PACE), corpora none==e byte-identical, map_test/array_test green under -gc e, v2 self-hosts; cx gate 125/125 + conformance green built against this runtime. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 72edb9e)
Mark the live graph while mutators run, with two brief STW points (start: snapshot roots + enable barrier + alloc-black; termination: re-scan dirtied roots + dirty spans, final drain, sweep). STW stays the default and is byte-for-byte diff-able. - card/dirty-span write barrier (vgc_wb_store): marks the mutated object's span dirty BEFORE the store; collector re-scans dirty spans at mark-termination (vgc_rescan_dirty_spans). Preemption-safe under mach-suspend (unlike an immediate-shade enqueue, which can be lost mid-barrier) and keeps the mark queue collector-exclusive. - codegen emission in assign.v (gen_cm_write_barrier) for heap-targeted pointer-bearing stores; over-approximating, side-effect-free bases only. Fixes a gap where obj[i+k].field stores were skipped (InfixExpr index). - builtin bulk-mutator barriers (array push/ensure_cap/set/clone/insert, map set, vgc_realloc, vgc_memdup*) for pointer moves via memcpy. - alloc-black during mark (vgc_alloc_black_hook). GC-assist deliberately NOT wired: unsound under preemptive suspend (an assisting mutator frozen mid-scan of a popped grey object orphans it). Heap overshoot during long marks is bounded by the existing exhaustion->force-collect net. Default build emits no barrier calls and is unchanged. Gates green under the define: corpora none==e byte-identical, map/array tests, g_churn battery, multi-thread cm_stress, self-host, ASan == STW baseline. Measured payoff: large-live-set [par] ~1.2x faster than STW. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d22ae0e)
…codegen + vgc_free central lock Three independent soundness fixes for the `-gc e` front line (Perceus RC + precise STW tracing backstop), all general-purpose: 1. perceus.v — return-aliasing pin. A heap value bound from a CALL may alias the callee's receiver/args or sub-objects it traversed and copied out by value (e.g. a tree walk returning a []Element of shallow copies that share child buffers and can overlap one another). The old assign-aliasing pin only fired when an RHS *ident* was itself heap-owning, so such results were treated as uniquely owned and eager deep-dropped -> double free. Now any non-fresh call-bound aggregate is marked shared and left to the tracing backstop; only proven-fresh producers (map/filter over primitives, fresh string builds) stay droppable. Sound by construction: marking shared only suppresses a drop. 2. auto_free_methods.v — gen_free_for_option_ptr. A `?&T` field's free method resolved the sym to T's struct and inlined T's field frees treating the option payload as a T value, emitting `((T**)&it->data)->field` -> "base type 'T *' is not a structure". Now the pointee is freed via the base `_free` (`T_free(*(T**)&it->data)`), mirroring the non-option `&T` field case. 3. vgc_free — take the per-size-class central lock around the alloc-bit / count / free-index mutation, the same lock the allocator and the collector use. Closes a race between an eager free and a concurrent allocation of the same class. Obeys the lock-before-suspend discipline: brief, no allocation/blocking, so the collector (which pre-acquires every central lock) cannot deadlock. Corpora byte-identical (none==e), churn battery 0-corrupt, multi-thread stress 0 mismatches, serial reuse throughput unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 38607b2)
…ointer The STW collector captures the triggering (collector) thread's own roots via a setjmp trampoline (vgc_run_gc_spilled) and the cooperative-safepoint park (vgc_park_spill), then conservatively scans [sp, stack_base]. Both anchored `sp` at __builtin_frame_address(0) — the FRAME POINTER. But setjmp spills the callee-saved registers into its jmp_buf, which lives BELOW the frame pointer (between SP and FP). So the scanned range excluded the spill area, and a live mutator root held only in a callee-saved register — routine under -Os — was never scanned and got reclaimed while still live. Surfaced as a sporadic signal-11 under concurrent allocation (e.g. an HTTP server hammered with many connections): a value reachable through a worker thread's registers was swept, then read after free at a scattered point in the request path. Add vgc_real_sp() (reads the actual SP register on arm64/x86_64/i386, falls back to __builtin_frame_address) and anchor both self-scan paths at it, clamped to <= &buf so the jmp_buf spill area is always covered. Also route vgc_get_sp() through it for consistent registration/refresh ranges. Sound by construction: the scanned range only ever grows to include the true lowest in-use stack address. Suspended threads were already correct (their SP comes from thread_get_state); only the self/collector path relied on setjmp. CX-free, general-purpose: affects -gc vgc and -gc e (shared trampoline). The compiler self-build is .no_gc so the toolchain binary is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 50fde69)
…e world The STW collector resumed all mutators (vgc_resume_thread loop) and THEN ran its end-of-cycle bookkeeping — vgc_update_trigger() and `gc_cycle++` — while mutators were already running. gc_cycle is read by mutators to stamp a freshly-acquired span's `sweep_gen` (vgc_span_alloc / vgc_central_get_span / vgc_get_free_span), and the sweep's in-flight-span guard (vgc_sweep_span) recycles an empty span only when `sweep_gen != gc_cycle`. So in the tiny window between resume and `gc_cycle++`, a resumed mutator could acquire an empty, still-in-flight span and stamp it with the OLD cycle; the next cycle's sweep then saw `sweep_gen != gc_cycle` and recycled that span out from under the mutator — a use-after-free. Pathologically narrow and timing-sensitive (it masks under every instrumentation), so it surfaced only as a sporadic signal-11 under sustained concurrent allocation (an HTTP server hammered with many connections), on both the mach (macOS) and signal (Linux) STW backends. ThreadSanitizer (Linux) pinned it precisely: a data race on `vgc_heap.gc_cycle` between vgc_gc_start (writer) and vgc_span_alloc (reader). Moving the bump + the trigger update ahead of the resume loop — while the world is still stopped — closes the window: every post-resume acquisition stamps exactly the cycle the next sweep checks against, and the just-run sweep still correctly skipped spans acquired during this cycle (it ran at the pre-bump cycle). After this change TSan reports zero vgc data races (was 30), and the dominant concurrent-HTTP crash is gone. CX-free, general-purpose; affects -gc vgc and -gc e (shared STW path). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 8baa8db)
…_alloc data race SUBSTANTIVE FIX (vgc_d_vgc.c.v): vgc_find_span / vgc_is_heap_ptr read vgc_heap.narenas LOCK-FREE (they run on the collector's STW conservative scan, where a lock would deadlock against a frozen mutator) while vgc_span_alloc wrote it under vgc_heap.lock. That is a publication data race: a lock-free reader could observe narenas grow before the new arenas[idx] (base/size/page_span map) it gates was published, then read a stale base/size or a nil page_span -> wrong or missing span -> heap corruption. ThreadSanitizer pinpointed it (read in vgc_find_span vs write in vgc_span_alloc, global vgc_heap) under concurrent HTTP connection-teardown churn; the corruption surfaced as a segfault in the request path. Fix: bump narenas as the LAST step of vgc_span_alloc with an atomic RELEASE store (after arenas[idx] fields + page map + allspans), and load it with an atomic ACQUIRE in the lock-free readers. TSan: 1 race -> 0. This also makes every early-return in vgc_span_alloc leave narenas consistent (a half-built arena is simply never published). CX-agnostic; reproduces with any heavily multi-threaded alloc/free workload. DIAGNOSTIC TOOLING (inert; behind -d defines, zero codegen in normal builds — verified a plain `-gc e` build is unaffected): a mark-closure verifier (vgc_verify_mark_closure), a /proc/self/maps root-finder (vgc_rootfind_*), a data- segment dump, and a low-perturbation watch (per-cycle reset + emit-on-sweep, pub vgc_set_watch) under -d vgc_verify / -d vgc_watch; plus a coarse allocator lock (vgc_alloc_lock + a per-thread re-entrancy guard) under -d vgc_coarse_alloc. These proved the vgc mark/sweep collector is sound (the residual is allocator-side concurrency, not reclamation) and are kept for the ongoing allocator-race hunt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit c39ce23)
Found via a reliable `Connection: close` teardown-churn repro (wrk -t8 -c256 against the cx guide HTTP server); each fix pushes the crash materially later (round 1 -> 8 -> 12). All CX-agnostic, observable behaviour byte-identical. Full cx gate green under -gc e (V-impl 125/125 + conformance + python/rust/go bindings + abi-c, prod+non-prod). 1. vgc page_span[] publication race (vgc_d_vgc.c.v). A span carved from an EXISTING arena writes arenas[i].page_span[pidx]=span under vgc_heap.lock but does NOT bump narenas, so the narenas release/acquire publication does not cover these slot writes. vgc_find_span reads page_span lock-free in the vgc_free/vgc_realloc hot path -> data race -> stale span -> heap corruption. Fix: atomic u64 RELEASE store on the slot + ACQUIRE load in vgc_find_span (paired, like narenas). 2. vgc mcache bitmap double-update (vgc_d_vgc.c.v, vgc_platform.h). The lock-free mcache fast path (vgc_span_alloc_obj) did a plain read-modify-write on span.alloc_bits/alloc_count while a concurrent cross-thread vgc_free RMW'd the SAME span's bitmap under central[class].lock -- the lock does not exclude the unlocked alloc side -> lost update -> a slot handed out twice. Fix: atomic OR (alloc) / atomic AND (free) on alloc_bits + atomic add/sub on alloc_count (new u8 fetch_or/fetch_and + sub_u32 helpers in all three cc branches). free_index stays a racy-but-safe scan hint; the STW sweep stays non-atomic (mutators suspended). 3. picoev handle_timeout stale-target assert (picoev.v). Under Connection: close teardown churn a timed-out fd can retain a residual timeouts entry whose target was already torn down (loop_id=-1, cb=nil). The assert aborted the reactor; poll_once already skips such targets. Fix: drop the stale timeout entry + skip, mirroring poll_once. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 46d2ae5)
…2 races)
Under heavy multi-threaded alloc/free churn the vgc allocator could return
NULL for an in-range small allocation, surfacing as a null `&T{}` and a
caller null-deref. Two independent root causes, each strictly load-bearing
(ablation-confirmed); a third (a non-atomic alloc_bits read) was identified
but is not load-bearing and is left out to keep the alloc fast path
atomic-free.
* span scan missed the start byte's low bits. vgc_span_alloc_obj's two-pass
free-slot scan applied the free_index start_bit offset in BOTH passes, so
bits [0, start_bit) of the start byte were scanned in neither pass. A span
with a free low slot but a high/stale free_index (the state a fill leaves
when a cross-thread free's lowering of free_index is not yet visible) then
reported "full" and returned nil. Single-byte-bitmap spans (small nelems)
hit it whenever free_index == nelems. Fix: apply the offset only in pass 0;
the wrap pass scans the start byte from bit 0.
* GC reclaimed mcache-resident spans. vgc_sweep_span reclaims any empty span
with a stale sweep_gen, and on_central==0 means BOTH "free-floating" AND
"owned by an mcache". An empty cached span was reset (nelems=0) and pooled
by vgc_put_free_span while still referenced by the mcache slot and by a
thread suspended inside vgc_malloc -- span descriptors live outside the GC
arena, so the conservative root scan never protects them, and
vgc_fixup_caches only nulls the cache slot, not a suspended owner's local.
The owner then read a zeroed/torn span. Fix: vgc_protect_cached_spans()
stamps every mcache-resident span's sweep_gen under STW before sweep,
reusing the existing in-flight guard.
Deterministic white-box regression (CX-free): vgc_residual4_selftest in
vlib/builtin/vgc_selftest_d_vgc.c.v, driven by
bench/parallel-alloc/vgc_residual4_test.v; both checks verified to fail
without their fix. Full V _test suite + conformance green under -gc e.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 871dced)
Three CX-agnostic allocator changes that recover [par] / multi-reactor scaling under -gc e without regressing serial throughput or correctness. Root cause (measured): the alloc fast path was ~98% global-lock spin under parallel churn, NOT GC pacing and NOT memory bandwidth (8 separate processes scale; 8 in-process workers do not). Two locks dominated: vgc_heap.lock (new-span carve, whose hold included a per-span mmap()) and central[].lock (full-span return). 1. Span-descriptor bump slab (vgc_alloc_span_meta): span descriptors are never individually freed (pooled by vgc_put_free_span forever), so replace the per-carve mmap(sizeof(VGC_Span)) -- a syscall under vgc_heap.lock -- with a pointer bump + rare bulk mmap. Side effect: ~3x lower RSS (the per-span mmap wasted ~16KB of page granularity on a ~400B descriptor). 2. Drop full spans instead of returning them to central.full. The partial list was never reused (returns only land FULL spans on full; sweep relinks only fully-empty), so the per-fill return was pure central[].lock contention. A dropped span stays in allspans, is swept normally, and is reclaimed when empty via the on_central==0 path; reuse flows through the free_spans pool + the active span's free_index. Sound vs residual vlang#4 (a dropped span is no longer mcache-resident, so same-cycle reclaim is correct; it is sweep_gen-protected while still referenced during the drop). 3. Per-thread GC pacing on by default (was env-gated). Adaptive: scales the live trigger by live_threads only when >1, so single-threaded/small programs keep the historic 256MB trigger and RSS. VGC_PACE=0 disables it. Results (best-of-5, -prod -gc e, 12-core): par 360ms = 3.8x faster than serial at a 4GB trigger (RSS 2.3GB); ~2.9x at the default with pacing; serial 1550-> 1380ms; the alloc lock-spin leaves the hot profile entirely (now compute-bound). Gates: residual-vlang#4 selftest PASS; cx V-impl 125/125; local MP stress 50/50; HTTP churn repro (wrk -t8 -c256 Connection:close) survived 15 rounds niltrace=0; real HTTP server sweep ~22M requests crash-free. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 82e3934)
…nostic Genericize two comments that named the downstream consumer, keeping vlib source provider-neutral (matches 6f94f555): the mark-closure verifier's noscan-referrer note and the vgc_residual4_selftest header. Comment-only; no code change. (cherry picked from commit 1ff9e4f)
…er -gc e
The autofree/Perceus drop for an option-pointer local (b := &?Foo{}) closed the
free call twice: the option branch wrote '.data)' (closing the call's open paren)
and then the shared tail wrote ');' again -> 'free((Foo**)b.data));' -> C error
'extraneous ) before ;'. The non-option path closes only once via that tail, so
emit '.data' (no close) and let the single tail ')' close the call. Fixes
vlib/v/tests/options/option_init_ptr_test.v under -gc e (passes under e/boehm/none;
boehm options suite 213/213, no regression). One of the documented -gc e edge bugs.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 3bcf843)
gen_free_for_array took the string-construct branch for an option element ([]?T) because the UNWRAPPED payload sym (e.g. string) has a user free -> it emitted a call to _option_<T>_free, a wrapper method no codegen path produces -> C error 'undeclared function builtin___option_string_free'. Fix: for an option element, INLINE the payload free (check option state, free the payload via its base free on &data), mirroring the sum-type-variant-option and whole-?[]T paths. No separate _option_<T>_free method needed. Fixes option_ifguard_array_of_option_test under -gc e (options suite 213/213; boehm/none unaffected). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 3d53776)
…e_threads (c) the real MP-alloc lever (measured): vgc_free took the per-class central[].lock on EVERY non-tiny free, so N threads dropping the same size class serialized N-way on one lock — bench_scalar anti-scaled 35->7 Mops/s T1->T8, below boehm. (This is why the earlier 'near-linear, ~8x boehm' result regressed: the residual-vlang#4 fix added that lock for correctness.) Fix: skip the central lock when span.on_central == 0 (resident in a thread mcache, or dropped awaiting sweep) — such a span has no central-list membership to guard, and its bitmap+count mutations are individually atomic. The atomic fetch_and's prior value gates the count decrement so a double-free still cannot double-subtract. free_index stays a racy-but-safe hint. Spans actually on a central list (on_central != 0: the unregistered-overflow-thread fallback) keep the lock, preserving list consistency and the collector's lock-before-suspend fence. Also: make live_threads ++/-- in register/unregister ATOMIC — vgc_maybe_gc reads it lock-free for per-thread GC pacing (on by default), so a plain RMW raced that atomic read (TSan-flagged; a PACE-on-by-default regression). Result (best-of-3, -prod -gc e, 12-core): bench_scalar T1->T8 = 45->326 Mops/s (7.2x, near-linear; 5.5x boehm at T8); bench_mp T1 5.9->76. Soundness: residual-vlang#4 white-box selftest PASS; container HTTP churn 15 rounds niltrace=0 (the exact race the lock guarded); TSan 0 warnings (was 1, the live_threads race — now fixed); cx V-impl 125/125 + full make test green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit c69fd59)
…ps root-finder Debug-only soundness tooling used during development (compiled out by default). Removed for review clarity: vgc_verify_mark_closure, vgc_rootfind_region + the $if vgc_verify ? call sites and the verify-only globals/C decls. The low-perturbation vgc_watch_* hooks (also debug, runtime-gated by vgc_watch_addr, inert by default) are kept — they were the load-bearing diagnostic for the concurrency fixes and add negligible inert cost.
CX-free V programs backing the PR's perf/soundness claims (so reviewers can reproduce): bench_scalar (single-class MP alloc scaling), bench_mp (nested-object MP), par_reclaim (bounded-live control), cm_stress (concurrent-mark hazards), boehm_mp_bench (Boehm baseline). vgc_residual4_test (white-box fix self-check) is already present.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56f73cde59
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // MUST be after _vinit(): _vinit zero-initializes every global (incl. | ||
| // vgc_heap), so running vgc_init before it would have gc_enabled/next_gc | ||
| // wiped back to 0 -> GC disabled for the whole program -> unbounded heap. | ||
| g.writeln('\tbuiltin__vgc_init();') |
There was a problem hiding this comment.
Avoid re-registering threads after pre-init allocations
When _vinit allocates under -gc vgc/-gc e, the allocation path calls vgc_ensure_registered() before this generated builtin__vgc_init() runs; vgc_init() then unconditionally calls vgc_register_thread() again. That leaves the earlier cache slot still registered with the same OS thread port, so the next collection sees it as another mutator and tries to suspend/scan the collector thread itself (i != self_idx), which can self-suspend or hang as soon as GC triggers in programs with allocating global/module init. Make registration idempotent here, or avoid the second registration in all generated entry points.
Useful? React with 👍 / 👎.
| vgc_cm_stw_exit(self_idx) | ||
|
|
||
| vgc_update_trigger() | ||
| vgc_heap.gc_cycle++ |
There was a problem hiding this comment.
Bump the GC cycle before resuming concurrent mutators
In -d vgc_concurrent builds this resumes mutators before updating next_gc and incrementing gc_cycle. The STW path above explicitly moved those updates before resume because a resumed allocator can acquire/cache a span stamped with the old cycle; the next sweep then sees sweep_gen != gc_cycle and can recycle that cached span while the mutator still owns it. Move vgc_update_trigger()/vgc_heap.gc_cycle++ before vgc_cm_stw_exit(self_idx), matching the non-concurrent path.
Useful? React with 👍 / 👎.
| static inline uint32_t vgc_thread_self_port(void) { return 0; } | ||
| static inline void vgc_suspend_thread(uint32_t t) { (void)t; } | ||
| static inline void vgc_resume_thread(uint32_t t) { (void)t; } |
There was a problem hiding this comment.
Disable VGC where OS-level suspension is unavailable
On Windows/BSD this fallback registers every thread with mach_port == 0 and makes suspend/resume no-ops, but vgc_gc_start() no longer uses the old cooperative safepoint path and only suspends slots with c.mach_port != 0. As a result, a multi-threaded -gc vgc/-gc e program on these supported targets will mark and sweep while other mutators keep running, so live objects allocated or stored by those threads can be reclaimed. This needs to either reject VGC on unsupported STW platforms or retain a safe cooperative fallback.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56f73cde59
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // MUST be after _vinit(): _vinit zero-initializes every global (incl. | ||
| // vgc_heap), so running vgc_init before it would have gc_enabled/next_gc | ||
| // wiped back to 0 -> GC disabled for the whole program -> unbounded heap. | ||
| g.writeln('\tbuiltin__vgc_init();') |
There was a problem hiding this comment.
Avoid re-registering threads after pre-init allocations
When _vinit allocates under -gc vgc/-gc e, the allocation path calls vgc_ensure_registered() before this generated builtin__vgc_init() runs; vgc_init() then unconditionally calls vgc_register_thread() again. That leaves the earlier cache slot still registered with the same OS thread port, so the next collection sees it as another mutator and tries to suspend/scan the collector thread itself (i != self_idx), which can self-suspend or hang as soon as GC triggers in programs with allocating global/module init. Make registration idempotent here, or avoid the second registration in all generated entry points.
Useful? React with 👍 / 👎.
| vgc_cm_stw_exit(self_idx) | ||
|
|
||
| vgc_update_trigger() | ||
| vgc_heap.gc_cycle++ |
There was a problem hiding this comment.
Bump the GC cycle before resuming concurrent mutators
In -d vgc_concurrent builds this resumes mutators before updating next_gc and incrementing gc_cycle. The STW path above explicitly moved those updates before resume because a resumed allocator can acquire/cache a span stamped with the old cycle; the next sweep then sees sweep_gen != gc_cycle and can recycle that cached span while the mutator still owns it. Move vgc_update_trigger()/vgc_heap.gc_cycle++ before vgc_cm_stw_exit(self_idx), matching the non-concurrent path.
Useful? React with 👍 / 👎.
| static inline uint32_t vgc_thread_self_port(void) { return 0; } | ||
| static inline void vgc_suspend_thread(uint32_t t) { (void)t; } | ||
| static inline void vgc_resume_thread(uint32_t t) { (void)t; } |
There was a problem hiding this comment.
Disable VGC where OS-level suspension is unavailable
On Windows/BSD this fallback registers every thread with mach_port == 0 and makes suspend/resume no-ops, but vgc_gc_start() no longer uses the old cooperative safepoint path and only suspends slots with c.mach_port != 0. As a result, a multi-threaded -gc vgc/-gc e program on these supported targets will mark and sweep while other mutators keep running, so live objects allocated or stored by those threads can be reclaimed. This needs to either reject VGC on unsupported STW platforms or retain a safe cooperative fallback.
Useful? React with 👍 / 👎.
(1) vfmt array.v + the two vgc files (hand-edited during cherry-pick conflict resolution + the verifier strip) so code-formatting CI passes. (2) The standalone POC benchmarks are each `module main`; sitting beside vgc_residual4_test.v they collided under `v test` (duplicate `main`/`Obj`). Moved to bench/parallel-alloc/poc/ so the selftest's module is clean; benches still run individually.
…ootstrap)
The concurrent-mark write barrier vgc_wb_store is defined only in
vgc_gc_d_vgc.c.v (compiled under -d vgc, i.e. -gc e). Its call sites in
array.v/map.v sit inside `$if vgc_concurrent ? { ... }`, so they emit no
code in ordinary builds — but the checker still walks those comptime
branches, and `v -os cross` emits every branch into the generated v.c.
Both paths require the symbol to resolve in non-vgc builds, so default
(boehm) `-os cross` / bootstrap-v / build-vc failed with
"unknown function: vgc_wb_store".
Add a no-op fallback in a _notd_vgc.c.v sibling so the symbol resolves
in boehm/none builds (mutually exclusive with the real definition via
file suffix). No behavioral change: the barrier is only ever active
under `-gc e -d vgc_concurrent`.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`v fmt -verify` (run by test-cleancode in nearly every CI job) flagged vlib/v/gen/c/perceus.v and vlib/v/gen/c/cgen.v as not vfmt'ed; also reformat the two POC bench files. Formatting-only (field/comment alignment); no logic change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
report-missing-fn-doc requires a name-leading doc comment on every new public function. Document the two introduced by this PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
write_heap_alloc / write_heap_alloc_close no longer use their `typ`
parameter (the precise-pointer-map HEAP_vgc variant is intentionally
unused — see the note there). V emits a `notice: unused parameter: typ`
to stderr on every build; the tools-* CI jobs treat any stderr output
from a tool compile as a failure, so the whole tools-{linux,macos,
freebsd,openbsd,docker} matrix went red.
Mark the parameter `_` (kept in the signature for call-site symmetry;
maintainers may prefer removing it outright).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for approving the CI run — that surfaced exactly what we needed. I triaged the 38 failures; they collapse to four root causes, all on our side and all low-risk, now addressed on the branch:
Worth flagging: this CI matrix builds V's default configuration, so the A couple of things I'd like direction on before going further:
This is still a POC / "needs verification" — happy to adjust scope, split it, or hold any of it pending your read on the architecture. |
…neutral) Mirrors the 4 fixes landed on the upstream PR branch (cx-home/v pr/mem-mgmt-poc, vlang#27458) into the CX build lineage. All behavior-neutral; -gc e codegen is byte-identical. - vgc_wb_store: add no-op fallback (vgc_wb_fallback_notd_vgc.c.v) so the symbol resolves in non-vgc / `-os cross` builds (dormant for CX since it builds -gc e by default, but makes a boehm/cross build robust). - cgen.v: mark dead `typ` param of write_heap_alloc[_close] `_` (kills the per-build `unused parameter` notice on stderr). - perceus.v + cgen.v: vfmt. - vgc_init / vgc_set_watch: doc comments. Gate: devbox run -- make test-vcx GREEN (V-impl 125/125 + conformance md 22/0, namespaces 16/0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
So... I decided to run and for bench_mp: Which means on my system, this new way is slower than both the others. 😕 Latest V, CachyOS, on 7950X CPU, 32G RAM, and NVME disks. |
|
Alex will have to make final decisions, but my opinion:
v2 is still not quite done, yet, so we can test earier/more stably on v1. However, checking again v2 before it is finalized could be very useful, as well. |
|
I think the core idea is interesting, Perceus-style reuse, ownership-driven drops/frees, and a tracing backstop are all directions worth studying. My concern is more about integration and timing. I have been doing some work on the V2 ownership/autofree direction based on the ownership problems listed in #27116. In the current V2 work, the scope is deliberately staged, it collects ownership/autofree facts from the V2 post-transform FlatAst and type information, classify transfers and release/cleanup eligibility in the V2 type/checker layer, then emit bounded cleanup first through the V2 CleanC backend and fixture tests. The goal, however, should be a backend-neutral V2 ownership/free contract. Once the model is stable, the same ownership decisions will need clear lowering rules for the native backends too, including x64, arm64, and any future backend. That model needs to cover shallow copies, escaping locals, non-owning pointers, sumtype payloads, pointer fields, globals/struct storage, etc. That makes me wary of adding another memory-management path before the V2 model is implemented, reviewed, and stabilized. For the current C backend side, I also think this is still too experimental for This PR is not just adding a small isolated switch, it also changes the current C backend code generator, autofree/free-method generation, builtins, VGC runtime behavior, platform STW/root scanning, allocator behavior, and Boehm/libgc-related code. Even if So in my opinion, the safest path would be to keep this as a dedicated experimental/RFC branch for now, possibly as a dedicated branch under the V org if maintainers want to explore it, and extract the useful parts separately like benchmarks, reproducers, design notes, and small isolated fixes with tests. If the experimental branch later proves strong results across supported OSes, with correctness tests and stable performance data, then moving parts of it toward the current C backend / master path would make much more sense. This is only my personal opinion, of course. ^^ Just to be clear, I still think this PR is interesting and definitely worth exploring further. |
|
Really appreciate the careful look — and thank you for taking the time to actually benchmark it, @JalonSolov. On performance: that's a fair and important data point. Our wins were measured on M2 Max (arm64/macOS); your 7950X / CachyOS numbers — On integration & timing, @GGRei: I think you're right, and I'd rather follow your lead than push a ~5k-line cross-cutting change at
On the builtin reference (@JalonSolov): agreed — I'll rework so the barrier never references a vgc-only symbol outside I'll re-frame the PR accordingly. Thanks again — this is exactly the steer I was hoping for. |
|
Of course, please keep in mind that I am nobody special in the project, just a modest contributor giving my personal opinion. Medvednikov is the real final decision-maker here, with JalonSolov's help of course. |
Architecture E: a Perceus front-line + precise stop-the-world tracing backstop for V's C backend
What this is (please read first)
This is a proof-of-concept, developed with Claude (Anthropic's coding agent), shared
because the results look very positive for V and we'd like the community to verify
them. We (the CX project — a tree-walking language interpreter written in V) set out to
test whether V could meet our memory-management and multi-core needs instead of
switching to Rust, while staying aligned with V's stated direction (autofree /
reuse-in-place). It worked well enough to be worth contributing — at minimum a baseline
POC, possibly a real contribution.
Direct about the claims: every performance number below was measured on our machines
and our workloads only — no broad independent benchmark suite, no third-party review.
Treat them as claims to verify, not facts. The correctness work is firmer (TSan, a
deterministic white-box self-check, and a churn reproducer — all included) but also wants
independent eyes. We'd value the community pressure-testing both.
Context: follows up on the Perceus discussion #27166 (and a Discord exchange where
@JalonSolov suggested a PR so Alex could look it over). Open question for maintainers
up front: target v1 (current master, where it's built + tested) or plan for v2? If v2
reworks the backend/codegen we're happy to advise on a port — better to know before deep
review.
All changes are provider-neutral V-runtime / codegen work: CX was the workload that
surfaced the bugs and motivated the optimizations, but nothing here is specific to it
(source scrubbed of downstream-specific naming).
Summary (TL;DR)
This adds a new opt-in memory-management mode for the C backend,
-gc e, that pairsa Perceus-style reference-counting front line (compiler-emitted, in-place reuse) with
a precise, from-scratch stop-the-world tracing collector (
vgc) as the backstop —plus the bug fixes and allocator optimizations that made the combination sound and fast
under heavy multi-threaded allocation. The front line reclaims the common, uniquely-owned
case with zero tracing; the backstop reclaims arbitrary aliased/cyclic graphs that RC
cannot.
vgcalone is also usable (-gc vgc).Why: Boehm (V's default conservative GC) anti-scales on alloc-heavy multicore
workloads (its parallel marker and alloc lock serialize mutators) and over-retains
(conservative). E targets both single-thread throughput (reuse-in-place avoids
allocation) and multicore scaling (per-thread allocation + accounting, no shared
alloc-path lock in the steady state).
Status / honesty: developed and gated against one large real consumer + a battery of
provider-neutral micro-benchmarks (included). The performance numbers below are
measured on those workloads and should be independently verified before any claim is
relied on. STW is the default collection strategy; concurrent mark is behind a separate
-d vgc_concurrentand is not proposed for default. The verification tooling(mark-closure verifier, root-finder) is compiled out unless
-d vgc_verify.Architecture & rationale
V today offers Boehm (conservative, default),
-autofree(compiler-managed scope frees,single-ownership assumption), and
-gc none. None gives both low allocation andlinear multicore scaling for an allocation-heavy program with aliased/graph-shaped data.
Architecture E = front line + backstop, decoupled from
-autofree:Perceus front line (
vlib/v/gen/c/perceus.v, new). A compile-time ownership/shareanalysis emits in-place reuse and drop for values it can prove uniquely owned,
reclaiming the common case without touching the collector. Crucially it is decoupled
from
-autofree:-autofreerestructures codegen assuming sole ownership and isincompatible with a backing collector (corrupts under any GC — demonstrated). E runs
the drop analysis off its own
perceusdefine, so Perceus drops are the sole frees andthe analysis stays sound (it pins assignment-aliases, call-result aliases, and any
value whose heap field is exposed → those fall through to the backstop).
Precise STW tracing backstop (
vlib/builtin/vgc_*.c.v, new). A from-scratchmark/sweep collector with a Go-
mcache-style segregated allocator (per-thread spancaches, per-size-class central lists, arena-backed spans). It reclaims what RC can't
(cycles, aliased graphs) and runs rarely because the front line absorbs most frees.
Precise (type-driven) marking where sound; conservative stack/register scanning for
roots. Mutators are stopped via OS-level suspend (mach / signal).
The hybrid is the point. RC alone leaks cycles; tracing alone pays full mark cost
on every cycle. Perceus handles the dominant uniquely-owned case in-place; the tracing
backstop is the correctness net for the rest. This mirrors Koka/Lean's Perceus + a
collector, adapted to V (which lacks a uniform per-object header, so the backstop owns
arbitrary-graph reclamation rather than a global RC header scheme).
Isolation-for-scaling doctrine: linear multicore scaling comes from per-thread
isolation of the allocation path (per-thread span caches + per-thread heap accounting),
not from a faster shared collector. The collector is the rare backstop; the steady-state
alloc/free fast path touches no shared cacheline or lock.
Bugs fixed (correctness)
Each is provider-neutral and was reproduced under heavy concurrent alloc/free (a
multi-reactor HTTP server + churn micro-benchmarks). The commit hashes below are the
originating development commits; each commit on this PR branch carries a
(cherry picked from commit …)trailer, and the PR's Commits tab is the authoritativeper-change view.
50fde691setjmpspills callee-saved regs below the FP; an FP-anchored scan missed a live root held only in a spilled reg → reclaimed-while-live.gc_cycle+ GC trigger under STW, before resuming the world8baa8db0sweep_genwith the old cycle let the next sweep recycle a still-in-flight span (UAF). TSan: 30→0 races.narenaswith release/acquirec39ce23fvgc_find_spanreadnarenaswhilespan_allocwrote it under lock — publication race on the arena it gates. TSan-pinpointed.page_spanslots with release/acquire46d2ae5anarenas, so the page-map writes weren't published to the lock-freefind_spanreader → stale span.fetch_or/fetch_and+ atomic count46d2ae5aalloc_bits(alloc fast path) raced a cross-threadfree's RMW under the central lock → one slot handed out twice.vgc_span_alloc_objtwo-pass scan start-byte coverage871dceda[0,start_bit)of the start byte scanned in neither → a span with a free low slot reported "full" →vgc_mallocNULL → caller null-deref.871dcedasweep_genunder STW.ptrmap004b02f2ptrmapwas a per-span property set by the first typed alloc, but a size class packs many types → objects whose layout differed had live child pointers skipped → reclaimed-while-reachable. Conservative scanning over-retains, never under-retains.?&Tfree-method codegen +vgc_freecentral lock38607b2d_option_*wrapper (C compile error); (c)vgc_freenow takes the per-class central lock (was a real MP soundness gap).?SumType/?[]Tfields26ac2bbegen_free_for_sumtype/_arrayemittedit->_typ/it->lenon the_option_*wrapper → C error for any program freeing such a field under autofree/Perceus.contains_ptrtreats?T/!Tas pointer-bearingd112d5c8[]?intwas flagged noscan (the option strips to.int), but_option_intcarries anIErrorpointer → a pointer-bearing object marked noscan.-gc ecorrectness fixes: map tiny-free, Perceus drop, HEAP_vgc arity, overflow-thread panica95aff916bvgc_max_threadscase that indexedcaches[-1]and recursed through malloc in the panic path. The HEAP_vgc-arity fix alone cleared 22 of 34 of V's own-gc etest failures.)when freeing an option-pointer local (b := &?Foo{})3bcf843fb9free((Foo**)b.data));C error. Fixesoption_init_ptr_testunder-gc e; boehm/none unaffected.[]?T3d537762?stringreferenced_option_string_free, a wrapper no path generated (the unwrapped sym has a userfree→ string-construct branch). Now inline the option-element payload free. Fixesoption_ifguard_array_of_option_test.live_threadsin register/unregisterc69fd59bvgc_maybe_gcreadslive_threadslock-free for per-thread GC pacing; the plain++/--raced that atomic read (TSan-flagged).Optimizations (performance)
live_delta/alloc_delta, flush to the global atomics only every ~1 MB677770dd/38607b2don_central == 0):vgc_freeskips the per-classcentral[].lock(kept only for spans actually on a central list); bitmap+count stay atomic, thefetch_andprior value gates the decrement (double-free-safe)c69fd59bbench_scalar: 8 threads alloc+drop one 32 B class) serialized N-way and anti-scaled (35→7 Mops/s T1→T8, below Boehm). With the skip it is near-linear again: 45→326 Mops/s T1→T8 (7.2×, ~5.5× Boehm at T8);bench_mpT1 5.9→76. Verified residual-#4-safe (white-box selftest + container churn 15 rounds niltrace=0 + TSan 0).&Foo, gated by a sound deep-drop analysis677770ddd7e9f5a172edb9e5-d vgc_concurrent(opt-in, STW stays default)d22ae0eemmapfrom under the heap lock; ~3× lower RSS) + drop full spans instead of returning them to the central full-list (the never-reused per-fill central-lock traffic) + per-thread GC pacing on by default (adaptive — only when >1 mutator)82e39343Profile evidence: under parallel churn the alloc fast path was ~98% spin on two global
locks (
vgc_heap.lockfor span carving — whose hold included anmmapsyscall — and theper-class central lock for span return); 8 separate processes scaled but 8 in-process
workers did not, isolating the cost to in-process shared allocator state (not bandwidth).
Soundness evidence
-fsanitize=thread) found and confirmed fixes Can you release the closed-source compiler right now for us to play? #2, Just published the first V example to show you some features of the language. Very interested in your input. #3, Where can I download the compiler? #5(race count → 0 after each).
vlib/builtin/vgc_selftest_d_vgc.c.v(driven by
bench/parallel-alloc/vgc_residual4_test.v) — reverting either fix fails it.bench/parallel-alloc/):g_churn(alloc/free/realloc storms, multi-thread),
bench_mp/bench_scalar(MP alloc scaling),par_live(large concurrent live sets),cm_stress(concurrent-mark hazards),cm_barrier_proto.c(the write-barrier model with deterministic teeth).-gc none/boehm/vgc/e.Scope / what to review carefully
vgc_d_vgc.c.v(allocator) andvgc_gc_d_vgc.c.v(collector) first, thenperceus.v(analysis) and the codegentouch-points (
assign.v,auto_free_methods.v,autofree.v,cgen.v,fn.v).-d vgc_concurrent(needs a sound GC-assist),vgc_verifytooling (debug-gated), and the experimentalcx_region.c.v/transport-layer patches (consumer-specific; excluded from this proposal).
a83aabb10f; a rebase onto current master is required.already covers owner-frees of mcache-resident spans — the dominant case; a complete
mimalloc-style per-span atomic thread-free list would also make cross-thread frees of
central-listed spans lock-free); sound
concurrent-mark GC-assist (cooperative safepoints); generational
option; and V's own
-gc ecodegen edge cases — ~10 of 2146vlib/v/testsprograms(all
-gc e-specific, pass undernone/boehm), characterized as three families:(1) option-wrapper / generic / sub-module
_freenot generated — e.g. an array of?stringreferencesbuiltin___option_string_freebut the value-option wrapper free isnever emitted (free-method generation vs
-skip-unusedDCE; the unwrapped element symhas a user
free, so the option-wrapper free path is skipped);(2) reflection metadata reclaimed (4 reflection / generic-anon-fn tests segfault);
(3) Perceus string early-drop (3 tmpl/comptime/interface-str tests produce
truncated/aliased strings). These touch shared autofree/option/Perceus codegen
(boehm-regression-sensitive) and runtime mark soundness — each warrants a dedicated pass,
not bundled here. Fix Fix generic docs after pull #10 #13 above cleared one (
option_init_ptr).Test environment (so the numbers mean something — and what's NOT covered)
Everything below was measured on a single machine. This is a real limitation: we have
not tested other CPUs, x86, or native (non-virtualized) Linux. Please reproduce on your
own hardware.
efficiency), 64 GB RAM, macOS 26.4.1 (build 25E253), Apple clang 21.0.0.
-prodbuilds via
-cc cc.clang 18.1.3, wrk 4.1.0),
aarch64— i.e. Linux 6.12 (linuxkit) running in Docker'sVM on that same M2 Max, not a separate native or x86 host. TSan + the concurrent-
HTTP churn reproducer ran here. So: arm64 only; x86, native Linux, and other core
counts are unverified. The collector's conservative stack/register scan and the
OS-suspend STW path are platform-sensitive — independent runs on x86/native Linux are
exactly the verification we're asking for.
-gc boehmis the baseline.How to verify