Architecture E (POC, opt-in `-gc e`): Perceus reuse-in-place front line + precise STW tracing GC backstop for the C backend by eptx · Pull Request #27458 · vlang/v

eptx · 2026-06-15T00:40:01Z

Architecture E: a Perceus front-line + precise stop-the-world tracing backstop for V's C backend

What this is (please read first)

This is a proof-of-concept, developed with Claude (Anthropic's coding agent), shared
because the results look very positive for V and we'd like the community to verify
them. We (the CX project — a tree-walking language interpreter written in V) set out to
test whether V could meet our memory-management and multi-core needs instead of
switching to Rust, while staying aligned with V's stated direction (autofree /
reuse-in-place). It worked well enough to be worth contributing — at minimum a baseline
POC, possibly a real contribution.

Direct about the claims: every performance number below was measured on our machines
and our workloads only — no broad independent benchmark suite, no third-party review.
Treat them as claims to verify, not facts. The correctness work is firmer (TSan, a
deterministic white-box self-check, and a churn reproducer — all included) but also wants
independent eyes. We'd value the community pressure-testing both.

Context: follows up on the Perceus discussion #27166 (and a Discord exchange where
@JalonSolov suggested a PR so Alex could look it over). Open question for maintainers
up front: target v1 (current master, where it's built + tested) or plan for v2? If v2
reworks the backend/codegen we're happy to advise on a port — better to know before deep
review.

All changes are provider-neutral V-runtime / codegen work: CX was the workload that
surfaced the bugs and motivated the optimizations, but nothing here is specific to it
(source scrubbed of downstream-specific naming).

Summary (TL;DR)

This adds a new opt-in memory-management mode for the C backend, -gc e, that pairs
a Perceus-style reference-counting front line (compiler-emitted, in-place reuse) with
a precise, from-scratch stop-the-world tracing collector (vgc) as the backstop —
plus the bug fixes and allocator optimizations that made the combination sound and fast
under heavy multi-threaded allocation. The front line reclaims the common, uniquely-owned
case with zero tracing; the backstop reclaims arbitrary aliased/cyclic graphs that RC
cannot. vgc alone is also usable (-gc vgc).

Why: Boehm (V's default conservative GC) anti-scales on alloc-heavy multicore
workloads (its parallel marker and alloc lock serialize mutators) and over-retains
(conservative). E targets both single-thread throughput (reuse-in-place avoids
allocation) and multicore scaling (per-thread allocation + accounting, no shared
alloc-path lock in the steady state).

Status / honesty: developed and gated against one large real consumer + a battery of
provider-neutral micro-benchmarks (included). The performance numbers below are
measured on those workloads and should be independently verified before any claim is
relied on. STW is the default collection strategy; concurrent mark is behind a separate
-d vgc_concurrent and is not proposed for default. The verification tooling
(mark-closure verifier, root-finder) is compiled out unless -d vgc_verify.

Architecture & rationale

V today offers Boehm (conservative, default), -autofree (compiler-managed scope frees,
single-ownership assumption), and -gc none. None gives both low allocation and
linear multicore scaling for an allocation-heavy program with aliased/graph-shaped data.

Architecture E = front line + backstop, decoupled from -autofree:

Perceus front line (vlib/v/gen/c/perceus.v, new). A compile-time ownership/share
analysis emits in-place reuse and drop for values it can prove uniquely owned,
reclaiming the common case without touching the collector. Crucially it is decoupled
from -autofree: -autofree restructures codegen assuming sole ownership and is
incompatible with a backing collector (corrupts under any GC — demonstrated). E runs
the drop analysis off its own perceus define, so Perceus drops are the sole frees and
the analysis stays sound (it pins assignment-aliases, call-result aliases, and any
value whose heap field is exposed → those fall through to the backstop).
Precise STW tracing backstop (vlib/builtin/vgc_*.c.v, new). A from-scratch
mark/sweep collector with a Go-mcache-style segregated allocator (per-thread span
caches, per-size-class central lists, arena-backed spans). It reclaims what RC can't
(cycles, aliased graphs) and runs rarely because the front line absorbs most frees.
Precise (type-driven) marking where sound; conservative stack/register scanning for
roots. Mutators are stopped via OS-level suspend (mach / signal).
The hybrid is the point. RC alone leaks cycles; tracing alone pays full mark cost
on every cycle. Perceus handles the dominant uniquely-owned case in-place; the tracing
backstop is the correctness net for the rest. This mirrors Koka/Lean's Perceus + a
collector, adapted to V (which lacks a uniform per-object header, so the backstop owns
arbitrary-graph reclamation rather than a global RC header scheme).

Isolation-for-scaling doctrine: linear multicore scaling comes from per-thread
isolation of the allocation path (per-thread span caches + per-thread heap accounting),
not from a faster shared collector. The collector is the rare backstop; the steady-state
alloc/free fast path touches no shared cacheline or lock.

Bugs fixed (correctness)

Each is provider-neutral and was reproduced under heavy concurrent alloc/free (a
multi-reactor HTTP server + churn micro-benchmarks). The commit hashes below are the
originating development commits; each commit on this PR branch carries a
(cherry picked from commit …) trailer, and the PR's Commits tab is the authoritative
per-change view.

#	Fix	Commit	Notes
1	Collector self-scan anchored at the real SP, not the frame pointer	`50fde691`	`setjmp` spills callee-saved regs below the FP; an FP-anchored scan missed a live root held only in a spilled reg → reclaimed-while-live.
2	Advance `gc_cycle` + GC trigger under STW, before resuming the world	`8baa8db0`	A resumed mutator stamping a fresh span's `sweep_gen` with the old cycle let the next sweep recycle a still-in-flight span (UAF). TSan: 30→0 races.
3	Publish `narenas` with release/acquire	`c39ce23f`	Lock-free `vgc_find_span` read `narenas` while `span_alloc` wrote it under lock — publication race on the arena it gates. TSan-pinpointed.
4	Publish `page_span` slots with release/acquire	`46d2ae5a`	Spans carved from an existing arena don't bump `narenas`, so the page-map writes weren't published to the lock-free `find_span` reader → stale span.
5	mcache bitmap atomic `fetch_or`/`fetch_and` + atomic count	`46d2ae5a`	Unlocked RMW on `alloc_bits` (alloc fast path) raced a cross-thread `free`'s RMW under the central lock → one slot handed out twice.
6	`vgc_span_alloc_obj` two-pass scan start-byte coverage	`871dceda`	The free-index offset was applied in both passes, leaving `[0,start_bit)` of the start byte scanned in neither → a span with a free low slot reported "full" → `vgc_malloc` NULL → caller null-deref.
7	Don't reclaim mcache-resident spans in sweep	`871dceda`	A cached span that momentarily empties was recycled while still referenced by an mcache slot / a suspended owner's local (span descriptors live outside the GC arena, so the root scan can't protect them). Fix: stamp registered threads' cached spans' `sweep_gen` under STW.
8	Conservative backstop mark; drop the unsound per-span `ptrmap`	`004b02f2`	`ptrmap` was a per-span property set by the first typed alloc, but a size class packs many types → objects whose layout differed had live child pointers skipped → reclaimed-while-reachable. Conservative scanning over-retains, never under-retains.
9	Sound eager-drop for aliased call results + `?&T` free-method codegen + `vgc_free` central lock	`38607b2d`	(a) a heap value bound from a call may alias the callee's traversed sub-objects → pin it (backstop, don't deep-drop); (b) option-of-pointer free emitted a struct member-access on the `_option_*` wrapper (C compile error); (c) `vgc_free` now takes the per-class central lock (was a real MP soundness gap).
10	Option-aware free methods for `?SumType` / `?[]T` fields	`26ac2bbe`	`gen_free_for_sumtype`/`_array` emitted `it->_typ`/`it->len` on the `_option_*` wrapper → C error for any program freeing such a field under autofree/Perceus.
11	`contains_ptr` treats `?T` / `!T` as pointer-bearing	`d112d5c8`	`[]?int` was flagged noscan (the option strips to `.int`), but `_option_int` carries an `IError` pointer → a pointer-bearing object marked noscan.
12	Four `-gc e` correctness fixes: map tiny-free, Perceus drop, HEAP_vgc arity, overflow-thread panic	`a95aff916b`	Incl. the >`vgc_max_threads` case that indexed `caches[-1]` and recursed through malloc in the panic path. The HEAP_vgc-arity fix alone cleared 22 of 34 of V's own `-gc e` test failures.
13	Drop extraneous `)` when freeing an option-pointer local (`b := &?Foo{}`)	`3bcf843fb9`	The option branch closed the free call's paren and the shared tail closed it again → `free((Foo**)b.data));` C error. Fixes `option_init_ptr_test` under `-gc e`; boehm/none unaffected.
14	Generate the option-element free for `[]?T`	`3d537762`	An array of `?string` referenced `_option_string_free`, a wrapper no path generated (the unwrapped sym has a user `free` → string-construct branch). Now inline the option-element payload free. Fixes `option_ifguard_array_of_option_test`.
15	Atomic `live_threads` in register/unregister	`c69fd59b`	`vgc_maybe_gc` reads `live_threads` lock-free for per-thread GC pacing; the plain `++/--` raced that atomic read (TSan-flagged).

Optimizations (performance)

#	Optimization	Commit	Claimed effect (verify)
A	Per-thread heap accounting (Go per-P style): alloc/free bump thread-private `live_delta`/`alloc_delta`, flush to the global atomics only every ~1 MB	`677770dd` / `38607b2d`	Removes global-atomic cacheline contention on the accounting path.
A2	Lock-free free fast path for mcache-resident / dropped spans (`on_central == 0`): `vgc_free` skips the per-class `central[].lock` (kept only for spans actually on a central list); bitmap+count stay atomic, the `fetch_and` prior value gates the decrement (double-free-safe)	`c69fd59b`	The residual-#4 fix had added that lock to every non-tiny free → a same-class free storm (`bench_scalar`: 8 threads alloc+drop one 32 B class) serialized N-way and anti-scaled (35→7 Mops/s T1→T8, below Boehm). With the skip it is near-linear again: 45→326 Mops/s T1→T8 (7.2×, ~5.5× Boehm at T8); `bench_mp` T1 5.9→76. Verified residual-#4-safe (white-box selftest + container churn 15 rounds niltrace=0 + TSan 0).
B	Perceus deep-free of nested heap fields of a dropped `&Foo`, gated by a sound deep-drop analysis	`677770dd`	Nested-object MP T1→T8 ~7.5× (near-linear); removes the GC pressure that compounded MP contention.
C	In-place reuse: direct indexed stores for reused map slots	`d7e9f5a1`	Avoids re-hashing on reuse-in-place.
D	Dynamic span registry (mmap-backed, lazily committed) + env-gated GC pacing	`72edb9e5`	Removes a fixed 262k-span cap; lets the trigger scale without a hard abort.
E	Concurrent tri-color mark behind `-d vgc_concurrent` (opt-in, STW stays default)	`d22ae0ee`	~1.2–1.4× on a parallel alloc-heavy fold vs STW; not proposed for default (needs a sound GC-assist first).
F	Alloc-path lock-contention removal (B18): span-descriptor bump slab (drops a per-carve `mmap` from under the heap lock; ~3× lower RSS) + drop full spans instead of returning them to the central full-list (the never-reused per-fill central-lock traffic) + per-thread GC pacing on by default (adaptive — only when >1 mutator)	`82e39343`	Parallel alloc-heavy workload recovered from anti-scaling (≈parity) to ~3.8× its serial at a high trigger; ~3× lower RSS.

Profile evidence: under parallel churn the alloc fast path was ~98% spin on two global
locks (vgc_heap.lock for span carving — whose hold included an mmap syscall — and the
per-class central lock for span return); 8 separate processes scaled but 8 in-process
workers did not, isolating the cost to in-process shared allocator state (not bandwidth).

Soundness evidence

TSan (Linux, clang -fsanitize=thread) found and confirmed fixes Can you release the closed-source compiler right now for us to play? #2, Just published the first V example to show you some features of the language. Very interested in your input. #3, Where can I download the compiler? #5
(race count → 0 after each).
Deterministic white-box self-check for fixes .v file extension is Verilog #6/The first draft of the documentation is live #7: vlib/builtin/vgc_selftest_d_vgc.c.v
(driven by bench/parallel-alloc/vgc_residual4_test.v) — reverting either fix fails it.
Churn batteries (provider-neutral, in bench/parallel-alloc/): g_churn
(alloc/free/realloc storms, multi-thread), bench_mp/bench_scalar (MP alloc scaling),
par_live (large concurrent live sets), cm_stress (concurrent-mark hazards),
cm_barrier_proto.c (the write-barrier model with deterministic teeth).
Differential: program output byte-identical across -gc none / boehm / vgc / e.
Long multi-reactor HTTP soundness run: tens of millions of requests, crash-free.

Scope / what to review carefully

New GC backend is large; suggest reviewing vgc_d_vgc.c.v (allocator) and
vgc_gc_d_vgc.c.v (collector) first, then perceus.v (analysis) and the codegen
touch-points (assign.v, auto_free_methods.v, autofree.v, cgen.v, fn.v).
Not for default upstream: -d vgc_concurrent (needs a sound GC-assist),
vgc_verify tooling (debug-gated), and the experimental cx_region.c.v /
transport-layer patches (consumer-specific; excluded from this proposal).
Based on a83aabb10f; a rebase onto current master is required.
Known follow-ups: full deferred cross-thread free (the targeted lock-free path #A2
already covers owner-frees of mcache-resident spans — the dominant case; a complete
mimalloc-style per-span atomic thread-free list would also make cross-thread frees of
central-listed spans lock-free); sound
concurrent-mark GC-assist (cooperative safepoints); generational
option; and V's own -gc e codegen edge cases — ~10 of 2146 vlib/v/tests programs
(all -gc e-specific, pass under none/boehm), characterized as three families:
(1) option-wrapper / generic / sub-module _free not generated — e.g. an array of
?string references builtin___option_string_free but the value-option wrapper free is
never emitted (free-method generation vs -skip-unused DCE; the unwrapped element sym
has a user free, so the option-wrapper free path is skipped);
(2) reflection metadata reclaimed (4 reflection / generic-anon-fn tests segfault);
(3) Perceus string early-drop (3 tmpl/comptime/interface-str tests produce
truncated/aliased strings). These touch shared autofree/option/Perceus codegen
(boehm-regression-sensitive) and runtime mark soundness — each warrants a dedicated pass,
not bundled here. Fix Fix generic docs after pull #10 #13 above cleared one (option_init_ptr).

Test environment (so the numbers mean something — and what's NOT covered)

Everything below was measured on a single machine. This is a real limitation: we have
not tested other CPUs, x86, or native (non-virtualized) Linux. Please reproduce on your
own hardware.

Dev + all macOS benchmarks: Apple M2 Max, 12 cores (8 performance + 4
efficiency), 64 GB RAM, macOS 26.4.1 (build 25E253), Apple clang 21.0.0. -prod
builds via -cc cc.
Linux correctness/concurrency testing: a Docker container (Ubuntu 24.04.4,
clang 18.1.3, wrk 4.1.0), aarch64 — i.e. Linux 6.12 (linuxkit) running in Docker's
VM on that same M2 Max, not a separate native or x86 host. TSan + the concurrent-
HTTP churn reproducer ran here. So: arm64 only; x86, native Linux, and other core
counts are unverified. The collector's conservative stack/register scan and the
OS-suspend STW path are platform-sensitive — independent runs on x86/native Linux are
exactly the verification we're asking for.
Numbers are best-of-3 (compute benches) wall-clock; -gc boehm is the baseline.

How to verify

# build the dev compiler, then for any program:
v -gc e   prog.v     # Perceus front line + vgc backstop
v -gc vgc prog.v     # backstop only
v -gc e -d vgc_concurrent prog.v   # opt-in concurrent mark
# benches (provider-neutral):
v -enable-globals -gc e bench/parallel-alloc/poc/bench_scalar.v   # MP alloc scaling vs boehm
v -enable-globals -gc e bench/parallel-alloc/poc/bench_mp.v
v -gc e test bench/parallel-alloc/vgc_residual4_test.v        # white-box fix self-check

libgc's parallel mark defaults to one helper thread per core. On macOS every stop-the-world collection then wakes N-1 mark helpers that contend (mach thread_suspend/resume + mark-queue spin), starving the application's own worker threads. For an allocation-heavy multi-threaded server (the cx picoev multi-reactor HTTP leg) this collapses throughput. Measured on a 12-core M-series, serve-file [?http-service] under wrk -t8 -c100: parallel mark (default): ~48.5K req/s, ~5 active cores single marker: ~125-131K req/s (2.6x) Fix: emit GC_set_markers_count(1) in the main() boehm preamble on macOS, before GC_INIT() — GC_thr_init computes the marker count there and starts the helpers eagerly, so the call must precede it (an equivalent call in _vinit runs too late). Mirrors the GC_set_markers_count(1) already done for shared libs. The GC_MARKERS env var still overrides this (read first in GC_thr_init). Scoped to macOS; Linux parallel-mark is left at default pending separate measurement. (cherry picked from commit 0421ce3)

The C codegen for global declarations handled @[volatile] but silently ignored @[thread_local], so a @[thread_local] __global compiled to a plain process-shared global. Any concurrent use of such a global then raced across threads. Emit the __thread storage-class (GCC/Clang/tcc; valid for the zero/nil/literal initializers V produces for globals) when the decl carries the attribute. Required by the scope-aware region allocator, whose per-thread state must be genuinely thread-local. (cherry picked from commit 13e1ce9)

…c to ON The bundled gc.c amalgamation embedded `/* #undef THREAD_LOCAL_ALLOC */` in its autoheader config — the one alloc-lock mitigation left OFF. Flip it to `#define THREAD_LOCAL_ALLOC 1` (the canonical configure-equivalent of `--enable-thread-local-alloc=yes`); the TLA implementation is already present in the amalgamation (guarded by `#ifdef THREAD_LOCAL_ALLOC`), so this activates real code. Compiles clean. This affects the Linux / `-prod`-bundled `gc.o` path only. The macOS `cx` build links the prebuilt `thirdparty/tcc/lib/libgc.a` (built by thirdparty-macos-arm64_bdwgc.sh), which already defaults TLA on — so the flip brings the bundled config into parity with what macOS already ships. Companion runtime lever (already present at this pin): the macOS marker-pin `GC_set_markers_count(1)` before `GC_INIT()` in cmain.v::gen_boehm_gc_init(). Together: ~2.4-5x on MARK-bound multi-thread work. Neither removes the GC_allocate_ml alloc-lock, so alloc-heavy `[?map [par]]` (cx-private vlang#14) stays ~1.3x slower than serial — partial relief by design. Full acceptance + measurements in cx-private bench/parallel-alloc/P0-ACCEPTANCE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit ae03a36)

…vgc backstop + Linux port + per-thread accounting + deep-free Forward-port of the full E work (developed on the upstream-master clone) onto the latest-V (a83aabb, Jun-11) base, on top of the cherry-picked cx patches. Adds: - P3 minimal STW mark-region backstop collector (vgc_*), churn-correct on macOS. - Linux backstop port: signal-suspend+ACK + dl_iterate_phdr ELF roots (vgc_platform.h). - P1/P2 Perceus front line, DECOUPLED from -autofree (fires on the `perceus` define alone) so it composes with the tracing backstop. - Per-thread heap accounting (live_delta/alloc_delta) → R2 alloc-MP near-linear. - Sound deep-free of nested heap fields of a dropped &Foo (perceus.v deep-drop analysis). - `-gc e` unified flag = vgc backstop + Perceus front line (opt-in; boehm stays default). perceus.v is a new cgen pass (was untracked in the clone; now committed here). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 677770d)

A vgc span serves one size CLASS but real workloads pack many TYPES (plus conservative ptrmap==0 allocations) into it. The span recorded only the FIRST typed allocation's ptrmap and applied it to every object, so any object whose real pointer layout differed had live child pointers skipped during mark -> reclaimed-while-reachable (observed as corrupted results in a deep alloc-heavy serial fold). The backstop now scans every scannable (non-noscan) span conservatively: finds every pointer, may over-retain, never under-retains. The backstop runs rarely behind the Perceus front line, so the cost is negligible. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 004b02f)

gen_free_for_sumtype/gen_free_for_array were not .option-aware and emitted it->_typ / it->len against the _option_* wrapper, producing a C error for any program with a ?SumType or ?[]T field freed under autofree/perceus. Mirror the existing option handling in gen_free_for_map/struct; no-op for non-option types. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 26ac2bb)

Remove references to the downstream consumer/repo from comments so the runtime and Perceus analysis read as standalone upstream V work. No code change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 6f94f55)

Option/result wrappers always embed an IError err field (a pointer-bearing interface) regardless of the payload, so `?int` etc. contain a pointer even though int does not. contains_ptr called final_sym, which strips the .option/.result flags, and so classified `[]?int` as scan-free -> emitted a _noscan allocation whose live err pointer a conservative GC mark would skip (potential reclaim-while-reachable). Check the flags before final_sym. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d112d5c)

When the Perceus in-place map reuse fires, the result buffer is the receiver's own buffer (cap >= len), so the map loop can store results by ascending index instead of calling array_push. This removes the per-element call + grow-check + memmove and, by making the loop body transparent, lets the C optimizer vectorize it and elide the (non-escaping) allocation. ~3x faster than Boehm single-thread on the reuse micro-bench (prod), parity-or-better otherwise; correctness unchanged (reuse precondition already proved unique+dead+same-size). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d7e9f5a)

The trace ring + async-signal-safe crash dump (P3 bring-up scaffolding) are now compiled only under `-cflags -DVGC_DIAG`; without it vgc_trace/vgc_trace_init are zero-overhead no-ops and the startup vgc_say probes are removed, so a -gc vgc/e binary emits nothing to stderr. The loud span-registry-overflow abort message is preserved (vgc_say + the raw stderr write helpers stay). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 25c1145)

…drop, HEAP_vgc arity, overflow-thread panic) Forward-ported from the upstream-tracking clone (vlang-v-latest). All V-only, CX-agnostic, with standalone patches + CX-free repros in cx-private bench/parallel-alloc/. 1. vgc tiny-allocator free clobbers live siblings: the tiny allocator packs several sub-16B noscan objects into one span slot; vgc_free cleared the whole slot, reclaiming live tiny neighbors (short map-key char buffers) -> map corruption under -gc vgc/e. Fix: per-span is_tiny flag; vgc_free defers tiny-block slots to the tracing collector (Go model). 2. Perceus drops a store-target index var before its use: pcs_lower_stmt AssignStmt left a store target's index/container idents (m[key]=v) out of the use-set, so Perceus dropped `key` at its declaration, freeing it before map_set cloned it. Fix: non-Ident store targets collect all idents as uses. 3. HEAP_vgc macro arity: HEAP_vgc(type,expr,ptrmap,nptrs) was emitted with 2 args by some paths -> "too few arguments to function-like macro". The ptrmap is dead at runtime (conservative-mark backstop ignores it), so HEAP_vgc == HEAP. Fix: always emit plain HEAP under vgc (cgen.v + assign.v). 4. vgc overflow-thread panic: >vgc_max_threads(64) concurrent threads exhaust the fixed cache table -> cache_idx=-1 -> caches[-1] out-of-range panic. Fix: cache_idx<0 allocates from central (vgc_cache_get_span) + folds accounting into the global atomics (vgc_acct_alloc). Validated: vlib map_test/array_test/string_test green under -gc vgc AND -gc e; g_churn battery + corpora none==e clean; cx default gate 125/125 + conformance green (no regression). NOTE: building cx itself with -gc e surfaces 6 PRE-EXISTING cx-under-gc-e failures (fail on the pre-fix fork too) — tracked separately, not caused by these fixes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit a95aff9)

Two CX-agnostic V-runtime changes for parallel alloc-heavy scaling under the full-STW vgc backstop (-gc e/vgc): 1. Dynamic span capacity (always-on, sound robustness fix). allspans was a fixed inline [262144]&VGC_Span; a higher GC trigger exhausted it -> the loud vgc_say(0xDEAD) abort. Now allspans is an mmap-backed pointer (vgc_os_alloc, 16M-entry default = 128MB address space, lazily committed by the OS), allocated once on the first vgc_span_alloc under vgc_heap.lock (NOT in vgc_init -- spans are allocated during _vinit, before vgc_init runs). The pointer never moves, so the collector's lock-free allspans walks (incl. lazy sweep) never see a relocated/freed buffer. Loud abort kept as the backstop at the (now huge) cap. Env override VGC_ALLSPANS_CAP. 2. Per-thread GC pacing knobs (env-gated, DEFAULT OFF -> a build with no env set is byte-identical to before). VGC_NEXT_GC_MB raises the trigger floor; VGC_PACE scales the live trigger by live_threads so N concurrent allocators don't trip the shared trigger N x more often per unit of per-thread progress. Validated (clone, identical changes): g_churn battery 0 corruptions (default + VGC_PACE), corpora none==e byte-identical, map_test/array_test green under -gc e, v2 self-hosts; cx gate 125/125 + conformance green built against this runtime. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 72edb9e)

Mark the live graph while mutators run, with two brief STW points (start: snapshot roots + enable barrier + alloc-black; termination: re-scan dirtied roots + dirty spans, final drain, sweep). STW stays the default and is byte-for-byte diff-able. - card/dirty-span write barrier (vgc_wb_store): marks the mutated object's span dirty BEFORE the store; collector re-scans dirty spans at mark-termination (vgc_rescan_dirty_spans). Preemption-safe under mach-suspend (unlike an immediate-shade enqueue, which can be lost mid-barrier) and keeps the mark queue collector-exclusive. - codegen emission in assign.v (gen_cm_write_barrier) for heap-targeted pointer-bearing stores; over-approximating, side-effect-free bases only. Fixes a gap where obj[i+k].field stores were skipped (InfixExpr index). - builtin bulk-mutator barriers (array push/ensure_cap/set/clone/insert, map set, vgc_realloc, vgc_memdup*) for pointer moves via memcpy. - alloc-black during mark (vgc_alloc_black_hook). GC-assist deliberately NOT wired: unsound under preemptive suspend (an assisting mutator frozen mid-scan of a popped grey object orphans it). Heap overshoot during long marks is bounded by the existing exhaustion->force-collect net. Default build emits no barrier calls and is unchanged. Gates green under the define: corpora none==e byte-identical, map/array tests, g_churn battery, multi-thread cm_stress, self-host, ASan == STW baseline. Measured payoff: large-live-set [par] ~1.2x faster than STW. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit d22ae0e)

…codegen + vgc_free central lock Three independent soundness fixes for the `-gc e` front line (Perceus RC + precise STW tracing backstop), all general-purpose: 1. perceus.v — return-aliasing pin. A heap value bound from a CALL may alias the callee's receiver/args or sub-objects it traversed and copied out by value (e.g. a tree walk returning a []Element of shallow copies that share child buffers and can overlap one another). The old assign-aliasing pin only fired when an RHS *ident* was itself heap-owning, so such results were treated as uniquely owned and eager deep-dropped -> double free. Now any non-fresh call-bound aggregate is marked shared and left to the tracing backstop; only proven-fresh producers (map/filter over primitives, fresh string builds) stay droppable. Sound by construction: marking shared only suppresses a drop. 2. auto_free_methods.v — gen_free_for_option_ptr. A `?&T` field's free method resolved the sym to T's struct and inlined T's field frees treating the option payload as a T value, emitting `((T**)&it->data)->field` -> "base type 'T *' is not a structure". Now the pointee is freed via the base `_free` (`T_free(*(T**)&it->data)`), mirroring the non-option `&T` field case. 3. vgc_free — take the per-size-class central lock around the alloc-bit / count / free-index mutation, the same lock the allocator and the collector use. Closes a race between an eager free and a concurrent allocation of the same class. Obeys the lock-before-suspend discipline: brief, no allocation/blocking, so the collector (which pre-acquires every central lock) cannot deadlock. Corpora byte-identical (none==e), churn battery 0-corrupt, multi-thread stress 0 mismatches, serial reuse throughput unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 38607b2)

…ointer The STW collector captures the triggering (collector) thread's own roots via a setjmp trampoline (vgc_run_gc_spilled) and the cooperative-safepoint park (vgc_park_spill), then conservatively scans [sp, stack_base]. Both anchored `sp` at __builtin_frame_address(0) — the FRAME POINTER. But setjmp spills the callee-saved registers into its jmp_buf, which lives BELOW the frame pointer (between SP and FP). So the scanned range excluded the spill area, and a live mutator root held only in a callee-saved register — routine under -Os — was never scanned and got reclaimed while still live. Surfaced as a sporadic signal-11 under concurrent allocation (e.g. an HTTP server hammered with many connections): a value reachable through a worker thread's registers was swept, then read after free at a scattered point in the request path. Add vgc_real_sp() (reads the actual SP register on arm64/x86_64/i386, falls back to __builtin_frame_address) and anchor both self-scan paths at it, clamped to <= &buf so the jmp_buf spill area is always covered. Also route vgc_get_sp() through it for consistent registration/refresh ranges. Sound by construction: the scanned range only ever grows to include the true lowest in-use stack address. Suspended threads were already correct (their SP comes from thread_get_state); only the self/collector path relied on setjmp. CX-free, general-purpose: affects -gc vgc and -gc e (shared trampoline). The compiler self-build is .no_gc so the toolchain binary is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 50fde69)

…e world The STW collector resumed all mutators (vgc_resume_thread loop) and THEN ran its end-of-cycle bookkeeping — vgc_update_trigger() and `gc_cycle++` — while mutators were already running. gc_cycle is read by mutators to stamp a freshly-acquired span's `sweep_gen` (vgc_span_alloc / vgc_central_get_span / vgc_get_free_span), and the sweep's in-flight-span guard (vgc_sweep_span) recycles an empty span only when `sweep_gen != gc_cycle`. So in the tiny window between resume and `gc_cycle++`, a resumed mutator could acquire an empty, still-in-flight span and stamp it with the OLD cycle; the next cycle's sweep then saw `sweep_gen != gc_cycle` and recycled that span out from under the mutator — a use-after-free. Pathologically narrow and timing-sensitive (it masks under every instrumentation), so it surfaced only as a sporadic signal-11 under sustained concurrent allocation (an HTTP server hammered with many connections), on both the mach (macOS) and signal (Linux) STW backends. ThreadSanitizer (Linux) pinned it precisely: a data race on `vgc_heap.gc_cycle` between vgc_gc_start (writer) and vgc_span_alloc (reader). Moving the bump + the trigger update ahead of the resume loop — while the world is still stopped — closes the window: every post-resume acquisition stamps exactly the cycle the next sweep checks against, and the just-run sweep still correctly skipped spans acquired during this cycle (it ran at the pre-bump cycle). After this change TSan reports zero vgc data races (was 30), and the dominant concurrent-HTTP crash is gone. CX-free, general-purpose; affects -gc vgc and -gc e (shared STW path). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 8baa8db)

…_alloc data race SUBSTANTIVE FIX (vgc_d_vgc.c.v): vgc_find_span / vgc_is_heap_ptr read vgc_heap.narenas LOCK-FREE (they run on the collector's STW conservative scan, where a lock would deadlock against a frozen mutator) while vgc_span_alloc wrote it under vgc_heap.lock. That is a publication data race: a lock-free reader could observe narenas grow before the new arenas[idx] (base/size/page_span map) it gates was published, then read a stale base/size or a nil page_span -> wrong or missing span -> heap corruption. ThreadSanitizer pinpointed it (read in vgc_find_span vs write in vgc_span_alloc, global vgc_heap) under concurrent HTTP connection-teardown churn; the corruption surfaced as a segfault in the request path. Fix: bump narenas as the LAST step of vgc_span_alloc with an atomic RELEASE store (after arenas[idx] fields + page map + allspans), and load it with an atomic ACQUIRE in the lock-free readers. TSan: 1 race -> 0. This also makes every early-return in vgc_span_alloc leave narenas consistent (a half-built arena is simply never published). CX-agnostic; reproduces with any heavily multi-threaded alloc/free workload. DIAGNOSTIC TOOLING (inert; behind -d defines, zero codegen in normal builds — verified a plain `-gc e` build is unaffected): a mark-closure verifier (vgc_verify_mark_closure), a /proc/self/maps root-finder (vgc_rootfind_*), a data- segment dump, and a low-perturbation watch (per-cycle reset + emit-on-sweep, pub vgc_set_watch) under -d vgc_verify / -d vgc_watch; plus a coarse allocator lock (vgc_alloc_lock + a per-thread re-entrancy guard) under -d vgc_coarse_alloc. These proved the vgc mark/sweep collector is sound (the residual is allocator-side concurrency, not reclamation) and are kept for the ongoing allocator-race hunt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit c39ce23)

Found via a reliable `Connection: close` teardown-churn repro (wrk -t8 -c256 against the cx guide HTTP server); each fix pushes the crash materially later (round 1 -> 8 -> 12). All CX-agnostic, observable behaviour byte-identical. Full cx gate green under -gc e (V-impl 125/125 + conformance + python/rust/go bindings + abi-c, prod+non-prod). 1. vgc page_span[] publication race (vgc_d_vgc.c.v). A span carved from an EXISTING arena writes arenas[i].page_span[pidx]=span under vgc_heap.lock but does NOT bump narenas, so the narenas release/acquire publication does not cover these slot writes. vgc_find_span reads page_span lock-free in the vgc_free/vgc_realloc hot path -> data race -> stale span -> heap corruption. Fix: atomic u64 RELEASE store on the slot + ACQUIRE load in vgc_find_span (paired, like narenas). 2. vgc mcache bitmap double-update (vgc_d_vgc.c.v, vgc_platform.h). The lock-free mcache fast path (vgc_span_alloc_obj) did a plain read-modify-write on span.alloc_bits/alloc_count while a concurrent cross-thread vgc_free RMW'd the SAME span's bitmap under central[class].lock -- the lock does not exclude the unlocked alloc side -> lost update -> a slot handed out twice. Fix: atomic OR (alloc) / atomic AND (free) on alloc_bits + atomic add/sub on alloc_count (new u8 fetch_or/fetch_and + sub_u32 helpers in all three cc branches). free_index stays a racy-but-safe scan hint; the STW sweep stays non-atomic (mutators suspended). 3. picoev handle_timeout stale-target assert (picoev.v). Under Connection: close teardown churn a timed-out fd can retain a residual timeouts entry whose target was already torn down (loop_id=-1, cb=nil). The assert aborted the reactor; poll_once already skips such targets. Fix: drop the stale timeout entry + skip, mirroring poll_once. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 46d2ae5)

…2 races) Under heavy multi-threaded alloc/free churn the vgc allocator could return NULL for an in-range small allocation, surfacing as a null `&T{}` and a caller null-deref. Two independent root causes, each strictly load-bearing (ablation-confirmed); a third (a non-atomic alloc_bits read) was identified but is not load-bearing and is left out to keep the alloc fast path atomic-free. * span scan missed the start byte's low bits. vgc_span_alloc_obj's two-pass free-slot scan applied the free_index start_bit offset in BOTH passes, so bits [0, start_bit) of the start byte were scanned in neither pass. A span with a free low slot but a high/stale free_index (the state a fill leaves when a cross-thread free's lowering of free_index is not yet visible) then reported "full" and returned nil. Single-byte-bitmap spans (small nelems) hit it whenever free_index == nelems. Fix: apply the offset only in pass 0; the wrap pass scans the start byte from bit 0. * GC reclaimed mcache-resident spans. vgc_sweep_span reclaims any empty span with a stale sweep_gen, and on_central==0 means BOTH "free-floating" AND "owned by an mcache". An empty cached span was reset (nelems=0) and pooled by vgc_put_free_span while still referenced by the mcache slot and by a thread suspended inside vgc_malloc -- span descriptors live outside the GC arena, so the conservative root scan never protects them, and vgc_fixup_caches only nulls the cache slot, not a suspended owner's local. The owner then read a zeroed/torn span. Fix: vgc_protect_cached_spans() stamps every mcache-resident span's sweep_gen under STW before sweep, reusing the existing in-flight guard. Deterministic white-box regression (CX-free): vgc_residual4_selftest in vlib/builtin/vgc_selftest_d_vgc.c.v, driven by bench/parallel-alloc/vgc_residual4_test.v; both checks verified to fail without their fix. Full V _test suite + conformance green under -gc e. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 871dced)

Three CX-agnostic allocator changes that recover [par] / multi-reactor scaling under -gc e without regressing serial throughput or correctness. Root cause (measured): the alloc fast path was ~98% global-lock spin under parallel churn, NOT GC pacing and NOT memory bandwidth (8 separate processes scale; 8 in-process workers do not). Two locks dominated: vgc_heap.lock (new-span carve, whose hold included a per-span mmap()) and central[].lock (full-span return). 1. Span-descriptor bump slab (vgc_alloc_span_meta): span descriptors are never individually freed (pooled by vgc_put_free_span forever), so replace the per-carve mmap(sizeof(VGC_Span)) -- a syscall under vgc_heap.lock -- with a pointer bump + rare bulk mmap. Side effect: ~3x lower RSS (the per-span mmap wasted ~16KB of page granularity on a ~400B descriptor). 2. Drop full spans instead of returning them to central.full. The partial list was never reused (returns only land FULL spans on full; sweep relinks only fully-empty), so the per-fill return was pure central[].lock contention. A dropped span stays in allspans, is swept normally, and is reclaimed when empty via the on_central==0 path; reuse flows through the free_spans pool + the active span's free_index. Sound vs residual vlang#4 (a dropped span is no longer mcache-resident, so same-cycle reclaim is correct; it is sweep_gen-protected while still referenced during the drop). 3. Per-thread GC pacing on by default (was env-gated). Adaptive: scales the live trigger by live_threads only when >1, so single-threaded/small programs keep the historic 256MB trigger and RSS. VGC_PACE=0 disables it. Results (best-of-5, -prod -gc e, 12-core): par 360ms = 3.8x faster than serial at a 4GB trigger (RSS 2.3GB); ~2.9x at the default with pacing; serial 1550-> 1380ms; the alloc lock-spin leaves the hot profile entirely (now compute-bound). Gates: residual-vlang#4 selftest PASS; cx V-impl 125/125; local MP stress 50/50; HTTP churn repro (wrk -t8 -c256 Connection:close) survived 15 rounds niltrace=0; real HTTP server sweep ~22M requests crash-free. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 82e3934)

…nostic Genericize two comments that named the downstream consumer, keeping vlib source provider-neutral (matches 6f94f555): the mark-closure verifier's noscan-referrer note and the vgc_residual4_selftest header. Comment-only; no code change. (cherry picked from commit 1ff9e4f)

…er -gc e The autofree/Perceus drop for an option-pointer local (b := &?Foo{}) closed the free call twice: the option branch wrote '.data)' (closing the call's open paren) and then the shared tail wrote ');' again -> 'free((Foo**)b.data));' -> C error 'extraneous ) before ;'. The non-option path closes only once via that tail, so emit '.data' (no close) and let the single tail ')' close the call. Fixes vlib/v/tests/options/option_init_ptr_test.v under -gc e (passes under e/boehm/none; boehm options suite 213/213, no regression). One of the documented -gc e edge bugs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 3bcf843)

gen_free_for_array took the string-construct branch for an option element ([]?T) because the UNWRAPPED payload sym (e.g. string) has a user free -> it emitted a call to _option_<T>_free, a wrapper method no codegen path produces -> C error 'undeclared function builtin___option_string_free'. Fix: for an option element, INLINE the payload free (check option state, free the payload via its base free on &data), mirroring the sum-type-variant-option and whole-?[]T paths. No separate _option_<T>_free method needed. Fixes option_ifguard_array_of_option_test under -gc e (options suite 213/213; boehm/none unaffected). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 3d53776)

…e_threads (c) the real MP-alloc lever (measured): vgc_free took the per-class central[].lock on EVERY non-tiny free, so N threads dropping the same size class serialized N-way on one lock — bench_scalar anti-scaled 35->7 Mops/s T1->T8, below boehm. (This is why the earlier 'near-linear, ~8x boehm' result regressed: the residual-vlang#4 fix added that lock for correctness.) Fix: skip the central lock when span.on_central == 0 (resident in a thread mcache, or dropped awaiting sweep) — such a span has no central-list membership to guard, and its bitmap+count mutations are individually atomic. The atomic fetch_and's prior value gates the count decrement so a double-free still cannot double-subtract. free_index stays a racy-but-safe hint. Spans actually on a central list (on_central != 0: the unregistered-overflow-thread fallback) keep the lock, preserving list consistency and the collector's lock-before-suspend fence. Also: make live_threads ++/-- in register/unregister ATOMIC — vgc_maybe_gc reads it lock-free for per-thread GC pacing (on by default), so a plain RMW raced that atomic read (TSan-flagged; a PACE-on-by-default regression). Result (best-of-3, -prod -gc e, 12-core): bench_scalar T1->T8 = 45->326 Mops/s (7.2x, near-linear; 5.5x boehm at T8); bench_mp T1 5.9->76. Soundness: residual-vlang#4 white-box selftest PASS; container HTTP churn 15 rounds niltrace=0 (the exact race the lock guarded); TSan 0 warnings (was 1, the live_threads race — now fixed); cx V-impl 125/125 + full make test green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit c69fd59)

…ps root-finder Debug-only soundness tooling used during development (compiled out by default). Removed for review clarity: vgc_verify_mark_closure, vgc_rootfind_region + the $if vgc_verify ? call sites and the verify-only globals/C decls. The low-perturbation vgc_watch_* hooks (also debug, runtime-gated by vgc_watch_addr, inert by default) are kept — they were the load-bearing diagnostic for the concurrency fixes and add negligible inert cost.

CX-free V programs backing the PR's perf/soundness claims (so reviewers can reproduce): bench_scalar (single-class MP alloc scaling), bench_mp (nested-object MP), par_reclaim (bounded-live control), cm_stress (concurrent-mark hazards), boehm_mp_bench (Boehm baseline). vgc_residual4_test (white-box fix self-check) is already present.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56f73cde59

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-15T00:46:56Z

+		// MUST be after _vinit(): _vinit zero-initializes every global (incl.
+		// vgc_heap), so running vgc_init before it would have gc_enabled/next_gc
+		// wiped back to 0 -> GC disabled for the whole program -> unbounded heap.
+		g.writeln('\tbuiltin__vgc_init();')


Avoid re-registering threads after pre-init allocations

When _vinit allocates under -gc vgc/-gc e, the allocation path calls vgc_ensure_registered() before this generated builtin__vgc_init() runs; vgc_init() then unconditionally calls vgc_register_thread() again. That leaves the earlier cache slot still registered with the same OS thread port, so the next collection sees it as another mutator and tries to suspend/scan the collector thread itself (i != self_idx), which can self-suspend or hang as soon as GC triggers in programs with allocating global/module init. Make registration idempotent here, or avoid the second registration in all generated entry points.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-15T00:46:56Z

+	vgc_cm_stw_exit(self_idx)

+	vgc_update_trigger()
 	vgc_heap.gc_cycle++


Bump the GC cycle before resuming concurrent mutators

In -d vgc_concurrent builds this resumes mutators before updating next_gc and incrementing gc_cycle. The STW path above explicitly moved those updates before resume because a resumed allocator can acquire/cache a span stamped with the old cycle; the next sweep then sees sweep_gen != gc_cycle and can recycle that cached span while the mutator still owns it. Move vgc_update_trigger()/vgc_heap.gc_cycle++ before vgc_cm_stw_exit(self_idx), matching the non-concurrent path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-15T00:46:56Z

+  static inline uint32_t vgc_thread_self_port(void) { return 0; }
+  static inline void vgc_suspend_thread(uint32_t t) { (void)t; }
+  static inline void vgc_resume_thread(uint32_t t) { (void)t; }


Disable VGC where OS-level suspension is unavailable

On Windows/BSD this fallback registers every thread with mach_port == 0 and makes suspend/resume no-ops, but vgc_gc_start() no longer uses the old cooperative safepoint path and only suspends slots with c.mach_port != 0. As a result, a multi-threaded -gc vgc/-gc e program on these supported targets will mark and sweep while other mutators keep running, so live objects allocated or stored by those threads can be reclaimed. This needs to either reject VGC on unsupported STW platforms or retain a safe cooperative fallback.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56f73cde59

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-15T00:46:58Z

+		// MUST be after _vinit(): _vinit zero-initializes every global (incl.
+		// vgc_heap), so running vgc_init before it would have gc_enabled/next_gc
+		// wiped back to 0 -> GC disabled for the whole program -> unbounded heap.
+		g.writeln('\tbuiltin__vgc_init();')


Avoid re-registering threads after pre-init allocations

When _vinit allocates under -gc vgc/-gc e, the allocation path calls vgc_ensure_registered() before this generated builtin__vgc_init() runs; vgc_init() then unconditionally calls vgc_register_thread() again. That leaves the earlier cache slot still registered with the same OS thread port, so the next collection sees it as another mutator and tries to suspend/scan the collector thread itself (i != self_idx), which can self-suspend or hang as soon as GC triggers in programs with allocating global/module init. Make registration idempotent here, or avoid the second registration in all generated entry points.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-15T00:46:58Z

+	vgc_cm_stw_exit(self_idx)

+	vgc_update_trigger()
 	vgc_heap.gc_cycle++


Bump the GC cycle before resuming concurrent mutators

In -d vgc_concurrent builds this resumes mutators before updating next_gc and incrementing gc_cycle. The STW path above explicitly moved those updates before resume because a resumed allocator can acquire/cache a span stamped with the old cycle; the next sweep then sees sweep_gen != gc_cycle and can recycle that cached span while the mutator still owns it. Move vgc_update_trigger()/vgc_heap.gc_cycle++ before vgc_cm_stw_exit(self_idx), matching the non-concurrent path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-15T00:46:58Z

+  static inline uint32_t vgc_thread_self_port(void) { return 0; }
+  static inline void vgc_suspend_thread(uint32_t t) { (void)t; }
+  static inline void vgc_resume_thread(uint32_t t) { (void)t; }


Disable VGC where OS-level suspension is unavailable

On Windows/BSD this fallback registers every thread with mach_port == 0 and makes suspend/resume no-ops, but vgc_gc_start() no longer uses the old cooperative safepoint path and only suspends slots with c.mach_port != 0. As a result, a multi-threaded -gc vgc/-gc e program on these supported targets will mark and sweep while other mutators keep running, so live objects allocated or stored by those threads can be reclaimed. This needs to either reject VGC on unsupported STW platforms or retain a safe cooperative fallback.

Useful? React with 👍 / 👎.

(1) vfmt array.v + the two vgc files (hand-edited during cherry-pick conflict resolution + the verifier strip) so code-formatting CI passes. (2) The standalone POC benchmarks are each `module main`; sitting beside vgc_residual4_test.v they collided under `v test` (duplicate `main`/`Obj`). Moved to bench/parallel-alloc/poc/ so the selftest's module is clean; benches still run individually.

…ootstrap) The concurrent-mark write barrier vgc_wb_store is defined only in vgc_gc_d_vgc.c.v (compiled under -d vgc, i.e. -gc e). Its call sites in array.v/map.v sit inside `$if vgc_concurrent ? { ... }`, so they emit no code in ordinary builds — but the checker still walks those comptime branches, and `v -os cross` emits every branch into the generated v.c. Both paths require the symbol to resolve in non-vgc builds, so default (boehm) `-os cross` / bootstrap-v / build-vc failed with "unknown function: vgc_wb_store". Add a no-op fallback in a _notd_vgc.c.v sibling so the symbol resolves in boehm/none builds (mutually exclusive with the real definition via file suffix). No behavioral change: the barrier is only ever active under `-gc e -d vgc_concurrent`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

`v fmt -verify` (run by test-cleancode in nearly every CI job) flagged vlib/v/gen/c/perceus.v and vlib/v/gen/c/cgen.v as not vfmt'ed; also reformat the two POC bench files. Formatting-only (field/comment alignment); no logic change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

report-missing-fn-doc requires a name-leading doc comment on every new public function. Document the two introduced by this PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

write_heap_alloc / write_heap_alloc_close no longer use their `typ` parameter (the precise-pointer-map HEAP_vgc variant is intentionally unused — see the note there). V emits a `notice: unused parameter: typ` to stderr on every build; the tools-* CI jobs treat any stderr output from a tool compile as a failure, so the whole tools-{linux,macos, freebsd,openbsd,docker} matrix went red. Mark the parameter `_` (kept in the signature for call-site symmetry; maintainers may prefer removing it outright). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

eptx · 2026-06-15T20:20:27Z

Thanks for approving the CI run — that surfaced exactly what we needed.

I triaged the 38 failures; they collapse to four root causes, all on our side and all low-risk, now addressed on the branch:

-os cross / bootstrap break (real regression, the important one). The concurrent-mark write barrier vgc_wb_store is defined only under -d vgc (-gc e), but its call sites in array.v/map.v sit in $if vgc_concurrent ? blocks. v -os cross emits every comptime branch into v.c, and the checker walks inactive branches too — so in a default (boehm) build the symbol is unresolved and bootstrap-v, build-vc, cross-* and the BSD cross-compiles fail with unknown function: vgc_wb_store. Fixed by adding a no-op vgc_wb_store fallback in a _notd_vgc.c.v sibling (mutually exclusive with the real definition via file suffix; zero behavioral change — the barrier is only ever live under -gc e -d vgc_concurrent).
vfmt. vlib/v/gen/c/perceus.v and cgen.v weren't vfmt-clean. Since test-cleancode runs early in most jobs, this alone reddened the linux/macos/windows compilers, the docker images, and the sanitize-* and tcc-* jobs (they abort at the fmt gate before reaching their namesake work). Reformatted.
unused parameter: typ. write_heap_alloc[_close] no longer use their typ arg (the precise-pointer-map HEAP_vgc variant is intentionally dead — see the note there). V prints a notice to stderr even on a successful build, and the tools-* harness treats any stderr from a tool compile as failure, so the whole tools-* matrix (and the riscv64 build) went red. Marked the param _.
Missing doc comments on pub fn vgc_init / vgc_set_watch. Added.

Worth flagging: this CI matrix builds V's default configuration, so the -gc e code paths (C11 atomics, @[thread_local], the conservative scanner) aren't actually compiled or exercised by these jobs — the failures above are all default-build lint/cross issues, not anything specific to the new backend. If it'd be useful, I'm happy to add a small opt-in -gc e lane so the new collector gets real coverage; the two known limitations there are (a) tcc can't compile the C11 atomics the barrier/STW code uses, so a -gc e lane would need a non-tcc compiler, and (b) the conservative mark scan will trip ASan/MSan/UBSan without suppressions. I'd rather take your steer on whether you even want that lane than guess.

A couple of things I'd like direction on before going further:

Fallback approach for (1): the no-op _notd_vgc stub is the minimal fix, but if you'd prefer the barrier machinery never reference a vgc-only symbol from builtin at all (e.g. gating the call sites differently), I'll restructure to match your conventions.
v1 vs v2: this POC targets the current C backend (vlib/v/gen/c). Is that the right place for an experiment like this, or would you want it oriented toward v2?

This is still a POC / "needs verification" — happy to adjust scope, split it, or hold any of it pending your read on the architecture.

…neutral) Mirrors the 4 fixes landed on the upstream PR branch (cx-home/v pr/mem-mgmt-poc, vlang#27458) into the CX build lineage. All behavior-neutral; -gc e codegen is byte-identical. - vgc_wb_store: add no-op fallback (vgc_wb_fallback_notd_vgc.c.v) so the symbol resolves in non-vgc / `-os cross` builds (dormant for CX since it builds -gc e by default, but makes a boehm/cross build robust). - cgen.v: mark dead `typ` param of write_heap_alloc[_close] `_` (kills the per-build `unused parameter` notice on stderr). - perceus.v + cgen.v: vfmt. - vgc_init / vgc_set_watch: doc comments. Gate: devbox run -- make test-vcx GREEN (V-impl 125/125 + conformance md 22/0, namespaces 16/0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

JalonSolov · 2026-06-15T21:10:29Z

So... I decided to run -gc vgc vs -gc boehm (I know it's the default... I made it explicit...) vs -gc e, I compiled V itself after checking out the PR, and ran the test with hyperfine... here are my results for bench_scalar:

Summary
  ./ve -enable-globals -gc boehm run bench/parallel-alloc/poc/bench_scalar.v ran
    1.31 ± 0.07 times faster than ./ve -enable-globals -gc vgc run bench/parallel-alloc/poc/bench_scalar.v
    1.39 ± 0.08 times faster than ./ve -enable-globals -gc e run bench/parallel-alloc/poc/bench_scalar.v

and for bench_mp:

Summary
  ./ve -enable-globals -gc vgc run bench/parallel-alloc/poc/bench_mp.v ran
    1.02 ± 0.04 times faster than ./ve -enable-globals -gc boehm run bench/parallel-alloc/poc/bench_mp.v
    1.07 ± 0.02 times faster than ./ve -enable-globals -gc e run bench/parallel-alloc/poc/bench_mp.v

Which means on my system, this new way is slower than both the others. 😕

Latest V, CachyOS, on 7950X CPU, 32G RAM, and NVME disks.

JalonSolov · 2026-06-15T21:13:35Z

Alex will have to make final decisions, but my opinion:

never reference a vgc-only symbol from builtin at all, unless -gc vgc is used
implement for both backends

v2 is still not quite done, yet, so we can test earier/more stably on v1. However, checking again v2 before it is finalized could be very useful, as well.

GGRei · 2026-06-15T22:31:14Z

I think the core idea is interesting, Perceus-style reuse, ownership-driven drops/frees, and a tracing backstop are all directions worth studying.

My concern is more about integration and timing. I have been doing some work on the V2 ownership/autofree direction based on the ownership problems listed in #27116.

In the current V2 work, the scope is deliberately staged, it collects ownership/autofree facts from the V2 post-transform FlatAst and type information, classify transfers and release/cleanup eligibility in the V2 type/checker layer, then emit bounded cleanup first through the V2 CleanC backend and fixture tests. The goal, however, should be a backend-neutral V2 ownership/free contract. Once the model is stable, the same ownership decisions will need clear lowering rules for the native backends too, including x64, arm64, and any future backend. That model needs to cover shallow copies, escaping locals, non-owning pointers, sumtype payloads, pointer fields, globals/struct storage, etc.

That makes me wary of adding another memory-management path before the V2 model is implemented, reviewed, and stabilized.

For the current C backend side, I also think this is still too experimental for master as-is. It is true that V already has several memory-management modes there, such as Boehm, none, and autofree, but autofree is already a sensitive area with known issues in that path.

This PR is not just adding a small isolated switch, it also changes the current C backend code generator, autofree/free-method generation, builtins, VGC runtime behavior, platform STW/root scanning, allocator behavior, and Boehm/libgc-related code. Even if -gc e is opt-in, that maintenance surface is not really isolated, so I am cautious about adding a new experimental hybrid mode.

So in my opinion, the safest path would be to keep this as a dedicated experimental/RFC branch for now, possibly as a dedicated branch under the V org if maintainers want to explore it, and extract the useful parts separately like benchmarks, reproducers, design notes, and small isolated fixes with tests. If the experimental branch later proves strong results across supported OSes, with correctness tests and stable performance data, then moving parts of it toward the current C backend / master path would make much more sense.

This is only my personal opinion, of course. ^^

Just to be clear, I still think this PR is interesting and definitely worth exploring further.

eptx · 2026-06-16T03:49:06Z

Really appreciate the careful look — and thank you for taking the time to actually benchmark it, @JalonSolov.

On performance: that's a fair and important data point. Our wins were measured on M2 Max (arm64/macOS); your 7950X / CachyOS numbers — -gc e behind both boehm and vgc on bench_scalar and bench_mp — tell me the result is workload- and platform-dependent, not a universal win, and I won't claim otherwise. It would need broad cross-OS / cross-arch data before being a serious master proposal, and it doesn't have that yet.

On integration & timing, @GGRei: I think you're right, and I'd rather follow your lead than push a ~5k-line cross-cutting change at master:

Happy to keep the full hybrid as a dedicated experimental / RFC branch (under the V org if you'd like a home for it; otherwise my fork), explicitly not targeting master.
I'd prefer to peel off the independently-useful pieces as small, isolated, tested PRs — the benchmarks + reproducers, the design notes, and the few genuinely isolated fixes we hit along the way. Would those be welcome, and do you want them as separate PRs?
I also don't want this competing with the v2 ownership/autofree direction in Restore CI step: Ensure V2 can be compiled with -autofree #27116 — that's clearly the strategic path. I'll exercise the ideas against v2 early (per @JalonSolov) so anything useful feeds into that model rather than around it.

On the builtin reference (@JalonSolov): agreed — I'll rework so the barrier never references a vgc-only symbol outside -gc vgc builds (gating the call sites, rather than the builtin no-op fallback I pushed just to clear CI), and look at covering both vgc and e.

I'll re-frame the PR accordingly. Thanks again — this is exactly the steer I was hoping for.

GGRei · 2026-06-16T04:01:59Z

Of course, please keep in mind that I am nobody special in the project, just a modest contributor giving my personal opinion.

Medvednikov is the real final decision-maker here, with JalonSolov's help of course.

eptx and others added 26 commits June 14, 2026 20:18

chatgpt-codex-connector Bot reviewed Jun 15, 2026

View reviewed changes

eptx and others added 2 commits June 14, 2026 20:47

eptx and others added 3 commits June 15, 2026 16:06

docs(vgc): add doc comments for pub fn vgc_init and vgc_set_watch

843469f

report-missing-fn-doc requires a name-leading doc comment on every new public function. Document the two introduced by this PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Uh oh!

Conversation

eptx commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture E: a Perceus front-line + precise stop-the-world tracing backstop for V's C backend

What this is (please read first)

Summary (TL;DR)

Architecture & rationale

Bugs fixed (correctness)

Optimizations (performance)

Soundness evidence

Scope / what to review carefully

Test environment (so the numbers mean something — and what's NOT covered)

How to verify

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

eptx commented Jun 15, 2026

Uh oh!

JalonSolov commented Jun 15, 2026

Uh oh!

JalonSolov commented Jun 15, 2026

Uh oh!

GGRei commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eptx commented Jun 16, 2026

Uh oh!

GGRei commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eptx commented Jun 15, 2026 •

edited

Loading

GGRei commented Jun 15, 2026 •

edited

Loading