Skip to content

Architecture E (POC, opt-in -gc e): Perceus reuse-in-place front line + precise STW tracing GC backstop for the C backend#27458

Open
eptx wants to merge 31 commits into
vlang:masterfrom
cx-home:pr/mem-mgmt-poc
Open

Architecture E (POC, opt-in -gc e): Perceus reuse-in-place front line + precise STW tracing GC backstop for the C backend#27458
eptx wants to merge 31 commits into
vlang:masterfrom
cx-home:pr/mem-mgmt-poc

Conversation

@eptx

@eptx eptx commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Architecture E: a Perceus front-line + precise stop-the-world tracing backstop for V's C backend

What this is (please read first)

This is a proof-of-concept, developed with Claude (Anthropic's coding agent), shared
because the results look very positive for V and we'd like the community to verify
them
. We (the CX project — a tree-walking language interpreter written in V) set out to
test whether V could meet our memory-management and multi-core needs instead of
switching to Rust, while staying aligned with V's stated direction (autofree /
reuse-in-place). It worked well enough to be worth contributing — at minimum a baseline
POC, possibly a real contribution.

Direct about the claims: every performance number below was measured on our machines
and our workloads only
— no broad independent benchmark suite, no third-party review.
Treat them as claims to verify, not facts. The correctness work is firmer (TSan, a
deterministic white-box self-check, and a churn reproducer — all included) but also wants
independent eyes. We'd value the community pressure-testing both.

Context: follows up on the Perceus discussion #27166 (and a Discord exchange where
@JalonSolov suggested a PR so Alex could look it over). Open question for maintainers
up front: target v1 (current master, where it's built + tested) or plan for v2?
If v2
reworks the backend/codegen we're happy to advise on a port — better to know before deep
review.

All changes are provider-neutral V-runtime / codegen work: CX was the workload that
surfaced the bugs and motivated the optimizations, but nothing here is specific to it
(source scrubbed of downstream-specific naming).

Summary (TL;DR)

This adds a new opt-in memory-management mode for the C backend, -gc e, that pairs
a Perceus-style reference-counting front line (compiler-emitted, in-place reuse) with
a precise, from-scratch stop-the-world tracing collector (vgc) as the backstop —
plus the bug fixes and allocator optimizations that made the combination sound and fast
under heavy multi-threaded allocation. The front line reclaims the common, uniquely-owned
case with zero tracing; the backstop reclaims arbitrary aliased/cyclic graphs that RC
cannot. vgc alone is also usable (-gc vgc).

Why: Boehm (V's default conservative GC) anti-scales on alloc-heavy multicore
workloads (its parallel marker and alloc lock serialize mutators) and over-retains
(conservative). E targets both single-thread throughput (reuse-in-place avoids
allocation) and multicore scaling (per-thread allocation + accounting, no shared
alloc-path lock in the steady state).

Status / honesty: developed and gated against one large real consumer + a battery of
provider-neutral micro-benchmarks (included). The performance numbers below are
measured on those workloads and should be independently verified before any claim is
relied on. STW is the default collection strategy; concurrent mark is behind a separate
-d vgc_concurrent and is not proposed for default. The verification tooling
(mark-closure verifier, root-finder) is compiled out unless -d vgc_verify.


Architecture & rationale

V today offers Boehm (conservative, default), -autofree (compiler-managed scope frees,
single-ownership assumption), and -gc none. None gives both low allocation and
linear multicore scaling for an allocation-heavy program with aliased/graph-shaped data.

Architecture E = front line + backstop, decoupled from -autofree:

  1. Perceus front line (vlib/v/gen/c/perceus.v, new). A compile-time ownership/share
    analysis emits in-place reuse and drop for values it can prove uniquely owned,
    reclaiming the common case without touching the collector. Crucially it is decoupled
    from -autofree
    : -autofree restructures codegen assuming sole ownership and is
    incompatible with a backing collector (corrupts under any GC — demonstrated). E runs
    the drop analysis off its own perceus define, so Perceus drops are the sole frees and
    the analysis stays sound (it pins assignment-aliases, call-result aliases, and any
    value whose heap field is exposed → those fall through to the backstop).

  2. Precise STW tracing backstop (vlib/builtin/vgc_*.c.v, new). A from-scratch
    mark/sweep collector with a Go-mcache-style segregated allocator (per-thread span
    caches, per-size-class central lists, arena-backed spans). It reclaims what RC can't
    (cycles, aliased graphs) and runs rarely because the front line absorbs most frees.
    Precise (type-driven) marking where sound; conservative stack/register scanning for
    roots. Mutators are stopped via OS-level suspend (mach / signal).

  3. The hybrid is the point. RC alone leaks cycles; tracing alone pays full mark cost
    on every cycle. Perceus handles the dominant uniquely-owned case in-place; the tracing
    backstop is the correctness net for the rest. This mirrors Koka/Lean's Perceus + a
    collector, adapted to V (which lacks a uniform per-object header, so the backstop owns
    arbitrary-graph reclamation rather than a global RC header scheme).

Isolation-for-scaling doctrine: linear multicore scaling comes from per-thread
isolation of the allocation path
(per-thread span caches + per-thread heap accounting),
not from a faster shared collector. The collector is the rare backstop; the steady-state
alloc/free fast path touches no shared cacheline or lock.


Bugs fixed (correctness)

Each is provider-neutral and was reproduced under heavy concurrent alloc/free (a
multi-reactor HTTP server + churn micro-benchmarks). The commit hashes below are the
originating development commits; each commit on this PR branch carries a
(cherry picked from commit …) trailer, and the PR's Commits tab is the authoritative
per-change view.

# Fix Commit Notes
1 Collector self-scan anchored at the real SP, not the frame pointer 50fde691 setjmp spills callee-saved regs below the FP; an FP-anchored scan missed a live root held only in a spilled reg → reclaimed-while-live.
2 Advance gc_cycle + GC trigger under STW, before resuming the world 8baa8db0 A resumed mutator stamping a fresh span's sweep_gen with the old cycle let the next sweep recycle a still-in-flight span (UAF). TSan: 30→0 races.
3 Publish narenas with release/acquire c39ce23f Lock-free vgc_find_span read narenas while span_alloc wrote it under lock — publication race on the arena it gates. TSan-pinpointed.
4 Publish page_span slots with release/acquire 46d2ae5a Spans carved from an existing arena don't bump narenas, so the page-map writes weren't published to the lock-free find_span reader → stale span.
5 mcache bitmap atomic fetch_or/fetch_and + atomic count 46d2ae5a Unlocked RMW on alloc_bits (alloc fast path) raced a cross-thread free's RMW under the central lock → one slot handed out twice.
6 vgc_span_alloc_obj two-pass scan start-byte coverage 871dceda The free-index offset was applied in both passes, leaving [0,start_bit) of the start byte scanned in neither → a span with a free low slot reported "full" → vgc_malloc NULL → caller null-deref.
7 Don't reclaim mcache-resident spans in sweep 871dceda A cached span that momentarily empties was recycled while still referenced by an mcache slot / a suspended owner's local (span descriptors live outside the GC arena, so the root scan can't protect them). Fix: stamp registered threads' cached spans' sweep_gen under STW.
8 Conservative backstop mark; drop the unsound per-span ptrmap 004b02f2 ptrmap was a per-span property set by the first typed alloc, but a size class packs many types → objects whose layout differed had live child pointers skipped → reclaimed-while-reachable. Conservative scanning over-retains, never under-retains.
9 Sound eager-drop for aliased call results + ?&T free-method codegen + vgc_free central lock 38607b2d (a) a heap value bound from a call may alias the callee's traversed sub-objects → pin it (backstop, don't deep-drop); (b) option-of-pointer free emitted a struct member-access on the _option_* wrapper (C compile error); (c) vgc_free now takes the per-class central lock (was a real MP soundness gap).
10 Option-aware free methods for ?SumType / ?[]T fields 26ac2bbe gen_free_for_sumtype/_array emitted it->_typ/it->len on the _option_* wrapper → C error for any program freeing such a field under autofree/Perceus.
11 contains_ptr treats ?T / !T as pointer-bearing d112d5c8 []?int was flagged noscan (the option strips to .int), but _option_int carries an IError pointer → a pointer-bearing object marked noscan.
12 Four -gc e correctness fixes: map tiny-free, Perceus drop, HEAP_vgc arity, overflow-thread panic a95aff916b Incl. the >vgc_max_threads case that indexed caches[-1] and recursed through malloc in the panic path. The HEAP_vgc-arity fix alone cleared 22 of 34 of V's own -gc e test failures.
13 Drop extraneous ) when freeing an option-pointer local (b := &?Foo{}) 3bcf843fb9 The option branch closed the free call's paren and the shared tail closed it again → free((Foo**)b.data)); C error. Fixes option_init_ptr_test under -gc e; boehm/none unaffected.
14 Generate the option-element free for []?T 3d537762 An array of ?string referenced _option_string_free, a wrapper no path generated (the unwrapped sym has a user free → string-construct branch). Now inline the option-element payload free. Fixes option_ifguard_array_of_option_test.
15 Atomic live_threads in register/unregister c69fd59b vgc_maybe_gc reads live_threads lock-free for per-thread GC pacing; the plain ++/-- raced that atomic read (TSan-flagged).

Optimizations (performance)

# Optimization Commit Claimed effect (verify)
A Per-thread heap accounting (Go per-P style): alloc/free bump thread-private live_delta/alloc_delta, flush to the global atomics only every ~1 MB 677770dd / 38607b2d Removes global-atomic cacheline contention on the accounting path.
A2 Lock-free free fast path for mcache-resident / dropped spans (on_central == 0): vgc_free skips the per-class central[].lock (kept only for spans actually on a central list); bitmap+count stay atomic, the fetch_and prior value gates the decrement (double-free-safe) c69fd59b The residual-#4 fix had added that lock to every non-tiny free → a same-class free storm (bench_scalar: 8 threads alloc+drop one 32 B class) serialized N-way and anti-scaled (35→7 Mops/s T1→T8, below Boehm). With the skip it is near-linear again: 45→326 Mops/s T1→T8 (7.2×, ~5.5× Boehm at T8); bench_mp T1 5.9→76. Verified residual-#4-safe (white-box selftest + container churn 15 rounds niltrace=0 + TSan 0).
B Perceus deep-free of nested heap fields of a dropped &Foo, gated by a sound deep-drop analysis 677770dd Nested-object MP T1→T8 ~7.5× (near-linear); removes the GC pressure that compounded MP contention.
C In-place reuse: direct indexed stores for reused map slots d7e9f5a1 Avoids re-hashing on reuse-in-place.
D Dynamic span registry (mmap-backed, lazily committed) + env-gated GC pacing 72edb9e5 Removes a fixed 262k-span cap; lets the trigger scale without a hard abort.
E Concurrent tri-color mark behind -d vgc_concurrent (opt-in, STW stays default) d22ae0ee ~1.2–1.4× on a parallel alloc-heavy fold vs STW; not proposed for default (needs a sound GC-assist first).
F Alloc-path lock-contention removal (B18): span-descriptor bump slab (drops a per-carve mmap from under the heap lock; ~3× lower RSS) + drop full spans instead of returning them to the central full-list (the never-reused per-fill central-lock traffic) + per-thread GC pacing on by default (adaptive — only when >1 mutator) 82e39343 Parallel alloc-heavy workload recovered from anti-scaling (≈parity) to ~3.8× its serial at a high trigger; ~3× lower RSS.

Profile evidence: under parallel churn the alloc fast path was ~98% spin on two global
locks (vgc_heap.lock for span carving — whose hold included an mmap syscall — and the
per-class central lock for span return); 8 separate processes scaled but 8 in-process
workers did not, isolating the cost to in-process shared allocator state (not bandwidth).


Soundness evidence

Scope / what to review carefully

  • New GC backend is large; suggest reviewing vgc_d_vgc.c.v (allocator) and
    vgc_gc_d_vgc.c.v (collector) first, then perceus.v (analysis) and the codegen
    touch-points (assign.v, auto_free_methods.v, autofree.v, cgen.v, fn.v).
  • Not for default upstream: -d vgc_concurrent (needs a sound GC-assist),
    vgc_verify tooling (debug-gated), and the experimental cx_region.c.v /
    transport-layer patches (consumer-specific; excluded from this proposal).
  • Based on a83aabb10f; a rebase onto current master is required.
  • Known follow-ups: full deferred cross-thread free (the targeted lock-free path #A2
    already covers owner-frees of mcache-resident spans — the dominant case; a complete
    mimalloc-style per-span atomic thread-free list would also make cross-thread frees of
    central-listed spans lock-free); sound
    concurrent-mark GC-assist (cooperative safepoints); generational
    option; and V's own -gc e codegen edge cases — ~10 of 2146 vlib/v/tests programs
    (all -gc e-specific, pass under none/boehm), characterized as three families:
    (1) option-wrapper / generic / sub-module _free not generated — e.g. an array of
    ?string references builtin___option_string_free but the value-option wrapper free is
    never emitted (free-method generation vs -skip-unused DCE; the unwrapped element sym
    has a user free, so the option-wrapper free path is skipped);
    (2) reflection metadata reclaimed (4 reflection / generic-anon-fn tests segfault);
    (3) Perceus string early-drop (3 tmpl/comptime/interface-str tests produce
    truncated/aliased strings). These touch shared autofree/option/Perceus codegen
    (boehm-regression-sensitive) and runtime mark soundness — each warrants a dedicated pass,
    not bundled here. Fix Fix generic docs after pull #10 #13 above cleared one (option_init_ptr).

Test environment (so the numbers mean something — and what's NOT covered)

Everything below was measured on a single machine. This is a real limitation: we have
not tested other CPUs, x86, or native (non-virtualized) Linux. Please reproduce on your
own hardware.

  • Dev + all macOS benchmarks: Apple M2 Max, 12 cores (8 performance + 4
    efficiency), 64 GB RAM, macOS 26.4.1 (build 25E253), Apple clang 21.0.0. -prod
    builds via -cc cc.
  • Linux correctness/concurrency testing: a Docker container (Ubuntu 24.04.4,
    clang 18.1.3, wrk 4.1.0), aarch64
    — i.e. Linux 6.12 (linuxkit) running in Docker's
    VM on that same M2 Max, not a separate native or x86 host. TSan + the concurrent-
    HTTP churn reproducer ran here. So: arm64 only; x86, native Linux, and other core
    counts are unverified.
    The collector's conservative stack/register scan and the
    OS-suspend STW path are platform-sensitive — independent runs on x86/native Linux are
    exactly the verification we're asking for.
  • Numbers are best-of-3 (compute benches) wall-clock; -gc boehm is the baseline.

How to verify

# build the dev compiler, then for any program:
v -gc e   prog.v     # Perceus front line + vgc backstop
v -gc vgc prog.v     # backstop only
v -gc e -d vgc_concurrent prog.v   # opt-in concurrent mark
# benches (provider-neutral):
v -enable-globals -gc e bench/parallel-alloc/poc/bench_scalar.v   # MP alloc scaling vs boehm
v -enable-globals -gc e bench/parallel-alloc/poc/bench_mp.v
v -gc e test bench/parallel-alloc/vgc_residual4_test.v        # white-box fix self-check

eptx and others added 26 commits June 14, 2026 20:18
libgc's parallel mark defaults to one helper thread per core. On macOS
every stop-the-world collection then wakes N-1 mark helpers that contend
(mach thread_suspend/resume + mark-queue spin), starving the
application's own worker threads. For an allocation-heavy multi-threaded
server (the cx picoev multi-reactor HTTP leg) this collapses throughput.

Measured on a 12-core M-series, serve-file [?http-service] under
wrk -t8 -c100:
  parallel mark (default): ~48.5K req/s, ~5 active cores
  single marker:           ~125-131K req/s  (2.6x)

Fix: emit GC_set_markers_count(1) in the main() boehm preamble on macOS,
before GC_INIT() — GC_thr_init computes the marker count there and starts
the helpers eagerly, so the call must precede it (an equivalent call in
_vinit runs too late). Mirrors the GC_set_markers_count(1) already done
for shared libs. The GC_MARKERS env var still overrides this (read first
in GC_thr_init). Scoped to macOS; Linux parallel-mark is left at default
pending separate measurement.

(cherry picked from commit 0421ce3)
The C codegen for global declarations handled @[volatile] but silently
ignored @[thread_local], so a @[thread_local] __global compiled to a
plain process-shared global. Any concurrent use of such a global then
raced across threads. Emit the __thread storage-class (GCC/Clang/tcc;
valid for the zero/nil/literal initializers V produces for globals)
when the decl carries the attribute. Required by the scope-aware region
allocator, whose per-thread state must be genuinely thread-local.

(cherry picked from commit 13e1ce9)
…c to ON

The bundled gc.c amalgamation embedded `/* #undef THREAD_LOCAL_ALLOC */`
in its autoheader config — the one alloc-lock mitigation left OFF. Flip it
to `#define THREAD_LOCAL_ALLOC 1` (the canonical configure-equivalent of
`--enable-thread-local-alloc=yes`); the TLA implementation is already present
in the amalgamation (guarded by `#ifdef THREAD_LOCAL_ALLOC`), so this activates
real code. Compiles clean.

This affects the Linux / `-prod`-bundled `gc.o` path only. The macOS `cx`
build links the prebuilt `thirdparty/tcc/lib/libgc.a` (built by
thirdparty-macos-arm64_bdwgc.sh), which already defaults TLA on — so the flip
brings the bundled config into parity with what macOS already ships.

Companion runtime lever (already present at this pin): the macOS marker-pin
`GC_set_markers_count(1)` before `GC_INIT()` in cmain.v::gen_boehm_gc_init().
Together: ~2.4-5x on MARK-bound multi-thread work. Neither removes the
GC_allocate_ml alloc-lock, so alloc-heavy `[?map [par]]` (cx-private vlang#14) stays
~1.3x slower than serial — partial relief by design. Full acceptance +
measurements in cx-private bench/parallel-alloc/P0-ACCEPTANCE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit ae03a36)
…vgc backstop + Linux port + per-thread accounting + deep-free

Forward-port of the full E work (developed on the upstream-master clone) onto the
latest-V (a83aabb, Jun-11) base, on top of the cherry-picked cx patches. Adds:
- P3 minimal STW mark-region backstop collector (vgc_*), churn-correct on macOS.
- Linux backstop port: signal-suspend+ACK + dl_iterate_phdr ELF roots (vgc_platform.h).
- P1/P2 Perceus front line, DECOUPLED from -autofree (fires on the `perceus` define
  alone) so it composes with the tracing backstop.
- Per-thread heap accounting (live_delta/alloc_delta) → R2 alloc-MP near-linear.
- Sound deep-free of nested heap fields of a dropped &Foo (perceus.v deep-drop analysis).
- `-gc e` unified flag = vgc backstop + Perceus front line (opt-in; boehm stays default).

perceus.v is a new cgen pass (was untracked in the clone; now committed here).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 677770d)
A vgc span serves one size CLASS but real workloads pack many TYPES (plus
conservative ptrmap==0 allocations) into it. The span recorded only the FIRST
typed allocation's ptrmap and applied it to every object, so any object whose
real pointer layout differed had live child pointers skipped during mark ->
reclaimed-while-reachable (observed as corrupted results in a deep alloc-heavy
serial fold). The backstop now scans every scannable (non-noscan) span
conservatively: finds every pointer, may over-retain, never under-retains. The
backstop runs rarely behind the Perceus front line, so the cost is negligible.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 004b02f)
gen_free_for_sumtype/gen_free_for_array were not .option-aware and emitted
it->_typ / it->len against the _option_* wrapper, producing a C error for any
program with a ?SumType or ?[]T field freed under autofree/perceus. Mirror the
existing option handling in gen_free_for_map/struct; no-op for non-option types.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 26ac2bb)
Remove references to the downstream consumer/repo from comments so the runtime
and Perceus analysis read as standalone upstream V work. No code change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 6f94f55)
Option/result wrappers always embed an IError err field (a pointer-bearing
interface) regardless of the payload, so `?int` etc. contain a pointer even
though int does not. contains_ptr called final_sym, which strips the
.option/.result flags, and so classified `[]?int` as scan-free -> emitted a
_noscan allocation whose live err pointer a conservative GC mark would skip
(potential reclaim-while-reachable). Check the flags before final_sym.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit d112d5c)
When the Perceus in-place map reuse fires, the result buffer is the receiver's
own buffer (cap >= len), so the map loop can store results by ascending index
instead of calling array_push. This removes the per-element call + grow-check +
memmove and, by making the loop body transparent, lets the C optimizer vectorize
it and elide the (non-escaping) allocation. ~3x faster than Boehm single-thread
on the reuse micro-bench (prod), parity-or-better otherwise; correctness
unchanged (reuse precondition already proved unique+dead+same-size).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit d7e9f5a)
The trace ring + async-signal-safe crash dump (P3 bring-up scaffolding) are now
compiled only under `-cflags -DVGC_DIAG`; without it vgc_trace/vgc_trace_init are
zero-overhead no-ops and the startup vgc_say probes are removed, so a -gc vgc/e
binary emits nothing to stderr. The loud span-registry-overflow abort message is
preserved (vgc_say + the raw stderr write helpers stay).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 25c1145)
…drop, HEAP_vgc arity, overflow-thread panic)

Forward-ported from the upstream-tracking clone (vlang-v-latest). All V-only,
CX-agnostic, with standalone patches + CX-free repros in cx-private
bench/parallel-alloc/.

1. vgc tiny-allocator free clobbers live siblings: the tiny allocator packs
   several sub-16B noscan objects into one span slot; vgc_free cleared the whole
   slot, reclaiming live tiny neighbors (short map-key char buffers) -> map
   corruption under -gc vgc/e. Fix: per-span is_tiny flag; vgc_free defers
   tiny-block slots to the tracing collector (Go model).

2. Perceus drops a store-target index var before its use: pcs_lower_stmt
   AssignStmt left a store target's index/container idents (m[key]=v) out of the
   use-set, so Perceus dropped `key` at its declaration, freeing it before
   map_set cloned it. Fix: non-Ident store targets collect all idents as uses.

3. HEAP_vgc macro arity: HEAP_vgc(type,expr,ptrmap,nptrs) was emitted with 2
   args by some paths -> "too few arguments to function-like macro". The ptrmap
   is dead at runtime (conservative-mark backstop ignores it), so HEAP_vgc == HEAP.
   Fix: always emit plain HEAP under vgc (cgen.v + assign.v).

4. vgc overflow-thread panic: >vgc_max_threads(64) concurrent threads exhaust the
   fixed cache table -> cache_idx=-1 -> caches[-1] out-of-range panic. Fix:
   cache_idx<0 allocates from central (vgc_cache_get_span) + folds accounting into
   the global atomics (vgc_acct_alloc).

Validated: vlib map_test/array_test/string_test green under -gc vgc AND -gc e;
g_churn battery + corpora none==e clean; cx default gate 125/125 + conformance
green (no regression). NOTE: building cx itself with -gc e surfaces 6 PRE-EXISTING
cx-under-gc-e failures (fail on the pre-fix fork too) — tracked separately, not
caused by these fixes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit a95aff9)
Two CX-agnostic V-runtime changes for parallel alloc-heavy scaling under the
full-STW vgc backstop (-gc e/vgc):

1. Dynamic span capacity (always-on, sound robustness fix). allspans was a fixed
   inline [262144]&VGC_Span; a higher GC trigger exhausted it -> the loud
   vgc_say(0xDEAD) abort. Now allspans is an mmap-backed pointer (vgc_os_alloc,
   16M-entry default = 128MB address space, lazily committed by the OS),
   allocated once on the first vgc_span_alloc under vgc_heap.lock (NOT in
   vgc_init -- spans are allocated during _vinit, before vgc_init runs). The
   pointer never moves, so the collector's lock-free allspans walks (incl. lazy
   sweep) never see a relocated/freed buffer. Loud abort kept as the backstop at
   the (now huge) cap. Env override VGC_ALLSPANS_CAP.

2. Per-thread GC pacing knobs (env-gated, DEFAULT OFF -> a build with no env set
   is byte-identical to before). VGC_NEXT_GC_MB raises the trigger floor;
   VGC_PACE scales the live trigger by live_threads so N concurrent allocators
   don't trip the shared trigger N x more often per unit of per-thread progress.

Validated (clone, identical changes): g_churn battery 0 corruptions (default +
VGC_PACE), corpora none==e byte-identical, map_test/array_test green under -gc e,
v2 self-hosts; cx gate 125/125 + conformance green built against this runtime.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 72edb9e)
Mark the live graph while mutators run, with two brief STW points (start:
snapshot roots + enable barrier + alloc-black; termination: re-scan dirtied
roots + dirty spans, final drain, sweep). STW stays the default and is
byte-for-byte diff-able.

- card/dirty-span write barrier (vgc_wb_store): marks the mutated object's span
  dirty BEFORE the store; collector re-scans dirty spans at mark-termination
  (vgc_rescan_dirty_spans). Preemption-safe under mach-suspend (unlike an
  immediate-shade enqueue, which can be lost mid-barrier) and keeps the mark
  queue collector-exclusive.
- codegen emission in assign.v (gen_cm_write_barrier) for heap-targeted
  pointer-bearing stores; over-approximating, side-effect-free bases only. Fixes
  a gap where obj[i+k].field stores were skipped (InfixExpr index).
- builtin bulk-mutator barriers (array push/ensure_cap/set/clone/insert, map set,
  vgc_realloc, vgc_memdup*) for pointer moves via memcpy.
- alloc-black during mark (vgc_alloc_black_hook).

GC-assist deliberately NOT wired: unsound under preemptive suspend (an assisting
mutator frozen mid-scan of a popped grey object orphans it). Heap overshoot
during long marks is bounded by the existing exhaustion->force-collect net.

Default build emits no barrier calls and is unchanged. Gates green under the
define: corpora none==e byte-identical, map/array tests, g_churn battery,
multi-thread cm_stress, self-host, ASan == STW baseline. Measured payoff:
large-live-set [par] ~1.2x faster than STW.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit d22ae0e)
…codegen + vgc_free central lock

Three independent soundness fixes for the `-gc e` front line (Perceus RC +
precise STW tracing backstop), all general-purpose:

1. perceus.v — return-aliasing pin. A heap value bound from a CALL may alias the
   callee's receiver/args or sub-objects it traversed and copied out by value
   (e.g. a tree walk returning a []Element of shallow copies that share child
   buffers and can overlap one another). The old assign-aliasing pin only fired
   when an RHS *ident* was itself heap-owning, so such results were treated as
   uniquely owned and eager deep-dropped -> double free. Now any non-fresh
   call-bound aggregate is marked shared and left to the tracing backstop; only
   proven-fresh producers (map/filter over primitives, fresh string builds) stay
   droppable. Sound by construction: marking shared only suppresses a drop.

2. auto_free_methods.v — gen_free_for_option_ptr. A `?&T` field's free method
   resolved the sym to T's struct and inlined T's field frees treating the option
   payload as a T value, emitting `((T**)&it->data)->field` -> "base type 'T *' is
   not a structure". Now the pointee is freed via the base `_free`
   (`T_free(*(T**)&it->data)`), mirroring the non-option `&T` field case.

3. vgc_free — take the per-size-class central lock around the alloc-bit / count /
   free-index mutation, the same lock the allocator and the collector use. Closes
   a race between an eager free and a concurrent allocation of the same class.
   Obeys the lock-before-suspend discipline: brief, no allocation/blocking, so the
   collector (which pre-acquires every central lock) cannot deadlock.

Corpora byte-identical (none==e), churn battery 0-corrupt, multi-thread stress
0 mismatches, serial reuse throughput unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 38607b2)
…ointer

The STW collector captures the triggering (collector) thread's own roots via a
setjmp trampoline (vgc_run_gc_spilled) and the cooperative-safepoint park
(vgc_park_spill), then conservatively scans [sp, stack_base]. Both anchored `sp`
at __builtin_frame_address(0) — the FRAME POINTER. But setjmp spills the
callee-saved registers into its jmp_buf, which lives BELOW the frame pointer
(between SP and FP). So the scanned range excluded the spill area, and a live
mutator root held only in a callee-saved register — routine under -Os — was
never scanned and got reclaimed while still live. Surfaced as a sporadic
signal-11 under concurrent allocation (e.g. an HTTP server hammered with many
connections): a value reachable through a worker thread's registers was swept,
then read after free at a scattered point in the request path.

Add vgc_real_sp() (reads the actual SP register on arm64/x86_64/i386, falls back
to __builtin_frame_address) and anchor both self-scan paths at it, clamped to
<= &buf so the jmp_buf spill area is always covered. Also route vgc_get_sp()
through it for consistent registration/refresh ranges. Sound by construction:
the scanned range only ever grows to include the true lowest in-use stack
address. Suspended threads were already correct (their SP comes from
thread_get_state); only the self/collector path relied on setjmp.

CX-free, general-purpose: affects -gc vgc and -gc e (shared trampoline). The
compiler self-build is .no_gc so the toolchain binary is unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 50fde69)
…e world

The STW collector resumed all mutators (vgc_resume_thread loop) and THEN ran its
end-of-cycle bookkeeping — vgc_update_trigger() and `gc_cycle++` — while mutators
were already running. gc_cycle is read by mutators to stamp a freshly-acquired
span's `sweep_gen` (vgc_span_alloc / vgc_central_get_span / vgc_get_free_span),
and the sweep's in-flight-span guard (vgc_sweep_span) recycles an empty span only
when `sweep_gen != gc_cycle`. So in the tiny window between resume and `gc_cycle++`,
a resumed mutator could acquire an empty, still-in-flight span and stamp it with
the OLD cycle; the next cycle's sweep then saw `sweep_gen != gc_cycle` and recycled
that span out from under the mutator — a use-after-free. Pathologically narrow and
timing-sensitive (it masks under every instrumentation), so it surfaced only as a
sporadic signal-11 under sustained concurrent allocation (an HTTP server hammered
with many connections), on both the mach (macOS) and signal (Linux) STW backends.

ThreadSanitizer (Linux) pinned it precisely: a data race on `vgc_heap.gc_cycle`
between vgc_gc_start (writer) and vgc_span_alloc (reader). Moving the bump + the
trigger update ahead of the resume loop — while the world is still stopped — closes
the window: every post-resume acquisition stamps exactly the cycle the next sweep
checks against, and the just-run sweep still correctly skipped spans acquired during
this cycle (it ran at the pre-bump cycle). After this change TSan reports zero vgc
data races (was 30), and the dominant concurrent-HTTP crash is gone.

CX-free, general-purpose; affects -gc vgc and -gc e (shared STW path).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 8baa8db)
…_alloc data race

SUBSTANTIVE FIX (vgc_d_vgc.c.v): vgc_find_span / vgc_is_heap_ptr read
vgc_heap.narenas LOCK-FREE (they run on the collector's STW conservative scan,
where a lock would deadlock against a frozen mutator) while vgc_span_alloc wrote
it under vgc_heap.lock. That is a publication data race: a lock-free reader could
observe narenas grow before the new arenas[idx] (base/size/page_span map) it
gates was published, then read a stale base/size or a nil page_span -> wrong or
missing span -> heap corruption. ThreadSanitizer pinpointed it (read in
vgc_find_span vs write in vgc_span_alloc, global vgc_heap) under concurrent
HTTP connection-teardown churn; the corruption surfaced as a segfault in the
request path. Fix: bump narenas as the LAST step of vgc_span_alloc with an
atomic RELEASE store (after arenas[idx] fields + page map + allspans), and load
it with an atomic ACQUIRE in the lock-free readers. TSan: 1 race -> 0. This also
makes every early-return in vgc_span_alloc leave narenas consistent (a half-built
arena is simply never published). CX-agnostic; reproduces with any heavily
multi-threaded alloc/free workload.

DIAGNOSTIC TOOLING (inert; behind -d defines, zero codegen in normal builds —
verified a plain `-gc e` build is unaffected): a mark-closure verifier
(vgc_verify_mark_closure), a /proc/self/maps root-finder (vgc_rootfind_*), a data-
segment dump, and a low-perturbation watch (per-cycle reset + emit-on-sweep, pub
vgc_set_watch) under -d vgc_verify / -d vgc_watch; plus a coarse allocator lock
(vgc_alloc_lock + a per-thread re-entrancy guard) under -d vgc_coarse_alloc. These
proved the vgc mark/sweep collector is sound (the residual is allocator-side
concurrency, not reclamation) and are kept for the ongoing allocator-race hunt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit c39ce23)
Found via a reliable `Connection: close` teardown-churn repro
(wrk -t8 -c256 against the cx guide HTTP server); each fix pushes the
crash materially later (round 1 -> 8 -> 12). All CX-agnostic, observable
behaviour byte-identical. Full cx gate green under -gc e (V-impl 125/125
+ conformance + python/rust/go bindings + abi-c, prod+non-prod).

1. vgc page_span[] publication race (vgc_d_vgc.c.v). A span carved from an
   EXISTING arena writes arenas[i].page_span[pidx]=span under vgc_heap.lock
   but does NOT bump narenas, so the narenas release/acquire publication
   does not cover these slot writes. vgc_find_span reads page_span lock-free
   in the vgc_free/vgc_realloc hot path -> data race -> stale span -> heap
   corruption. Fix: atomic u64 RELEASE store on the slot + ACQUIRE load in
   vgc_find_span (paired, like narenas).

2. vgc mcache bitmap double-update (vgc_d_vgc.c.v, vgc_platform.h). The
   lock-free mcache fast path (vgc_span_alloc_obj) did a plain
   read-modify-write on span.alloc_bits/alloc_count while a concurrent
   cross-thread vgc_free RMW'd the SAME span's bitmap under
   central[class].lock -- the lock does not exclude the unlocked alloc side
   -> lost update -> a slot handed out twice. Fix: atomic OR (alloc) /
   atomic AND (free) on alloc_bits + atomic add/sub on alloc_count (new u8
   fetch_or/fetch_and + sub_u32 helpers in all three cc branches).
   free_index stays a racy-but-safe scan hint; the STW sweep stays
   non-atomic (mutators suspended).

3. picoev handle_timeout stale-target assert (picoev.v). Under
   Connection: close teardown churn a timed-out fd can retain a residual
   timeouts entry whose target was already torn down (loop_id=-1, cb=nil).
   The assert aborted the reactor; poll_once already skips such targets.
   Fix: drop the stale timeout entry + skip, mirroring poll_once.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 46d2ae5)
…2 races)

Under heavy multi-threaded alloc/free churn the vgc allocator could return
NULL for an in-range small allocation, surfacing as a null `&T{}` and a
caller null-deref. Two independent root causes, each strictly load-bearing
(ablation-confirmed); a third (a non-atomic alloc_bits read) was identified
but is not load-bearing and is left out to keep the alloc fast path
atomic-free.

* span scan missed the start byte's low bits. vgc_span_alloc_obj's two-pass
  free-slot scan applied the free_index start_bit offset in BOTH passes, so
  bits [0, start_bit) of the start byte were scanned in neither pass. A span
  with a free low slot but a high/stale free_index (the state a fill leaves
  when a cross-thread free's lowering of free_index is not yet visible) then
  reported "full" and returned nil. Single-byte-bitmap spans (small nelems)
  hit it whenever free_index == nelems. Fix: apply the offset only in pass 0;
  the wrap pass scans the start byte from bit 0.

* GC reclaimed mcache-resident spans. vgc_sweep_span reclaims any empty span
  with a stale sweep_gen, and on_central==0 means BOTH "free-floating" AND
  "owned by an mcache". An empty cached span was reset (nelems=0) and pooled
  by vgc_put_free_span while still referenced by the mcache slot and by a
  thread suspended inside vgc_malloc -- span descriptors live outside the GC
  arena, so the conservative root scan never protects them, and
  vgc_fixup_caches only nulls the cache slot, not a suspended owner's local.
  The owner then read a zeroed/torn span. Fix: vgc_protect_cached_spans()
  stamps every mcache-resident span's sweep_gen under STW before sweep,
  reusing the existing in-flight guard.

Deterministic white-box regression (CX-free): vgc_residual4_selftest in
vlib/builtin/vgc_selftest_d_vgc.c.v, driven by
bench/parallel-alloc/vgc_residual4_test.v; both checks verified to fail
without their fix. Full V _test suite + conformance green under -gc e.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 871dced)
Three CX-agnostic allocator changes that recover [par] / multi-reactor scaling
under -gc e without regressing serial throughput or correctness. Root cause
(measured): the alloc fast path was ~98% global-lock spin under parallel churn,
NOT GC pacing and NOT memory bandwidth (8 separate processes scale; 8 in-process
workers do not). Two locks dominated: vgc_heap.lock (new-span carve, whose hold
included a per-span mmap()) and central[].lock (full-span return).

1. Span-descriptor bump slab (vgc_alloc_span_meta): span descriptors are never
   individually freed (pooled by vgc_put_free_span forever), so replace the
   per-carve mmap(sizeof(VGC_Span)) -- a syscall under vgc_heap.lock -- with a
   pointer bump + rare bulk mmap. Side effect: ~3x lower RSS (the per-span mmap
   wasted ~16KB of page granularity on a ~400B descriptor).

2. Drop full spans instead of returning them to central.full. The partial list
   was never reused (returns only land FULL spans on full; sweep relinks only
   fully-empty), so the per-fill return was pure central[].lock contention. A
   dropped span stays in allspans, is swept normally, and is reclaimed when empty
   via the on_central==0 path; reuse flows through the free_spans pool + the
   active span's free_index. Sound vs residual vlang#4 (a dropped span is no longer
   mcache-resident, so same-cycle reclaim is correct; it is sweep_gen-protected
   while still referenced during the drop).

3. Per-thread GC pacing on by default (was env-gated). Adaptive: scales the live
   trigger by live_threads only when >1, so single-threaded/small programs keep
   the historic 256MB trigger and RSS. VGC_PACE=0 disables it.

Results (best-of-5, -prod -gc e, 12-core): par 360ms = 3.8x faster than serial
at a 4GB trigger (RSS 2.3GB); ~2.9x at the default with pacing; serial 1550->
1380ms; the alloc lock-spin leaves the hot profile entirely (now compute-bound).
Gates: residual-vlang#4 selftest PASS; cx V-impl 125/125; local MP stress 50/50;
HTTP churn repro (wrk -t8 -c256 Connection:close) survived 15 rounds niltrace=0;
real HTTP server sweep ~22M requests crash-free.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 82e3934)
…nostic

Genericize two comments that named the downstream consumer, keeping vlib source
provider-neutral (matches 6f94f555): the mark-closure verifier's noscan-referrer
note and the vgc_residual4_selftest header. Comment-only; no code change.

(cherry picked from commit 1ff9e4f)
…er -gc e

The autofree/Perceus drop for an option-pointer local (b := &?Foo{}) closed the
free call twice: the option branch wrote '.data)' (closing the call's open paren)
and then the shared tail wrote ');' again -> 'free((Foo**)b.data));' -> C error
'extraneous ) before ;'. The non-option path closes only once via that tail, so
emit '.data' (no close) and let the single tail ')' close the call. Fixes
vlib/v/tests/options/option_init_ptr_test.v under -gc e (passes under e/boehm/none;
boehm options suite 213/213, no regression). One of the documented -gc e edge bugs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 3bcf843)
gen_free_for_array took the string-construct branch for an option element ([]?T)
because the UNWRAPPED payload sym (e.g. string) has a user free -> it emitted a
call to _option_<T>_free, a wrapper method no codegen path produces -> C error
'undeclared function builtin___option_string_free'. Fix: for an option element,
INLINE the payload free (check option state, free the payload via its base free on
&data), mirroring the sum-type-variant-option and whole-?[]T paths. No separate
_option_<T>_free method needed. Fixes option_ifguard_array_of_option_test under
-gc e (options suite 213/213; boehm/none unaffected).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 3d53776)
…e_threads

(c) the real MP-alloc lever (measured): vgc_free took the per-class central[].lock
on EVERY non-tiny free, so N threads dropping the same size class serialized N-way
on one lock — bench_scalar anti-scaled 35->7 Mops/s T1->T8, below boehm. (This is
why the earlier 'near-linear, ~8x boehm' result regressed: the residual-vlang#4 fix added
that lock for correctness.)

Fix: skip the central lock when span.on_central == 0 (resident in a thread mcache,
or dropped awaiting sweep) — such a span has no central-list membership to guard, and
its bitmap+count mutations are individually atomic. The atomic fetch_and's prior value
gates the count decrement so a double-free still cannot double-subtract. free_index
stays a racy-but-safe hint. Spans actually on a central list (on_central != 0: the
unregistered-overflow-thread fallback) keep the lock, preserving list consistency and
the collector's lock-before-suspend fence.

Also: make live_threads ++/-- in register/unregister ATOMIC — vgc_maybe_gc reads it
lock-free for per-thread GC pacing (on by default), so a plain RMW raced that atomic
read (TSan-flagged; a PACE-on-by-default regression).

Result (best-of-3, -prod -gc e, 12-core): bench_scalar T1->T8 = 45->326 Mops/s
(7.2x, near-linear; 5.5x boehm at T8); bench_mp T1 5.9->76. Soundness: residual-vlang#4
white-box selftest PASS; container HTTP churn 15 rounds niltrace=0 (the exact race
the lock guarded); TSan 0 warnings (was 1, the live_threads race — now fixed); cx
V-impl 125/125 + full make test green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit c69fd59)
…ps root-finder

Debug-only soundness tooling used during development (compiled out by default).
Removed for review clarity: vgc_verify_mark_closure, vgc_rootfind_region + the
$if vgc_verify ? call sites and the verify-only globals/C decls. The
low-perturbation vgc_watch_* hooks (also debug, runtime-gated by vgc_watch_addr,
inert by default) are kept — they were the load-bearing diagnostic for the
concurrency fixes and add negligible inert cost.
CX-free V programs backing the PR's perf/soundness claims (so reviewers can
reproduce): bench_scalar (single-class MP alloc scaling), bench_mp (nested-object
MP), par_reclaim (bounded-live control), cm_stress (concurrent-mark hazards),
boehm_mp_bench (Boehm baseline). vgc_residual4_test (white-box fix self-check) is
already present.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56f73cde59

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vlib/v/gen/c/cmain.v
// MUST be after _vinit(): _vinit zero-initializes every global (incl.
// vgc_heap), so running vgc_init before it would have gc_enabled/next_gc
// wiped back to 0 -> GC disabled for the whole program -> unbounded heap.
g.writeln('\tbuiltin__vgc_init();')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid re-registering threads after pre-init allocations

When _vinit allocates under -gc vgc/-gc e, the allocation path calls vgc_ensure_registered() before this generated builtin__vgc_init() runs; vgc_init() then unconditionally calls vgc_register_thread() again. That leaves the earlier cache slot still registered with the same OS thread port, so the next collection sees it as another mutator and tries to suspend/scan the collector thread itself (i != self_idx), which can self-suspend or hang as soon as GC triggers in programs with allocating global/module init. Make registration idempotent here, or avoid the second registration in all generated entry points.

Useful? React with 👍 / 👎.

Comment on lines +422 to 425
vgc_cm_stw_exit(self_idx)

vgc_update_trigger()
vgc_heap.gc_cycle++

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bump the GC cycle before resuming concurrent mutators

In -d vgc_concurrent builds this resumes mutators before updating next_gc and incrementing gc_cycle. The STW path above explicitly moved those updates before resume because a resumed allocator can acquire/cache a span stamped with the old cycle; the next sweep then sees sweep_gen != gc_cycle and can recycle that cached span while the mutator still owns it. Move vgc_update_trigger()/vgc_heap.gc_cycle++ before vgc_cm_stw_exit(self_idx), matching the non-concurrent path.

Useful? React with 👍 / 👎.

Comment on lines +815 to +817
static inline uint32_t vgc_thread_self_port(void) { return 0; }
static inline void vgc_suspend_thread(uint32_t t) { (void)t; }
static inline void vgc_resume_thread(uint32_t t) { (void)t; }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disable VGC where OS-level suspension is unavailable

On Windows/BSD this fallback registers every thread with mach_port == 0 and makes suspend/resume no-ops, but vgc_gc_start() no longer uses the old cooperative safepoint path and only suspends slots with c.mach_port != 0. As a result, a multi-threaded -gc vgc/-gc e program on these supported targets will mark and sweep while other mutators keep running, so live objects allocated or stored by those threads can be reclaimed. This needs to either reject VGC on unsupported STW platforms or retain a safe cooperative fallback.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56f73cde59

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vlib/v/gen/c/cmain.v
// MUST be after _vinit(): _vinit zero-initializes every global (incl.
// vgc_heap), so running vgc_init before it would have gc_enabled/next_gc
// wiped back to 0 -> GC disabled for the whole program -> unbounded heap.
g.writeln('\tbuiltin__vgc_init();')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid re-registering threads after pre-init allocations

When _vinit allocates under -gc vgc/-gc e, the allocation path calls vgc_ensure_registered() before this generated builtin__vgc_init() runs; vgc_init() then unconditionally calls vgc_register_thread() again. That leaves the earlier cache slot still registered with the same OS thread port, so the next collection sees it as another mutator and tries to suspend/scan the collector thread itself (i != self_idx), which can self-suspend or hang as soon as GC triggers in programs with allocating global/module init. Make registration idempotent here, or avoid the second registration in all generated entry points.

Useful? React with 👍 / 👎.

Comment on lines +422 to 425
vgc_cm_stw_exit(self_idx)

vgc_update_trigger()
vgc_heap.gc_cycle++

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bump the GC cycle before resuming concurrent mutators

In -d vgc_concurrent builds this resumes mutators before updating next_gc and incrementing gc_cycle. The STW path above explicitly moved those updates before resume because a resumed allocator can acquire/cache a span stamped with the old cycle; the next sweep then sees sweep_gen != gc_cycle and can recycle that cached span while the mutator still owns it. Move vgc_update_trigger()/vgc_heap.gc_cycle++ before vgc_cm_stw_exit(self_idx), matching the non-concurrent path.

Useful? React with 👍 / 👎.

Comment on lines +815 to +817
static inline uint32_t vgc_thread_self_port(void) { return 0; }
static inline void vgc_suspend_thread(uint32_t t) { (void)t; }
static inline void vgc_resume_thread(uint32_t t) { (void)t; }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disable VGC where OS-level suspension is unavailable

On Windows/BSD this fallback registers every thread with mach_port == 0 and makes suspend/resume no-ops, but vgc_gc_start() no longer uses the old cooperative safepoint path and only suspends slots with c.mach_port != 0. As a result, a multi-threaded -gc vgc/-gc e program on these supported targets will mark and sweep while other mutators keep running, so live objects allocated or stored by those threads can be reclaimed. This needs to either reject VGC on unsupported STW platforms or retain a safe cooperative fallback.

Useful? React with 👍 / 👎.

eptx and others added 2 commits June 14, 2026 20:47
(1) vfmt array.v + the two vgc files (hand-edited during cherry-pick conflict
resolution + the verifier strip) so code-formatting CI passes. (2) The standalone
POC benchmarks are each `module main`; sitting beside vgc_residual4_test.v they
collided under `v test` (duplicate `main`/`Obj`). Moved to bench/parallel-alloc/poc/
so the selftest's module is clean; benches still run individually.
…ootstrap)

The concurrent-mark write barrier vgc_wb_store is defined only in
vgc_gc_d_vgc.c.v (compiled under -d vgc, i.e. -gc e). Its call sites in
array.v/map.v sit inside `$if vgc_concurrent ? { ... }`, so they emit no
code in ordinary builds — but the checker still walks those comptime
branches, and `v -os cross` emits every branch into the generated v.c.
Both paths require the symbol to resolve in non-vgc builds, so default
(boehm) `-os cross` / bootstrap-v / build-vc failed with
"unknown function: vgc_wb_store".

Add a no-op fallback in a _notd_vgc.c.v sibling so the symbol resolves
in boehm/none builds (mutually exclusive with the real definition via
file suffix). No behavioral change: the barrier is only ever active
under `-gc e -d vgc_concurrent`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
eptx and others added 3 commits June 15, 2026 16:06
`v fmt -verify` (run by test-cleancode in nearly every CI job) flagged
vlib/v/gen/c/perceus.v and vlib/v/gen/c/cgen.v as not vfmt'ed; also
reformat the two POC bench files. Formatting-only (field/comment
alignment); no logic change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
report-missing-fn-doc requires a name-leading doc comment on every new
public function. Document the two introduced by this PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
write_heap_alloc / write_heap_alloc_close no longer use their `typ`
parameter (the precise-pointer-map HEAP_vgc variant is intentionally
unused — see the note there). V emits a `notice: unused parameter: typ`
to stderr on every build; the tools-* CI jobs treat any stderr output
from a tool compile as a failure, so the whole tools-{linux,macos,
freebsd,openbsd,docker} matrix went red.

Mark the parameter `_` (kept in the signature for call-site symmetry;
maintainers may prefer removing it outright).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@eptx

eptx commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for approving the CI run — that surfaced exactly what we needed.

I triaged the 38 failures; they collapse to four root causes, all on our side and all low-risk, now addressed on the branch:

  1. -os cross / bootstrap break (real regression, the important one). The concurrent-mark write barrier vgc_wb_store is defined only under -d vgc (-gc e), but its call sites in array.v/map.v sit in $if vgc_concurrent ? blocks. v -os cross emits every comptime branch into v.c, and the checker walks inactive branches too — so in a default (boehm) build the symbol is unresolved and bootstrap-v, build-vc, cross-* and the BSD cross-compiles fail with unknown function: vgc_wb_store. Fixed by adding a no-op vgc_wb_store fallback in a _notd_vgc.c.v sibling (mutually exclusive with the real definition via file suffix; zero behavioral change — the barrier is only ever live under -gc e -d vgc_concurrent).

  2. vfmt. vlib/v/gen/c/perceus.v and cgen.v weren't vfmt-clean. Since test-cleancode runs early in most jobs, this alone reddened the linux/macos/windows compilers, the docker images, and the sanitize-* and tcc-* jobs (they abort at the fmt gate before reaching their namesake work). Reformatted.

  3. unused parameter: typ. write_heap_alloc[_close] no longer use their typ arg (the precise-pointer-map HEAP_vgc variant is intentionally dead — see the note there). V prints a notice to stderr even on a successful build, and the tools-* harness treats any stderr from a tool compile as failure, so the whole tools-* matrix (and the riscv64 build) went red. Marked the param _.

  4. Missing doc comments on pub fn vgc_init / vgc_set_watch. Added.

Worth flagging: this CI matrix builds V's default configuration, so the -gc e code paths (C11 atomics, @[thread_local], the conservative scanner) aren't actually compiled or exercised by these jobs — the failures above are all default-build lint/cross issues, not anything specific to the new backend. If it'd be useful, I'm happy to add a small opt-in -gc e lane so the new collector gets real coverage; the two known limitations there are (a) tcc can't compile the C11 atomics the barrier/STW code uses, so a -gc e lane would need a non-tcc compiler, and (b) the conservative mark scan will trip ASan/MSan/UBSan without suppressions. I'd rather take your steer on whether you even want that lane than guess.

A couple of things I'd like direction on before going further:

  • Fallback approach for (1): the no-op _notd_vgc stub is the minimal fix, but if you'd prefer the barrier machinery never reference a vgc-only symbol from builtin at all (e.g. gating the call sites differently), I'll restructure to match your conventions.
  • v1 vs v2: this POC targets the current C backend (vlib/v/gen/c). Is that the right place for an experiment like this, or would you want it oriented toward v2?

This is still a POC / "needs verification" — happy to adjust scope, split it, or hold any of it pending your read on the architecture.

eptx added a commit to cx-home/v that referenced this pull request Jun 15, 2026
…neutral)

Mirrors the 4 fixes landed on the upstream PR branch (cx-home/v
pr/mem-mgmt-poc, vlang#27458) into the CX build lineage. All
behavior-neutral; -gc e codegen is byte-identical.

- vgc_wb_store: add no-op fallback (vgc_wb_fallback_notd_vgc.c.v) so the
  symbol resolves in non-vgc / `-os cross` builds (dormant for CX since
  it builds -gc e by default, but makes a boehm/cross build robust).
- cgen.v: mark dead `typ` param of write_heap_alloc[_close] `_` (kills
  the per-build `unused parameter` notice on stderr).
- perceus.v + cgen.v: vfmt.
- vgc_init / vgc_set_watch: doc comments.

Gate: devbox run -- make test-vcx GREEN (V-impl 125/125 + conformance
md 22/0, namespaces 16/0).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@JalonSolov

Copy link
Copy Markdown
Collaborator

So... I decided to run -gc vgc vs -gc boehm (I know it's the default... I made it explicit...) vs -gc e, I compiled V itself after checking out the PR, and ran the test with hyperfine... here are my results for bench_scalar:

Summary
  ./ve -enable-globals -gc boehm run bench/parallel-alloc/poc/bench_scalar.v ran
    1.31 ± 0.07 times faster than ./ve -enable-globals -gc vgc run bench/parallel-alloc/poc/bench_scalar.v
    1.39 ± 0.08 times faster than ./ve -enable-globals -gc e run bench/parallel-alloc/poc/bench_scalar.v

and for bench_mp:

Summary
  ./ve -enable-globals -gc vgc run bench/parallel-alloc/poc/bench_mp.v ran
    1.02 ± 0.04 times faster than ./ve -enable-globals -gc boehm run bench/parallel-alloc/poc/bench_mp.v
    1.07 ± 0.02 times faster than ./ve -enable-globals -gc e run bench/parallel-alloc/poc/bench_mp.v

Which means on my system, this new way is slower than both the others. 😕

Latest V, CachyOS, on 7950X CPU, 32G RAM, and NVME disks.

@JalonSolov

Copy link
Copy Markdown
Collaborator

Alex will have to make final decisions, but my opinion:

  • never reference a vgc-only symbol from builtin at all, unless -gc vgc is used
  • implement for both backends

v2 is still not quite done, yet, so we can test earier/more stably on v1. However, checking again v2 before it is finalized could be very useful, as well.

@GGRei

GGRei commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

I think the core idea is interesting, Perceus-style reuse, ownership-driven drops/frees, and a tracing backstop are all directions worth studying.

My concern is more about integration and timing. I have been doing some work on the V2 ownership/autofree direction based on the ownership problems listed in #27116.

In the current V2 work, the scope is deliberately staged, it collects ownership/autofree facts from the V2 post-transform FlatAst and type information, classify transfers and release/cleanup eligibility in the V2 type/checker layer, then emit bounded cleanup first through the V2 CleanC backend and fixture tests. The goal, however, should be a backend-neutral V2 ownership/free contract. Once the model is stable, the same ownership decisions will need clear lowering rules for the native backends too, including x64, arm64, and any future backend. That model needs to cover shallow copies, escaping locals, non-owning pointers, sumtype payloads, pointer fields, globals/struct storage, etc.

That makes me wary of adding another memory-management path before the V2 model is implemented, reviewed, and stabilized.

For the current C backend side, I also think this is still too experimental for master as-is. It is true that V already has several memory-management modes there, such as Boehm, none, and autofree, but autofree is already a sensitive area with known issues in that path.

This PR is not just adding a small isolated switch, it also changes the current C backend code generator, autofree/free-method generation, builtins, VGC runtime behavior, platform STW/root scanning, allocator behavior, and Boehm/libgc-related code. Even if -gc e is opt-in, that maintenance surface is not really isolated, so I am cautious about adding a new experimental hybrid mode.

So in my opinion, the safest path would be to keep this as a dedicated experimental/RFC branch for now, possibly as a dedicated branch under the V org if maintainers want to explore it, and extract the useful parts separately like benchmarks, reproducers, design notes, and small isolated fixes with tests. If the experimental branch later proves strong results across supported OSes, with correctness tests and stable performance data, then moving parts of it toward the current C backend / master path would make much more sense.

This is only my personal opinion, of course. ^^

Just to be clear, I still think this PR is interesting and definitely worth exploring further.

@eptx

eptx commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Really appreciate the careful look — and thank you for taking the time to actually benchmark it, @JalonSolov.

On performance: that's a fair and important data point. Our wins were measured on M2 Max (arm64/macOS); your 7950X / CachyOS numbers — -gc e behind both boehm and vgc on bench_scalar and bench_mp — tell me the result is workload- and platform-dependent, not a universal win, and I won't claim otherwise. It would need broad cross-OS / cross-arch data before being a serious master proposal, and it doesn't have that yet.

On integration & timing, @GGRei: I think you're right, and I'd rather follow your lead than push a ~5k-line cross-cutting change at master:

  • Happy to keep the full hybrid as a dedicated experimental / RFC branch (under the V org if you'd like a home for it; otherwise my fork), explicitly not targeting master.
  • I'd prefer to peel off the independently-useful pieces as small, isolated, tested PRs — the benchmarks + reproducers, the design notes, and the few genuinely isolated fixes we hit along the way. Would those be welcome, and do you want them as separate PRs?
  • I also don't want this competing with the v2 ownership/autofree direction in Restore CI step: Ensure V2 can be compiled with -autofree #27116 — that's clearly the strategic path. I'll exercise the ideas against v2 early (per @JalonSolov) so anything useful feeds into that model rather than around it.

On the builtin reference (@JalonSolov): agreed — I'll rework so the barrier never references a vgc-only symbol outside -gc vgc builds (gating the call sites, rather than the builtin no-op fallback I pushed just to clear CI), and look at covering both vgc and e.

I'll re-frame the PR accordingly. Thanks again — this is exactly the steer I was hoping for.

@GGRei

GGRei commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Of course, please keep in mind that I am nobody special in the project, just a modest contributor giving my personal opinion.

Medvednikov is the real final decision-maker here, with JalonSolov's help of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

.v file extension is Verilog Can you release the closed-source compiler right now for us to play?

3 participants