Skip to content

fix(windows): make OpenCode reliable on Windows — retry spurious spawnSync ETIMEDOUT (#46)#85

Merged
Cannon07 merged 3 commits into
mainfrom
feat/windows-opencode
Jun 11, 2026
Merged

fix(windows): make OpenCode reliable on Windows — retry spurious spawnSync ETIMEDOUT (#46)#85
Cannon07 merged 3 commits into
mainfrom
feat/windows-opencode

Conversation

@Cannon07

Copy link
Copy Markdown
Owner

Summary

OpenCode's cross-platform plumbing already shipped (the index.ts Windows branch that runs hook-entry.ps1 via PowerShell, the libuv-based installer, the shared PS shim, and the platform.is_absolute resolver fix from #81 which the opencode normaliser shares). In practice, though, OpenCode-on-Windows dropped files under concurrent, multi-file proposals — the most common OpenCode pattern. This PR fixes that last gap.

The bug

When OpenCode fires hooks for a concurrent batch (e.g. several file writes in one turn), the first hook's execSync threw ETIMEDOUT after ~15 ms — not the 15 s timeout — so that file silently got no diff and no neo-tree marker. The rest of the batch succeeded.

Confirmed on a real Windows box across several 5-file runs: the first file always failed fast (ETIMEDOUT/SIGTERM, attempt 1, ~12–19 ms) while files 2–5 succeeded.

Root cause

A Node spawnSync behaviour: its timeout deadline is derived from libuv's cached loop time, which is refreshed only once per loop iteration. The first spawnSync that runs right after async work — here, the awaited enqueueHook in the tool hooks — sees a stale "now", so now + timeout is already in the past and libuv SIGTERMs the child the instant it spawns. Unix execs the .sh directly and doesn't hit this the same way, which is why it's Windows-only.

The fix (backends/opencode/index.ts)

  • runHook is now async and retries a spurious ETIMEDOUT — one that returns faster than SPURIOUS_TIMEOUT_MS (2 s) — but awaits a turn of the event loop first so libuv refreshes its cached clock. A synchronous retry would re-read the same stale value and fail again (which is exactly why the next hook in a burst always succeeds).
  • A genuine ~15 s timeout is not retried, so hang-protection is preserved.
  • enqueueHook awaits the async runHook, so the existing send-order serialization is unchanged.
  • Also documents that OpenCode's after-hook carries tool args on input (the before-hook on output) — confirmed against the live API while debugging.

Validation

End-to-end on a real Windows box with OpenCode, captured via temporary transport-layer logging (since removed):

  • the retry fired across the runs, all recovered on attempt 1;
  • every 5-file burst now delivers all 5 pre + 5 post hooks to the shim (previously the first file was dropped);
  • single-file edits/writes/deletes, post-tool clearing, and absolute-path handling all behave correctly.

Unix is unaffected — the retry only triggers on a fast ETIMEDOUT, which the direct .sh exec path doesn't produce.

Follow-ups (filed, intentionally out of scope here)

🤖 Generated with Claude Code

…in (#46)

When OpenCode fires hooks for a concurrent batch (e.g. multiple file writes in
one turn), the FIRST hook's execSync threw ETIMEDOUT after ~15ms — not the 15s
timeout — so that file silently got no diff or neo-tree marker. Confirmed on a
real Windows box: across several 5-file runs the first file always failed fast
(ETIMEDOUT/SIGTERM, attempt 1, ~12-19ms) while the rest succeeded.

Root cause is a Node spawnSync behaviour: its timeout deadline is derived from
libuv's *cached* loop time, refreshed once per loop iteration. The first
spawnSync right after async work (the awaited enqueueHook) sees a stale "now",
so `now + timeout` is already in the past and libuv SIGTERMs the child the
instant it spawns. Unix execs the .sh directly and doesn't hit this the same
way (it's why the bug is Windows-only).

Fix: make runHook async and retry a *spurious* ETIMEDOUT (ETIMEDOUT that returns
faster than SPURIOUS_TIMEOUT_MS), but `await` a turn of the event loop first so
libuv refreshes its cached time — a synchronous retry would re-read the same
stale value (which is exactly why the *next* hook in a burst always succeeds). A
genuine ~15s timeout is not retried, so hang-protection is preserved.
enqueueHook now awaits the async runHook, so ordering is unchanged.

Validated: retry fired 7x across the runs, all recovered on attempt 1; every
5-file burst now delivers all 5 pre/post hooks to the shim.

Also documents that OpenCode's after-hook carries tool args on `input` (the
before-hook on `output`) — confirmed against the live API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Cannon07

Copy link
Copy Markdown
Owner Author

Review — OpenCode reliability on Windows

Excellent diagnosis and a correctly-targeted, well-scoped fix. The root cause (libuv's cached loop time making spawnSync's deadline already-expired on the first post-async spawn) is sophisticated, and yielding the event loop before retry directly addresses it. CI green; happy path covered on Unix, the retry/burst behavior validated manually on Windows.

Why it's correct

  • The yield is the crux. A synchronous retry would re-read the same stale libuv clock and fail identically — await setImmediate(...) before continue forces libuv to refresh its cached "now", which is why the next hook in a burst always succeeded.
  • Spurious vs. genuine discrimination preserves hang protection. Gating on elapsed < SPURIOUS_TIMEOUT_MS (2 s) means a real ~15 s hang is not retried — it still abstains and logs.
  • Bounded and total. Every path inside the loop returns; no unbounded retry, no fall-through.
  • Serialization preserved. hookQueue.then(async () => { await fn() }) resolves only after runHook fully resolves, so the setImmediate yields advance libuv/timers but don't let a later hook start early — send-order holds.
  • Unix genuinely unaffected — the retry only triggers on a fast ETIMEDOUT, which the direct .sh exec path doesn't produce.

CI coverage boundary (clarification, not a defect)

  • The refactor's happy path is covered on Unix CI: tests/backends/opencode/harness.ts imports the real index.ts and drives the hooks, so the async change didn't regress invocation/ordering.
  • The two things that matter most — the spurious-timeout retry branch (a Windows libuv quirk) and the concurrent multi-file burst (the harness fires one op at a time) — are inherently un-CI-able and rest on the manual Windows validation. That's the right evidence for this class of bug.

One question before merge

The retry decision keys only on err.code === "ETIMEDOUT", but the PR description notes the failure manifests as ETIMEDOUT/SIGTERM, and the logging branch still checks both:

const spurious = !!err && err.code === "ETIMEDOUT" && elapsed < SPURIOUS_TIMEOUT_MS

In practice Node's timeout-kill sets code: 'ETIMEDOUT' (and 7/7 recovery confirms it matches here), but Node has historically also surfaced timeout-kills as signal: 'SIGTERM' with a null code on some platforms — in which case the detector would miss and the file would silently drop again, reintroducing the bug. Since it's still gated by elapsed < 2 s, widening costs nothing on the genuine-hang path:

const spurious = !!err && (err.code === "ETIMEDOUT" || err.signal === "SIGTERM") && elapsed < SPURIOUS_TIMEOUT_MS

Was the SIGTERM case deliberately excluded, or is it safe to widen to match the original logging check?

Minor / optional

  • No regression guard for the retry logic. Hard to unit-test a libuv quirk, but harness.ts could grow a fault-injection action (mock execSync to throw one fast ETIMEDOUT then succeed) to assert recovery — cheap insurance against someone later "simplifying" the retry away. Given how painful this was to find, a guard has real value.
  • Log message lost its specificity ("timed out after 15s""timed out"). Including elapsed ms would distinguish a real hang from an exhausted-retry case in the field.
  • The comment documenting after-hook args on input vs before-hook on output is a good capture of a live-API asymmetry that would bite later.

Net: strong PR — the only thing I'd genuinely want resolved is the SIGTERM question; everything else is optional hardening.

Cannon07 and others added 2 commits June 11, 2026 22:59
…eout retry (#46)

Review follow-up. The retry keyed only on `err.code === "ETIMEDOUT"`, but Node's
timeout-kill can surface as `signal: 'SIGTERM'` with a null code on some
platforms — there the spurious-timeout would be missed and the first hook of a
concurrent burst would silently drop again. The logging branch already checked
both, so the retry detection was narrower than the logging for no reason.

Unify on a single `timedOut` predicate (code ETIMEDOUT OR signal SIGTERM), still
gated on `elapsed < SPURIOUS_TIMEOUT_MS` so a genuine ~15s hang is never retried.
Also restore elapsed ms in the timeout log so a real hang is distinguishable from
exhausted spurious retries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review follow-up (optional hardening). The Windows libuv quirk that makes the
first post-async spawnSync spuriously time out can't be reproduced on CI, so the
retry was previously only covered by manual Windows validation — fragile against
a later "simplification".

Extract the retry loop into an exported, injectable `runWithSpuriousRetry(run,
label)` (production behaviour unchanged — runHook just passes the execSync call
as `run`), and add a unit guard that drives it with a fake `run`:

- fast ETIMEDOUT recovers on retry (the core case)
- fast SIGTERM-only recovers (guards the platform-variance fix)
- a non-timeout error is NOT retried
- a persistent fast timeout is bounded at MAX_HOOK_ATTEMPTS (no infinite loop)
- success on first attempt runs once

retry_test.ts imports the real index.ts and asserts call counts; test_retry.sh
drives it via bun/npx tsx (no nvim needed) and skips cleanly when neither is
present. Uses pathToFileURL so the dynamic import also works on Windows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Cannon07 Cannon07 merged commit 01ca749 into main Jun 11, 2026
3 checks passed
@Cannon07 Cannon07 deleted the feat/windows-opencode branch June 11, 2026 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant