fix(windows): make OpenCode reliable on Windows — retry spurious spawnSync ETIMEDOUT (#46)#85
Conversation
…in (#46) When OpenCode fires hooks for a concurrent batch (e.g. multiple file writes in one turn), the FIRST hook's execSync threw ETIMEDOUT after ~15ms — not the 15s timeout — so that file silently got no diff or neo-tree marker. Confirmed on a real Windows box: across several 5-file runs the first file always failed fast (ETIMEDOUT/SIGTERM, attempt 1, ~12-19ms) while the rest succeeded. Root cause is a Node spawnSync behaviour: its timeout deadline is derived from libuv's *cached* loop time, refreshed once per loop iteration. The first spawnSync right after async work (the awaited enqueueHook) sees a stale "now", so `now + timeout` is already in the past and libuv SIGTERMs the child the instant it spawns. Unix execs the .sh directly and doesn't hit this the same way (it's why the bug is Windows-only). Fix: make runHook async and retry a *spurious* ETIMEDOUT (ETIMEDOUT that returns faster than SPURIOUS_TIMEOUT_MS), but `await` a turn of the event loop first so libuv refreshes its cached time — a synchronous retry would re-read the same stale value (which is exactly why the *next* hook in a burst always succeeds). A genuine ~15s timeout is not retried, so hang-protection is preserved. enqueueHook now awaits the async runHook, so ordering is unchanged. Validated: retry fired 7x across the runs, all recovered on attempt 1; every 5-file burst now delivers all 5 pre/post hooks to the shim. Also documents that OpenCode's after-hook carries tool args on `input` (the before-hook on `output`) — confirmed against the live API. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review — OpenCode reliability on WindowsExcellent diagnosis and a correctly-targeted, well-scoped fix. The root cause (libuv's cached loop time making Why it's correct
CI coverage boundary (clarification, not a defect)
One question before mergeThe retry decision keys only on const spurious = !!err && err.code === "ETIMEDOUT" && elapsed < SPURIOUS_TIMEOUT_MSIn practice Node's timeout-kill sets const spurious = !!err && (err.code === "ETIMEDOUT" || err.signal === "SIGTERM") && elapsed < SPURIOUS_TIMEOUT_MSWas the Minor / optional
Net: strong PR — the only thing I'd genuinely want resolved is the |
…eout retry (#46) Review follow-up. The retry keyed only on `err.code === "ETIMEDOUT"`, but Node's timeout-kill can surface as `signal: 'SIGTERM'` with a null code on some platforms — there the spurious-timeout would be missed and the first hook of a concurrent burst would silently drop again. The logging branch already checked both, so the retry detection was narrower than the logging for no reason. Unify on a single `timedOut` predicate (code ETIMEDOUT OR signal SIGTERM), still gated on `elapsed < SPURIOUS_TIMEOUT_MS` so a genuine ~15s hang is never retried. Also restore elapsed ms in the timeout log so a real hang is distinguishable from exhausted spurious retries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review follow-up (optional hardening). The Windows libuv quirk that makes the first post-async spawnSync spuriously time out can't be reproduced on CI, so the retry was previously only covered by manual Windows validation — fragile against a later "simplification". Extract the retry loop into an exported, injectable `runWithSpuriousRetry(run, label)` (production behaviour unchanged — runHook just passes the execSync call as `run`), and add a unit guard that drives it with a fake `run`: - fast ETIMEDOUT recovers on retry (the core case) - fast SIGTERM-only recovers (guards the platform-variance fix) - a non-timeout error is NOT retried - a persistent fast timeout is bounded at MAX_HOOK_ATTEMPTS (no infinite loop) - success on first attempt runs once retry_test.ts imports the real index.ts and asserts call counts; test_retry.sh drives it via bun/npx tsx (no nvim needed) and skips cleanly when neither is present. Uses pathToFileURL so the dynamic import also works on Windows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
OpenCode's cross-platform plumbing already shipped (the
index.tsWindows branch that runshook-entry.ps1via PowerShell, the libuv-based installer, the shared PS shim, and theplatform.is_absoluteresolver fix from #81 which the opencode normaliser shares). In practice, though, OpenCode-on-Windows dropped files under concurrent, multi-file proposals — the most common OpenCode pattern. This PR fixes that last gap.The bug
When OpenCode fires hooks for a concurrent batch (e.g. several file writes in one turn), the first hook's
execSyncthrewETIMEDOUTafter ~15 ms — not the 15 s timeout — so that file silently got no diff and no neo-tree marker. The rest of the batch succeeded.Confirmed on a real Windows box across several 5-file runs: the first file always failed fast (
ETIMEDOUT/SIGTERM, attempt 1, ~12–19 ms) while files 2–5 succeeded.Root cause
A Node
spawnSyncbehaviour: its timeout deadline is derived from libuv's cached loop time, which is refreshed only once per loop iteration. The firstspawnSyncthat runs right after async work — here, the awaitedenqueueHookin the tool hooks — sees a stale "now", sonow + timeoutis already in the past and libuvSIGTERMs the child the instant it spawns. Unix execs the.shdirectly and doesn't hit this the same way, which is why it's Windows-only.The fix (
backends/opencode/index.ts)runHookis nowasyncand retries a spuriousETIMEDOUT— one that returns faster thanSPURIOUS_TIMEOUT_MS(2 s) — butawaits a turn of the event loop first so libuv refreshes its cached clock. A synchronous retry would re-read the same stale value and fail again (which is exactly why the next hook in a burst always succeeds).enqueueHookawaits the asyncrunHook, so the existing send-order serialization is unchanged.input(the before-hook onoutput) — confirmed against the live API while debugging.Validation
End-to-end on a real Windows box with OpenCode, captured via temporary transport-layer logging (since removed):
Unix is unaffected — the retry only triggers on a fast
ETIMEDOUT, which the direct.shexec path doesn't produce.Follow-ups (filed, intentionally out of scope here)
CODE_PREVIEW_DEBUG(the instrumentation that made this debuggable, productized) — to be its own PR.🤖 Generated with Claude Code