Skip to content

Commit 072bc53

Browse files
justin808claude
andauthored
Document health-probe status-code contract and Control Plane probes (#4053) (#4063)
## Why Issue #4053: the recommended renderer readiness probe `curl --http2-prior-knowledge -fsS http://localhost:3800/ready` breaks container startup. The container never becomes ready; removing `-fsS` "fixes" it (but then the probe always passes, defeating its purpose). **Root cause (confirmed in `worker.ts`):** `-f`/`--fail` makes curl exit non-zero on any HTTP status `>= 400`, and `/ready` returns `503 {"status":"waiting_for_bundle"}` during the cold-start window until the answering worker compiles its first bundle. A `--fail` probe against `/ready` with no warm-up path therefore deadlocks startup. This is intended gating behavior — but the docs lacked an explicit per-endpoint status-code contract and any Control Plane guidance, so the reporter hit the deadlock without a clear answer. Endpoint contract (from [`worker.ts`](packages/react-on-rails-pro-node-renderer/src/worker.ts)): - `/health` → always `200` - `/info` → always `200` - `/ready` → `200` once a bundle is compiled, else `503` ## What changed Documentation only — `docs/oss/building-features/node-renderer/health-checks.md`: 1. **Status-Code Contract** section with a per-endpoint table and a callout explaining exactly why `curl -fsS .../ready` breaks startup, and what to probe instead (`/health` or `/info` for an always-passes probe; reserve `--fail` against `/ready` for setups with a warm-up path). 2. **Control Plane (CPLN)** section: CPLN exposes HTTP and Command (exec) probes; the HTTP probe is HTTP/1.1 and cannot speak the renderer's h2c listener, so use a Command probe with h2c-aware curl against `/health` by default, switching to `/ready` only with a warm-up path. This addresses all three of the issue's requests: a probe command that works, the `/ready`+`/info` status-code contract, and the CPLN-specific probe options. ## Validation - `npx prettier --check` passes on the changed file. - Docs-only; no CHANGELOG entry per `.claude/docs/changelog-guidelines.md` (docs fixes excluded). Fixes #4053 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Documentation-only changes with no runtime, security, or deployment code modifications. > > **Overview** > Documents why **`curl --fail` against `/ready`** can deadlock container readiness during cold start, and how to probe safely. > > Adds a **Status-Code Contract** table for `/health`, `/info`, and `/ready` (including when `503 waiting_for_bundle` is expected) plus guidance to use **`/health` or `/info`** for probes that must pass once the process is up, and **`/ready` with `--fail` only when a warm-up path** compiles the first bundle. > > Adds a **Control Plane (CPLN)** section explaining that HTTP probes cannot use the renderer’s **h2c** listener, with example **`exec` Command probes** using h2c-aware `curl` against `/health` for readiness and liveness. The same content is mirrored in **`llms-full.txt`**. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 40ae833. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added a “Status-Code Contract” describing expected HTTP responses for health endpoints, including when `/ready` may return `503` during cold start * Expanded guidance on `curl --fail`/`-f`/`-fsS`, warning against using `/ready` as a failing readiness gate without a warm-up path * Added Control Plane-specific probe instructions and example probe shapes to match connectivity constraints <!-- end of auto-generated comment: release notes by coderabbit.ai --> ## Agent Merge Confidence (coordinator) **Merged by Claude Code coordinator under explicit maintainer (justin808) delegation** — "merge authorization if confident + documented." - **Confidence: HIGH.** Docs-only (`docs/oss/.../health-checks.md`) + regenerated `llms-full.txt`. - **Rebased** onto merged #4087 (Group B, generated `llms-full.txt` serialization); `llms-full.txt` regenerated via `node script/generate-llms-full.mjs` — regeneration produced zero diff and the `--check` guard passes (auto-merge was correct). - **Doc accuracy verified vs source:** `/health`→200, `/ready`→200/503 `waiting_for_bundle` confirmed against `packages/react-on-rails-pro-node-renderer/src/worker.ts`; CPLN probe schema confirmed against `react_on_rails_pro/.controlplane/rails.yml`. - **Current-head review threads triaged against the actual file (not merged blind):** the recurring CodeRabbit/claude/codex "missing `timeoutSeconds`" and greptile "missing liveness probe" flags are **stale/incorrect** — the CPLN snippet already contains `timeoutSeconds: 5` on both `readinessProbe` and `livenessProbe`. `startupProbe` omission is intentional and safe (probes `/health`, always 200; cold start cannot misfire). The `-sfS` and localhost-qualification notes are advisory/misreads. No actionable must-fix remained. - **CI:** `mergeStateStatus: CLEAN`, all checks pass/skip. **Merge ledger** sole hard violation `unknown_review_decision` — overridden by maintainer merge delegation. Changelog `not_user_visible`. - **Cross-batch:** Batch-1 #4037 also touches `llms-full.txt` and must rebase+regenerate after this merge. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 890841c commit 072bc53

2 files changed

Lines changed: 138 additions & 0 deletions

File tree

docs/oss/building-features/node-renderer/health-checks.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,24 @@ routes outside the authenticated render and asset endpoints and do not require t
2020
probes cannot carry it). Keep the renderer on `localhost` or private networking as usual; see
2121
[Network Security](./basics.md#network-security).
2222

23+
## Status-Code Contract
24+
25+
Probe tooling that uses `curl --fail` / `-f` (which `-sf` and `-fsS` both include) exits non-zero on any HTTP status
26+
`>= 400`. Whether `--fail` is safe therefore depends on which endpoint you probe:
27+
28+
| Endpoint | Status codes | Safe with `curl --fail` / `-f`? |
29+
| --------- | ---------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
30+
| `/health` | Always `200` (process can answer) | **Yes** — never non-2xx. |
31+
| `/info` | Always `200` (returns Node and renderer versions) | **Yes** — never non-2xx. |
32+
| `/ready` | `200` once a bundle is compiled; `503` `waiting_for_bundle` until then | **Only with a warm-up path.** Without one, `-f` turns the cold-start `503` into a failed probe. |
33+
34+
> **Why `curl -fsS .../ready` can break container startup:** during the cold-start window `/ready` returns `503`
35+
> (`{"status":"waiting_for_bundle"}`) until the answering worker compiles its first bundle, and `-f`/`--fail` turns
36+
> that `503` into a non-zero exit. If that command gates startup/readiness and nothing pre-warms the renderer, the
37+
> probe never passes and the container never becomes ready. This is the `503` working as designed, not a bug — see
38+
> [Gating traffic on `/ready`](#gating-traffic-on-ready). For a probe that must always pass once the process is up,
39+
> point `--fail` at `/health` (or `/info`); reserve `--fail` against `/ready` for setups with a warm-up path.
40+
2341
## Enabling the Endpoints
2442

2543
The endpoints are **off by default**. Enable them with the `enableHealthEndpoints` config option or the
@@ -220,6 +238,57 @@ services:
220238
start_period: 10s
221239
```
222240

241+
## Control Plane (CPLN)
242+
243+
Control Plane exposes two relevant probe shapes: an **HTTP** probe and a **Command** (exec) probe. The HTTP probe is
244+
HTTP/1.1 and **cannot speak the renderer's h2c listener** (see [h2c](#h2c-why-httpget-probes-do-not-work)), so it
245+
always fails against these endpoints — use a **Command** probe with an h2c-aware curl instead. Command probes run
246+
inside the container, so the default `localhost` binding works and no `0.0.0.0` host is required.
247+
248+
Use `/health` (always `200`) for liveness and readiness by default so a normal cold start cannot fail the workload
249+
before the first render compiles a bundle. Control Plane uses the Kubernetes-style `readinessProbe` / `livenessProbe`
250+
fields on the workload container (the same shape as the [Kubernetes example above](#kubernetes-probes) and the
251+
existing [Control Plane deployment docs](../../deployment/docker-deployment.md#deploying-with-control-plane)), with a Command
252+
probe expressed as `exec.command`:
253+
254+
```yaml
255+
kind: workload
256+
spec:
257+
containers:
258+
- name: node-renderer
259+
# ... image, ports, env (RENDERER_ENABLE_HEALTH_ENDPOINTS: 'true') ...
260+
# Command probe — h2c-aware curl against /health (always 200).
261+
readinessProbe:
262+
exec:
263+
command:
264+
- curl
265+
- -sf
266+
- --max-time
267+
- '3'
268+
- --http2-prior-knowledge
269+
- http://localhost:3800/health
270+
periodSeconds: 5
271+
failureThreshold: 3
272+
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout
273+
livenessProbe:
274+
exec:
275+
command:
276+
- curl
277+
- -sf
278+
- --max-time
279+
- '3'
280+
- --http2-prior-knowledge
281+
- http://localhost:3800/health
282+
periodSeconds: 10
283+
failureThreshold: 3
284+
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout
285+
```
286+
287+
Do **not** point a `--fail` Command probe at `/ready` unless something pre-warms the renderer, or the probe will fail
288+
on the cold-start `503` and the workload never becomes ready (the exact failure in
289+
[Status-Code Contract](#status-code-contract)). Only switch the path to `/ready` once a warm-up path delivers the
290+
first render to every worker that answers probes — see [Gating traffic on `/ready`](#gating-traffic-on-ready).
291+
223292
## Semantics and Caveats
224293

225294
- **Per-worker checks.** With `workersCount > 1`, the Node.js cluster module distributes incoming connections across

llms-full.txt

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15258,6 +15258,24 @@ routes outside the authenticated render and asset endpoints and do not require t
1525815258
probes cannot carry it). Keep the renderer on `localhost` or private networking as usual; see
1525915259
[Network Security](./basics.md#network-security).
1526015260

15261+
## Status-Code Contract
15262+
15263+
Probe tooling that uses `curl --fail` / `-f` (which `-sf` and `-fsS` both include) exits non-zero on any HTTP status
15264+
`>= 400`. Whether `--fail` is safe therefore depends on which endpoint you probe:
15265+
15266+
| Endpoint | Status codes | Safe with `curl --fail` / `-f`? |
15267+
| --------- | ---------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
15268+
| `/health` | Always `200` (process can answer) | **Yes** — never non-2xx. |
15269+
| `/info` | Always `200` (returns Node and renderer versions) | **Yes** — never non-2xx. |
15270+
| `/ready` | `200` once a bundle is compiled; `503` `waiting_for_bundle` until then | **Only with a warm-up path.** Without one, `-f` turns the cold-start `503` into a failed probe. |
15271+
15272+
> **Why `curl -fsS .../ready` can break container startup:** during the cold-start window `/ready` returns `503`
15273+
> (`{"status":"waiting_for_bundle"}`) until the answering worker compiles its first bundle, and `-f`/`--fail` turns
15274+
> that `503` into a non-zero exit. If that command gates startup/readiness and nothing pre-warms the renderer, the
15275+
> probe never passes and the container never becomes ready. This is the `503` working as designed, not a bug — see
15276+
> [Gating traffic on `/ready`](#gating-traffic-on-ready). For a probe that must always pass once the process is up,
15277+
> point `--fail` at `/health` (or `/info`); reserve `--fail` against `/ready` for setups with a warm-up path.
15278+
1526115279
## Enabling the Endpoints
1526215280

1526315281
The endpoints are **off by default**. Enable them with the `enableHealthEndpoints` config option or the
@@ -15458,6 +15476,57 @@ services:
1545815476
start_period: 10s
1545915477
```
1546015478

15479+
## Control Plane (CPLN)
15480+
15481+
Control Plane exposes two relevant probe shapes: an **HTTP** probe and a **Command** (exec) probe. The HTTP probe is
15482+
HTTP/1.1 and **cannot speak the renderer's h2c listener** (see [h2c](#h2c-why-httpget-probes-do-not-work)), so it
15483+
always fails against these endpoints — use a **Command** probe with an h2c-aware curl instead. Command probes run
15484+
inside the container, so the default `localhost` binding works and no `0.0.0.0` host is required.
15485+
15486+
Use `/health` (always `200`) for liveness and readiness by default so a normal cold start cannot fail the workload
15487+
before the first render compiles a bundle. Control Plane uses the Kubernetes-style `readinessProbe` / `livenessProbe`
15488+
fields on the workload container (the same shape as the [Kubernetes example above](#kubernetes-probes) and the
15489+
existing [Control Plane deployment docs](../../deployment/docker-deployment.md#deploying-with-control-plane)), with a Command
15490+
probe expressed as `exec.command`:
15491+
15492+
```yaml
15493+
kind: workload
15494+
spec:
15495+
containers:
15496+
- name: node-renderer
15497+
# ... image, ports, env (RENDERER_ENABLE_HEALTH_ENDPOINTS: 'true') ...
15498+
# Command probe — h2c-aware curl against /health (always 200).
15499+
readinessProbe:
15500+
exec:
15501+
command:
15502+
- curl
15503+
- -sf
15504+
- --max-time
15505+
- '3'
15506+
- --http2-prior-knowledge
15507+
- http://localhost:3800/health
15508+
periodSeconds: 5
15509+
failureThreshold: 3
15510+
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout
15511+
livenessProbe:
15512+
exec:
15513+
command:
15514+
- curl
15515+
- -sf
15516+
- --max-time
15517+
- '3'
15518+
- --http2-prior-knowledge
15519+
- http://localhost:3800/health
15520+
periodSeconds: 10
15521+
failureThreshold: 3
15522+
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout
15523+
```
15524+
15525+
Do **not** point a `--fail` Command probe at `/ready` unless something pre-warms the renderer, or the probe will fail
15526+
on the cold-start `503` and the workload never becomes ready (the exact failure in
15527+
[Status-Code Contract](#status-code-contract)). Only switch the path to `/ready` once a warm-up path delivers the
15528+
first render to every worker that answers probes — see [Gating traffic on `/ready`](#gating-traffic-on-ready).
15529+
1546115530
## Semantics and Caveats
1546215531

1546315532
- **Per-worker checks.** With `workersCount > 1`, the Node.js cluster module distributes incoming connections across

0 commit comments

Comments
 (0)