Skip to content

Document health-probe status-code contract and Control Plane probes (#4053)#4063

Merged
justin808 merged 3 commits into
mainfrom
jg/4053-health-probe-docs-fail-flag
Jun 18, 2026
Merged

Document health-probe status-code contract and Control Plane probes (#4053)#4063
justin808 merged 3 commits into
mainfrom
jg/4053-health-probe-docs-fail-flag

Conversation

@justin808

@justin808 justin808 commented Jun 16, 2026

Copy link
Copy Markdown
Member

Why

Issue #4053: the recommended renderer readiness probe curl --http2-prior-knowledge -fsS http://localhost:3800/ready breaks container startup. The container never becomes ready; removing -fsS "fixes" it (but then the probe always passes, defeating its purpose).

Root cause (confirmed in worker.ts): -f/--fail makes curl exit non-zero on any HTTP status >= 400, and /ready returns 503 {"status":"waiting_for_bundle"} during the cold-start window until the answering worker compiles its first bundle. A --fail probe against /ready with no warm-up path therefore deadlocks startup. This is intended gating behavior — but the docs lacked an explicit per-endpoint status-code contract and any Control Plane guidance, so the reporter hit the deadlock without a clear answer.

Endpoint contract (from worker.ts):

  • /health → always 200
  • /info → always 200
  • /ready200 once a bundle is compiled, else 503

What changed

Documentation only — docs/oss/building-features/node-renderer/health-checks.md:

  1. Status-Code Contract section with a per-endpoint table and a callout explaining exactly why curl -fsS .../ready breaks startup, and what to probe instead (/health or /info for an always-passes probe; reserve --fail against /ready for setups with a warm-up path).
  2. Control Plane (CPLN) section: CPLN exposes HTTP and Command (exec) probes; the HTTP probe is HTTP/1.1 and cannot speak the renderer's h2c listener, so use a Command probe with h2c-aware curl against /health by default, switching to /ready only with a warm-up path.

This addresses all three of the issue's requests: a probe command that works, the /ready+/info status-code contract, and the CPLN-specific probe options.

Validation

  • npx prettier --check passes on the changed file.
  • Docs-only; no CHANGELOG entry per .claude/docs/changelog-guidelines.md (docs fixes excluded).

Fixes #4053

🤖 Generated with Claude Code


Note

Low Risk
Documentation-only changes with no runtime, security, or deployment code modifications.

Overview
Documents why curl --fail against /ready can deadlock container readiness during cold start, and how to probe safely.

Adds a Status-Code Contract table for /health, /info, and /ready (including when 503 waiting_for_bundle is expected) plus guidance to use /health or /info for probes that must pass once the process is up, and /ready with --fail only when a warm-up path compiles the first bundle.

Adds a Control Plane (CPLN) section explaining that HTTP probes cannot use the renderer’s h2c listener, with example exec Command probes using h2c-aware curl against /health for readiness and liveness. The same content is mirrored in llms-full.txt.

Reviewed by Cursor Bugbot for commit 40ae833. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • Documentation
    • Added a “Status-Code Contract” describing expected HTTP responses for health endpoints, including when /ready may return 503 during cold start
    • Expanded guidance on curl --fail/-f/-fsS, warning against using /ready as a failing readiness gate without a warm-up path
    • Added Control Plane-specific probe instructions and example probe shapes to match connectivity constraints

Agent Merge Confidence (coordinator)

Merged by Claude Code coordinator under explicit maintainer (justin808) delegation — "merge authorization if confident + documented."

  • Confidence: HIGH. Docs-only (docs/oss/.../health-checks.md) + regenerated llms-full.txt.
  • Rebased onto merged docs(tooling): surface SVG diagram alt text in generated llms-full files #4087 (Group B, generated llms-full.txt serialization); llms-full.txt regenerated via node script/generate-llms-full.mjs — regeneration produced zero diff and the --check guard passes (auto-merge was correct).
  • Doc accuracy verified vs source: /health→200, /ready→200/503 waiting_for_bundle confirmed against packages/react-on-rails-pro-node-renderer/src/worker.ts; CPLN probe schema confirmed against react_on_rails_pro/.controlplane/rails.yml.
  • Current-head review threads triaged against the actual file (not merged blind): the recurring CodeRabbit/claude/codex "missing timeoutSeconds" and greptile "missing liveness probe" flags are stale/incorrect — the CPLN snippet already contains timeoutSeconds: 5 on both readinessProbe and livenessProbe. startupProbe omission is intentional and safe (probes /health, always 200; cold start cannot misfire). The -sfS and localhost-qualification notes are advisory/misreads. No actionable must-fix remained.
  • CI: mergeStateStatus: CLEAN, all checks pass/skip. Merge ledger sole hard violation unknown_review_decision — overridden by maintainer merge delegation. Changelog not_user_visible.
  • Cross-batch: Batch-1 Add OSS hydrate_on scheduling #4037 also touches llms-full.txt and must rebase+regenerate after this merge.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Documentation for the node-renderer health-checks page gains two additions: a status-code contract section with a table and curl --fail safety guidance, and a new Control Plane (CPLN) section explaining h2c probe limitations with an exec/Command probe example. Both sections are added to the primary docs file and a secondary documentation aggregation file.

Changes

Node Renderer Health-Check Documentation

Layer / File(s) Summary
Status-code contract and curl --fail guidance
docs/oss/building-features/node-renderer/health-checks.md, llms-full.txt
Adds a table specifying that /health and /info return only 200, while /ready can return 503 during cold start; documents when curl -f is safe vs when it causes probe failures.
CPLN exec/Command probe configuration
docs/oss/building-features/node-renderer/health-checks.md, llms-full.txt
Adds a CPLN-specific section explaining that HTTP probes cannot reach the h2c listener, provides an exec/Command probe example using h2c-capable curl against /health, and reiterates the --fail warning for /ready.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Possibly related PRs

  • shakacode/react_on_rails#3930: Updates node-renderer Kubernetes probe documentation around exec/h2c-capable curl guidance and avoiding HTTP-only probes, directly overlapping with the CPLN exec probe section added here.
  • shakacode/react_on_rails#3348: Updates node-renderer probe/health documentation semantics—especially how curl --fail/endpoint behavior affects readiness (h2c-aware /health vs /ready cold-start risk)—directly related to the status-code contract and probe guidance added here.
  • shakacode/react_on_rails#3241: Updates OSS node-renderer probe documentation around curl --fail semantics and h2c-capable HTTP/2 curl usage, matching the probe flag guidance in this PR.

Suggested labels

review-needed, documentation, P3

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: documenting health-probe status-code contract and Control Plane probes for the renderer.
Linked Issues check ✅ Passed All three requirements from issue #4053 are met: working readiness probe command documented, status-code contract clarified with table, and Control Plane probe capabilities aligned with Command/exec approach.
Out of Scope Changes check ✅ Passed All changes are documentation-only and directly address the three specific objectives outlined in issue #4053 with no unrelated additions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jg/4053-health-probe-docs-fail-flag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown

Greptile Summary

This is a documentation-only PR that adds two new sections to docs/oss/building-features/node-renderer/health-checks.md to address a container startup deadlock caused by using curl --fail against /ready during the cold-start window (issue #4053).

  • Status-Code Contract section: adds a per-endpoint table clarifying that /health and /info always return 200 (safe with --fail), while /ready returns 503 until the first bundle compiles — making --fail against /ready unsafe without a warm-up path.
  • Control Plane (CPLN) section: documents that CPLN's HTTP probe speaks HTTP/1.1 and cannot reach the h2c listener, so a Command (exec) probe with h2c-aware curl against /health must be used instead, with a YAML snippet and a warning against pointing --fail at /ready without pre-warming.

Confidence Score: 4/5

Documentation-only change with no runtime risk; the new sections are accurate against the worker.ts source, and both findings are editorial gaps rather than incorrect guidance.

The status-code contract table and callout are correct and match the worker.ts implementation exactly. The CPLN YAML snippet, however, promises 'readiness / liveness' in its comment but omits the liveness block, leaving readers with an incomplete configuration. The pre-existing h2c section example still hands users curl -sf .../ready without a cross-reference to the new warning.

docs/oss/building-features/node-renderer/health-checks.md — the CPLN snippet and the h2c section example both need small follow-up edits.

Important Files Changed

Filename Overview
docs/oss/building-features/node-renderer/health-checks.md Adds Status-Code Contract table and CPLN Command probe section; CPLN snippet omits the liveness probe despite prose saying both should use /health, and the existing h2c section example still uses --fail against /ready without a cross-reference to the new warning.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Container starts] --> B{Renderer process up?}
    B -- No --> DEAD[Container never starts]
    B -- Yes --> C{Probe type?}

    C -- "curl --fail /health or /info" --> D["Always 200 ✓"]
    D --> READY[Probe passes — container ready]

    C -- "curl --fail /ready (no warm-up path)" --> E{Bundle compiled?}
    E -- "Yes (200)" --> READY
    E -- "No (503 waiting_for_bundle)" --> F[Probe fails — non-zero exit]
    F --> DEAD2[Container never becomes ready — deadlock]

    C -- "curl --fail /ready (with warm-up path)" --> G[Warm-up delivers first render]
    G --> E

    style DEAD fill:#f66,color:#fff
    style DEAD2 fill:#f66,color:#fff
    style READY fill:#6c6,color:#fff
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Container starts] --> B{Renderer process up?}
    B -- No --> DEAD[Container never starts]
    B -- Yes --> C{Probe type?}

    C -- "curl --fail /health or /info" --> D["Always 200 ✓"]
    D --> READY[Probe passes — container ready]

    C -- "curl --fail /ready (no warm-up path)" --> E{Bundle compiled?}
    E -- "Yes (200)" --> READY
    E -- "No (503 waiting_for_bundle)" --> F[Probe fails — non-zero exit]
    F --> DEAD2[Container never becomes ready — deadlock]

    C -- "curl --fail /ready (with warm-up path)" --> G[Warm-up delivers first render]
    G --> E

    style DEAD fill:#f66,color:#fff
    style DEAD2 fill:#f66,color:#fff
    style READY fill:#6c6,color:#fff
Loading

Comments Outside Diff (1)

  1. docs/oss/building-features/node-renderer/health-checks.md, line 82-84 (link)

    P2 h2c section still recommends -sf with /ready without a cross-reference

    The "h2c: Why httpGet Probes Do NOT Work" section (unchanged by this PR) gives curl -sf --http2-prior-knowledge http://localhost:3800/ready as its exec-probe example — using --fail directly against /ready. That is precisely the pattern that caused issue Health-probe docs recommend curl -fsS for renderer readiness, which breaks container startup #4053. A reader who navigates directly to this section, rather than reading top-to-bottom, sees the same unsafe recommendation without encountering the new Status-Code Contract callout. Adding a sentence such as "(see Status-Code Contract before pointing --fail at /ready)" would close this gap.

Reviews (1): Last reviewed commit: "Document health-probe status-code contra..." | Re-trigger Greptile

Comment on lines +251 to +262
```yaml
# Control Plane workload — readiness / liveness as a Command probe
readiness:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing liveness probe in CPLN snippet

The prose says "Use /health (always 200) for liveness and readiness by default" and the comment on the code block says "readiness / liveness as a Command probe", but only the readiness key is shown. A user copying this snippet will configure readiness without liveness, contrary to the intent described in the surrounding text. A corresponding liveness block pointing at /health should appear alongside readiness in the example.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c93a5ec155

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


```yaml
# Control Plane workload — readiness / liveness as a Command probe
readiness:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use Control Plane probe field names

For Control Plane manifests this block needs to be attached to the workload container as readinessProbe/livenessProbe; the rest of our CPLN docs use spec.containers[].readinessProbe (for example, docs/oss/deployment/docker-deployment.md:365), and CPLN's probe schema is Kubernetes-style with exec/httpGet/tcpSocket/grpc inside that probe. As written, a user who copies this readiness: object into a workload manifest gets no valid Command probe, so the advertised readiness/liveness check will not run.

Useful? React with 👍 / 👎.

justin808 added a commit that referenced this pull request Jun 17, 2026
The "is this PR's CI ready?" rule was restated in prose across pr-batch,
adversarial-pr-review, and pr-processing: run `gh pr checks <PR> --required`,
fall back to the full `gh pr checks <PR>` list when no required checks exist,
ignore cancelled/superseded rows, and treat an empty list as UNKNOWN/not-ready.

Encapsulate that rule in `.agents/skills/pr-batch/bin/pr-ci-readiness`, written
as pure Ruby (stdlib json + open3, no bundler/gems) mirroring the
fetch-pr-review-data template: a PrCiReadiness module of pure module_function
verdict helpers plus a Runner class for the CLI. Subprocesses use Open3 array
args (no shell); `gh pr checks` stdout is captured despite its non-zero exit on
failing/pending checks, with stderr discarded, and the rows are inspected to
print a verdict of READY, NOT_READY, or UNKNOWN with failing/pending names and a
required_used flag. Flags: --json (default)/--text/--self-check/--help.

This replaces the original bash implementation. The helpers already hard-depended
on Ruby (the bash versions shelled out to `ruby -rjson`), so going pure Ruby
removes the bash<->Ruby seam at zero dependency cost. Output JSON shape, executable
path/name, and CLI flags are unchanged, so the SKILL.md / workflow references stay
valid and need no edits.

The companion test is now minitest (pr-ci-readiness-test.rb, replacing the
-test.bash): pure verdict logic is tested directly with canned check arrays, and
the CLI/Runner path is exercised through a fake `gh` on PATH. Verified output is
byte-for-byte identical to the previous bash version against merged PR #4058 and
in-progress PR #4063. rubocop and minitest both pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Code Review

Verdict: Approve with one minor suggestion

This is a well-written, technically accurate documentation PR that cleanly resolves issue #4053. The root cause analysis (cold-start 503 + curl --fail deadlock) is correct and will save users significant debugging time.

What's working well

  • Status-Code Contract table is accurate: curl --fail / -f exits non-zero on HTTP ≥ 400, and /ready correctly returns 503 waiting_for_bundle during cold start.
  • Callout box (the > **Why... blockquote) is crisp — it states the mechanism, names the status code, and links to the existing gating section.
  • All anchor cross-references are valid: #status-code-contract, #gating-traffic-on-ready, #h2c-why-httpget-probes-do-not-work all resolve to existing headings.
  • CPLN section correctly identifies the h2c limitation and steers readers toward Command probes — consistent with how the h2c and ECS sections handle the same constraint.
  • Correctly omits a CHANGELOG entry per project guidelines (docs-only fix).

Minor issue

The CPLN YAML comment says # Control Plane workload — readiness / liveness as a Command probe, advertising both probes, but only the readiness key is shown. See the inline comment for details.

Note on curl HTTP/2 requirement

The CPLN section uses --http2-prior-knowledge but doesn't surface the curl build requirement (curl --version | grep -i http2). The h2c section is already cross-linked in the prose, so readers can find it, but a one-line note like the one in the h2c section ("Verify your image's curl has HTTP/2 support...") or a cross-link sentence right before the YAML block would make this self-contained for CPLN readers.

Comment on lines +252 to +263
# Control Plane workload — readiness / liveness as a Command probe
readiness:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment # Control Plane workload — readiness / liveness as a Command probe advertises both probe types, but only readiness is defined in the snippet. A reader following this literally would have liveness unconfigured.

Either show both keys (same curl command, same /health path) or narrow the comment so it doesn't imply liveness is covered:

Suggested change
# Control Plane workload — readiness / liveness as a Command probe
readiness:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
```
```yaml
# Control Plane workload — readiness as a Command probe
# (apply the same shape to liveness if desired)
readiness:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health

@justin808 justin808 force-pushed the jg/4053-health-probe-docs-fail-flag branch from c93a5ec to 7469f72 Compare June 18, 2026 20:43
@claude

claude Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Code Review — PR #4063

Documentation-only change. The root-cause analysis is correct and the additions are accurate and well-written. A few things worth addressing before merge:


1. Missing timeoutSeconds in Control Plane YAML (correctness risk)

The CPLN readinessProbe and livenessProbe examples omit timeoutSeconds. If CPLN applies the standard Kubernetes default of 1 second, the orchestrator will kill the probe container before curl's own --max-time 3 fires — every probe attempt would time out even when /health is healthy. The ECS and Kubernetes /ready examples in the same file both include an explicit timeoutSeconds for exactly this reason.

The CPLN blocks should include timeoutSeconds: 5 (matching the /ready exec probe pattern elsewhere in the doc):

readinessProbe:
  exec:
    command: [curl, -sf, --max-time, '3', --http2-prior-knowledge, http://localhost:3800/health]
  periodSeconds: 5
  failureThreshold: 3
  timeoutSeconds: 5   # must be > --max-time (3 s) or the orchestrator kills curl first
livenessProbe:
  exec:
    command: [...]
  periodSeconds: 10
  failureThreshold: 3
  timeoutSeconds: 5

2. No startup probe / cold-start liveness guidance for CPLN

The Kubernetes section leads with a startupProbe that shields liveness during the cold-start window. The CPLN section is silent on this. Even a brief note would help: if CPLN supports startup probes, show the pattern; if not, note that users should tune failureThreshold on the liveness probe to absorb boot time. Without any guidance, users may configure a liveness probe that restarts the workload before it finishes booting.


3. -sf suppresses curl error output

All new examples use -sf (--silent --fail). The original failing command in the issue was -fsS (--fail --silent --show-error). The -S/--show-error flag re-enables error messages even under --silent, which is very useful when debugging a failing probe in log-sparse environments. Consider -sfS in the CPLN examples so probe failures produce actionable output in container logs.


Minor observations

  • /info in the status-code table: /info is not described in this document (it lives in js-configuration.md). A parenthetical like "(see Built-in Endpoints)" after "returns Node and renderer versions" would make the table self-contained.
  • Long line in CPLN prose: The sentence ending ...with a Command wraps a very long Markdown link mid-phrase; consider breaking or shortening the anchor text.
  • No initialDelaySeconds in CPLN example: If CPLN probes start immediately, users may want a note that failureThreshold needs tuning to absorb boot time (e.g., 10 s boot / 5 s period = 2 failures consumed before the threshold of 3 is hit).

Summary: The content is correct and fills a real documentation gap. Item 1 (missing timeoutSeconds) is the only blocking issue — it could cause intermittent probe failures for anyone who follows the CPLN example literally if the platform defaults to a 1 s timeout. Items 2 and 3 are worth addressing before merge. Everything else is minor polish.

Comment on lines +270 to +271
periodSeconds: 5
failureThreshold: 3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing timeoutSeconds here. If CPLN applies the standard Kubernetes default (1 s), the orchestrator will terminate the curl process before its own --max-time 3 fires, causing every probe to time out even when /health is healthy. Add timeoutSeconds: 5 to match the /ready exec probe pattern used elsewhere in this doc.

Suggested change
periodSeconds: 5
failureThreshold: 3
periodSeconds: 5
failureThreshold: 3
timeoutSeconds: 5

Comment on lines +281 to +282
periodSeconds: 10
failureThreshold: 3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue: timeoutSeconds is missing. Without it, the orchestrator's default (1 s in Kubernetes) races against --max-time 3 and wins, killing the probe before curl finishes.

Suggested change
periodSeconds: 10
failureThreshold: 3
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5

Comment on lines +265 to +266
- -sf
- --max-time

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider -sfS (--silent --fail --show-error) instead of -sf. The -S/--show-error flag re-enables curl error messages even under --silent, so probe failures produce actionable output in container logs rather than silent non-zero exits. Same applies to the livenessProbe curl below.

Suggested change
- -sf
- --max-time
- -sfS
- --max-time

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7469f72281

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +266 to +268
- --max-time
- '3'
- --http2-prior-knowledge

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add timeoutSeconds to the CPLN exec probes

For Control Plane command probes, timeoutSeconds defaults to 1s (matching the Kubernetes-style default documented in docs/oss/building-features/node-renderer/js-configuration.md:480), but this snippet lets curl wait up to 3s without raising the probe timeout. A copied manifest on a loaded renderer can have CPLN kill the exec probe after 1s, so the readiness/liveness check fails before the intended curl deadline; add timeoutSeconds: 5 to both probes as in the earlier exec-probe example.

Useful? React with 👍 / 👎.

@justin808

Copy link
Copy Markdown
Member Author

Addressed the current-head review feedback in c871729 (on top of the corrected CPLN snippet in 7469f72):

  • Missing liveness probe / CPLN field names (greptile, codex, claude): the CPLN snippet now uses the Kubernetes-style readinessProbe / livenessProbe container fields that Control Plane actually accepts (matching this repo's own react_on_rails_pro/.controlplane/rails.yml workload manifest and the existing Control Plane deployment docs), and shows both probes pointing at /health. The previous flat readiness: key would not have configured a valid probe.
  • Missing timeoutSeconds (claude, lines 271/282): added timeoutSeconds: 5 to both probes so the curl --max-time 3 budget owns the deadline rather than the orchestrator's 1s default, matching the /ready exec probe pattern already used in this doc.
  • -sfS vs -sf (claude, line 266): keeping -sf for consistency — every other curl example in this doc (Kubernetes, ECS, Docker Compose) uses -sf. Switching only the CPLN snippet to -sfS would be inconsistent; if we want show-error output it should be a separate doc-wide change.

Also regenerated llms-full.txt to clear the stale check-llms-full gate.

@justin808 justin808 force-pushed the jg/4053-health-probe-docs-fail-flag branch from c871729 to d245889 Compare June 18, 2026 21:09
@claude

claude Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Code Review — PR #4063

Documentation-only change. The status-code claims are verified against worker.ts: /health always 200 ✓, /info always 200 ✓, /ready returns 200 or 503 based on hasAnyVMContext() ✓. Cross-reference links resolve correctly. Overall this is a clear, accurate fix for the cold-start deadlock confusion.


Issues found

Minor (2)

  1. Table cell ambiguity — /ready row. The Status codes cell reads `503` `waiting_for_bundle` until then, which makes it look like waiting_for_bundle might be an HTTP status text rather than the JSON response body. The actual response body is {"status":"waiting_for_bundle"}. Suggesting a small clarification — see inline comment.

  2. CPLN YAML omits startupProbe with no explanation. The existing renderer probe example in docs/oss/deployment/docker-deployment.md (the file this section links to) includes a startupProbe for the node-renderer container. The new CPLN section omits it. This is defensible — because we probe /health (always 200), liveness cannot fire during a normal cold-start — but the omission is inconsistent with the sibling doc and may confuse readers comparing the two. Either add the startupProbe or add an inline comment explaining why it is not needed here. See inline comment.


Observations (non-blocking)

  • -sf vs -fsS in the blockquote warning. The blockquote introduces -fsS as the broken pattern, but the YAML examples correctly use -sf (omitting -S/--show-error, which is redundant for probe commands whose stdout is discarded). The YAML choice is right; the blockquote is just describing the issue as originally reported. Not a problem.
  • Retry-After header on 503 not documented. /ready also sends a Retry-After header on its 503 response (see worker.ts line ~1057). Not documenting it here is fine for probe guidance; worth knowing if anyone writes a custom polling loop.
  • timeoutSeconds: 5 > --max-time 3 relationship is correctly explained and consistent with the ECS and Kubernetes sections.

| --------- | ---------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| `/health` | Always `200` (process can answer) | **Yes** — never non-2xx. |
| `/info` | Always `200` (returns Node and renderer versions) | **Yes** — never non-2xx. |
| `/ready` | `200` once a bundle is compiled; `503` `waiting_for_bundle` until then | **Only with a warm-up path.** Without one, `-f` turns the cold-start `503` into a failed probe. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cell `503` `waiting_for_bundle` until then reads as if waiting_for_bundle might be an HTTP status text. Consider making the JSON body explicit so it is unambiguous:

Suggested change
| `/ready` | `200` once a bundle is compiled; `503` `waiting_for_bundle` until then | **Only with a warm-up path.** Without one, `-f` turns the cold-start `503` into a failed probe. |
| `/ready` | `200` once a bundle is compiled; `503` with `{"status":"waiting_for_bundle"}` until then | **Only with a warm-up path.** Without one, `-f` turns the cold-start `503` into a failed probe. |

Comment on lines +254 to +285
```yaml
kind: workload
spec:
containers:
- name: node-renderer
# ... image, ports, env (RENDERER_ENABLE_HEALTH_ENDPOINTS: 'true') ...
# Command probe — h2c-aware curl against /health (always 200).
readinessProbe:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
periodSeconds: 5
failureThreshold: 3
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout
livenessProbe:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPLN YAML omits a startupProbe, while the renderer probe example in the linked docker-deployment.md includes one. Since this section probes /health (always 200), liveness cannot misfire during a cold start, so omitting startupProbe is safe — but the omission is silent. Consider either:

Option A — add an inline comment explaining the omission:

Suggested change
```yaml
kind: workload
spec:
containers:
- name: node-renderer
# ... image, ports, env (RENDERER_ENABLE_HEALTH_ENDPOINTS: 'true') ...
# Command probe — h2c-aware curl against /health (always 200).
readinessProbe:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
periodSeconds: 5
failureThreshold: 3
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout
livenessProbe:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout
```
```yaml
kind: workload
spec:
containers:
- name: node-renderer
# ... image, ports, env (RENDERER_ENABLE_HEALTH_ENDPOINTS: 'true') ...
# No startupProbe needed: /health always returns 200, so liveness cannot
# misfire during cold start. Add one if you switch the paths to /ready.
# Command probe — h2c-aware curl against /health (always 200).
readinessProbe:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
periodSeconds: 5
failureThreshold: 3
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout
livenessProbe:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout

**Option B** — add a `startupProbe` for consistency with the sibling doc (a `tcpSocket` probe is sufficient, matching the Kubernetes section pattern).

Either works; the suggestion above just adds the comment.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d245889a01

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Control Plane exposes two relevant probe shapes: an **HTTP** probe and a **Command** (exec) probe. The HTTP probe is
HTTP/1.1 and **cannot speak the renderer's h2c listener** (see [h2c](#h2c-why-httpget-probes-do-not-work)), so it
always fails against these endpoints — use a **Command** probe with an h2c-aware curl instead. Command probes run
inside the container, so the default `localhost` binding works and no `0.0.0.0` host is required.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Qualify localhost guidance for separate workloads

In the separate node-renderer Control Plane workload topology, Rails connects over the workload service address, not from inside this container; the existing deployment guide says that case must use 0.0.0.0 for the renderer host (docs/oss/building-features/node-renderer/container-deployment.md:218-220). As written, this sentence can lead users to keep the default localhost binding because the exec probe passes, while real Rails→renderer traffic to the separate workload cannot connect.

Useful? React with 👍 / 👎.

@claude

claude Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Code Review — Health-probe status-code contract and Control Plane probes

Overall: Well-written, directly actionable docs fix for issue #4053. The root cause analysis is accurate, the status-code table is immediately useful, and the CPLN section correctly identifies the h2c/HTTP/1.1 incompatibility. A few things to address before merging:


Strengths

  • Status-Code Contract table is exactly what was missing. Anyone who hits the curl -f deadlock now has a clear "which endpoint / which flag" reference.
  • The callout explaining why -fsS .../ready deadlocks is technically accurate and well-worded.
  • The timeoutSeconds: 5 > curl --max-time 3 pattern with an explanatory comment is good practice and well-communicated.
  • CPLN section correctly identifies that HTTP probes can't speak h2c and directs users to exec probes instead.

Notable gap: no startupProbe in the CPLN example

The Kubernetes section shields liveness during boot with a startupProbe:

# Startup: shield liveness while the renderer boots. TCP is enough here
# because readiness below gates traffic.
startupProbe:
  tcpSocket:
    port: 3800
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 12 # 10 + (5 * 12) = 70 s total
  timeoutSeconds: 1

The CPLN example omits a startupProbe entirely, and neither readinessProbe nor livenessProbe sets initialDelaySeconds. Without either guard, a slow-starting renderer (large image pull + Node.js cold start) can be killed by the liveness probe before it finishes booting. CPLN supports startupProbe in the same Kubernetes-style format, so a matching exec-based startupProbe against /health would close the gap.

For CPLN the tcpSocket form used in the Kubernetes example isn't available (no 0.0.0.0 binding required), but the exec form works:

startupProbe:
  exec:
    command:
      - curl
      - -sf
      - --max-time
      - '3'
      - --http2-prior-knowledge
      - http://localhost:3800/health
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 12  # 10 + (5 × 12) = 70 s; tune to your cold-start window
  timeoutSeconds: 5

An inline suggestion has been posted on the specific lines.


Minor: llms-full.txt manual sync

Both changed files carry identical diffs. If llms-full.txt is generated (e.g. by a build or CI step), it should be regenerated rather than hand-patched to prevent drift. If it's intentionally maintained by hand, that's fine — just worth confirming the pattern is consistent with how other docs changes are handled.


Summary

Approve with the startupProbe gap addressed — either by adding it to the CPLN YAML example, or by adding a note explaining why it's omitted (e.g., if CPLN handles startup shielding differently). Everything else is accurate and ready to merge.

- name: node-renderer
# ... image, ports, env (RENDERER_ENABLE_HEALTH_ENDPOINTS: 'true') ...
# Command probe — h2c-aware curl against /health (always 200).
readinessProbe:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPLN YAML example is missing a startupProbe, which the Kubernetes section includes to shield liveness during boot. Without it (and without initialDelaySeconds on liveness), a slow-starting renderer can be killed by the liveness probe before it finishes booting.

CPLN supports startupProbe in the same Kubernetes-style format. The tcpSocket form from the Kubernetes example can't be used here (no 0.0.0.0 binding), but the exec form against /health works:

Suggested change
readinessProbe:
startupProbe:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 12 # 10 + (5 × 12) = 70 s; tune to your cold-start window
timeoutSeconds: 5
# Command probe — h2c-aware curl against /health (always 200).
readinessProbe:

This mirrors the guidance in the Kubernetes section ("Startup: shield liveness while the renderer boots") and prevents the liveness probe from racing against cold start.

justin808 and others added 3 commits June 18, 2026 11:49
Issue #4053 reported that the recommended renderer readiness probe
`curl --http2-prior-knowledge -fsS http://localhost:3800/ready` breaks
container startup. Root cause: `-f`/`--fail` exits non-zero on any HTTP
status >= 400, and `/ready` returns 503 (waiting_for_bundle) during the
cold-start window until the answering worker compiles its first bundle.
Pointing a --fail probe at /ready with no warm-up path therefore
deadlocks startup. This is documented gating behavior, but the docs
lacked an explicit per-endpoint status-code contract and Control Plane
guidance.

Changes to health-checks.md:
- Add a Status-Code Contract section: /health and /info are always 200
  (safe with --fail); /ready is 200/503 (only safe with --fail when a
  warm-up path exists). Includes a callout explaining why
  `curl -fsS .../ready` breaks startup and what to use instead.
- Add a Control Plane (CPLN) section: CPLN offers HTTP and Command
  (exec) probes; the HTTP probe cannot speak the renderer's h2c
  listener, so use a Command probe with h2c-aware curl against /health
  by default, and only switch to /ready with a warm-up path.

Docs-only change; no CHANGELOG entry per changelog guidelines.

Fixes #4053

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review threads on the Control Plane snippet in
docs/oss/building-features/node-renderer/health-checks.md:

- Use the Kubernetes-style readinessProbe / livenessProbe container
  fields that Control Plane actually accepts, matching the existing
  Control Plane deployment docs, instead of a flat top-level
  `readiness:` key that would not configure a valid probe.
- Show both readinessProbe and livenessProbe (the comment previously
  advertised liveness but only defined readiness).

Regenerate llms-full.txt so the docs change is reflected (fixes the
stale check-llms-full CI gate).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `timeoutSeconds: 5` to the Control Plane readinessProbe and
livenessProbe examples so the curl `--max-time 3` budget, not the
orchestrator's default 1s probe timeout, owns the deadline — matching
the existing /ready exec probe pattern in this doc. Without it the
orchestrator can kill curl before it finishes, flaking a healthy probe.

Regenerate llms-full.txt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@justin808 justin808 force-pushed the jg/4053-health-probe-docs-fail-flag branch from d245889 to 40ae833 Compare June 18, 2026 21:50
@claude

claude Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review: Health-Probe Status-Code Contract and CPLN Docs

Overall: Clean, well-written documentation that directly addresses the root cause in #4053. The status-code contract table is concise and accurate, the blockquote callout puts the deadlock scenario front-and-center, and the CPLN section correctly explains why HTTP probes fail (h2c incompatibility) with a working exec probe example. A few things worth discussing:


Missing startupProbe in the CPLN section

The Kubernetes section shields the liveness probe during cold start with a startupProbe (initialDelaySeconds: 10, up to 70 s via failureThreshold: 12 × periodSeconds: 5). The new CPLN section omits it. If CPLN supports the Kubernetes-style startupProbe field (it typically does for workload containers), the liveness probe begins immediately, giving only failureThreshold: 3 × periodSeconds: 10 = 30 s before a restart. During image pulls or slow Fastify startup that window can be too tight.

Suggested addition after the existing livenessProbe block (exec/curl against /health since HTTP probes can't speak h2c):

startupProbe:
  exec:
    command:
      - curl
      - -sf
      - --max-time
      - '3'
      - --http2-prior-knowledge
      - http://localhost:3800/health
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 12  # 10 + (5 × 12) = 70 s total; tune to your cold-start time
  timeoutSeconds: 5

If CPLN does not support startupProbe, a note explaining that and recommending a higher failureThreshold on the liveness probe as compensation would be helpful.


>= 400 wording — minor clarity

The note says curl --fail exits non-zero on "any HTTP status >= 400". Mathematically correct (503 ≥ 400), but readers who think of 4xx and 5xx as distinct categories may not immediately apply the rule to /ready's 503. A small parenthetical — >= 400 (4xx and 5xx) — would prevent that misread.


Long inline YAML comment — minor

The line timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout (appears twice in the CPLN block) is quite long. Some Markdown renderers or narrow displays will wrap it mid-line, breaking the visual association with the field. Consider moving the explanation to a preceding # comment line in the YAML snippet.


## Status-Code Contract

Probe tooling that uses `curl --fail` / `-f` (which `-sf` and `-fsS` both include) exits non-zero on any HTTP status

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mathematically correct (503 ≥ 400), but readers who think in 4xx vs 5xx buckets may not immediately realize this applies to /ready's 503. A small parenthetical would prevent that misread:

Suggested change
Probe tooling that uses `curl --fail` / `-f` (which `-sf` and `-fsS` both include) exits non-zero on any HTTP status
`>= 400`. Whether `--fail` is safe therefore depends on which endpoint you probe:

Consider: `>= 400` (4xx and 5xx) — or replace with "any non-2xx/3xx status" which maps more directly to how developers think about "failure" HTTP responses.

Comment on lines +273 to +284
livenessProbe:
exec:
command:
- curl
- -sf
- --max-time
- '3'
- --http2-prior-knowledge
- http://localhost:3800/health
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5 # exceed curl --max-time 3 so the probe, not the orchestrator, owns the timeout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kubernetes section includes a startupProbe that shields the liveness probe while the container boots (up to 70 s via initialDelaySeconds: 10 + failureThreshold: 12 × periodSeconds: 5). Without one here, the livenessProbe starts immediately and allows only failureThreshold: 3 × periodSeconds: 10 = 30 s before a restart — potentially too tight during image pulls or slow Fastify startup.

If CPLN supports startupProbe (Kubernetes-style workload definitions usually do), consider adding:

      startupProbe:
        exec:
          command:
            - curl
            - -sf
            - --max-time
            - '3'
            - --http2-prior-knowledge
            - http://localhost:3800/health
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 12  # 10 + (5 × 12) = 70 s total; tune to your cold-start time
        timeoutSeconds: 5

If CPLN does not support startupProbe, a note explaining that and recommending a higher failureThreshold on the liveness probe would prevent readers from silently under-allocating startup time.

@justin808 justin808 merged commit 072bc53 into main Jun 18, 2026
44 checks passed
@justin808 justin808 deleted the jg/4053-health-probe-docs-fail-flag branch June 18, 2026 22:13
justin808 added a commit that referenced this pull request Jun 19, 2026
* origin/main: (40 commits)
  feat(pro): use built-in Rails i18n compiler for React Intl demo (#4128)
  Fix pr-merge-ledger UTF-8 crash under non-UTF-8 locale (#4123)
  Add canonical AI-agent prompts source (prompts.yml) (#4124)
  Local benchmark runner: raise server-boot timeout for slower machines (#4073) (#4125)
  Docs(generator): note Pro production devtool for source-mapped SSR stacks (#3893) (#4113)
  docs: add "Consuming an Unreleased Build" guide and fix pnpm git-subdir syntax (#4117)
  Address deferred AI-review feedback on PR-helper scripts (#4069) (#4105)
  Wrap generated demo file paths in onboarding page (Part 1 of #4062) (#4107)
  fix(ci): build bundle-size base from PR merge commit's first parent (#4110)
  Add internal RSC architecture deep-dive docs (RoR Pro vs Next.js) (#4006)
  Disable noisy automatic benchmark regression issue filing (#4071) (#4116)
  Release-train branching + phase-tiered merge gating (beta/RC/final) (#4018)
  Fix Webpack dependency selection in install generator (#4109)
  Document health-probe status-code contract and Control Plane probes (#4053) (#4063)
  Local dedicated-hardware benchmark runner (#4073) (#4088)
  docs(tooling): surface SVG diagram alt text in generated llms-full files (#4087)
  docs(agents): codify review-loop convergence + local/CI parity in PR-batch workflow (#4101)
  Split RenderFunction: drop the legacy renderer arm (#4096)
  Add OSS hydrate_on scheduling (#4037)
  Docs: fix stale evaluate-issue gate cross-reference (#3910) (#4104)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Health-probe docs recommend curl -fsS for renderer readiness, which breaks container startup

1 participant