Skip to content

Harden skill: generic runner detection + mandatory failure-triage gate before fixing source #221

Description

@jobordu

Summary

Field feedback from running /nf:harden on a Python/pytest financial repo. The loop machinery (convergence, termination, guardrails) is well-built, but the skill is missing its judgment layer — and that's exactly where the value and the risk live for correctness-sensitive code. Run literally and unattended on that repo it would have skipped entirely (wrong language) or "fixed" correct code to satisfy a confident-but-wrong adversarial test. Both were worked around by hand.

Verified against core/workflows/harden.md.

Structural weakness 1 — hardcoded to JS/TS (skill no-ops on other ecosystems)

core/workflows/harden.md Step 2 only discovers *.test.{js,cjs,ts} / *.spec.{js,ts} (lines 39-42) and only maps runners vitest/jest/node --test/npm test from package.json (lines 56-60). On a pytest repo it finds zero test files → prints "HARDENING SKIPPED" and does nothing.

  • Fix: detect the test runner/ecosystem generically (at least the top few: pytest, go test, cargo test, JUnit/Maven, …) — discover via config files / lockfiles, not just package.json. Glob patterns and $RUN_CMD should derive from the detected ecosystem.

Structural weakness 2 — fix loop blindly trusts the adversarial test (the dangerous one)

Step 4b (line 177) tells the fixer: "Fix the source implementation (NOT the tests) to make the failing tests pass." There is no gate that the failing test asserts correct behavior before source is changed to satisfy it. On a financial system this is a footgun: an adversarial agent that misreads the spec writes a wrong-expectation test → the fixer "repairs" correct code into a regression. Hit live twice (an impossible forecast_mult<1 input where blind re-flooring would diverge from live and trip a self-consistency guard; and the agent's own wrong-floor test it had to retract).

  • Fix (highest priority): insert a mandatory triage step before the fixer is ever allowed to touch source — for each new failure, classify real-gap vs invalid-assertion, citing the spec/source. Only real gaps proceed to the fixer; invalid-assertion tests are rejected/rewritten, never satisfied by mutating source. Ideally the triage is a separate agent/role from both the adversary and the fixer.

Smaller gaps also compensated for by hand

  • No full-suite / gate awareness. The loop only checks whether the new adversarial tests pass ($RUN_CMD); it never runs coverage floors, lint, or the rest of the suite. A "fix" can green the new test while breaking a 100% coverage gate and the skill won't notice. → Run the full suite + project gates between iterations and fail/flag on regression.
  • No cross-iteration diversity. Step 4a re-spawns the same prompt every round (categories are static per run; --full only widens them once), so iterations re-cover ground and burn the cap. → Rotate the focus per iteration (e.g. math → wiring → numerics → newly-added code).
  • "Converged" is oversold. 2 quiet rounds means "the agent stopped finding things" — sensitive to agent luck/temperature, not a robustness proof. The HARDENING CONVERGED banner implies more certainty than the heuristic delivers. → Soften the wording / state it's a heuristic stop.
  • Single adversarial agent (no panel). One Task per round; a small diverse panel would surface more.
  • Hardcoded path node ~/.claude/nf/bin/nf-tools.cjs (line 180) — resolve portably.
  • "Add tests to EXISTING test files only" (line 132) gets awkward when a focus area has no test file yet.

Acceptance criteria

  • Runs meaningfully on at least pytest + one more non-Node ecosystem (generic runner detection; no false "SKIPPED").
  • A failure is triaged real-gap vs invalid-assertion (with spec/source citation) before any source change; invalid-assertion tests are never satisfied by mutating source.
  • Full suite + configured gates (coverage/lint) run between iterations; a fix that regresses a gate is flagged, not silently accepted.
  • Per-iteration focus diversity; "converged" wording reflects it's a heuristic stop.

Filed from live /nf:harden usage feedback.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions