Harden skill: generic runner detection + mandatory failure-triage gate before fixing source

## Summary
Field feedback from running `/nf:harden` on a **Python/pytest** financial repo. The loop machinery (convergence, termination, guardrails) is well-built, but the skill is missing its **judgment layer** — and that's exactly where the value and the risk live for correctness-sensitive code. Run literally and unattended on that repo it would have **skipped entirely** (wrong language) or **"fixed" correct code to satisfy a confident-but-wrong adversarial test**. Both were worked around by hand.

Verified against `core/workflows/harden.md`.

## Structural weakness 1 — hardcoded to JS/TS (skill no-ops on other ecosystems)
`core/workflows/harden.md` Step 2 only discovers `*.test.{js,cjs,ts}` / `*.spec.{js,ts}` (lines 39-42) and only maps runners `vitest`/`jest`/`node --test`/`npm test` from `package.json` (lines 56-60). On a pytest repo it finds zero test files → prints **"HARDENING SKIPPED"** and does nothing.
- **Fix:** detect the test runner/ecosystem generically (at least the top few: pytest, go test, cargo test, JUnit/Maven, …) — discover via config files / lockfiles, not just `package.json`. Glob patterns and `$RUN_CMD` should derive from the detected ecosystem.

## Structural weakness 2 — fix loop blindly trusts the adversarial test (the dangerous one)
Step 4b (line 177) tells the fixer: *"Fix the source implementation (NOT the tests) to make the failing tests pass."* There is **no gate** that the failing test asserts *correct* behavior before source is changed to satisfy it. On a financial system this is a footgun: an adversarial agent that misreads the spec writes a wrong-expectation test → the fixer "repairs" correct code into a regression. Hit live twice (an impossible `forecast_mult<1` input where blind re-flooring would diverge from live and trip a self-consistency guard; and the agent's own wrong-floor test it had to retract).
- **Fix (highest priority):** insert a **mandatory triage step** before the fixer is ever allowed to touch source — for each new failure, classify **real-gap vs invalid-assertion**, citing the spec/source. Only real gaps proceed to the fixer; invalid-assertion tests are rejected/rewritten, never satisfied by mutating source. Ideally the triage is a *separate* agent/role from both the adversary and the fixer.

## Smaller gaps also compensated for by hand
- **No full-suite / gate awareness.** The loop only checks whether the new adversarial tests pass (`$RUN_CMD`); it never runs coverage floors, lint, or the rest of the suite. A "fix" can green the new test while breaking a 100% coverage gate and the skill won't notice. → Run the full suite + project gates between iterations and fail/flag on regression.
- **No cross-iteration diversity.** Step 4a re-spawns the same prompt every round (categories are static per run; `--full` only widens them once), so iterations re-cover ground and burn the cap. → Rotate the focus per iteration (e.g. math → wiring → numerics → newly-added code).
- **"Converged" is oversold.** 2 quiet rounds means "the agent stopped finding things" — sensitive to agent luck/temperature, not a robustness proof. The `HARDENING CONVERGED` banner implies more certainty than the heuristic delivers. → Soften the wording / state it's a heuristic stop.
- **Single adversarial agent (no panel).** One Task per round; a small diverse panel would surface more.
- **Hardcoded path** `node ~/.claude/nf/bin/nf-tools.cjs` (line 180) — resolve portably.
- **"Add tests to EXISTING test files only" (line 132)** gets awkward when a focus area has no test file yet.

## Acceptance criteria
- Runs meaningfully on at least pytest + one more non-Node ecosystem (generic runner detection; no false "SKIPPED").
- A failure is **triaged real-gap vs invalid-assertion (with spec/source citation) before** any source change; invalid-assertion tests are never satisfied by mutating source.
- Full suite + configured gates (coverage/lint) run between iterations; a fix that regresses a gate is flagged, not silently accepted.
- Per-iteration focus diversity; "converged" wording reflects it's a heuristic stop.

_Filed from live `/nf:harden` usage feedback._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden skill: generic runner detection + mandatory failure-triage gate before fixing source #221

Summary

Structural weakness 1 — hardcoded to JS/TS (skill no-ops on other ecosystems)

Structural weakness 2 — fix loop blindly trusts the adversarial test (the dangerous one)

Smaller gaps also compensated for by hand

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Harden skill: generic runner detection + mandatory failure-triage gate before fixing source #221

Description

Summary

Structural weakness 1 — hardcoded to JS/TS (skill no-ops on other ecosystems)

Structural weakness 2 — fix loop blindly trusts the adversarial test (the dangerous one)

Smaller gaps also compensated for by hand

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions