Skip to content

Commit 7b74bbf

Browse files
committed
docs(skill): add Tier-2 findings to mutation_testing skill
Captures the pytcp Tier-2 audit learnings: - §4 equivalent-mutant classes 12-17: bounded-domain enum/mode comparisons (arp_ignore == 8 ≡ >= 8 over {0,1,2,8}), __debug__-and-log guard operands (log mocked), idempotent re-assignment guards, defensively-unreachable case-_ counters (parser rejects upstream), keyword-only *-separator AST no-ops, never-wrapping modular masks. - New §6.5 "Shard archetypes & baseline economics": the central Tier-2 lesson. Archetype A (thin-unit, high-ROI, close gaps) vs Archetype B (integration-saturated, audit-don't-grind), the mock-construct fast- baseline technique (socket drop-in: 6.7s daemon -> 0.4s unit), the baseline-choice rule (a seam/parity test under-reports ~15x: ack 4.7% seam vs 33-73% full-suite), and how to report AUDITED shards. - "When to invoke" now leads with archetype classification; §4 headline + cross-references cite the pytcp results doc. Doc-only (skill content).
1 parent aee1996 commit 7b74bbf

1 file changed

Lines changed: 162 additions & 1 deletion

File tree

.claude/skills/mutation_testing/SKILL.md

Lines changed: 162 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,14 @@ shapes that dominated the residual.
5454
- Re-confirming a package after a round of test additions
5555
(use the fast survivor-only re-scan, §8).
5656

57+
**First, classify the shard's archetype (§6.5).** Thin-unit
58+
shards (fast unit baseline) are high-ROI — run the full scan
59+
and close gaps. Integration-saturated shards (session/fsm,
60+
packet handlers) are audit-only — sample against the *real*
61+
integration baseline, classify, close one demonstrator. Do
62+
not grind an integration-saturated shard against a slow or a
63+
seam-only baseline.
64+
5765
## When NOT to invoke
5866

5967
- As a `make lint` / CI gate. Mutation testing is a
@@ -247,7 +255,11 @@ score badly understates the suite** because a large fraction
247255
of survivors are **equivalent mutants** that *no runtime test
248256
can ever kill*. Report raw AND equivalent-adjusted; the
249257
adjusted number is the honest one. (net_addr: 80.5 % raw but
250-
**95.2 %** once equivalents are excluded.)
258+
**95.2 %** once equivalents are excluded. pytcp Tier-2: ipc
259+
64→68 % raw / **~98 %** adjusted; socket drop-in 39 % raw /
260+
**~85 %** adjusted — the annotation-mutant fraction climbs
261+
even higher on the typed runtime, e.g. socket drop-in had
262+
~431 of 579 survivors in PEP 604 unions.)
251263

252264
Classify survivors with the rule below. The first two classes
253265
are mechanically detectable; the rest require reading the
@@ -328,6 +340,63 @@ mutated line.
328340
non-negative). Killable only by an input the realistic
329341
wire never carries — usually low-value; confirm the
330342
coincidence arithmetically before deferring.
343+
12. **Bounded-domain enum/mode comparisons** (pytcp Tier-2) —
344+
a `==` against one value of a small *closed* set, where
345+
the other members make `==``>=` or `==``<=` over the
346+
whole reachable domain. The canonical case is a sysctl
347+
mode read: `arp_ignore == 8``>= 8` is equivalent
348+
because the mode is one of `{0,1,2,8}` and 8 is the max;
349+
`arp_ignore == 2``>= 2` is equivalent inside the
350+
`else` arm where the domain is already narrowed to
351+
`{0,1,2}`. Generalizes class 4 to enum-valued domains.
352+
Killable only by a mode value the validator forbids —
353+
low-value; confirm the domain before deferring.
354+
13. **`__debug__ and log(...)` guard operands** (pytcp) — the
355+
pervasive `__debug__ and log("chan", f"...")` tracing
356+
idiom. Tests mock `log`, so mutating the `and``or`,
357+
the f-string interpolation (`<<``*` inside the
358+
message), or the guard is invisible: the log call's
359+
*effect* is never asserted. Any survivor whose source line
360+
is inside a `__debug__ and log(` argument is equivalent.
361+
Dominant alongside class 1 on the handler shards.
362+
14. **Idempotent re-assignment guards** (pytcp) — a `if x !=
363+
computed_value: x = computed_value` shape where the guard
364+
only gates a (mocked) log + a re-derivation that lands on
365+
the *same* value. Mutating the guard comparison (`!=`
366+
`==`, `<<``^` in the RHS) changes only whether the
367+
branch is *entered*; the branch body re-assigns the
368+
correctly-computed value either way, so end state is
369+
identical. The value-bearing assignment line (not the
370+
guard) is the killable one — and it is usually already
371+
killed. (ARP/UDP RX window-update; the `snd_wnd != win <<
372+
wsc` guard.)
373+
15. **Defensively-unreachable `case _:` / parser-rejected
374+
branches** (pytcp) — a handler `match` default that bumps
375+
an `op_unknown__drop` counter, where the *parser* (e.g.
376+
TX-strict `from_int(...).is_unknown`) already rejects the
377+
bad value upstream, so the default is dead via any
378+
realistic wire input. The `+= 1` NumberReplacer survives
379+
because no frame reaches it. Distinguish from a real gap
380+
by tracing whether the wire value can reach the line at
381+
all (grep the parser's reject path first).
382+
16. **Keyword-only `*`-separator AST no-ops** (pytcp) —
383+
cosmic-ray reports a `ReplaceBinaryOperator_Mul_Div` on a
384+
`def f(self, *, arg)` line (the `*` is the kw-only marker,
385+
not a binary op). The mutation either no-ops or turns
386+
`*` into `/` (positional-only), and every call site passes
387+
the arg *by keyword* anyway, so behaviour is unchanged.
388+
Killable in principle by `inspect.signature(...).kind is
389+
KEYWORD_ONLY` (see the tcp/state `KeywordOnlySignatures`
390+
batch), but for runtime handlers the kw-only contract is
391+
low-value ceremony — defer unless the signature is a
392+
public API surface.
393+
17. **Never-wrapping modular masks** (pytcp) — `(a - b) &
394+
0xFFFF_FFFF` on a sequence-delta / flight-size / RTT that
395+
never actually wraps in the tested range, so dropping or
396+
altering the mask (`& 0xFFFF_FFFF``// 0xFFFF_FFFF`,
397+
`<<`) yields the same value. Killable only by a 32-bit-
398+
wrap scenario the suite does not drive; usually equivalent
399+
in practice. Class-4 cousin for explicit wrap masks.
331400

332401
The remainder are **genuine gaps**. Triage each by reading
333402
the mutated line; propose the test that would catch it.
@@ -527,6 +596,90 @@ and prints each result; it never strands. Always
527596

528597
---
529598

599+
## 6.5 Shard archetypes & baseline economics (pytcp Tier-2)
600+
601+
The pytcp audit split into two archetypes with **opposite
602+
ROI**, and recognising which one you are in *before* spending
603+
compute is the single biggest Tier-2 lesson. The
604+
discriminator is the **test surface**, not the source.
605+
606+
### Archetype A — thin-unit shards (high ROI, close gaps)
607+
608+
`lib`, `tcp-math`, `tcp/state`, `ipc` value codecs, the
609+
`socket` drop-in. Logic reachable by a **fast unit test**
610+
(<1 s baseline) over a constructed object. Survivors are
611+
cheap to find and cheap to close; this is where mutation
612+
testing pays for itself (ipc closed 69 gaps, socket drop-in
613+
69). Run the full-scan workflow (§3) and close real gaps.
614+
615+
**Mock-construct the wrapper to get a fast baseline.** A
616+
daemon-/IO-backed wrapper whose logic runs *before* the IO
617+
(the `socket` drop-in: type-guards, `makefile` mode parsing,
618+
factory dispatch, blocking-mode state) is unit-testable by
619+
constructing it over a `create_autospec(Collaborator,
620+
spec_set=True)` and patching the lazy accessor
621+
(`patch.object(mod, "_get_default_stack", ...)`). This turned
622+
a 6.7 s daemon-integration baseline into a 0.4 s unit
623+
baseline and let 69 gaps close fast. Always check whether the
624+
"needs a daemon/socket" surface actually needs one for the
625+
*branch under test*.
626+
627+
### Archetype B — integration-saturated shards (audit, don't grind)
628+
629+
`tcp/session`, `tcp/fsm`, `runtime/packet_handler`. Logic
630+
driven only by the integration suite (FSM state, wire RX→TX,
631+
stat counters). Two sub-cases:
632+
633+
- **Baseline-bound (session/fsm)** — the *only* test surface
634+
that exercises the logic is the slow whole-protocol suite
635+
(TCP: 590 tests / 33 s). A per-collaborator *seam* test
636+
(3 tests / 792 LOC) is **NOT a valid mutation baseline**
637+
it under-reports ~15× (ack: **4.7 % seam vs 33–73 %
638+
full-suite** on the same survivors). Exhaustive closure is
639+
~46 h of serial compute and mostly re-confirms saturation.
640+
**Audit, don't grind:** sample survivors against the *full*
641+
suite to get the true rate, classify the residue, close
642+
**one kill-proven demonstrator** (recover-marker decay),
643+
and document the recommendation (opportunistic backlog).
644+
- **Per-protocol-auditable (packet_handler)** — each protocol
645+
*does* have a green-in-isolation, fast suite (arp/udp
646+
~3.6 s), so a handler file mutation-tests cleanly against
647+
its own protocol's suite. Worked examples (arp, udp) both
648+
landed at **~76 % adjusted**: the suites assert exact
649+
`packet_stats` counters on every branch, so branch/counter
650+
mutations die. Residue is annotation (class 1) + log guards
651+
(class 13) + bounded-mode (class 12) + low-density genuine
652+
gaps on **under-tested branches the suite's topology never
653+
reaches** (a sysctl mode on a single-subnet topology, a
654+
multi-socket fan-out, an error-emit selection). Close those
655+
*opportunistically* with a scenario test; a full 20-handler
656+
sweep is auditable but low-yield.
657+
658+
### The baseline-choice rule
659+
660+
**Pick the smallest test surface that genuinely exercises the
661+
mutated logic.** A seam/parity test that only pins the
662+
*refactor boundary* (e.g. "the collaborator is wired in")
663+
exercises ~5 % of the code and yields a meaningless 5 % score.
664+
Before trusting a per-mutant number, sanity-check the baseline:
665+
grep the candidate test for assertions on the *behaviour* the
666+
mutated lines produce. If it only asserts wiring/identity,
667+
escalate to the protocol's full integration suite — and if
668+
*that* is the only adequate surface and it is slow, switch from
669+
"close every gap" to "sample + classify + one demonstrator".
670+
671+
### Reporting integration-saturated shards
672+
673+
Put them in the per-shard table with the real (sampled)
674+
adjusted rate, a ``/`` footnote stating the baseline caveat
675+
(seam-under-reports / sampled-not-exhaustive), and a status of
676+
`AUDITED` rather than `DONE`. The honest deliverable is the
677+
*finding* (this code is integration-saturated; here is the
678+
true rate; here is one demonstrator; the rest is opportunistic)
679+
— not a forced exhaustive grind.
680+
681+
---
682+
530683
## 7. Deliverable — the results document
531684

532685
Write `docs/refactor/<pkg>_mutation_audit_results.md`:
@@ -622,6 +775,14 @@ characterised.
622775
`…_results.md` — the at-scale precedent (21 sharded runs,
623776
dependency-scoped test-commands, whole-file gaps, the
624777
per-shard score table + deep-TLV follow-up seam).
778+
- `docs/refactor/pytcp_mutation_audit_results.md` — the
779+
Tier-1/Tier-2 precedent for the §6.5 archetype split:
780+
thin-unit shards (lib / tcp-math / tcp-state / ipc / socket
781+
drop-in, gaps closed) vs integration-saturated shards
782+
(session/fsm baseline-bound + audited; packet_handler
783+
per-protocol-auditable at ~76 % adjusted). Worked examples
784+
of the mock-construct fast baseline and the seam-under-
785+
reports caveat.
625786
- `.claude/rules/unit_testing.md` — test authoring (the
626787
corrections land as unit tests; §7.2 docstring audit, §6a
627788
mocking, tight assertions).

0 commit comments

Comments
 (0)