Skip to content

Commit 0ba0c53

Browse files
ccie18643claude
andcommitted
skill(mutation_testing): fold in net_proto at-scale lessons
The skill was written from the net_addr audit (a flat value-type library). The net_proto audit (20 protocols + lib, 27,007 mutants, 21 sharded runs) surfaced lessons that refine or extend it: - Rail #2 refinement: for a package of INDEPENDENT modules the correct test scope is the module's own tests + its shared-dependency tests (dependency-scoped sharding), NOT the whole-package suite — that is not the dangerous under-scoping the rail warns about. The exception is shared foundation code (lib/), which keeps full-suite scope. Plus the cross-module-constant blind spot (a constant defined in A, consumed by B, survives A's shard but is killed by B's tests). - §4 two new equivalent classes: result-preserving optimizations (the inet_cksum 8-byte chunking) and base-coincidence arithmetic (ARP 0x0001, routing RH0=0, POINTER_BASE==SLOT_LEN, IGMP code-128). - §5 the self-referential-constant kill-proof trap (assert the literal, not the imported constant the mutation changes — hit twice on MIN_MSS and hop=64) and the Python-harness recommendation over shell loops (which stranded a mutation via a restore race). - §6 the whole-file/whole-thing omission as the FIRST-check, highest- value pattern (three found: FastOpen, Packet Too Big, MLDv2 Query); the dispatch-guaranteed wrong-type assert untested across ~30 options; and the one-sided length/version boundary sub-pattern. - Intro + §11 add net_proto as the at-scale worked example. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent c137b3f commit 0ba0c53

1 file changed

Lines changed: 146 additions & 19 deletions

File tree

.claude/skills/mutation_testing/SKILL.md

Lines changed: 146 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,23 @@ test notices. A break no test catches — a **surviving
1414
mutant** — is a coverage blind spot that line coverage
1515
cannot see.
1616

17-
The canonical worked example is the net_addr audit
18-
(2026-06-08): **78.2 % → 80.5 % raw**, **92.4 % → 95.2 %
19-
equivalent-adjusted**, 14 test-only commits. The runbook
20-
and results live at
17+
The first worked example is the net_addr audit (2026-06-08,
18+
a flat value-type library): **78.2 % → 80.5 % raw**,
19+
**92.4 % → 95.2 % equivalent-adjusted**, 14 test-only
20+
commits. Runbook + results:
2121
`docs/refactor/net_addr_mutation_audit.md` and
22-
`docs/refactor/net_addr_mutation_audit_results.md`.
22+
`…_results.md`.
23+
24+
The at-scale example is the net_proto audit (2026-06-09, 20
25+
protocols + shared `lib`, **27,007 mutants across 21
26+
sharded runs**): **70 genuine gaps closed test-only**,
27+
including **three whole-codec omissions** (an option, two
28+
messages, with no test file at all). It is the precedent for
29+
the dependency-scoped sharding (rail #2), the whole-file-gap
30+
first-check (§6), and the result-preserving / base-coincidence
31+
equivalent classes (§4). Results:
32+
`docs/refactor/net_proto_mutation_audit_results.md` (the
33+
plan/sharding doc is `…_mutation_audit.md`).
2334

2435
## When to invoke
2536

@@ -67,14 +78,38 @@ result or a near-miss during the net_addr audit.
6778
dies fast as `MemoryError` (counted *killed*) instead.
6879
Monitor `free -h` during the run regardless.
6980

70-
2. **Test-command = the FULL package unit suite, never a
71-
narrower scope.** A narrow test-command reports
72-
**false-positive survivors** — mutants the broader suite
73-
would kill. (The trial run scoped to one test file and
74-
"found" a survivor that the sibling SACK tests already
75-
killed.) Running the whole package suite per mutant means
76-
every survivor is a *genuine* gap no test anywhere
77-
catches.
81+
2. **Test-command = the FULL consumer set of the mutated
82+
code, never UNDER that.** A test-command that omits any
83+
test which exercises the mutated module reports
84+
**false-positive survivors** — mutants a left-out test
85+
would kill. (The net_addr trial run scoped to one test
86+
file and "found" a survivor the sibling SACK tests already
87+
killed.) The danger is *under*-scoping below the real
88+
consumer set — NOT scoping to the exact consumers.
89+
**Refinement from the net_proto audit (4.5× scale):** for
90+
a package of *independent* modules (e.g. net_proto's 20
91+
protocols — a udp mutant can never be killed by a tcp
92+
test), the **correct** test scope is that module's own
93+
tests **plus the shared-dependency tests it transitively
94+
needs** (`tests/unit/lib`), and sharding per module that
95+
way is right — *not* the dangerous narrowing this rail
96+
warns about. It also runs ~3× faster and lets you
97+
prioritize / stop early. **The one exception is shared
98+
foundation code** (net_proto's `lib/`: `inet_cksum`,
99+
`int_checks`, `proto_*`) — that is consumed by every
100+
module, so its shard MUST run the **full** package suite.
101+
Rule of thumb: scope to *exactly the set of tests that can
102+
kill a mutant in this module*, computed from the
103+
dependency graph — full-suite when in doubt, dependency-
104+
scoped when the independence is provable.
105+
**Cross-module-constant blind spot:** a constant defined
106+
in module A but *consumed* by module B (net_proto's
107+
`IP6__MIN_MTU`, defined in ip6, used by icmp6 error
108+
messages) survives A's shard but is killed by B's tests.
109+
When a survivor is a bare constant with no in-module
110+
reader, check whether another module's suite kills it
111+
before calling it a gap — it is *cross-shard-covered*, not
112+
a real gap.
78113

79114
3. **Clear `__pycache__` before the run AND between EVERY
80115
manual mutate→revert.** A stale `.pyc` makes a later run
@@ -258,6 +293,30 @@ mutated line.
258293
are killable — and those are killable **directly**, by
259294
calling the helper with crafted inputs, NOT by coaxing
260295
the generator into producing them (see §6 lesson).
296+
10. **Result-preserving optimizations** (net_proto) — an
297+
internal fast path whose output is identical regardless
298+
of how it chunks. `inet_cksum`'s 8-byte loop:
299+
`(remainder := buffer_len - offset) >= 8` and `q_count =
300+
remainder >> 3` — mutating the chunk threshold (`>= 8` →
301+
`>= 9`) or the chunk count (`>> 3``>> 4`) only shifts
302+
bytes between the fast path and the remainder loop; the
303+
one's-complement sum is associative, so the checksum is
304+
byte-identical. Verify by confirming the mutant survives
305+
*every* consumer's round-trip test, then it is equivalent.
306+
11. **Base-coincidence arithmetic** (net_proto) — a constant
307+
whose specific value makes an operator mutation coincide
308+
with the original on the entire *reachable* domain.
309+
Examples: ARP `hrtype == 0x0001` has byte 0 = 0, so
310+
reading `frame[1:2]` classifies identically to
311+
`frame[0:2]`; routing-header `routing_type == RH0` where
312+
`RH0 = 0` makes `<= 0``== 0` on the byte domain; a
313+
pointer check where `POINTER_BASE == SLOT_LEN == 4` makes
314+
`(p - 4) % 4``(p + 4) % 4`; IGMP max-resp-`code == 128`
315+
where the linear value (128) equals the float decode
316+
`(0|0x10) << 3` (128). Generalizes class 4 (max-value /
317+
non-negative). Killable only by an input the realistic
318+
wire never carries — usually low-value; confirm the
319+
coincidence arithmetically before deferring.
261320

262321
The remainder are **genuine gaps**. Triage each by reading
263322
the mutated line; propose the test that would catch it.
@@ -287,18 +346,73 @@ input or trusting a single test file is how false claims ship.
287346
5. **`git checkout` the source + clear `__pycache__` again.**
288347
6. Verify `git diff` on the package is empty before moving on.
289348

290-
For a batch of kill-proofs, wrap each in a function that
291-
clears pycache between iterations — a tight `cp/sed/run`
292-
loop within the same filesystem-mtime second WILL reuse a
293-
stale `.pyc` and lie to you.
349+
**The self-referential-constant trap (net_proto, hit twice).**
350+
When the gap is a `NumberReplacer` on a *constant definition*
351+
(`TCP__MIN_MSS = 536``537`, `IP6__DEFAULT_HOP_LIMIT = 64`
352+
`65`), the killing test MUST assert against the **literal**
353+
value (`assert x == 536`), NOT against the imported constant
354+
(`assert x == TCP__MIN_MSS`). Asserting against the constant
355+
makes the expectation move *with* the mutation — both sides
356+
change to 537, the assert still passes, the mutant survives.
357+
This passed my first kill-proof for MIN_MSS and hop=64 and
358+
looked closed; only re-running the survivor scan exposed it.
359+
Assert the literal; optionally add a second
360+
`assert THE_CONSTANT == 536` line to pin the constant by name
361+
too.
362+
363+
**Use a Python harness for batches, not a shell loop**
364+
(net_proto). A `cp/sed/run` shell loop — especially with
365+
`r=$(kp ...)` command substitution or a trailing `| sort`
366+
can have its **restore step (`cp bak file`) race or get its
367+
stdout eaten**, leaving a **stranded mutation on disk** (it
368+
happened: 10 dhcp4 option files left with `<=` applied, then
369+
the next iteration found nothing to substitute). A small
370+
Python driver (`subprocess.run` per mutant, `open(f,"w")`
371+
restore, `shutil.rmtree` pycache between) is deterministic
372+
and prints each result; it never strands. Always
373+
`git diff --stat` the package after a batch regardless
374+
(rail #6).
294375

295376
---
296377

297378
## 6. Common real-gap patterns (where survivors actually cluster)
298379

380+
- **Whole-file / whole-thing omissions — CHECK THIS FIRST**
381+
(net_proto). The highest-value finds are not arithmetic at
382+
all: an entire source file (an option, a message, a codec)
383+
with **no dedicated test**, where *every* mutant in it
384+
survives. Line coverage shows it "covered" because the
385+
dispatch *imports* it, but its logic is never asserted. The
386+
net_proto audit found three — TCP FastOpen option, ICMPv6
387+
Packet Too Big, MLDv2 Query (185 survivors). **Before
388+
triaging individual operators, bucket survivors by source
389+
file and compare the count against a `find … -name
390+
'test__*<file>*'`** — a file with ~80–185 survivors and no
391+
test file is a whole-thing gap. Close it with the full
392+
per-file test (the §8 test-matrix in `unit_testing.md`),
393+
not a one-off; it converts the most mutants per unit effort.
394+
- **The dispatch-guaranteed assert, untested everywhere**
395+
(net_proto). The `buffer[0] == int(Type)` / `from_bytes(...)
396+
== int(Type)` kind-byte assert at the top of every option's
397+
`from_buffer` was untested across ~30 options in 4 protocols
398+
(ip4 / dhcp4 / dhcp6 / accecn). The container dispatch
399+
*guarantees* the byte, so the assert never fires in normal
400+
flow — but `== Type``<=` / `>=` survives with no
401+
wrong-type test. One cheap shared batch: a wrong-type-below
402+
(e.g. `0x00`) and wrong-type-above (`0xff` / `0xffff`)
403+
`from_buffer` over a valid frame, expecting `AssertionError`.
404+
Detect the gap quickly by scripting the `<=` mutation across
405+
every option and re-running just that option's suite.
299406
- **Degenerate / weak fixtures.** The single most common
300-
real gap. A test asserts the right output but for an input
301-
where many mutations coincide:
407+
*arithmetic* gap. A test asserts the right output but for an
408+
input where many mutations coincide:
409+
- **empty-data / zero-value operands** — an `X + len(data)`
410+
`__len__` asserted only with `data=b""` (so `X+0 == X-0`,
411+
the `+``-` survives); a timestamp slice asserted only
412+
with a top-byte-zero value (so `[+4:+8]``[+5:+8]` reads
413+
the same int); a header flag round-trip with only `rd` set
414+
(every other bit position unexercised). Use non-empty
415+
data, a top-byte-set value, **all flags distinct**.
302416
- all-zero-byte operands (MAC `02:00:00:...` hides EUI-64
303417
field-placement bugs) → use a non-degenerate operand
304418
(`aa:bb:cc:dd:ee:ff`).
@@ -321,6 +435,15 @@ stale `.pyc` and lie to you.
321435
landing *exactly* on `0` or `MAX` (both valid). Add the
322436
exact-endpoint cases to pin the `<=` and the `- 1` max
323437
constant.
438+
- **One-sided length boundaries** (net_proto). A fixed-length
439+
check `buffer[1] != LEN` whose only wrong-length test uses
440+
an *under*-length value — `<` and `!=` agree below `LEN`, so
441+
the `!=``<` mutant survives; it dies only on an
442+
*over*-length frame (`LEN+1`). Likewise a version/type check
443+
tested only *below* the value (DNS `ver=5`, ARP `hrtype=0`,
444+
ip6 `ver=5`): `!=``<` survives until you add an *above*
445+
case (`ver=7`). Always test wrong-value on **both** sides of
446+
a `!=`.
324447
- **One-sided predicate tests.** A prefix predicate
325448
(`& mask == prefix`) tested with a None-case on only one
326449
side of the prefix. The `==``<=` mutant needs a *below*-
@@ -433,6 +556,10 @@ characterised.
433556
- `docs/refactor/net_addr_mutation_audit_results.md` — the
434557
exemplar results document (per-module table, equivalent
435558
ledger, kill-proven corrections).
559+
- `docs/refactor/net_proto_mutation_audit.md` /
560+
`…_results.md` — the at-scale precedent (21 sharded runs,
561+
dependency-scoped test-commands, whole-file gaps, the
562+
per-shard score table + deep-TLV follow-up seam).
436563
- `.claude/rules/unit_testing.md` — test authoring (the
437564
corrections land as unit tests; §7.2 docstring audit, §6a
438565
mocking, tight assertions).

0 commit comments

Comments
 (0)