Module 11 — Eval Harness for Security Tools¶

Type 13 · Eval Harness — build a labelled test corpus, a precision/recall scorecard, and a CI regression gate for the log parser/classifier you wrote earlier in the track, so its quality is measured and a future edit cannot silently break it. Go to the hands-on lab →

Last reviewed: 2026-06

Python for Security — a tool that works on the log you tested it against is an anecdote; a tool with a scorecard is software.

Difficulty: Intermediate · Estimated time: ~3.5–4.5 hrs (study + lab) · Type: Eval Harness · Prerequisites: 02 — Files, Regex & Log Parsing, 10 — Packaging, Testing & Owning AI Code

In 60 seconds

A unit test asks "does the code do what I wrote?"; an eval asks "does the tool catch the attacks, and how often does it cry wolf?" — and only the second has a number. This module turns the Module-02 log parser into a measured tool: a labelled, held-out corpus, a precision/recall scorecard instead of a vibe, and a CI gate that fails the build when a change degrades recall. The proof the gate works is a planted regression — you weaken the rule on purpose and watch CI turn red. A gate you've only ever seen pass isn't a gate.

Why this matters¶

You built a log parser back in Module 02 — it pulled failed-login IPs out of an SSH auth log and flagged the brute-force offenders. It worked on the sample log. Module 10 taught you to pin its behaviour with pytest. But a unit test answers "does this function do what I coded?" — it does not answer the question that actually matters for a detection tool: "does it catch the attacks, and how often does it cry wolf?" Those are different questions, and the second one has a number. A regex that flags Failed password will quietly miss the attacker who pivoted to valid-credential spraying, and will quietly fire on the cron job that mistypes its own password twice a night — and your unit tests, all green, will tell you nothing about either. The day a teammate "improves" the regex and silently drops recall from 0.95 to 0.60, the demo still looks fine and nothing turns red. This module is the discipline that makes a security tool trustworthy: a labelled corpus of real and malformed log lines, a precision/recall scorecard instead of a vibe, and a CI gate that fails the build the day a change degrades it. It is the same skill the AI-ops track applies to models — applied here to the deterministic tools this whole track produces.

Objective¶

Build an eval harness for a security tool you already wrote: assemble a labelled, held-out corpus of log lines (true detections + benign near-misses + malformed input), choose and justify a metric, score the tool into a scorecard, find the precision/recall knee deliberately, and wire a CI regression gate that fails the build when a planted change degrades the score.

The core idea¶

Your log parser flags brute-force attempts. Is it any good? Prove it. Before reading on, write down how you would convince a skeptical SOC lead — with evidence, not adjectives — that your Module-02 parser catches the attacks that matter and doesn't drown them in false alarms. If your honest answer is "it worked on the sample log," you've just named the trap.

A detection tool is not deterministic in the way that matters. The code is deterministic, yes — but the space of inputs it will face is not, and that is the thing you cannot eyeball. You tested it on one log; production hands it a thousand variants you never saw. The move that makes the tool trustworthy is not a cleverer regex — it is measurement against a corpus the tool was never tuned on, reported as a number, gated in CI. That is the entire module, and it is the construct the rest of this build-track has been doing by accident ("verify on positive and negative cases") without ever naming.

The mental model

The code is deterministic; the space of inputs it will face is not — and that gap is the thing you cannot eyeball. The move that makes a detection tool trustworthy isn't a cleverer regex, it's a number: score it against data it was never tuned on, report precision/recall, and gate that number in CI.

A labelled corpus is the spec your unit tests aren't. A pytest assertion says "this input yields this output." A corpus is a graded exam: dozens of log lines, each tagged attack or benign, including the cases that break naive tools — the benign cron double-failure, the Unicode-mangled line, the truncated entry, the slow-and-low spray that never trips a per-minute threshold. You run the tool over the whole corpus and compare its verdicts to the answer key. The corpus is held out from whatever sample you tuned the regex against — score on the data you tuned on and the number is inflated by the same memorisation that makes the demo lie. (This is the machine-learning train/dev/test split, and it transfers intact to a rule-based tool: tune the regex against the dev log, grade it against a test corpus it has never seen.)

The gotcha

Your corpus is imbalanced — most log lines are benign — so a tool that flags nothing scores 95%+ "accuracy" while catching zero attacks. Accuracy lies on skewed data; gate on recall (a missed intrusion is a breach) and watch precision as its cost. And never grade on the lines you tuned on — score on held-out data or the number is inflated by the same memorisation that makes the demo lie.

Metric choice is a judgment, and accuracy is usually the wrong one. Your corpus is imbalanced — most log lines are benign — so a tool that flags nothing scores 95%+ "accuracy" while catching zero attacks. The vocabulary you need is the confusion matrix — true/false positives and negatives — and the two ratios built from it: precision (of the lines you flagged, how many were real attacks?) and recall (of the real attacks, how many did you catch?). For a detection tool the load-bearing number is usually recall — a missed intrusion costs a breach; a false positive costs an analyst a few minutes — but recall bought at the price of a flooded alert queue is its own failure, which is why you watch precision (and the false-positive rate) as the cost. The eval is what lets you find the knee of that tradeoff on purpose instead of by feel: tighten the rule and recall drops; loosen it and precision drops; the scorecard shows you exactly where.

Coverage ≠ effectiveness. A 500-line corpus is not better than a 40-line one if all 500 are easy Failed password lines. Counting items is vanity; the corpus earns its keep by deliberately sampling the hard cases — the near-misses that look like the other class, the malformed input that crashes a brittle parser, the novel attack phrased unusually. The hand-built 40-line corpus that includes the cases you know trip naive tools is worth more than a thousand auto-generated easy ones.

Go deeper: the confusion matrix, the knee, and parametrize

Precision (of the lines you flagged, how many were real?) and recall (of the real attacks, how many did you catch?) both come out of the confusion matrix — TP/FP/FN/TN. The eval lets you find the knee of the tradeoff on purpose: tighten the rule and recall drops, loosen it and precision drops, and the scorecard shows you exactly where. Mechanically, @pytest.mark.parametrize turns each labelled corpus line into its own pass/fail test — the bridge from the Module-10 suite to a corpus-driven eval.

The regression gate is what makes this engineering, not a one-off study. The deliverable is a gate: the eval runs in CI, and a change that drops recall below a declared floor fails the build — exactly as a unit test fails on a broken function. The proof that the gate works is a planted regression: you deliberately weaken the rule (so it under-detects), and the gate must turn red and exit non-zero. A gate you have only ever seen pass is not a gate — you haven't shown it can catch anything. The green-on-good / red-on-regressed contrast is the lesson; it's what lets a teammate refactor the parser on a Friday without praying. Unit tests prove the code didn't break; the eval gate proves the tool didn't get worse.

flowchart LR
    C["held-out corpus<br/>(attack / benign / malformed)"] --> R["run parser<br/>over every line"]
    R --> CM["compare verdicts<br/>to answer key"]
    CM --> SC["scorecard:<br/>precision / recall"]
    SC --> G{"recall ≥ floor?"}
    G -->|yes| PASS["CI green — merge"]
    G -->|"no (planted regression)"| FAIL["CI red — exit non-zero"]

AI caveat

A model writes the mechanical parts well — confusion-matrix counting, the scorecard, the Actions YAML. What it quietly gets wrong is the judgment: it defaults to accuracy (override it to recall), it will happily grade on the lines it generated and "tested" (enforce the held-out wall), and it won't fail-close by default. Use it to expand the corpus with adversarial near-misses, then label every one yourself — a model labelling its own test set is the contamination this module warns about.

Learn (~2.5 hrs)¶

The confusion matrix & the metrics (~50 min) - Wikipedia — "Precision and recall" — the canonical definitions with the 2×2 contingency table (TP/FP/FN/TN) and the worked classifier example; read down to the F-score section so the vocabulary your scorecard prints is precise. ~15 min. - Jason Brownlee — "Failure of Classification Accuracy for Imbalanced Class Distributions" — why you don't gate on accuracy: the "accuracy paradox" where a do-nothing detector scores 99% on skewed data. Short and concrete; this is the single most important misconception this module fixes. ~15 min. - scikit-learn — "Metrics and scoring" — the Python reference: read §3.4.4.6 (confusion matrix) and §3.4.4.9 (precision/recall/F-measure). You can reimplement these by hand in the lab (it's just counting), but this shows the precision_score/recall_score/confusion_matrix calls you'd reach for in a real harness. ~20 min.

The harness mechanism in Python (~50 min) - pytest — "How to parametrize fixtures and test functions" — @pytest.mark.parametrize is how you turn a labelled corpus into one test per item, so each corpus line is its own pass/fail; this is the bridge from the Module-10 test suite to a corpus-driven eval. ~20 min. - coverage.py docs — read the intro and "Quick start": coverage measures which lines ran, which is the perfect foil for the "coverage ≠ effectiveness" idea — high line-coverage on easy inputs tells you nothing about detection quality. Use it to find untested branches, never as your quality metric. ~15 min.

Wiring the regression gate (~30 min) - GitHub Docs — "Building and testing Python" — how to run a Python script/pytest in GitHub Actions so a non-zero exit fails the PR; read the pytest section. This is the CI half of the gate — the part that makes the eval block a merge rather than sit in a notebook. ~15 min.

Key concepts¶

A unit test asks "does the code do what I wrote?"; an eval asks "does the tool catch the attacks, and how often does it cry wolf?" — different questions, and only the second has precision/recall.
Labelled corpus vs. the sample you tuned on: grade on held-out data or every number is inflated by memorisation.
Metric is a judgment: recall is load-bearing for detection (a miss is a breach), precision/FP-rate is the cost you pay for it; accuracy lies on an imbalanced corpus.
The precision/recall knee: tighten the rule → recall drops; loosen it → precision drops. The eval finds the operating point deliberately.
Coverage ≠ effectiveness — sample the hard near-miss and malformed cases, not just more easy lines.
The regression gate: a planted weakening of the rule must fail the build — a gate you've only seen pass isn't a gate.

AI acceleration¶

Have a model draft the mechanical parts — the confusion-matrix counting, the precision/recall/F1 arithmetic, the scorecard table, the argument parsing, the GitHub Actions YAML. That's boilerplate and a model writes it well. What you must own is everything a model will quietly get wrong here: the choice of metric (a model defaults to accuracy — you override it to recall and justify it against the cost of a missed alert), the held-out discipline (a model will happily grade on the same lines it "tested," and on the lines it generated; you enforce the wall), and the gate's direction and fail-closed behaviour (does a missing metric or an erroring eval fail the build, or silently pass?). Use a model to expand the corpus with adversarial near-misses — benign lines crafted to look like attacks, attacks phrased to dodge the obvious regex — then label every one yourself and verify it, because a model labelling its own test set is the contamination this whole module warns against. You generate candidates; you own the ground truth.

Check yourself

What question does an eval answer that a unit test cannot — and why does only the eval have a number?
Why is accuracy the wrong metric for an imbalanced detection corpus, and what do you gate on instead?
Why is a planted regression the proof your CI gate works, and what does its absence leave unproven?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).