Module 03 — IaC Security Scanning¶

Type 8 · Judgment-as-Code / Gate — encode your verdict on an IaC misconfiguration as a CI gate that fails-bad and passes-good, with one true finding correctly suppressed; the deliverable is the gate, proven both ways, not an essay. (Secondary: Build-&-Operate — you run a real scanner over real Terraform.) Go to the hands-on lab →

Last reviewed: 2026-06

Security Automation — the misconfiguration that becomes a breach ships first as a line of Terraform. Catch it in the diff, then make the catch permanent.

Difficulty: Intermediate · Estimated time: ~3–4 hrs (study + lab) · Prerequisites: Foundations · Module 02 — Infrastructure as Code

In 60 seconds

The misconfiguration that becomes a breach ships first as a line of Terraform. A static scanner (checkov, tfsec) catches the known-bad pattern — public-read, 0.0.0.0/0, wildcard IAM — in the PR diff, milliseconds instead of months. But a scanner is a fast junior reviewer with no context: it can't tell the open port you meant from the one that's a breach. The deliverable isn't the scan; it's the CI gate that encodes your verdict so it can't regress, with true false-positives suppressed with a rationale, never silenced.

Why this matters¶

A misconfigured S3 bucket costs nothing to fix in a .tf file before it deploys. After it ships with public read, gets discovered by a scanner, lands in a breach report, and has to be disclosed to customers, it costs orders of magnitude more. The uncomfortable through-line behind the real incidents makes the point: the unencrypted public bucket behind the 2017 wave of S3 leaks (Accenture, Verizon/Nice, Booz Allen, Dow Jones), and the over-broad IAM role the attacker rode in the 2019 Capital One breach — these almost never start life in a console. They start as a line of Terraform, get reviewed by someone reading logic not posture, and ship. By the time a posture scanner finds them in production, they have been live for months.

This module moves the catch left, to the diff. The same property that makes infrastructure-as-code auditable makes it scannable before a single resource exists: a static analyzer parses the HCL, builds the resource graph, and matches it against a rule library — checkov's CKV_AWS_* checks, tfsec's built-ins — each mapped back to a CIS control. A misconfiguration that takes days to find in production takes milliseconds to flag in a pull request, and costs nothing to fix before tofu apply.

But the scan is not the lesson. The lesson is what you do with the verdict. A finding you fix by hand regresses the next time someone copies the module. IaC is the one place where your verdict can become a mechanical gate that blocks the merge — and that gate is the deliverable of this module.

The core idea: a scanner is a fast junior reviewer with no context¶

Hold this picture, because the rest of the module is its consequences. A scanner is a brilliant, tireless junior reviewer who has memorized every known-bad pattern and understands none of your intentions. It will catch encrypted = false, acl = "public-read", cidr_blocks = ["0.0.0.0/0"], and Action = "*" every time, instantly, across ten thousand files. It will never tell you that the open security group on port 443 is the one your public load balancer actually needs, or that the open one on 5432 is a database you just exposed to the internet — because both are the same pattern, and the difference is a decision the scanner can't see. It pattern-matches; it cannot read intent, business context, or the blast radius two resources away.

The mental model

A scanner is a brilliant, tireless junior reviewer who has memorized every known-bad pattern and understands none of your intentions. It catches the pattern every time, instantly, across ten thousand files — and never tells you that the open port 443 is the load balancer you need and the open 5432 is a database you just exposed. The pattern is the same; the difference is your judgment.

So the scanner splits the world cleanly into two halves, and your job is different in each:

The known-bad pattern — unencrypted storage, public ACL, wildcard IAM, SSH open to the world, IMDSv2 not enforced. Here the scanner is right and you just fix it. The skill is throughput, not judgment.
The bad decision the scanner misses — an open SG that is genuinely intended (a true false-positive you must suppress correctly, with a rationale, not silence) versus an open SG that is a real exposure; a hardcoded secret in a variable default; an IAM policy that's valid HCL but composes into privilege escalation. Logic and context live here, and this is where you add value the tool can't.

The discipline that ties it together is the suppression. Every tool lets you silence a finding with an inline comment (#checkov:skip=CKV_AWS_18: <reason>). Suppressing a true false-positive — the logging bucket that doesn't need to log to itself, the intended public-HTTPS rule — is a legitimate, senior move: you over-ruling the junior with a documented reason. Suppressing by check-ID across the whole codebase, or with no rationale, is how the junior gets ignored entirely and the bad decision ships anyway. A suppression is an audit trail, not a mute button. Getting that distinction right is the judgment this module is about.

The gotcha

Suppression is where the gate quietly fails. Silencing a true false-positive inline, with a check-ID and a defensible reason, is a senior move. Blanket-skipping a check across the codebase, or suppressing a real exposure as if it were noise, is how the bad decision ships with a paper trail that makes it look reviewed. And calibrate strict-first: start permissive and "tighten later" never happens.

checkov (Bridgecrew / Palo Alto Networks) and tfsec (Aqua Security) cover overlapping but not identical rule sets — running both is common because each catches what the other misses. The choice between them matters far less than the habit of running one consistently in CI as a gate: checkov -d . --soft-fail-on LOW exits non-zero when a real finding remains, and a pipeline that fails on that exit code means the misconfig never reaches tofu apply. The calibration skill is the rest of it — too strict and every PR fails on noise, too loose and real misconfigs slip through. Start strict, suppress with justification, never start permissive and tighten later. The gate is where you encode that verdict so it can't regress.

flowchart LR
    PR["PR diff<br/>(.tf change)"] --> S["scanner<br/>(checkov / tfsec)"]
    S --> F{"finding?"}
    F -->|"none"| OK["merge → <code>tofu apply</code>"]
    F -->|"true false-positive"| SUP["suppress inline<br/>+ rationale + check-ID"]
    SUP --> OK
    F -->|"real misconfig"| BLOCK["exit non-zero<br/>— block the merge"]

AI caveat

AI is excellent at writing Terraform that passes a scanner — and just as good at hiding an IAM over-grant behind it. It will "fix" a finding by moving a wildcard from Action to Resource (still broken) or suppress a real exposure as if it were a false-positive. Let it draft and run the scanner; you confirm each suppression has a real rationale and the gate fails for the right reason.

Learn (~2 hrs)¶

Build-first and tool-heavy: read enough to triage findings and write a real gate, then go to the lab.

The scanners and their rule libraries (~1 hr) - Checkov — Quick Start (~25 min) — run the quickstart against a local Terraform directory; understand --check, --skip-check, and the output format. The Terraform check index is the fastest way to see exactly what field each CKV_AWS_* check tests — look up CKV_AWS_18 (S3 access logging) and CKV_AWS_19/CKV_AWS_145 (S3 encryption) so a finding stops being a black box. - tfsec — Documentation (Getting Started + Configuration) (~20 min) — Terraform-specific depth and very readable output; read the severity levels and inline-suppression syntax, and notice the overlap (and gaps) versus Checkov. - Checkov — Suppressing and Skipping checks (inline checkov:skip) (~15 min) — the correct way to record a true false-positive, with a rationale. This is the judgment move, documented.

Writing the gate — the actual deliverable (~45 min) - Checkov — Hard and soft fail (exit codes, --soft-fail-on, --hard-fail-on) (~20 min) — read precisely how Checkov sets its exit code and how --soft-fail-on / --hard-fail-on choose which severities block. The gate lives or dies on this. - bridgecrewio/checkov-action (the GitHub Action) (~15 min) — the canonical CI integration; read how soft_fail and SARIF upload (output_format: cli,sarif → github/codeql-action/upload-sarif) wire into a PR check, and pin the action to a commit SHA. - Writing a custom Checkov check (Python / YAML) (~10 min) — skim, for the stretch: when no built-in rule encodes your org's verdict, you write the rule.

Why the patterns matter (~15 min) - CIS AWS Foundations Benchmark (~10 min, skim) — the controls each CKV_AWS_* maps to (S3 encryption, SG ingress, IMDSv2). The gate enforces these; cite them in findings. - MITRE ATT&CK — T1078.004 Valid Accounts: Cloud Accounts (~5 min) — the over-broad IAM role and public ingress these scans catch are exactly what an attacker rides after initial access; the framing for why a blocked merge prevents an attack, not just a lint warning.

Key concepts¶

A scanner is a fast junior reviewer with no context: it catches the known-bad pattern (encrypted = false, public-read, 0.0.0.0/0, *) but never the bad decision (intended vs. catastrophic open port; a secret in a variable; IAM that composes into admin).
Shift-left literally: block the misconfig in the PR diff, before tofu apply, not in a post-deploy audit months later.
checkov and tfsec overlap but differ — run both; the choice matters less than the habit of gating in CI.
Suppression is an audit trail, not a mute button: silence a true false-positive inline with a rationale and a check-ID — never blanket-skip across the codebase.
Calibrate strict-first: too strict floods PRs with noise, too loose lets real misconfigs through; start strict and suppress with justification.
The deliverable is the gate: the verdict encoded so it fails the bad state and passes the fix, and can't regress when someone copies the module.

AI acceleration¶

AI is excellent at writing Terraform that passes a scanner — and just as good at writing Terraform that looks correct but hides an IAM over-grant or an encryption miss. The reliable loop: let the model draft a resource block, run checkov/tfsec on it immediately, feed the findings back, iterate — the model is your first-pass engineer; you are the reviewer. Doing this by hand also teaches you which misconfigs AI consistently produces. But the judgment the model can't do for you is exactly the scanner's blind spot: it will happily "fix" a finding by moving a wildcard from Action to Resource (still broken), suppress a real exposure as if it were a false-positive, or pass the gate while leaving a secret in a variable. Make the model draft the gate and the suppressions; you confirm each suppression has a real rationale, that the gate fails the original config for the right reason, and that it passes only the genuinely-fixed one. AI authors, you review, you own the verdict.

Check yourself

A scanner flags two open security groups identically. What can it never tell you about them, and whose job is that?
When is suppressing a finding a senior move, and when is it how the bad decision ships "reviewed"?
Why "start strict and suppress" rather than "start permissive and tighten later"?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).