Module 03 — IaC Security Scanning¶
Type 8 · Judgment-as-Code / Gate — encode your verdict on an IaC misconfiguration as a CI gate that fails-bad and passes-good, with one true finding correctly suppressed; the deliverable is the gate, proven both ways, not an essay. (Secondary: Build-&-Operate — you run a real scanner over real Terraform.) Go to the hands-on lab →
Last reviewed: 2026-06
Security Automation — the misconfiguration that becomes a breach ships first as a line of Terraform. Catch it in the diff, then make the catch permanent.
In 60 seconds
The misconfiguration that becomes a breach ships first as a line of Terraform. A static scanner
(checkov, tfsec) catches the known-bad pattern — public-read, 0.0.0.0/0, wildcard IAM —
in the PR diff, milliseconds instead of months. But a scanner is a fast junior reviewer with no
context: it can't tell the open port you meant from the one that's a breach. The deliverable isn't
the scan; it's the CI gate that encodes your verdict so it can't regress, with true
false-positives suppressed with a rationale, never silenced.
Why this matters¶
A misconfigured S3 bucket costs nothing to fix in a .tf file before it deploys. After it ships with
public read, gets discovered by a scanner, lands in a breach report, and has to be disclosed to
customers, it costs orders of magnitude more. The uncomfortable through-line behind the real incidents
makes the point: the unencrypted public bucket behind the 2017 wave of S3 leaks (Accenture,
Verizon/Nice, Booz Allen, Dow Jones), and the over-broad IAM role the attacker rode in the 2019
Capital One breach — these almost never start life in a console. They start as a line of Terraform,
get reviewed by someone reading logic not posture, and ship. By the time a posture scanner finds
them in production, they have been live for months.
This module moves the catch left, to the diff. The same property that makes infrastructure-as-code
auditable makes it scannable before a single resource exists: a static analyzer parses the HCL,
builds the resource graph, and matches it against a rule library — checkov's CKV_AWS_* checks,
tfsec's built-ins — each mapped back to a CIS control. A misconfiguration that takes days to find in
production takes milliseconds to flag in a pull request, and costs nothing to fix before tofu apply.
But the scan is not the lesson. The lesson is what you do with the verdict. A finding you fix by hand regresses the next time someone copies the module. IaC is the one place where your verdict can become a mechanical gate that blocks the merge — and that gate is the deliverable of this module.
The core idea: a scanner is a fast junior reviewer with no context¶
Hold this picture, because the rest of the module is its consequences. A scanner is a brilliant,
tireless junior reviewer who has memorized every known-bad pattern and understands none of your
intentions. It will catch encrypted = false, acl = "public-read", cidr_blocks = ["0.0.0.0/0"],
and Action = "*" every time, instantly, across ten thousand files. It will never tell you that the
open security group on port 443 is the one your public load balancer actually needs, or that the open
one on 5432 is a database you just exposed to the internet — because both are the same pattern, and the
difference is a decision the scanner can't see. It pattern-matches; it cannot read intent, business
context, or the blast radius two resources away.
The mental model
A scanner is a brilliant, tireless junior reviewer who has memorized every known-bad pattern and understands none of your intentions. It catches the pattern every time, instantly, across ten thousand files — and never tells you that the open port 443 is the load balancer you need and the open 5432 is a database you just exposed. The pattern is the same; the difference is your judgment.
So the scanner splits the world cleanly into two halves, and your job is different in each:
- The known-bad pattern — unencrypted storage, public ACL, wildcard IAM, SSH open to the world, IMDSv2 not enforced. Here the scanner is right and you just fix it. The skill is throughput, not judgment.
- The bad decision the scanner misses — an open SG that is genuinely intended (a true false-positive you must suppress correctly, with a rationale, not silence) versus an open SG that is a real exposure; a hardcoded secret in a variable default; an IAM policy that's valid HCL but composes into privilege escalation. Logic and context live here, and this is where you add value the tool can't.
The discipline that ties it together is the suppression. Every tool lets you silence a finding with
an inline comment (#checkov:skip=CKV_AWS_18: <reason>). Suppressing a true false-positive — the
logging bucket that doesn't need to log to itself, the intended public-HTTPS rule — is a legitimate,
senior move: you over-ruling the junior with a documented reason. Suppressing by check-ID across the
whole codebase, or with no rationale, is how the junior gets ignored entirely and the bad decision ships
anyway. A suppression is an audit trail, not a mute button. Getting that distinction right is the
judgment this module is about.
The gotcha
Suppression is where the gate quietly fails. Silencing a true false-positive inline, with a check-ID and a defensible reason, is a senior move. Blanket-skipping a check across the codebase, or suppressing a real exposure as if it were noise, is how the bad decision ships with a paper trail that makes it look reviewed. And calibrate strict-first: start permissive and "tighten later" never happens.
checkov (Bridgecrew / Palo Alto Networks) and tfsec (Aqua Security) cover overlapping but not
identical rule sets — running both is common because each catches what the other misses. The choice
between them matters far less than the habit of running one consistently in CI as a gate: checkov
-d . --soft-fail-on LOW exits non-zero when a real finding remains, and a pipeline that fails on that
exit code means the misconfig never reaches tofu apply. The calibration skill is the rest of it — too
strict and every PR fails on noise, too loose and real misconfigs slip through. Start strict, suppress
with justification, never start permissive and tighten later. The gate is where you encode that
verdict so it can't regress.
flowchart LR
PR["PR diff<br/>(.tf change)"] --> S["scanner<br/>(checkov / tfsec)"]
S --> F{"finding?"}
F -->|"none"| OK["merge → <code>tofu apply</code>"]
F -->|"true false-positive"| SUP["suppress inline<br/>+ rationale + check-ID"]
SUP --> OK
F -->|"real misconfig"| BLOCK["exit non-zero<br/>— block the merge"]
AI caveat
AI is excellent at writing Terraform that passes a scanner — and just as good at hiding an IAM
over-grant behind it. It will "fix" a finding by moving a wildcard from Action to Resource
(still broken) or suppress a real exposure as if it were a false-positive. Let it draft and run the
scanner; you confirm each suppression has a real rationale and the gate fails for the right reason.
Learn (~2 hrs)¶
Build-first and tool-heavy: read enough to triage findings and write a real gate, then go to the lab.
The scanners and their rule libraries (~1 hr)
- Checkov — Quick Start (~25 min) — run the quickstart against a local Terraform directory; understand --check, --skip-check, and the output format. The Terraform check index is the fastest way to see exactly what field each CKV_AWS_* check tests — look up CKV_AWS_18 (S3 access logging) and CKV_AWS_19/CKV_AWS_145 (S3 encryption) so a finding stops being a black box.
- tfsec — Documentation (Getting Started + Configuration) (~20 min) — Terraform-specific depth and very readable output; read the severity levels and inline-suppression syntax, and notice the overlap (and gaps) versus Checkov.
- Checkov — Suppressing and Skipping checks (inline checkov:skip) (~15 min) — the correct way to record a true false-positive, with a rationale. This is the judgment move, documented.
Writing the gate — the actual deliverable (~45 min)
- Checkov — Hard and soft fail (exit codes, --soft-fail-on, --hard-fail-on) (~20 min) — read precisely how Checkov sets its exit code and how --soft-fail-on / --hard-fail-on choose which severities block. The gate lives or dies on this.
- bridgecrewio/checkov-action (the GitHub Action) (~15 min) — the canonical CI integration; read how soft_fail and SARIF upload (output_format: cli,sarif → github/codeql-action/upload-sarif) wire into a PR check, and pin the action to a commit SHA.
- Writing a custom Checkov check (Python / YAML) (~10 min) — skim, for the stretch: when no built-in rule encodes your org's verdict, you write the rule.
Why the patterns matter (~15 min)
- CIS AWS Foundations Benchmark (~10 min, skim) — the controls each CKV_AWS_* maps to (S3 encryption, SG ingress, IMDSv2). The gate enforces these; cite them in findings.
- MITRE ATT&CK — T1078.004 Valid Accounts: Cloud Accounts (~5 min) — the over-broad IAM role and public ingress these scans catch are exactly what an attacker rides after initial access; the framing for why a blocked merge prevents an attack, not just a lint warning.
Key concepts¶
- A scanner is a fast junior reviewer with no context: it catches the known-bad pattern (
encrypted = false,public-read,0.0.0.0/0,*) but never the bad decision (intended vs. catastrophic open port; a secret in a variable; IAM that composes into admin). - Shift-left literally: block the misconfig in the PR diff, before
tofu apply, not in a post-deploy audit months later. checkovandtfsecoverlap but differ — run both; the choice matters less than the habit of gating in CI.- Suppression is an audit trail, not a mute button: silence a true false-positive inline with a rationale and a check-ID — never blanket-skip across the codebase.
- Calibrate strict-first: too strict floods PRs with noise, too loose lets real misconfigs through; start strict and suppress with justification.
- The deliverable is the gate: the verdict encoded so it fails the bad state and passes the fix, and can't regress when someone copies the module.
AI acceleration¶
AI is excellent at writing Terraform that passes a scanner — and just as good at writing Terraform
that looks correct but hides an IAM over-grant or an encryption miss. The reliable loop: let the model
draft a resource block, run checkov/tfsec on it immediately, feed the findings back, iterate — the
model is your first-pass engineer; you are the reviewer. Doing this by hand also teaches you which
misconfigs AI consistently produces. But the judgment the model can't do for you is exactly the
scanner's blind spot: it will happily "fix" a finding by moving a wildcard from Action to Resource
(still broken), suppress a real exposure as if it were a false-positive, or pass the gate while
leaving a secret in a variable. Make the model draft the gate and the suppressions; you confirm each
suppression has a real rationale, that the gate fails the original config for the right reason, and
that it passes only the genuinely-fixed one. AI authors, you review, you own the verdict.
Check yourself
- A scanner flags two open security groups identically. What can it never tell you about them, and whose job is that?
- When is suppressing a finding a senior move, and when is it how the bad decision ships "reviewed"?
- Why "start strict and suppress" rather than "start permissive and tighten later"?
Comments
Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).