Module 07 — Log Parsing & Normalisation¶

Type 9 · Tool-Build — parse a real, messy log into structured fields and normalise it to a common schema (ECS), handling the malformed lines; you commit a reusable parser and its parse-rate verification. (Secondary: Build-&-Operate — the parser is a piece of the data plane that has to keep working.) Go to the hands-on lab →

Last reviewed: 2026-06

Defensive Operations — raw logs are chaos; a common schema is what makes detection scale.

Difficulty: Intermediate · Estimated time: ~5–7 hrs (study + lab) · Prerequisites: Foundations

In 60 seconds

Every source describes the world differently — Apache, sshd, Sysmon, and a firewall each name "source IP" their own way. Until you parse raw text into typed fields and normalise them to a shared schema (the Elastic Common Schema), one detection can't work across sources and you can't correlate them. It's two steps: parse, then rename into a common vocabulary. This is AI's home turf and where it fails most silently — a parser that drops 5% of lines looks fine until a detection misses. Always check the parse rate.

Why this matters¶

Every source logs differently — Apache, sshd, Sysmon, and a firewall each describe a "source IP" in their own way. Until logs are parsed into fields and normalised to a common schema, you can't write one detection that works across sources, or correlate them. Normalisation (e.g. to the Elastic Common Schema) is the unglamorous plumbing that makes everything downstream possible — and it's exactly where AI both helps most and breaks things silently.

Objective¶

Parse a real, messy log into structured fields and normalise it to a common schema, handling the malformed lines.

The core idea¶

This is the plumbing that makes everything else possible, and it's invisible until it breaks. Every source describes the world differently — Apache, sshd, Sysmon, and a firewall each have their own name and format for something as basic as "source IP." Until you parse raw text into typed fields and normalise those fields to a common schema (like the Elastic Common Schema), you cannot write one detection that works across sources, and you cannot correlate them: a "suspicious source IP" rule would need rewriting for every vendor. Normalisation is what lets one rule mean the same thing everywhere.

flowchart LR
    A["Apache log<br/>(clientip)"] --> PA[parse] --> NA[normalise]
    S["sshd log<br/>(rhost)"] --> PS[parse] --> NS[normalise]
    F["firewall log<br/>(src)"] --> PF[parse] --> NF[normalise]
    NA --> ECS["source.ip<br/>(ECS)"]
    NS --> ECS
    NF --> ECS
    ECS --> D["one detection<br/>works across all"]

The mental model

Two steps: parse (unstructured text → fields, via grok/regex/VRL), then normalise (rename those fields into a shared vocabulary). For the network engineer it's the exact reason you map every vendor's syslog into a common field set before building one dashboard across a mixed fleet — the detection logic should never have to care which box emitted the line.

This module exists because parsing is AI's home turf and where it fails most silently. A model writes a grok or VRL parser for an unfamiliar format in seconds — genuinely useful.

The gotcha

A parser that drops 5% of lines, mislabels a field, or mangles a timestamp looks completely fine until a detection quietly misses the one event that mattered. The single discipline that saves you: always check the parse rate and the actual field values against the raw log. A green pipeline is not a correct pipeline — the same lesson as module 01, one layer down.

AI caveat

This is AI's home turf — a model writes a grok/VRL parser for an unfamiliar format in seconds. It's also where it fails silently: a parser that drops 5% of lines or mislabels a field looks fine until a detection misses. Always check the parse rate and the field values against the raw log.

Learn (~4 hrs)¶

Pipelines - Vector documentation — a modern, fast log pipeline; read the "Quickstart" and the VRL (transform language) intro. - Elastic Common Schema (ECS) — the normalisation target: a shared field set so detections work across sources.

What good logs contain - OWASP Logging Cheat Sheet — what parseable, useful logs should include.

Key concepts¶

Parsing: unstructured text → fields (grok / regex / VRL)
Normalisation to a common schema (ECS)
Enrichment (geoIP, asset/user context)
Handling malformed / multiline logs
Why normalisation makes one detection work across many sources

AI acceleration¶

This is AI's home turf — a model writes a grok/VRL parser for an unfamiliar format in seconds. It's also where it fails silently: a parser that drops 5% of lines or mislabels a field looks fine until a detection misses. Always check the parse rate and the field values against the raw log.

Check yourself

Why can't you write one "suspicious source IP" detection that works across Apache, sshd, and a firewall until you've normalised — what specifically breaks?
What's the difference between parsing and normalising, and why do you need both?
Your AI-drafted parser runs clean and the pipeline is green — what single number tells you whether it's actually correct?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).