Skip to content

Module 07 — Log Parsing & Normalisation

Type 9 · Tool-Build — parse a real, messy log into structured fields and normalise it to a common schema (ECS), handling the malformed lines; you commit a reusable parser and its parse-rate verification. (Secondary: Build-&-Operate — the parser is a piece of the data plane that has to keep working.) Go to the hands-on lab →

Last reviewed: 2026-06

Defensive Operationsraw logs are chaos; a common schema is what makes detection scale.

Difficulty: Intermediate  ·  Estimated time: ~5–7 hrs (study + lab)  ·  Prerequisites: Foundations

In 60 seconds

Every source describes the world differently — Apache, sshd, Sysmon, and a firewall each name "source IP" their own way. Until you parse raw text into typed fields and normalise them to a shared schema (the Elastic Common Schema), one detection can't work across sources and you can't correlate them. It's two steps: parse, then rename into a common vocabulary. This is AI's home turf and where it fails most silently — a parser that drops 5% of lines looks fine until a detection misses. Always check the parse rate.

Why this matters

Every source logs differently — Apache, sshd, Sysmon, and a firewall each describe a "source IP" in their own way. Until logs are parsed into fields and normalised to a common schema, you can't write one detection that works across sources, or correlate them. Normalisation (e.g. to the Elastic Common Schema) is the unglamorous plumbing that makes everything downstream possible — and it's exactly where AI both helps most and breaks things silently.

Objective

Parse a real, messy log into structured fields and normalise it to a common schema, handling the malformed lines.

The core idea

This is the plumbing that makes everything else possible, and it's invisible until it breaks. Every source describes the world differently — Apache, sshd, Sysmon, and a firewall each have their own name and format for something as basic as "source IP." Until you parse raw text into typed fields and normalise those fields to a common schema (like the Elastic Common Schema), you cannot write one detection that works across sources, and you cannot correlate them: a "suspicious source IP" rule would need rewriting for every vendor. Normalisation is what lets one rule mean the same thing everywhere.

flowchart LR
    A["Apache log<br/>(clientip)"] --> PA[parse] --> NA[normalise]
    S["sshd log<br/>(rhost)"] --> PS[parse] --> NS[normalise]
    F["firewall log<br/>(src)"] --> PF[parse] --> NF[normalise]
    NA --> ECS["source.ip<br/>(ECS)"]
    NS --> ECS
    NF --> ECS
    ECS --> D["one detection<br/>works across all"]

The mental model

Two steps: parse (unstructured text → fields, via grok/regex/VRL), then normalise (rename those fields into a shared vocabulary). For the network engineer it's the exact reason you map every vendor's syslog into a common field set before building one dashboard across a mixed fleet — the detection logic should never have to care which box emitted the line.

This module exists because parsing is AI's home turf and where it fails most silently. A model writes a grok or VRL parser for an unfamiliar format in seconds — genuinely useful.

The gotcha

A parser that drops 5% of lines, mislabels a field, or mangles a timestamp looks completely fine until a detection quietly misses the one event that mattered. The single discipline that saves you: always check the parse rate and the actual field values against the raw log. A green pipeline is not a correct pipeline — the same lesson as module 01, one layer down.

AI caveat

This is AI's home turf — a model writes a grok/VRL parser for an unfamiliar format in seconds. It's also where it fails silently: a parser that drops 5% of lines or mislabels a field looks fine until a detection misses. Always check the parse rate and the field values against the raw log.

Learn (~4 hrs)

Pipelines - Vector documentation — a modern, fast log pipeline; read the "Quickstart" and the VRL (transform language) intro. - Elastic Common Schema (ECS) — the normalisation target: a shared field set so detections work across sources.

What good logs contain - OWASP Logging Cheat Sheet — what parseable, useful logs should include.

Key concepts

  • Parsing: unstructured text → fields (grok / regex / VRL)
  • Normalisation to a common schema (ECS)
  • Enrichment (geoIP, asset/user context)
  • Handling malformed / multiline logs
  • Why normalisation makes one detection work across many sources

AI acceleration

This is AI's home turf — a model writes a grok/VRL parser for an unfamiliar format in seconds. It's also where it fails silently: a parser that drops 5% of lines or mislabels a field looks fine until a detection misses. Always check the parse rate and the field values against the raw log.

Check yourself

  • Why can't you write one "suspicious source IP" detection that works across Apache, sshd, and a firewall until you've normalised — what specifically breaks?
  • What's the difference between parsing and normalising, and why do you need both?
  • Your AI-drafted parser runs clean and the pipeline is green — what single number tells you whether it's actually correct?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).