Module 02 — File Triage & Identification¶

Type 2 · Misconception Reveal — disprove "the extension tells you the type" and "packed = malicious" by classifying unknown files on real format and entropy-per-section signals, deliverable a triage report that routes each sample to the right deeper analysis. (Secondary: Tool-Build — wrap the checks into a reusable classifier.) Go to the hands-on lab →

Last reviewed: 2026-06

Malware Analysis — the first question is not "what does it do?" but "what is it?" — misidentifying a file type is how analysis goes wrong from the first step.

Difficulty: Intermediate · Estimated time: ~4–6 hrs (study + lab) · Prerequisites: Foundations

In 60 seconds

The first triage question is not "what does it do?" but "what is it?" — and the extension lies. A file's true identity lives in its magic bytes (MZ, %PDF, PK); its inner structure and per-section entropy tell you whether it's packed. Triage is routing, not analysis: for each file answer (1) true type, (2) packed or not, (3) which deeper path it warrants — in seconds, at scale. An Emotet wave is hundreds of files, and the doc and the packed PE it drops go down two different paths.

Why this matters¶

File type identification is the entry point to every analysis workflow. Before you run a single tool, you need to know what you are actually looking at. An attacker who renames a DLL to a .pdf, or embeds a PE inside a ZIP inside a Word document, is betting that the analyst or the security tool will take the extension at face value. Triage cuts through that. Emotet is the canonical reason this step exists at scale: for years one of the highest-volume threats in the world, it arrived as a phishing attachment (a Word doc, or a ZIP hiding one) that dropped a packed PE downloader — exactly the nested "document → script → packed executable" container that defeats extension-trust. A triage analyst working an Emotet wave is not analysing one file; they are routing hundreds, and the first decision per file is what is this, and is it packed? (Emotet — MITRE ATT&CK S0367 documents the spearphishing-attachment delivery (T1566.001) and its modular downloader role.)

Objective¶

Given a set of unknown files, classify each one correctly by type, determine its format and packing/obfuscation status, and produce a triage report that tells the analysis team what deeper work is warranted.

The core idea¶

The mental model

The extension is a lie — or at best a polite suggestion. The real identity of a file lives in its first few bytes: the magic number. A PE always starts MZ (0x4D 0x5A); a PDF starts %PDF; a ZIP begins PK (0x50 0x4B). The file command has done this correctly for decades and is still the right first tool. The mistake is stopping there — magic bytes name the outer container, but the question that matters is what the inner structure is telling you.

A PE file with a section named .upx0 and entropy above 7.0 is almost certainly packed — the actual code is compressed or encrypted and won't be visible to static analysis until unpacked. Entropy is the single fastest heuristic for "there is something hidden here": random-looking high-entropy content in a section that should contain legible code is the signature of packing, encryption, or self-modifying shellcode. Packed samples are not rare; most commodity malware is packed.

The practitioner translation: triage is triage, not analysis. Your goal in this step is to answer three questions for each file — (1) what is its true type, (2) is it packed or obfuscated, and (3) what deeper analysis path does it warrant? Spending an hour doing string extraction on a packed PE is wasted effort; the right move is to note it as packed, flag it for unpacking or dynamic analysis, and move on.

mermaid flowchart TD F["unknown file"] --> M{"magic bytes?"} M -->|"MZ (PE)"| E{"section entropy<br/>> 7.0?"} M -->|"%PDF / PK / OLE"| DOC["document / macro path<br/>(Module 11)"] E -->|"yes — packed"| UP["unpacking path<br/>(Module 09)"] E -->|"no"| ST["static analysis path<br/>(Module 03)"] This is the information the incident team needs to prioritise the queue. An Emotet attachment makes the routing concrete: the doc itself goes to the document/macro path (Module 11), while the packed downloader PE it drops gets flagged for unpacking (Module 09) — two different files, two different paths, decided in seconds at the triage step.

The gotcha

"Packed = malicious" is false, and so is a fixed entropy threshold. High entropy is suspicious relative to the section type — installers and legitimately compressed assets are high-entropy too. Packing is a routing signal ("needs unpacking before static analysis"), not a verdict; treat 7.0 as a heuristic to investigate, not a label to apply.

Go deeper: compiler/packer fingerprinting with DIE

Detect-It-Easy (DIE) goes further than file because it parses internal structure and applies heuristics: it identifies the compiler (MSVC, GCC, Delphi), the packer (UPX, MPRESS, Themida), and the architecture. That matters for routing — a Delphi binary behaves differently under a debugger than an MSVC one; an ARM binary can't be detonated in a standard x86 sandbox. File type and compiler fingerprint together determine which tools come next.

Learn (~3 hrs)¶

What files actually are - PE Format Reference (Microsoft Docs) — the specification for Windows PE executables; read the overview and the "MS-DOS Stub" through "Section Table" sections to understand what you're parsing.

A real high-volume sample to triage (~15 min) - Emotet — MITRE ATT&CK S0367 — read the overview and delivery techniques to understand the exact container chain triage exists to unpack: spearphishing attachment (T1566.001) → macro/PowerShell (T1059.001) → packed PE downloader. This is the family behind a typical triage queue.

Entropy and packing - Malware Traffic Analysis — Packing and Obfuscation — browse a real triage case; note how entropy graphs appear in analyst writeups.

Python tools - pefile documentation — the library reference; focus on the sections, imports, and entropy APIs. - python-magic docs — a thin wrapper around libmagic; read the examples section.

Key concepts¶

Magic bytes vs. file extensions
Entropy as a packing/obfuscation signal
PE section structure: .text, .data, .rsrc, and what high-entropy sections mean
Compiler/packer fingerprinting with DIE
Triage as routing: what analysis path does each file warrant?
MITRE ATT&CK T1027 — Obfuscated Files or Information
Real worked family: Emotet (modular downloader) — its phishing-doc-to-packed-PE chain is the nested container triage is built to route; the doc and the dropped PE take different analysis paths

AI acceleration¶

AI can generate a triage script quickly, but it tends to treat entropy thresholds as fixed ("above 7.0 = packed") when the real answer is "high entropy relative to the section type is suspicious." Have the AI draft the classifier, then test it against a known-clean file and a known-packed one. If it classifies both as "low risk," the threshold logic is wrong and you own that miss.

Check yourself

Why is the file extension the least trustworthy piece of identification, and what replaces it?
A .text section reads at entropy 7.6. Is that a verdict? What does it actually route the file to?
Triage routes an Emotet attachment to two different deeper paths. Which two files, and which paths?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).