Module 08 — Decompilation & Code Analysis¶

Type 6 · Reconstruct — decompile a binary in a free decompiler, identify the core algorithm (a XOR/RC4 key-scheduling loop), and rename variables to match the recovered logic, deliverable an annotated code listing drop-ready for an analysis report. (Secondary: Adversarial Review — treat decompiler/AI output as a guess and cross-check it against capa.) Go to the hands-on lab →

Last reviewed: 2026-06

Malware Analysis — turn compiled binaries back into readable logic so you can confirm what a sample actually does.

Difficulty: Advanced · Estimated time: ~4–6 hrs (study + lab) · Prerequisites: Foundations

In 60 seconds

Decompilation goes one level above disassembly — it reconstructs source-like structure (if, while, typed calls) so you read the algorithm, not the opcodes. But the output is pseudo-C: a compiler's-eye guess with right data flow and wrong names, types, and occasionally control flow. Your job is to test the guess, not trust it — find the loop boundary, trace what each variable accumulates, match it to a known primitive (RC4 KSA, XOR), and rename as you go. The cross-check that closes the case is capa from Module 04 agreeing independently.

Why this matters¶

Disassembly shows you the CPU's view — opcodes, registers, branches. Decompilation goes one level higher and reconstructs the original source-like structure: if, while, function calls with typed arguments. That step narrows the gap between "I see instructions" and "I understand the algorithm." For an analyst, it is the difference between spending two hours decoding a loop by hand and spending ten minutes reading what retdec or Ghidra's decompiler reconstructed. The job hires people who can read decompiled output fluently, not people who memorize every opcode. IcedID (BokBot) is a real reason this matters: this modular banking trojan hides its payloads as binaries embedded in RC4-encrypted PNG files (MITRE: T1027.003 steganography, T1027.002 packing). To pull that payload, you decompile the loader, recognise the RC4 key-scheduling and stream-generation loops, and confirm the algorithm — which is precisely the worked example this module builds toward. (IcedID — MITRE ATT&CK S0483.)

Objective¶

Open a compiled binary in a free decompiler, identify the core algorithm (an XOR-based key-scheduling loop), rename variables to match your mental model, and produce a brief annotated code listing that you can drop into an analysis report.

The core idea¶

The mental model

Decompilers do not recover the original source — they make an educated guess at the structure the compiler was given. The output is pseudo-C: right data flow, but wrong variable names, wrong types, and occasionally wrong control flow where the optimiser was aggressive. Your job is not to trust the guess; it is to test it. The fastest test is to find the loop boundary, trace what each variable accumulates over iterations, and check whether that pattern matches a known algorithm.

The gotcha

The hardest output to read is not the one with complex logic — it is the one with inlined copies of the same small function. Compilers aggressively inline short routines, so a four-line key-mixing loop can appear copy-pasted twelve times with different starting values. Recognise the copy and you only have to understand it once; miss it and you re-derive the same loop a dozen times (or, worse, conclude they differ).

Go deeper: rename as you go, then corroborate with capa

Stripped symbols and inlining are the day-one obstacles, and the counter-move for both is the same: rename as you go. Give every function a hypothesis name the instant you form one (maybe_rc4_ksa, xor_decrypt) and update it on evidence — Ghidra and retdec both persist renames. An analyst who leaves everything FUN_00401020 is reading, not analysing. The cross-check decompilers can't replace: match the logic against capa (Module 04). capa flags encrypt data using XOR and the decompiler shows a repeating-key XOR loop = two independent lines of evidence. That convergence is what goes in the report.

Learn (~3 hrs)¶

How decompilers work - Ghidra — NSA's open-source decompiler (official course, Unit 1–3) — the canonical free training; Units 1–3 cover the decompiler pane specifically (~1.5 hrs). - retdec: Open-source decompiler from Avast — README and wiki explain how to invoke it headlessly; read the "Usage" section and the output format doc (~20 min).

Reading decompiled output - Practical Malware Analysis, Ch. 6 — "Recognizing C Code Constructs in Assembly" — covers loops, arrays, and structs in compiled output; the decompiler produces the same patterns one abstraction higher.

XOR and RC4 as recurring malware primitives - MITRE ATT&CK T1027.013 — Obfuscated Files: Encrypted/Encoded File — why XOR-family ciphers are the most common obfuscation; links to real CTI reports (~15 min). - IcedID — MITRE ATT&CK S0483 — a real family that embeds payloads in RC4-encrypted PNG files (T1027.003 / T1027.002); read its entry to see RC4-as-payload-crypto attributed to a documented banking trojan, which is the decompilation target shape this module practises (~15 min).

Key concepts¶

Decompiled output is pseudo-C: correct data flow, uncertain types and names.
Rename every function and variable the moment you form a hypothesis; update on evidence.
Inlined functions appear as repeated identical loops — recognize the copy, understand it once.
Cross-check decompiler findings against CAPA output for independent corroboration.
Stripped binaries have no symbol names; your renamed hypothesis names are the analysis artifact.
Real worked family: IcedID (BokBot) — hides payloads in RC4-encrypted PNGs (T1027.003); recognising the RC4 KSA/PRGA in decompiled loader output is exactly this module's skill

AI acceleration¶

Feed a decompiled function to a model and ask it to identify the algorithm and rename variables. The model is often right on common primitives (RC4 KSA, CRC32, base64 decode). Your job: verify every rename against the actual data flow — models hallucinate variable relationships. Use the model's hypothesis as a starting point, not a conclusion, and document which renames came from AI and which from manual verification.

AI caveat

A model nails common primitives (RC4 KSA, CRC32, base64) but hallucinates variable relationships — so verify every rename against the actual data flow and record which renames came from AI vs. manual work. The model's labels are a hypothesis; the data flow is the proof.

Check yourself

Decompiler output is pseudo-C. Which parts of it should you trust, and which should you treat as a guess?
You see what looks like the same 4-line loop a dozen times. What is the compiler likely doing, and why does it matter?
You think a loop is RC4. What independent line of evidence (from an earlier module) corroborates that for the report?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).