Lab 04 — RAG, and the retrieval eval that keeps it honest¶
Hands-on lab · ← Back to the module concept
Setup¶
git clone https://github.com/plaintext-security/plaintext-labs
cd plaintext-labs/ai-augmented-ops/04-rag
make up && make demo
Requirements: Docker, 6 GB RAM free. No GPU needed.
Three containers start: Ollama (generation), ChromaDB (vector store), and a lab container with the
ingest, query, and eval scripts. First run downloads tinyllama (~637 MB) and nomic-embed-text
(~274 MB). make demo ingests the knowledge base, runs one query, shows the retrieved chunks
alongside the generated answer — and then runs the retrieval eval over the labelled query set and
prints the scorecard, so you see the build and its measurement in one pass.
The retrieval eval (
make eval) scores recorded retrievals against a committed labelled query set. Once ingested, that scoring is deterministic and needs no live generation — the same offline, CI-friendly discipline as Module 11.
Scenario¶
A security team maintains a knowledge base of runbooks, past incident summaries, and detection notes. During an incident, analysts waste time hunting the right runbook in the wiki. You build a RAG pipeline that answers a natural-language question from those actual documents — and then you do the thing most teams skip: you prove the retrieval works. A confident answer over the wrong runbook is worse than no answer; the eval is how you catch it before an analyst trusts it at 3 a.m.
Everything runs locally. No cloud API keys, no external targets, no authorization needed.
Do¶
Part A — Build & operate the RAG¶
- [ ]
make demoand read the full output carefully. Find: - The retrieved chunks: which documents did the retrieval step return?
- The generated answer: what did the model produce?
-
Any fact in the answer that is not in the retrieved chunks (hallucination-on-context). Write one sentence: judged by the prose alone, would you have noticed if the retrieval was wrong?
-
[ ]
make shell, then openscripts/ingest.py. Read the chunking logic: what isCHUNK_SIZEin characters, and the overlap? Indata/knowledge-base/, which document is longest? Would the current chunk size capture a complete procedure step from that document? Note your reasoning — you'll measure the effect of this dial in Part B. -
[ ] Run three more queries with
For each, were the retrieved chunks relevant, and did the answer reflect them? Then run a query for something not in the corpus and document what the model does when retrieval finds nothing:scripts/query.pyand eyeball relevance:
Part B — Build the retrieval eval (the deliverable)¶
-
[ ] Read the labelled query set and understand why it's held out.
data/eval-queries.jsonmaps ~15 realistic SOC questions to their known-relevant source document(s) — the answer key for retrieval. These queries are separate from the demo question and were never used to tune the chunking. Skim three: confirm each one's "relevant" doc is a judgment about the source, not a keyword match (e.g. a query about "a stolen password" should map to the credential-incident runbook even though it never says "stolen"). That semantic gap is exactly what recall@k tests. -
[ ] Score recall@k.
make evalembeds each labelled query, retrieves the top-k chunks, and checks whether at least one genuinely-relevant chunk appears — writingresults/retrieval-scorecard.mdwith recall@1 / @3 / @5 plus the per-query misses. Read it as a number, not a vibe. Which queries missed at k=3, and why — phrasing, chunk boundaries, or vocabulary? -
[ ] Score groundedness.
make evalalso reports a minimal groundedness check on the generated answers (span overlap between the answer's claims and the retrieved chunks; an answer making claims absent from its context scores low). Find the query whose answer read most confidently but scored worst on groundedness — that gap is the silent failure the module is about. Note where simple span-overlap is too crude and would need an LLM-grader (and why that re-introduces the eval-the-evaluator problem). -
[ ] Prove the regression gate catches a real change.
make gateruns the eval with a declared floor (recall_at_k=0.80) and exits non-zero if retrieval drops below it. Now cause a regression: shrink the chunk size to a value too small to bracket a procedure step (make ingest CHUNK_SIZE=120), re-runmake gate, and watch recall@3 collapse and the gate go red (exit 1). Restore the good chunk size, re-ingest, and confirm the gate goes green (exit 0). The green-on-good / red-on-regression contrast is the whole point: a "harmless" chunking tweak can silently tank retrieval, and the gate is what stops it merging. -
[ ] Extend the held-out set with a hard case. Add one document to
data/knowledge-base/(a short 150–300-word runbook on a topic not yet covered — e.g. insider-threat containment), write two labelled queries for it indata/eval-queries.json(one phrased like the document, one phrased unlike it), re-ingest, and re-runmake eval. Does retrieval find it on both? The phrased-unlike query is the one that exposes embedding/vocabulary gaps — exactly the case that "more easy queries" would never catch (coverage ≠ effectiveness).
Success criteria — you're done when¶
- [ ]
make demoruns to completion: retrieved chunks + generated answer + the retrieval scorecard. - [ ]
make evalproducesresults/retrieval-scorecard.mdwith recall@1/@3/@5 and a groundedness number, plus per-query misses. - [ ] You can state, in writing, why a RAG needs a retrieval metric and not just an "answer reads well" check — and you've seen one confident answer score low on groundedness.
- [ ]
make gateexits 0 on the good pipeline and 1 afterCHUNK_SIZE=120— you've watched recall@3 collapse and recover. - [ ] Your new document + its two labelled queries are committed, ingested, and scored.
Deliverables¶
data/eval-queries.json (the labelled query set, including your two added queries) +
scripts/eval.py (with any metric/gate change you made) + data/knowledge-base/<your-runbook>.md +
results/retrieval-scorecard.md, all committed. The eval-as-code is the artifact: a labelled
held-out query set, a recall@k / groundedness scorecard, and a gate that fails on a retrieval
regression. Do not commit live run dumps (results/predictions-*.json, raw chunk text) — they're
gitignored and regenerate from the corpus + eval. The ingested collection + the corpus additions are
the retrieval backend the SoC copilot reuses in Module 06.
Automate & own it¶
Required. Wire the retrieval gate into CI so a regression cannot merge. Add a
.github/workflows/rag-eval.yml (in your own portfolio repo) that, on every PR, ingests the corpus
and runs:
CHUNK_SIZE=120 regression.
AI acceleration¶
Have a model expand the labelled query set — ask it for SOC questions phrased unlike the source documents (the hard, vocabulary-gap cases recall@k is meant to catch) — then label them yourself: open the source doc and confirm which chunk is genuinely relevant. A model labelling its own answer key is the contamination Module 11 warns about; it proposes queries, you own the ground truth. Then paste a low-groundedness answer plus its retrieved chunks into a frontier model and ask "which claims here are not supported by this context?" — and check its verdict against the chunks yourself, because trusting the grader uncritically is the same mistake one level up.
Connects forward¶
The ingested collection and the retrieval eval both feed Module 06: the SoC copilot retrieves from this corpus, and its end-to-end scorecard reuses this recall@k + groundedness check as the retrieval half. Module 11 is where this harness is generalised — same held-out + scorecard + gate discipline, across triage and RAG together. Module 09 attacks this pipeline: a document injected into the knowledge base can poison retrieval and manipulate answers — and your retrieval eval becomes the regression test that proves the poisoning stays fixed once you mitigate it.
Marketable proof¶
"I can build a RAG pipeline grounded in a private corpus — nomic-embed for embeddings, ChromaDB for the vector store, Ollama for generation — and I built the retrieval eval that proves it works: a labelled held-out query set, a recall@k and groundedness scorecard, and a CI gate that fails when a chunking or embedding change drops recall. I measure retrieval, I don't trust the prose."
Stretch¶
- Add hybrid search: combine ChromaDB vector similarity with a keyword filter, then re-run
make evaland report whether recall@k improved or regressed — let the number decide, not intuition. - Upgrade groundedness from span-overlap to an LLM-graded check (does each answer claim follow from the retrieved chunks?), and write up where it disagreed with span-overlap and why that grader now needs its own eval.
- Sweep
CHUNK_SIZEacross several values, plot recall@3 against chunk size, and pick the operating point deliberately — the chunking dial, tuned by measurement instead of feel.
Comments
Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).