Evaluation and Iteration

Building a RAG system is easy. Building one that works well requires measurement. Without evaluation, you're guessing about quality.

Two Things to Evaluate

1. Retrieval Quality

Did we find the right documents?

Recall@k: Of all relevant documents, how many did we retrieve in the top K results? If there are 3 relevant docs and you retrieve 2 of them in top-5, recall@5 = 66%.
Precision@k: Of the K documents retrieved, how many are actually relevant? If you retrieve 5 docs and 2 are relevant, precision@5 = 40%.
MRR (Mean Reciprocal Rank): How high is the first relevant result? First position = 1.0, second = 0.5, third = 0.33.

2. Generation Quality

Did the answer correctly use the retrieved information?

Faithfulness: Does the answer only claim things supported by the retrieved chunks? No hallucination beyond the source material.
Relevance: Does the answer actually address the question?
Completeness: Did the answer use all relevant information from the chunks?

Building a Golden Dataset

You need a set of question-answer pairs where you know the correct answer and which documents should be retrieved.

{
  "question": "What was the pricing model agreed with Comfama?",
  "expected_answer": "Fee blocks by project size: S, M, L, XL",
  "relevant_docs": ["2025-12-12-pricing-strategy-comfama.md"],
  "relevant_sections": ["Discussion", "Decisions"]
}

Start with 20-30 pairs covering different query types:

Factual lookups ("When did we last meet with X?")
Synthesis ("What are the main themes across client meetings?")
Temporal ("What changed in the pricing model between Q3 and Q4?")

LLM-as-Judge

Instead of manually grading every answer, use a strong LLM to evaluate:

System: You are evaluating a RAG system's answer.
Given the source documents and the generated answer,
rate faithfulness (1-5) and relevance (1-5).
Explain any issues.

Source: [retrieved chunks]
Question: [user query]
Answer: [generated answer]

This scales evaluation. You can run it on every answer in your test set automatically.

The Iteration Loop

1. Build initial RAG pipeline

Run eval suite (golden dataset)
Identify failures (wrong docs retrieved, unfaithful answers)
Diagnose: is it a retrieval problem or generation problem?
Fix: adjust chunking, add metadata filters, improve prompts
Re-run eval suite
Repeat until quality target met

Most improvements come from retrieval fixes, not prompt engineering. If the wrong documents are retrieved, no prompt will fix the output.

❓ Quiz 1

Your RAG system retrieves 5 documents for a query. 3 are relevant out of 4 total relevant docs in the corpus. What is recall@5?

Recall@5 = relevant docs retrieved / total relevant docs = 3/4 = 75%. It measures how many of the relevant documents we found, out of all that exist.

Answer to continue ↓

🛠 Exercise 1

Create 5 evaluation question-answer pairs for a RAG system over your Obsidian meeting notes. For each, write: the query, the expected answer, and which source document(s) should be retrieved. Cover different query types: factual, synthesis, and temporal.

✓ Saved

← Chunking and Embeddings When and Why Multi-Agent →

→ advance · ? shortcuts 03.03