0 XP
L1
?
Lessons
Chunking and Embeddings
exercise ⏱ 20m
2/3

Chunking and Embeddings

Retrieval starts with how you break documents into pieces (chunking) and how you represent those pieces numerically (embeddings). These decisions determine what your RAG system can and can't find.

Chunking Strategies

Documents need to be split into chunks small enough to be relevant but large enough to contain useful context.

By headers/sections

Split at H2 or H3 boundaries. Each section becomes a chunk. Works great for structured documents like your meeting notes (which have headers for topics, action items, etc.).

Fixed-size with overlap

Split every N tokens with M tokens of overlap. Simple, works for any document. But you'll cut through sentences and ideas.

Semantic chunking

Use the LLM or embeddings to identify topic boundaries. More accurate but slower and more expensive.

For your Obsidian vault:

Your meeting notes already have structure — YAML frontmatter, headers, sections. Chunking by section (H2) is the obvious choice. Each chunk gets metadata from the frontmatter (date, project, attendees).

Meeting: "2025-12-12 Pricing strategy Comfama"

Chunk 1: { section: "Discussion", content: "...", date: "2025-12-12", project: "Comfama" }
Chunk 2: { section: "Action Items", content: "...", date: "2025-12-12", project: "Comfama" }
Chunk 3: { section: "Decisions", content: "...", date: "2025-12-12", project: "Comfama" }

Embeddings

Embeddings convert text into vectors — arrays of numbers that capture semantic meaning. Similar texts have similar vectors.

Popular embedding models:

  • text-embedding-3-small (OpenAI): Cheap, good enough for most cases, 1536 dimensions
  • text-embedding-3-large (OpenAI): Better quality, 3072 dimensions, 2x cost
  • voyage-3 (Voyage AI): Strong on code and technical content
  • Local models: nomic-embed-text, bge-small — free, runs on your machine

Vector stores:

  • Supabase pgvector: You already use Supabase. Enable the pgvector extension, add an embedding column, done. Zero new infrastructure.
  • Pinecone: Managed, fast, scales well. Extra service to maintain.
  • ChromaDB: Local, good for prototyping. Not great for production.

For your stack, Supabase pgvector is the obvious choice — no new service, you already have the client set up.

The Chunk Size Tradeoff

Small chunks (100-200 tokens)Large chunks (500-1000 tokens)
More precise retrievalMore context per chunk
Risk losing surrounding contextRisk retrieving irrelevant text
More chunks to searchFewer chunks, faster search
Better for specific factsBetter for narrative/reasoning

For meeting notes: 300-500 tokens per chunk is a good starting point. Each section of a meeting typically falls in this range naturally.

❓ Quiz 1
For your Obsidian meeting notes (which have YAML frontmatter and H2 sections), what's the best chunking strategy?
Your notes are already well-structured. Chunking by sections preserves natural topic boundaries, and the frontmatter gives you metadata (date, project, attendees) for filtering.
Answer to continue ↓

Review

Time to consolidate what you learned.

🎮 Chunking Strategy Picker
Click each item to move it between categories. Place all items, then click Check.
Meeting notes with H2 sections
Obsidian vault entries
Mixed-format wikis
Research papers
Raw chat logs
API docs with structured headers
Legal contracts with nested clauses
CSV data dumps
Long unstructured emails
By Headers
Fixed-Size
Semantic
Complete to continue ↓
🛠 Exercise 1
Write a TypeScript function that takes a markdown meeting note (string) and returns an array of chunks. Each chunk should be an object with: content (the section text), section (the H2 header name), and metadata extracted from YAML frontmatter (date, project, attendees if present). Use gray-matter for frontmatter parsing.
✓ Saved
advance · ? shortcuts 03.02
Claude — Tutor
select text for context
Ask me anything about this lesson.
I can see your quiz answers and decisions.

💡 Select text in the lesson to use it as context.
CONTEXT