Learning Goals#

  • Explain embeddings and similarity at a high level.

  • Chunk and embed a small corpus with a modern sentence encoder.

  • Index vectors and run top-k similarity search (FAISS or exact NN).

  • Assess retrieval quality and iterate on chunking/top_k.

Module 2: Document Retrieval and Embeddings#

Part of the RCD Workshops series: RAG for Research Applications#

In this module, we’ll dive into how to fetch relevant documents for RAG, covering both “classic” (keyword) and modern (embedding) approaches, with hands-on practice for each step.

2.1: From Keywords to Vectors: Why Classic Search Isn’t Enough#

Traditional document search relies on keyword matching — for example, using TF-IDF or BM25 — but this method misses synonyms and rephrasings. RAG leverages embeddings instead: both documents and queries are mapped to dense vectors that reflect semantic meaning, enabling discovery even if no words overlap.

By “dense vectors,” we mean that each document or query is represented as a point in a high-dimensional space, where similar meanings are closer together. This allows us to find relevant documents based on their semantic content rather than just exact word matches.

import numpy, faiss, torch
print("NumPy", numpy.__version__)
print("FAISS", faiss.__version__, "(CPU) | Torch", torch.__version__, "CUDA", torch.version.cuda, "GPU", torch.cuda.is_available())
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import numpy, faiss, torch
      2 print("NumPy", numpy.__version__)
      3 print("FAISS", faiss.__version__, "(CPU) | Torch", torch.__version__, "CUDA", torch.version.cuda, "GPU", torch.cuda.is_available())

ModuleNotFoundError: No module named 'faiss'
# EXERCISE: Classic Keyword Search
corpus = [
    'Impacts of climate change on global economies are substantial.',
    'Recent studies discuss worldwide financial losses due to global warming.',
    'I learned to sew in my high school home economics class.'
]
query = 'climate economics'
def keyword_search(query, docs):
    return [d for d in docs if any(word.lower() in d.lower() for word in query.split())]
keyword_search(query, corpus)
Semantic Similarity Venn Diagram

Above: Only exact (or near-exact) keyword matches will be found. Synonyms/non-obvious rephrasings are missed.

2.2: What Are Embeddings?#

Embeddings are vector representations of text such that meaningfully similar texts have vectors close together in space.

Image embeddings

Let’s see a toy example:

# Import necessary tool
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
sentences = [
    'Large language models can learn from research papers.',
    'AI systems use documents to answer questions.',
    'Bananas are yellow and tasty.'
]
embs = model.encode(sentences)

# Now we have embeddings for each sentence. Let's take a look at the first chunk of each embedding.
for i, s in enumerate(sentences):
    print(f"Sentence {i}: {s}")
    print(f"Embedding: {embs[i][:15]}...\n")  # Display first 15 elements of each embedding
# Now let's look at the cosine similarities among our documents.
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        if i < j:
            print(f"Similarity('{s1}', '{s2}') = {cosine(embs[i], embs[j]):.2f}")

You should see higher similarity between topically related text, much lower for unrelated (e.g. the banana one).


Quick Check: In your own words#

Why do we use embeddings instead of plain keyword search when building a RAG system?

from utils import create_answer_box
create_answer_box("**Your Answer:** We use embeddings instead of only keywords because...", question_id="mod2_why_embeddings")

2.3: Preparing Documents for Retrieval: Chunking and Embedding#

Documents are often too long for models to process at once. We break them into chunks (by token/paragraph) before embedding.

Why chunk?

  • Keeps each unit the right size for LLM input

  • Lets retrieval focus on topical sections — precision

Let’s practice chunking and embedding a custom document.

# Example: Manual chunking
doc = """
Retrieval-Augmented Generation (RAG) augments LLMs by allowing retrieval from external sources. \
Chunking splits text into manageable parts; for example, splitting by paragraph.

Embeddings allow searches to find relevant sections even if different words are used. Cosine similarity quantifies text closeness.

Document retrieval pipelines (using tools like FAISS) depend on these steps working well together.
"""
chunks = [c.strip() for c in doc.split('\n') if c.strip()]
for i, chunk in enumerate(chunks):
    print(f'Chunk {i+1}: {chunk}')
# Embed your chunks
chunk_embs = model.encode(chunks)

# Print the first 15 elements of each chunk embedding
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}")
    print(f"Embedding: {chunk_embs[i][:15]}...\n")  # Display first 15 elements of each embedding

Dataset: Demo Corpus#

We will use a tiny mixed-domain corpus (AI, Climate, Biomedical, Materials paper abstracts) stored in data/demo_corpus.jsonl.

from pathlib import Path
import pandas as pd

DATA_PATH = 'data/demo_corpus.jsonl'
df = pd.read_json(DATA_PATH, lines=True)
docs = df.to_dict('records')
print(f'Loaded {len(docs)} docs from {DATA_PATH}')
display(df.head())
docs[0:3]

2.4: Indexing Scientific Abstracts (with FAISS)#

We’ll go end-to-end using the demo corpus of scientific abstracts: chunk abstracts → encode chunks → index → retrieve.

# Build a tiny passage index from scientific abstracts with simple chunking
import faiss
import numpy as np

def chunk_text(text, max_chars=400):
    text = (text or '').strip()
    if not text:
        return []
    return [text[i:i+max_chars].strip() for i in range(0, len(text), max_chars)]

# Prepare chunk records from the loaded demo corpus (expects df/docs from above)
chunk_texts = []
chunk_meta = []
for d in docs:
    abs_text = d.get('abstract', '')
    pieces = chunk_text(abs_text, max_chars=400)
    for j, t in enumerate(pieces):
        if not t:
            continue
        chunk_texts.append(t)
        chunk_meta.append({'doc_id': d.get('id'), 'title': d.get('title'), 'chunk_id': j})

# Encode and normalize for cosine similarity via inner product
embs = model.encode(chunk_texts)
embs = np.array([v/np.linalg.norm(v) for v in embs], dtype='float32')
index = faiss.IndexFlatIP(embs.shape[1])
index.add(embs)

# Simple demo query over abstracts
query = 'How do RAG systems combine LLMs with retrieval?'
q = model.encode([query])[0]
q = (q/np.linalg.norm(q)).astype('float32')
D, I = index.search(np.array([q]), k=3)
for rank, (idx, score) in enumerate(zip(I[0], D[0]), start=1):
    m = chunk_meta[idx]
    snippet = chunk_texts[idx][:160].replace('\n',' ')
    print(f'#{rank} score={score:.3f}| {m["title"][:90]}...')
    print(f'   {snippet}...\n')

Quick Knowledge Check#

What would happen if you used a very long chunk size? Write a brief hypothesis about the kinds of results you’d get from using a retrieval module in that way.

from utils import create_answer_box
create_answer_box("**Your Hypothesis:**\n- With a long chunk size..", question_id="mod2_longchunk")

End of Module 2#

You’ve now practiced the core steps of document retrieval for RAG: classic vs. semantic search, embedding, chunking, and vector indexing.

Next: We’ll assemble these building blocks into a complete RAG pipeline!