Embeddings & Semantic Search Explained
Embeddings and Semantic Search
Embeddings are the technology that enables agents to search through thousands of documents and find the ones most relevant to a question — not by matching keywords, but by understanding meaning. This is the foundation of RAG (Retrieval-Augmented Generation) and agent memory systems.
What Embeddings Are
An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meaning have vectors that are close together in mathematical space — regardless of the exact words used.
Example:
- "How do I reset my password?" → [0.23, -0.45, 0.78, ...]
- "I forgot my login credentials" → [0.25, -0.43, 0.76, ...]
- "What is the weather in Tokyo?" → [-0.82, 0.31, -0.15, ...]
The first two questions are semantically similar — their vectors are close. The weather question is different — its vector is far away.
Why this matters for agents: An agent can search through 10,000 documents and find the 5 most relevant to a question in milliseconds — by comparing vectors, not reading the full text of each document.
Creating Embeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
import numpy as np
# OpenAI embeddings (cloud, requires API key)
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")
# text-embedding-3-small: 1536 dimensions, cheap (~$0.00002 per 1K tokens)
# text-embedding-3-large: 3072 dimensions, higher quality, more expensive
# Embed a single text
vector = embeddings_model.embed_query("How do I reset my password?")
print(f"Dimensions: {len(vector)}") # 1536
print(f"First 5 values: {vector[:5]}")
# Embed multiple texts at once (more efficient)
texts = [
"How do I reset my password?",
"I forgot my login credentials",
"What is the weather in Tokyo?"
]
vectors = embeddings_model.embed_documents(texts)
print(f"Embedded {len(vectors)} texts")
# HuggingFace embeddings (local, no API cost)
local_embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
# Smaller, faster model for development and testing
)
Measuring Similarity
Cosine similarity measures how similar two vectors are (1.0 = identical, 0 = unrelated, -1 = opposite):
def cosine_similarity(vec1: list[float], vec2: list[float]) -> float:
v1, v2 = np.array(vec1), np.array(vec2)
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
# Embed three texts
q1 = embeddings_model.embed_query("How do I reset my password?")
q2 = embeddings_model.embed_query("I forgot my login credentials")
q3 = embeddings_model.embed_query("What is the weather in Tokyo?")
print(cosine_similarity(q1, q2)) # ~0.92 — very similar!
print(cosine_similarity(q1, q3)) # ~0.35 — quite different
Vector Stores: Persisting and Searching Embeddings
Vector stores store embeddings and provide fast similarity search. For development:
Chroma (Local, Zero Configuration)
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create from documents
documents = [
Document(page_content="The return policy allows returns within 30 days.", metadata={"source": "policy.pdf", "page": 1}),
Document(page_content="Orders over $50 qualify for free shipping.", metadata={"source": "shipping.pdf", "page": 1}),
Document(page_content="Customer support is available 24/7 via chat.", metadata={"source": "support.pdf", "page": 1}),
Document(page_content="Premium members get 20% off all orders.", metadata={"source": "membership.pdf", "page": 2}),
]
# This embeds each document and stores vectors to disk
vectorstore = Chroma.from_documents(
documents=documents,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Search by similarity
results = vectorstore.similarity_search("How can I return a product?", k=2)
for doc in results:
print(f"Content: {doc.page_content}")
print(f"Source: {doc.metadata['source']}")
print()
Loading an Existing Chroma Store
# Load from disk (don't re-embed every time!)
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
# Use as a retriever
retriever = vectorstore.as_retriever(
search_type="similarity", # Or "mmr" for diversity
search_kwargs={"k": 4}
)
docs = retriever.invoke("What is the return policy?")
Similarity Search with Scores
# Returns (Document, score) tuples — score is cosine similarity
results_with_scores = vectorstore.similarity_search_with_score(
"How do I get free shipping?",
k=3
)
for doc, score in results_with_scores:
print(f"Score: {score:.3f} | {doc.page_content[:80]}")
# Filter by minimum score (only return relevant results)
relevant_docs = [
doc for doc, score in results_with_scores
if score > 0.7 # Only above 70% similarity
]
Building a Document Ingestion Pipeline
For real applications, you need a systematic pipeline to process and store documents:
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def ingest_documents(documents_dir: str, vectorstore_dir: str) -> Chroma:
"""Load all PDFs from a directory and add to vector store."""
# Load documents
loader = DirectoryLoader(
documents_dir,
glob="**/*.pdf",
loader_cls=PyPDFLoader,
show_progress=True
)
raw_documents = loader.load()
print(f"Loaded {len(raw_documents)} document pages")
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)
chunks = splitter.split_documents(raw_documents)
print(f"Split into {len(chunks)} chunks")
# Create embeddings and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=vectorstore_dir
)
print(f"Stored {vectorstore._collection.count()} vectors")
return vectorstore
# Run once to build the database
vectorstore = ingest_documents("./docs", "./chroma_db")
Metadata Filtering
Filter search results by metadata alongside semantic similarity:
# Only search documents from a specific source
results = vectorstore.similarity_search(
"What are the pricing options?",
k=4,
filter={"source": "pricing.pdf"} # Chroma metadata filter
)
# More complex filter
results = vectorstore.similarity_search(
query="customer onboarding process",
k=5,
filter={"$and": [{"department": "sales"}, {"year": {"$gte": 2023}}]}
)
Maximum Marginal Relevance (MMR)
MMR balances relevance with diversity — avoids returning 4 chunks that all say the same thing:
# Standard similarity: may return very similar chunks
results_sim = vectorstore.similarity_search("pricing", k=4)
# MMR: balances relevance with diversity
results_mmr = vectorstore.max_marginal_relevance_search(
"pricing",
k=4,
fetch_k=20, # Fetch 20 candidates, select 4 most diverse relevant ones
lambda_mult=0.5 # 0 = max diversity, 1 = max relevance
)
Choosing the Right Embedding Model
| Model | Dimensions | Cost | Quality | Best For |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | Low | High | Production (cost-effective) |
| OpenAI text-embedding-3-large | 3072 | Medium | Highest | When quality matters most |
| all-MiniLM-L6-v2 (local) | 384 | Free | Good | Development, no API cost |
| all-mpnet-base-v2 (local) | 768 | Free | Better | Production with local model |
For production: text-embedding-3-small is the best combination of quality and cost.
Next lesson: Pinecone and Chroma — choosing and configuring vector databases for production agents.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises