Embeddings & Semantic Search Explained | AI Agent Development Course | AiTechWorlds

Embeddings and Semantic Search

Embeddings are the technology that enables agents to search through thousands of documents and find the ones most relevant to a question — not by matching keywords, but by understanding meaning. This is the foundation of RAG (Retrieval-Augmented Generation) and agent memory systems.

What Embeddings Are

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meaning have vectors that are close together in mathematical space — regardless of the exact words used.

Example:

"How do I reset my password?" → [0.23, -0.45, 0.78, ...]
"I forgot my login credentials" → [0.25, -0.43, 0.76, ...]
"What is the weather in Tokyo?" → [-0.82, 0.31, -0.15, ...]

The first two questions are semantically similar — their vectors are close. The weather question is different — its vector is far away.

Why this matters for agents: An agent can search through 10,000 documents and find the 5 most relevant to a question in milliseconds — by comparing vectors, not reading the full text of each document.

Creating Embeddings

from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
import numpy as np

# OpenAI embeddings (cloud, requires API key)
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")
# text-embedding-3-small: 1536 dimensions, cheap (~$0.00002 per 1K tokens)
# text-embedding-3-large: 3072 dimensions, higher quality, more expensive

# Embed a single text
vector = embeddings_model.embed_query("How do I reset my password?")
print(f"Dimensions: {len(vector)}")  # 1536
print(f"First 5 values: {vector[:5]}")

# Embed multiple texts at once (more efficient)
texts = [
    "How do I reset my password?",
    "I forgot my login credentials",
    "What is the weather in Tokyo?"
]
vectors = embeddings_model.embed_documents(texts)
print(f"Embedded {len(vectors)} texts")

# HuggingFace embeddings (local, no API cost)
local_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
    # Smaller, faster model for development and testing
)

Measuring Similarity

Cosine similarity measures how similar two vectors are (1.0 = identical, 0 = unrelated, -1 = opposite):

def cosine_similarity(vec1: list[float], vec2: list[float]) -> float:
    v1, v2 = np.array(vec1), np.array(vec2)
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# Embed three texts
q1 = embeddings_model.embed_query("How do I reset my password?")
q2 = embeddings_model.embed_query("I forgot my login credentials")
q3 = embeddings_model.embed_query("What is the weather in Tokyo?")

print(cosine_similarity(q1, q2))  # ~0.92 — very similar!
print(cosine_similarity(q1, q3))  # ~0.35 — quite different

Vector Stores: Persisting and Searching Embeddings

Vector stores store embeddings and provide fast similarity search. For development:

Chroma (Local, Zero Configuration)

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create from documents
documents = [
    Document(page_content="The return policy allows returns within 30 days.", metadata={"source": "policy.pdf", "page": 1}),
    Document(page_content="Orders over $50 qualify for free shipping.", metadata={"source": "shipping.pdf", "page": 1}),
    Document(page_content="Customer support is available 24/7 via chat.", metadata={"source": "support.pdf", "page": 1}),
    Document(page_content="Premium members get 20% off all orders.", metadata={"source": "membership.pdf", "page": 2}),
]

# This embeds each document and stores vectors to disk
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Search by similarity
results = vectorstore.similarity_search("How can I return a product?", k=2)
for doc in results:
    print(f"Content: {doc.page_content}")
    print(f"Source: {doc.metadata['source']}")
    print()

Loading an Existing Chroma Store

# Load from disk (don't re-embed every time!)
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

# Use as a retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",   # Or "mmr" for diversity
    search_kwargs={"k": 4}
)

docs = retriever.invoke("What is the return policy?")

Similarity Search with Scores

# Returns (Document, score) tuples — score is cosine similarity
results_with_scores = vectorstore.similarity_search_with_score(
    "How do I get free shipping?",
    k=3
)

for doc, score in results_with_scores:
    print(f"Score: {score:.3f} | {doc.page_content[:80]}")

# Filter by minimum score (only return relevant results)
relevant_docs = [
    doc for doc, score in results_with_scores
    if score > 0.7  # Only above 70% similarity
]

Building a Document Ingestion Pipeline

For real applications, you need a systematic pipeline to process and store documents:

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def ingest_documents(documents_dir: str, vectorstore_dir: str) -> Chroma:
    """Load all PDFs from a directory and add to vector store."""
    
    # Load documents
    loader = DirectoryLoader(
        documents_dir,
        glob="**/*.pdf",
        loader_cls=PyPDFLoader,
        show_progress=True
    )
    raw_documents = loader.load()
    print(f"Loaded {len(raw_documents)} document pages")
    
    # Split into chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
    )
    chunks = splitter.split_documents(raw_documents)
    print(f"Split into {len(chunks)} chunks")
    
    # Create embeddings and store
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=vectorstore_dir
    )
    
    print(f"Stored {vectorstore._collection.count()} vectors")
    return vectorstore

# Run once to build the database
vectorstore = ingest_documents("./docs", "./chroma_db")

Metadata Filtering

Filter search results by metadata alongside semantic similarity:

# Only search documents from a specific source
results = vectorstore.similarity_search(
    "What are the pricing options?",
    k=4,
    filter={"source": "pricing.pdf"}  # Chroma metadata filter
)

# More complex filter
results = vectorstore.similarity_search(
    query="customer onboarding process",
    k=5,
    filter={"$and": [{"department": "sales"}, {"year": {"$gte": 2023}}]}
)

Maximum Marginal Relevance (MMR)

MMR balances relevance with diversity — avoids returning 4 chunks that all say the same thing:

# Standard similarity: may return very similar chunks
results_sim = vectorstore.similarity_search("pricing", k=4)

# MMR: balances relevance with diversity
results_mmr = vectorstore.max_marginal_relevance_search(
    "pricing",
    k=4,
    fetch_k=20,   # Fetch 20 candidates, select 4 most diverse relevant ones
    lambda_mult=0.5  # 0 = max diversity, 1 = max relevance
)

Choosing the Right Embedding Model

Model	Dimensions	Cost	Quality	Best For
OpenAI text-embedding-3-small	1536	Low	High	Production (cost-effective)
OpenAI text-embedding-3-large	3072	Medium	Highest	When quality matters most
all-MiniLM-L6-v2 (local)	384	Free	Good	Development, no API cost
all-mpnet-base-v2 (local)	768	Free	Better	Production with local model

For production: text-embedding-3-small is the best combination of quality and cost.

Next lesson: Pinecone and Chroma — choosing and configuring vector databases for production agents.