Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

AiTechWorlds

🔗

AI Learning

RAG: Retrieval-Augmented Generation Guide

Complete RAG pipeline — chunking, embedding, vector search, re-ranking, and LangChain implementation with code.

#rag #langchain #vector-database #llm #embeddings

Back to Notes Library

RAG: Retrieval-Augmented Generation Complete Guide

What Is RAG?

RAG (Retrieval-Augmented Generation) is an architecture that connects an LLM to an external knowledge source at inference time. Instead of relying solely on training data, the model retrieves relevant documents and uses them as context to answer queries.

Core Problem RAG Solves:

LLMs hallucinate when asked about facts outside training data
Knowledge cutoff makes models stale
Fine-tuning entire models for each knowledge update is expensive

RAG Pipeline Architecture

Step-by-Step Implementation

Step 1: Document Ingestion

python

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,       # overlap preserves context across chunks
    separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_documents(documents)

Step 2: Embed and Store

python

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Step 3: Retrieve and Generate

python

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

retriever = vectorstore.as_retriever(
    search_type="mmr",          # Maximum Marginal Relevance — reduces redundancy
    search_kwargs={"k": 5}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o-mini"),
    retriever=retriever,
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])

Chunking Strategies

Strategy	Best For	Chunk Size
Fixed-size (character)	Simple docs, quick setup	256–1024 chars
Recursive (semantic)	Mixed documents	512–1024 chars
Sentence-level	QA, precise retrieval	1–5 sentences
Semantic chunking	Long form content	Variable
Document structure	PDFs, HTML, code	By section/function

Overlap rule of thumb: 10–20% of chunk size to avoid cutting context at boundaries.

Embedding Model Comparison

Model	Dimensions	Best For	Free?
`text-embedding-3-small`	1,536	General RAG (OpenAI)	No
`text-embedding-3-large`	3,072	High accuracy (OpenAI)	No
`nomic-embed-text`	768	Local / open source	Yes
`bge-m3`	1,024	Multilingual	Yes
`all-MiniLM-L6-v2`	384	Fast, low resource	Yes

Vector Database Comparison

Database	Deployment	Scale	Free Tier
Chroma	Local / self-hosted	Small-medium	Yes (local)
Pinecone	Cloud managed	Large	Yes (1 index)
Qdrant	Self-hosted / cloud	Large	Yes (cloud)
Weaviate	Self-hosted / cloud	Large	Yes
FAISS	In-memory local	Medium	Yes (library)
pgvector	PostgreSQL extension	Medium	Yes

Advanced RAG Techniques

Technique	What It Does	When to Use
HyDE	Generate hypothetical doc before retrieval	Vague/short queries
Multi-query retrieval	Generate multiple query variants	Low recall
Re-ranking (cross-encoder)	Re-score retrieved docs with more powerful model	Precision-critical
Parent document retrieval	Retrieve full section when child chunk matches	Chunking loses context
Self-RAG	Model decides when to retrieve	Adaptive pipelines
RAPTOR	Hierarchical summarization tree	Very long documents

Naive RAG vs Advanced RAG

Aspect	Naive RAG	Advanced RAG
Indexing	Fixed chunks	Hierarchical + metadata
Retrieval	Dense vector search only	Hybrid (dense + sparse)
Re-ranking	None	Cross-encoder re-ranker
Query	Raw user input	Query expansion/rewrite
Response	Direct LLM output	Cited, grounded output

Hybrid Search (BM25 + Vector)

python

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

bm25 = BM25Retriever.from_documents(chunks, k=5)
vector = vectorstore.as_retriever(search_kwargs={"k": 5})

hybrid = EnsembleRetriever(
    retrievers=[bm25, vector],
    weights=[0.4, 0.6]      # 60% semantic, 40% keyword
)

Evaluation Metrics

Metric	Measures	Tool
Context Recall	How much ground truth is retrieved	RAGAS
Context Precision	How relevant retrieved docs are	RAGAS
Faithfulness	Is answer grounded in retrieved docs?	RAGAS
Answer Relevancy	Does answer address the question?	RAGAS

Common Mistakes

Chunk size too large — buries the relevant sentence in a noisy chunk
No overlap between chunks — cuts sentences mid-thought at chunk boundaries
Using only semantic search — keyword matches (BM25) often outperform vectors for exact terms
Skipping re-ranking — top-5 from embedding search often needs reordering for accuracy
Not filtering by metadata — retrieve from the right document subset before scoring similarity

Download RAG: Retrieval-Augmented Generation Guide

Get this note + 100s more free on Telegram

Join Free →

📱

Get more notes like this daily on Telegram!

Free study notes, cheat sheets & AI tips

Join Free →

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

🔗

AI Learning

RAG: Retrieval-Augmented Generation Guide

Complete RAG pipeline — chunking, embedding, vector search, re-ranking, and LangChain implementation with code.

#rag #langchain #vector-database #llm #embeddings

Back to Notes Library

RAG: Retrieval-Augmented Generation Complete Guide

What Is RAG?

Core Problem RAG Solves:

LLMs hallucinate when asked about facts outside training data
Knowledge cutoff makes models stale
Fine-tuning entire models for each knowledge update is expensive

RAG Pipeline Architecture

Step-by-Step Implementation

Step 1: Document Ingestion

python

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,       # overlap preserves context across chunks
    separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_documents(documents)

Step 2: Embed and Store

python

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Step 3: Retrieve and Generate

python

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

retriever = vectorstore.as_retriever(
    search_type="mmr",          # Maximum Marginal Relevance — reduces redundancy
    search_kwargs={"k": 5}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o-mini"),
    retriever=retriever,
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])

Chunking Strategies

Strategy	Best For	Chunk Size
Fixed-size (character)	Simple docs, quick setup	256–1024 chars
Recursive (semantic)	Mixed documents	512–1024 chars
Sentence-level	QA, precise retrieval	1–5 sentences
Semantic chunking	Long form content	Variable
Document structure	PDFs, HTML, code	By section/function

Overlap rule of thumb: 10–20% of chunk size to avoid cutting context at boundaries.

Embedding Model Comparison

Model	Dimensions	Best For	Free?
`text-embedding-3-small`	1,536	General RAG (OpenAI)	No
`text-embedding-3-large`	3,072	High accuracy (OpenAI)	No
`nomic-embed-text`	768	Local / open source	Yes
`bge-m3`	1,024	Multilingual	Yes
`all-MiniLM-L6-v2`	384	Fast, low resource	Yes

Vector Database Comparison

Database	Deployment	Scale	Free Tier
Chroma	Local / self-hosted	Small-medium	Yes (local)
Pinecone	Cloud managed	Large	Yes (1 index)
Qdrant	Self-hosted / cloud	Large	Yes (cloud)
Weaviate	Self-hosted / cloud	Large	Yes
FAISS	In-memory local	Medium	Yes (library)
pgvector	PostgreSQL extension	Medium	Yes

Advanced RAG Techniques

Technique	What It Does	When to Use
HyDE	Generate hypothetical doc before retrieval	Vague/short queries
Multi-query retrieval	Generate multiple query variants	Low recall
Re-ranking (cross-encoder)	Re-score retrieved docs with more powerful model	Precision-critical
Parent document retrieval	Retrieve full section when child chunk matches	Chunking loses context
Self-RAG	Model decides when to retrieve	Adaptive pipelines
RAPTOR	Hierarchical summarization tree	Very long documents

Naive RAG vs Advanced RAG

Aspect	Naive RAG	Advanced RAG
Indexing	Fixed chunks	Hierarchical + metadata
Retrieval	Dense vector search only	Hybrid (dense + sparse)
Re-ranking	None	Cross-encoder re-ranker
Query	Raw user input	Query expansion/rewrite
Response	Direct LLM output	Cited, grounded output

Hybrid Search (BM25 + Vector)

python

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

bm25 = BM25Retriever.from_documents(chunks, k=5)
vector = vectorstore.as_retriever(search_kwargs={"k": 5})

hybrid = EnsembleRetriever(
    retrievers=[bm25, vector],
    weights=[0.4, 0.6]      # 60% semantic, 40% keyword
)

Evaluation Metrics

Metric	Measures	Tool
Context Recall	How much ground truth is retrieved	RAGAS
Context Precision	How relevant retrieved docs are	RAGAS
Faithfulness	Is answer grounded in retrieved docs?	RAGAS
Answer Relevancy	Does answer address the question?	RAGAS

Common Mistakes

Chunk size too large — buries the relevant sentence in a noisy chunk
No overlap between chunks — cuts sentences mid-thought at chunk boundaries
Using only semantic search — keyword matches (BM25) often outperform vectors for exact terms
Skipping re-ranking — top-5 from embedding search often needs reordering for accuracy
Not filtering by metadata — retrieve from the right document subset before scoring similarity

Download RAG: Retrieval-Augmented Generation Guide

Get this note + 100s more free on Telegram

Join Free →

📱

Get more notes like this daily on Telegram!

Free study notes, cheat sheets & AI tips

Join Free →

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.