5 LangChain RAG Evaluation Metrics with Ragas (2026)
Evaluate your LangChain RAG pipeline with Ragas: faithfulness, answer relevancy, context recall, context precision, and answer correctness — full Python code.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Building a RAG system is the easy part. Knowing whether it actually works — that is harder. Without evaluation, you are making blind changes: swapping chunking strategies, adjusting retrieval k, changing prompts, with no idea whether your metrics improved or declined.
Ragas (Retrieval-Augmented Generation Assessment) is the standard framework for evaluating LangChain RAG pipelines. It measures five key metrics using reference-free evaluation — meaning you do not need manually labeled ground-truth answers for most metrics. This makes it practical to run continuously as you iterate.
This guide implements a full Ragas evaluation pipeline for a LangChain RAG system. You will be able to score your pipeline before and after any change to see whether it improved.
Understanding the 5 Ragas Metrics
Faithfulness (0–1)
Measures whether every claim in the answer is supported by the retrieved context. An answer with 10 claims where 8 are in the context scores 0.80. This is the primary hallucination detector.
High faithfulness means the model is not making things up. Low faithfulness means the model is answering from its training data, not your documents.
Answer Relevancy (0–1)
Measures how directly the answer addresses the question. Ragas generates multiple paraphrases of the question from the answer, then computes how similar they are to the original question using cosine similarity.
High answer relevancy means the answer is focused on what was asked. Low answer relevancy means the answer drifts off-topic or gives generic information.
Context Recall (0–1)
Requires a reference answer. Measures what percentage of the information in the reference answer is covered by the retrieved context. If the reference answer has 5 facts and the context contains 4 of them, recall is 0.80.
Low context recall means your retriever is not finding all the relevant documents.
Context Precision (0–1)
Requires a reference answer. Measures what percentage of the retrieved context is actually relevant. A high-precision system retrieves few documents, most of which are useful. A low-precision system retrieves many documents, most of which are noise.
Low context precision means your retriever is pulling in too much irrelevant content.
Answer Correctness (0–1)
Requires a reference answer. Combines semantic similarity and factual overlap between the generated answer and the reference answer. This is the closest Ragas gets to "is the answer right?"
Setup
pip install ragas langchain langchain-openai langchain-community chromadb datasets
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"
Step 1: Build the RAG Pipeline to Evaluate
First, build the RAG system you want to evaluate:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents import Document
# Sample knowledge base documents
documents = [
Document(page_content="""
LangChain is an open-source framework for building applications powered by large language models.
It was created by Harrison Chase and first released in October 2022.
LangChain provides abstractions for chains, agents, memory, and retrieval.
The framework supports over 50 LLM providers and 100+ integrations.
""", metadata={"source": "langchain_overview.txt"}),
Document(page_content="""
RAG (Retrieval-Augmented Generation) combines a retrieval system with an LLM.
In a RAG pipeline, relevant documents are retrieved from a vector store at query time.
These documents are appended to the prompt as context before the LLM generates an answer.
RAG reduces hallucinations because the LLM can cite specific retrieved passages.
The key components are: document loader, text splitter, embedding model, vector store, and retriever.
""", metadata={"source": "rag_guide.txt"}),
Document(page_content="""
Vector databases store high-dimensional embeddings for fast similarity search.
Chroma is a popular open-source vector database for local development.
Pinecone is a managed vector database with automatic scaling.
Weaviate supports both dense and sparse vector search (hybrid search).
FAISS is a Facebook library optimized for billion-scale vector search.
Qdrant provides a REST and gRPC API for vector operations.
""", metadata={"source": "vector_db_comparison.txt"}),
Document(page_content="""
LangChain agents use LLMs to decide which tools to call based on user input.
The ReAct framework (Reasoning + Acting) is the most common agent architecture.
Agents can use tools like web search, calculators, code execution, and API calls.
The LangChain OpenAI Functions agent uses structured function calling for tool invocation.
Agent memory can be maintained across turns using ConversationBufferMemory.
""", metadata={"source": "agents_guide.txt"}),
Document(page_content="""
LangChain LCEL (LangChain Expression Language) provides a declarative syntax for building chains.
Chains are composed using the pipe operator: chain = prompt | llm | output_parser
LCEL chains support both sync and async execution natively.
Runnables implement ainvoke(), astream(), and abatch() for async operation.
The RunnableParallel class runs multiple chains concurrently and merges their outputs.
""", metadata={"source": "lcel_guide.txt"}),
]
# Build the vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Build the generation chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", """Answer the question using only the provided context.
If the answer is not in the context, say "I don't have information about that."
Context:
{context}"""),
("human", "{question}"),
])
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Test it
answer = rag_chain.invoke("What is LangChain and when was it created?")
print(answer)
Step 2: Create the Evaluation Dataset
Ragas needs a dataset of questions, contexts, answers, and optionally ground truth answers:
from ragas import EvaluationDataset, SingleTurnSample
import asyncio
# Define evaluation questions with reference answers
eval_questions = [
{
"question": "What is LangChain and when was it first released?",
"ground_truth": "LangChain is an open-source framework for building LLM applications, created by Harrison Chase and first released in October 2022."
},
{
"question": "What are the key components of a RAG pipeline?",
"ground_truth": "The key components of a RAG pipeline are: document loader, text splitter, embedding model, vector store, and retriever."
},
{
"question": "What is the difference between Chroma and Pinecone?",
"ground_truth": "Chroma is an open-source vector database for local development. Pinecone is a managed vector database with automatic scaling."
},
{
"question": "What is the ReAct framework in LangChain agents?",
"ground_truth": "ReAct (Reasoning + Acting) is the most common LangChain agent architecture where the LLM reasons about which tools to call."
},
{
"question": "What does LCEL stand for and what operator does it use?",
"ground_truth": "LCEL stands for LangChain Expression Language. It uses the pipe operator (|) to compose chains declaratively."
},
{
"question": "How many LLM providers does LangChain support?",
"ground_truth": "LangChain supports over 50 LLM providers."
},
{
"question": "What is Weaviate's special feature compared to other vector databases?",
"ground_truth": "Weaviate supports both dense and sparse vector search, also known as hybrid search."
},
{
"question": "What types of tools can LangChain agents use?",
"ground_truth": "LangChain agents can use tools like web search, calculators, code execution, and API calls."
},
]
async def generate_evaluation_dataset(questions: list[dict]) -> list[dict]:
"""Generate answers and collect contexts for evaluation."""
samples = []
for item in questions:
question = item["question"]
# Retrieve context
retrieved_docs = retriever.invoke(question)
contexts = [doc.page_content for doc in retrieved_docs]
# Generate answer
answer = await rag_chain.ainvoke(question)
samples.append({
"user_input": question,
"retrieved_contexts": contexts,
"response": answer,
"reference": item.get("ground_truth", ""),
})
print(f"Generated: {question[:60]}...")
return samples
# Run the data generation
samples = asyncio.run(generate_evaluation_dataset(eval_questions))
# Preview the dataset
for s in samples[:2]:
print(f"\nQ: {s['user_input']}")
print(f"A: {s['response'][:200]}...")
print(f"Context snippets: {len(s['retrieved_contexts'])}")
Step 3: Run Ragas Evaluation
from ragas import evaluate
from ragas.metrics import (
Faithfulness,
AnswerRelevancy,
ContextRecall,
ContextPrecision,
AnswerCorrectness,
)
from ragas import EvaluationDataset, SingleTurnSample
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
# Configure evaluation LLM and embeddings (can use cheaper model for evaluation)
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))
eval_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))
# Build the Ragas dataset
ragas_samples = [
SingleTurnSample(
user_input=s["user_input"],
retrieved_contexts=s["retrieved_contexts"],
response=s["response"],
reference=s["reference"],
)
for s in samples
]
dataset = EvaluationDataset(samples=ragas_samples)
# Define metrics to evaluate
metrics = [
Faithfulness(llm=eval_llm),
AnswerRelevancy(llm=eval_llm, embeddings=eval_embeddings),
ContextRecall(llm=eval_llm),
ContextPrecision(llm=eval_llm),
AnswerCorrectness(llm=eval_llm),
]
# Run evaluation
print("Running Ragas evaluation...")
results = evaluate(
dataset=dataset,
metrics=metrics,
)
# Display results
print("\n" + "="*50)
print("RAGAS EVALUATION RESULTS")
print("="*50)
print(f"Faithfulness: {results['faithfulness']:.3f}")
print(f"Answer Relevancy: {results['answer_relevancy']:.3f}")
print(f"Context Recall: {results['context_recall']:.3f}")
print(f"Context Precision: {results['context_precision']:.3f}")
print(f"Answer Correctness:{results['answer_correctness']:.3f}")
# Convert to pandas for detailed analysis
import pandas as pd
results_df = results.to_pandas()
print("\nPer-question breakdown:")
print(results_df[["user_input", "faithfulness", "answer_relevancy", "answer_correctness"]].to_string())
Step 4: Automated TestSet Generation
If you do not have a pre-built evaluation set, Ragas can generate one from your documents:
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(
OpenAIEmbeddings(model="text-embedding-3-small")
)
generator = TestsetGenerator(
llm=generator_llm,
embedding_model=generator_embeddings,
)
# Generate 20 test questions from your documents
testset = generator.generate_with_langchain_docs(
documents=documents,
testset_size=20,
)
# Convert to pandas to inspect
testset_df = testset.to_pandas()
print(f"Generated {len(testset_df)} test questions")
print(testset_df[["user_input", "reference"]].head())
# Save for reuse
testset_df.to_csv("evaluation_testset.csv", index=False)
Step 5: Continuous Evaluation Pipeline
Build an evaluation pipeline you can run after every change:
import json
from datetime import datetime
from pathlib import Path
def evaluate_rag_pipeline(
rag_chain,
retriever,
testset_path: str = "evaluation_testset.csv",
results_dir: str = "eval_results",
) -> dict:
"""
Run a full Ragas evaluation and save results with timestamp.
Returns a dict with all metric scores.
"""
Path(results_dir).mkdir(exist_ok=True)
# Load test set
testset_df = pd.read_csv(testset_path)
questions = testset_df["user_input"].tolist()
references = testset_df["reference"].tolist() if "reference" in testset_df.columns else [""] * len(questions)
# Generate answers
print(f"Generating answers for {len(questions)} questions...")
ragas_samples = []
for question, reference in zip(questions, references):
retrieved_docs = retriever.invoke(question)
contexts = [doc.page_content for doc in retrieved_docs]
answer = rag_chain.invoke(question)
ragas_samples.append(SingleTurnSample(
user_input=question,
retrieved_contexts=contexts,
response=answer,
reference=reference,
))
# Evaluate
dataset = EvaluationDataset(samples=ragas_samples)
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))
eval_embeddings = LangchainEmbeddingsWrapper(
OpenAIEmbeddings(model="text-embedding-3-small")
)
results = evaluate(
dataset=dataset,
metrics=[
Faithfulness(llm=eval_llm),
AnswerRelevancy(llm=eval_llm, embeddings=eval_embeddings),
ContextRecall(llm=eval_llm),
ContextPrecision(llm=eval_llm),
AnswerCorrectness(llm=eval_llm),
],
)
# Save results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
scores = {
"timestamp": timestamp,
"faithfulness": float(results["faithfulness"]),
"answer_relevancy": float(results["answer_relevancy"]),
"context_recall": float(results["context_recall"]),
"context_precision": float(results["context_precision"]),
"answer_correctness": float(results["answer_correctness"]),
"num_samples": len(questions),
}
output_path = f"{results_dir}/eval_{timestamp}.json"
with open(output_path, "w") as f:
json.dump(scores, f, indent=2)
print(f"\nResults saved to {output_path}")
return scores
# Run evaluation
scores = evaluate_rag_pipeline(rag_chain, retriever)
print("\nFinal scores:")
for metric, score in scores.items():
if isinstance(score, float):
print(f" {metric}: {score:.3f}")
Step 6: Compare Before and After Pipeline Changes
Track how metric changes across experiments:
import json
import glob
import pandas as pd
def compare_evaluation_runs(results_dir: str = "eval_results") -> pd.DataFrame:
"""Load and compare all evaluation runs."""
result_files = sorted(glob.glob(f"{results_dir}/eval_*.json"))
if not result_files:
print("No evaluation results found.")
return pd.DataFrame()
runs = []
for f in result_files:
with open(f) as fp:
runs.append(json.load(fp))
df = pd.DataFrame(runs)
metric_cols = ["faithfulness", "answer_relevancy", "context_recall",
"context_precision", "answer_correctness"]
print("Evaluation History:")
print(df[["timestamp"] + metric_cols].to_string(index=False))
if len(df) > 1:
latest = df.iloc[-1]
previous = df.iloc[-2]
print("\nChange from previous run:")
for metric in metric_cols:
delta = latest[metric] - previous[metric]
direction = "+" if delta >= 0 else ""
print(f" {metric}: {direction}{delta:.3f}")
return df
compare_evaluation_runs()
Comparison Table: RAG Evaluation Metrics
| Metric | Needs Reference Answer | What It Detects | Target Score | Typical Score Range |
|---|---|---|---|---|
| Faithfulness | No | Hallucinations | > 0.85 | 0.60 – 0.95 |
| Answer Relevancy | No | Off-topic answers | > 0.80 | 0.65 – 0.95 |
| Context Recall | Yes | Retriever coverage gaps | > 0.75 | 0.50 – 0.90 |
| Context Precision | Yes | Retriever noise | > 0.70 | 0.45 – 0.90 |
| Answer Correctness | Yes | Overall accuracy | > 0.75 | 0.50 – 0.90 |
Diagnosing Low Scores
Each metric points to a different part of your pipeline:
def diagnose_low_scores(scores: dict) -> list[str]:
"""Return actionable recommendations based on metric scores."""
recommendations = []
if scores.get("faithfulness", 1.0) < 0.75:
recommendations.append(
"LOW FAITHFULNESS: LLM is hallucinating. "
"Try: (1) Add 'only use context' instructions, "
"(2) Use a more instruction-following model, "
"(3) Reduce temperature to 0."
)
if scores.get("answer_relevancy", 1.0) < 0.75:
recommendations.append(
"LOW ANSWER RELEVANCY: Answers drift off-topic. "
"Try: (1) Add format instructions to your prompt, "
"(2) Use chain-of-thought prompting, "
"(3) Reduce max_tokens to force conciseness."
)
if scores.get("context_recall", 1.0) < 0.65:
recommendations.append(
"LOW CONTEXT RECALL: Retriever misses relevant documents. "
"Try: (1) Increase k (retrieve more docs), "
"(2) Use hybrid search (dense + sparse), "
"(3) Try smaller chunk sizes."
)
if scores.get("context_precision", 1.0) < 0.60:
recommendations.append(
"LOW CONTEXT PRECISION: Retriever pulls irrelevant docs. "
"Try: (1) Decrease k, "
"(2) Add EmbeddingsFilter post-retrieval, "
"(3) Use MMR retrieval to reduce redundancy."
)
if scores.get("answer_correctness", 1.0) < 0.65:
recommendations.append(
"LOW ANSWER CORRECTNESS: Answers are factually wrong. "
"Check faithfulness first — if high, improve retrieval recall. "
"If both are fine, your knowledge base may be incomplete."
)
if not recommendations:
recommendations.append("All metrics look healthy. Consider raising thresholds for tighter evaluation.")
return recommendations
recommendations = diagnose_low_scores(scores)
for r in recommendations:
print(f"\n{r}")
Integration with LangChain Pipelines
Ragas evaluation integrates directly with the RAG system tutorial pipeline. Run it against any LangChain retriever and chain pair — you do not need to modify your production code.
For building the RAG pipeline that you evaluate here, see vector database guide for vector store setup, and semantic search tutorial for retrieval strategies. Once your scores are healthy, Deploy AI model to production covers the path to serving the pipeline in production.
Frequently Asked Questions
What is the difference between faithfulness and answer relevancy in Ragas?
Faithfulness measures whether the answer is grounded in the retrieved context — it catches hallucinations where the model uses knowledge outside the context. Answer relevancy measures whether the answer actually addresses the question — it catches cases where the model answers a different question than the one asked. A system can be faithful but irrelevant (cites real context but about the wrong topic) or relevant but unfaithful (answers the right question using information not in the context).
How much does running Ragas evaluation cost?
Ragas uses LLM calls to evaluate faithfulness and answer relevancy, which adds cost on top of your generation costs. A dataset of 100 question-answer pairs typically costs $0.10–0.50 with GPT-4o-mini as the evaluator. Use a smaller or cheaper model for evaluation than for generation to manage costs. Ragas also supports local models via Ollama for zero-cost evaluation when you need to run it frequently.
How do I create a good evaluation dataset for my RAG system?
A good evaluation dataset includes 50–200 question-answer pairs that cover the full range of queries your system will face. Include simple factoid questions, multi-hop questions requiring synthesis, and edge cases where the answer is not in your knowledge base. You can use Ragas TestsetGenerator to automatically create synthetic questions from your documents, or manually curate questions from real user queries collected during beta testing.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.