10 LangChain RAG Evaluation with Ragas (Metrics and Datasets)
Evaluate your LangChain RAG pipelines with Ragas: faithfulness, answer relevancy, context recall, TestsetGenerator, and CI/CD integration for production quality.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Most teams build a RAG pipeline, test it manually with a few questions, declare it "good enough," and ship it. Six months later they are debugging why users are complaining about wrong answers — and they have no historical metrics to understand what changed.
Systematic RAG evaluation is not optional in production. It is the difference between an AI feature that improves with each iteration and one that degrades silently. Ragas is the most widely adopted open-source framework for this, and it integrates cleanly with LangChain pipelines.
This guide covers ten evaluation patterns using Ragas: the core metrics, automated test set generation, CI/CD integration, and how to interpret scores to make concrete improvements to your pipeline.
For the RAG pipeline itself, see the RAG system tutorial. For vector store setup, see the vector database guide.
Why RAG Pipelines Need Systematic Evaluation
A RAG system has two failure modes that feel identical to the user — wrong answers — but have completely different root causes:
- Retrieval failure: The right document was not retrieved. The LLM could not answer correctly because the context was irrelevant.
- Generation failure: The right document was retrieved, but the LLM ignored it or hallucinated beyond it.
Without metrics that distinguish these two failures, you are guessing at fixes. Ragas gives you separate scores for each component, so you can target your improvements correctly.
Installation and Setup
pip install ragas langchain langchain-openai langchain-chroma
import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
The 10 Core Ragas Metrics
Ragas evaluates RAG systems across four dimensions, with multiple metrics in each:
1. Faithfulness
Measures whether the answer is grounded in the retrieved context. A score of 1.0 means every claim in the answer is supported by the context. A score of 0.4 means more than half the claims are not.
from ragas.metrics import faithfulness
from datasets import Dataset
from ragas import evaluate
# Sample evaluation data
eval_data = {
"question": [
"What is the capital of France?",
"When was Python first released?",
],
"answer": [
"The capital of France is Paris.",
"Python was first released in 1991 by Guido van Rossum.",
],
"contexts": [
["France is a country in Western Europe. Its capital city is Paris."],
["Python is a programming language. Guido van Rossum created it in 1991."],
],
"ground_truth": [
"Paris",
"1991",
],
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[faithfulness])
print(f"Faithfulness: {result['faithfulness']:.3f}")
2. Answer Relevancy
Measures how well the answer addresses the question. A high score requires the answer to be on-topic and complete — not just tangentially related.
from ragas.metrics import answer_relevancy
from ragas import evaluate
result = evaluate(dataset, metrics=[answer_relevancy])
print(f"Answer Relevancy: {result['answer_relevancy']:.3f}")
3. Context Recall
Measures whether the retrieved context contains the information needed to answer the question. This requires ground truth answers. A low score points to a retrieval problem.
from ragas.metrics import context_recall
result = evaluate(dataset, metrics=[context_recall])
print(f"Context Recall: {result['context_recall']:.3f}")
4. Context Precision
Measures whether the retrieved chunks are relevant, penalizing retrieval of noise alongside signal.
from ragas.metrics import context_precision
result = evaluate(dataset, metrics=[context_precision])
print(f"Context Precision: {result['context_precision']:.3f}")
5. Answer Correctness
End-to-end correctness: is the answer factually correct given the ground truth? This is the most comprehensive single metric.
from ragas.metrics import answer_correctness
result = evaluate(dataset, metrics=[answer_correctness])
print(f"Answer Correctness: {result['answer_correctness']:.3f}")
6. Answer Similarity
Semantic similarity between the generated answer and the ground truth, measured by embedding distance.
from ragas.metrics import answer_similarity
result = evaluate(dataset, metrics=[answer_similarity])
print(f"Answer Similarity: {result['answer_similarity']:.3f}")
7. Context Entity Recall
Checks whether named entities from the ground truth appear in the retrieved context — particularly valuable for knowledge-intensive tasks.
from ragas.metrics import context_entity_recall
result = evaluate(dataset, metrics=[context_entity_recall])
print(f"Context Entity Recall: {result['context_entity_recall']:.3f}")
8–10. Running All Metrics Together
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
answer_correctness,
answer_similarity,
context_entity_recall,
)
from ragas import evaluate
import pandas as pd
ALL_METRICS = [
faithfulness,
answer_relevancy,
context_recall,
context_precision,
answer_correctness,
answer_similarity,
context_entity_recall,
]
result = evaluate(dataset, metrics=ALL_METRICS)
df = result.to_pandas()
print(df[["question"] + [m.name for m in ALL_METRICS]].to_string())
Generating Test Datasets Automatically
Manually creating evaluation datasets is tedious. Ragas's TestsetGenerator creates question-answer pairs from your documents automatically:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load your documents
loader = DirectoryLoader(
"./documents",
glob="**/*.txt",
loader_cls=TextLoader
)
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
# Initialize the generator with models
generator_llm = ChatOpenAI(model="gpt-4o")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embeddings
)
# Generate 30 test questions across three complexity levels
testset = generator.generate_with_langchain_docs(
chunks,
test_size=30,
distributions={
simple: 0.5, # 15 simple factual questions
reasoning: 0.3, # 9 multi-hop reasoning questions
multi_context: 0.2 # 6 questions requiring multiple documents
}
)
# Convert to a DataFrame to inspect
test_df = testset.to_pandas()
print(test_df[["question", "ground_truth", "evolution_type"]].head(10))
# Save for reuse
test_df.to_csv("./eval_data/testset.csv", index=False)
The simple evolution creates direct factual questions. reasoning creates questions that require inference. multi_context creates questions that require synthesizing information across multiple documents — the hardest category and where many RAG systems fail.
Running a Full Evaluation Against Your LangChain RAG Pipeline
import asyncio
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
# Set up the RAG pipeline
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = Chroma(
collection_name="docs",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the context below.
Context: {context}
Question: {question}
""")
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
async def run_rag_on_testset(test_questions: list[str]) -> dict:
"""
Run the RAG pipeline on each test question and collect
the answers and retrieved contexts.
"""
answers = []
contexts = []
for question in test_questions:
# Get retrieved docs
retrieved_docs = await retriever.ainvoke(question)
ctx_texts = [doc.page_content for doc in retrieved_docs]
# Get the answer
answer = await rag_chain.ainvoke(question)
answers.append(answer)
contexts.append(ctx_texts)
return {"answers": answers, "contexts": contexts}
async def evaluate_rag_pipeline(testset_path: str) -> dict:
import pandas as pd
test_df = pd.read_csv(testset_path)
questions = test_df["question"].tolist()
ground_truths = test_df["ground_truth"].tolist()
# Run the pipeline
rag_output = await run_rag_on_testset(questions)
# Build the Ragas dataset
eval_dataset = Dataset.from_dict({
"question": questions,
"answer": rag_output["answers"],
"contexts": rag_output["contexts"],
"ground_truth": ground_truths,
})
# Evaluate
result = evaluate(
eval_dataset,
metrics=[
faithfulness,
answer_relevancy,
context_recall,
context_precision,
]
)
return {
"faithfulness": result["faithfulness"],
"answer_relevancy": result["answer_relevancy"],
"context_recall": result["context_recall"],
"context_precision": result["context_precision"],
"dataframe": result.to_pandas()
}
# Run evaluation
scores = asyncio.run(evaluate_rag_pipeline("./eval_data/testset.csv"))
print(f"Faithfulness: {scores['faithfulness']:.3f}")
print(f"Answer Relevancy: {scores['answer_relevancy']:.3f}")
print(f"Context Recall: {scores['context_recall']:.3f}")
print(f"Context Precision: {scores['context_precision']:.3f}")
Interpreting Scores and Diagnosing Problems
Understanding what low scores mean is as important as measuring them:
def diagnose_rag_issues(scores: dict) -> list[str]:
"""
Interpret Ragas scores and suggest specific improvements.
"""
issues = []
if scores["faithfulness"] < 0.75:
issues.append(
"LOW FAITHFULNESS: The LLM is generating claims not supported by context. "
"Fix: Tighten your prompt ('answer ONLY from the context below'), "
"lower the model temperature, or use a model with less tendency to hallucinate."
)
if scores["answer_relevancy"] < 0.70:
issues.append(
"LOW ANSWER RELEVANCY: Answers are off-topic or incomplete. "
"Fix: Review your prompt template for clarity, ensure the question is "
"passed through correctly, and check for prompt injection in user inputs."
)
if scores["context_recall"] < 0.70:
issues.append(
"LOW CONTEXT RECALL: Retrieved context is missing key information. "
"Fix: Increase k (retrieve more documents), improve your chunking strategy, "
"add metadata filtering, or switch to a better embedding model."
)
if scores["context_precision"] < 0.70:
issues.append(
"LOW CONTEXT PRECISION: Too much irrelevant content being retrieved. "
"Fix: Reduce k, use MMR (Max Marginal Relevance) retrieval, "
"add metadata filters to scope retrieval, or improve document splitting."
)
if not issues:
issues.append("All scores above threshold — pipeline is performing well.")
return issues
issues = diagnose_rag_issues(scores)
for issue in issues:
print(issue)
print()
CI/CD Integration with GitHub Actions
Automated evaluation in your deployment pipeline prevents regressions:
# .github/workflows/rag-evaluation.yml
name: RAG Pipeline Evaluation
on:
push:
paths:
- "src/rag/**"
- "data/documents/**"
pull_request:
paths:
- "src/rag/**"
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install ragas langchain langchain-openai langchain-chroma
- name: Run RAG evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python scripts/evaluate_rag.py
- name: Check score thresholds
run: python scripts/check_thresholds.py
The threshold checker script:
# scripts/check_thresholds.py
import json
import sys
THRESHOLDS = {
"faithfulness": 0.80,
"answer_relevancy": 0.75,
"context_recall": 0.70,
"context_precision": 0.70,
}
with open("./eval_results/scores.json") as f:
scores = json.load(f)
failures = []
for metric, threshold in THRESHOLDS.items():
score = scores.get(metric, 0)
if score < threshold:
failures.append(
f"FAIL: {metric} = {score:.3f} (threshold: {threshold:.2f})"
)
else:
print(f"PASS: {metric} = {score:.3f} (threshold: {threshold:.2f})")
if failures:
print("\nFailed metrics:")
for f in failures:
print(f)
sys.exit(1)
print("\nAll metrics passed!")
Ragas Metric Comparison
| Metric | Requires ground truth? | Measures | Failure points to |
|---|---|---|---|
| Faithfulness | No | LLM grounding | Generation problem |
| Answer Relevancy | No | Answer on-topic | Prompt or LLM problem |
| Context Recall | Yes | Retrieval completeness | Retrieval problem |
| Context Precision | Yes | Retrieval noise | Retrieval problem |
| Answer Correctness | Yes | End-to-end accuracy | Both |
| Answer Similarity | Yes | Semantic closeness | Both |
| Context Entity Recall | Yes | Named entity coverage | Retrieval problem |
Tracking Score Trends Over Time
Single-point scores are useful. Time series is invaluable:
import json
import sqlite3
from datetime import datetime
def save_evaluation_results(
scores: dict,
pipeline_version: str,
db_path: str = "./eval_results/history.db"
):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS evaluations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT,
pipeline_version TEXT,
faithfulness REAL,
answer_relevancy REAL,
context_recall REAL,
context_precision REAL,
answer_correctness REAL
)
""")
conn.execute("""
INSERT INTO evaluations VALUES (NULL, ?, ?, ?, ?, ?, ?, ?)
""", (
datetime.now().isoformat(),
pipeline_version,
scores.get("faithfulness"),
scores.get("answer_relevancy"),
scores.get("context_recall"),
scores.get("context_precision"),
scores.get("answer_correctness"),
))
conn.commit()
conn.close()
def plot_metric_trends(db_path: str):
import pandas as pd
import matplotlib.pyplot as plt
conn = sqlite3.connect(db_path)
df = pd.read_sql("SELECT * FROM evaluations ORDER BY timestamp", conn)
conn.close()
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
metrics = ["faithfulness", "answer_relevancy", "context_recall", "context_precision"]
thresholds = [0.80, 0.75, 0.70, 0.70]
for ax, metric, threshold in zip(axes.flat, metrics, thresholds):
ax.plot(df["timestamp"], df[metric], marker="o")
ax.axhline(y=threshold, color="r", linestyle="--", label=f"Threshold ({threshold})")
ax.set_title(metric.replace("_", " ").title())
ax.set_ylim(0, 1)
ax.legend()
ax.tick_params(axis="x", rotation=45)
plt.tight_layout()
plt.savefig("./eval_results/metric_trends.png", dpi=150)
plt.show()
For the broader context of building reliable AI systems, see AI agents explained and Deploy AI model to production.
Frequently Asked Questions
What is Ragas and how does it evaluate RAG pipelines? Ragas is an open-source Python library that evaluates Retrieval-Augmented Generation pipelines without requiring human-labeled ground truth for every metric. It uses LLMs to assess aspects like whether the answer is grounded in the retrieved context (faithfulness), whether the context actually contains the answer (context recall), and whether the answer addresses the question (answer relevancy).
What Ragas score should I target for a production RAG system? Industry benchmarks suggest targeting faithfulness above 0.85, answer relevancy above 0.80, and context recall above 0.75 for a production-quality RAG system. These thresholds vary by domain — medical and legal applications typically require higher faithfulness scores (0.90+) to minimize hallucination risk.
How do I automatically generate test datasets for RAG evaluation? Ragas provides a TestsetGenerator that takes your source documents and automatically creates question-answer pairs at different complexity levels: simple factual questions, multi-hop reasoning questions, and conversational questions. This is useful when you do not have existing labeled evaluation data but need a representative test set.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.