AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

evaluation scorecard showing RAG accuracy — LangChain Ragas RAG evaluation metrics

10 LangChain RAG Evaluation with Ragas (Metrics and Datasets)

⚡ Quick Answer

Evaluate your LangChain RAG pipelines with Ragas: faithfulness, answer relevancy, context recall, TestsetGenerator, and CI/CD integration for production quality.

AiTechWorlds Team May 31, 2026 10 min read

#LangChain #Ragas #RAG evaluation #metrics #testing

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Most teams build a RAG pipeline, test it manually with a few questions, declare it "good enough," and ship it. Six months later they are debugging why users are complaining about wrong answers — and they have no historical metrics to understand what changed.

Systematic RAG evaluation is not optional in production. It is the difference between an AI feature that improves with each iteration and one that degrades silently. Ragas is the most widely adopted open-source framework for this, and it integrates cleanly with LangChain pipelines.

This guide covers ten evaluation patterns using Ragas: the core metrics, automated test set generation, CI/CD integration, and how to interpret scores to make concrete improvements to your pipeline.

For the RAG pipeline itself, see the RAG system tutorial. For vector store setup, see the vector database guide.

Why RAG Pipelines Need Systematic Evaluation

A RAG system has two failure modes that feel identical to the user — wrong answers — but have completely different root causes:

Retrieval failure: The right document was not retrieved. The LLM could not answer correctly because the context was irrelevant.
Generation failure: The right document was retrieved, but the LLM ignored it or hallucinated beyond it.

Without metrics that distinguish these two failures, you are guessing at fixes. Ragas gives you separate scores for each component, so you can target your improvements correctly.

Installation and Setup

pip install ragas langchain langchain-openai langchain-chroma

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"

The 10 Core Ragas Metrics

Ragas evaluates RAG systems across four dimensions, with multiple metrics in each:

1. Faithfulness

Measures whether the answer is grounded in the retrieved context. A score of 1.0 means every claim in the answer is supported by the context. A score of 0.4 means more than half the claims are not.

from ragas.metrics import faithfulness
from datasets import Dataset
from ragas import evaluate

# Sample evaluation data
eval_data = {
    "question": [
        "What is the capital of France?",
        "When was Python first released?",
    ],
    "answer": [
        "The capital of France is Paris.",
        "Python was first released in 1991 by Guido van Rossum.",
    ],
    "contexts": [
        ["France is a country in Western Europe. Its capital city is Paris."],
        ["Python is a programming language. Guido van Rossum created it in 1991."],
    ],
    "ground_truth": [
        "Paris",
        "1991",
    ],
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[faithfulness])
print(f"Faithfulness: {result['faithfulness']:.3f}")

2. Answer Relevancy

Measures how well the answer addresses the question. A high score requires the answer to be on-topic and complete — not just tangentially related.

from ragas.metrics import answer_relevancy
from ragas import evaluate

result = evaluate(dataset, metrics=[answer_relevancy])
print(f"Answer Relevancy: {result['answer_relevancy']:.3f}")

3. Context Recall

Measures whether the retrieved context contains the information needed to answer the question. This requires ground truth answers. A low score points to a retrieval problem.

from ragas.metrics import context_recall

result = evaluate(dataset, metrics=[context_recall])
print(f"Context Recall: {result['context_recall']:.3f}")

4. Context Precision

Measures whether the retrieved chunks are relevant, penalizing retrieval of noise alongside signal.

from ragas.metrics import context_precision

result = evaluate(dataset, metrics=[context_precision])
print(f"Context Precision: {result['context_precision']:.3f}")

5. Answer Correctness

End-to-end correctness: is the answer factually correct given the ground truth? This is the most comprehensive single metric.

from ragas.metrics import answer_correctness

result = evaluate(dataset, metrics=[answer_correctness])
print(f"Answer Correctness: {result['answer_correctness']:.3f}")

6. Answer Similarity

Semantic similarity between the generated answer and the ground truth, measured by embedding distance.

from ragas.metrics import answer_similarity

result = evaluate(dataset, metrics=[answer_similarity])
print(f"Answer Similarity: {result['answer_similarity']:.3f}")

7. Context Entity Recall

Checks whether named entities from the ground truth appear in the retrieved context — particularly valuable for knowledge-intensive tasks.

from ragas.metrics import context_entity_recall

result = evaluate(dataset, metrics=[context_entity_recall])
print(f"Context Entity Recall: {result['context_entity_recall']:.3f}")

8–10. Running All Metrics Together

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
    answer_similarity,
    context_entity_recall,
)
from ragas import evaluate
import pandas as pd

ALL_METRICS = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
    answer_similarity,
    context_entity_recall,
]

result = evaluate(dataset, metrics=ALL_METRICS)
df = result.to_pandas()
print(df[["question"] + [m.name for m in ALL_METRICS]].to_string())

Generating Test Datasets Automatically

Manually creating evaluation datasets is tedious. Ragas's TestsetGenerator creates question-answer pairs from your documents automatically:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load your documents
loader = DirectoryLoader(
    "./documents",
    glob="**/*.txt",
    loader_cls=TextLoader
)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# Initialize the generator with models
generator_llm = ChatOpenAI(model="gpt-4o")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Generate 30 test questions across three complexity levels
testset = generator.generate_with_langchain_docs(
    chunks,
    test_size=30,
    distributions={
        simple: 0.5,       # 15 simple factual questions
        reasoning: 0.3,    # 9 multi-hop reasoning questions
        multi_context: 0.2  # 6 questions requiring multiple documents
    }
)

# Convert to a DataFrame to inspect
test_df = testset.to_pandas()
print(test_df[["question", "ground_truth", "evolution_type"]].head(10))

# Save for reuse
test_df.to_csv("./eval_data/testset.csv", index=False)

The simple evolution creates direct factual questions. reasoning creates questions that require inference. multi_context creates questions that require synthesizing information across multiple documents — the hardest category and where many RAG systems fail.

Running a Full Evaluation Against Your LangChain RAG Pipeline

import asyncio
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

# Set up the RAG pipeline
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = Chroma(
    collection_name="docs",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_template("""
Answer the question based on the context below.
Context: {context}
Question: {question}
""")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


async def run_rag_on_testset(test_questions: list[str]) -> dict:
    """
    Run the RAG pipeline on each test question and collect
    the answers and retrieved contexts.
    """
    answers = []
    contexts = []

    for question in test_questions:
        # Get retrieved docs
        retrieved_docs = await retriever.ainvoke(question)
        ctx_texts = [doc.page_content for doc in retrieved_docs]

        # Get the answer
        answer = await rag_chain.ainvoke(question)

        answers.append(answer)
        contexts.append(ctx_texts)

    return {"answers": answers, "contexts": contexts}


async def evaluate_rag_pipeline(testset_path: str) -> dict:
    import pandas as pd

    test_df = pd.read_csv(testset_path)
    questions = test_df["question"].tolist()
    ground_truths = test_df["ground_truth"].tolist()

    # Run the pipeline
    rag_output = await run_rag_on_testset(questions)

    # Build the Ragas dataset
    eval_dataset = Dataset.from_dict({
        "question": questions,
        "answer": rag_output["answers"],
        "contexts": rag_output["contexts"],
        "ground_truth": ground_truths,
    })

    # Evaluate
    result = evaluate(
        eval_dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_recall,
            context_precision,
        ]
    )

    return {
        "faithfulness": result["faithfulness"],
        "answer_relevancy": result["answer_relevancy"],
        "context_recall": result["context_recall"],
        "context_precision": result["context_precision"],
        "dataframe": result.to_pandas()
    }


# Run evaluation
scores = asyncio.run(evaluate_rag_pipeline("./eval_data/testset.csv"))
print(f"Faithfulness:      {scores['faithfulness']:.3f}")
print(f"Answer Relevancy:  {scores['answer_relevancy']:.3f}")
print(f"Context Recall:    {scores['context_recall']:.3f}")
print(f"Context Precision: {scores['context_precision']:.3f}")

Interpreting Scores and Diagnosing Problems

Understanding what low scores mean is as important as measuring them:

def diagnose_rag_issues(scores: dict) -> list[str]:
    """
    Interpret Ragas scores and suggest specific improvements.
    """
    issues = []

    if scores["faithfulness"] < 0.75:
        issues.append(
            "LOW FAITHFULNESS: The LLM is generating claims not supported by context. "
            "Fix: Tighten your prompt ('answer ONLY from the context below'), "
            "lower the model temperature, or use a model with less tendency to hallucinate."
        )

    if scores["answer_relevancy"] < 0.70:
        issues.append(
            "LOW ANSWER RELEVANCY: Answers are off-topic or incomplete. "
            "Fix: Review your prompt template for clarity, ensure the question is "
            "passed through correctly, and check for prompt injection in user inputs."
        )

    if scores["context_recall"] < 0.70:
        issues.append(
            "LOW CONTEXT RECALL: Retrieved context is missing key information. "
            "Fix: Increase k (retrieve more documents), improve your chunking strategy, "
            "add metadata filtering, or switch to a better embedding model."
        )

    if scores["context_precision"] < 0.70:
        issues.append(
            "LOW CONTEXT PRECISION: Too much irrelevant content being retrieved. "
            "Fix: Reduce k, use MMR (Max Marginal Relevance) retrieval, "
            "add metadata filters to scope retrieval, or improve document splitting."
        )

    if not issues:
        issues.append("All scores above threshold — pipeline is performing well.")

    return issues

issues = diagnose_rag_issues(scores)
for issue in issues:
    print(issue)
    print()

CI/CD Integration with GitHub Actions

Automated evaluation in your deployment pipeline prevents regressions:

# .github/workflows/rag-evaluation.yml
name: RAG Pipeline Evaluation

on:
  push:
    paths:
      - "src/rag/**"
      - "data/documents/**"
  pull_request:
    paths:
      - "src/rag/**"

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install ragas langchain langchain-openai langchain-chroma

      - name: Run RAG evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/evaluate_rag.py

      - name: Check score thresholds
        run: python scripts/check_thresholds.py

The threshold checker script:

# scripts/check_thresholds.py
import json
import sys

THRESHOLDS = {
    "faithfulness": 0.80,
    "answer_relevancy": 0.75,
    "context_recall": 0.70,
    "context_precision": 0.70,
}

with open("./eval_results/scores.json") as f:
    scores = json.load(f)

failures = []
for metric, threshold in THRESHOLDS.items():
    score = scores.get(metric, 0)
    if score < threshold:
        failures.append(
            f"FAIL: {metric} = {score:.3f} (threshold: {threshold:.2f})"
        )
    else:
        print(f"PASS: {metric} = {score:.3f} (threshold: {threshold:.2f})")

if failures:
    print("\nFailed metrics:")
    for f in failures:
        print(f)
    sys.exit(1)

print("\nAll metrics passed!")

Ragas Metric Comparison

Metric	Requires ground truth?	Measures	Failure points to
Faithfulness	No	LLM grounding	Generation problem
Answer Relevancy	No	Answer on-topic	Prompt or LLM problem
Context Recall	Yes	Retrieval completeness	Retrieval problem
Context Precision	Yes	Retrieval noise	Retrieval problem
Answer Correctness	Yes	End-to-end accuracy	Both
Answer Similarity	Yes	Semantic closeness	Both
Context Entity Recall	Yes	Named entity coverage	Retrieval problem

Tracking Score Trends Over Time

Single-point scores are useful. Time series is invaluable:

import json
import sqlite3
from datetime import datetime

def save_evaluation_results(
    scores: dict,
    pipeline_version: str,
    db_path: str = "./eval_results/history.db"
):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS evaluations (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            timestamp TEXT,
            pipeline_version TEXT,
            faithfulness REAL,
            answer_relevancy REAL,
            context_recall REAL,
            context_precision REAL,
            answer_correctness REAL
        )
    """)
    conn.execute("""
        INSERT INTO evaluations VALUES (NULL, ?, ?, ?, ?, ?, ?, ?)
    """, (
        datetime.now().isoformat(),
        pipeline_version,
        scores.get("faithfulness"),
        scores.get("answer_relevancy"),
        scores.get("context_recall"),
        scores.get("context_precision"),
        scores.get("answer_correctness"),
    ))
    conn.commit()
    conn.close()

def plot_metric_trends(db_path: str):
    import pandas as pd
    import matplotlib.pyplot as plt

    conn = sqlite3.connect(db_path)
    df = pd.read_sql("SELECT * FROM evaluations ORDER BY timestamp", conn)
    conn.close()

    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    metrics = ["faithfulness", "answer_relevancy", "context_recall", "context_precision"]
    thresholds = [0.80, 0.75, 0.70, 0.70]

    for ax, metric, threshold in zip(axes.flat, metrics, thresholds):
        ax.plot(df["timestamp"], df[metric], marker="o")
        ax.axhline(y=threshold, color="r", linestyle="--", label=f"Threshold ({threshold})")
        ax.set_title(metric.replace("_", " ").title())
        ax.set_ylim(0, 1)
        ax.legend()
        ax.tick_params(axis="x", rotation=45)

    plt.tight_layout()
    plt.savefig("./eval_results/metric_trends.png", dpi=150)
    plt.show()

For the broader context of building reliable AI systems, see AI agents explained and Deploy AI model to production.

Frequently Asked Questions

What is Ragas and how does it evaluate RAG pipelines? Ragas is an open-source Python library that evaluates Retrieval-Augmented Generation pipelines without requiring human-labeled ground truth for every metric. It uses LLMs to assess aspects like whether the answer is grounded in the retrieved context (faithfulness), whether the context actually contains the answer (context recall), and whether the answer addresses the question (answer relevancy).

What Ragas score should I target for a production RAG system? Industry benchmarks suggest targeting faithfulness above 0.85, answer relevancy above 0.80, and context recall above 0.75 for a production-quality RAG system. These thresholds vary by domain — medical and legal applications typically require higher faithfulness scores (0.90+) to minimize hallucination risk.

How do I automatically generate test datasets for RAG evaluation? Ragas provides a TestsetGenerator that takes your source documents and automatically creates question-answer pairs at different complexity levels: simple factual questions, multi-hop reasoning questions, and conversational questions. This is useful when you do not have existing labeled evaluation data but need a representative test set.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Ragas is an open-source Python library that evaluates Retrieval-Augmented Generation pipelines without requiring human-labeled ground truth for every metric. It uses LLMs to assess aspects like whether the answer is grounded in the retrieved context (faithfulness), whether the context actually contains the answer (context recall), and whether the answer addresses the question (answer relevancy).

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesAI Agent Development Notes NotesRAG: Retrieval-Augmented Generation Guide NotesDigital Marketing Metrics Reference BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

10 LangChain RAG Evaluation with Ragas (Metrics and Datasets)

⚡ Quick Answer

Evaluate your LangChain RAG pipelines with Ragas: faithfulness, answer relevancy, context recall, TestsetGenerator, and CI/CD integration for production quality.

AiTechWorlds Team May 31, 2026 10 min read

#LangChain #Ragas #RAG evaluation #metrics #testing

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

This guide covers ten evaluation patterns using Ragas: the core metrics, automated test set generation, CI/CD integration, and how to interpret scores to make concrete improvements to your pipeline.

For the RAG pipeline itself, see the RAG system tutorial. For vector store setup, see the vector database guide.

Why RAG Pipelines Need Systematic Evaluation

A RAG system has two failure modes that feel identical to the user — wrong answers — but have completely different root causes:

Retrieval failure: The right document was not retrieved. The LLM could not answer correctly because the context was irrelevant.
Generation failure: The right document was retrieved, but the LLM ignored it or hallucinated beyond it.

Without metrics that distinguish these two failures, you are guessing at fixes. Ragas gives you separate scores for each component, so you can target your improvements correctly.

Installation and Setup

pip install ragas langchain langchain-openai langchain-chroma

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"

The 10 Core Ragas Metrics

Ragas evaluates RAG systems across four dimensions, with multiple metrics in each:

1. Faithfulness

Measures whether the answer is grounded in the retrieved context. A score of 1.0 means every claim in the answer is supported by the context. A score of 0.4 means more than half the claims are not.

from ragas.metrics import faithfulness
from datasets import Dataset
from ragas import evaluate

# Sample evaluation data
eval_data = {
    "question": [
        "What is the capital of France?",
        "When was Python first released?",
    ],
    "answer": [
        "The capital of France is Paris.",
        "Python was first released in 1991 by Guido van Rossum.",
    ],
    "contexts": [
        ["France is a country in Western Europe. Its capital city is Paris."],
        ["Python is a programming language. Guido van Rossum created it in 1991."],
    ],
    "ground_truth": [
        "Paris",
        "1991",
    ],
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[faithfulness])
print(f"Faithfulness: {result['faithfulness']:.3f}")

2. Answer Relevancy

Measures how well the answer addresses the question. A high score requires the answer to be on-topic and complete — not just tangentially related.

from ragas.metrics import answer_relevancy
from ragas import evaluate

result = evaluate(dataset, metrics=[answer_relevancy])
print(f"Answer Relevancy: {result['answer_relevancy']:.3f}")

3. Context Recall

Measures whether the retrieved context contains the information needed to answer the question. This requires ground truth answers. A low score points to a retrieval problem.

from ragas.metrics import context_recall

result = evaluate(dataset, metrics=[context_recall])
print(f"Context Recall: {result['context_recall']:.3f}")

4. Context Precision

Measures whether the retrieved chunks are relevant, penalizing retrieval of noise alongside signal.

from ragas.metrics import context_precision

result = evaluate(dataset, metrics=[context_precision])
print(f"Context Precision: {result['context_precision']:.3f}")

5. Answer Correctness

End-to-end correctness: is the answer factually correct given the ground truth? This is the most comprehensive single metric.

from ragas.metrics import answer_correctness

result = evaluate(dataset, metrics=[answer_correctness])
print(f"Answer Correctness: {result['answer_correctness']:.3f}")

6. Answer Similarity

Semantic similarity between the generated answer and the ground truth, measured by embedding distance.

from ragas.metrics import answer_similarity

result = evaluate(dataset, metrics=[answer_similarity])
print(f"Answer Similarity: {result['answer_similarity']:.3f}")

7. Context Entity Recall

Checks whether named entities from the ground truth appear in the retrieved context — particularly valuable for knowledge-intensive tasks.

from ragas.metrics import context_entity_recall

result = evaluate(dataset, metrics=[context_entity_recall])
print(f"Context Entity Recall: {result['context_entity_recall']:.3f}")

8–10. Running All Metrics Together

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
    answer_similarity,
    context_entity_recall,
)
from ragas import evaluate
import pandas as pd

ALL_METRICS = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
    answer_similarity,
    context_entity_recall,
]

result = evaluate(dataset, metrics=ALL_METRICS)
df = result.to_pandas()
print(df[["question"] + [m.name for m in ALL_METRICS]].to_string())

Generating Test Datasets Automatically

Manually creating evaluation datasets is tedious. Ragas's TestsetGenerator creates question-answer pairs from your documents automatically:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load your documents
loader = DirectoryLoader(
    "./documents",
    glob="**/*.txt",
    loader_cls=TextLoader
)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# Initialize the generator with models
generator_llm = ChatOpenAI(model="gpt-4o")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Generate 30 test questions across three complexity levels
testset = generator.generate_with_langchain_docs(
    chunks,
    test_size=30,
    distributions={
        simple: 0.5,       # 15 simple factual questions
        reasoning: 0.3,    # 9 multi-hop reasoning questions
        multi_context: 0.2  # 6 questions requiring multiple documents
    }
)

# Convert to a DataFrame to inspect
test_df = testset.to_pandas()
print(test_df[["question", "ground_truth", "evolution_type"]].head(10))

# Save for reuse
test_df.to_csv("./eval_data/testset.csv", index=False)

Running a Full Evaluation Against Your LangChain RAG Pipeline

import asyncio
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

# Set up the RAG pipeline
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = Chroma(
    collection_name="docs",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_template("""
Answer the question based on the context below.
Context: {context}
Question: {question}
""")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


async def run_rag_on_testset(test_questions: list[str]) -> dict:
    """
    Run the RAG pipeline on each test question and collect
    the answers and retrieved contexts.
    """
    answers = []
    contexts = []

    for question in test_questions:
        # Get retrieved docs
        retrieved_docs = await retriever.ainvoke(question)
        ctx_texts = [doc.page_content for doc in retrieved_docs]

        # Get the answer
        answer = await rag_chain.ainvoke(question)

        answers.append(answer)
        contexts.append(ctx_texts)

    return {"answers": answers, "contexts": contexts}


async def evaluate_rag_pipeline(testset_path: str) -> dict:
    import pandas as pd

    test_df = pd.read_csv(testset_path)
    questions = test_df["question"].tolist()
    ground_truths = test_df["ground_truth"].tolist()

    # Run the pipeline
    rag_output = await run_rag_on_testset(questions)

    # Build the Ragas dataset
    eval_dataset = Dataset.from_dict({
        "question": questions,
        "answer": rag_output["answers"],
        "contexts": rag_output["contexts"],
        "ground_truth": ground_truths,
    })

    # Evaluate
    result = evaluate(
        eval_dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_recall,
            context_precision,
        ]
    )

    return {
        "faithfulness": result["faithfulness"],
        "answer_relevancy": result["answer_relevancy"],
        "context_recall": result["context_recall"],
        "context_precision": result["context_precision"],
        "dataframe": result.to_pandas()
    }


# Run evaluation
scores = asyncio.run(evaluate_rag_pipeline("./eval_data/testset.csv"))
print(f"Faithfulness:      {scores['faithfulness']:.3f}")
print(f"Answer Relevancy:  {scores['answer_relevancy']:.3f}")
print(f"Context Recall:    {scores['context_recall']:.3f}")
print(f"Context Precision: {scores['context_precision']:.3f}")

Interpreting Scores and Diagnosing Problems

Understanding what low scores mean is as important as measuring them:

def diagnose_rag_issues(scores: dict) -> list[str]:
    """
    Interpret Ragas scores and suggest specific improvements.
    """
    issues = []

    if scores["faithfulness"] < 0.75:
        issues.append(
            "LOW FAITHFULNESS: The LLM is generating claims not supported by context. "
            "Fix: Tighten your prompt ('answer ONLY from the context below'), "
            "lower the model temperature, or use a model with less tendency to hallucinate."
        )

    if scores["answer_relevancy"] < 0.70:
        issues.append(
            "LOW ANSWER RELEVANCY: Answers are off-topic or incomplete. "
            "Fix: Review your prompt template for clarity, ensure the question is "
            "passed through correctly, and check for prompt injection in user inputs."
        )

    if scores["context_recall"] < 0.70:
        issues.append(
            "LOW CONTEXT RECALL: Retrieved context is missing key information. "
            "Fix: Increase k (retrieve more documents), improve your chunking strategy, "
            "add metadata filtering, or switch to a better embedding model."
        )

    if scores["context_precision"] < 0.70:
        issues.append(
            "LOW CONTEXT PRECISION: Too much irrelevant content being retrieved. "
            "Fix: Reduce k, use MMR (Max Marginal Relevance) retrieval, "
            "add metadata filters to scope retrieval, or improve document splitting."
        )

    if not issues:
        issues.append("All scores above threshold — pipeline is performing well.")

    return issues

issues = diagnose_rag_issues(scores)
for issue in issues:
    print(issue)
    print()

CI/CD Integration with GitHub Actions

Automated evaluation in your deployment pipeline prevents regressions:

# .github/workflows/rag-evaluation.yml
name: RAG Pipeline Evaluation

on:
  push:
    paths:
      - "src/rag/**"
      - "data/documents/**"
  pull_request:
    paths:
      - "src/rag/**"

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install ragas langchain langchain-openai langchain-chroma

      - name: Run RAG evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/evaluate_rag.py

      - name: Check score thresholds
        run: python scripts/check_thresholds.py

The threshold checker script:

# scripts/check_thresholds.py
import json
import sys

THRESHOLDS = {
    "faithfulness": 0.80,
    "answer_relevancy": 0.75,
    "context_recall": 0.70,
    "context_precision": 0.70,
}

with open("./eval_results/scores.json") as f:
    scores = json.load(f)

failures = []
for metric, threshold in THRESHOLDS.items():
    score = scores.get(metric, 0)
    if score < threshold:
        failures.append(
            f"FAIL: {metric} = {score:.3f} (threshold: {threshold:.2f})"
        )
    else:
        print(f"PASS: {metric} = {score:.3f} (threshold: {threshold:.2f})")

if failures:
    print("\nFailed metrics:")
    for f in failures:
        print(f)
    sys.exit(1)

print("\nAll metrics passed!")

Ragas Metric Comparison

Metric	Requires ground truth?	Measures	Failure points to
Faithfulness	No	LLM grounding	Generation problem
Answer Relevancy	No	Answer on-topic	Prompt or LLM problem
Context Recall	Yes	Retrieval completeness	Retrieval problem
Context Precision	Yes	Retrieval noise	Retrieval problem
Answer Correctness	Yes	End-to-end accuracy	Both
Answer Similarity	Yes	Semantic closeness	Both
Context Entity Recall	Yes	Named entity coverage	Retrieval problem

Tracking Score Trends Over Time

Single-point scores are useful. Time series is invaluable:

import json
import sqlite3
from datetime import datetime

def save_evaluation_results(
    scores: dict,
    pipeline_version: str,
    db_path: str = "./eval_results/history.db"
):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS evaluations (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            timestamp TEXT,
            pipeline_version TEXT,
            faithfulness REAL,
            answer_relevancy REAL,
            context_recall REAL,
            context_precision REAL,
            answer_correctness REAL
        )
    """)
    conn.execute("""
        INSERT INTO evaluations VALUES (NULL, ?, ?, ?, ?, ?, ?, ?)
    """, (
        datetime.now().isoformat(),
        pipeline_version,
        scores.get("faithfulness"),
        scores.get("answer_relevancy"),
        scores.get("context_recall"),
        scores.get("context_precision"),
        scores.get("answer_correctness"),
    ))
    conn.commit()
    conn.close()

def plot_metric_trends(db_path: str):
    import pandas as pd
    import matplotlib.pyplot as plt

    conn = sqlite3.connect(db_path)
    df = pd.read_sql("SELECT * FROM evaluations ORDER BY timestamp", conn)
    conn.close()

    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    metrics = ["faithfulness", "answer_relevancy", "context_recall", "context_precision"]
    thresholds = [0.80, 0.75, 0.70, 0.70]

    for ax, metric, threshold in zip(axes.flat, metrics, thresholds):
        ax.plot(df["timestamp"], df[metric], marker="o")
        ax.axhline(y=threshold, color="r", linestyle="--", label=f"Threshold ({threshold})")
        ax.set_title(metric.replace("_", " ").title())
        ax.set_ylim(0, 1)
        ax.legend()
        ax.tick_params(axis="x", rotation=45)

    plt.tight_layout()
    plt.savefig("./eval_results/metric_trends.png", dpi=150)
    plt.show()

For the broader context of building reliable AI systems, see AI agents explained and Deploy AI model to production.

Frequently Asked Questions

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

10 LangChain RAG Evaluation with Ragas (Metrics and Datasets)

Why RAG Pipelines Need Systematic Evaluation

Installation and Setup

The 10 Core Ragas Metrics

1. Faithfulness

2. Answer Relevancy

3. Context Recall

4. Context Precision

5. Answer Correctness

6. Answer Similarity

7. Context Entity Recall

8–10. Running All Metrics Together

Generating Test Datasets Automatically

Running a Full Evaluation Against Your LangChain RAG Pipeline

Interpreting Scores and Diagnosing Problems

CI/CD Integration with GitHub Actions

Ragas Metric Comparison

Tracking Score Trends Over Time

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

10 LangChain RAG Evaluation with Ragas (Metrics and Datasets)

Why RAG Pipelines Need Systematic Evaluation

Installation and Setup

The 10 Core Ragas Metrics

1. Faithfulness

2. Answer Relevancy

3. Context Recall

4. Context Precision

5. Answer Correctness

6. Answer Similarity

7. Context Entity Recall

8–10. Running All Metrics Together

Generating Test Datasets Automatically

Running a Full Evaluation Against Your LangChain RAG Pipeline

Interpreting Scores and Diagnosing Problems

CI/CD Integration with GitHub Actions

Ragas Metric Comparison

Tracking Score Trends Over Time

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily