AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

evaluation metrics dashboard showing RAG scores — LangChain RAG evaluation Ragas

5 LangChain RAG Evaluation Metrics with Ragas (2026)

⚡ Quick Answer

Evaluate your LangChain RAG pipeline with Ragas: faithfulness, answer relevancy, context recall, context precision, and answer correctness — full Python code.

AiTechWorlds Team May 31, 2026 13 min read

#LangChain #RAG evaluation #Ragas #metrics #LLM assessment

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Building a RAG system is the easy part. Knowing whether it actually works — that is harder. Without evaluation, you are making blind changes: swapping chunking strategies, adjusting retrieval k, changing prompts, with no idea whether your metrics improved or declined.

Ragas (Retrieval-Augmented Generation Assessment) is the standard framework for evaluating LangChain RAG pipelines. It measures five key metrics using reference-free evaluation — meaning you do not need manually labeled ground-truth answers for most metrics. This makes it practical to run continuously as you iterate.

This guide implements a full Ragas evaluation pipeline for a LangChain RAG system. You will be able to score your pipeline before and after any change to see whether it improved.

Understanding the 5 Ragas Metrics

Faithfulness (0–1)

Measures whether every claim in the answer is supported by the retrieved context. An answer with 10 claims where 8 are in the context scores 0.80. This is the primary hallucination detector.

High faithfulness means the model is not making things up. Low faithfulness means the model is answering from its training data, not your documents.

Answer Relevancy (0–1)

Measures how directly the answer addresses the question. Ragas generates multiple paraphrases of the question from the answer, then computes how similar they are to the original question using cosine similarity.

High answer relevancy means the answer is focused on what was asked. Low answer relevancy means the answer drifts off-topic or gives generic information.

Context Recall (0–1)

Requires a reference answer. Measures what percentage of the information in the reference answer is covered by the retrieved context. If the reference answer has 5 facts and the context contains 4 of them, recall is 0.80.

Low context recall means your retriever is not finding all the relevant documents.

Context Precision (0–1)

Requires a reference answer. Measures what percentage of the retrieved context is actually relevant. A high-precision system retrieves few documents, most of which are useful. A low-precision system retrieves many documents, most of which are noise.

Low context precision means your retriever is pulling in too much irrelevant content.

Answer Correctness (0–1)

Requires a reference answer. Combines semantic similarity and factual overlap between the generated answer and the reference answer. This is the closest Ragas gets to "is the answer right?"

Setup

pip install ragas langchain langchain-openai langchain-community chromadb datasets

import os
os.environ["OPENAI_API_KEY"] = "your-key-here"

Step 1: Build the RAG Pipeline to Evaluate

First, build the RAG system you want to evaluate:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents import Document

# Sample knowledge base documents
documents = [
    Document(page_content="""
    LangChain is an open-source framework for building applications powered by large language models.
    It was created by Harrison Chase and first released in October 2022.
    LangChain provides abstractions for chains, agents, memory, and retrieval.
    The framework supports over 50 LLM providers and 100+ integrations.
    """, metadata={"source": "langchain_overview.txt"}),

    Document(page_content="""
    RAG (Retrieval-Augmented Generation) combines a retrieval system with an LLM.
    In a RAG pipeline, relevant documents are retrieved from a vector store at query time.
    These documents are appended to the prompt as context before the LLM generates an answer.
    RAG reduces hallucinations because the LLM can cite specific retrieved passages.
    The key components are: document loader, text splitter, embedding model, vector store, and retriever.
    """, metadata={"source": "rag_guide.txt"}),

    Document(page_content="""
    Vector databases store high-dimensional embeddings for fast similarity search.
    Chroma is a popular open-source vector database for local development.
    Pinecone is a managed vector database with automatic scaling.
    Weaviate supports both dense and sparse vector search (hybrid search).
    FAISS is a Facebook library optimized for billion-scale vector search.
    Qdrant provides a REST and gRPC API for vector operations.
    """, metadata={"source": "vector_db_comparison.txt"}),

    Document(page_content="""
    LangChain agents use LLMs to decide which tools to call based on user input.
    The ReAct framework (Reasoning + Acting) is the most common agent architecture.
    Agents can use tools like web search, calculators, code execution, and API calls.
    The LangChain OpenAI Functions agent uses structured function calling for tool invocation.
    Agent memory can be maintained across turns using ConversationBufferMemory.
    """, metadata={"source": "agents_guide.txt"}),

    Document(page_content="""
    LangChain LCEL (LangChain Expression Language) provides a declarative syntax for building chains.
    Chains are composed using the pipe operator: chain = prompt | llm | output_parser
    LCEL chains support both sync and async execution natively.
    Runnables implement ainvoke(), astream(), and abatch() for async operation.
    The RunnableParallel class runs multiple chains concurrently and merges their outputs.
    """, metadata={"source": "lcel_guide.txt"}),
]

# Build the vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Build the generation chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer the question using only the provided context.
If the answer is not in the context, say "I don't have information about that."

Context:
{context}"""),
    ("human", "{question}"),
])

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Test it
answer = rag_chain.invoke("What is LangChain and when was it created?")
print(answer)

Step 2: Create the Evaluation Dataset

Ragas needs a dataset of questions, contexts, answers, and optionally ground truth answers:

from ragas import EvaluationDataset, SingleTurnSample
import asyncio

# Define evaluation questions with reference answers
eval_questions = [
    {
        "question": "What is LangChain and when was it first released?",
        "ground_truth": "LangChain is an open-source framework for building LLM applications, created by Harrison Chase and first released in October 2022."
    },
    {
        "question": "What are the key components of a RAG pipeline?",
        "ground_truth": "The key components of a RAG pipeline are: document loader, text splitter, embedding model, vector store, and retriever."
    },
    {
        "question": "What is the difference between Chroma and Pinecone?",
        "ground_truth": "Chroma is an open-source vector database for local development. Pinecone is a managed vector database with automatic scaling."
    },
    {
        "question": "What is the ReAct framework in LangChain agents?",
        "ground_truth": "ReAct (Reasoning + Acting) is the most common LangChain agent architecture where the LLM reasons about which tools to call."
    },
    {
        "question": "What does LCEL stand for and what operator does it use?",
        "ground_truth": "LCEL stands for LangChain Expression Language. It uses the pipe operator (|) to compose chains declaratively."
    },
    {
        "question": "How many LLM providers does LangChain support?",
        "ground_truth": "LangChain supports over 50 LLM providers."
    },
    {
        "question": "What is Weaviate's special feature compared to other vector databases?",
        "ground_truth": "Weaviate supports both dense and sparse vector search, also known as hybrid search."
    },
    {
        "question": "What types of tools can LangChain agents use?",
        "ground_truth": "LangChain agents can use tools like web search, calculators, code execution, and API calls."
    },
]


async def generate_evaluation_dataset(questions: list[dict]) -> list[dict]:
    """Generate answers and collect contexts for evaluation."""
    samples = []

    for item in questions:
        question = item["question"]

        # Retrieve context
        retrieved_docs = retriever.invoke(question)
        contexts = [doc.page_content for doc in retrieved_docs]

        # Generate answer
        answer = await rag_chain.ainvoke(question)

        samples.append({
            "user_input": question,
            "retrieved_contexts": contexts,
            "response": answer,
            "reference": item.get("ground_truth", ""),
        })

        print(f"Generated: {question[:60]}...")

    return samples


# Run the data generation
samples = asyncio.run(generate_evaluation_dataset(eval_questions))

# Preview the dataset
for s in samples[:2]:
    print(f"\nQ: {s['user_input']}")
    print(f"A: {s['response'][:200]}...")
    print(f"Context snippets: {len(s['retrieved_contexts'])}")

Step 3: Run Ragas Evaluation

from ragas import evaluate
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextRecall,
    ContextPrecision,
    AnswerCorrectness,
)
from ragas import EvaluationDataset, SingleTurnSample
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper


# Configure evaluation LLM and embeddings (can use cheaper model for evaluation)
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))
eval_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

# Build the Ragas dataset
ragas_samples = [
    SingleTurnSample(
        user_input=s["user_input"],
        retrieved_contexts=s["retrieved_contexts"],
        response=s["response"],
        reference=s["reference"],
    )
    for s in samples
]

dataset = EvaluationDataset(samples=ragas_samples)

# Define metrics to evaluate
metrics = [
    Faithfulness(llm=eval_llm),
    AnswerRelevancy(llm=eval_llm, embeddings=eval_embeddings),
    ContextRecall(llm=eval_llm),
    ContextPrecision(llm=eval_llm),
    AnswerCorrectness(llm=eval_llm),
]

# Run evaluation
print("Running Ragas evaluation...")
results = evaluate(
    dataset=dataset,
    metrics=metrics,
)

# Display results
print("\n" + "="*50)
print("RAGAS EVALUATION RESULTS")
print("="*50)
print(f"Faithfulness:      {results['faithfulness']:.3f}")
print(f"Answer Relevancy:  {results['answer_relevancy']:.3f}")
print(f"Context Recall:    {results['context_recall']:.3f}")
print(f"Context Precision: {results['context_precision']:.3f}")
print(f"Answer Correctness:{results['answer_correctness']:.3f}")

# Convert to pandas for detailed analysis
import pandas as pd
results_df = results.to_pandas()
print("\nPer-question breakdown:")
print(results_df[["user_input", "faithfulness", "answer_relevancy", "answer_correctness"]].to_string())

Step 4: Automated TestSet Generation

If you do not have a pre-built evaluation set, Ragas can generate one from your documents:

from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(
    OpenAIEmbeddings(model="text-embedding-3-small")
)

generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=generator_embeddings,
)

# Generate 20 test questions from your documents
testset = generator.generate_with_langchain_docs(
    documents=documents,
    testset_size=20,
)

# Convert to pandas to inspect
testset_df = testset.to_pandas()
print(f"Generated {len(testset_df)} test questions")
print(testset_df[["user_input", "reference"]].head())

# Save for reuse
testset_df.to_csv("evaluation_testset.csv", index=False)

Step 5: Continuous Evaluation Pipeline

Build an evaluation pipeline you can run after every change:

import json
from datetime import datetime
from pathlib import Path


def evaluate_rag_pipeline(
    rag_chain,
    retriever,
    testset_path: str = "evaluation_testset.csv",
    results_dir: str = "eval_results",
) -> dict:
    """
    Run a full Ragas evaluation and save results with timestamp.
    Returns a dict with all metric scores.
    """
    Path(results_dir).mkdir(exist_ok=True)

    # Load test set
    testset_df = pd.read_csv(testset_path)
    questions = testset_df["user_input"].tolist()
    references = testset_df["reference"].tolist() if "reference" in testset_df.columns else [""] * len(questions)

    # Generate answers
    print(f"Generating answers for {len(questions)} questions...")
    ragas_samples = []

    for question, reference in zip(questions, references):
        retrieved_docs = retriever.invoke(question)
        contexts = [doc.page_content for doc in retrieved_docs]
        answer = rag_chain.invoke(question)

        ragas_samples.append(SingleTurnSample(
            user_input=question,
            retrieved_contexts=contexts,
            response=answer,
            reference=reference,
        ))

    # Evaluate
    dataset = EvaluationDataset(samples=ragas_samples)
    eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))
    eval_embeddings = LangchainEmbeddingsWrapper(
        OpenAIEmbeddings(model="text-embedding-3-small")
    )

    results = evaluate(
        dataset=dataset,
        metrics=[
            Faithfulness(llm=eval_llm),
            AnswerRelevancy(llm=eval_llm, embeddings=eval_embeddings),
            ContextRecall(llm=eval_llm),
            ContextPrecision(llm=eval_llm),
            AnswerCorrectness(llm=eval_llm),
        ],
    )

    # Save results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    scores = {
        "timestamp": timestamp,
        "faithfulness": float(results["faithfulness"]),
        "answer_relevancy": float(results["answer_relevancy"]),
        "context_recall": float(results["context_recall"]),
        "context_precision": float(results["context_precision"]),
        "answer_correctness": float(results["answer_correctness"]),
        "num_samples": len(questions),
    }

    output_path = f"{results_dir}/eval_{timestamp}.json"
    with open(output_path, "w") as f:
        json.dump(scores, f, indent=2)

    print(f"\nResults saved to {output_path}")
    return scores


# Run evaluation
scores = evaluate_rag_pipeline(rag_chain, retriever)
print("\nFinal scores:")
for metric, score in scores.items():
    if isinstance(score, float):
        print(f"  {metric}: {score:.3f}")

Step 6: Compare Before and After Pipeline Changes

Track how metric changes across experiments:

import json
import glob
import pandas as pd


def compare_evaluation_runs(results_dir: str = "eval_results") -> pd.DataFrame:
    """Load and compare all evaluation runs."""
    result_files = sorted(glob.glob(f"{results_dir}/eval_*.json"))

    if not result_files:
        print("No evaluation results found.")
        return pd.DataFrame()

    runs = []
    for f in result_files:
        with open(f) as fp:
            runs.append(json.load(fp))

    df = pd.DataFrame(runs)
    metric_cols = ["faithfulness", "answer_relevancy", "context_recall",
                   "context_precision", "answer_correctness"]

    print("Evaluation History:")
    print(df[["timestamp"] + metric_cols].to_string(index=False))

    if len(df) > 1:
        latest = df.iloc[-1]
        previous = df.iloc[-2]
        print("\nChange from previous run:")
        for metric in metric_cols:
            delta = latest[metric] - previous[metric]
            direction = "+" if delta >= 0 else ""
            print(f"  {metric}: {direction}{delta:.3f}")

    return df


compare_evaluation_runs()

Comparison Table: RAG Evaluation Metrics

Metric	Needs Reference Answer	What It Detects	Target Score	Typical Score Range
Faithfulness	No	Hallucinations	> 0.85	0.60 – 0.95
Answer Relevancy	No	Off-topic answers	> 0.80	0.65 – 0.95
Context Recall	Yes	Retriever coverage gaps	> 0.75	0.50 – 0.90
Context Precision	Yes	Retriever noise	> 0.70	0.45 – 0.90
Answer Correctness	Yes	Overall accuracy	> 0.75	0.50 – 0.90

Diagnosing Low Scores

Each metric points to a different part of your pipeline:

def diagnose_low_scores(scores: dict) -> list[str]:
    """Return actionable recommendations based on metric scores."""
    recommendations = []

    if scores.get("faithfulness", 1.0) < 0.75:
        recommendations.append(
            "LOW FAITHFULNESS: LLM is hallucinating. "
            "Try: (1) Add 'only use context' instructions, "
            "(2) Use a more instruction-following model, "
            "(3) Reduce temperature to 0."
        )

    if scores.get("answer_relevancy", 1.0) < 0.75:
        recommendations.append(
            "LOW ANSWER RELEVANCY: Answers drift off-topic. "
            "Try: (1) Add format instructions to your prompt, "
            "(2) Use chain-of-thought prompting, "
            "(3) Reduce max_tokens to force conciseness."
        )

    if scores.get("context_recall", 1.0) < 0.65:
        recommendations.append(
            "LOW CONTEXT RECALL: Retriever misses relevant documents. "
            "Try: (1) Increase k (retrieve more docs), "
            "(2) Use hybrid search (dense + sparse), "
            "(3) Try smaller chunk sizes."
        )

    if scores.get("context_precision", 1.0) < 0.60:
        recommendations.append(
            "LOW CONTEXT PRECISION: Retriever pulls irrelevant docs. "
            "Try: (1) Decrease k, "
            "(2) Add EmbeddingsFilter post-retrieval, "
            "(3) Use MMR retrieval to reduce redundancy."
        )

    if scores.get("answer_correctness", 1.0) < 0.65:
        recommendations.append(
            "LOW ANSWER CORRECTNESS: Answers are factually wrong. "
            "Check faithfulness first — if high, improve retrieval recall. "
            "If both are fine, your knowledge base may be incomplete."
        )

    if not recommendations:
        recommendations.append("All metrics look healthy. Consider raising thresholds for tighter evaluation.")

    return recommendations


recommendations = diagnose_low_scores(scores)
for r in recommendations:
    print(f"\n{r}")

Integration with LangChain Pipelines

Ragas evaluation integrates directly with the RAG system tutorial pipeline. Run it against any LangChain retriever and chain pair — you do not need to modify your production code.

For building the RAG pipeline that you evaluate here, see vector database guide for vector store setup, and semantic search tutorial for retrieval strategies. Once your scores are healthy, Deploy AI model to production covers the path to serving the pipeline in production.

Frequently Asked Questions

What is the difference between faithfulness and answer relevancy in Ragas?

Faithfulness measures whether the answer is grounded in the retrieved context — it catches hallucinations where the model uses knowledge outside the context. Answer relevancy measures whether the answer actually addresses the question — it catches cases where the model answers a different question than the one asked. A system can be faithful but irrelevant (cites real context but about the wrong topic) or relevant but unfaithful (answers the right question using information not in the context).

How much does running Ragas evaluation cost?

Ragas uses LLM calls to evaluate faithfulness and answer relevancy, which adds cost on top of your generation costs. A dataset of 100 question-answer pairs typically costs $0.10–0.50 with GPT-4o-mini as the evaluator. Use a smaller or cheaper model for evaluation than for generation to manage costs. Ragas also supports local models via Ollama for zero-cost evaluation when you need to run it frequently.

How do I create a good evaluation dataset for my RAG system?

A good evaluation dataset includes 50–200 question-answer pairs that cover the full range of queries your system will face. Include simple factoid questions, multi-hop questions requiring synthesis, and edge cases where the answer is not in your knowledge base. You can use Ragas TestsetGenerator to automatically create synthetic questions from your documents, or manually curate questions from real user queries collected during beta testing.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Faithfulness measures whether the answer is grounded in the retrieved context — it catches hallucinations. Answer relevancy measures whether the answer actually addresses the question — it catches off-topic responses. A system can be faithful but irrelevant (answers truthfully but about the wrong thing) or relevant but unfaithful (answers the question using information not in the context).

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesAI Agent Development Notes NotesRAG: Retrieval-Augmented Generation Guide NotesDigital Marketing Metrics Reference BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

5 LangChain RAG Evaluation Metrics with Ragas (2026)

⚡ Quick Answer

Evaluate your LangChain RAG pipeline with Ragas: faithfulness, answer relevancy, context recall, context precision, and answer correctness — full Python code.

AiTechWorlds Team May 31, 2026 13 min read

#LangChain #RAG evaluation #Ragas #metrics #LLM assessment

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

This guide implements a full Ragas evaluation pipeline for a LangChain RAG system. You will be able to score your pipeline before and after any change to see whether it improved.

Understanding the 5 Ragas Metrics

Faithfulness (0–1)

Measures whether every claim in the answer is supported by the retrieved context. An answer with 10 claims where 8 are in the context scores 0.80. This is the primary hallucination detector.

High faithfulness means the model is not making things up. Low faithfulness means the model is answering from its training data, not your documents.

Answer Relevancy (0–1)

High answer relevancy means the answer is focused on what was asked. Low answer relevancy means the answer drifts off-topic or gives generic information.

Context Recall (0–1)

Low context recall means your retriever is not finding all the relevant documents.

Context Precision (0–1)

Low context precision means your retriever is pulling in too much irrelevant content.

Answer Correctness (0–1)

Requires a reference answer. Combines semantic similarity and factual overlap between the generated answer and the reference answer. This is the closest Ragas gets to "is the answer right?"

Setup

pip install ragas langchain langchain-openai langchain-community chromadb datasets

import os
os.environ["OPENAI_API_KEY"] = "your-key-here"

Step 1: Build the RAG Pipeline to Evaluate

First, build the RAG system you want to evaluate:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents import Document

# Sample knowledge base documents
documents = [
    Document(page_content="""
    LangChain is an open-source framework for building applications powered by large language models.
    It was created by Harrison Chase and first released in October 2022.
    LangChain provides abstractions for chains, agents, memory, and retrieval.
    The framework supports over 50 LLM providers and 100+ integrations.
    """, metadata={"source": "langchain_overview.txt"}),

    Document(page_content="""
    RAG (Retrieval-Augmented Generation) combines a retrieval system with an LLM.
    In a RAG pipeline, relevant documents are retrieved from a vector store at query time.
    These documents are appended to the prompt as context before the LLM generates an answer.
    RAG reduces hallucinations because the LLM can cite specific retrieved passages.
    The key components are: document loader, text splitter, embedding model, vector store, and retriever.
    """, metadata={"source": "rag_guide.txt"}),

    Document(page_content="""
    Vector databases store high-dimensional embeddings for fast similarity search.
    Chroma is a popular open-source vector database for local development.
    Pinecone is a managed vector database with automatic scaling.
    Weaviate supports both dense and sparse vector search (hybrid search).
    FAISS is a Facebook library optimized for billion-scale vector search.
    Qdrant provides a REST and gRPC API for vector operations.
    """, metadata={"source": "vector_db_comparison.txt"}),

    Document(page_content="""
    LangChain agents use LLMs to decide which tools to call based on user input.
    The ReAct framework (Reasoning + Acting) is the most common agent architecture.
    Agents can use tools like web search, calculators, code execution, and API calls.
    The LangChain OpenAI Functions agent uses structured function calling for tool invocation.
    Agent memory can be maintained across turns using ConversationBufferMemory.
    """, metadata={"source": "agents_guide.txt"}),

    Document(page_content="""
    LangChain LCEL (LangChain Expression Language) provides a declarative syntax for building chains.
    Chains are composed using the pipe operator: chain = prompt | llm | output_parser
    LCEL chains support both sync and async execution natively.
    Runnables implement ainvoke(), astream(), and abatch() for async operation.
    The RunnableParallel class runs multiple chains concurrently and merges their outputs.
    """, metadata={"source": "lcel_guide.txt"}),
]

# Build the vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Build the generation chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer the question using only the provided context.
If the answer is not in the context, say "I don't have information about that."

Context:
{context}"""),
    ("human", "{question}"),
])

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Test it
answer = rag_chain.invoke("What is LangChain and when was it created?")
print(answer)

Step 2: Create the Evaluation Dataset

Ragas needs a dataset of questions, contexts, answers, and optionally ground truth answers:

from ragas import EvaluationDataset, SingleTurnSample
import asyncio

# Define evaluation questions with reference answers
eval_questions = [
    {
        "question": "What is LangChain and when was it first released?",
        "ground_truth": "LangChain is an open-source framework for building LLM applications, created by Harrison Chase and first released in October 2022."
    },
    {
        "question": "What are the key components of a RAG pipeline?",
        "ground_truth": "The key components of a RAG pipeline are: document loader, text splitter, embedding model, vector store, and retriever."
    },
    {
        "question": "What is the difference between Chroma and Pinecone?",
        "ground_truth": "Chroma is an open-source vector database for local development. Pinecone is a managed vector database with automatic scaling."
    },
    {
        "question": "What is the ReAct framework in LangChain agents?",
        "ground_truth": "ReAct (Reasoning + Acting) is the most common LangChain agent architecture where the LLM reasons about which tools to call."
    },
    {
        "question": "What does LCEL stand for and what operator does it use?",
        "ground_truth": "LCEL stands for LangChain Expression Language. It uses the pipe operator (|) to compose chains declaratively."
    },
    {
        "question": "How many LLM providers does LangChain support?",
        "ground_truth": "LangChain supports over 50 LLM providers."
    },
    {
        "question": "What is Weaviate's special feature compared to other vector databases?",
        "ground_truth": "Weaviate supports both dense and sparse vector search, also known as hybrid search."
    },
    {
        "question": "What types of tools can LangChain agents use?",
        "ground_truth": "LangChain agents can use tools like web search, calculators, code execution, and API calls."
    },
]


async def generate_evaluation_dataset(questions: list[dict]) -> list[dict]:
    """Generate answers and collect contexts for evaluation."""
    samples = []

    for item in questions:
        question = item["question"]

        # Retrieve context
        retrieved_docs = retriever.invoke(question)
        contexts = [doc.page_content for doc in retrieved_docs]

        # Generate answer
        answer = await rag_chain.ainvoke(question)

        samples.append({
            "user_input": question,
            "retrieved_contexts": contexts,
            "response": answer,
            "reference": item.get("ground_truth", ""),
        })

        print(f"Generated: {question[:60]}...")

    return samples


# Run the data generation
samples = asyncio.run(generate_evaluation_dataset(eval_questions))

# Preview the dataset
for s in samples[:2]:
    print(f"\nQ: {s['user_input']}")
    print(f"A: {s['response'][:200]}...")
    print(f"Context snippets: {len(s['retrieved_contexts'])}")

Step 3: Run Ragas Evaluation

from ragas import evaluate
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextRecall,
    ContextPrecision,
    AnswerCorrectness,
)
from ragas import EvaluationDataset, SingleTurnSample
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper


# Configure evaluation LLM and embeddings (can use cheaper model for evaluation)
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))
eval_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

# Build the Ragas dataset
ragas_samples = [
    SingleTurnSample(
        user_input=s["user_input"],
        retrieved_contexts=s["retrieved_contexts"],
        response=s["response"],
        reference=s["reference"],
    )
    for s in samples
]

dataset = EvaluationDataset(samples=ragas_samples)

# Define metrics to evaluate
metrics = [
    Faithfulness(llm=eval_llm),
    AnswerRelevancy(llm=eval_llm, embeddings=eval_embeddings),
    ContextRecall(llm=eval_llm),
    ContextPrecision(llm=eval_llm),
    AnswerCorrectness(llm=eval_llm),
]

# Run evaluation
print("Running Ragas evaluation...")
results = evaluate(
    dataset=dataset,
    metrics=metrics,
)

# Display results
print("\n" + "="*50)
print("RAGAS EVALUATION RESULTS")
print("="*50)
print(f"Faithfulness:      {results['faithfulness']:.3f}")
print(f"Answer Relevancy:  {results['answer_relevancy']:.3f}")
print(f"Context Recall:    {results['context_recall']:.3f}")
print(f"Context Precision: {results['context_precision']:.3f}")
print(f"Answer Correctness:{results['answer_correctness']:.3f}")

# Convert to pandas for detailed analysis
import pandas as pd
results_df = results.to_pandas()
print("\nPer-question breakdown:")
print(results_df[["user_input", "faithfulness", "answer_relevancy", "answer_correctness"]].to_string())

Step 4: Automated TestSet Generation

If you do not have a pre-built evaluation set, Ragas can generate one from your documents:

from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(
    OpenAIEmbeddings(model="text-embedding-3-small")
)

generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=generator_embeddings,
)

# Generate 20 test questions from your documents
testset = generator.generate_with_langchain_docs(
    documents=documents,
    testset_size=20,
)

# Convert to pandas to inspect
testset_df = testset.to_pandas()
print(f"Generated {len(testset_df)} test questions")
print(testset_df[["user_input", "reference"]].head())

# Save for reuse
testset_df.to_csv("evaluation_testset.csv", index=False)

Step 5: Continuous Evaluation Pipeline

Build an evaluation pipeline you can run after every change:

import json
from datetime import datetime
from pathlib import Path


def evaluate_rag_pipeline(
    rag_chain,
    retriever,
    testset_path: str = "evaluation_testset.csv",
    results_dir: str = "eval_results",
) -> dict:
    """
    Run a full Ragas evaluation and save results with timestamp.
    Returns a dict with all metric scores.
    """
    Path(results_dir).mkdir(exist_ok=True)

    # Load test set
    testset_df = pd.read_csv(testset_path)
    questions = testset_df["user_input"].tolist()
    references = testset_df["reference"].tolist() if "reference" in testset_df.columns else [""] * len(questions)

    # Generate answers
    print(f"Generating answers for {len(questions)} questions...")
    ragas_samples = []

    for question, reference in zip(questions, references):
        retrieved_docs = retriever.invoke(question)
        contexts = [doc.page_content for doc in retrieved_docs]
        answer = rag_chain.invoke(question)

        ragas_samples.append(SingleTurnSample(
            user_input=question,
            retrieved_contexts=contexts,
            response=answer,
            reference=reference,
        ))

    # Evaluate
    dataset = EvaluationDataset(samples=ragas_samples)
    eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))
    eval_embeddings = LangchainEmbeddingsWrapper(
        OpenAIEmbeddings(model="text-embedding-3-small")
    )

    results = evaluate(
        dataset=dataset,
        metrics=[
            Faithfulness(llm=eval_llm),
            AnswerRelevancy(llm=eval_llm, embeddings=eval_embeddings),
            ContextRecall(llm=eval_llm),
            ContextPrecision(llm=eval_llm),
            AnswerCorrectness(llm=eval_llm),
        ],
    )

    # Save results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    scores = {
        "timestamp": timestamp,
        "faithfulness": float(results["faithfulness"]),
        "answer_relevancy": float(results["answer_relevancy"]),
        "context_recall": float(results["context_recall"]),
        "context_precision": float(results["context_precision"]),
        "answer_correctness": float(results["answer_correctness"]),
        "num_samples": len(questions),
    }

    output_path = f"{results_dir}/eval_{timestamp}.json"
    with open(output_path, "w") as f:
        json.dump(scores, f, indent=2)

    print(f"\nResults saved to {output_path}")
    return scores


# Run evaluation
scores = evaluate_rag_pipeline(rag_chain, retriever)
print("\nFinal scores:")
for metric, score in scores.items():
    if isinstance(score, float):
        print(f"  {metric}: {score:.3f}")

Step 6: Compare Before and After Pipeline Changes

Track how metric changes across experiments:

import json
import glob
import pandas as pd


def compare_evaluation_runs(results_dir: str = "eval_results") -> pd.DataFrame:
    """Load and compare all evaluation runs."""
    result_files = sorted(glob.glob(f"{results_dir}/eval_*.json"))

    if not result_files:
        print("No evaluation results found.")
        return pd.DataFrame()

    runs = []
    for f in result_files:
        with open(f) as fp:
            runs.append(json.load(fp))

    df = pd.DataFrame(runs)
    metric_cols = ["faithfulness", "answer_relevancy", "context_recall",
                   "context_precision", "answer_correctness"]

    print("Evaluation History:")
    print(df[["timestamp"] + metric_cols].to_string(index=False))

    if len(df) > 1:
        latest = df.iloc[-1]
        previous = df.iloc[-2]
        print("\nChange from previous run:")
        for metric in metric_cols:
            delta = latest[metric] - previous[metric]
            direction = "+" if delta >= 0 else ""
            print(f"  {metric}: {direction}{delta:.3f}")

    return df


compare_evaluation_runs()

Comparison Table: RAG Evaluation Metrics

Metric	Needs Reference Answer	What It Detects	Target Score	Typical Score Range
Faithfulness	No	Hallucinations	> 0.85	0.60 – 0.95
Answer Relevancy	No	Off-topic answers	> 0.80	0.65 – 0.95
Context Recall	Yes	Retriever coverage gaps	> 0.75	0.50 – 0.90
Context Precision	Yes	Retriever noise	> 0.70	0.45 – 0.90
Answer Correctness	Yes	Overall accuracy	> 0.75	0.50 – 0.90

Diagnosing Low Scores

Each metric points to a different part of your pipeline:

def diagnose_low_scores(scores: dict) -> list[str]:
    """Return actionable recommendations based on metric scores."""
    recommendations = []

    if scores.get("faithfulness", 1.0) < 0.75:
        recommendations.append(
            "LOW FAITHFULNESS: LLM is hallucinating. "
            "Try: (1) Add 'only use context' instructions, "
            "(2) Use a more instruction-following model, "
            "(3) Reduce temperature to 0."
        )

    if scores.get("answer_relevancy", 1.0) < 0.75:
        recommendations.append(
            "LOW ANSWER RELEVANCY: Answers drift off-topic. "
            "Try: (1) Add format instructions to your prompt, "
            "(2) Use chain-of-thought prompting, "
            "(3) Reduce max_tokens to force conciseness."
        )

    if scores.get("context_recall", 1.0) < 0.65:
        recommendations.append(
            "LOW CONTEXT RECALL: Retriever misses relevant documents. "
            "Try: (1) Increase k (retrieve more docs), "
            "(2) Use hybrid search (dense + sparse), "
            "(3) Try smaller chunk sizes."
        )

    if scores.get("context_precision", 1.0) < 0.60:
        recommendations.append(
            "LOW CONTEXT PRECISION: Retriever pulls irrelevant docs. "
            "Try: (1) Decrease k, "
            "(2) Add EmbeddingsFilter post-retrieval, "
            "(3) Use MMR retrieval to reduce redundancy."
        )

    if scores.get("answer_correctness", 1.0) < 0.65:
        recommendations.append(
            "LOW ANSWER CORRECTNESS: Answers are factually wrong. "
            "Check faithfulness first — if high, improve retrieval recall. "
            "If both are fine, your knowledge base may be incomplete."
        )

    if not recommendations:
        recommendations.append("All metrics look healthy. Consider raising thresholds for tighter evaluation.")

    return recommendations


recommendations = diagnose_low_scores(scores)
for r in recommendations:
    print(f"\n{r}")

Integration with LangChain Pipelines

Ragas evaluation integrates directly with the RAG system tutorial pipeline. Run it against any LangChain retriever and chain pair — you do not need to modify your production code.

Frequently Asked Questions

What is the difference between faithfulness and answer relevancy in Ragas?

How much does running Ragas evaluation cost?

How do I create a good evaluation dataset for my RAG system?

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

5 LangChain RAG Evaluation Metrics with Ragas (2026)

Understanding the 5 Ragas Metrics

Faithfulness (0–1)

Answer Relevancy (0–1)

Context Recall (0–1)

Context Precision (0–1)

Answer Correctness (0–1)

Setup

Step 1: Build the RAG Pipeline to Evaluate

Step 2: Create the Evaluation Dataset

Step 3: Run Ragas Evaluation

Step 4: Automated TestSet Generation

Step 5: Continuous Evaluation Pipeline

Step 6: Compare Before and After Pipeline Changes

Comparison Table: RAG Evaluation Metrics

Diagnosing Low Scores

Integration with LangChain Pipelines

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

5 LangChain RAG Evaluation Metrics with Ragas (2026)

Understanding the 5 Ragas Metrics

Faithfulness (0–1)

Answer Relevancy (0–1)

Context Recall (0–1)

Context Precision (0–1)

Answer Correctness (0–1)

Setup

Step 1: Build the RAG Pipeline to Evaluate

Step 2: Create the Evaluation Dataset

Step 3: Run Ragas Evaluation

Step 4: Automated TestSet Generation

Step 5: Continuous Evaluation Pipeline

Step 6: Compare Before and After Pipeline Changes

Comparison Table: RAG Evaluation Metrics

Diagnosing Low Scores

Integration with LangChain Pipelines

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily