AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

AI trace visualization dashboard — LangSmith debugging LangChain tracing

How to Use LangSmith for Debugging LangChain Apps (2026)

⚡ Quick Answer

Learn how to use LangSmith to trace, debug, and evaluate LangChain apps — with run inspection, dataset creation, A/B testing chains, and a practical debugging workflow.

AiTechWorlds Team May 31, 2026 12 min read

#LangChain #LangSmith #debugging #tracing #observability

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

The first time I ran a LangChain agent with 10 tool calls and got a wrong answer, I had no idea where things went off track. Was it the retrieval step? The prompt? The tool execution? The final synthesis? Without tracing, debugging an agent is like trying to fix a car engine by listening to it from outside the garage. LangSmith opens the hood.

This guide walks through LangSmith from initial setup through advanced evaluation workflows. I will cover the features I use daily: run inspection, custom metadata, dataset creation for regression testing, and chain comparison. This is the missing piece in most LangChain tutorials.

If you are not yet using LangChain agents, the Build AI agent with LangChain guide is a good starting point. The LangChain tutorial 2025 covers the basics you will need before adding observability.

Why Observability Matters for LLM Apps

LLM applications fail in ways that traditional software does not. A bug in a REST API throws an exception you can trace. An LLM that misunderstands a prompt returns a plausible-sounding wrong answer with no error. You need tooling specifically designed to capture what the model saw, what it produced, and how long each step took.

According to LangChain's 2025 developer survey, teams using LangSmith reported a 40% reduction in debugging time for production incidents. Having visibility into what your LLM is actually doing makes that kind of difference.

Setup

First, create a LangSmith account at smith.langchain.com. The free tier is enough for everything in this guide.

pip install langchain langchain-openai langsmith python-dotenv

Get your API key from the LangSmith settings page, then add to .env:

OPENAI_API_KEY=your_openai_key
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_langsmith_key
LANGCHAIN_PROJECT=my-project-name   # traces go to this project
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com

That is genuinely all you need to start tracing. Import load_dotenv() and every LangChain call will automatically appear in your LangSmith dashboard.

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

load_dotenv()  # loads LANGCHAIN_TRACING_V2 and LANGCHAIN_API_KEY

llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("human", "{question}"),
])

chain = prompt | llm

# This call is now automatically traced in LangSmith
result = chain.invoke({"question": "What is the capital of France?"})
print(result.content)

After running this, open your LangSmith project and you will see the trace with the full prompt, response, latency, and token counts.

Understanding the Trace View

The LangSmith trace view shows a hierarchical breakdown of every step in a chain run. For a simple chain like the one above, you see:

For agents with tool calls, this becomes much more detailed — each tool call shows its input arguments and return value, so you can trace exactly why the agent made each decision.

Adding Custom Metadata to Traces

Automatic tracing is a start. But you also want to know who triggered a trace, what session it belonged to, and any custom tags for filtering.

from langchain_core.runnables import RunnableConfig

config = RunnableConfig(
    metadata={
        "user_id": "user_12345",
        "session_id": "session_abc",
        "environment": "production",
        "feature_flag": "v2_prompt",
    },
    tags=["production", "v2", "customer-support"],
    run_name="customer-support-query",  # shows as the trace name
)

result = chain.invoke(
    {"question": "How do I reset my password?"},
    config=config
)

Now in LangSmith you can filter all traces by user_id, environment, or tag. For debugging production incidents, filtering by user_id to see exactly what a specific user experienced is invaluable.

Tracing Non-LangChain Code with @traceable

LangSmith is not limited to LangChain components. Use the @traceable decorator on any Python function.

from langsmith import traceable

@traceable(name="database-lookup", run_type="tool")
def lookup_user_from_database(user_id: str) -> dict:
    """Simulate a database lookup."""
    # In production, this would query your actual database
    return {
        "user_id": user_id,
        "name": "Alice Johnson",
        "plan": "premium",
        "last_login": "2026-05-30"
    }

@traceable(name="fetch-user-history", run_type="tool")
def fetch_user_conversation_history(user_id: str, limit: int = 10) -> list:
    """Fetch recent conversation history for a user."""
    return [
        {"role": "user", "content": "How do I export my data?"},
        {"role": "assistant", "content": "You can export from Settings > Data Export."},
    ]

@traceable(name="generate-support-response", run_type="chain")
def generate_support_response(user_id: str, question: str) -> str:
    """Full pipeline: fetch context, generate response."""
    user_info = lookup_user_from_database(user_id)
    history = fetch_user_conversation_history(user_id)

    context = f"User: {user_info['name']}, Plan: {user_info['plan']}"
    prompt_text = f"{context}\n\nHistory: {history[-2:]}\n\nQuestion: {question}"

    response = llm.invoke(prompt_text)
    return response.content

# All three functions appear as nested spans in LangSmith
answer = generate_support_response("user_12345", "Can I share my account?")
print(answer)

The trace tree shows generate-support-response as the parent with database-lookup and fetch-user-history as children, each with their own timing and I/O data.

Inspecting Runs Programmatically

Beyond the UI, LangSmith has a Python client for programmatic access to your traces. This is useful for automated quality checks and feeding data into evaluation pipelines.

from langsmith import Client

client = Client()

# List recent runs for a project
runs = list(client.list_runs(
    project_name="my-project-name",
    run_type="chain",
    limit=10,
    order="desc"
))

print(f"Recent runs: {len(runs)}")

for run in runs[:3]:
    print(f"\n--- Run: {run.name} ---")
    print(f"  Status: {run.status}")
    print(f"  Latency: {run.end_time - run.start_time if run.end_time else 'N/A'}")
    print(f"  Tokens: {run.total_tokens}")
    print(f"  Input: {str(run.inputs)[:100]}")
    print(f"  Output: {str(run.outputs)[:100]}")

# Filter runs by metadata
filtered_runs = list(client.list_runs(
    project_name="my-project-name",
    filter='and(eq(metadata_key, "environment"), eq(metadata_value, "production"))',
    limit=50,
))
print(f"\nProduction runs: {len(filtered_runs)}")

Creating Evaluation Datasets

One of LangSmith's most valuable features is dataset management for regression testing. You create a dataset of question-answer pairs, then run your chain against it to measure accuracy.

from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    dataset_name="customer-support-qa-v1",
    description="Evaluation dataset for customer support agent"
)

# Add test cases
test_cases = [
    {
        "question": "How do I reset my password?",
        "expected_answer": "You can reset your password from the login page by clicking 'Forgot password'."
    },
    {
        "question": "What payment methods do you accept?",
        "expected_answer": "We accept Visa, Mastercard, American Express, and PayPal."
    },
    {
        "question": "How do I cancel my subscription?",
        "expected_answer": "You can cancel from Settings > Billing > Cancel Subscription."
    },
    {
        "question": "Is my data encrypted?",
        "expected_answer": "Yes, all data is encrypted at rest with AES-256 and in transit with TLS 1.3."
    },
]

# Create examples in the dataset
for case in test_cases:
    client.create_example(
        inputs={"question": case["question"]},
        outputs={"expected_answer": case["expected_answer"]},
        dataset_id=dataset.id
    )

print(f"Dataset '{dataset.name}' created with {len(test_cases)} examples")
print(f"Dataset ID: {dataset.id}")

Running Evaluations Against a Dataset

from langchain.smith import RunEvalConfig
from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Define the chain to evaluate (your production chain)
def run_chain(inputs: dict) -> dict:
    result = chain.invoke({"question": inputs["question"]})
    return {"answer": result.content}

# Define evaluators
evaluators = [
    # Checks if the answer contains the key facts from the expected answer
    LangChainStringEvaluator(
        "qa",
        config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)}
    ),
]

# Run the evaluation
results = evaluate(
    run_chain,
    data="customer-support-qa-v1",
    evaluators=evaluators,
    experiment_prefix="baseline-gpt4o",
    metadata={"model": "gpt-4o", "prompt_version": "v1"},
)

print(f"\nEvaluation complete: {results.experiment_name}")
# Results appear in LangSmith UI under the Datasets tab

Comparing Two Chains (A/B Testing Prompts)

This is where LangSmith earns its cost. Run two different prompt versions against the same dataset and compare accuracy side by side.

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Version 1: Original prompt
prompt_v1 = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful customer support assistant."),
    ("human", "{question}"),
])
chain_v1 = prompt_v1 | llm

# Version 2: More detailed system prompt
prompt_v2 = ChatPromptTemplate.from_messages([
    ("system", """You are a customer support specialist for a SaaS product.
    Answer questions accurately and concisely.
    If you don't know the answer, say so rather than guessing.
    Keep responses under 100 words."""),
    ("human", "{question}"),
])
chain_v2 = prompt_v2 | llm

def run_chain_v1(inputs: dict) -> dict:
    return {"answer": chain_v1.invoke({"question": inputs["question"]}).content}

def run_chain_v2(inputs: dict) -> dict:
    return {"answer": chain_v2.invoke({"question": inputs["question"]}).content}

evaluators = [
    LangChainStringEvaluator(
        "qa",
        config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)}
    ),
]

# Run both experiments
results_v1 = evaluate(
    run_chain_v1,
    data="customer-support-qa-v1",
    evaluators=evaluators,
    experiment_prefix="prompt-v1",
)

results_v2 = evaluate(
    run_chain_v2,
    data="customer-support-qa-v1",
    evaluators=evaluators,
    experiment_prefix="prompt-v2",
)

print("Experiments complete. Compare results in LangSmith UI.")
# In the UI, both experiments appear in the Datasets view
# You can click 'Compare' to see a side-by-side score breakdown

Debugging a Failing Agent

Here is a real debugging workflow for when an agent gives wrong answers.

from langchain.agents import create_openai_tools_agent, AgentExecutor
from langchain_core.tools import tool

@tool
def get_product_info(product_name: str) -> str:
    """Get information about a product from the catalog."""
    catalog = {
        "python course": "Python Complete Course: $149.99, 40 hours of content",
        "ai toolkit": "AI Toolkit Pro: $299.99, includes 50 pre-built agents",
    }
    return catalog.get(product_name.lower(), f"Product '{product_name}' not found")

@tool
def calculate_discount(original_price: float, discount_percent: float) -> str:
    """Calculate the discounted price."""
    discount = original_price * (discount_percent / 100)
    final = original_price - discount
    return f"Original: ${original_price}, Discount: ${discount:.2f}, Final: ${final:.2f}"

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a sales assistant. Help customers with product information and pricing."),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

agent = create_openai_tools_agent(llm=llm, tools=[get_product_info, calculate_discount], prompt=prompt)
agent_executor = AgentExecutor(agent=agent, tools=[get_product_info, calculate_discount], verbose=True)

# Run with tracing metadata for easy lookup
config = RunnableConfig(
    metadata={"debug_session": "pricing_issue_2026_05_31"},
    run_name="sales-agent-pricing-test",
)

result = agent_executor.invoke(
    {"input": "What's the price of the Python Course with a 20% discount?"},
    config=config
)
print(result["output"])

Now in LangSmith, filter by debug_session: pricing_issue_2026_05_31 to find this exact run. You can see:

Which tools were called and in what order
The exact arguments passed to each tool
The tool's return value
The model's reasoning between tool calls
Token usage at each step

This makes it trivial to find where the agent went wrong — was the tool called with wrong arguments? Did the tool return unexpected data? Did the model misinterpret the tool output?

Setting Up Automated Regression Tests

Once you have an evaluation dataset, run it in CI/CD to catch regressions before they reach production:

# run_evals.py — run this in your CI pipeline
import sys
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

PASS_THRESHOLD = 0.80  # require 80%+ accuracy to pass

llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful customer support assistant."),
    ("human", "{question}"),
])
chain = prompt | llm

def run_chain(inputs: dict) -> dict:
    return {"answer": chain.invoke({"question": inputs["question"]}).content}

evaluators = [
    LangChainStringEvaluator(
        "qa",
        config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)}
    ),
]

results = evaluate(
    run_chain,
    data="customer-support-qa-v1",
    evaluators=evaluators,
    experiment_prefix="ci-check",
)

# Check pass rate
eval_results = list(results)
pass_count = sum(1 for r in eval_results if r.get("feedback", {}).get("score", 0) >= 0.5)
total = len(eval_results)
pass_rate = pass_count / total if total > 0 else 0

print(f"Pass rate: {pass_rate:.1%} ({pass_count}/{total})")

if pass_rate < PASS_THRESHOLD:
    print(f"FAILED: Pass rate {pass_rate:.1%} is below threshold {PASS_THRESHOLD:.1%}")
    sys.exit(1)
else:
    print(f"PASSED: Pass rate {pass_rate:.1%} meets threshold")
    sys.exit(0)

LangSmith vs Alternative Observability Tools

Feature	LangSmith	Weights & Biases	Arize Phoenix	Custom Logging
LangChain auto-tracing	Yes, native	Partial	Yes	Manual
Multi-step trace view	Yes	Limited	Yes	Manual
Evaluation datasets	Yes, built-in	Yes	Limited	Manual
A/B prompt comparison	Yes, native	Yes	Limited	Manual
Cost per trace	Free tier / paid	Paid	Open source	Free
Setup complexity	Low	Medium	Medium	High
Self-hosted option	Yes (Enterprise)	No	Yes	Yes

LangSmith wins on ease of setup for LangChain projects. Arize Phoenix is the strongest alternative for teams that need open-source or self-hosted solutions.

What to Build Next

With LangSmith wired up, the next step is building evaluation datasets that reflect your actual user queries. Pull real production queries from your logs (anonymized), create ground truth answers, and run weekly evaluations as your prompt or model changes.

For the agents side, Build AI agent with LangChain shows the agent patterns that benefit most from LangSmith tracing. If you are building RAG pipelines, the RAG system tutorial pairs well with LangSmith for measuring retrieval quality. For deploying traced applications to production, Deploy AI model to production covers the infrastructure side.

Conclusion

LangSmith transforms LLM development from guesswork into engineering. The automatic tracing alone is worth the five-minute setup — every chain call becomes debuggable without adding any extra logging code. The evaluation and comparison features take you further: once you have a test dataset, you can compare prompt changes objectively rather than relying on vibes.

My personal workflow: trace everything in development, add custom metadata for production users, create evaluation datasets from real production queries, and run evaluations in CI before every deploy. That workflow has caught real regressions before they reached users multiple times.

Set it up today, even for a simple project. Having trace history from day one is much better than wishing you had it after a production incident.

FAQs

Is LangSmith free to use? LangSmith has a free tier that includes 5,000 traces per month and basic evaluation features. The Developer plan ($39/month) extends to 50,000 traces and unlocks comparison datasets and automated evaluations. For teams, there are Enterprise plans with custom pricing and self-hosted deployment options.

Can I use LangSmith with non-LangChain LLM calls? Yes. LangSmith can trace any Python code using its @traceable decorator, not just LangChain chains. You can wrap OpenAI SDK calls, Anthropic SDK calls, or any custom function. The decorator captures inputs, outputs, latency, and any metadata you pass to it.

How do I share a specific trace with a teammate for debugging? In the LangSmith UI, open the trace you want to share, click the share button in the top right, and copy the shareable link. This link gives view access to that specific trace. You can also share full project traces by inviting teammates to your LangSmith workspace.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

LangSmith has a free tier that includes 5,000 traces per month and basic evaluation features. The Developer plan ($39/month) extends to 50,000 traces and unlocks comparison datasets and automated evaluations. For teams, there are Enterprise plans with custom pricing and self-hosted deployment options.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesAI Agent Development Notes NotesRAG: Retrieval-Augmented Generation Guide BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course ProjectAutonomous Multi-Agent System for Software Development

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

How to Use LangSmith for Debugging LangChain Apps (2026)

⚡ Quick Answer

Learn how to use LangSmith to trace, debug, and evaluate LangChain apps — with run inspection, dataset creation, A/B testing chains, and a practical debugging workflow.

AiTechWorlds Team May 31, 2026 12 min read

#LangChain #LangSmith #debugging #tracing #observability

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

If you are not yet using LangChain agents, the Build AI agent with LangChain guide is a good starting point. The LangChain tutorial 2025 covers the basics you will need before adding observability.

Why Observability Matters for LLM Apps

Setup

First, create a LangSmith account at smith.langchain.com. The free tier is enough for everything in this guide.

pip install langchain langchain-openai langsmith python-dotenv

Get your API key from the LangSmith settings page, then add to .env:

OPENAI_API_KEY=your_openai_key
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_langsmith_key
LANGCHAIN_PROJECT=my-project-name   # traces go to this project
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com

That is genuinely all you need to start tracing. Import load_dotenv() and every LangChain call will automatically appear in your LangSmith dashboard.

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

load_dotenv()  # loads LANGCHAIN_TRACING_V2 and LANGCHAIN_API_KEY

llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("human", "{question}"),
])

chain = prompt | llm

# This call is now automatically traced in LangSmith
result = chain.invoke({"question": "What is the capital of France?"})
print(result.content)

After running this, open your LangSmith project and you will see the trace with the full prompt, response, latency, and token counts.

Understanding the Trace View

The LangSmith trace view shows a hierarchical breakdown of every step in a chain run. For a simple chain like the one above, you see:

For agents with tool calls, this becomes much more detailed — each tool call shows its input arguments and return value, so you can trace exactly why the agent made each decision.

Adding Custom Metadata to Traces

Automatic tracing is a start. But you also want to know who triggered a trace, what session it belonged to, and any custom tags for filtering.

from langchain_core.runnables import RunnableConfig

config = RunnableConfig(
    metadata={
        "user_id": "user_12345",
        "session_id": "session_abc",
        "environment": "production",
        "feature_flag": "v2_prompt",
    },
    tags=["production", "v2", "customer-support"],
    run_name="customer-support-query",  # shows as the trace name
)

result = chain.invoke(
    {"question": "How do I reset my password?"},
    config=config
)

Tracing Non-LangChain Code with @traceable

LangSmith is not limited to LangChain components. Use the @traceable decorator on any Python function.

from langsmith import traceable

@traceable(name="database-lookup", run_type="tool")
def lookup_user_from_database(user_id: str) -> dict:
    """Simulate a database lookup."""
    # In production, this would query your actual database
    return {
        "user_id": user_id,
        "name": "Alice Johnson",
        "plan": "premium",
        "last_login": "2026-05-30"
    }

@traceable(name="fetch-user-history", run_type="tool")
def fetch_user_conversation_history(user_id: str, limit: int = 10) -> list:
    """Fetch recent conversation history for a user."""
    return [
        {"role": "user", "content": "How do I export my data?"},
        {"role": "assistant", "content": "You can export from Settings > Data Export."},
    ]

@traceable(name="generate-support-response", run_type="chain")
def generate_support_response(user_id: str, question: str) -> str:
    """Full pipeline: fetch context, generate response."""
    user_info = lookup_user_from_database(user_id)
    history = fetch_user_conversation_history(user_id)

    context = f"User: {user_info['name']}, Plan: {user_info['plan']}"
    prompt_text = f"{context}\n\nHistory: {history[-2:]}\n\nQuestion: {question}"

    response = llm.invoke(prompt_text)
    return response.content

# All three functions appear as nested spans in LangSmith
answer = generate_support_response("user_12345", "Can I share my account?")
print(answer)

The trace tree shows generate-support-response as the parent with database-lookup and fetch-user-history as children, each with their own timing and I/O data.

Inspecting Runs Programmatically

Beyond the UI, LangSmith has a Python client for programmatic access to your traces. This is useful for automated quality checks and feeding data into evaluation pipelines.

from langsmith import Client

client = Client()

# List recent runs for a project
runs = list(client.list_runs(
    project_name="my-project-name",
    run_type="chain",
    limit=10,
    order="desc"
))

print(f"Recent runs: {len(runs)}")

for run in runs[:3]:
    print(f"\n--- Run: {run.name} ---")
    print(f"  Status: {run.status}")
    print(f"  Latency: {run.end_time - run.start_time if run.end_time else 'N/A'}")
    print(f"  Tokens: {run.total_tokens}")
    print(f"  Input: {str(run.inputs)[:100]}")
    print(f"  Output: {str(run.outputs)[:100]}")

# Filter runs by metadata
filtered_runs = list(client.list_runs(
    project_name="my-project-name",
    filter='and(eq(metadata_key, "environment"), eq(metadata_value, "production"))',
    limit=50,
))
print(f"\nProduction runs: {len(filtered_runs)}")

Creating Evaluation Datasets

One of LangSmith's most valuable features is dataset management for regression testing. You create a dataset of question-answer pairs, then run your chain against it to measure accuracy.

from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    dataset_name="customer-support-qa-v1",
    description="Evaluation dataset for customer support agent"
)

# Add test cases
test_cases = [
    {
        "question": "How do I reset my password?",
        "expected_answer": "You can reset your password from the login page by clicking 'Forgot password'."
    },
    {
        "question": "What payment methods do you accept?",
        "expected_answer": "We accept Visa, Mastercard, American Express, and PayPal."
    },
    {
        "question": "How do I cancel my subscription?",
        "expected_answer": "You can cancel from Settings > Billing > Cancel Subscription."
    },
    {
        "question": "Is my data encrypted?",
        "expected_answer": "Yes, all data is encrypted at rest with AES-256 and in transit with TLS 1.3."
    },
]

# Create examples in the dataset
for case in test_cases:
    client.create_example(
        inputs={"question": case["question"]},
        outputs={"expected_answer": case["expected_answer"]},
        dataset_id=dataset.id
    )

print(f"Dataset '{dataset.name}' created with {len(test_cases)} examples")
print(f"Dataset ID: {dataset.id}")

Running Evaluations Against a Dataset

from langchain.smith import RunEvalConfig
from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Define the chain to evaluate (your production chain)
def run_chain(inputs: dict) -> dict:
    result = chain.invoke({"question": inputs["question"]})
    return {"answer": result.content}

# Define evaluators
evaluators = [
    # Checks if the answer contains the key facts from the expected answer
    LangChainStringEvaluator(
        "qa",
        config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)}
    ),
]

# Run the evaluation
results = evaluate(
    run_chain,
    data="customer-support-qa-v1",
    evaluators=evaluators,
    experiment_prefix="baseline-gpt4o",
    metadata={"model": "gpt-4o", "prompt_version": "v1"},
)

print(f"\nEvaluation complete: {results.experiment_name}")
# Results appear in LangSmith UI under the Datasets tab

Comparing Two Chains (A/B Testing Prompts)

This is where LangSmith earns its cost. Run two different prompt versions against the same dataset and compare accuracy side by side.

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Version 1: Original prompt
prompt_v1 = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful customer support assistant."),
    ("human", "{question}"),
])
chain_v1 = prompt_v1 | llm

# Version 2: More detailed system prompt
prompt_v2 = ChatPromptTemplate.from_messages([
    ("system", """You are a customer support specialist for a SaaS product.
    Answer questions accurately and concisely.
    If you don't know the answer, say so rather than guessing.
    Keep responses under 100 words."""),
    ("human", "{question}"),
])
chain_v2 = prompt_v2 | llm

def run_chain_v1(inputs: dict) -> dict:
    return {"answer": chain_v1.invoke({"question": inputs["question"]}).content}

def run_chain_v2(inputs: dict) -> dict:
    return {"answer": chain_v2.invoke({"question": inputs["question"]}).content}

evaluators = [
    LangChainStringEvaluator(
        "qa",
        config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)}
    ),
]

# Run both experiments
results_v1 = evaluate(
    run_chain_v1,
    data="customer-support-qa-v1",
    evaluators=evaluators,
    experiment_prefix="prompt-v1",
)

results_v2 = evaluate(
    run_chain_v2,
    data="customer-support-qa-v1",
    evaluators=evaluators,
    experiment_prefix="prompt-v2",
)

print("Experiments complete. Compare results in LangSmith UI.")
# In the UI, both experiments appear in the Datasets view
# You can click 'Compare' to see a side-by-side score breakdown

Debugging a Failing Agent

Here is a real debugging workflow for when an agent gives wrong answers.

from langchain.agents import create_openai_tools_agent, AgentExecutor
from langchain_core.tools import tool

@tool
def get_product_info(product_name: str) -> str:
    """Get information about a product from the catalog."""
    catalog = {
        "python course": "Python Complete Course: $149.99, 40 hours of content",
        "ai toolkit": "AI Toolkit Pro: $299.99, includes 50 pre-built agents",
    }
    return catalog.get(product_name.lower(), f"Product '{product_name}' not found")

@tool
def calculate_discount(original_price: float, discount_percent: float) -> str:
    """Calculate the discounted price."""
    discount = original_price * (discount_percent / 100)
    final = original_price - discount
    return f"Original: ${original_price}, Discount: ${discount:.2f}, Final: ${final:.2f}"

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a sales assistant. Help customers with product information and pricing."),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

agent = create_openai_tools_agent(llm=llm, tools=[get_product_info, calculate_discount], prompt=prompt)
agent_executor = AgentExecutor(agent=agent, tools=[get_product_info, calculate_discount], verbose=True)

# Run with tracing metadata for easy lookup
config = RunnableConfig(
    metadata={"debug_session": "pricing_issue_2026_05_31"},
    run_name="sales-agent-pricing-test",
)

result = agent_executor.invoke(
    {"input": "What's the price of the Python Course with a 20% discount?"},
    config=config
)
print(result["output"])

Now in LangSmith, filter by debug_session: pricing_issue_2026_05_31 to find this exact run. You can see:

Which tools were called and in what order
The exact arguments passed to each tool
The tool's return value
The model's reasoning between tool calls
Token usage at each step

This makes it trivial to find where the agent went wrong — was the tool called with wrong arguments? Did the tool return unexpected data? Did the model misinterpret the tool output?

Setting Up Automated Regression Tests

Once you have an evaluation dataset, run it in CI/CD to catch regressions before they reach production:

# run_evals.py — run this in your CI pipeline
import sys
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

PASS_THRESHOLD = 0.80  # require 80%+ accuracy to pass

llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful customer support assistant."),
    ("human", "{question}"),
])
chain = prompt | llm

def run_chain(inputs: dict) -> dict:
    return {"answer": chain.invoke({"question": inputs["question"]}).content}

evaluators = [
    LangChainStringEvaluator(
        "qa",
        config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)}
    ),
]

results = evaluate(
    run_chain,
    data="customer-support-qa-v1",
    evaluators=evaluators,
    experiment_prefix="ci-check",
)

# Check pass rate
eval_results = list(results)
pass_count = sum(1 for r in eval_results if r.get("feedback", {}).get("score", 0) >= 0.5)
total = len(eval_results)
pass_rate = pass_count / total if total > 0 else 0

print(f"Pass rate: {pass_rate:.1%} ({pass_count}/{total})")

if pass_rate < PASS_THRESHOLD:
    print(f"FAILED: Pass rate {pass_rate:.1%} is below threshold {PASS_THRESHOLD:.1%}")
    sys.exit(1)
else:
    print(f"PASSED: Pass rate {pass_rate:.1%} meets threshold")
    sys.exit(0)

LangSmith vs Alternative Observability Tools

Feature	LangSmith	Weights & Biases	Arize Phoenix	Custom Logging
LangChain auto-tracing	Yes, native	Partial	Yes	Manual
Multi-step trace view	Yes	Limited	Yes	Manual
Evaluation datasets	Yes, built-in	Yes	Limited	Manual
A/B prompt comparison	Yes, native	Yes	Limited	Manual
Cost per trace	Free tier / paid	Paid	Open source	Free
Setup complexity	Low	Medium	Medium	High
Self-hosted option	Yes (Enterprise)	No	Yes	Yes

LangSmith wins on ease of setup for LangChain projects. Arize Phoenix is the strongest alternative for teams that need open-source or self-hosted solutions.

What to Build Next

Conclusion

Set it up today, even for a simple project. Having trace history from day one is much better than wishing you had it after a production incident.

FAQs

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

How to Use LangSmith for Debugging LangChain Apps (2026)

Why Observability Matters for LLM Apps

Setup

Understanding the Trace View

Adding Custom Metadata to Traces

Tracing Non-LangChain Code with @traceable

Inspecting Runs Programmatically

Creating Evaluation Datasets

Running Evaluations Against a Dataset

Comparing Two Chains (A/B Testing Prompts)

Debugging a Failing Agent

Setting Up Automated Regression Tests

LangSmith vs Alternative Observability Tools

What to Build Next

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

How to Use LangSmith for Debugging LangChain Apps (2026)

Why Observability Matters for LLM Apps

Setup

Understanding the Trace View

Adding Custom Metadata to Traces

Tracing Non-LangChain Code with @traceable

Inspecting Runs Programmatically

Creating Evaluation Datasets

Running Evaluations Against a Dataset

Comparing Two Chains (A/B Testing Prompts)

Debugging a Failing Agent

Setting Up Automated Regression Tests

LangSmith vs Alternative Observability Tools

What to Build Next

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily