How to Use LangSmith for Debugging LangChain Apps (2026)
Learn how to use LangSmith to trace, debug, and evaluate LangChain apps — with run inspection, dataset creation, A/B testing chains, and a practical debugging workflow.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
The first time I ran a LangChain agent with 10 tool calls and got a wrong answer, I had no idea where things went off track. Was it the retrieval step? The prompt? The tool execution? The final synthesis? Without tracing, debugging an agent is like trying to fix a car engine by listening to it from outside the garage. LangSmith opens the hood.
This guide walks through LangSmith from initial setup through advanced evaluation workflows. I will cover the features I use daily: run inspection, custom metadata, dataset creation for regression testing, and chain comparison. This is the missing piece in most LangChain tutorials.
If you are not yet using LangChain agents, the Build AI agent with LangChain guide is a good starting point. The LangChain tutorial 2025 covers the basics you will need before adding observability.
Why Observability Matters for LLM Apps
LLM applications fail in ways that traditional software does not. A bug in a REST API throws an exception you can trace. An LLM that misunderstands a prompt returns a plausible-sounding wrong answer with no error. You need tooling specifically designed to capture what the model saw, what it produced, and how long each step took.
According to LangChain's 2025 developer survey, teams using LangSmith reported a 40% reduction in debugging time for production incidents. Having visibility into what your LLM is actually doing makes that kind of difference.
Setup
First, create a LangSmith account at smith.langchain.com. The free tier is enough for everything in this guide.
pip install langchain langchain-openai langsmith python-dotenv
Get your API key from the LangSmith settings page, then add to .env:
OPENAI_API_KEY=your_openai_key
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_langsmith_key
LANGCHAIN_PROJECT=my-project-name # traces go to this project
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
That is genuinely all you need to start tracing. Import load_dotenv() and every LangChain call will automatically appear in your LangSmith dashboard.
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
load_dotenv() # loads LANGCHAIN_TRACING_V2 and LANGCHAIN_API_KEY
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("human", "{question}"),
])
chain = prompt | llm
# This call is now automatically traced in LangSmith
result = chain.invoke({"question": "What is the capital of France?"})
print(result.content)
After running this, open your LangSmith project and you will see the trace with the full prompt, response, latency, and token counts.
Understanding the Trace View
The LangSmith trace view shows a hierarchical breakdown of every step in a chain run. For a simple chain like the one above, you see:
RunnableSequence (total: 823ms, 287 tokens)
├── ChatPromptTemplate (2ms)
│ └── Input: {"question": "What is the capital of France?"}
│ └── Output: [SystemMessage, HumanMessage]
└── ChatOpenAI (821ms, 287 tokens, $0.00144)
└── Input: [SystemMessage, HumanMessage]
└── Output: AIMessage("The capital of France is Paris.")
For agents with tool calls, this becomes much more detailed — each tool call shows its input arguments and return value, so you can trace exactly why the agent made each decision.
Adding Custom Metadata to Traces
Automatic tracing is a start. But you also want to know who triggered a trace, what session it belonged to, and any custom tags for filtering.
from langchain_core.runnables import RunnableConfig
config = RunnableConfig(
metadata={
"user_id": "user_12345",
"session_id": "session_abc",
"environment": "production",
"feature_flag": "v2_prompt",
},
tags=["production", "v2", "customer-support"],
run_name="customer-support-query", # shows as the trace name
)
result = chain.invoke(
{"question": "How do I reset my password?"},
config=config
)
Now in LangSmith you can filter all traces by user_id, environment, or tag. For debugging production incidents, filtering by user_id to see exactly what a specific user experienced is invaluable.
Tracing Non-LangChain Code with @traceable
LangSmith is not limited to LangChain components. Use the @traceable decorator on any Python function.
from langsmith import traceable
@traceable(name="database-lookup", run_type="tool")
def lookup_user_from_database(user_id: str) -> dict:
"""Simulate a database lookup."""
# In production, this would query your actual database
return {
"user_id": user_id,
"name": "Alice Johnson",
"plan": "premium",
"last_login": "2026-05-30"
}
@traceable(name="fetch-user-history", run_type="tool")
def fetch_user_conversation_history(user_id: str, limit: int = 10) -> list:
"""Fetch recent conversation history for a user."""
return [
{"role": "user", "content": "How do I export my data?"},
{"role": "assistant", "content": "You can export from Settings > Data Export."},
]
@traceable(name="generate-support-response", run_type="chain")
def generate_support_response(user_id: str, question: str) -> str:
"""Full pipeline: fetch context, generate response."""
user_info = lookup_user_from_database(user_id)
history = fetch_user_conversation_history(user_id)
context = f"User: {user_info['name']}, Plan: {user_info['plan']}"
prompt_text = f"{context}\n\nHistory: {history[-2:]}\n\nQuestion: {question}"
response = llm.invoke(prompt_text)
return response.content
# All three functions appear as nested spans in LangSmith
answer = generate_support_response("user_12345", "Can I share my account?")
print(answer)
The trace tree shows generate-support-response as the parent with database-lookup and fetch-user-history as children, each with their own timing and I/O data.
Inspecting Runs Programmatically
Beyond the UI, LangSmith has a Python client for programmatic access to your traces. This is useful for automated quality checks and feeding data into evaluation pipelines.
from langsmith import Client
client = Client()
# List recent runs for a project
runs = list(client.list_runs(
project_name="my-project-name",
run_type="chain",
limit=10,
order="desc"
))
print(f"Recent runs: {len(runs)}")
for run in runs[:3]:
print(f"\n--- Run: {run.name} ---")
print(f" Status: {run.status}")
print(f" Latency: {run.end_time - run.start_time if run.end_time else 'N/A'}")
print(f" Tokens: {run.total_tokens}")
print(f" Input: {str(run.inputs)[:100]}")
print(f" Output: {str(run.outputs)[:100]}")
# Filter runs by metadata
filtered_runs = list(client.list_runs(
project_name="my-project-name",
filter='and(eq(metadata_key, "environment"), eq(metadata_value, "production"))',
limit=50,
))
print(f"\nProduction runs: {len(filtered_runs)}")
Creating Evaluation Datasets
One of LangSmith's most valuable features is dataset management for regression testing. You create a dataset of question-answer pairs, then run your chain against it to measure accuracy.
from langsmith import Client
client = Client()
# Create a dataset
dataset = client.create_dataset(
dataset_name="customer-support-qa-v1",
description="Evaluation dataset for customer support agent"
)
# Add test cases
test_cases = [
{
"question": "How do I reset my password?",
"expected_answer": "You can reset your password from the login page by clicking 'Forgot password'."
},
{
"question": "What payment methods do you accept?",
"expected_answer": "We accept Visa, Mastercard, American Express, and PayPal."
},
{
"question": "How do I cancel my subscription?",
"expected_answer": "You can cancel from Settings > Billing > Cancel Subscription."
},
{
"question": "Is my data encrypted?",
"expected_answer": "Yes, all data is encrypted at rest with AES-256 and in transit with TLS 1.3."
},
]
# Create examples in the dataset
for case in test_cases:
client.create_example(
inputs={"question": case["question"]},
outputs={"expected_answer": case["expected_answer"]},
dataset_id=dataset.id
)
print(f"Dataset '{dataset.name}' created with {len(test_cases)} examples")
print(f"Dataset ID: {dataset.id}")
Running Evaluations Against a Dataset
from langchain.smith import RunEvalConfig
from langsmith.evaluation import evaluate, LangChainStringEvaluator
# Define the chain to evaluate (your production chain)
def run_chain(inputs: dict) -> dict:
result = chain.invoke({"question": inputs["question"]})
return {"answer": result.content}
# Define evaluators
evaluators = [
# Checks if the answer contains the key facts from the expected answer
LangChainStringEvaluator(
"qa",
config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)}
),
]
# Run the evaluation
results = evaluate(
run_chain,
data="customer-support-qa-v1",
evaluators=evaluators,
experiment_prefix="baseline-gpt4o",
metadata={"model": "gpt-4o", "prompt_version": "v1"},
)
print(f"\nEvaluation complete: {results.experiment_name}")
# Results appear in LangSmith UI under the Datasets tab
Comparing Two Chains (A/B Testing Prompts)
This is where LangSmith earns its cost. Run two different prompt versions against the same dataset and compare accuracy side by side.
from langsmith.evaluation import evaluate, LangChainStringEvaluator
# Version 1: Original prompt
prompt_v1 = ChatPromptTemplate.from_messages([
("system", "You are a helpful customer support assistant."),
("human", "{question}"),
])
chain_v1 = prompt_v1 | llm
# Version 2: More detailed system prompt
prompt_v2 = ChatPromptTemplate.from_messages([
("system", """You are a customer support specialist for a SaaS product.
Answer questions accurately and concisely.
If you don't know the answer, say so rather than guessing.
Keep responses under 100 words."""),
("human", "{question}"),
])
chain_v2 = prompt_v2 | llm
def run_chain_v1(inputs: dict) -> dict:
return {"answer": chain_v1.invoke({"question": inputs["question"]}).content}
def run_chain_v2(inputs: dict) -> dict:
return {"answer": chain_v2.invoke({"question": inputs["question"]}).content}
evaluators = [
LangChainStringEvaluator(
"qa",
config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)}
),
]
# Run both experiments
results_v1 = evaluate(
run_chain_v1,
data="customer-support-qa-v1",
evaluators=evaluators,
experiment_prefix="prompt-v1",
)
results_v2 = evaluate(
run_chain_v2,
data="customer-support-qa-v1",
evaluators=evaluators,
experiment_prefix="prompt-v2",
)
print("Experiments complete. Compare results in LangSmith UI.")
# In the UI, both experiments appear in the Datasets view
# You can click 'Compare' to see a side-by-side score breakdown
Debugging a Failing Agent
Here is a real debugging workflow for when an agent gives wrong answers.
from langchain.agents import create_openai_tools_agent, AgentExecutor
from langchain_core.tools import tool
@tool
def get_product_info(product_name: str) -> str:
"""Get information about a product from the catalog."""
catalog = {
"python course": "Python Complete Course: $149.99, 40 hours of content",
"ai toolkit": "AI Toolkit Pro: $299.99, includes 50 pre-built agents",
}
return catalog.get(product_name.lower(), f"Product '{product_name}' not found")
@tool
def calculate_discount(original_price: float, discount_percent: float) -> str:
"""Calculate the discounted price."""
discount = original_price * (discount_percent / 100)
final = original_price - discount
return f"Original: ${original_price}, Discount: ${discount:.2f}, Final: ${final:.2f}"
prompt = ChatPromptTemplate.from_messages([
("system", "You are a sales assistant. Help customers with product information and pricing."),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
])
agent = create_openai_tools_agent(llm=llm, tools=[get_product_info, calculate_discount], prompt=prompt)
agent_executor = AgentExecutor(agent=agent, tools=[get_product_info, calculate_discount], verbose=True)
# Run with tracing metadata for easy lookup
config = RunnableConfig(
metadata={"debug_session": "pricing_issue_2026_05_31"},
run_name="sales-agent-pricing-test",
)
result = agent_executor.invoke(
{"input": "What's the price of the Python Course with a 20% discount?"},
config=config
)
print(result["output"])
Now in LangSmith, filter by debug_session: pricing_issue_2026_05_31 to find this exact run. You can see:
- Which tools were called and in what order
- The exact arguments passed to each tool
- The tool's return value
- The model's reasoning between tool calls
- Token usage at each step
This makes it trivial to find where the agent went wrong — was the tool called with wrong arguments? Did the tool return unexpected data? Did the model misinterpret the tool output?
Setting Up Automated Regression Tests
Once you have an evaluation dataset, run it in CI/CD to catch regressions before they reach production:
# run_evals.py — run this in your CI pipeline
import sys
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
PASS_THRESHOLD = 0.80 # require 80%+ accuracy to pass
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful customer support assistant."),
("human", "{question}"),
])
chain = prompt | llm
def run_chain(inputs: dict) -> dict:
return {"answer": chain.invoke({"question": inputs["question"]}).content}
evaluators = [
LangChainStringEvaluator(
"qa",
config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)}
),
]
results = evaluate(
run_chain,
data="customer-support-qa-v1",
evaluators=evaluators,
experiment_prefix="ci-check",
)
# Check pass rate
eval_results = list(results)
pass_count = sum(1 for r in eval_results if r.get("feedback", {}).get("score", 0) >= 0.5)
total = len(eval_results)
pass_rate = pass_count / total if total > 0 else 0
print(f"Pass rate: {pass_rate:.1%} ({pass_count}/{total})")
if pass_rate < PASS_THRESHOLD:
print(f"FAILED: Pass rate {pass_rate:.1%} is below threshold {PASS_THRESHOLD:.1%}")
sys.exit(1)
else:
print(f"PASSED: Pass rate {pass_rate:.1%} meets threshold")
sys.exit(0)
LangSmith vs Alternative Observability Tools
| Feature | LangSmith | Weights & Biases | Arize Phoenix | Custom Logging |
|---|---|---|---|---|
| LangChain auto-tracing | Yes, native | Partial | Yes | Manual |
| Multi-step trace view | Yes | Limited | Yes | Manual |
| Evaluation datasets | Yes, built-in | Yes | Limited | Manual |
| A/B prompt comparison | Yes, native | Yes | Limited | Manual |
| Cost per trace | Free tier / paid | Paid | Open source | Free |
| Setup complexity | Low | Medium | Medium | High |
| Self-hosted option | Yes (Enterprise) | No | Yes | Yes |
LangSmith wins on ease of setup for LangChain projects. Arize Phoenix is the strongest alternative for teams that need open-source or self-hosted solutions.
What to Build Next
With LangSmith wired up, the next step is building evaluation datasets that reflect your actual user queries. Pull real production queries from your logs (anonymized), create ground truth answers, and run weekly evaluations as your prompt or model changes.
For the agents side, Build AI agent with LangChain shows the agent patterns that benefit most from LangSmith tracing. If you are building RAG pipelines, the RAG system tutorial pairs well with LangSmith for measuring retrieval quality. For deploying traced applications to production, Deploy AI model to production covers the infrastructure side.
Conclusion
LangSmith transforms LLM development from guesswork into engineering. The automatic tracing alone is worth the five-minute setup — every chain call becomes debuggable without adding any extra logging code. The evaluation and comparison features take you further: once you have a test dataset, you can compare prompt changes objectively rather than relying on vibes.
My personal workflow: trace everything in development, add custom metadata for production users, create evaluation datasets from real production queries, and run evaluations in CI before every deploy. That workflow has caught real regressions before they reached users multiple times.
Set it up today, even for a simple project. Having trace history from day one is much better than wishing you had it after a production incident.
FAQs
Is LangSmith free to use? LangSmith has a free tier that includes 5,000 traces per month and basic evaluation features. The Developer plan ($39/month) extends to 50,000 traces and unlocks comparison datasets and automated evaluations. For teams, there are Enterprise plans with custom pricing and self-hosted deployment options.
Can I use LangSmith with non-LangChain LLM calls? Yes. LangSmith can trace any Python code using its @traceable decorator, not just LangChain chains. You can wrap OpenAI SDK calls, Anthropic SDK calls, or any custom function. The decorator captures inputs, outputs, latency, and any metadata you pass to it.
How do I share a specific trace with a teammate for debugging? In the LangSmith UI, open the trace you want to share, click the share button in the top right, and copy the shareable link. This link gives view access to that specific trace. You can also share full project traces by inviting teammates to your LangSmith workspace.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
10 AutoGPT Mistakes and How to Fix Them (Loops, Context Overflow)
The 10 most common AutoGPT mistakes developers make — infinite loops, context overflow, vague goals, and more — with root causes, fixes, and prevention strategies.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.