How to Implement Multi-Turn Conversations with LangChain
Learn how to build conversational agents with persistent memory using LangChain's ConversationBufferMemory, RunnableWithMessageHistory, and a FastAPI chatbot endpoint.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
When I first tried to build a chatbot that remembered what the user said three messages ago, I hit every wall you can imagine. The bot would forget the user's name, contradict itself mid-conversation, and generally behave like someone with a goldfish memory. That is the core challenge with stateless LLMs — each API call stands completely alone. LangChain's memory abstractions exist precisely to bridge that gap, and once you understand them, building a genuinely coherent multi-turn agent stops feeling like fighting the framework.
This guide walks through the full picture: from the classic ConversationBufferMemory API to the newer RunnableWithMessageHistory pattern, then wraps everything inside a production-ready FastAPI endpoint. I will be honest about where each approach shines and where it falls apart, because some of these tradeoffs are not obvious until you run into them in production.
Before diving in, a quick orientation. If you are brand new to LangChain, start with the LangChain tutorial 2025 first. If you want to understand the bigger picture, AI agent memory and planning gives excellent context for why conversational memory matters beyond just chatbots.
Why Memory Is Hard With LLMs
Large language models are stateless by design. You send a prompt, you get a completion, the model forgets everything the moment the response is returned. That is fine for one-shot tasks — summarize this document, translate this sentence — but it falls apart completely for conversations.
The naive fix is appending the entire conversation history to every prompt. That works until the history grows longer than the context window. Then you are suddenly dropping messages or hitting token limit errors. The smarter fix is being selective: keep recent messages verbatim, compress older ones, or store only specific facts. LangChain gives you several tools for each of these strategies.
According to Andreessen Horowitz's 2025 AI survey, over 60% of production LLM applications involve multi-turn interactions, yet memory management remains one of the top pain points developers report. So you are in good company finding this tricky.
Setting Up the Project
pip install langchain langchain-openai langchain-community fastapi uvicorn python-dotenv tiktoken
Create a .env file:
OPENAI_API_KEY=your_key_here
Approach 1: ConversationBufferMemory
The classic approach. ConversationBufferMemory stores every message and injects them into the prompt on each call. Simple, transparent, and gets you far for short conversations.
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
load_dotenv()
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
memory = ConversationBufferMemory(
return_messages=True, # returns Message objects, not a plain string
memory_key="history"
)
chain = ConversationChain(
llm=llm,
memory=memory,
verbose=True # prints the full prompt — great for debugging
)
# First turn
response1 = chain.predict(input="Hi! My name is Sofia and I'm learning Python.")
print(response1)
# Second turn — the model should remember "Sofia"
response2 = chain.predict(input="What's my name again?")
print(response2)
# Third turn
response3 = chain.predict(input="What was the first thing I told you?")
print(response3)
Run this and you will see the model correctly recall "Sofia" across turns. The verbose=True flag prints exactly what goes into the prompt — worth keeping during development to catch subtle issues early.
Inspecting and Clearing the Buffer
# See what is stored
print(memory.chat_memory.messages)
# Clear it for a fresh session
memory.clear()
The downside becomes obvious fast: if a conversation runs 50 turns, you are sending 50 turns of history with every single request. Token costs compound quickly.
Approach 2: ConversationSummaryMemory
When conversations grow long, ConversationSummaryMemory compresses older turns into a running summary using the LLM itself. Recent context stays fresh; old context gets compressed.
from langchain.memory import ConversationSummaryMemory
# This memory type needs an LLM to generate summaries
summary_memory = ConversationSummaryMemory(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
return_messages=True
)
chain = ConversationChain(
llm=llm,
memory=summary_memory,
verbose=True
)
chain.predict(input="I'm building a Python web scraper for e-commerce sites.")
chain.predict(input="I'm using BeautifulSoup and Requests right now.")
chain.predict(input="My main challenge is handling JavaScript-rendered pages.")
# Memory now holds a compressed summary, not every individual message
print(summary_memory.buffer)
The trade-off is real: you burn tokens to save tokens. For very long sessions (50+ turns) it is worth it. For typical chatbot sessions under 20 turns, the overhead rarely justifies itself.
Approach 3: ConversationSummaryBufferMemory
This hybrid approach keeps recent messages verbatim and summarizes anything that exceeds a token limit. It is usually the best default for production use because it preserves recent precision while capping total memory size.
from langchain.memory import ConversationSummaryBufferMemory
hybrid_memory = ConversationSummaryBufferMemory(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
max_token_limit=2000, # keep recent messages up to this many tokens verbatim
return_messages=True
)
chain = ConversationChain(llm=llm, memory=hybrid_memory)
for i in range(10):
chain.predict(input=f"Message {i}: Tell me something interesting about number {i}.")
# Recent messages are verbatim; older ones are summarized
print("Buffer messages:", hybrid_memory.chat_memory.messages[-3:])
print("Summary:", hybrid_memory.moving_summary_buffer)
Approach 4: RunnableWithMessageHistory (Modern LCEL Pattern)
LangChain v0.2+ introduced RunnableWithMessageHistory, which is the recommended approach for any new project. It works with any LCEL chain and separates the message history store from the chain logic entirely — a much cleaner architecture that makes swapping storage backends trivial.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
# In-memory store keyed by session ID
store: dict[str, BaseChatMessageHistory] = {}
def get_session_history(session_id: str) -> BaseChatMessageHistory:
if session_id not in store:
store[session_id] = ChatMessageHistory()
return store[session_id]
# Build the chain
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Be concise but friendly."),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
])
chain = prompt | llm
# Wrap with history management
chain_with_history = RunnableWithMessageHistory(
chain,
get_session_history,
input_messages_key="input",
history_messages_key="history",
)
# Use a config dict to identify the session
config = {"configurable": {"session_id": "user_abc_session_1"}}
# First turn
response1 = chain_with_history.invoke(
{"input": "My name is Marcus and I work in data engineering."},
config=config
)
print(response1.content)
# Second turn — same session_id carries the history over
response2 = chain_with_history.invoke(
{"input": "What field do I work in?"},
config=config
)
print(response2.content)
Why This Pattern Is Better
The separation of concerns is the real win. Your chain does not know or care about session management. The get_session_history function handles all of that. Want to switch from in-memory to Redis? Change one function, touch nothing else.
# Redis-backed history (requires: pip install redis langchain-redis)
from langchain_redis import RedisChatMessageHistory
def get_redis_session_history(session_id: str) -> RedisChatMessageHistory:
return RedisChatMessageHistory(
session_id=session_id,
url="redis://localhost:6379",
ttl=3600 # expire after 1 hour of inactivity
)
chain_with_redis = RunnableWithMessageHistory(
chain,
get_redis_session_history,
input_messages_key="input",
history_messages_key="history",
)
The chain definition stays identical. That is the whole point of this architecture.
Injecting User Context Into the System Prompt
A very common production pattern: collect user preferences or profile data at signup and inject them into every conversation.
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
personalized_prompt = ChatPromptTemplate.from_messages([
("system", """You are a personalized coding assistant.
User profile:
- Name: {user_name}
- Skill level: {skill_level}
- Primary language: {primary_language}
Tailor your explanations to their skill level. Use their name occasionally."""),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
])
personalized_chain = personalized_prompt | llm
personalized_with_history = RunnableWithMessageHistory(
personalized_chain,
get_session_history,
input_messages_key="input",
history_messages_key="history",
)
config = {"configurable": {"session_id": "user_marcus_001"}}
response = personalized_with_history.invoke(
{
"input": "How do I optimize a slow Pandas DataFrame operation?",
"user_name": "Marcus",
"skill_level": "intermediate",
"primary_language": "Python",
},
config=config
)
print(response.content)
Memory Approach Comparison
| Approach | Token Usage | Accuracy | Best For |
|---|---|---|---|
| ConversationBufferMemory | High (all messages) | Perfect recall | Short sessions under 20 turns |
| ConversationSummaryMemory | Low (summaries only) | May lose detail | Very long sessions, 50+ turns |
| ConversationSummaryBufferMemory | Medium | Good balance | Most production applications |
| RunnableWithMessageHistory + Buffer | Configurable | Perfect recall | New projects using LCEL |
| RunnableWithMessageHistory + Redis | Configurable | Perfect recall | Multi-server deployments |
Building the FastAPI Chatbot Endpoint
Here is a complete FastAPI application tying everything together with proper per-user session management.
# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
import uvicorn
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
load_dotenv()
# ---- Session store ----
store: dict[str, BaseChatMessageHistory] = {}
def get_session_history(session_id: str) -> BaseChatMessageHistory:
if session_id not in store:
store[session_id] = ChatMessageHistory()
return store[session_id]
# ---- Chain setup ----
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant for AiTechWorlds.com users. "
"Be concise, accurate, and friendly. "
"Remember context from earlier in the conversation."),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
])
chain = prompt | llm
chain_with_history = RunnableWithMessageHistory(
chain,
get_session_history,
input_messages_key="input",
history_messages_key="history",
)
# ---- FastAPI app ----
app = FastAPI(title="LangChain Chatbot API", version="1.0")
class ChatRequest(BaseModel):
session_id: str
message: str
class ChatResponse(BaseModel):
session_id: str
response: str
turn_count: int
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
if not request.message.strip():
raise HTTPException(status_code=400, detail="Message cannot be empty")
config = {"configurable": {"session_id": request.session_id}}
try:
result = await chain_with_history.ainvoke(
{"input": request.message},
config=config
)
except Exception as e:
raise HTTPException(status_code=500, detail=f"LLM error: {str(e)}")
history = get_session_history(request.session_id)
turn_count = len(history.messages) // 2 # 1 turn = 1 human msg + 1 AI msg
return ChatResponse(
session_id=request.session_id,
response=result.content,
turn_count=turn_count,
)
@app.delete("/chat/{session_id}")
async def clear_session(session_id: str):
if session_id in store:
store[session_id].clear()
return {"message": f"Session {session_id} cleared"}
raise HTTPException(status_code=404, detail="Session not found")
@app.get("/chat/{session_id}/history")
async def get_history(session_id: str):
if session_id not in store:
raise HTTPException(status_code=404, detail="Session not found")
messages = store[session_id].messages
return {
"session_id": session_id,
"turn_count": len(messages) // 2,
"messages": [
{"role": msg.type, "content": msg.content}
for msg in messages
]
}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Start it:
uvicorn app:app --reload --port 8000
Test with curl:
# First turn
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"session_id": "user_001", "message": "Hi, my name is Priya and I love machine learning."}'
# Second turn — same session_id
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"session_id": "user_001", "message": "What do I love?"}'
# Check full history
curl http://localhost:8000/chat/user_001/history
Adding Streaming Responses
Nobody wants to stare at a blank screen while the LLM generates a response. Streaming sends tokens as they arrive.
from fastapi.responses import StreamingResponse
from langchain_openai import ChatOpenAI
streaming_llm = ChatOpenAI(model="gpt-4o", temperature=0.7, streaming=True)
streaming_chain = prompt | streaming_llm
streaming_chain_with_history = RunnableWithMessageHistory(
streaming_chain,
get_session_history,
input_messages_key="input",
history_messages_key="history",
)
@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
config = {"configurable": {"session_id": request.session_id}}
async def token_generator():
async for chunk in streaming_chain_with_history.astream(
{"input": request.message},
config=config
):
if chunk.content:
yield f"data: {chunk.content}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
token_generator(),
media_type="text/event-stream"
)
Handling Token Limits Gracefully
For very long sessions, you will eventually approach the model's context window limit. A trimmer prevents that from crashing your endpoint.
import tiktoken
from langchain_community.chat_message_histories import ChatMessageHistory
MAX_HISTORY_TOKENS = 6000
def trim_history_to_token_limit(
history: ChatMessageHistory,
max_tokens: int = MAX_HISTORY_TOKENS
) -> list:
"""Keep the most recent messages that fit within the token budget."""
encoder = tiktoken.encoding_for_model("gpt-4o")
messages = history.messages
total_tokens = 0
kept_messages = []
for msg in reversed(messages):
tokens = len(encoder.encode(msg.content))
if total_tokens + tokens > max_tokens:
break
kept_messages.insert(0, msg)
total_tokens += tokens
return kept_messages
def get_trimmed_history(session_id: str):
if session_id not in store:
store[session_id] = ChatMessageHistory()
history = store[session_id]
trimmed = trim_history_to_token_limit(history)
temp_history = ChatMessageHistory()
temp_history.messages = trimmed
return temp_history
# Use get_trimmed_history instead of get_session_history in the chain wrapper
chain_with_trimmed_history = RunnableWithMessageHistory(
chain,
get_trimmed_history,
input_messages_key="input",
history_messages_key="history",
)
Testing Memory Behavior
Do not ship memory code without tests. Here is a minimal pytest setup that catches the most common bugs.
# test_memory.py
import pytest
from langchain_community.chat_message_histories import ChatMessageHistory
def test_session_isolation():
"""Two different session IDs must not share memory."""
store = {}
def get_history(session_id):
if session_id not in store:
store[session_id] = ChatMessageHistory()
return store[session_id]
h1 = get_history("user_a")
h2 = get_history("user_b")
h1.add_user_message("My name is Alice")
assert len(h2.messages) == 0
assert len(h1.messages) == 1
def test_history_grows_per_turn():
"""Each turn should add exactly 2 messages (human + AI)."""
history = ChatMessageHistory()
history.add_user_message("Hello")
history.add_ai_message("Hi there!")
history.add_user_message("How are you?")
history.add_ai_message("I'm doing well, thanks!")
assert len(history.messages) == 4
def test_clear_session():
"""Clearing should remove all messages."""
history = ChatMessageHistory()
history.add_user_message("Remember this")
history.add_ai_message("Sure!")
history.clear()
assert len(history.messages) == 0
Run with: pytest test_memory.py -v
Common Mistakes and Their Fixes
Sharing a single memory object across all users — This is the most common bug I see in sample code. Every user needs their own history instance keyed by session ID. The store dict pattern in the examples above prevents this entirely.
Forgetting return_messages=True with classic ConversationBufferMemory — Without it, history is injected as a plain string, which breaks chat prompt templates expecting proper message objects.
Using synchronous .invoke() inside async FastAPI routes — Call .ainvoke() instead, or the async FastAPI thread will block.
Not testing session isolation — Add the isolation test above before deploying. You do not want user A seeing user B's conversation history.
Ignoring token limits entirely — Add the trimmer or use ConversationSummaryBufferMemory to avoid context window errors in long-running sessions.
What to Build Next
Now that you have working multi-turn memory, the natural next step is giving your agent tools. Build AI agent with LangChain covers a complete tool-using agent. If you want to add document context so the agent can answer questions about your product docs, the RAG system tutorial pairs well with what you just built here.
For production deployment — rate limiting, authentication, monitoring — Deploy AI model to production covers the operational side in detail. And if you want to compare this approach to other chatbot frameworks, Build AI chatbot Python surveys several options side by side.
Conclusion
Multi-turn memory in LangChain comes down to one core idea: persist message history between calls and inject it into the prompt. The classic ConversationBufferMemory makes this easy for quick projects. The modern RunnableWithMessageHistory makes it clean — separating history management from chain logic so you can swap storage backends without touching your core code.
The FastAPI endpoint in this guide is production-ready for moderate traffic. For serious scale, swap the in-memory store for Redis and add proper authentication headers. The chain structure stays identical — and that is exactly the point of the architecture.
If this helped you build something, drop a comment below or share what you are making with these techniques. Questions about specific memory backends or tricky session management edge cases? Ask away.
FAQs
What is the difference between ConversationBufferMemory and ConversationSummaryMemory? ConversationBufferMemory stores every message verbatim, so it's accurate but grows indefinitely. ConversationSummaryMemory uses an LLM to compress older turns into a running summary, saving tokens at the cost of fine-grained detail. For short sessions pick buffer; for long-running agents pick summary or the hybrid SummaryBufferMemory.
Does RunnableWithMessageHistory work with any LLM provider? Yes. RunnableWithMessageHistory wraps any LCEL chain, so it works with OpenAI, Anthropic, Google, Ollama, or any chat model LangChain supports. You only need to provide a callable that returns a BaseChatMessageHistory instance keyed on session ID.
How do I persist conversation history across server restarts? Swap the in-memory store for a durable backend. LangChain ships RedisChatMessageHistory, MongoDBChatMessageHistory, and DynamoDBChatMessageHistory. Pass one of those as the get_session_history callable and conversations survive restarts automatically.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
How to Deploy AutoGen Agents as APIs with FastAPI (2026)
Learn to serve AutoGen multi-agent systems as production REST APIs using FastAPI with async endpoints and real-time streaming responses.
5 AutoGen Conversational Patterns (One-Shot, Multi-Turn, Hierarchical)
Master AutoGen's 5 core agent interaction models — from one-shot requests to hierarchical orchestration — with full code examples and use case comparisons.
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
5 AutoGPT Memory Types (Vector, Redis, File, Conversation)
Compare AutoGPT's 5 memory backends — local file, Redis, Pinecone, Milvus, and Weaviate. Choose the right one for speed, cost, and persistence needs.