How to Implement Multi-Turn Conversations with LangChain

When I first tried to build a chatbot that remembered what the user said three messages ago, I hit every wall you can imagine. The bot would forget the user's name, contradict itself mid-conversation, and generally behave like someone with a goldfish memory. That is the core challenge with stateless LLMs — each API call stands completely alone. LangChain's memory abstractions exist precisely to bridge that gap, and once you understand them, building a genuinely coherent multi-turn agent stops feeling like fighting the framework.

This guide walks through the full picture: from the classic ConversationBufferMemory API to the newer RunnableWithMessageHistory pattern, then wraps everything inside a production-ready FastAPI endpoint. I will be honest about where each approach shines and where it falls apart, because some of these tradeoffs are not obvious until you run into them in production.

Before diving in, a quick orientation. If you are brand new to LangChain, start with the LangChain tutorial 2025 first. If you want to understand the bigger picture, AI agent memory and planning gives excellent context for why conversational memory matters beyond just chatbots.

Why Memory Is Hard With LLMs

Large language models are stateless by design. You send a prompt, you get a completion, the model forgets everything the moment the response is returned. That is fine for one-shot tasks — summarize this document, translate this sentence — but it falls apart completely for conversations.

The naive fix is appending the entire conversation history to every prompt. That works until the history grows longer than the context window. Then you are suddenly dropping messages or hitting token limit errors. The smarter fix is being selective: keep recent messages verbatim, compress older ones, or store only specific facts. LangChain gives you several tools for each of these strategies.

According to Andreessen Horowitz's 2025 AI survey, over 60% of production LLM applications involve multi-turn interactions, yet memory management remains one of the top pain points developers report. So you are in good company finding this tricky.

Setting Up the Project

pip install langchain langchain-openai langchain-community fastapi uvicorn python-dotenv tiktoken

Create a .env file:

OPENAI_API_KEY=your_key_here

Approach 1: ConversationBufferMemory

The classic approach. ConversationBufferMemory stores every message and injects them into the prompt on each call. Simple, transparent, and gets you far for short conversations.

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

load_dotenv()

llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

memory = ConversationBufferMemory(
    return_messages=True,   # returns Message objects, not a plain string
    memory_key="history"
)

chain = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True            # prints the full prompt — great for debugging
)

# First turn
response1 = chain.predict(input="Hi! My name is Sofia and I'm learning Python.")
print(response1)

# Second turn — the model should remember "Sofia"
response2 = chain.predict(input="What's my name again?")
print(response2)

# Third turn
response3 = chain.predict(input="What was the first thing I told you?")
print(response3)

Run this and you will see the model correctly recall "Sofia" across turns. The verbose=True flag prints exactly what goes into the prompt — worth keeping during development to catch subtle issues early.

Inspecting and Clearing the Buffer

# See what is stored
print(memory.chat_memory.messages)

# Clear it for a fresh session
memory.clear()

The downside becomes obvious fast: if a conversation runs 50 turns, you are sending 50 turns of history with every single request. Token costs compound quickly.

Approach 2: ConversationSummaryMemory

When conversations grow long, ConversationSummaryMemory compresses older turns into a running summary using the LLM itself. Recent context stays fresh; old context gets compressed.

from langchain.memory import ConversationSummaryMemory

# This memory type needs an LLM to generate summaries
summary_memory = ConversationSummaryMemory(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    return_messages=True
)

chain = ConversationChain(
    llm=llm,
    memory=summary_memory,
    verbose=True
)

chain.predict(input="I'm building a Python web scraper for e-commerce sites.")
chain.predict(input="I'm using BeautifulSoup and Requests right now.")
chain.predict(input="My main challenge is handling JavaScript-rendered pages.")

# Memory now holds a compressed summary, not every individual message
print(summary_memory.buffer)

The trade-off is real: you burn tokens to save tokens. For very long sessions (50+ turns) it is worth it. For typical chatbot sessions under 20 turns, the overhead rarely justifies itself.

Approach 3: ConversationSummaryBufferMemory

This hybrid approach keeps recent messages verbatim and summarizes anything that exceeds a token limit. It is usually the best default for production use because it preserves recent precision while capping total memory size.

from langchain.memory import ConversationSummaryBufferMemory

hybrid_memory = ConversationSummaryBufferMemory(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    max_token_limit=2000,   # keep recent messages up to this many tokens verbatim
    return_messages=True
)

chain = ConversationChain(llm=llm, memory=hybrid_memory)

for i in range(10):
    chain.predict(input=f"Message {i}: Tell me something interesting about number {i}.")

# Recent messages are verbatim; older ones are summarized
print("Buffer messages:", hybrid_memory.chat_memory.messages[-3:])
print("Summary:", hybrid_memory.moving_summary_buffer)

Approach 4: RunnableWithMessageHistory (Modern LCEL Pattern)

LangChain v0.2+ introduced RunnableWithMessageHistory, which is the recommended approach for any new project. It works with any LCEL chain and separates the message history store from the chain logic entirely — a much cleaner architecture that makes swapping storage backends trivial.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory

# In-memory store keyed by session ID
store: dict[str, BaseChatMessageHistory] = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

# Build the chain
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Be concise but friendly."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

chain = prompt | llm

# Wrap with history management
chain_with_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

# Use a config dict to identify the session
config = {"configurable": {"session_id": "user_abc_session_1"}}

# First turn
response1 = chain_with_history.invoke(
    {"input": "My name is Marcus and I work in data engineering."},
    config=config
)
print(response1.content)

# Second turn — same session_id carries the history over
response2 = chain_with_history.invoke(
    {"input": "What field do I work in?"},
    config=config
)
print(response2.content)

Why This Pattern Is Better

The separation of concerns is the real win. Your chain does not know or care about session management. The get_session_history function handles all of that. Want to switch from in-memory to Redis? Change one function, touch nothing else.

# Redis-backed history (requires: pip install redis langchain-redis)
from langchain_redis import RedisChatMessageHistory

def get_redis_session_history(session_id: str) -> RedisChatMessageHistory:
    return RedisChatMessageHistory(
        session_id=session_id,
        url="redis://localhost:6379",
        ttl=3600    # expire after 1 hour of inactivity
    )

chain_with_redis = RunnableWithMessageHistory(
    chain,
    get_redis_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

The chain definition stays identical. That is the whole point of this architecture.

Injecting User Context Into the System Prompt

A very common production pattern: collect user preferences or profile data at signup and inject them into every conversation.

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

personalized_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a personalized coding assistant.

User profile:
- Name: {user_name}
- Skill level: {skill_level}
- Primary language: {primary_language}

Tailor your explanations to their skill level. Use their name occasionally."""),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

personalized_chain = personalized_prompt | llm

personalized_with_history = RunnableWithMessageHistory(
    personalized_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

config = {"configurable": {"session_id": "user_marcus_001"}}

response = personalized_with_history.invoke(
    {
        "input": "How do I optimize a slow Pandas DataFrame operation?",
        "user_name": "Marcus",
        "skill_level": "intermediate",
        "primary_language": "Python",
    },
    config=config
)
print(response.content)

Memory Approach Comparison

Approach	Token Usage	Accuracy	Best For
ConversationBufferMemory	High (all messages)	Perfect recall	Short sessions under 20 turns
ConversationSummaryMemory	Low (summaries only)	May lose detail	Very long sessions, 50+ turns
ConversationSummaryBufferMemory	Medium	Good balance	Most production applications
RunnableWithMessageHistory + Buffer	Configurable	Perfect recall	New projects using LCEL
RunnableWithMessageHistory + Redis	Configurable	Perfect recall	Multi-server deployments

Building the FastAPI Chatbot Endpoint

Here is a complete FastAPI application tying everything together with proper per-user session management.

# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
import uvicorn

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory

load_dotenv()

# ---- Session store ----
store: dict[str, BaseChatMessageHistory] = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

# ---- Chain setup ----
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant for AiTechWorlds.com users. "
               "Be concise, accurate, and friendly. "
               "Remember context from earlier in the conversation."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

chain = prompt | llm

chain_with_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

# ---- FastAPI app ----
app = FastAPI(title="LangChain Chatbot API", version="1.0")

class ChatRequest(BaseModel):
    session_id: str
    message: str

class ChatResponse(BaseModel):
    session_id: str
    response: str
    turn_count: int

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    if not request.message.strip():
        raise HTTPException(status_code=400, detail="Message cannot be empty")

    config = {"configurable": {"session_id": request.session_id}}

    try:
        result = await chain_with_history.ainvoke(
            {"input": request.message},
            config=config
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"LLM error: {str(e)}")

    history = get_session_history(request.session_id)
    turn_count = len(history.messages) // 2   # 1 turn = 1 human msg + 1 AI msg

    return ChatResponse(
        session_id=request.session_id,
        response=result.content,
        turn_count=turn_count,
    )

@app.delete("/chat/{session_id}")
async def clear_session(session_id: str):
    if session_id in store:
        store[session_id].clear()
        return {"message": f"Session {session_id} cleared"}
    raise HTTPException(status_code=404, detail="Session not found")

@app.get("/chat/{session_id}/history")
async def get_history(session_id: str):
    if session_id not in store:
        raise HTTPException(status_code=404, detail="Session not found")

    messages = store[session_id].messages
    return {
        "session_id": session_id,
        "turn_count": len(messages) // 2,
        "messages": [
            {"role": msg.type, "content": msg.content}
            for msg in messages
        ]
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Start it:

uvicorn app:app --reload --port 8000

Test with curl:

# First turn
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"session_id": "user_001", "message": "Hi, my name is Priya and I love machine learning."}'

# Second turn — same session_id
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"session_id": "user_001", "message": "What do I love?"}'

# Check full history
curl http://localhost:8000/chat/user_001/history

Adding Streaming Responses

Nobody wants to stare at a blank screen while the LLM generates a response. Streaming sends tokens as they arrive.

from fastapi.responses import StreamingResponse
from langchain_openai import ChatOpenAI

streaming_llm = ChatOpenAI(model="gpt-4o", temperature=0.7, streaming=True)
streaming_chain = prompt | streaming_llm
streaming_chain_with_history = RunnableWithMessageHistory(
    streaming_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    config = {"configurable": {"session_id": request.session_id}}

    async def token_generator():
        async for chunk in streaming_chain_with_history.astream(
            {"input": request.message},
            config=config
        ):
            if chunk.content:
                yield f"data: {chunk.content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        token_generator(),
        media_type="text/event-stream"
    )

Handling Token Limits Gracefully

For very long sessions, you will eventually approach the model's context window limit. A trimmer prevents that from crashing your endpoint.

import tiktoken
from langchain_community.chat_message_histories import ChatMessageHistory

MAX_HISTORY_TOKENS = 6000

def trim_history_to_token_limit(
    history: ChatMessageHistory,
    max_tokens: int = MAX_HISTORY_TOKENS
) -> list:
    """Keep the most recent messages that fit within the token budget."""
    encoder = tiktoken.encoding_for_model("gpt-4o")
    messages = history.messages

    total_tokens = 0
    kept_messages = []

    for msg in reversed(messages):
        tokens = len(encoder.encode(msg.content))
        if total_tokens + tokens > max_tokens:
            break
        kept_messages.insert(0, msg)
        total_tokens += tokens

    return kept_messages

def get_trimmed_history(session_id: str):
    if session_id not in store:
        store[session_id] = ChatMessageHistory()

    history = store[session_id]
    trimmed = trim_history_to_token_limit(history)

    temp_history = ChatMessageHistory()
    temp_history.messages = trimmed
    return temp_history

# Use get_trimmed_history instead of get_session_history in the chain wrapper
chain_with_trimmed_history = RunnableWithMessageHistory(
    chain,
    get_trimmed_history,
    input_messages_key="input",
    history_messages_key="history",
)

Testing Memory Behavior

Do not ship memory code without tests. Here is a minimal pytest setup that catches the most common bugs.

# test_memory.py
import pytest
from langchain_community.chat_message_histories import ChatMessageHistory

def test_session_isolation():
    """Two different session IDs must not share memory."""
    store = {}

    def get_history(session_id):
        if session_id not in store:
            store[session_id] = ChatMessageHistory()
        return store[session_id]

    h1 = get_history("user_a")
    h2 = get_history("user_b")

    h1.add_user_message("My name is Alice")

    assert len(h2.messages) == 0
    assert len(h1.messages) == 1

def test_history_grows_per_turn():
    """Each turn should add exactly 2 messages (human + AI)."""
    history = ChatMessageHistory()
    history.add_user_message("Hello")
    history.add_ai_message("Hi there!")
    history.add_user_message("How are you?")
    history.add_ai_message("I'm doing well, thanks!")

    assert len(history.messages) == 4

def test_clear_session():
    """Clearing should remove all messages."""
    history = ChatMessageHistory()
    history.add_user_message("Remember this")
    history.add_ai_message("Sure!")
    history.clear()

    assert len(history.messages) == 0

Run with: pytest test_memory.py -v

Common Mistakes and Their Fixes

Sharing a single memory object across all users — This is the most common bug I see in sample code. Every user needs their own history instance keyed by session ID. The store dict pattern in the examples above prevents this entirely.

Forgetting return_messages=True with classic ConversationBufferMemory — Without it, history is injected as a plain string, which breaks chat prompt templates expecting proper message objects.

Using synchronous .invoke() inside async FastAPI routes — Call .ainvoke() instead, or the async FastAPI thread will block.

Not testing session isolation — Add the isolation test above before deploying. You do not want user A seeing user B's conversation history.

Ignoring token limits entirely — Add the trimmer or use ConversationSummaryBufferMemory to avoid context window errors in long-running sessions.

What to Build Next

Now that you have working multi-turn memory, the natural next step is giving your agent tools. Build AI agent with LangChain covers a complete tool-using agent. If you want to add document context so the agent can answer questions about your product docs, the RAG system tutorial pairs well with what you just built here.

For production deployment — rate limiting, authentication, monitoring — Deploy AI model to production covers the operational side in detail. And if you want to compare this approach to other chatbot frameworks, Build AI chatbot Python surveys several options side by side.

Conclusion

Multi-turn memory in LangChain comes down to one core idea: persist message history between calls and inject it into the prompt. The classic ConversationBufferMemory makes this easy for quick projects. The modern RunnableWithMessageHistory makes it clean — separating history management from chain logic so you can swap storage backends without touching your core code.

The FastAPI endpoint in this guide is production-ready for moderate traffic. For serious scale, swap the in-memory store for Redis and add proper authentication headers. The chain structure stays identical — and that is exactly the point of the architecture.

If this helped you build something, drop a comment below or share what you are making with these techniques. Questions about specific memory backends or tricky session management edge cases? Ask away.

FAQs

What is the difference between ConversationBufferMemory and ConversationSummaryMemory? ConversationBufferMemory stores every message verbatim, so it's accurate but grows indefinitely. ConversationSummaryMemory uses an LLM to compress older turns into a running summary, saving tokens at the cost of fine-grained detail. For short sessions pick buffer; for long-running agents pick summary or the hybrid SummaryBufferMemory.

Does RunnableWithMessageHistory work with any LLM provider? Yes. RunnableWithMessageHistory wraps any LCEL chain, so it works with OpenAI, Anthropic, Google, Ollama, or any chat model LangChain supports. You only need to provide a callable that returns a BaseChatMessageHistory instance keyed on session ID.

How do I persist conversation history across server restarts? Swap the in-memory store for a durable backend. LangChain ships RedisChatMessageHistory, MongoDBChatMessageHistory, and DynamoDBChatMessageHistory. Pass one of those as the get_session_history callable and conversations survive restarts automatically.

Why Memory Is Hard With LLMs

Setting Up the Project

pip install langchain langchain-openai langchain-community fastapi uvicorn python-dotenv tiktoken

Create a .env file:

OPENAI_API_KEY=your_key_here

Approach 1: ConversationBufferMemory

The classic approach. ConversationBufferMemory stores every message and injects them into the prompt on each call. Simple, transparent, and gets you far for short conversations.

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

load_dotenv()

llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

memory = ConversationBufferMemory(
    return_messages=True,   # returns Message objects, not a plain string
    memory_key="history"
)

chain = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True            # prints the full prompt — great for debugging
)

# First turn
response1 = chain.predict(input="Hi! My name is Sofia and I'm learning Python.")
print(response1)

# Second turn — the model should remember "Sofia"
response2 = chain.predict(input="What's my name again?")
print(response2)

# Third turn
response3 = chain.predict(input="What was the first thing I told you?")
print(response3)

Inspecting and Clearing the Buffer

# See what is stored
print(memory.chat_memory.messages)

# Clear it for a fresh session
memory.clear()

The downside becomes obvious fast: if a conversation runs 50 turns, you are sending 50 turns of history with every single request. Token costs compound quickly.

Approach 2: ConversationSummaryMemory

When conversations grow long, ConversationSummaryMemory compresses older turns into a running summary using the LLM itself. Recent context stays fresh; old context gets compressed.

from langchain.memory import ConversationSummaryMemory

# This memory type needs an LLM to generate summaries
summary_memory = ConversationSummaryMemory(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    return_messages=True
)

chain = ConversationChain(
    llm=llm,
    memory=summary_memory,
    verbose=True
)

chain.predict(input="I'm building a Python web scraper for e-commerce sites.")
chain.predict(input="I'm using BeautifulSoup and Requests right now.")
chain.predict(input="My main challenge is handling JavaScript-rendered pages.")

# Memory now holds a compressed summary, not every individual message
print(summary_memory.buffer)

The trade-off is real: you burn tokens to save tokens. For very long sessions (50+ turns) it is worth it. For typical chatbot sessions under 20 turns, the overhead rarely justifies itself.

Approach 3: ConversationSummaryBufferMemory

from langchain.memory import ConversationSummaryBufferMemory

hybrid_memory = ConversationSummaryBufferMemory(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    max_token_limit=2000,   # keep recent messages up to this many tokens verbatim
    return_messages=True
)

chain = ConversationChain(llm=llm, memory=hybrid_memory)

for i in range(10):
    chain.predict(input=f"Message {i}: Tell me something interesting about number {i}.")

# Recent messages are verbatim; older ones are summarized
print("Buffer messages:", hybrid_memory.chat_memory.messages[-3:])
print("Summary:", hybrid_memory.moving_summary_buffer)

Approach 4: RunnableWithMessageHistory (Modern LCEL Pattern)

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory

# In-memory store keyed by session ID
store: dict[str, BaseChatMessageHistory] = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

# Build the chain
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Be concise but friendly."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

chain = prompt | llm

# Wrap with history management
chain_with_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

# Use a config dict to identify the session
config = {"configurable": {"session_id": "user_abc_session_1"}}

# First turn
response1 = chain_with_history.invoke(
    {"input": "My name is Marcus and I work in data engineering."},
    config=config
)
print(response1.content)

# Second turn — same session_id carries the history over
response2 = chain_with_history.invoke(
    {"input": "What field do I work in?"},
    config=config
)
print(response2.content)

Why This Pattern Is Better

# Redis-backed history (requires: pip install redis langchain-redis)
from langchain_redis import RedisChatMessageHistory

def get_redis_session_history(session_id: str) -> RedisChatMessageHistory:
    return RedisChatMessageHistory(
        session_id=session_id,
        url="redis://localhost:6379",
        ttl=3600    # expire after 1 hour of inactivity
    )

chain_with_redis = RunnableWithMessageHistory(
    chain,
    get_redis_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

The chain definition stays identical. That is the whole point of this architecture.

Injecting User Context Into the System Prompt

A very common production pattern: collect user preferences or profile data at signup and inject them into every conversation.

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

personalized_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a personalized coding assistant.

User profile:
- Name: {user_name}
- Skill level: {skill_level}
- Primary language: {primary_language}

Tailor your explanations to their skill level. Use their name occasionally."""),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

personalized_chain = personalized_prompt | llm

personalized_with_history = RunnableWithMessageHistory(
    personalized_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

config = {"configurable": {"session_id": "user_marcus_001"}}

response = personalized_with_history.invoke(
    {
        "input": "How do I optimize a slow Pandas DataFrame operation?",
        "user_name": "Marcus",
        "skill_level": "intermediate",
        "primary_language": "Python",
    },
    config=config
)
print(response.content)

Memory Approach Comparison

Approach	Token Usage	Accuracy	Best For
ConversationBufferMemory	High (all messages)	Perfect recall	Short sessions under 20 turns
ConversationSummaryMemory	Low (summaries only)	May lose detail	Very long sessions, 50+ turns
ConversationSummaryBufferMemory	Medium	Good balance	Most production applications
RunnableWithMessageHistory + Buffer	Configurable	Perfect recall	New projects using LCEL
RunnableWithMessageHistory + Redis	Configurable	Perfect recall	Multi-server deployments

Building the FastAPI Chatbot Endpoint

Here is a complete FastAPI application tying everything together with proper per-user session management.

# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
import uvicorn

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory

load_dotenv()

# ---- Session store ----
store: dict[str, BaseChatMessageHistory] = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

# ---- Chain setup ----
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant for AiTechWorlds.com users. "
               "Be concise, accurate, and friendly. "
               "Remember context from earlier in the conversation."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

chain = prompt | llm

chain_with_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

# ---- FastAPI app ----
app = FastAPI(title="LangChain Chatbot API", version="1.0")

class ChatRequest(BaseModel):
    session_id: str
    message: str

class ChatResponse(BaseModel):
    session_id: str
    response: str
    turn_count: int

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    if not request.message.strip():
        raise HTTPException(status_code=400, detail="Message cannot be empty")

    config = {"configurable": {"session_id": request.session_id}}

    try:
        result = await chain_with_history.ainvoke(
            {"input": request.message},
            config=config
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"LLM error: {str(e)}")

    history = get_session_history(request.session_id)
    turn_count = len(history.messages) // 2   # 1 turn = 1 human msg + 1 AI msg

    return ChatResponse(
        session_id=request.session_id,
        response=result.content,
        turn_count=turn_count,
    )

@app.delete("/chat/{session_id}")
async def clear_session(session_id: str):
    if session_id in store:
        store[session_id].clear()
        return {"message": f"Session {session_id} cleared"}
    raise HTTPException(status_code=404, detail="Session not found")

@app.get("/chat/{session_id}/history")
async def get_history(session_id: str):
    if session_id not in store:
        raise HTTPException(status_code=404, detail="Session not found")

    messages = store[session_id].messages
    return {
        "session_id": session_id,
        "turn_count": len(messages) // 2,
        "messages": [
            {"role": msg.type, "content": msg.content}
            for msg in messages
        ]
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Start it:

uvicorn app:app --reload --port 8000

Test with curl:

# First turn
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"session_id": "user_001", "message": "Hi, my name is Priya and I love machine learning."}'

# Second turn — same session_id
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"session_id": "user_001", "message": "What do I love?"}'

# Check full history
curl http://localhost:8000/chat/user_001/history

Adding Streaming Responses

Nobody wants to stare at a blank screen while the LLM generates a response. Streaming sends tokens as they arrive.

from fastapi.responses import StreamingResponse
from langchain_openai import ChatOpenAI

streaming_llm = ChatOpenAI(model="gpt-4o", temperature=0.7, streaming=True)
streaming_chain = prompt | streaming_llm
streaming_chain_with_history = RunnableWithMessageHistory(
    streaming_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    config = {"configurable": {"session_id": request.session_id}}

    async def token_generator():
        async for chunk in streaming_chain_with_history.astream(
            {"input": request.message},
            config=config
        ):
            if chunk.content:
                yield f"data: {chunk.content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        token_generator(),
        media_type="text/event-stream"
    )

Handling Token Limits Gracefully

For very long sessions, you will eventually approach the model's context window limit. A trimmer prevents that from crashing your endpoint.

import tiktoken
from langchain_community.chat_message_histories import ChatMessageHistory

MAX_HISTORY_TOKENS = 6000

def trim_history_to_token_limit(
    history: ChatMessageHistory,
    max_tokens: int = MAX_HISTORY_TOKENS
) -> list:
    """Keep the most recent messages that fit within the token budget."""
    encoder = tiktoken.encoding_for_model("gpt-4o")
    messages = history.messages

    total_tokens = 0
    kept_messages = []

    for msg in reversed(messages):
        tokens = len(encoder.encode(msg.content))
        if total_tokens + tokens > max_tokens:
            break
        kept_messages.insert(0, msg)
        total_tokens += tokens

    return kept_messages

def get_trimmed_history(session_id: str):
    if session_id not in store:
        store[session_id] = ChatMessageHistory()

    history = store[session_id]
    trimmed = trim_history_to_token_limit(history)

    temp_history = ChatMessageHistory()
    temp_history.messages = trimmed
    return temp_history

# Use get_trimmed_history instead of get_session_history in the chain wrapper
chain_with_trimmed_history = RunnableWithMessageHistory(
    chain,
    get_trimmed_history,
    input_messages_key="input",
    history_messages_key="history",
)

Testing Memory Behavior

Do not ship memory code without tests. Here is a minimal pytest setup that catches the most common bugs.

# test_memory.py
import pytest
from langchain_community.chat_message_histories import ChatMessageHistory

def test_session_isolation():
    """Two different session IDs must not share memory."""
    store = {}

    def get_history(session_id):
        if session_id not in store:
            store[session_id] = ChatMessageHistory()
        return store[session_id]

    h1 = get_history("user_a")
    h2 = get_history("user_b")

    h1.add_user_message("My name is Alice")

    assert len(h2.messages) == 0
    assert len(h1.messages) == 1

def test_history_grows_per_turn():
    """Each turn should add exactly 2 messages (human + AI)."""
    history = ChatMessageHistory()
    history.add_user_message("Hello")
    history.add_ai_message("Hi there!")
    history.add_user_message("How are you?")
    history.add_ai_message("I'm doing well, thanks!")

    assert len(history.messages) == 4

def test_clear_session():
    """Clearing should remove all messages."""
    history = ChatMessageHistory()
    history.add_user_message("Remember this")
    history.add_ai_message("Sure!")
    history.clear()

    assert len(history.messages) == 0

Run with: pytest test_memory.py -v

Common Mistakes and Their Fixes

Using synchronous .invoke() inside async FastAPI routes — Call .ainvoke() instead, or the async FastAPI thread will block.

Not testing session isolation — Add the isolation test above before deploying. You do not want user A seeing user B's conversation history.

Ignoring token limits entirely — Add the trimmer or use ConversationSummaryBufferMemory to avoid context window errors in long-running sessions.

How to Implement Multi-Turn Conversations with LangChain

Why Memory Is Hard With LLMs

Setting Up the Project

Approach 1: ConversationBufferMemory

Inspecting and Clearing the Buffer

Approach 2: ConversationSummaryMemory

Approach 3: ConversationSummaryBufferMemory

Approach 4: RunnableWithMessageHistory (Modern LCEL Pattern)

Why This Pattern Is Better

Injecting User Context Into the System Prompt

Memory Approach Comparison

Building the FastAPI Chatbot Endpoint

Adding Streaming Responses

Handling Token Limits Gracefully

Testing Memory Behavior

Common Mistakes and Their Fixes

What to Build Next

Conclusion

FAQs

Frequently Asked Questions

AiTechWorlds Team

Related Articles

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

5 AutoGen Conversational Patterns (One-Shot, Multi-Turn, Hierarchical)

AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?

5 AutoGPT Memory Types (Vector, Redis, File, Conversation)

Get Free AI Notes Daily

How to Implement Multi-Turn Conversations with LangChain

Why Memory Is Hard With LLMs

Setting Up the Project

Approach 1: ConversationBufferMemory

Inspecting and Clearing the Buffer

Approach 2: ConversationSummaryMemory

Approach 3: ConversationSummaryBufferMemory

Approach 4: RunnableWithMessageHistory (Modern LCEL Pattern)

Why This Pattern Is Better

Injecting User Context Into the System Prompt

Memory Approach Comparison

Building the FastAPI Chatbot Endpoint

Adding Streaming Responses

Handling Token Limits Gracefully

Testing Memory Behavior

Common Mistakes and Their Fixes

What to Build Next

Conclusion

FAQs

Frequently Asked Questions

AiTechWorlds Team

Related Articles

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

5 AutoGen Conversational Patterns (One-Shot, Multi-Turn, Hierarchical)

AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?

5 AutoGPT Memory Types (Vector, Redis, File, Conversation)

Get Free AI Notes Daily