AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

AI conversation history stored in memory — LangChain conversation memory types buffer summary

How to Implement Memory in LangChain (Buffer, Summary, Vector)

Q: How do I share memory across multiple users in a multi-user chatbot?

Use session IDs. Each user gets a unique session_id passed to get_session_history(). LangChain's RunnableWithMessageHistory handles the routing — one history store per session_id. Store histories in Redis or a database for multi-instance deployments.

⚡ Quick Answer

Learn every major LangChain memory type — ConversationBufferMemory, SummaryMemory, VectorStoreMemory, and EntityMemory — with working code and a comparison table.

AiTechWorlds Team May 31, 2026 14 min read

#LangChain #Memory #Chatbot #ConversationMemory #Python

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Every chatbot needs memory. Without it, the model has no context — ask "what did I just say?" and it draws a blank. Build memory wrong, and you'll either blow through your token budget on long conversations or confuse users by forgetting things they said five messages ago.

I've built a lot of chatbots. The memory decision — which type to use, how to persist it, when to summarize — is one of the decisions that actually moves the needle on user experience. This guide covers every major LangChain memory type with working code, a comparison table, and honest notes on when each one is the right tool.

If you're building a chatbot from scratch, the build AI chatbot Python guide gives you the full application context. For the agent side of memory — how agents remember across tool calls — see AI agent memory and planning.

Understanding How LangChain Memory Works

Before the code, let me clarify the modern LangChain memory architecture. The framework went through a significant refactor: the older ConversationBufferMemory class is now considered legacy. Modern LangChain uses:

ChatMessageHistory — stores the raw messages (the data layer)
RunnableWithMessageHistory — injects history into any LCEL chain (the integration layer)
MessagesPlaceholder — the slot in your prompt where history gets inserted

This separation is cleaner than the old approach. The history store is swappable (in-memory, Redis, Postgres), and the chain code stays the same regardless.

That said, the legacy classes still work and you'll encounter them constantly in existing code. This guide covers both.

Memory Type 1: ConversationBufferMemory

Buffer memory keeps every message in the conversation history. Simplest, most faithful, most expensive at scale.

Modern LCEL Approach

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)

# Prompt with a placeholder for conversation history
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Be concise and friendly."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])

chain = prompt | llm | StrOutputParser()

# In-memory store (use Redis/DB for production)
session_histories = {}

def get_session_history(session_id: str) -> ChatMessageHistory:
    if session_id not in session_histories:
        session_histories[session_id] = ChatMessageHistory()
    return session_histories[session_id]

# Wrap the chain with history management
chain_with_memory = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history"
)

# Have a multi-turn conversation
config = {"configurable": {"session_id": "user_abc"}}

response1 = chain_with_memory.invoke({"input": "My name is Priya."}, config=config)
print(f"Assistant: {response1}")

response2 = chain_with_memory.invoke({"input": "I'm learning Python."}, config=config)
print(f"Assistant: {response2}")

response3 = chain_with_memory.invoke({"input": "What's my name and what am I learning?"}, config=config)
print(f"Assistant: {response3}")  # Should correctly reference Priya and Python

# Inspect the stored history
history = get_session_history("user_abc")
print(f"\nMessages in history: {len(history.messages)}")
for msg in history.messages:
    print(f"[{msg.__class__.__name__}]: {msg.content[:60]}")

Legacy Buffer Memory (you'll encounter this in existing codebases)

from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
memory = ConversationBufferMemory(return_messages=True)

conversation = ConversationChain(llm=llm, memory=memory, verbose=False)

response = conversation.predict(input="Hello, I'm working on a Python project.")
print(response)

response = conversation.predict(input="The project uses LangChain and OpenAI.")
print(response)

# Access the stored history
print(memory.chat_memory.messages)

When buffer memory breaks down: Each message takes tokens. A 50-turn conversation might accumulate 5,000+ tokens of history before you've even written your actual query. For user-facing applications, you need a cap.

Memory Type 2: Buffer Window Memory

Window memory solves the token accumulation problem by only keeping the last N message pairs.

from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import BaseMessage
from typing import List

class WindowedChatHistory:
    """Keeps only the last k message pairs (2k messages total)."""
    
    def __init__(self, k: int = 5):
        self.k = k
        self._history = ChatMessageHistory()
    
    @property
    def messages(self) -> List[BaseMessage]:
        all_messages = self._history.messages
        # Keep only last 2k messages (k pairs of human+AI)
        if len(all_messages) > self.k * 2:
            return all_messages[-(self.k * 2):]
        return all_messages
    
    def add_messages(self, messages: List[BaseMessage]) -> None:
        for msg in messages:
            self._history.add_message(msg)
    
    def clear(self) -> None:
        self._history.clear()

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.5)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. You remember the last few messages."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])

chain = prompt | llm | StrOutputParser()

# Window stores — k=3 means we remember last 3 exchanges
windowed_stores: dict = {}

def get_windowed_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in windowed_stores:
        windowed_stores[session_id] = WindowedChatHistory(k=3)
    return windowed_stores[session_id]

chain_with_window = RunnableWithMessageHistory(
    chain,
    get_windowed_history,
    input_messages_key="input",
    history_messages_key="history"
)

config = {"configurable": {"session_id": "windowed_user"}}

# Messages beyond k=3 pairs will be forgotten
for i in range(8):
    response = chain_with_window.invoke(
        {"input": f"This is message number {i+1}."},
        config=config
    )
    print(f"Turn {i+1}: {response[:60]}")

# Ask about early messages — they'll be forgotten
response = chain_with_window.invoke(
    {"input": "What was message number 1?"},
    config=config
)
print(f"Memory test: {response}")

Production note: Token-based windowing (trimming based on token count rather than message count) is more precise than message-count windowing. LangChain's trim_messages utility handles this:

from langchain_core.messages import trim_messages

# Trim messages to fit within a token budget
trimmed = trim_messages(
    messages=history.messages,
    max_tokens=2000,
    token_counter=llm,  # Use the LLM to count tokens accurately
    strategy="last",    # Keep the most recent messages
    include_system=True,
    allow_partial=False
)

Memory Type 3: ConversationSummaryMemory

Summary memory uses an LLM to periodically compress old conversation history into a summary. The summary replaces the raw messages, keeping token usage bounded while preserving semantic content.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_community.chat_message_histories import ChatMessageHistory

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
summarizer_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

summarize_prompt = ChatPromptTemplate.from_template(
    """Summarize the conversation below into 2-3 sentences that capture the key information.
    Focus on facts about the user, their goals, and any decisions made.
    
    Conversation:
    {conversation}
    
    Summary:"""
)

summarize_chain = summarize_prompt | summarizer_llm | StrOutputParser()

class SummaryMemoryStore:
    """Maintains a running summary of old messages plus recent raw messages."""
    
    def __init__(self, recent_messages: int = 6, summary_threshold: int = 10):
        self.recent_messages = recent_messages
        self.summary_threshold = summary_threshold
        self._messages = []
        self._summary = ""
    
    @property
    def messages(self):
        result = []
        if self._summary:
            result.append(SystemMessage(content=f"[Conversation summary so far: {self._summary}]"))
        result.extend(self._messages[-self.recent_messages:])
        return result
    
    def add_messages(self, messages):
        for msg in messages:
            self._messages.append(msg)
        
        # Summarize when we exceed the threshold
        if len(self._messages) > self.summary_threshold:
            self._compress()
    
    def _compress(self):
        # Get messages to summarize (all but the most recent)
        to_summarize = self._messages[:-self.recent_messages]
        if not to_summarize:
            return
        
        conversation_text = "\n".join(
            f"{msg.__class__.__name__}: {msg.content}"
            for msg in to_summarize
        )
        
        # Generate summary
        if self._summary:
            conversation_text = f"Previous summary: {self._summary}\n\nNew conversation:\n{conversation_text}"
        
        self._summary = summarize_chain.invoke({"conversation": conversation_text})
        # Keep only recent messages
        self._messages = self._messages[-self.recent_messages:]
        print(f"[Memory compressed. Summary: {self._summary[:100]}...]")
    
    def clear(self):
        self._messages = []
        self._summary = ""

# Integrate with LCEL chain
memory_store = SummaryMemoryStore(recent_messages=6, summary_threshold=10)

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful personal assistant."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])

chat_chain = chat_prompt | llm | StrOutputParser()

def chat_with_summary_memory(user_input: str) -> str:
    # Get current history
    history = memory_store.messages
    
    # Invoke the chain
    response = chat_chain.invoke({
        "history": history,
        "input": user_input
    })
    
    # Update memory with new exchange
    memory_store.add_messages([
        HumanMessage(content=user_input),
        AIMessage(content=response)
    ])
    
    return response

# Test across a long conversation
topics = [
    "I'm a software engineer working at a startup in Berlin.",
    "We're building a recommendation system using LangChain.",
    "The main challenge is scaling the vector search to millions of users.",
    "We're considering Pinecone vs Qdrant for the vector database.",
    "Our budget is about $500/month for infrastructure.",
    "I prefer Python over JavaScript for backend work.",
    "We're using FastAPI for the REST endpoints.",
    "The team has 4 engineers and one ML researcher.",
    "We ship new features every two weeks.",
    "Our biggest bottleneck is the embedding generation pipeline.",
    "We're thinking about using batch embeddings to reduce costs.",
]

for msg in topics:
    response = chat_with_summary_memory(msg)
    print(f"User: {msg[:50]}")
    print(f"Bot: {response[:80]}\n")

# Test memory recall after compression
recall = chat_with_summary_memory("What city am I working in and what database are we considering?")
print(f"\nMemory recall: {recall}")

The trade-off: Summary memory reduces token usage dramatically but loses verbatim detail. If a user mentioned "my API key is X123" early in the conversation, a summary might not retain that. For use cases where exact details matter, buffer or vector memory is safer.

Memory Type 4: Vector Store Memory

Vector memory stores conversation turns as embeddings and retrieves the most semantically relevant past exchanges for each new query. Instead of recalling the most recent messages, it recalls the most relevant ones.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.documents import Document
import uuid

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.5)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Vector store for conversation memory
memory_store = Chroma(
    collection_name="conversation_memory",
    embedding_function=embeddings,
    persist_directory="./conversation_memory_db"
)

def store_exchange(human_msg: str, ai_msg: str, session_id: str):
    """Store a conversation exchange as a document."""
    exchange_text = f"User: {human_msg}\nAssistant: {ai_msg}"
    doc = Document(
        page_content=exchange_text,
        metadata={
            "session_id": session_id,
            "human_message": human_msg,
            "ai_message": ai_msg
        }
    )
    memory_store.add_documents([doc], ids=[str(uuid.uuid4())])

def retrieve_relevant_memory(query: str, session_id: str, k: int = 3) -> str:
    """Retrieve the k most relevant past exchanges."""
    results = memory_store.similarity_search(
        query,
        k=k,
        filter={"session_id": session_id}
    )
    if not results:
        return "No relevant past conversations found."
    
    return "\n\n".join([
        f"Past exchange:\n{doc.page_content}"
        for doc in results
    ])

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant with access to relevant past conversation snippets.
    
Relevant past conversations:
{memory}

Use this context when relevant, but don't reference it directly unless needed."""),
    ("human", "{input}")
])

chain = prompt | llm | StrOutputParser()

def chat_with_vector_memory(user_input: str, session_id: str) -> str:
    # Retrieve relevant memories
    relevant_memory = retrieve_relevant_memory(user_input, session_id)
    
    # Generate response with memory context
    response = chain.invoke({
        "memory": relevant_memory,
        "input": user_input
    })
    
    # Store this exchange for future retrieval
    store_exchange(user_input, response, session_id)
    
    return response

# Test with a session
session = "test_session_001"
exchanges = [
    "I'm building a customer support chatbot for an e-commerce company.",
    "The main products we sell are electronics and home appliances.",
    "We get about 500 support tickets per day.",
    "The biggest issue customers have is about delivery tracking.",
    "We use Shopify for our store and have a custom API.",
]

for msg in exchanges:
    resp = chat_with_vector_memory(msg, session)
    print(f"User: {msg}")
    print(f"Bot: {resp[:100]}\n")

# Ask something that requires recalling specific earlier info
recall_test = chat_with_vector_memory(
    "What platform do we use and what's our main customer issue?",
    session
)
print(f"\nRecall test: {recall_test}")

When vector memory shines: Long-running assistants, note-taking agents, or personalization use cases where the user has many previous sessions and you want to surface the most contextually relevant history, not just the most recent.

Memory Type 5: Entity Memory

Entity memory tracks specific facts about named entities (people, companies, places) across a conversation. It's the right choice when your chatbot needs to remember structured facts.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import json

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Entity extractor
entity_extractor_prompt = ChatPromptTemplate.from_template(
    """Extract any new facts about named entities from this conversation turn.
    Return a JSON object with entity names as keys and their properties as values.
    Only include NEW or UPDATED information not already in the existing entities.
    
    Existing entities: {entities}
    
    New conversation turn:
    User: {human_message}
    Assistant: {ai_message}
    
    New entity facts (JSON only, empty object if none):"""
)

entity_extractor = entity_extractor_prompt | llm | StrOutputParser()

class EntityMemory:
    def __init__(self):
        self.entities = {}
    
    def update(self, human_msg: str, ai_msg: str):
        """Extract and store entity facts from a conversation turn."""
        result = entity_extractor.invoke({
            "entities": json.dumps(self.entities, indent=2),
            "human_message": human_msg,
            "ai_message": ai_msg
        })
        
        try:
            # Handle markdown code blocks
            if "```json" in result:
                result = result.split("```json")[1].split("```")[0].strip()
            elif "```" in result:
                result = result.split("```")[1].split("```")[0].strip()
            
            new_facts = json.loads(result)
            for entity, facts in new_facts.items():
                if entity not in self.entities:
                    self.entities[entity] = {}
                if isinstance(facts, dict):
                    self.entities[entity].update(facts)
                else:
                    self.entities[entity]["info"] = facts
        except (json.JSONDecodeError, IndexError):
            pass  # Skip malformed responses
    
    def get_context(self, query: str = "") -> str:
        if not self.entities:
            return "No entity information stored yet."
        return json.dumps(self.entities, indent=2)

entity_memory = EntityMemory()

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant.
    
Known entities and facts:
{entity_context}

Use this information naturally in conversation."""),
    ("human", "{input}")
])

chat_chain = chat_prompt | llm | StrOutputParser()

def chat_with_entity_memory(user_input: str) -> str:
    # Get current entity context
    entity_context = entity_memory.get_context(user_input)
    
    # Generate response
    response = chat_chain.invoke({
        "entity_context": entity_context,
        "input": user_input
    })
    
    # Update entity memory with this exchange
    entity_memory.update(user_input, response)
    
    return response

# Test
print(chat_with_entity_memory("My colleague Alice is a senior Python developer at TechCorp."))
print(chat_with_entity_memory("Alice is working on migrating their codebase to FastAPI."))
print(chat_with_entity_memory("TechCorp is based in San Francisco and has about 200 employees."))

# Test recall of specific entity facts
print("\n--- Entity recall test ---")
print(chat_with_entity_memory("What can you tell me about Alice and TechCorp?"))
print(json.dumps(entity_memory.entities, indent=2))

Entity memory is particularly valuable for CRM-adjacent chatbots, personal assistants, or any application where remembering structured facts about specific people/organizations improves the experience.

Memory Comparison Table

Memory Type	Token Usage	Recall Quality	Update Speed	Best For
Buffer (full)	Grows indefinitely	Perfect — verbatim	Instant	Short conversations, precise recall
Buffer Window	Bounded (last N turns)	Good for recent, none for old	Instant	General chatbots with moderate conversation length
Summary	Bounded (summary size)	Good semantic, loses detail	LLM call required	Long conversations, cost-sensitive apps
Vector Store	Bounded	Excellent — semantic retrieval	Embedding call	Long-running assistants, personalization
Entity	Bounded (entity store size)	Perfect for tracked entities	LLM extraction call	CRM bots, personal assistants, fact tracking

Persistent Memory Across Sessions

All the examples above lose memory on server restart. For production chatbots, you need persistence.

Redis-Backed Memory

from langchain_community.chat_message_histories import RedisChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])

chain = prompt | llm | StrOutputParser()

def get_redis_history(session_id: str) -> RedisChatMessageHistory:
    return RedisChatMessageHistory(
        session_id=session_id,
        url="redis://localhost:6379",
        ttl=86400  # Expire after 24 hours
    )

chain_with_redis = RunnableWithMessageHistory(
    chain,
    get_redis_history,
    input_messages_key="input",
    history_messages_key="history"
)

# This persists across server restarts
response = chain_with_redis.invoke(
    {"input": "Remember this: my project deadline is June 15th."},
    config={"configurable": {"session_id": "persistent_user_001"}}
)

SQL-Backed Memory

from langchain_community.chat_message_histories import SQLChatMessageHistory

def get_sql_history(session_id: str) -> SQLChatMessageHistory:
    return SQLChatMessageHistory(
        session_id=session_id,
        connection="sqlite:///./chat_history.db"
        # Or PostgreSQL: "postgresql://user:pass@localhost/dbname"
    )

For detailed coverage of persistent memory and multi-session management in production chatbots, the LangChain multi-turn conversations guide goes deep on the patterns.

Choosing the Right Memory Type

Here's my decision tree for picking a memory approach:

Is the conversation short (under 20 turns)? → Buffer memory is fine.

Is the conversation long but you mainly need recent context? → Buffer Window with token trimming.

Is cost a concern and the conversation can exceed 30+ turns? → Summary memory.

Do you need to recall specific facts mentioned anywhere in a long history? → Vector store memory.

Do you need to track structured facts about entities (people, places, organizations)? → Entity memory.

Do you need cross-session persistence? → Redis or SQL-backed history with any of the above.

Many production applications combine types: buffer window for recent context, plus entity memory for important facts, plus periodic summarization. The AI agent memory and planning guide covers these hybrid approaches.

Conclusion

Memory is what transforms a stateless LLM call into a conversation. LangChain gives you five meaningful memory types, each optimized for a different trade-off between token cost, recall quality, and update overhead. Buffer memory is simplest and most accurate but can blow your token budget. Summary memory is cost-efficient but lossy. Vector memory gives you semantic recall across long histories. Entity memory tracks specific facts precisely.

For new chatbot projects: start with RunnableWithMessageHistory plus in-memory ChatMessageHistory. Measure your token usage after a week of real usage. Then decide whether you need windowing, summarization, or a more specialized memory type.

When you're ready to add tools and true autonomy, see LangChain agent types — agents use memory differently than chains, and the patterns there build directly on what we covered here.

Frequently Asked Questions

Does LangChain memory persist across server restarts?

In-memory stores (ChatMessageHistory) are wiped on restart. For persistence, use Redis-backed, database-backed, or file-backed history stores. LangChain provides RedisChatMessageHistory, SQLChatMessageHistory, and FileChatMessageHistory for persistent memory.

How do I share memory across multiple users in a multi-user chatbot?

Use session IDs. Each user gets a unique session_id passed to get_session_history(). LangChain's RunnableWithMessageHistory handles the routing — one history store per session_id. Store histories in Redis or a database for multi-instance deployments.

What is the token limit for ConversationBufferMemory?

Buffer memory keeps every message, so token usage grows indefinitely. For long conversations, switch to ConversationBufferWindowMemory (keeps last N messages) or ConversationSummaryMemory (summarizes old messages). In production, always cap memory with one of these approaches.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

In-memory stores (ChatMessageHistory) are wiped on restart. For persistence, use Redis-backed, database-backed, or file-backed history stores. LangChain provides RedisEntityStore, SQLChatMessageHistory, and FileChatMessageHistory for persistent memory.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

InterviewPython BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizPython Basics QuizPython OOP Concepts

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

How to Implement Memory in LangChain (Buffer, Summary, Vector)

⚡ Quick Answer

Learn every major LangChain memory type — ConversationBufferMemory, SummaryMemory, VectorStoreMemory, and EntityMemory — with working code and a comparison table.

AiTechWorlds Team May 31, 2026 14 min read

#LangChain #Memory #Chatbot #ConversationMemory #Python

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Understanding How LangChain Memory Works

ChatMessageHistory — stores the raw messages (the data layer)
RunnableWithMessageHistory — injects history into any LCEL chain (the integration layer)
MessagesPlaceholder — the slot in your prompt where history gets inserted

This separation is cleaner than the old approach. The history store is swappable (in-memory, Redis, Postgres), and the chain code stays the same regardless.

That said, the legacy classes still work and you'll encounter them constantly in existing code. This guide covers both.

Memory Type 1: ConversationBufferMemory

Buffer memory keeps every message in the conversation history. Simplest, most faithful, most expensive at scale.

Modern LCEL Approach

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)

# Prompt with a placeholder for conversation history
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Be concise and friendly."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])

chain = prompt | llm | StrOutputParser()

# In-memory store (use Redis/DB for production)
session_histories = {}

def get_session_history(session_id: str) -> ChatMessageHistory:
    if session_id not in session_histories:
        session_histories[session_id] = ChatMessageHistory()
    return session_histories[session_id]

# Wrap the chain with history management
chain_with_memory = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history"
)

# Have a multi-turn conversation
config = {"configurable": {"session_id": "user_abc"}}

response1 = chain_with_memory.invoke({"input": "My name is Priya."}, config=config)
print(f"Assistant: {response1}")

response2 = chain_with_memory.invoke({"input": "I'm learning Python."}, config=config)
print(f"Assistant: {response2}")

response3 = chain_with_memory.invoke({"input": "What's my name and what am I learning?"}, config=config)
print(f"Assistant: {response3}")  # Should correctly reference Priya and Python

# Inspect the stored history
history = get_session_history("user_abc")
print(f"\nMessages in history: {len(history.messages)}")
for msg in history.messages:
    print(f"[{msg.__class__.__name__}]: {msg.content[:60]}")

Legacy Buffer Memory (you'll encounter this in existing codebases)

from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
memory = ConversationBufferMemory(return_messages=True)

conversation = ConversationChain(llm=llm, memory=memory, verbose=False)

response = conversation.predict(input="Hello, I'm working on a Python project.")
print(response)

response = conversation.predict(input="The project uses LangChain and OpenAI.")
print(response)

# Access the stored history
print(memory.chat_memory.messages)

Memory Type 2: Buffer Window Memory

Window memory solves the token accumulation problem by only keeping the last N message pairs.

from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import BaseMessage
from typing import List

class WindowedChatHistory:
    """Keeps only the last k message pairs (2k messages total)."""
    
    def __init__(self, k: int = 5):
        self.k = k
        self._history = ChatMessageHistory()
    
    @property
    def messages(self) -> List[BaseMessage]:
        all_messages = self._history.messages
        # Keep only last 2k messages (k pairs of human+AI)
        if len(all_messages) > self.k * 2:
            return all_messages[-(self.k * 2):]
        return all_messages
    
    def add_messages(self, messages: List[BaseMessage]) -> None:
        for msg in messages:
            self._history.add_message(msg)
    
    def clear(self) -> None:
        self._history.clear()

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.5)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. You remember the last few messages."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])

chain = prompt | llm | StrOutputParser()

# Window stores — k=3 means we remember last 3 exchanges
windowed_stores: dict = {}

def get_windowed_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in windowed_stores:
        windowed_stores[session_id] = WindowedChatHistory(k=3)
    return windowed_stores[session_id]

chain_with_window = RunnableWithMessageHistory(
    chain,
    get_windowed_history,
    input_messages_key="input",
    history_messages_key="history"
)

config = {"configurable": {"session_id": "windowed_user"}}

# Messages beyond k=3 pairs will be forgotten
for i in range(8):
    response = chain_with_window.invoke(
        {"input": f"This is message number {i+1}."},
        config=config
    )
    print(f"Turn {i+1}: {response[:60]}")

# Ask about early messages — they'll be forgotten
response = chain_with_window.invoke(
    {"input": "What was message number 1?"},
    config=config
)
print(f"Memory test: {response}")

Production note: Token-based windowing (trimming based on token count rather than message count) is more precise than message-count windowing. LangChain's trim_messages utility handles this:

from langchain_core.messages import trim_messages

# Trim messages to fit within a token budget
trimmed = trim_messages(
    messages=history.messages,
    max_tokens=2000,
    token_counter=llm,  # Use the LLM to count tokens accurately
    strategy="last",    # Keep the most recent messages
    include_system=True,
    allow_partial=False
)

Memory Type 3: ConversationSummaryMemory

Summary memory uses an LLM to periodically compress old conversation history into a summary. The summary replaces the raw messages, keeping token usage bounded while preserving semantic content.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_community.chat_message_histories import ChatMessageHistory

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
summarizer_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

summarize_prompt = ChatPromptTemplate.from_template(
    """Summarize the conversation below into 2-3 sentences that capture the key information.
    Focus on facts about the user, their goals, and any decisions made.
    
    Conversation:
    {conversation}
    
    Summary:"""
)

summarize_chain = summarize_prompt | summarizer_llm | StrOutputParser()

class SummaryMemoryStore:
    """Maintains a running summary of old messages plus recent raw messages."""
    
    def __init__(self, recent_messages: int = 6, summary_threshold: int = 10):
        self.recent_messages = recent_messages
        self.summary_threshold = summary_threshold
        self._messages = []
        self._summary = ""
    
    @property
    def messages(self):
        result = []
        if self._summary:
            result.append(SystemMessage(content=f"[Conversation summary so far: {self._summary}]"))
        result.extend(self._messages[-self.recent_messages:])
        return result
    
    def add_messages(self, messages):
        for msg in messages:
            self._messages.append(msg)
        
        # Summarize when we exceed the threshold
        if len(self._messages) > self.summary_threshold:
            self._compress()
    
    def _compress(self):
        # Get messages to summarize (all but the most recent)
        to_summarize = self._messages[:-self.recent_messages]
        if not to_summarize:
            return
        
        conversation_text = "\n".join(
            f"{msg.__class__.__name__}: {msg.content}"
            for msg in to_summarize
        )
        
        # Generate summary
        if self._summary:
            conversation_text = f"Previous summary: {self._summary}\n\nNew conversation:\n{conversation_text}"
        
        self._summary = summarize_chain.invoke({"conversation": conversation_text})
        # Keep only recent messages
        self._messages = self._messages[-self.recent_messages:]
        print(f"[Memory compressed. Summary: {self._summary[:100]}...]")
    
    def clear(self):
        self._messages = []
        self._summary = ""

# Integrate with LCEL chain
memory_store = SummaryMemoryStore(recent_messages=6, summary_threshold=10)

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful personal assistant."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])

chat_chain = chat_prompt | llm | StrOutputParser()

def chat_with_summary_memory(user_input: str) -> str:
    # Get current history
    history = memory_store.messages
    
    # Invoke the chain
    response = chat_chain.invoke({
        "history": history,
        "input": user_input
    })
    
    # Update memory with new exchange
    memory_store.add_messages([
        HumanMessage(content=user_input),
        AIMessage(content=response)
    ])
    
    return response

# Test across a long conversation
topics = [
    "I'm a software engineer working at a startup in Berlin.",
    "We're building a recommendation system using LangChain.",
    "The main challenge is scaling the vector search to millions of users.",
    "We're considering Pinecone vs Qdrant for the vector database.",
    "Our budget is about $500/month for infrastructure.",
    "I prefer Python over JavaScript for backend work.",
    "We're using FastAPI for the REST endpoints.",
    "The team has 4 engineers and one ML researcher.",
    "We ship new features every two weeks.",
    "Our biggest bottleneck is the embedding generation pipeline.",
    "We're thinking about using batch embeddings to reduce costs.",
]

for msg in topics:
    response = chat_with_summary_memory(msg)
    print(f"User: {msg[:50]}")
    print(f"Bot: {response[:80]}\n")

# Test memory recall after compression
recall = chat_with_summary_memory("What city am I working in and what database are we considering?")
print(f"\nMemory recall: {recall}")

Memory Type 4: Vector Store Memory

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.documents import Document
import uuid

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.5)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Vector store for conversation memory
memory_store = Chroma(
    collection_name="conversation_memory",
    embedding_function=embeddings,
    persist_directory="./conversation_memory_db"
)

def store_exchange(human_msg: str, ai_msg: str, session_id: str):
    """Store a conversation exchange as a document."""
    exchange_text = f"User: {human_msg}\nAssistant: {ai_msg}"
    doc = Document(
        page_content=exchange_text,
        metadata={
            "session_id": session_id,
            "human_message": human_msg,
            "ai_message": ai_msg
        }
    )
    memory_store.add_documents([doc], ids=[str(uuid.uuid4())])

def retrieve_relevant_memory(query: str, session_id: str, k: int = 3) -> str:
    """Retrieve the k most relevant past exchanges."""
    results = memory_store.similarity_search(
        query,
        k=k,
        filter={"session_id": session_id}
    )
    if not results:
        return "No relevant past conversations found."
    
    return "\n\n".join([
        f"Past exchange:\n{doc.page_content}"
        for doc in results
    ])

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant with access to relevant past conversation snippets.
    
Relevant past conversations:
{memory}

Use this context when relevant, but don't reference it directly unless needed."""),
    ("human", "{input}")
])

chain = prompt | llm | StrOutputParser()

def chat_with_vector_memory(user_input: str, session_id: str) -> str:
    # Retrieve relevant memories
    relevant_memory = retrieve_relevant_memory(user_input, session_id)
    
    # Generate response with memory context
    response = chain.invoke({
        "memory": relevant_memory,
        "input": user_input
    })
    
    # Store this exchange for future retrieval
    store_exchange(user_input, response, session_id)
    
    return response

# Test with a session
session = "test_session_001"
exchanges = [
    "I'm building a customer support chatbot for an e-commerce company.",
    "The main products we sell are electronics and home appliances.",
    "We get about 500 support tickets per day.",
    "The biggest issue customers have is about delivery tracking.",
    "We use Shopify for our store and have a custom API.",
]

for msg in exchanges:
    resp = chat_with_vector_memory(msg, session)
    print(f"User: {msg}")
    print(f"Bot: {resp[:100]}\n")

# Ask something that requires recalling specific earlier info
recall_test = chat_with_vector_memory(
    "What platform do we use and what's our main customer issue?",
    session
)
print(f"\nRecall test: {recall_test}")

Memory Type 5: Entity Memory

Entity memory tracks specific facts about named entities (people, companies, places) across a conversation. It's the right choice when your chatbot needs to remember structured facts.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import json

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Entity extractor
entity_extractor_prompt = ChatPromptTemplate.from_template(
    """Extract any new facts about named entities from this conversation turn.
    Return a JSON object with entity names as keys and their properties as values.
    Only include NEW or UPDATED information not already in the existing entities.
    
    Existing entities: {entities}
    
    New conversation turn:
    User: {human_message}
    Assistant: {ai_message}
    
    New entity facts (JSON only, empty object if none):"""
)

entity_extractor = entity_extractor_prompt | llm | StrOutputParser()

class EntityMemory:
    def __init__(self):
        self.entities = {}
    
    def update(self, human_msg: str, ai_msg: str):
        """Extract and store entity facts from a conversation turn."""
        result = entity_extractor.invoke({
            "entities": json.dumps(self.entities, indent=2),
            "human_message": human_msg,
            "ai_message": ai_msg
        })
        
        try:
            # Handle markdown code blocks
            if "```json" in result:
                result = result.split("```json")[1].split("```")[0].strip()
            elif "```" in result:
                result = result.split("```")[1].split("```")[0].strip()
            
            new_facts = json.loads(result)
            for entity, facts in new_facts.items():
                if entity not in self.entities:
                    self.entities[entity] = {}
                if isinstance(facts, dict):
                    self.entities[entity].update(facts)
                else:
                    self.entities[entity]["info"] = facts
        except (json.JSONDecodeError, IndexError):
            pass  # Skip malformed responses
    
    def get_context(self, query: str = "") -> str:
        if not self.entities:
            return "No entity information stored yet."
        return json.dumps(self.entities, indent=2)

entity_memory = EntityMemory()

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant.
    
Known entities and facts:
{entity_context}

Use this information naturally in conversation."""),
    ("human", "{input}")
])

chat_chain = chat_prompt | llm | StrOutputParser()

def chat_with_entity_memory(user_input: str) -> str:
    # Get current entity context
    entity_context = entity_memory.get_context(user_input)
    
    # Generate response
    response = chat_chain.invoke({
        "entity_context": entity_context,
        "input": user_input
    })
    
    # Update entity memory with this exchange
    entity_memory.update(user_input, response)
    
    return response

# Test
print(chat_with_entity_memory("My colleague Alice is a senior Python developer at TechCorp."))
print(chat_with_entity_memory("Alice is working on migrating their codebase to FastAPI."))
print(chat_with_entity_memory("TechCorp is based in San Francisco and has about 200 employees."))

# Test recall of specific entity facts
print("\n--- Entity recall test ---")
print(chat_with_entity_memory("What can you tell me about Alice and TechCorp?"))
print(json.dumps(entity_memory.entities, indent=2))

Memory Comparison Table

Memory Type	Token Usage	Recall Quality	Update Speed	Best For
Buffer (full)	Grows indefinitely	Perfect — verbatim	Instant	Short conversations, precise recall
Buffer Window	Bounded (last N turns)	Good for recent, none for old	Instant	General chatbots with moderate conversation length
Summary	Bounded (summary size)	Good semantic, loses detail	LLM call required	Long conversations, cost-sensitive apps
Vector Store	Bounded	Excellent — semantic retrieval	Embedding call	Long-running assistants, personalization
Entity	Bounded (entity store size)	Perfect for tracked entities	LLM extraction call	CRM bots, personal assistants, fact tracking

Persistent Memory Across Sessions

All the examples above lose memory on server restart. For production chatbots, you need persistence.

Redis-Backed Memory

from langchain_community.chat_message_histories import RedisChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")
])

chain = prompt | llm | StrOutputParser()

def get_redis_history(session_id: str) -> RedisChatMessageHistory:
    return RedisChatMessageHistory(
        session_id=session_id,
        url="redis://localhost:6379",
        ttl=86400  # Expire after 24 hours
    )

chain_with_redis = RunnableWithMessageHistory(
    chain,
    get_redis_history,
    input_messages_key="input",
    history_messages_key="history"
)

# This persists across server restarts
response = chain_with_redis.invoke(
    {"input": "Remember this: my project deadline is June 15th."},
    config={"configurable": {"session_id": "persistent_user_001"}}
)

SQL-Backed Memory

from langchain_community.chat_message_histories import SQLChatMessageHistory

def get_sql_history(session_id: str) -> SQLChatMessageHistory:
    return SQLChatMessageHistory(
        session_id=session_id,
        connection="sqlite:///./chat_history.db"
        # Or PostgreSQL: "postgresql://user:pass@localhost/dbname"
    )

For detailed coverage of persistent memory and multi-session management in production chatbots, the LangChain multi-turn conversations guide goes deep on the patterns.

Choosing the Right Memory Type

Here's my decision tree for picking a memory approach:

Is the conversation short (under 20 turns)? → Buffer memory is fine.

Is the conversation long but you mainly need recent context? → Buffer Window with token trimming.

Is cost a concern and the conversation can exceed 30+ turns? → Summary memory.

Do you need to recall specific facts mentioned anywhere in a long history? → Vector store memory.

Do you need to track structured facts about entities (people, places, organizations)? → Entity memory.

Do you need cross-session persistence? → Redis or SQL-backed history with any of the above.

Conclusion

When you're ready to add tools and true autonomy, see LangChain agent types — agents use memory differently than chains, and the patterns there build directly on what we covered here.

Frequently Asked Questions

Does LangChain memory persist across server restarts?

How do I share memory across multiple users in a multi-user chatbot?

What is the token limit for ConversationBufferMemory?

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

InterviewPython BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizPython Basics QuizPython OOP Concepts

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

How to Implement Memory in LangChain (Buffer, Summary, Vector)

Understanding How LangChain Memory Works

Memory Type 1: ConversationBufferMemory

Modern LCEL Approach

Legacy Buffer Memory (you'll encounter this in existing codebases)

Memory Type 2: Buffer Window Memory

Memory Type 3: ConversationSummaryMemory

Memory Type 4: Vector Store Memory

Memory Type 5: Entity Memory

Memory Comparison Table

Persistent Memory Across Sessions

Redis-Backed Memory

SQL-Backed Memory

Choosing the Right Memory Type

Conclusion

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

How to Implement Memory in LangChain (Buffer, Summary, Vector)

Understanding How LangChain Memory Works

Memory Type 1: ConversationBufferMemory

Modern LCEL Approach

Legacy Buffer Memory (you'll encounter this in existing codebases)

Memory Type 2: Buffer Window Memory

Memory Type 3: ConversationSummaryMemory

Memory Type 4: Vector Store Memory

Memory Type 5: Entity Memory

Memory Comparison Table

Persistent Memory Across Sessions

Redis-Backed Memory

SQL-Backed Memory

Choosing the Right Memory Type

Conclusion

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily