How to Implement Memory in LangChain (Buffer, Summary, Vector)
Learn every major LangChain memory type — ConversationBufferMemory, SummaryMemory, VectorStoreMemory, and EntityMemory — with working code and a comparison table.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Every chatbot needs memory. Without it, the model has no context — ask "what did I just say?" and it draws a blank. Build memory wrong, and you'll either blow through your token budget on long conversations or confuse users by forgetting things they said five messages ago.
I've built a lot of chatbots. The memory decision — which type to use, how to persist it, when to summarize — is one of the decisions that actually moves the needle on user experience. This guide covers every major LangChain memory type with working code, a comparison table, and honest notes on when each one is the right tool.
If you're building a chatbot from scratch, the build AI chatbot Python guide gives you the full application context. For the agent side of memory — how agents remember across tool calls — see AI agent memory and planning.
Understanding How LangChain Memory Works
Before the code, let me clarify the modern LangChain memory architecture. The framework went through a significant refactor: the older ConversationBufferMemory class is now considered legacy. Modern LangChain uses:
ChatMessageHistory— stores the raw messages (the data layer)RunnableWithMessageHistory— injects history into any LCEL chain (the integration layer)MessagesPlaceholder— the slot in your prompt where history gets inserted
This separation is cleaner than the old approach. The history store is swappable (in-memory, Redis, Postgres), and the chain code stays the same regardless.
That said, the legacy classes still work and you'll encounter them constantly in existing code. This guide covers both.
Memory Type 1: ConversationBufferMemory
Buffer memory keeps every message in the conversation history. Simplest, most faithful, most expensive at scale.
Modern LCEL Approach
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
# Prompt with a placeholder for conversation history
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Be concise and friendly."),
MessagesPlaceholder(variable_name="history"),
("human", "{input}")
])
chain = prompt | llm | StrOutputParser()
# In-memory store (use Redis/DB for production)
session_histories = {}
def get_session_history(session_id: str) -> ChatMessageHistory:
if session_id not in session_histories:
session_histories[session_id] = ChatMessageHistory()
return session_histories[session_id]
# Wrap the chain with history management
chain_with_memory = RunnableWithMessageHistory(
chain,
get_session_history,
input_messages_key="input",
history_messages_key="history"
)
# Have a multi-turn conversation
config = {"configurable": {"session_id": "user_abc"}}
response1 = chain_with_memory.invoke({"input": "My name is Priya."}, config=config)
print(f"Assistant: {response1}")
response2 = chain_with_memory.invoke({"input": "I'm learning Python."}, config=config)
print(f"Assistant: {response2}")
response3 = chain_with_memory.invoke({"input": "What's my name and what am I learning?"}, config=config)
print(f"Assistant: {response3}") # Should correctly reference Priya and Python
# Inspect the stored history
history = get_session_history("user_abc")
print(f"\nMessages in history: {len(history.messages)}")
for msg in history.messages:
print(f"[{msg.__class__.__name__}]: {msg.content[:60]}")
Legacy Buffer Memory (you'll encounter this in existing codebases)
from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
memory = ConversationBufferMemory(return_messages=True)
conversation = ConversationChain(llm=llm, memory=memory, verbose=False)
response = conversation.predict(input="Hello, I'm working on a Python project.")
print(response)
response = conversation.predict(input="The project uses LangChain and OpenAI.")
print(response)
# Access the stored history
print(memory.chat_memory.messages)
When buffer memory breaks down: Each message takes tokens. A 50-turn conversation might accumulate 5,000+ tokens of history before you've even written your actual query. For user-facing applications, you need a cap.
Memory Type 2: Buffer Window Memory
Window memory solves the token accumulation problem by only keeping the last N message pairs.
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import BaseMessage
from typing import List
class WindowedChatHistory:
"""Keeps only the last k message pairs (2k messages total)."""
def __init__(self, k: int = 5):
self.k = k
self._history = ChatMessageHistory()
@property
def messages(self) -> List[BaseMessage]:
all_messages = self._history.messages
# Keep only last 2k messages (k pairs of human+AI)
if len(all_messages) > self.k * 2:
return all_messages[-(self.k * 2):]
return all_messages
def add_messages(self, messages: List[BaseMessage]) -> None:
for msg in messages:
self._history.add_message(msg)
def clear(self) -> None:
self._history.clear()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.5)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. You remember the last few messages."),
MessagesPlaceholder(variable_name="history"),
("human", "{input}")
])
chain = prompt | llm | StrOutputParser()
# Window stores — k=3 means we remember last 3 exchanges
windowed_stores: dict = {}
def get_windowed_history(session_id: str) -> BaseChatMessageHistory:
if session_id not in windowed_stores:
windowed_stores[session_id] = WindowedChatHistory(k=3)
return windowed_stores[session_id]
chain_with_window = RunnableWithMessageHistory(
chain,
get_windowed_history,
input_messages_key="input",
history_messages_key="history"
)
config = {"configurable": {"session_id": "windowed_user"}}
# Messages beyond k=3 pairs will be forgotten
for i in range(8):
response = chain_with_window.invoke(
{"input": f"This is message number {i+1}."},
config=config
)
print(f"Turn {i+1}: {response[:60]}")
# Ask about early messages — they'll be forgotten
response = chain_with_window.invoke(
{"input": "What was message number 1?"},
config=config
)
print(f"Memory test: {response}")
Production note: Token-based windowing (trimming based on token count rather than message count) is more precise than message-count windowing. LangChain's trim_messages utility handles this:
from langchain_core.messages import trim_messages
# Trim messages to fit within a token budget
trimmed = trim_messages(
messages=history.messages,
max_tokens=2000,
token_counter=llm, # Use the LLM to count tokens accurately
strategy="last", # Keep the most recent messages
include_system=True,
allow_partial=False
)
Memory Type 3: ConversationSummaryMemory
Summary memory uses an LLM to periodically compress old conversation history into a summary. The summary replaces the raw messages, keeping token usage bounded while preserving semantic content.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_community.chat_message_histories import ChatMessageHistory
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
summarizer_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
summarize_prompt = ChatPromptTemplate.from_template(
"""Summarize the conversation below into 2-3 sentences that capture the key information.
Focus on facts about the user, their goals, and any decisions made.
Conversation:
{conversation}
Summary:"""
)
summarize_chain = summarize_prompt | summarizer_llm | StrOutputParser()
class SummaryMemoryStore:
"""Maintains a running summary of old messages plus recent raw messages."""
def __init__(self, recent_messages: int = 6, summary_threshold: int = 10):
self.recent_messages = recent_messages
self.summary_threshold = summary_threshold
self._messages = []
self._summary = ""
@property
def messages(self):
result = []
if self._summary:
result.append(SystemMessage(content=f"[Conversation summary so far: {self._summary}]"))
result.extend(self._messages[-self.recent_messages:])
return result
def add_messages(self, messages):
for msg in messages:
self._messages.append(msg)
# Summarize when we exceed the threshold
if len(self._messages) > self.summary_threshold:
self._compress()
def _compress(self):
# Get messages to summarize (all but the most recent)
to_summarize = self._messages[:-self.recent_messages]
if not to_summarize:
return
conversation_text = "\n".join(
f"{msg.__class__.__name__}: {msg.content}"
for msg in to_summarize
)
# Generate summary
if self._summary:
conversation_text = f"Previous summary: {self._summary}\n\nNew conversation:\n{conversation_text}"
self._summary = summarize_chain.invoke({"conversation": conversation_text})
# Keep only recent messages
self._messages = self._messages[-self.recent_messages:]
print(f"[Memory compressed. Summary: {self._summary[:100]}...]")
def clear(self):
self._messages = []
self._summary = ""
# Integrate with LCEL chain
memory_store = SummaryMemoryStore(recent_messages=6, summary_threshold=10)
chat_prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful personal assistant."),
MessagesPlaceholder(variable_name="history"),
("human", "{input}")
])
chat_chain = chat_prompt | llm | StrOutputParser()
def chat_with_summary_memory(user_input: str) -> str:
# Get current history
history = memory_store.messages
# Invoke the chain
response = chat_chain.invoke({
"history": history,
"input": user_input
})
# Update memory with new exchange
memory_store.add_messages([
HumanMessage(content=user_input),
AIMessage(content=response)
])
return response
# Test across a long conversation
topics = [
"I'm a software engineer working at a startup in Berlin.",
"We're building a recommendation system using LangChain.",
"The main challenge is scaling the vector search to millions of users.",
"We're considering Pinecone vs Qdrant for the vector database.",
"Our budget is about $500/month for infrastructure.",
"I prefer Python over JavaScript for backend work.",
"We're using FastAPI for the REST endpoints.",
"The team has 4 engineers and one ML researcher.",
"We ship new features every two weeks.",
"Our biggest bottleneck is the embedding generation pipeline.",
"We're thinking about using batch embeddings to reduce costs.",
]
for msg in topics:
response = chat_with_summary_memory(msg)
print(f"User: {msg[:50]}")
print(f"Bot: {response[:80]}\n")
# Test memory recall after compression
recall = chat_with_summary_memory("What city am I working in and what database are we considering?")
print(f"\nMemory recall: {recall}")
The trade-off: Summary memory reduces token usage dramatically but loses verbatim detail. If a user mentioned "my API key is X123" early in the conversation, a summary might not retain that. For use cases where exact details matter, buffer or vector memory is safer.
Memory Type 4: Vector Store Memory
Vector memory stores conversation turns as embeddings and retrieves the most semantically relevant past exchanges for each new query. Instead of recalling the most recent messages, it recalls the most relevant ones.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.documents import Document
import uuid
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.5)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Vector store for conversation memory
memory_store = Chroma(
collection_name="conversation_memory",
embedding_function=embeddings,
persist_directory="./conversation_memory_db"
)
def store_exchange(human_msg: str, ai_msg: str, session_id: str):
"""Store a conversation exchange as a document."""
exchange_text = f"User: {human_msg}\nAssistant: {ai_msg}"
doc = Document(
page_content=exchange_text,
metadata={
"session_id": session_id,
"human_message": human_msg,
"ai_message": ai_msg
}
)
memory_store.add_documents([doc], ids=[str(uuid.uuid4())])
def retrieve_relevant_memory(query: str, session_id: str, k: int = 3) -> str:
"""Retrieve the k most relevant past exchanges."""
results = memory_store.similarity_search(
query,
k=k,
filter={"session_id": session_id}
)
if not results:
return "No relevant past conversations found."
return "\n\n".join([
f"Past exchange:\n{doc.page_content}"
for doc in results
])
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant with access to relevant past conversation snippets.
Relevant past conversations:
{memory}
Use this context when relevant, but don't reference it directly unless needed."""),
("human", "{input}")
])
chain = prompt | llm | StrOutputParser()
def chat_with_vector_memory(user_input: str, session_id: str) -> str:
# Retrieve relevant memories
relevant_memory = retrieve_relevant_memory(user_input, session_id)
# Generate response with memory context
response = chain.invoke({
"memory": relevant_memory,
"input": user_input
})
# Store this exchange for future retrieval
store_exchange(user_input, response, session_id)
return response
# Test with a session
session = "test_session_001"
exchanges = [
"I'm building a customer support chatbot for an e-commerce company.",
"The main products we sell are electronics and home appliances.",
"We get about 500 support tickets per day.",
"The biggest issue customers have is about delivery tracking.",
"We use Shopify for our store and have a custom API.",
]
for msg in exchanges:
resp = chat_with_vector_memory(msg, session)
print(f"User: {msg}")
print(f"Bot: {resp[:100]}\n")
# Ask something that requires recalling specific earlier info
recall_test = chat_with_vector_memory(
"What platform do we use and what's our main customer issue?",
session
)
print(f"\nRecall test: {recall_test}")
When vector memory shines: Long-running assistants, note-taking agents, or personalization use cases where the user has many previous sessions and you want to surface the most contextually relevant history, not just the most recent.
Memory Type 5: Entity Memory
Entity memory tracks specific facts about named entities (people, companies, places) across a conversation. It's the right choice when your chatbot needs to remember structured facts.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import json
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Entity extractor
entity_extractor_prompt = ChatPromptTemplate.from_template(
"""Extract any new facts about named entities from this conversation turn.
Return a JSON object with entity names as keys and their properties as values.
Only include NEW or UPDATED information not already in the existing entities.
Existing entities: {entities}
New conversation turn:
User: {human_message}
Assistant: {ai_message}
New entity facts (JSON only, empty object if none):"""
)
entity_extractor = entity_extractor_prompt | llm | StrOutputParser()
class EntityMemory:
def __init__(self):
self.entities = {}
def update(self, human_msg: str, ai_msg: str):
"""Extract and store entity facts from a conversation turn."""
result = entity_extractor.invoke({
"entities": json.dumps(self.entities, indent=2),
"human_message": human_msg,
"ai_message": ai_msg
})
try:
# Handle markdown code blocks
if "```json" in result:
result = result.split("```json")[1].split("```")[0].strip()
elif "```" in result:
result = result.split("```")[1].split("```")[0].strip()
new_facts = json.loads(result)
for entity, facts in new_facts.items():
if entity not in self.entities:
self.entities[entity] = {}
if isinstance(facts, dict):
self.entities[entity].update(facts)
else:
self.entities[entity]["info"] = facts
except (json.JSONDecodeError, IndexError):
pass # Skip malformed responses
def get_context(self, query: str = "") -> str:
if not self.entities:
return "No entity information stored yet."
return json.dumps(self.entities, indent=2)
entity_memory = EntityMemory()
chat_prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant.
Known entities and facts:
{entity_context}
Use this information naturally in conversation."""),
("human", "{input}")
])
chat_chain = chat_prompt | llm | StrOutputParser()
def chat_with_entity_memory(user_input: str) -> str:
# Get current entity context
entity_context = entity_memory.get_context(user_input)
# Generate response
response = chat_chain.invoke({
"entity_context": entity_context,
"input": user_input
})
# Update entity memory with this exchange
entity_memory.update(user_input, response)
return response
# Test
print(chat_with_entity_memory("My colleague Alice is a senior Python developer at TechCorp."))
print(chat_with_entity_memory("Alice is working on migrating their codebase to FastAPI."))
print(chat_with_entity_memory("TechCorp is based in San Francisco and has about 200 employees."))
# Test recall of specific entity facts
print("\n--- Entity recall test ---")
print(chat_with_entity_memory("What can you tell me about Alice and TechCorp?"))
print(json.dumps(entity_memory.entities, indent=2))
Entity memory is particularly valuable for CRM-adjacent chatbots, personal assistants, or any application where remembering structured facts about specific people/organizations improves the experience.
Memory Comparison Table
| Memory Type | Token Usage | Recall Quality | Update Speed | Best For |
|---|---|---|---|---|
| Buffer (full) | Grows indefinitely | Perfect — verbatim | Instant | Short conversations, precise recall |
| Buffer Window | Bounded (last N turns) | Good for recent, none for old | Instant | General chatbots with moderate conversation length |
| Summary | Bounded (summary size) | Good semantic, loses detail | LLM call required | Long conversations, cost-sensitive apps |
| Vector Store | Bounded | Excellent — semantic retrieval | Embedding call | Long-running assistants, personalization |
| Entity | Bounded (entity store size) | Perfect for tracked entities | LLM extraction call | CRM bots, personal assistants, fact tracking |
Persistent Memory Across Sessions
All the examples above lose memory on server restart. For production chatbots, you need persistence.
Redis-Backed Memory
from langchain_community.chat_message_histories import RedisChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
MessagesPlaceholder(variable_name="history"),
("human", "{input}")
])
chain = prompt | llm | StrOutputParser()
def get_redis_history(session_id: str) -> RedisChatMessageHistory:
return RedisChatMessageHistory(
session_id=session_id,
url="redis://localhost:6379",
ttl=86400 # Expire after 24 hours
)
chain_with_redis = RunnableWithMessageHistory(
chain,
get_redis_history,
input_messages_key="input",
history_messages_key="history"
)
# This persists across server restarts
response = chain_with_redis.invoke(
{"input": "Remember this: my project deadline is June 15th."},
config={"configurable": {"session_id": "persistent_user_001"}}
)
SQL-Backed Memory
from langchain_community.chat_message_histories import SQLChatMessageHistory
def get_sql_history(session_id: str) -> SQLChatMessageHistory:
return SQLChatMessageHistory(
session_id=session_id,
connection="sqlite:///./chat_history.db"
# Or PostgreSQL: "postgresql://user:pass@localhost/dbname"
)
For detailed coverage of persistent memory and multi-session management in production chatbots, the LangChain multi-turn conversations guide goes deep on the patterns.
Choosing the Right Memory Type
Here's my decision tree for picking a memory approach:
Is the conversation short (under 20 turns)? → Buffer memory is fine.
Is the conversation long but you mainly need recent context? → Buffer Window with token trimming.
Is cost a concern and the conversation can exceed 30+ turns? → Summary memory.
Do you need to recall specific facts mentioned anywhere in a long history? → Vector store memory.
Do you need to track structured facts about entities (people, places, organizations)? → Entity memory.
Do you need cross-session persistence? → Redis or SQL-backed history with any of the above.
Many production applications combine types: buffer window for recent context, plus entity memory for important facts, plus periodic summarization. The AI agent memory and planning guide covers these hybrid approaches.
Conclusion
Memory is what transforms a stateless LLM call into a conversation. LangChain gives you five meaningful memory types, each optimized for a different trade-off between token cost, recall quality, and update overhead. Buffer memory is simplest and most accurate but can blow your token budget. Summary memory is cost-efficient but lossy. Vector memory gives you semantic recall across long histories. Entity memory tracks specific facts precisely.
For new chatbot projects: start with RunnableWithMessageHistory plus in-memory ChatMessageHistory. Measure your token usage after a week of real usage. Then decide whether you need windowing, summarization, or a more specialized memory type.
When you're ready to add tools and true autonomy, see LangChain agent types — agents use memory differently than chains, and the patterns there build directly on what we covered here.
Frequently Asked Questions
Does LangChain memory persist across server restarts?
In-memory stores (ChatMessageHistory) are wiped on restart. For persistence, use Redis-backed, database-backed, or file-backed history stores. LangChain provides RedisChatMessageHistory, SQLChatMessageHistory, and FileChatMessageHistory for persistent memory.
How do I share memory across multiple users in a multi-user chatbot?
Use session IDs. Each user gets a unique session_id passed to get_session_history(). LangChain's RunnableWithMessageHistory handles the routing — one history store per session_id. Store histories in Redis or a database for multi-instance deployments.
What is the token limit for ConversationBufferMemory?
Buffer memory keeps every message, so token usage grows indefinitely. For long conversations, switch to ConversationBufferWindowMemory (keeps last N messages) or ConversationSummaryMemory (summarizes old messages). In production, always cap memory with one of these approaches.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
7 AutoGen Termination Conditions (Max Rounds, Human Approval)
Master all 7 AutoGen termination conditions including is_termination_msg, max_turns, and human approval patterns to stop agent loops reliably and safely.
AutoGen Tutorial: Microsoft's Multi-Agent Framework (2026)
Learn Microsoft AutoGen from scratch in 2026 — install, first agent conversation, GroupChat, and a full comparison of AutoGen 0.2 vs 0.4 features.
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
How to Use AutoGen with Tools (Web Scraper, Calculator, File)
Learn how to equip AutoGen agents with custom tools like web scrapers, calculators, and file handlers using register_for_llm and register_for_execution.