AI Agent Memory and Planning: How Agents Remember and Reason About Long Tasks
AI agent memory and planning explained — how agents store context across sessions, plan multi-step tasks, and use working memory, episodic memory, and semantic memory effectively.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
AI Agent Memory and Planning: How Agents Remember and Reason About Long Tasks
An agent that forgets everything between sessions is a chatbot with extra steps. An agent that can't maintain coherent state across a 20-step task is an expensive random walk.
Memory and planning are what separate toy agents from genuinely useful ones. I've rebuilt the memory system in production agents three times — each time learning what actually matters. This guide covers what works.
The Four Types of Agent Memory
┌──────────────────────────────────────────────────────┐
│ Agent Memory Architecture │
├──────────────────────────────────────────────────────┤
│ │
│ WORKING MEMORY (Context Window) │
│ ───────────────────────────── │
│ Current messages, tool results, active state │
│ Capacity: 128K-200K tokens │
│ Persistence: None (lost when session ends) │
│ │
│ EPISODIC MEMORY (What happened) │
│ ──────────────────────────────── │
│ Past conversations, completed tasks, interactions │
│ Capacity: Unlimited (database) │
│ Persistence: Permanent │
│ Retrieval: Semantic search or recency │
│ │
│ SEMANTIC MEMORY (What I know) │
│ ────────────────────────────── │
│ Facts, user preferences, domain knowledge │
│ Capacity: Unlimited (vector database) │
│ Persistence: Permanent (with updates) │
│ Retrieval: Embedding similarity │
│ │
│ PROCEDURAL MEMORY (How to do things) │
│ ──────────────────────────────────── │
│ System prompt, retrieved how-to documents │
│ Persistence: Static or dynamically retrieved │
└──────────────────────────────────────────────────────┘
Part 1: Working Memory Management
import tiktoken
from openai import OpenAI
client = OpenAI()
enc = tiktoken.encoding_for_model("gpt-4o")
class WorkingMemory:
"""Manages the active context window for an agent."""
def __init__(self, max_tokens: int = 100000, reserved_output: int = 4000):
self.messages = []
self.max_tokens = max_tokens - reserved_output
self.system_prompt = ""
self._system_tokens = 0
def set_system_prompt(self, prompt: str):
self.system_prompt = prompt
self._system_tokens = len(enc.encode(prompt))
def count_tokens(self, messages: list) -> int:
total = self._system_tokens
for msg in messages:
if isinstance(msg.get("content"), str):
total += len(enc.encode(msg["content"])) + 4
return total
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
self._maybe_compress()
def _maybe_compress(self):
"""Compress messages when approaching context limit."""
while self.count_tokens(self.messages) > self.max_tokens:
if len(self.messages) <= 4: # Keep at least 2 exchanges
break
# Strategy: summarize the oldest non-critical messages
# Simple: just remove oldest tool result (often verbose)
for i, msg in enumerate(self.messages):
if msg["role"] == "tool" and i < len(self.messages) - 4:
# Replace with compressed version
original = msg["content"]
self.messages[i] = {
"role": "tool",
"content": f"[Tool result summarized — {len(original)} chars]"
}
break
else:
# If no tool results to compress, remove oldest message pair
if len(self.messages) > 4:
self.messages.pop(0)
def get_context(self) -> list:
return [{"role": "system", "content": self.system_prompt}] + self.messages
def summarize_and_reset(self) -> str:
"""Summarize current context and reset working memory."""
if not self.messages:
return ""
# Ask LLM to summarize the conversation
summary_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Summarize the key facts, decisions, and outcomes from this agent interaction in 200 words or less."},
*self.messages
]
)
summary = summary_response.choices[0].message.content
# Reset with summary as context
self.messages = [
{
"role": "system",
"content": f"Previous context summary: {summary}"
}
]
return summary
Part 2: Episodic Memory with Semantic Search
import json
from datetime import datetime
import chromadb
from openai import OpenAI
client = OpenAI()
class EpisodicMemory:
"""Store and retrieve past agent interactions."""
def __init__(self, collection_name: str = "agent_episodes"):
self.chroma = chromadb.PersistentClient(path="./agent_memory")
self.collection = self.chroma.get_or_create_collection(collection_name)
def _embed(self, text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=[text]
)
return response.data[0].embedding
def store_episode(
self,
session_id: str,
user_message: str,
agent_response: str,
tools_used: list[str] | None = None,
outcome: str = "success"
):
"""Store a completed interaction."""
episode_text = f"User: {user_message}\nAgent: {agent_response}"
self.collection.upsert(
ids=[f"{session_id}_{datetime.now().timestamp()}"],
embeddings=[self._embed(episode_text)],
documents=[episode_text],
metadatas=[{
"session_id": session_id,
"timestamp": datetime.now().isoformat(),
"tools_used": json.dumps(tools_used or []),
"outcome": outcome,
"user_message_short": user_message[:100]
}]
)
def retrieve_relevant(self, current_query: str, top_k: int = 3) -> list[dict]:
"""Find past interactions similar to current query."""
query_emb = self._embed(current_query)
results = self.collection.query(
query_embeddings=[query_emb],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
episodes = []
for doc, meta, distance in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
if distance < 0.5: # Only return highly relevant episodes
episodes.append({
"episode": doc[:500], # Truncate for context
"timestamp": meta["timestamp"],
"outcome": meta["outcome"],
"relevance": 1 - distance
})
return episodes
def format_for_context(self, episodes: list[dict]) -> str:
"""Format retrieved episodes for injection into agent context."""
if not episodes:
return ""
parts = ["Relevant past interactions:"]
for ep in episodes:
date = ep["timestamp"][:10]
parts.append(f"\n[{date}, {ep['outcome']}] {ep['episode'][:200]}")
return "\n".join(parts)
Part 3: Planning Patterns
Plan-and-Execute
from pydantic import BaseModel
from typing import List
class TaskPlan(BaseModel):
goal: str
tasks: List[str]
success_criteria: str
def generate_plan(goal: str) -> TaskPlan:
"""Generate a structured task plan before execution."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """Create a step-by-step plan to accomplish the goal.
Return JSON with:
- goal: restated goal
- tasks: list of specific, actionable tasks (5-10 steps)
- success_criteria: how to know the goal is accomplished
Tasks should be concrete actions, not vague steps."""
},
{"role": "user", "content": f"Goal: {goal}"}
],
response_format={"type": "json_object"}
)
plan_data = json.loads(response.choices[0].message.content)
return TaskPlan(**plan_data)
class PlanExecuteAgent:
def __init__(self, tools: list, model: str = "gpt-4o-mini"):
self.tools = tools
self.model = model
self.working_memory = WorkingMemory()
self.episodic_memory = EpisodicMemory()
self.current_plan: TaskPlan | None = None
self.completed_tasks: list[str] = []
self.task_results: list[str] = []
def run(self, goal: str) -> str:
# 1. Generate plan
self.current_plan = generate_plan(goal)
print(f"Plan created: {len(self.current_plan.tasks)} tasks")
# 2. Retrieve relevant memories
past_episodes = self.episodic_memory.retrieve_relevant(goal)
memory_context = self.episodic_memory.format_for_context(past_episodes)
# 3. Set up working memory
self.working_memory.set_system_prompt(f"""You are executing a plan to: {goal}
Plan:
{chr(10).join(f'{i+1}. {task}' for i, task in enumerate(self.current_plan.tasks))}
Success criteria: {self.current_plan.success_criteria}
{memory_context}
Complete each task in order. Mark tasks as complete when done.""")
# 4. Execute tasks
for i, task in enumerate(self.current_plan.tasks):
print(f"\nExecuting task {i+1}: {task}")
# Check if re-planning is needed
if self._should_replan(task):
print("Replanning based on intermediate results...")
new_plan = generate_plan(f"{goal} (previously attempted, now adjusting based on: {self.task_results[-1][:200]})")
self.current_plan = new_plan
break
result = self._execute_task(task, i)
self.completed_tasks.append(task)
self.task_results.append(result)
# 5. Generate final output
final_response = self._synthesize_results(goal)
# 6. Store in episodic memory
self.episodic_memory.store_episode(
session_id="session_001",
user_message=goal,
agent_response=final_response[:500],
tools_used=[t.__name__ for t in self.tools],
outcome="success"
)
return final_response
def _should_replan(self, next_task: str) -> bool:
"""Check if earlier results suggest the plan should change."""
if not self.task_results:
return False
last_result = self.task_results[-1]
check_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Given this result: '{last_result[:200]}'\n"
f"Does the next planned task still make sense: '{next_task}'?\n"
f"Answer only: yes or no"
}]
)
return "no" in check_response.choices[0].message.content.lower()
def _execute_task(self, task: str, step_index: int) -> str:
"""Execute a single task."""
context = f"Completed {step_index} of {len(self.current_plan.tasks)} tasks."
if self.task_results:
context += f"\nLast result: {self.task_results[-1][:300]}"
self.working_memory.add_message("user", f"{context}\n\nNow execute: {task}")
response = client.chat.completions.create(
model=self.model,
messages=self.working_memory.get_context()
)
result = response.choices[0].message.content
self.working_memory.add_message("assistant", result)
return result
def _synthesize_results(self, goal: str) -> str:
results_text = "\n\n".join([
f"Task: {task}\nResult: {result[:300]}"
for task, result in zip(self.completed_tasks, self.task_results)
])
response = client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": "Synthesize the task results into a coherent final answer."},
{"role": "user", "content": f"Goal: {goal}\n\nTask Results:\n{results_text}"}
]
)
return response.choices[0].message.content
Part 4: Semantic Memory for User Preferences
class SemanticMemory:
"""Store learned facts about users and domain."""
def __init__(self):
self.chroma = chromadb.PersistentClient(path="./agent_memory")
self.facts = self.chroma.get_or_create_collection("semantic_facts")
def store_fact(self, fact: str, category: str, subject: str = "user"):
embedding = self._embed(fact)
doc_id = f"{subject}_{category}_{hash(fact)}"
self.facts.upsert(
ids=[doc_id],
embeddings=[embedding],
documents=[fact],
metadatas={
"category": category,
"subject": subject,
"stored_at": datetime.now().isoformat()
}
)
def recall(self, query: str, top_k: int = 5) -> list[str]:
query_emb = self._embed(query)
results = self.facts.query(
query_embeddings=[query_emb],
n_results=top_k
)
return results["documents"][0] if results["documents"] else []
def extract_and_store_facts(self, conversation: str):
"""Use LLM to extract memorable facts from conversation."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """Extract facts worth remembering from this conversation.
Return JSON: {"facts": [{"fact": "...", "category": "preference/project/personal/technical"}]}
Only include facts that would be useful in future conversations.
Return empty list if nothing worth remembering."""
},
{"role": "user", "content": conversation}
],
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
for item in data.get("facts", []):
self.store_fact(item["fact"], item["category"])
Conclusion
Memory and planning are what separate agents that can handle complex, multi-session tasks from those that can only respond to single prompts. The working memory management, episodic retrieval, and structured planning patterns here are the foundation of production agent systems.
The key insight: context is the most precious resource in an agent. Every token not used wisely is a token that could hold a more relevant fact, a more useful tool result, or a better plan.
For the graph-based framework that makes these patterns composable, see our LangGraph tutorial. For building specialized research agents with these memory systems, see our AI research agent guide.
Frequently Asked Questions
What are the types of memory in AI agents?
Working memory (context window — limited, temporary), episodic memory (past interactions in a database — unlimited, permanent), semantic memory (learned facts in a vector DB — unlimited, searchable), and procedural memory (how-to instructions in system prompt or retrieved docs).
How do agents plan multi-step tasks?
ReAct (plan on-the-fly, one step at a time), plan-and-execute (generate full plan upfront, then execute), tree of thoughts (generate multiple plan branches, pick best). Hybrid approach works best: generate initial plan, allow replanning at checkpoints when results suggest the plan needs adjustment.
What is working memory for AI agents?
The active context window — everything in the current prompt. Limited to 128K-200K tokens. Strategies: summarize old tool results, extract key facts to persistent storage before removing from context, always prioritize recent and relevant information.
How does an agent maintain state across long tasks?
Structured state objects (TypedDict/JSON), checkpoint saves to database after each step, scratchpad pattern (notes field tracking key findings), and summary compression (compress completed subtask details into compact form). LangGraph's state management handles this systematically.
What is the difference between short-term and long-term agent memory?
Short-term: in-context (current session), fast but limited and temporary. Long-term: persisted outside the model (database + vector DB), unlimited capacity, requires explicit storage and retrieval. The challenge: deciding what to store and when to retrieve it into context.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI Agents Explained: How Autonomous AI Systems Work and What They Can Do
AI agents explained — how autonomous AI systems perceive, reason, and act to complete complex tasks, the architectures powering them, and practical examples from ReAct to LangGraph.
AI Agents and the Future of Work: What's Actually Changing in 2025-2030
AI agents and the future of work — what tasks are being automated, which jobs are transforming, and what skills matter most as autonomous agents reshape knowledge work.
Will AI Agents Replace Software Developers? The Honest Technical Analysis
Will AI agents replace software developers? An honest technical analysis of what AI agents can and can't do, current limitations, and what skills remain uniquely human in 2025.
Build a Research Agent: End-to-End Autonomous Research Tool in Python
Build a complete AI research agent in Python — web search, source validation, synthesis, and report generation. Production patterns with LangGraph and real code.