AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

AI agent memory management with summarization — AutoGPT long-term memory

5 AutoGPT Long-Term Memory Strategies (Summarize, Forget)

⚡ Quick Answer

Master AutoGPT long-term memory with 5 proven strategies: sliding window, summarization, selective retention, and more — avoid context explosion in autonomous agents.

AiTechWorlds Team May 31, 2026 13 min read

#AutoGPT #long-term memory #context management #agent memory

📚Part of the Autogpt Autogen guide — explore all Autogpt Autogen articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Context explosion is the silent killer of long-running AutoGPT sessions. The agent starts a multi-hour task with perfect recall. An hour in, it's forgotten the original constraints. Two hours in, it's contradicting decisions it made earlier. Three hours in, it's consuming so many tokens that each step costs more than the last.

The root problem isn't that AutoGPT is forgetful — it's that the default memory behavior is to keep everything until the context window is full, then start losing things unpredictably from the beginning of the conversation. Deliberate memory management changes this from an accident into a choice.

Here are five strategies, from simplest to most sophisticated, with working code for each.

The Memory Problem in Concrete Terms

Before getting into solutions, understand the scale of the problem.

GPT-4o has a 128K token context window. That sounds large. In practice, an AutoGPT session with tool calls, results, and agent reasoning burns through it faster than you'd expect:

Average tool call + result: ~500 tokens
Average agent reasoning step: ~800 tokens
System prompt + goal definition: ~500 tokens

A session with 100 steps consumes roughly 130,000 tokens — slightly over the limit. Most real tasks involve far more than 100 steps. Without memory management, you're guaranteed to hit the wall on anything non-trivial.

The ideal state is that the agent always knows: what its overall goal is, what it has already accomplished, what key facts it has discovered, and what decisions it has made. Everything else can be compressed or forgotten.

Strategy 1: Sliding Window Memory

The simplest approach. Keep only the N most recent messages in the active context. Discard everything older.

# memory/sliding_window.py
from typing import List, Dict
import json


class SlidingWindowMemory:
    def __init__(self, window_size: int = 20, preserve_system: bool = True):
        """
        window_size: Number of recent messages to keep in context
        preserve_system: Always keep system/goal messages regardless of age
        """
        self.window_size = window_size
        self.preserve_system = preserve_system
        self._messages: List[Dict] = []

    def add_message(self, role: str, content: str, message_type: str = "normal"):
        """Add a message to memory."""
        self._messages.append({
            "role": role,
            "content": content,
            "type": message_type,  # "system", "goal", "critical", "normal"
        })

    def get_context(self) -> List[Dict]:
        """Get current context window (role + content only, no internal metadata)."""
        if self.preserve_system:
            # Always include system/goal messages
            protected = [m for m in self._messages if m["type"] in ("system", "goal")]
            recent = [m for m in self._messages if m["type"] not in ("system", "goal")]
            recent = recent[-self.window_size:]
            all_msgs = protected + recent
        else:
            all_msgs = self._messages[-self.window_size:]

        # Strip internal type metadata before returning
        return [{"role": m["role"], "content": m["content"]} for m in all_msgs]

    def get_stats(self) -> dict:
        total_tokens_est = sum(len(m["content"].split()) * 1.3 for m in self._messages)
        context_tokens_est = sum(
            len(m["content"].split()) * 1.3 for m in self.get_context()
        )
        return {
            "total_messages": len(self._messages),
            "context_messages": len(self.get_context()),
            "estimated_total_tokens": int(total_tokens_est),
            "estimated_context_tokens": int(context_tokens_est),
            "messages_dropped": max(0, len(self._messages) - len(self.get_context())),
        }

When to use it: Short tasks with predictable step counts. When you don't care about the agent referencing early decisions. Development and prototyping.

Tradeoffs: Simple and zero-cost. But the agent may rediscover things it already found, contradict earlier decisions, or lose important context set up in early messages.

Strategy 2: Summarization Memory

Instead of dropping old messages, compress them into a rolling summary. The agent retains the gist of past work without the full token cost.

# memory/summarization_memory.py
from openai import OpenAI
from typing import List, Dict, Optional
import os


class SummarizationMemory:
    def __init__(
        self,
        window_size: int = 15,
        summarize_threshold: int = 30,
        summary_model: str = "gpt-4o-mini",
    ):
        self.window_size = window_size
        self.summarize_threshold = summarize_threshold
        self.summary_model = summary_model
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

        self._messages: List[Dict] = []
        self._summary: Optional[str] = None

    def add_message(self, role: str, content: str):
        self._messages.append({"role": role, "content": content})

        # Trigger summarization when buffer exceeds threshold
        if len(self._messages) > self.summarize_threshold:
            self._compress_old_messages()

    def _compress_old_messages(self):
        """Summarize messages outside the current window."""
        messages_to_compress = self._messages[:-self.window_size]
        self._messages = self._messages[-self.window_size:]

        # Prepare messages for summarization
        text_to_summarize = "\n".join([
            f"{m['role'].upper()}: {m['content'][:300]}"
            for m in messages_to_compress
        ])

        # Include existing summary if present
        existing_context = ""
        if self._summary:
            existing_context = f"Previous summary:\n{self._summary}\n\n"

        prompt = f"""{existing_context}Summarize the following agent conversation history.
        Focus on:
        1. What tasks were completed
        2. Key facts and data discovered
        3. Decisions made and their rationale
        4. Files or outputs created
        5. Current progress toward the main goal

        Be specific about numbers, names, and critical details.
        Maximum 300 words.

        Conversation to summarize:
        {text_to_summarize}"""

        response = self.client.chat.completions.create(
            model=self.summary_model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=400,
        )

        self._summary = response.choices[0].message.content

    def get_context(self) -> List[Dict]:
        """Get current context including summary if available."""
        messages = []

        if self._summary:
            messages.append({
                "role": "system",
                "content": f"[Conversation History Summary]\n{self._summary}"
            })

        messages.extend(self._messages[-self.window_size:])
        return messages

    def get_summary(self) -> Optional[str]:
        return self._summary

When to use it: Tasks over 30-60 steps. Research workflows where the agent needs to remember findings from earlier steps. Any task where coherence across the full session matters.

Tradeoffs: Incurs additional LLM cost for summarization (use a cheap model like gpt-4o-mini to minimize this). Summaries lose nuance. Numbers and specific facts should be extracted before summarizing.

Strategy 3: Selective Retention Memory

Not all memories are equal. This strategy explicitly categorizes messages and applies different retention policies to each category.

# memory/selective_memory.py
from typing import List, Dict, Optional
from enum import Enum
import json


class MemoryTier(str, Enum):
    PERMANENT = "permanent"    # Never dropped (goals, critical decisions)
    LONG_TERM = "long_term"   # Kept for full session (key findings, facts)
    WORKING = "working"        # Recent context only (last N steps)
    EPHEMERAL = "ephemeral"   # Drop after single use (tool call intermediates)


class SelectiveMemory:
    def __init__(self, working_window: int = 10):
        self.working_window = working_window
        self._permanent: List[Dict] = []
        self._long_term: List[Dict] = []
        self._working: List[Dict] = []

    def add(self, role: str, content: str, tier: MemoryTier = MemoryTier.WORKING):
        message = {"role": role, "content": content}

        if tier == MemoryTier.PERMANENT:
            self._permanent.append(message)
        elif tier == MemoryTier.LONG_TERM:
            self._long_term.append(message)
        elif tier == MemoryTier.WORKING:
            self._working.append(message)
        # EPHEMERAL messages are not stored at all

    def promote_to_long_term(self, content: str):
        """Promote a fact or finding to long-term memory."""
        self._long_term.append({
            "role": "system",
            "content": f"[Retained Fact] {content}"
        })

    def get_context(self) -> List[Dict]:
        """Assemble context from all tiers."""
        context = []
        context.extend(self._permanent)                     # Always present
        context.extend(self._long_term[-20:])               # Up to 20 long-term items
        context.extend(self._working[-self.working_window:]) # Recent working memory
        return context

    def extract_and_retain_facts(self, text: str, llm_client, model: str = "gpt-4o-mini"):
        """Use LLM to extract important facts from a message and retain them."""
        extraction_prompt = f"""Extract the most important facts from this text that should be
        remembered for future reference. Focus on: specific numbers, names, URLs, file paths,
        decisions made, constraints identified.

        Return a JSON array of strings. Each string should be a single, specific fact.
        Maximum 5 facts. Return [] if nothing important.

        Text: {text[:1000]}"""

        response = llm_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": extraction_prompt}],
            max_tokens=200,
        )

        try:
            facts = json.loads(response.choices[0].message.content)
            for fact in facts[:5]:
                self.promote_to_long_term(fact)
        except (json.JSONDecodeError, KeyError):
            pass

When to use it: Complex research tasks where some outputs are critical facts and others are just process steps. Any task where the agent needs to reliably reference specific data points discovered early on.

Tradeoffs: Requires classifying messages at creation time. The automatic fact extraction adds cost. Works best when you can predict what types of information matter.

Strategy 4: External Vector Memory

Store all past agent outputs in a vector database. At each step, retrieve only the most relevant past context rather than keeping everything in the active window.

# memory/vector_memory.py
import os
import json
from typing import List, Dict, Optional
from openai import OpenAI
import numpy as np


class VectorMemory:
    """In-memory vector store (replace with Pinecone/Chroma for production)."""

    def __init__(self, embedding_model: str = "text-embedding-3-small"):
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.embedding_model = embedding_model
        self.store: List[Dict] = []

    def _embed(self, text: str) -> List[float]:
        response = self.client.embeddings.create(
            input=text[:8000],
            model=self.embedding_model,
        )
        return response.data[0].embedding

    def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def store_memory(self, content: str, metadata: dict = None):
        """Store a piece of memory with its embedding."""
        embedding = self._embed(content)
        self.store.append({
            "content": content,
            "embedding": embedding,
            "metadata": metadata or {},
        })

    def retrieve_relevant(
        self,
        query: str,
        top_k: int = 5,
        threshold: float = 0.7
    ) -> List[str]:
        """Retrieve memories most relevant to the current query."""
        if not self.store:
            return []

        query_embedding = self._embed(query)

        # Score all memories
        scored = []
        for item in self.store:
            score = self._cosine_similarity(query_embedding, item["embedding"])
            if score >= threshold:
                scored.append((score, item["content"]))

        # Return top_k by score
        scored.sort(reverse=True, key=lambda x: x[0])
        return [content for _, content in scored[:top_k]]

    def build_memory_context(self, current_query: str) -> str:
        """Build a memory context string for injection into the prompt."""
        relevant = self.retrieve_relevant(current_query)
        if not relevant:
            return ""

        context_parts = ["[Relevant Memory Retrieval]"]
        for i, memory in enumerate(relevant, 1):
            context_parts.append(f"{i}. {memory[:200]}")

        return "\n".join(context_parts)

For production, replace this with a proper vector database. The vector database guide covers Pinecone, Chroma, and Weaviate in detail — all three work well for this pattern.

When to use it: Long-running projects spanning multiple sessions. Agents that need to reference a large knowledge base. Research tasks where earlier findings are selectively relevant to later steps.

Tradeoffs: Higher cost — every store and retrieve operation calls the embedding API. Adds latency. Requires external infrastructure for production. But enables truly unlimited long-term memory.

Strategy 5: Hierarchical Memory

The most sophisticated approach. Memories are organized in tiers: immediate context, session summary, project history. Each tier feeds the one above it through progressive summarization.

# memory/hierarchical_memory.py
from openai import OpenAI
from typing import List, Dict, Optional
import os
import json
from datetime import datetime


class HierarchicalMemory:
    def __init__(
        self,
        immediate_window: int = 10,
        session_summary_interval: int = 25,
        llm_model: str = "gpt-4o-mini",
    ):
        self.immediate_window = immediate_window
        self.session_summary_interval = session_summary_interval
        self.llm_model = llm_model
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

        # Three memory tiers
        self.immediate: List[Dict] = []          # Last N messages
        self.session_summaries: List[str] = []   # Periodic session summaries
        self.project_memory: Dict = {            # Cross-session persistent state
            "goal": "",
            "key_decisions": [],
            "discovered_facts": [],
            "completed_tasks": [],
            "created_files": [],
        }

    def set_goal(self, goal: str):
        """Set the project goal — always kept in permanent memory."""
        self.project_memory["goal"] = goal

    def add_message(self, role: str, content: str):
        """Add a message and trigger compression if needed."""
        self.immediate.append({"role": role, "content": content})

        # Check if we need to create a session summary
        if len(self.immediate) >= self.session_summary_interval:
            self._create_session_summary()

    def _create_session_summary(self):
        """Compress current immediate memory into a session summary."""
        messages_text = "\n".join([
            f"{m['role']}: {m['content'][:250]}"
            for m in self.immediate[:-self.immediate_window]
        ])

        if not messages_text:
            return

        prompt = f"""Create a concise summary of these agent actions.
        Update the project state with any new information.

        Current project state:
        Goal: {self.project_memory['goal']}
        Known facts: {json.dumps(self.project_memory['discovered_facts'][-5:])}
        Completed tasks: {json.dumps(self.project_memory['completed_tasks'][-5:])}

        New messages to summarize:
        {messages_text}

        Return JSON with:
        {{
          "summary": "2-3 sentence narrative of what happened",
          "new_facts": ["fact1", "fact2"],
          "new_completed_tasks": ["task1"],
          "new_decisions": ["decision1"],
          "new_files": ["file_path1"]
        }}"""

        try:
            response = self.client.chat.completions.create(
                model=self.llm_model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=400,
                response_format={"type": "json_object"},
            )

            result = json.loads(response.choices[0].message.content)

            # Update project memory
            self.session_summaries.append(result.get("summary", ""))
            self.project_memory["discovered_facts"].extend(result.get("new_facts", []))
            self.project_memory["completed_tasks"].extend(result.get("new_completed_tasks", []))
            self.project_memory["key_decisions"].extend(result.get("new_decisions", []))
            self.project_memory["created_files"].extend(result.get("new_files", []))

            # Keep only recent immediate messages
            self.immediate = self.immediate[-self.immediate_window:]

        except Exception as e:
            print(f"Warning: Memory compression failed: {e}")
            # Fall back to simple truncation
            self.immediate = self.immediate[-self.immediate_window:]

    def get_context(self) -> List[Dict]:
        """Assemble full context from all memory tiers."""
        context = []

        # Project goal (always first)
        if self.project_memory["goal"]:
            context.append({
                "role": "system",
                "content": f"Project Goal: {self.project_memory['goal']}"
            })

        # Project state
        state_parts = []
        if self.project_memory["discovered_facts"]:
            facts = self.project_memory["discovered_facts"][-10:]
            state_parts.append(f"Known facts: {'; '.join(facts)}")
        if self.project_memory["completed_tasks"]:
            tasks = self.project_memory["completed_tasks"][-5:]
            state_parts.append(f"Completed: {'; '.join(tasks)}")
        if self.project_memory["created_files"]:
            files = self.project_memory["created_files"][-5:]
            state_parts.append(f"Created files: {', '.join(files)}")

        if state_parts:
            context.append({
                "role": "system",
                "content": "[Project State]\n" + "\n".join(state_parts)
            })

        # Recent session summaries (last 3)
        if self.session_summaries:
            recent_summaries = self.session_summaries[-3:]
            context.append({
                "role": "system",
                "content": "[Recent History]\n" + "\n".join(recent_summaries)
            })

        # Immediate context
        context.extend(self.immediate)

        return context

    def save_to_disk(self, path: str):
        """Persist project memory across sessions."""
        state = {
            "project_memory": self.project_memory,
            "session_summaries": self.session_summaries,
            "saved_at": datetime.utcnow().isoformat(),
        }
        with open(path, "w") as f:
            json.dump(state, f, indent=2)

    def load_from_disk(self, path: str):
        """Resume project memory from a previous session."""
        with open(path) as f:
            state = json.load(f)
        self.project_memory = state["project_memory"]
        self.session_summaries = state["session_summaries"]
        print(f"Loaded memory from session saved at {state['saved_at']}")

Strategy Comparison

Strategy	Token Cost	Coherence	Complexity	Best For
Sliding Window	Zero	Low — loses early context	Minimal	Short tasks, prototyping
Summarization	Low	Medium — retains gist	Low	Medium tasks, 30-100 steps
Selective Retention	Low	High for tagged content	Medium	Tasks with known critical facts
Vector Memory	Medium	High — semantic retrieval	High	Multi-session, large knowledge base
Hierarchical	Medium	Highest — structured state	High	Complex long-running projects

According to the AI agent memory and planning research, agents using hierarchical memory strategies outperform sliding window approaches by 34% on task completion for sessions exceeding 50 steps — the summarization overhead is more than offset by the reduction in contradictory actions.

Implementing Memory in an AutoGen Agent

import autogen
from memory.hierarchical_memory import HierarchicalMemory

memory = HierarchicalMemory(immediate_window=10, session_summary_interval=25)
memory.set_goal("Research and write a comprehensive report on battery technology advances in 2026")

# If resuming a session:
# memory.load_from_disk("project_memory.json")

# Inject memory context into each agent interaction
def get_messages_with_memory(user_message: str) -> list:
    memory.add_message("user", user_message)
    context = memory.get_context()
    context.append({"role": "user", "content": user_message})
    return context

# After each agent response:
def store_agent_response(response: str):
    memory.add_message("assistant", response)
    memory.save_to_disk("project_memory.json")  # Auto-save

The right memory strategy depends on task length and complexity. For anything under 20-30 steps, sliding window is fine. For research projects spanning hours or multiple sessions, hierarchical memory is worth the additional setup. The pattern connects naturally to Build AI agent with LangChain where LangChain's conversation memory classes implement very similar approaches with a different API.

The goal in every case is the same: the agent should always know where it is, where it came from, and where it's going — even when the raw conversation history is too long to fit in context.

Frequently Asked Questions

Why does AutoGPT lose important information during long tasks? AutoGPT, like all LLM-based agents, has a fixed context window. When a session exceeds this limit, older content is truncated or compressed. Without deliberate memory management, important early context (goals, constraints, partial results) gets pushed out by newer, less important content.

What is the difference between AutoGPT's working memory and long-term memory? Working memory is the active context window — what the agent is currently reasoning over. Long-term memory is persistent storage outside the context window, typically a vector database or file system. The agent retrieves from long-term memory as needed, treating it like a searchable external knowledge base.

Does summarization lose important details? Yes — summarization always involves information loss. The key is controlling what gets summarized and what gets preserved verbatim. Critical facts (names, numbers, specific decisions) should be extracted and stored explicitly before summarization. The summary captures flow and context; the extracted facts preserve precision.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AutoGPT, like all LLM-based agents, has a fixed context window. When a session exceeds this limit, older content is truncated or compressed. Without deliberate memory management, important early context (goals, constraints, partial results) gets pushed out by newer, less important content.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI agent role assignment diagram — AutoGen agent types roles

Agent Development

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

Understand the 5 core AutoGen agent types — AssistantAgent, UserProxyAgent, CodeExecutorAgent, and more — with code examples and a comparison table for each role.

May 31, 2026 11 min read

AutoGen agent served as REST API endpoint — FastAPI deployment

Agent Development

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

Learn to serve AutoGen multi-agent systems as production REST APIs using FastAPI with async endpoints and real-time streaming responses.

May 31, 2026 10 min read

Azure OpenAI enterprise integration with AutoGen — managed private instances

Agent Development

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Connect Microsoft AutoGen to Azure OpenAI for enterprise-grade AI agents. Step-by-step setup with private endpoints, OAI_CONFIG_LIST, and deployment config.

May 31, 2026 10 min read

AI agent automatically fixing code bugs — AutoGen code debugging auto-fix

Agent Development

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

Build an AutoGen agent that reviews code, analyzes PR diffs, suggests fixes, and automates code quality improvements with a full working implementation.

May 31, 2026 11 min read

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Autogpt Autogen

5 AutoGPT Long-Term Memory Strategies (Summarize, Forget)

⚡ Quick Answer

Master AutoGPT long-term memory with 5 proven strategies: sliding window, summarization, selective retention, and more — avoid context explosion in autonomous agents.

AiTechWorlds Team May 31, 2026 13 min read

#AutoGPT #long-term memory #context management #agent memory

📚Part of the Autogpt Autogen guide — explore all Autogpt Autogen articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Here are five strategies, from simplest to most sophisticated, with working code for each.

The Memory Problem in Concrete Terms

Before getting into solutions, understand the scale of the problem.

GPT-4o has a 128K token context window. That sounds large. In practice, an AutoGPT session with tool calls, results, and agent reasoning burns through it faster than you'd expect:

Average tool call + result: ~500 tokens
Average agent reasoning step: ~800 tokens
System prompt + goal definition: ~500 tokens

Strategy 1: Sliding Window Memory

The simplest approach. Keep only the N most recent messages in the active context. Discard everything older.

# memory/sliding_window.py
from typing import List, Dict
import json


class SlidingWindowMemory:
    def __init__(self, window_size: int = 20, preserve_system: bool = True):
        """
        window_size: Number of recent messages to keep in context
        preserve_system: Always keep system/goal messages regardless of age
        """
        self.window_size = window_size
        self.preserve_system = preserve_system
        self._messages: List[Dict] = []

    def add_message(self, role: str, content: str, message_type: str = "normal"):
        """Add a message to memory."""
        self._messages.append({
            "role": role,
            "content": content,
            "type": message_type,  # "system", "goal", "critical", "normal"
        })

    def get_context(self) -> List[Dict]:
        """Get current context window (role + content only, no internal metadata)."""
        if self.preserve_system:
            # Always include system/goal messages
            protected = [m for m in self._messages if m["type"] in ("system", "goal")]
            recent = [m for m in self._messages if m["type"] not in ("system", "goal")]
            recent = recent[-self.window_size:]
            all_msgs = protected + recent
        else:
            all_msgs = self._messages[-self.window_size:]

        # Strip internal type metadata before returning
        return [{"role": m["role"], "content": m["content"]} for m in all_msgs]

    def get_stats(self) -> dict:
        total_tokens_est = sum(len(m["content"].split()) * 1.3 for m in self._messages)
        context_tokens_est = sum(
            len(m["content"].split()) * 1.3 for m in self.get_context()
        )
        return {
            "total_messages": len(self._messages),
            "context_messages": len(self.get_context()),
            "estimated_total_tokens": int(total_tokens_est),
            "estimated_context_tokens": int(context_tokens_est),
            "messages_dropped": max(0, len(self._messages) - len(self.get_context())),
        }

When to use it: Short tasks with predictable step counts. When you don't care about the agent referencing early decisions. Development and prototyping.

Tradeoffs: Simple and zero-cost. But the agent may rediscover things it already found, contradict earlier decisions, or lose important context set up in early messages.

Strategy 2: Summarization Memory

Instead of dropping old messages, compress them into a rolling summary. The agent retains the gist of past work without the full token cost.

# memory/summarization_memory.py
from openai import OpenAI
from typing import List, Dict, Optional
import os


class SummarizationMemory:
    def __init__(
        self,
        window_size: int = 15,
        summarize_threshold: int = 30,
        summary_model: str = "gpt-4o-mini",
    ):
        self.window_size = window_size
        self.summarize_threshold = summarize_threshold
        self.summary_model = summary_model
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

        self._messages: List[Dict] = []
        self._summary: Optional[str] = None

    def add_message(self, role: str, content: str):
        self._messages.append({"role": role, "content": content})

        # Trigger summarization when buffer exceeds threshold
        if len(self._messages) > self.summarize_threshold:
            self._compress_old_messages()

    def _compress_old_messages(self):
        """Summarize messages outside the current window."""
        messages_to_compress = self._messages[:-self.window_size]
        self._messages = self._messages[-self.window_size:]

        # Prepare messages for summarization
        text_to_summarize = "\n".join([
            f"{m['role'].upper()}: {m['content'][:300]}"
            for m in messages_to_compress
        ])

        # Include existing summary if present
        existing_context = ""
        if self._summary:
            existing_context = f"Previous summary:\n{self._summary}\n\n"

        prompt = f"""{existing_context}Summarize the following agent conversation history.
        Focus on:
        1. What tasks were completed
        2. Key facts and data discovered
        3. Decisions made and their rationale
        4. Files or outputs created
        5. Current progress toward the main goal

        Be specific about numbers, names, and critical details.
        Maximum 300 words.

        Conversation to summarize:
        {text_to_summarize}"""

        response = self.client.chat.completions.create(
            model=self.summary_model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=400,
        )

        self._summary = response.choices[0].message.content

    def get_context(self) -> List[Dict]:
        """Get current context including summary if available."""
        messages = []

        if self._summary:
            messages.append({
                "role": "system",
                "content": f"[Conversation History Summary]\n{self._summary}"
            })

        messages.extend(self._messages[-self.window_size:])
        return messages

    def get_summary(self) -> Optional[str]:
        return self._summary

When to use it: Tasks over 30-60 steps. Research workflows where the agent needs to remember findings from earlier steps. Any task where coherence across the full session matters.

Strategy 3: Selective Retention Memory

Not all memories are equal. This strategy explicitly categorizes messages and applies different retention policies to each category.

# memory/selective_memory.py
from typing import List, Dict, Optional
from enum import Enum
import json


class MemoryTier(str, Enum):
    PERMANENT = "permanent"    # Never dropped (goals, critical decisions)
    LONG_TERM = "long_term"   # Kept for full session (key findings, facts)
    WORKING = "working"        # Recent context only (last N steps)
    EPHEMERAL = "ephemeral"   # Drop after single use (tool call intermediates)


class SelectiveMemory:
    def __init__(self, working_window: int = 10):
        self.working_window = working_window
        self._permanent: List[Dict] = []
        self._long_term: List[Dict] = []
        self._working: List[Dict] = []

    def add(self, role: str, content: str, tier: MemoryTier = MemoryTier.WORKING):
        message = {"role": role, "content": content}

        if tier == MemoryTier.PERMANENT:
            self._permanent.append(message)
        elif tier == MemoryTier.LONG_TERM:
            self._long_term.append(message)
        elif tier == MemoryTier.WORKING:
            self._working.append(message)
        # EPHEMERAL messages are not stored at all

    def promote_to_long_term(self, content: str):
        """Promote a fact or finding to long-term memory."""
        self._long_term.append({
            "role": "system",
            "content": f"[Retained Fact] {content}"
        })

    def get_context(self) -> List[Dict]:
        """Assemble context from all tiers."""
        context = []
        context.extend(self._permanent)                     # Always present
        context.extend(self._long_term[-20:])               # Up to 20 long-term items
        context.extend(self._working[-self.working_window:]) # Recent working memory
        return context

    def extract_and_retain_facts(self, text: str, llm_client, model: str = "gpt-4o-mini"):
        """Use LLM to extract important facts from a message and retain them."""
        extraction_prompt = f"""Extract the most important facts from this text that should be
        remembered for future reference. Focus on: specific numbers, names, URLs, file paths,
        decisions made, constraints identified.

        Return a JSON array of strings. Each string should be a single, specific fact.
        Maximum 5 facts. Return [] if nothing important.

        Text: {text[:1000]}"""

        response = llm_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": extraction_prompt}],
            max_tokens=200,
        )

        try:
            facts = json.loads(response.choices[0].message.content)
            for fact in facts[:5]:
                self.promote_to_long_term(fact)
        except (json.JSONDecodeError, KeyError):
            pass

Tradeoffs: Requires classifying messages at creation time. The automatic fact extraction adds cost. Works best when you can predict what types of information matter.

Strategy 4: External Vector Memory

Store all past agent outputs in a vector database. At each step, retrieve only the most relevant past context rather than keeping everything in the active window.

# memory/vector_memory.py
import os
import json
from typing import List, Dict, Optional
from openai import OpenAI
import numpy as np


class VectorMemory:
    """In-memory vector store (replace with Pinecone/Chroma for production)."""

    def __init__(self, embedding_model: str = "text-embedding-3-small"):
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.embedding_model = embedding_model
        self.store: List[Dict] = []

    def _embed(self, text: str) -> List[float]:
        response = self.client.embeddings.create(
            input=text[:8000],
            model=self.embedding_model,
        )
        return response.data[0].embedding

    def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def store_memory(self, content: str, metadata: dict = None):
        """Store a piece of memory with its embedding."""
        embedding = self._embed(content)
        self.store.append({
            "content": content,
            "embedding": embedding,
            "metadata": metadata or {},
        })

    def retrieve_relevant(
        self,
        query: str,
        top_k: int = 5,
        threshold: float = 0.7
    ) -> List[str]:
        """Retrieve memories most relevant to the current query."""
        if not self.store:
            return []

        query_embedding = self._embed(query)

        # Score all memories
        scored = []
        for item in self.store:
            score = self._cosine_similarity(query_embedding, item["embedding"])
            if score >= threshold:
                scored.append((score, item["content"]))

        # Return top_k by score
        scored.sort(reverse=True, key=lambda x: x[0])
        return [content for _, content in scored[:top_k]]

    def build_memory_context(self, current_query: str) -> str:
        """Build a memory context string for injection into the prompt."""
        relevant = self.retrieve_relevant(current_query)
        if not relevant:
            return ""

        context_parts = ["[Relevant Memory Retrieval]"]
        for i, memory in enumerate(relevant, 1):
            context_parts.append(f"{i}. {memory[:200]}")

        return "\n".join(context_parts)

For production, replace this with a proper vector database. The vector database guide covers Pinecone, Chroma, and Weaviate in detail — all three work well for this pattern.

Tradeoffs: Higher cost — every store and retrieve operation calls the embedding API. Adds latency. Requires external infrastructure for production. But enables truly unlimited long-term memory.

Strategy 5: Hierarchical Memory

The most sophisticated approach. Memories are organized in tiers: immediate context, session summary, project history. Each tier feeds the one above it through progressive summarization.

# memory/hierarchical_memory.py
from openai import OpenAI
from typing import List, Dict, Optional
import os
import json
from datetime import datetime


class HierarchicalMemory:
    def __init__(
        self,
        immediate_window: int = 10,
        session_summary_interval: int = 25,
        llm_model: str = "gpt-4o-mini",
    ):
        self.immediate_window = immediate_window
        self.session_summary_interval = session_summary_interval
        self.llm_model = llm_model
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

        # Three memory tiers
        self.immediate: List[Dict] = []          # Last N messages
        self.session_summaries: List[str] = []   # Periodic session summaries
        self.project_memory: Dict = {            # Cross-session persistent state
            "goal": "",
            "key_decisions": [],
            "discovered_facts": [],
            "completed_tasks": [],
            "created_files": [],
        }

    def set_goal(self, goal: str):
        """Set the project goal — always kept in permanent memory."""
        self.project_memory["goal"] = goal

    def add_message(self, role: str, content: str):
        """Add a message and trigger compression if needed."""
        self.immediate.append({"role": role, "content": content})

        # Check if we need to create a session summary
        if len(self.immediate) >= self.session_summary_interval:
            self._create_session_summary()

    def _create_session_summary(self):
        """Compress current immediate memory into a session summary."""
        messages_text = "\n".join([
            f"{m['role']}: {m['content'][:250]}"
            for m in self.immediate[:-self.immediate_window]
        ])

        if not messages_text:
            return

        prompt = f"""Create a concise summary of these agent actions.
        Update the project state with any new information.

        Current project state:
        Goal: {self.project_memory['goal']}
        Known facts: {json.dumps(self.project_memory['discovered_facts'][-5:])}
        Completed tasks: {json.dumps(self.project_memory['completed_tasks'][-5:])}

        New messages to summarize:
        {messages_text}

        Return JSON with:
        {{
          "summary": "2-3 sentence narrative of what happened",
          "new_facts": ["fact1", "fact2"],
          "new_completed_tasks": ["task1"],
          "new_decisions": ["decision1"],
          "new_files": ["file_path1"]
        }}"""

        try:
            response = self.client.chat.completions.create(
                model=self.llm_model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=400,
                response_format={"type": "json_object"},
            )

            result = json.loads(response.choices[0].message.content)

            # Update project memory
            self.session_summaries.append(result.get("summary", ""))
            self.project_memory["discovered_facts"].extend(result.get("new_facts", []))
            self.project_memory["completed_tasks"].extend(result.get("new_completed_tasks", []))
            self.project_memory["key_decisions"].extend(result.get("new_decisions", []))
            self.project_memory["created_files"].extend(result.get("new_files", []))

            # Keep only recent immediate messages
            self.immediate = self.immediate[-self.immediate_window:]

        except Exception as e:
            print(f"Warning: Memory compression failed: {e}")
            # Fall back to simple truncation
            self.immediate = self.immediate[-self.immediate_window:]

    def get_context(self) -> List[Dict]:
        """Assemble full context from all memory tiers."""
        context = []

        # Project goal (always first)
        if self.project_memory["goal"]:
            context.append({
                "role": "system",
                "content": f"Project Goal: {self.project_memory['goal']}"
            })

        # Project state
        state_parts = []
        if self.project_memory["discovered_facts"]:
            facts = self.project_memory["discovered_facts"][-10:]
            state_parts.append(f"Known facts: {'; '.join(facts)}")
        if self.project_memory["completed_tasks"]:
            tasks = self.project_memory["completed_tasks"][-5:]
            state_parts.append(f"Completed: {'; '.join(tasks)}")
        if self.project_memory["created_files"]:
            files = self.project_memory["created_files"][-5:]
            state_parts.append(f"Created files: {', '.join(files)}")

        if state_parts:
            context.append({
                "role": "system",
                "content": "[Project State]\n" + "\n".join(state_parts)
            })

        # Recent session summaries (last 3)
        if self.session_summaries:
            recent_summaries = self.session_summaries[-3:]
            context.append({
                "role": "system",
                "content": "[Recent History]\n" + "\n".join(recent_summaries)
            })

        # Immediate context
        context.extend(self.immediate)

        return context

    def save_to_disk(self, path: str):
        """Persist project memory across sessions."""
        state = {
            "project_memory": self.project_memory,
            "session_summaries": self.session_summaries,
            "saved_at": datetime.utcnow().isoformat(),
        }
        with open(path, "w") as f:
            json.dump(state, f, indent=2)

    def load_from_disk(self, path: str):
        """Resume project memory from a previous session."""
        with open(path) as f:
            state = json.load(f)
        self.project_memory = state["project_memory"]
        self.session_summaries = state["session_summaries"]
        print(f"Loaded memory from session saved at {state['saved_at']}")

Strategy Comparison

Strategy	Token Cost	Coherence	Complexity	Best For
Sliding Window	Zero	Low — loses early context	Minimal	Short tasks, prototyping
Summarization	Low	Medium — retains gist	Low	Medium tasks, 30-100 steps
Selective Retention	Low	High for tagged content	Medium	Tasks with known critical facts
Vector Memory	Medium	High — semantic retrieval	High	Multi-session, large knowledge base
Hierarchical	Medium	Highest — structured state	High	Complex long-running projects

Implementing Memory in an AutoGen Agent

import autogen
from memory.hierarchical_memory import HierarchicalMemory

memory = HierarchicalMemory(immediate_window=10, session_summary_interval=25)
memory.set_goal("Research and write a comprehensive report on battery technology advances in 2026")

# If resuming a session:
# memory.load_from_disk("project_memory.json")

# Inject memory context into each agent interaction
def get_messages_with_memory(user_message: str) -> list:
    memory.add_message("user", user_message)
    context = memory.get_context()
    context.append({"role": "user", "content": user_message})
    return context

# After each agent response:
def store_agent_response(response: str):
    memory.add_message("assistant", response)
    memory.save_to_disk("project_memory.json")  # Auto-save

The goal in every case is the same: the agent should always know where it is, where it came from, and where it's going — even when the raw conversation history is too long to fit in context.

Frequently Asked Questions

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

Understand the 5 core AutoGen agent types — AssistantAgent, UserProxyAgent, CodeExecutorAgent, and more — with code examples and a comparison table for each role.

May 31, 2026 11 min read

Agent Development

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

Learn to serve AutoGen multi-agent systems as production REST APIs using FastAPI with async endpoints and real-time streaming responses.

May 31, 2026 10 min read

Agent Development

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Connect Microsoft AutoGen to Azure OpenAI for enterprise-grade AI agents. Step-by-step setup with private endpoints, OAI_CONFIG_LIST, and deployment config.

May 31, 2026 10 min read

Agent Development

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

Build an AutoGen agent that reviews code, analyzes PR diffs, suggests fixes, and automates code quality improvements with a full working implementation.

May 31, 2026 11 min read

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

5 AutoGPT Long-Term Memory Strategies (Summarize, Forget)

The Memory Problem in Concrete Terms

Strategy 1: Sliding Window Memory

Strategy 2: Summarization Memory

Strategy 3: Selective Retention Memory

Strategy 4: External Vector Memory

Strategy 5: Hierarchical Memory

Strategy Comparison

Implementing Memory in an AutoGen Agent

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

Get Free AI Notes Daily

5 AutoGPT Long-Term Memory Strategies (Summarize, Forget)

The Memory Problem in Concrete Terms

Strategy 1: Sliding Window Memory

Strategy 2: Summarization Memory

Strategy 3: Selective Retention Memory

Strategy 4: External Vector Memory

Strategy 5: Hierarchical Memory

Strategy Comparison

Implementing Memory in an AutoGen Agent

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

Get Free AI Notes Daily