How to Deploy AutoGen Agents as APIs with FastAPI (2026)
Learn to serve AutoGen multi-agent systems as production REST APIs using FastAPI with async endpoints and real-time streaming responses.
Get more content like this on Telegram!
Daily AI tips, notes & resources ā free
Running a multi-agent AutoGen system interactively is useful for prototyping. But when you need other services, frontends, or third-party tools to talk to your agents, you need a proper API layer. FastAPI is the obvious choice ā it's async-native, fast, and ships with automatic OpenAPI docs.
This guide walks through turning an AutoGen agent system into a production-grade REST API. You'll get a full working example including async endpoints, session management, and streaming responses.
Why Expose Agents as APIs
Before jumping into code, it's worth understanding what you gain from this architecture. When your AutoGen agents live behind an HTTP API, they become composable services. Your React frontend, your Slack bot, your mobile app, and your backend services can all talk to the same agent logic without duplicating anything.
The alternative ā embedding agent logic directly in each consumer ā creates a maintenance nightmare. Every prompt change, every model swap, every tool addition needs to be rolled out to every consumer separately. An API-first approach centralizes all of that.
There's also the operational angle. FastAPI gives you request validation, authentication hooks, rate limiting middleware, and automatic documentation. These aren't things you want to build yourself on top of raw AutoGen.
Project Setup
Start with a clean virtual environment and install the dependencies:
pip install pyautogen fastapi uvicorn[standard] python-dotenv redis pydantic
Your project structure should look like this:
agent_api/
āāā main.py
āāā agents/
ā āāā __init__.py
ā āāā base.py
ā āāā research.py
āāā models/
ā āāā __init__.py
ā āāā schemas.py
āāā services/
ā āāā __init__.py
ā āāā session.py
āāā .env
Set your environment variables in .env:
OPENAI_API_KEY=sk-...
REDIS_URL=redis://localhost:6379
SESSION_TTL=3600
Defining Your Request and Response Schemas
Clean Pydantic schemas are the foundation of a good FastAPI service. Define them before writing any agent logic:
# models/schemas.py
from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import datetime
class ChatMessage(BaseModel):
role: str = Field(..., description="'user' or 'assistant'")
content: str
timestamp: datetime = Field(default_factory=datetime.utcnow)
class AgentRequest(BaseModel):
message: str = Field(..., min_length=1, max_length=4000)
session_id: Optional[str] = Field(None, description="Resume existing session")
max_turns: int = Field(default=5, ge=1, le=20)
stream: bool = Field(default=False)
class AgentResponse(BaseModel):
session_id: str
response: str
turns_used: int
messages: List[ChatMessage]
finished: bool
class StreamChunk(BaseModel):
session_id: str
chunk: str
done: bool = False
Session Management with Redis
Each API request needs to carry its own conversation context. Without session management, every call starts a brand-new conversation with no memory of what came before.
# services/session.py
import json
import redis.asyncio as aioredis
from typing import Optional, List
import os
import uuid
class SessionManager:
def __init__(self):
self.redis = aioredis.from_url(
os.getenv("REDIS_URL", "redis://localhost:6379"),
decode_responses=True
)
self.ttl = int(os.getenv("SESSION_TTL", 3600))
async def create_session(self) -> str:
session_id = str(uuid.uuid4())
await self.redis.setex(
f"session:{session_id}",
self.ttl,
json.dumps([])
)
return session_id
async def get_history(self, session_id: str) -> Optional[List[dict]]:
data = await self.redis.get(f"session:{session_id}")
if data is None:
return None
return json.loads(data)
async def save_history(self, session_id: str, history: List[dict]):
await self.redis.setex(
f"session:{session_id}",
self.ttl,
json.dumps(history)
)
async def delete_session(self, session_id: str):
await self.redis.delete(f"session:{session_id}")
Building the AutoGen Agent Layer
Now set up the actual agents. The key insight here is that agents need to be created fresh for each request (or at least have their history reset) to avoid state bleed between users:
# agents/research.py
import autogen
from typing import List, Tuple
import os
def build_research_team(chat_history: List[dict] = None) -> Tuple[autogen.AssistantAgent, autogen.UserProxyAgent]:
llm_config = {
"config_list": [
{
"model": "gpt-4o",
"api_key": os.getenv("OPENAI_API_KEY"),
}
],
"temperature": 0.1,
"timeout": 60,
}
assistant = autogen.AssistantAgent(
name="ResearchAssistant",
system_message="""You are a research assistant that provides accurate,
well-structured answers. Break complex topics into clear sections.
Always cite your reasoning. When you have fully answered the question,
end your response with TERMINATE.""",
llm_config=llm_config,
)
user_proxy = autogen.UserProxyAgent(
name="UserProxy",
human_input_mode="NEVER",
max_consecutive_auto_reply=10,
is_termination_msg=lambda msg: "TERMINATE" in (msg.get("content") or ""),
code_execution_config=False,
)
# Restore chat history if session exists
if chat_history:
for msg in chat_history:
if msg["role"] == "user":
user_proxy.chat_messages[assistant] = user_proxy.chat_messages.get(assistant, [])
assistant.chat_messages[user_proxy] = assistant.chat_messages.get(user_proxy, [])
return assistant, user_proxy
async def run_agent_async(
message: str,
chat_history: List[dict],
max_turns: int
) -> Tuple[str, List[dict]]:
import asyncio
assistant, user_proxy = build_research_team(chat_history)
# Run the synchronous AutoGen chat in a thread pool
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(
None,
lambda: user_proxy.initiate_chat(
assistant,
message=message,
max_turns=max_turns,
silent=True,
)
)
# Extract the last assistant message
messages = assistant.chat_messages.get(user_proxy, [])
last_response = ""
for msg in reversed(messages):
if msg.get("role") == "assistant":
content = msg.get("content", "")
last_response = content.replace("TERMINATE", "").strip()
break
# Build updated history for persistence
updated_history = []
for msg in messages:
updated_history.append({
"role": msg.get("role", "user"),
"content": msg.get("content", ""),
})
return last_response, updated_history
The Main FastAPI Application
Now wire everything together in main.py:
# main.py
import asyncio
import uuid
from fastapi import FastAPI, HTTPException, Depends
from fastapi.responses import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware
import json
from typing import AsyncGenerator
from models.schemas import AgentRequest, AgentResponse, StreamChunk
from services.session import SessionManager
from agents.research import run_agent_async
app = FastAPI(
title="AutoGen Agent API",
description="Multi-agent research assistant powered by AutoGen",
version="1.0.0",
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
session_manager = SessionManager()
@app.post("/chat", response_model=AgentResponse)
async def chat(request: AgentRequest):
"""Standard request-response chat endpoint."""
# Resolve or create session
session_id = request.session_id
if session_id:
history = await session_manager.get_history(session_id)
if history is None:
raise HTTPException(status_code=404, detail="Session not found or expired")
else:
session_id = await session_manager.create_session()
history = []
try:
response_text, updated_history = await run_agent_async(
message=request.message,
chat_history=history,
max_turns=request.max_turns,
)
except Exception as e:
raise HTTPException(status_code=500, detail=f"Agent error: {str(e)}")
await session_manager.save_history(session_id, updated_history)
return AgentResponse(
session_id=session_id,
response=response_text,
turns_used=len([m for m in updated_history if m["role"] == "assistant"]),
messages=[],
finished=True,
)
@app.post("/chat/stream")
async def chat_stream(request: AgentRequest):
"""Streaming endpoint that yields response chunks as SSE."""
session_id = request.session_id
if session_id:
history = await session_manager.get_history(session_id)
if history is None:
raise HTTPException(status_code=404, detail="Session not found")
else:
session_id = await session_manager.create_session()
history = []
async def generate_stream() -> AsyncGenerator[str, None]:
# Signal session start
start_chunk = StreamChunk(session_id=session_id, chunk="", done=False)
yield f"data: {start_chunk.model_dump_json()}\n\n"
try:
response_text, updated_history = await run_agent_async(
message=request.message,
chat_history=history,
max_turns=request.max_turns,
)
# Stream the response word by word
words = response_text.split(" ")
for i, word in enumerate(words):
chunk_text = word + (" " if i < len(words) - 1 else "")
chunk = StreamChunk(
session_id=session_id,
chunk=chunk_text,
done=False
)
yield f"data: {chunk.model_dump_json()}\n\n"
await asyncio.sleep(0.02) # Simulate streaming delay
await session_manager.save_history(session_id, updated_history)
done_chunk = StreamChunk(session_id=session_id, chunk="", done=True)
yield f"data: {done_chunk.model_dump_json()}\n\n"
except Exception as e:
error_chunk = StreamChunk(
session_id=session_id,
chunk=f"Error: {str(e)}",
done=True
)
yield f"data: {error_chunk.model_dump_json()}\n\n"
return StreamingResponse(
generate_stream(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no",
}
)
@app.delete("/session/{session_id}")
async def delete_session(session_id: str):
"""Clean up a session explicitly."""
await session_manager.delete_session(session_id)
return {"message": "Session deleted"}
@app.get("/health")
async def health():
return {"status": "healthy", "version": "1.0.0"}
Running the Server
Start with uvicorn for development:
uvicorn main:app --reload --host 0.0.0.0 --port 8000
For production, use multiple workers:
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4 --timeout-keep-alive 30
Test your endpoints immediately at http://localhost:8000/docs.
Testing with curl
Verify the basic endpoint works:
# Start a new session
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"message": "What are the main differences between transformers and RNNs?",
"max_turns": 3
}'
# Continue the session
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"message": "Which performs better on long sequences?",
"session_id": "YOUR_SESSION_ID_HERE",
"max_turns": 3
}'
# Test streaming
curl -N http://localhost:8000/chat/stream \
-X POST \
-H "Content-Type: application/json" \
-d '{"message": "Explain attention mechanisms briefly"}'
Performance Considerations
AutoGen agents are not lightweight. Each LLM call has latency in the hundreds of milliseconds to seconds range. A few things to keep in mind when deploying:
| Concern | Recommendation | Expected Impact |
|---|---|---|
| Agent instantiation overhead | Reuse llm_config, create agents per request | Minimal ā config is cheap |
| Concurrent requests | Use thread pool executor for sync AutoGen | Handles 10-50 concurrent requests per worker |
| Session storage | Redis with TTL | Sub-millisecond history retrieval |
| LLM latency | Cache identical queries | 90%+ reduction for repeated questions |
| Memory growth | Enforce max history length | Prevents unbounded context growth |
According to benchmarks from the AutoGen team, GPT-4o responses typically arrive in 2-8 seconds for research-style queries. Plan your client timeouts accordingly ā 30-60 seconds is a safe range.
Adding Authentication
Don't ship this without auth. A basic API key middleware takes five minutes to add:
from fastapi import Security
from fastapi.security import APIKeyHeader
api_key_header = APIKeyHeader(name="X-API-Key")
VALID_API_KEYS = set(os.getenv("API_KEYS", "").split(","))
async def verify_api_key(api_key: str = Security(api_key_header)):
if api_key not in VALID_API_KEYS:
raise HTTPException(status_code=401, detail="Invalid API key")
return api_key
# Add to any endpoint:
@app.post("/chat", response_model=AgentResponse, dependencies=[Depends(verify_api_key)])
async def chat(request: AgentRequest):
...
This pairs well with OpenAI API integration patterns where you're already managing credentials carefully.
Deploying to Production
Once your API is working locally, containerize it:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]
For a more complete deployment pipeline, check out the deploy AI model to production guide which covers container registries, load balancing, and health checks.
Multi-Agent Endpoints
The pattern above works for a single assistant. For multi-agent workflows ā say, a researcher plus a critic plus a writer ā you create a GroupChat and expose it the same way:
async def run_group_chat_async(message: str, max_turns: int) -> str:
loop = asyncio.get_event_loop()
def _run():
assistant = autogen.AssistantAgent("Researcher", llm_config=llm_config, ...)
critic = autogen.AssistantAgent("Critic", llm_config=llm_config, ...)
user_proxy = autogen.UserProxyAgent("User", human_input_mode="NEVER", ...)
groupchat = autogen.GroupChat(
agents=[user_proxy, assistant, critic],
messages=[],
max_round=max_turns,
)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)
user_proxy.initiate_chat(manager, message=message)
# Return the last non-empty message
for msg in reversed(groupchat.messages):
if msg.get("content"):
return msg["content"]
return ""
return await loop.run_in_executor(None, _run)
This is the core pattern behind how frameworks like CrewAI tutorial expose their multi-agent pipelines as services.
What to Monitor
Once deployed, the metrics that matter most for agent APIs are different from standard web services:
- LLM token usage per request ā your biggest cost driver
- Agent turn count distribution ā high turn counts signal prompt or task definition problems
- Session duration ā long sessions with many turns may indicate the agent isn't resolving tasks
- Error rate by error type ā context length errors vs API timeouts vs logic errors need different fixes
Connect these to Prometheus or Datadog and set alerts before you go live. A runaway agent burning tokens at scale is an expensive surprise.
Exposing AutoGen agents as FastAPI services is not complex, but it does require thinking carefully about state, concurrency, and operational concerns. The patterns here scale from a weekend project to a production service handling thousands of daily users with only configuration changes, not architectural rewrites. Start simple, measure everything, and add complexity only when the metrics demand it.
Frequently Asked Questions
Can AutoGen agents handle concurrent API requests? Yes. With FastAPI's async support and Python's asyncio, multiple agent conversations can run concurrently. Use separate ConversableAgent instances per request or implement a session management layer to isolate state between users.
How do I stream AutoGen agent responses in real time? Use FastAPI's StreamingResponse with an async generator. Configure AutoGen's human_input_mode to NEVER and hook into the agent's reply callback to yield chunks as they arrive from the LLM.
What's the best way to handle AutoGen agent state between API calls? Store conversation history in Redis or a database keyed by session ID. On each request, reload the chat history into the agent before continuing, then persist the updated history after the agent replies.
Frequently Asked Questions
AiTechWorlds Team
ā Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)
Understand the 5 core AutoGen agent types ā AssistantAgent, UserProxyAgent, CodeExecutorAgent, and more ā with code examples and a comparison table for each role.
How to Use AutoGen with Azure OpenAI (Enterprise Security)
Connect Microsoft AutoGen to Azure OpenAI for enterprise-grade AI agents. Step-by-step setup with private endpoints, OAI_CONFIG_LIST, and deployment config.
Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)
Build an AutoGen agent that reviews code, analyzes PR diffs, suggests fixes, and automates code quality improvements with a full working implementation.
How to Use AutoGen with Code Interpreter (Execute Python)
Learn how to set up AutoGen's code interpreter with LocalCommandLineCodeExecutor and DockerCommandLineCodeExecutor to safely execute Python in agent workflows.