AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

AutoGen agent served as REST API endpoint — FastAPI deployment

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

⚡ Quick Answer

Learn to serve AutoGen multi-agent systems as production REST APIs using FastAPI with async endpoints and real-time streaming responses.

AiTechWorlds Team May 31, 2026 10 min read

#AutoGen #FastAPI #API deployment #multi-agent systems

📚Part of the Autogpt Autogen guide — explore all Autogpt Autogen articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Running a multi-agent AutoGen system interactively is useful for prototyping. But when you need other services, frontends, or third-party tools to talk to your agents, you need a proper API layer. FastAPI is the obvious choice — it's async-native, fast, and ships with automatic OpenAPI docs.

This guide walks through turning an AutoGen agent system into a production-grade REST API. You'll get a full working example including async endpoints, session management, and streaming responses.

Why Expose Agents as APIs

Before jumping into code, it's worth understanding what you gain from this architecture. When your AutoGen agents live behind an HTTP API, they become composable services. Your React frontend, your Slack bot, your mobile app, and your backend services can all talk to the same agent logic without duplicating anything.

The alternative — embedding agent logic directly in each consumer — creates a maintenance nightmare. Every prompt change, every model swap, every tool addition needs to be rolled out to every consumer separately. An API-first approach centralizes all of that.

There's also the operational angle. FastAPI gives you request validation, authentication hooks, rate limiting middleware, and automatic documentation. These aren't things you want to build yourself on top of raw AutoGen.

Project Setup

Start with a clean virtual environment and install the dependencies:

pip install pyautogen fastapi uvicorn[standard] python-dotenv redis pydantic

Your project structure should look like this:

agent_api/
├── main.py
├── agents/
│   ├── __init__.py
│   ├── base.py
│   └── research.py
├── models/
│   ├── __init__.py
│   └── schemas.py
├── services/
│   ├── __init__.py
│   └── session.py
└── .env

Set your environment variables in .env:

OPENAI_API_KEY=sk-...
REDIS_URL=redis://localhost:6379
SESSION_TTL=3600

Defining Your Request and Response Schemas

Clean Pydantic schemas are the foundation of a good FastAPI service. Define them before writing any agent logic:

# models/schemas.py
from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import datetime


class ChatMessage(BaseModel):
    role: str = Field(..., description="'user' or 'assistant'")
    content: str
    timestamp: datetime = Field(default_factory=datetime.utcnow)


class AgentRequest(BaseModel):
    message: str = Field(..., min_length=1, max_length=4000)
    session_id: Optional[str] = Field(None, description="Resume existing session")
    max_turns: int = Field(default=5, ge=1, le=20)
    stream: bool = Field(default=False)


class AgentResponse(BaseModel):
    session_id: str
    response: str
    turns_used: int
    messages: List[ChatMessage]
    finished: bool


class StreamChunk(BaseModel):
    session_id: str
    chunk: str
    done: bool = False

Session Management with Redis

Each API request needs to carry its own conversation context. Without session management, every call starts a brand-new conversation with no memory of what came before.

# services/session.py
import json
import redis.asyncio as aioredis
from typing import Optional, List
import os
import uuid


class SessionManager:
    def __init__(self):
        self.redis = aioredis.from_url(
            os.getenv("REDIS_URL", "redis://localhost:6379"),
            decode_responses=True
        )
        self.ttl = int(os.getenv("SESSION_TTL", 3600))

    async def create_session(self) -> str:
        session_id = str(uuid.uuid4())
        await self.redis.setex(
            f"session:{session_id}",
            self.ttl,
            json.dumps([])
        )
        return session_id

    async def get_history(self, session_id: str) -> Optional[List[dict]]:
        data = await self.redis.get(f"session:{session_id}")
        if data is None:
            return None
        return json.loads(data)

    async def save_history(self, session_id: str, history: List[dict]):
        await self.redis.setex(
            f"session:{session_id}",
            self.ttl,
            json.dumps(history)
        )

    async def delete_session(self, session_id: str):
        await self.redis.delete(f"session:{session_id}")

Building the AutoGen Agent Layer

Now set up the actual agents. The key insight here is that agents need to be created fresh for each request (or at least have their history reset) to avoid state bleed between users:

# agents/research.py
import autogen
from typing import List, Tuple
import os


def build_research_team(chat_history: List[dict] = None) -> Tuple[autogen.AssistantAgent, autogen.UserProxyAgent]:
    llm_config = {
        "config_list": [
            {
                "model": "gpt-4o",
                "api_key": os.getenv("OPENAI_API_KEY"),
            }
        ],
        "temperature": 0.1,
        "timeout": 60,
    }

    assistant = autogen.AssistantAgent(
        name="ResearchAssistant",
        system_message="""You are a research assistant that provides accurate,
        well-structured answers. Break complex topics into clear sections.
        Always cite your reasoning. When you have fully answered the question,
        end your response with TERMINATE.""",
        llm_config=llm_config,
    )

    user_proxy = autogen.UserProxyAgent(
        name="UserProxy",
        human_input_mode="NEVER",
        max_consecutive_auto_reply=10,
        is_termination_msg=lambda msg: "TERMINATE" in (msg.get("content") or ""),
        code_execution_config=False,
    )

    # Restore chat history if session exists
    if chat_history:
        for msg in chat_history:
            if msg["role"] == "user":
                user_proxy.chat_messages[assistant] = user_proxy.chat_messages.get(assistant, [])
            assistant.chat_messages[user_proxy] = assistant.chat_messages.get(user_proxy, [])

    return assistant, user_proxy


async def run_agent_async(
    message: str,
    chat_history: List[dict],
    max_turns: int
) -> Tuple[str, List[dict]]:
    import asyncio

    assistant, user_proxy = build_research_team(chat_history)

    # Run the synchronous AutoGen chat in a thread pool
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(
        None,
        lambda: user_proxy.initiate_chat(
            assistant,
            message=message,
            max_turns=max_turns,
            silent=True,
        )
    )

    # Extract the last assistant message
    messages = assistant.chat_messages.get(user_proxy, [])
    last_response = ""
    for msg in reversed(messages):
        if msg.get("role") == "assistant":
            content = msg.get("content", "")
            last_response = content.replace("TERMINATE", "").strip()
            break

    # Build updated history for persistence
    updated_history = []
    for msg in messages:
        updated_history.append({
            "role": msg.get("role", "user"),
            "content": msg.get("content", ""),
        })

    return last_response, updated_history

The Main FastAPI Application

Now wire everything together in main.py:

# main.py
import asyncio
import uuid
from fastapi import FastAPI, HTTPException, Depends
from fastapi.responses import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware
import json
from typing import AsyncGenerator

from models.schemas import AgentRequest, AgentResponse, StreamChunk
from services.session import SessionManager
from agents.research import run_agent_async

app = FastAPI(
    title="AutoGen Agent API",
    description="Multi-agent research assistant powered by AutoGen",
    version="1.0.0",
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

session_manager = SessionManager()


@app.post("/chat", response_model=AgentResponse)
async def chat(request: AgentRequest):
    """Standard request-response chat endpoint."""
    # Resolve or create session
    session_id = request.session_id
    if session_id:
        history = await session_manager.get_history(session_id)
        if history is None:
            raise HTTPException(status_code=404, detail="Session not found or expired")
    else:
        session_id = await session_manager.create_session()
        history = []

    try:
        response_text, updated_history = await run_agent_async(
            message=request.message,
            chat_history=history,
            max_turns=request.max_turns,
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Agent error: {str(e)}")

    await session_manager.save_history(session_id, updated_history)

    return AgentResponse(
        session_id=session_id,
        response=response_text,
        turns_used=len([m for m in updated_history if m["role"] == "assistant"]),
        messages=[],
        finished=True,
    )


@app.post("/chat/stream")
async def chat_stream(request: AgentRequest):
    """Streaming endpoint that yields response chunks as SSE."""
    session_id = request.session_id
    if session_id:
        history = await session_manager.get_history(session_id)
        if history is None:
            raise HTTPException(status_code=404, detail="Session not found")
    else:
        session_id = await session_manager.create_session()
        history = []

    async def generate_stream() -> AsyncGenerator[str, None]:
        # Signal session start
        start_chunk = StreamChunk(session_id=session_id, chunk="", done=False)
        yield f"data: {start_chunk.model_dump_json()}\n\n"

        try:
            response_text, updated_history = await run_agent_async(
                message=request.message,
                chat_history=history,
                max_turns=request.max_turns,
            )

            # Stream the response word by word
            words = response_text.split(" ")
            for i, word in enumerate(words):
                chunk_text = word + (" " if i < len(words) - 1 else "")
                chunk = StreamChunk(
                    session_id=session_id,
                    chunk=chunk_text,
                    done=False
                )
                yield f"data: {chunk.model_dump_json()}\n\n"
                await asyncio.sleep(0.02)  # Simulate streaming delay

            await session_manager.save_history(session_id, updated_history)

            done_chunk = StreamChunk(session_id=session_id, chunk="", done=True)
            yield f"data: {done_chunk.model_dump_json()}\n\n"

        except Exception as e:
            error_chunk = StreamChunk(
                session_id=session_id,
                chunk=f"Error: {str(e)}",
                done=True
            )
            yield f"data: {error_chunk.model_dump_json()}\n\n"

    return StreamingResponse(
        generate_stream(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
        }
    )


@app.delete("/session/{session_id}")
async def delete_session(session_id: str):
    """Clean up a session explicitly."""
    await session_manager.delete_session(session_id)
    return {"message": "Session deleted"}


@app.get("/health")
async def health():
    return {"status": "healthy", "version": "1.0.0"}

Running the Server

Start with uvicorn for development:

uvicorn main:app --reload --host 0.0.0.0 --port 8000

For production, use multiple workers:

uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4 --timeout-keep-alive 30

Test your endpoints immediately at http://localhost:8000/docs.

Testing with curl

Verify the basic endpoint works:

# Start a new session
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What are the main differences between transformers and RNNs?",
    "max_turns": 3
  }'

# Continue the session
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Which performs better on long sequences?",
    "session_id": "YOUR_SESSION_ID_HERE",
    "max_turns": 3
  }'

# Test streaming
curl -N http://localhost:8000/chat/stream \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"message": "Explain attention mechanisms briefly"}'

Performance Considerations

AutoGen agents are not lightweight. Each LLM call has latency in the hundreds of milliseconds to seconds range. A few things to keep in mind when deploying:

Concern	Recommendation	Expected Impact
Agent instantiation overhead	Reuse llm_config, create agents per request	Minimal — config is cheap
Concurrent requests	Use thread pool executor for sync AutoGen	Handles 10-50 concurrent requests per worker
Session storage	Redis with TTL	Sub-millisecond history retrieval
LLM latency	Cache identical queries	90%+ reduction for repeated questions
Memory growth	Enforce max history length	Prevents unbounded context growth

According to benchmarks from the AutoGen team, GPT-4o responses typically arrive in 2-8 seconds for research-style queries. Plan your client timeouts accordingly — 30-60 seconds is a safe range.

Adding Authentication

Don't ship this without auth. A basic API key middleware takes five minutes to add:

from fastapi import Security
from fastapi.security import APIKeyHeader

api_key_header = APIKeyHeader(name="X-API-Key")
VALID_API_KEYS = set(os.getenv("API_KEYS", "").split(","))

async def verify_api_key(api_key: str = Security(api_key_header)):
    if api_key not in VALID_API_KEYS:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return api_key

# Add to any endpoint:
@app.post("/chat", response_model=AgentResponse, dependencies=[Depends(verify_api_key)])
async def chat(request: AgentRequest):
    ...

This pairs well with OpenAI API integration patterns where you're already managing credentials carefully.

Deploying to Production

Once your API is working locally, containerize it:

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

For a more complete deployment pipeline, check out the deploy AI model to production guide which covers container registries, load balancing, and health checks.

Multi-Agent Endpoints

The pattern above works for a single assistant. For multi-agent workflows — say, a researcher plus a critic plus a writer — you create a GroupChat and expose it the same way:

async def run_group_chat_async(message: str, max_turns: int) -> str:
    loop = asyncio.get_event_loop()

    def _run():
        assistant = autogen.AssistantAgent("Researcher", llm_config=llm_config, ...)
        critic = autogen.AssistantAgent("Critic", llm_config=llm_config, ...)
        user_proxy = autogen.UserProxyAgent("User", human_input_mode="NEVER", ...)

        groupchat = autogen.GroupChat(
            agents=[user_proxy, assistant, critic],
            messages=[],
            max_round=max_turns,
        )
        manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)

        user_proxy.initiate_chat(manager, message=message)

        # Return the last non-empty message
        for msg in reversed(groupchat.messages):
            if msg.get("content"):
                return msg["content"]
        return ""

    return await loop.run_in_executor(None, _run)

This is the core pattern behind how frameworks like CrewAI tutorial expose their multi-agent pipelines as services.

What to Monitor

Once deployed, the metrics that matter most for agent APIs are different from standard web services:

LLM token usage per request — your biggest cost driver
Agent turn count distribution — high turn counts signal prompt or task definition problems
Session duration — long sessions with many turns may indicate the agent isn't resolving tasks
Error rate by error type — context length errors vs API timeouts vs logic errors need different fixes

Connect these to Prometheus or Datadog and set alerts before you go live. A runaway agent burning tokens at scale is an expensive surprise.

Exposing AutoGen agents as FastAPI services is not complex, but it does require thinking carefully about state, concurrency, and operational concerns. The patterns here scale from a weekend project to a production service handling thousands of daily users with only configuration changes, not architectural rewrites. Start simple, measure everything, and add complexity only when the metrics demand it.

Frequently Asked Questions

Can AutoGen agents handle concurrent API requests? Yes. With FastAPI's async support and Python's asyncio, multiple agent conversations can run concurrently. Use separate ConversableAgent instances per request or implement a session management layer to isolate state between users.

How do I stream AutoGen agent responses in real time? Use FastAPI's StreamingResponse with an async generator. Configure AutoGen's human_input_mode to NEVER and hook into the agent's reply callback to yield chunks as they arrive from the LLM.

What's the best way to handle AutoGen agent state between API calls? Store conversation history in Redis or a database keyed by session ID. On each request, reload the chat history into the agent before continuing, then persist the updated history after the agent replies.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Yes. With FastAPI's async support and Python's asyncio, multiple agent conversations can run concurrently. Use separate ConversableAgent instances per request or implement a session management layer to isolate state between users.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI agent role assignment diagram — AutoGen agent types roles

Agent Development

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

Understand the 5 core AutoGen agent types — AssistantAgent, UserProxyAgent, CodeExecutorAgent, and more — with code examples and a comparison table for each role.

May 31, 2026 11 min read

Azure OpenAI enterprise integration with AutoGen — managed private instances

Agent Development

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Connect Microsoft AutoGen to Azure OpenAI for enterprise-grade AI agents. Step-by-step setup with private endpoints, OAI_CONFIG_LIST, and deployment config.

May 31, 2026 10 min read

AI agent automatically fixing code bugs — AutoGen code debugging auto-fix

Agent Development

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

Build an AutoGen agent that reviews code, analyzes PR diffs, suggests fixes, and automates code quality improvements with a full working implementation.

May 31, 2026 11 min read

Python code being executed by AutoGen agent — code interpreter execution

Agent Development

How to Use AutoGen with Code Interpreter (Execute Python)

Learn how to set up AutoGen's code interpreter with LocalCommandLineCodeExecutor and DockerCommandLineCodeExecutor to safely execute Python in agent workflows.

May 31, 2026 10 min read

Go deeper on this topic

BookPython Mastery 2026 ProjectAI-Powered Resume Screening System

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Autogpt Autogen

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

⚡ Quick Answer

Learn to serve AutoGen multi-agent systems as production REST APIs using FastAPI with async endpoints and real-time streaming responses.

AiTechWorlds Team May 31, 2026 10 min read

#AutoGen #FastAPI #API deployment #multi-agent systems

📚Part of the Autogpt Autogen guide — explore all Autogpt Autogen articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

This guide walks through turning an AutoGen agent system into a production-grade REST API. You'll get a full working example including async endpoints, session management, and streaming responses.

Why Expose Agents as APIs

Project Setup

Start with a clean virtual environment and install the dependencies:

pip install pyautogen fastapi uvicorn[standard] python-dotenv redis pydantic

Your project structure should look like this:

agent_api/
├── main.py
├── agents/
│   ├── __init__.py
│   ├── base.py
│   └── research.py
├── models/
│   ├── __init__.py
│   └── schemas.py
├── services/
│   ├── __init__.py
│   └── session.py
└── .env

Set your environment variables in .env:

OPENAI_API_KEY=sk-...
REDIS_URL=redis://localhost:6379
SESSION_TTL=3600

Defining Your Request and Response Schemas

Clean Pydantic schemas are the foundation of a good FastAPI service. Define them before writing any agent logic:

# models/schemas.py
from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import datetime


class ChatMessage(BaseModel):
    role: str = Field(..., description="'user' or 'assistant'")
    content: str
    timestamp: datetime = Field(default_factory=datetime.utcnow)


class AgentRequest(BaseModel):
    message: str = Field(..., min_length=1, max_length=4000)
    session_id: Optional[str] = Field(None, description="Resume existing session")
    max_turns: int = Field(default=5, ge=1, le=20)
    stream: bool = Field(default=False)


class AgentResponse(BaseModel):
    session_id: str
    response: str
    turns_used: int
    messages: List[ChatMessage]
    finished: bool


class StreamChunk(BaseModel):
    session_id: str
    chunk: str
    done: bool = False

Session Management with Redis

Each API request needs to carry its own conversation context. Without session management, every call starts a brand-new conversation with no memory of what came before.

# services/session.py
import json
import redis.asyncio as aioredis
from typing import Optional, List
import os
import uuid


class SessionManager:
    def __init__(self):
        self.redis = aioredis.from_url(
            os.getenv("REDIS_URL", "redis://localhost:6379"),
            decode_responses=True
        )
        self.ttl = int(os.getenv("SESSION_TTL", 3600))

    async def create_session(self) -> str:
        session_id = str(uuid.uuid4())
        await self.redis.setex(
            f"session:{session_id}",
            self.ttl,
            json.dumps([])
        )
        return session_id

    async def get_history(self, session_id: str) -> Optional[List[dict]]:
        data = await self.redis.get(f"session:{session_id}")
        if data is None:
            return None
        return json.loads(data)

    async def save_history(self, session_id: str, history: List[dict]):
        await self.redis.setex(
            f"session:{session_id}",
            self.ttl,
            json.dumps(history)
        )

    async def delete_session(self, session_id: str):
        await self.redis.delete(f"session:{session_id}")

Building the AutoGen Agent Layer

Now set up the actual agents. The key insight here is that agents need to be created fresh for each request (or at least have their history reset) to avoid state bleed between users:

# agents/research.py
import autogen
from typing import List, Tuple
import os


def build_research_team(chat_history: List[dict] = None) -> Tuple[autogen.AssistantAgent, autogen.UserProxyAgent]:
    llm_config = {
        "config_list": [
            {
                "model": "gpt-4o",
                "api_key": os.getenv("OPENAI_API_KEY"),
            }
        ],
        "temperature": 0.1,
        "timeout": 60,
    }

    assistant = autogen.AssistantAgent(
        name="ResearchAssistant",
        system_message="""You are a research assistant that provides accurate,
        well-structured answers. Break complex topics into clear sections.
        Always cite your reasoning. When you have fully answered the question,
        end your response with TERMINATE.""",
        llm_config=llm_config,
    )

    user_proxy = autogen.UserProxyAgent(
        name="UserProxy",
        human_input_mode="NEVER",
        max_consecutive_auto_reply=10,
        is_termination_msg=lambda msg: "TERMINATE" in (msg.get("content") or ""),
        code_execution_config=False,
    )

    # Restore chat history if session exists
    if chat_history:
        for msg in chat_history:
            if msg["role"] == "user":
                user_proxy.chat_messages[assistant] = user_proxy.chat_messages.get(assistant, [])
            assistant.chat_messages[user_proxy] = assistant.chat_messages.get(user_proxy, [])

    return assistant, user_proxy


async def run_agent_async(
    message: str,
    chat_history: List[dict],
    max_turns: int
) -> Tuple[str, List[dict]]:
    import asyncio

    assistant, user_proxy = build_research_team(chat_history)

    # Run the synchronous AutoGen chat in a thread pool
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(
        None,
        lambda: user_proxy.initiate_chat(
            assistant,
            message=message,
            max_turns=max_turns,
            silent=True,
        )
    )

    # Extract the last assistant message
    messages = assistant.chat_messages.get(user_proxy, [])
    last_response = ""
    for msg in reversed(messages):
        if msg.get("role") == "assistant":
            content = msg.get("content", "")
            last_response = content.replace("TERMINATE", "").strip()
            break

    # Build updated history for persistence
    updated_history = []
    for msg in messages:
        updated_history.append({
            "role": msg.get("role", "user"),
            "content": msg.get("content", ""),
        })

    return last_response, updated_history

The Main FastAPI Application

Now wire everything together in main.py:

# main.py
import asyncio
import uuid
from fastapi import FastAPI, HTTPException, Depends
from fastapi.responses import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware
import json
from typing import AsyncGenerator

from models.schemas import AgentRequest, AgentResponse, StreamChunk
from services.session import SessionManager
from agents.research import run_agent_async

app = FastAPI(
    title="AutoGen Agent API",
    description="Multi-agent research assistant powered by AutoGen",
    version="1.0.0",
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

session_manager = SessionManager()


@app.post("/chat", response_model=AgentResponse)
async def chat(request: AgentRequest):
    """Standard request-response chat endpoint."""
    # Resolve or create session
    session_id = request.session_id
    if session_id:
        history = await session_manager.get_history(session_id)
        if history is None:
            raise HTTPException(status_code=404, detail="Session not found or expired")
    else:
        session_id = await session_manager.create_session()
        history = []

    try:
        response_text, updated_history = await run_agent_async(
            message=request.message,
            chat_history=history,
            max_turns=request.max_turns,
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Agent error: {str(e)}")

    await session_manager.save_history(session_id, updated_history)

    return AgentResponse(
        session_id=session_id,
        response=response_text,
        turns_used=len([m for m in updated_history if m["role"] == "assistant"]),
        messages=[],
        finished=True,
    )


@app.post("/chat/stream")
async def chat_stream(request: AgentRequest):
    """Streaming endpoint that yields response chunks as SSE."""
    session_id = request.session_id
    if session_id:
        history = await session_manager.get_history(session_id)
        if history is None:
            raise HTTPException(status_code=404, detail="Session not found")
    else:
        session_id = await session_manager.create_session()
        history = []

    async def generate_stream() -> AsyncGenerator[str, None]:
        # Signal session start
        start_chunk = StreamChunk(session_id=session_id, chunk="", done=False)
        yield f"data: {start_chunk.model_dump_json()}\n\n"

        try:
            response_text, updated_history = await run_agent_async(
                message=request.message,
                chat_history=history,
                max_turns=request.max_turns,
            )

            # Stream the response word by word
            words = response_text.split(" ")
            for i, word in enumerate(words):
                chunk_text = word + (" " if i < len(words) - 1 else "")
                chunk = StreamChunk(
                    session_id=session_id,
                    chunk=chunk_text,
                    done=False
                )
                yield f"data: {chunk.model_dump_json()}\n\n"
                await asyncio.sleep(0.02)  # Simulate streaming delay

            await session_manager.save_history(session_id, updated_history)

            done_chunk = StreamChunk(session_id=session_id, chunk="", done=True)
            yield f"data: {done_chunk.model_dump_json()}\n\n"

        except Exception as e:
            error_chunk = StreamChunk(
                session_id=session_id,
                chunk=f"Error: {str(e)}",
                done=True
            )
            yield f"data: {error_chunk.model_dump_json()}\n\n"

    return StreamingResponse(
        generate_stream(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
        }
    )


@app.delete("/session/{session_id}")
async def delete_session(session_id: str):
    """Clean up a session explicitly."""
    await session_manager.delete_session(session_id)
    return {"message": "Session deleted"}


@app.get("/health")
async def health():
    return {"status": "healthy", "version": "1.0.0"}

Running the Server

Start with uvicorn for development:

uvicorn main:app --reload --host 0.0.0.0 --port 8000

For production, use multiple workers:

uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4 --timeout-keep-alive 30

Test your endpoints immediately at http://localhost:8000/docs.

Testing with curl

Verify the basic endpoint works:

# Start a new session
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What are the main differences between transformers and RNNs?",
    "max_turns": 3
  }'

# Continue the session
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Which performs better on long sequences?",
    "session_id": "YOUR_SESSION_ID_HERE",
    "max_turns": 3
  }'

# Test streaming
curl -N http://localhost:8000/chat/stream \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"message": "Explain attention mechanisms briefly"}'

Performance Considerations

AutoGen agents are not lightweight. Each LLM call has latency in the hundreds of milliseconds to seconds range. A few things to keep in mind when deploying:

Concern	Recommendation	Expected Impact
Agent instantiation overhead	Reuse llm_config, create agents per request	Minimal — config is cheap
Concurrent requests	Use thread pool executor for sync AutoGen	Handles 10-50 concurrent requests per worker
Session storage	Redis with TTL	Sub-millisecond history retrieval
LLM latency	Cache identical queries	90%+ reduction for repeated questions
Memory growth	Enforce max history length	Prevents unbounded context growth

According to benchmarks from the AutoGen team, GPT-4o responses typically arrive in 2-8 seconds for research-style queries. Plan your client timeouts accordingly — 30-60 seconds is a safe range.

Adding Authentication

Don't ship this without auth. A basic API key middleware takes five minutes to add:

from fastapi import Security
from fastapi.security import APIKeyHeader

api_key_header = APIKeyHeader(name="X-API-Key")
VALID_API_KEYS = set(os.getenv("API_KEYS", "").split(","))

async def verify_api_key(api_key: str = Security(api_key_header)):
    if api_key not in VALID_API_KEYS:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return api_key

# Add to any endpoint:
@app.post("/chat", response_model=AgentResponse, dependencies=[Depends(verify_api_key)])
async def chat(request: AgentRequest):
    ...

This pairs well with OpenAI API integration patterns where you're already managing credentials carefully.

Deploying to Production

Once your API is working locally, containerize it:

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

For a more complete deployment pipeline, check out the deploy AI model to production guide which covers container registries, load balancing, and health checks.

Multi-Agent Endpoints

The pattern above works for a single assistant. For multi-agent workflows — say, a researcher plus a critic plus a writer — you create a GroupChat and expose it the same way:

async def run_group_chat_async(message: str, max_turns: int) -> str:
    loop = asyncio.get_event_loop()

    def _run():
        assistant = autogen.AssistantAgent("Researcher", llm_config=llm_config, ...)
        critic = autogen.AssistantAgent("Critic", llm_config=llm_config, ...)
        user_proxy = autogen.UserProxyAgent("User", human_input_mode="NEVER", ...)

        groupchat = autogen.GroupChat(
            agents=[user_proxy, assistant, critic],
            messages=[],
            max_round=max_turns,
        )
        manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)

        user_proxy.initiate_chat(manager, message=message)

        # Return the last non-empty message
        for msg in reversed(groupchat.messages):
            if msg.get("content"):
                return msg["content"]
        return ""

    return await loop.run_in_executor(None, _run)

This is the core pattern behind how frameworks like CrewAI tutorial expose their multi-agent pipelines as services.

What to Monitor

Once deployed, the metrics that matter most for agent APIs are different from standard web services:

LLM token usage per request — your biggest cost driver
Agent turn count distribution — high turn counts signal prompt or task definition problems
Session duration — long sessions with many turns may indicate the agent isn't resolving tasks
Error rate by error type — context length errors vs API timeouts vs logic errors need different fixes

Connect these to Prometheus or Datadog and set alerts before you go live. A runaway agent burning tokens at scale is an expensive surprise.

Frequently Asked Questions

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

Understand the 5 core AutoGen agent types — AssistantAgent, UserProxyAgent, CodeExecutorAgent, and more — with code examples and a comparison table for each role.

May 31, 2026 11 min read

Agent Development

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Connect Microsoft AutoGen to Azure OpenAI for enterprise-grade AI agents. Step-by-step setup with private endpoints, OAI_CONFIG_LIST, and deployment config.

May 31, 2026 10 min read

Agent Development

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

Build an AutoGen agent that reviews code, analyzes PR diffs, suggests fixes, and automates code quality improvements with a full working implementation.

May 31, 2026 11 min read

Agent Development

How to Use AutoGen with Code Interpreter (Execute Python)

Learn how to set up AutoGen's code interpreter with LocalCommandLineCodeExecutor and DockerCommandLineCodeExecutor to safely execute Python in agent workflows.

May 31, 2026 10 min read

Go deeper on this topic

BookPython Mastery 2026 ProjectAI-Powered Resume Screening System

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

Why Expose Agents as APIs

Project Setup

Defining Your Request and Response Schemas

Session Management with Redis

Building the AutoGen Agent Layer

The Main FastAPI Application

Running the Server

Testing with curl

Performance Considerations

Adding Authentication

Deploying to Production

Multi-Agent Endpoints

What to Monitor

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

How to Use AutoGen with Code Interpreter (Execute Python)

Go deeper on this topic

Get Free AI Notes Daily

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

Why Expose Agents as APIs

Project Setup

Defining Your Request and Response Schemas

Session Management with Redis

Building the AutoGen Agent Layer

The Main FastAPI Application

Running the Server

Testing with curl

Performance Considerations

Adding Authentication

Deploying to Production

Multi-Agent Endpoints

What to Monitor

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

How to Use AutoGen with Code Interpreter (Execute Python)

Go deeper on this topic

Get Free AI Notes Daily