How to Use AutoGen with Local Models (GPT4All, Ooba, Ollama)
Run AutoGen agents entirely offline using GPT4All, Oobabooga, and Ollama local models. Full setup guide with LLM configs, API compatibility, and honest speed benchmarks.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Privacy regulations, air-gapped environments, and API cost control are driving more teams toward running AI agents on local hardware. The problem is that most AutoGen tutorials assume OpenAI credentials. If you want to run AutoGen entirely offline — or just want to avoid sending sensitive data to third-party servers — this guide is for you.
We'll cover three local model backends: GPT4All (easiest setup, great for quick tests), Oobabooga text-generation-webui (most flexible, supports the widest model range), and Ollama (best developer experience, recommended for most teams). Each section includes the exact LLM config to drop into your AutoGen setup.
For context on AutoGen agent patterns before wiring up local models, see AI agents explained and the AutoGen conversational patterns guide.
Why Local Models for AutoGen?
The case for local models isn't just privacy — though that's compelling for healthcare, finance, and legal applications. The economics matter too:
- Zero per-token cost after hardware investment
- No rate limits — run as many parallel agents as your hardware supports
- Latency control — no network round-trips, though GPU speed varies
- Data stays on-premise — critical for regulated industries
The trade-off is quality. Most local models in the 7B-13B range are noticeably weaker than GPT-4o on complex reasoning tasks. The 70B range closes that gap significantly. We'll address these trade-offs honestly in the comparison section.
Prerequisites
# Install AutoGen
pip install pyautogen
# Depending on your backend:
pip install gpt4all # GPT4All
pip install llama-cpp-python # Oobabooga
# Ollama: install from https://ollama.ai
Hardware requirements:
- 7B models: 8GB RAM or 6GB VRAM
- 13B models: 16GB RAM or 8GB VRAM
- 70B models (quantized): 48GB RAM or 24GB VRAM
Option 1: Ollama (Recommended)
Ollama is the simplest local model server. It exposes an OpenAI-compatible API at localhost:11434, which AutoGen can use without any adapter code.
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull models
ollama pull llama3:8b # Fast, good for simple tasks
ollama pull llama3:70b # Best quality, needs strong hardware
ollama pull mistral:7b # Great for coding tasks
ollama pull codellama:13b # Specialized for code generation
ollama pull mixtral:8x7b # Strong all-rounder
# Start Ollama server (runs automatically after install on most systems)
ollama serve
AutoGen configuration for Ollama:
# autogen_ollama_config.py
import autogen
# Ollama config — note the base_url pointing to local server
ollama_llm_config = {
"config_list": [
{
"model": "llama3:8b",
"api_key": "ollama", # Placeholder — Ollama doesn't require a real key
"base_url": "http://localhost:11434/v1", # OpenAI-compatible endpoint
"api_type": "openai"
}
],
"temperature": 0,
"timeout": 300 # Local models can be slow — set a generous timeout
}
# Create agents exactly as you would with OpenAI
assistant = autogen.AssistantAgent(
name="LocalAssistant",
llm_config=ollama_llm_config,
system_message="You are a helpful coding assistant. Write clean Python code."
)
user = autogen.UserProxyAgent(
name="User",
human_input_mode="NEVER",
max_consecutive_auto_reply=5,
code_execution_config={"work_dir": "local_output", "use_docker": False},
is_termination_msg=lambda msg: "DONE" in msg.get("content", "").upper()
)
user.initiate_chat(
assistant,
message="Write a Python script that reads a CSV file and calculates summary statistics."
)
Switching between models is as simple as changing the "model" field. You can also configure fallback models:
# Multi-model fallback config
ollama_config_with_fallback = {
"config_list": [
{
"model": "llama3:70b",
"api_key": "ollama",
"base_url": "http://localhost:11434/v1",
"api_type": "openai"
},
{
"model": "llama3:8b", # Fallback if 70B is too slow
"api_key": "ollama",
"base_url": "http://localhost:11434/v1",
"api_type": "openai"
}
],
"temperature": 0,
"timeout": 600
}
Option 2: Oobabooga text-generation-webui
Oobabooga offers the most model flexibility — GGUF, GPTQ, AWQ, EXL2 formats, LoRA support, and fine-grained generation parameters. It requires more setup but supports a wider range of community models.
# Clone and setup Oobabooga
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
# Start with OpenAI API extension enabled
python server.py --api --extensions openai --listen
# Server starts at localhost:5000/v1 by default
AutoGen configuration for Oobabooga:
# autogen_ooba_config.py
import autogen
ooba_llm_config = {
"config_list": [
{
"model": "mistral-7b-instruct-v0.2", # Must match loaded model name in Ooba
"api_key": "none",
"base_url": "http://localhost:5000/v1",
"api_type": "openai"
}
],
"temperature": 0.1,
"timeout": 600,
"max_tokens": 2048
}
assistant = autogen.AssistantAgent(
name="OobaAssistant",
llm_config=ooba_llm_config,
system_message="""You are a data analysis expert.
Analyze data, write Python code to process it, and explain your findings."""
)
user = autogen.UserProxyAgent(
name="Analyst",
human_input_mode="NEVER",
max_consecutive_auto_reply=8,
code_execution_config={"work_dir": "analysis_output", "use_docker": False}
)
user.initiate_chat(
assistant,
message="Analyze a dataset: I have a CSV with columns [date, sales, returns, region]. What metrics should I calculate and how?"
)
Oobabooga-specific tip: The max_tokens parameter matters more with Oobabooga. Some GGUF models default to a very low context window. Check your model's context size and set max_tokens accordingly.
Option 3: GPT4All
GPT4All is the simplest option — a desktop app with a Python SDK. No server setup, no command-line config. Best for quick experimentation.
# autogen_gpt4all_config.py
import autogen
from gpt4all import GPT4All
# GPT4All doesn't have a native OpenAI-compatible server,
# so we wrap it in a custom LLM class
class GPT4AllLLM:
def __init__(self, model_name: str = "Meta-Llama-3-8B-Instruct.Q4_0.gguf"):
self.model = GPT4All(model_name)
def create(self, messages: list, **kwargs) -> dict:
"""OpenAI-compatible chat completion interface."""
# Convert messages to a single prompt
prompt = self._messages_to_prompt(messages)
with self.model.chat_session():
response = self.model.generate(
prompt,
max_tokens=kwargs.get("max_tokens", 512),
temp=kwargs.get("temperature", 0.1)
)
return {
"choices": [
{
"message": {
"role": "assistant",
"content": response
},
"finish_reason": "stop"
}
],
"model": "gpt4all",
"usage": {"total_tokens": len(response.split())}
}
def _messages_to_prompt(self, messages: list) -> str:
"""Convert chat message format to a single string prompt."""
prompt_parts = []
for msg in messages:
role = msg.get("role", "user")
content = msg.get("content", "")
if role == "system":
prompt_parts.append(f"System: {content}")
elif role == "user":
prompt_parts.append(f"Human: {content}")
elif role == "assistant":
prompt_parts.append(f"Assistant: {content}")
prompt_parts.append("Assistant:")
return "\n".join(prompt_parts)
# For a simpler approach, run GPT4All's built-in API server
# gpt4all --api --port 4891
# Then use it exactly like Ollama:
gpt4all_config = {
"config_list": [
{
"model": "gpt4all-falcon-newbpe-q4_0",
"api_key": "none",
"base_url": "http://localhost:4891/v1",
"api_type": "openai"
}
],
"temperature": 0,
"timeout": 300
}
API Compatibility Layer
All three backends expose OpenAI-compatible APIs, which means the same AutoGen code works across all of them with a config swap. Here's a utility that lets you switch backends dynamically:
# local_llm_factory.py
import autogen
from typing import Literal
def get_llm_config(
backend: Literal["ollama", "oobabooga", "gpt4all", "openai"] = "ollama",
model: str = None,
temperature: float = 0
) -> dict:
"""Factory function for AutoGen LLM configs across backends."""
configs = {
"ollama": {
"model": model or "llama3:8b",
"api_key": "ollama",
"base_url": "http://localhost:11434/v1",
"api_type": "openai"
},
"oobabooga": {
"model": model or "mistral-7b-instruct",
"api_key": "none",
"base_url": "http://localhost:5000/v1",
"api_type": "openai"
},
"gpt4all": {
"model": model or "gpt4all-falcon",
"api_key": "none",
"base_url": "http://localhost:4891/v1",
"api_type": "openai"
},
"openai": {
"model": model or "gpt-4o",
"api_key": "YOUR_OPENAI_API_KEY",
"api_type": "openai"
}
}
return {
"config_list": [configs[backend]],
"temperature": temperature,
"timeout": 120 if backend == "openai" else 600
}
# Usage — swap backends without touching agent code
import os
backend = os.getenv("LLM_BACKEND", "ollama")
llm_config = get_llm_config(backend=backend, model=os.getenv("LLM_MODEL"))
assistant = autogen.AssistantAgent(
name="FlexibleAssistant",
llm_config=llm_config
)
Multi-Agent Setup with Local Models
Local models work in all AutoGen conversation patterns, including multi-agent GroupChats. One practical approach is using a stronger local model for the orchestrator and lighter models for worker agents:
# hybrid_local_setup.py
import autogen
from local_llm_factory import get_llm_config
# Strong model for orchestration (70B if hardware allows)
orchestrator_config = get_llm_config("ollama", model="llama3:70b")
# Lighter model for specialized tasks
worker_config = get_llm_config("ollama", model="codellama:13b")
planner = autogen.AssistantAgent(
name="Planner",
llm_config=orchestrator_config,
system_message="Break down complex tasks and assign them to specialized agents."
)
coder = autogen.AssistantAgent(
name="Coder",
llm_config=worker_config,
system_message="Write Python code based on specifications provided."
)
reviewer = autogen.AssistantAgent(
name="Reviewer",
llm_config=worker_config,
system_message="Review code for correctness and style. Flag issues clearly."
)
group_chat = autogen.GroupChat(
agents=[planner, coder, reviewer],
messages=[],
max_round=10
)
manager = autogen.GroupChatManager(
groupchat=group_chat,
llm_config=orchestrator_config
)
user = autogen.UserProxyAgent(
name="User",
human_input_mode="NEVER",
code_execution_config={"work_dir": "local_output", "use_docker": False}
)
user.initiate_chat(manager, message="Build a simple REST API with FastAPI that has CRUD endpoints for a todo list.")
Local Model Comparison Table
| Model | Size | Speed (tokens/sec, A100) | Coding Quality | Reasoning | Privacy | Best For |
|---|---|---|---|---|---|---|
| Llama 3 8B | 4.7GB | 85-120 | Good | Moderate | Full offline | Quick tasks, testing |
| Llama 3 70B | 40GB (Q4) | 15-25 | Excellent | Strong | Full offline | Production quality |
| Mistral 7B | 4.1GB | 90-130 | Very Good | Good | Full offline | Coding, instruction following |
| CodeLlama 13B | 7.3GB | 50-70 | Excellent | Moderate | Full offline | Code generation only |
| Mixtral 8x7B | 26GB | 20-35 | Excellent | Strong | Full offline | General purpose, balanced |
| GPT4All Falcon | 3.9GB | 100+ | Moderate | Moderate | Full offline | Lightweight deployments |
Speed numbers measured on A100 80GB. Consumer GPU (RTX 4090) speeds are approximately 40-60% of these values.
Handling Model Weaknesses
Local models sometimes produce shorter, less structured outputs than GPT-4o. These prompting patterns help:
# Prompting strategies for weaker local models
# 1. More explicit output format instructions
system_message = """You are a coding assistant.
ALWAYS format your responses as:
ANALYSIS: [your analysis here]
CODE:
# [python block start]
[code here]
# [block end]
EXPLANATION: [explanation here]
END"""
# 2. Shorter, more focused tasks
# Instead of: "Build a complete user authentication system"
# Use: "Write a password hashing function using bcrypt"
# 3. Chain smaller tasks instead of one large prompt
tasks = [
"Define the data model for a user account",
"Write the password validation function",
"Write the session token generation function",
"Write the login endpoint combining the above"
]
# 4. Increase temperature slightly for creative tasks, keep at 0 for code
creative_config = {**llm_config, "temperature": 0.3}
code_config = {**llm_config, "temperature": 0.0}
Privacy Considerations
Running agents locally means your data never leaves your infrastructure. This is the key differentiator for regulated industries. See OpenAI API integration for comparison — cloud APIs process your data on third-party infrastructure.
When building privacy-sensitive agents, add a data classification check before any LLM call:
# data_classifier.py
import re
SENSITIVE_PATTERNS = [
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
r'\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b', # Credit card
r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', # Email
r'\b(?:\+1[-\s]?)?\(?\d{3}\)?[-\s]\d{3}[-\s]\d{4}\b' # Phone
]
def contains_sensitive_data(text: str) -> bool:
for pattern in SENSITIVE_PATTERNS:
if re.search(pattern, text):
return True
return False
def safe_llm_call(text: str, use_local: bool = False):
"""Route to local model if sensitive data detected."""
if contains_sensitive_data(text) and not use_local:
print("Warning: Sensitive data detected. Routing to local model.")
return get_llm_config("ollama")
return get_llm_config("openai")
For teams building more complex agents that need both local processing and external knowledge, the Build AI agent with LangChain guide covers hybrid architectures.
FAQs
Can local models match GPT-4o quality in AutoGen workflows?
For simple, well-defined tasks — code generation, summarization, data extraction — capable local models like Llama 3 70B or Mistral Large come close. Complex reasoning chains and tool use still favor GPT-4o. A practical approach is using a local model for worker agents and reserving cloud API calls for the orchestrator.
How much RAM do I need to run local models with AutoGen?
For 7B models: 8GB RAM minimum, 16GB comfortable. For 13B models: 16GB RAM, 24GB comfortable. For 70B models: 48GB+ RAM or a GPU with 24GB+ VRAM using 4-bit quantization. Ollama handles quantization automatically — just specify the q4 variant of a model.
Do local models work with AutoGen's code execution feature?
Yes, with caveats. Code execution in AutoGen is independent of the LLM — the model generates code and Python executes it locally. The limitation is that weaker local models produce less reliable code, so you'll need to allow more correction rounds or add a validation step before execution.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)
Understand the 5 core AutoGen agent types — AssistantAgent, UserProxyAgent, CodeExecutorAgent, and more — with code examples and a comparison table for each role.
How to Deploy AutoGen Agents as APIs with FastAPI (2026)
Learn to serve AutoGen multi-agent systems as production REST APIs using FastAPI with async endpoints and real-time streaming responses.
How to Use AutoGen with Azure OpenAI (Enterprise Security)
Connect Microsoft AutoGen to Azure OpenAI for enterprise-grade AI agents. Step-by-step setup with private endpoints, OAI_CONFIG_LIST, and deployment config.
Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)
Build an AutoGen agent that reviews code, analyzes PR diffs, suggests fixes, and automates code quality improvements with a full working implementation.