How to Use AutoGPT with Voice (Speech-to-Text + TTS)
Add voice control to AutoGPT with Whisper speech-to-text and ElevenLabs or pyttsx3 TTS. Build a conversational autonomous agent you talk to hands-free.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Voice interfaces change how autonomous agents feel to use. Instead of typing goals into a terminal and reading back walls of text, you speak naturally and hear responses. For hands-free workflows — cooking while asking an agent to research recipes, driving while getting a briefing, or accessibility use cases — voice transforms AutoGPT from a dev tool into something genuinely practical.
This guide covers the full technical stack: Whisper for speech-to-text, your choice of TTS engine, and the integration layer that ties it all together with an AutoGen-based agent. You'll have a working voice-controlled agent by the end.
Architecture Overview
The voice pipeline has three distinct layers:
Microphone Input
↓
[Whisper STT] → Text transcript
↓
[AutoGPT/AutoGen Agent] → Text response
↓
[TTS Engine] → Audio output
↓
Speaker Output
Each layer is independently swappable. You can start with pyttsx3 for free offline TTS and upgrade to ElevenLabs later without changing the agent layer at all.
Installing Dependencies
# Core speech stack
pip install openai-whisper pyaudio sounddevice soundfile numpy
# TTS options (install what you need)
pip install pyttsx3 # Free, offline, robotic
pip install openai # OpenAI TTS API
pip install elevenlabs # ElevenLabs (best quality, paid)
# Agent framework
pip install pyautogen
# Audio utilities
pip install playsound pydub
On Windows, pyaudio often requires a pre-built wheel:
pip install pipwin
pipwin install pyaudio
On macOS, install portaudio first:
brew install portaudio
pip install pyaudio
Building the Speech-to-Text Module
Whisper is the gold standard for offline STT. It runs locally, handles accents well, and supports 99 languages:
# stt/whisper_stt.py
import whisper
import sounddevice as sd
import soundfile as sf
import numpy as np
import tempfile
import os
from typing import Optional
class WhisperSTT:
def __init__(self, model_size: str = "base"):
"""
model_size options: tiny, base, small, medium, large
- tiny: fastest, least accurate (~39M params)
- base: good balance for most use cases (~74M params)
- small: better accuracy, still fast (~244M params)
- medium: high accuracy, slower (~769M params)
"""
print(f"Loading Whisper {model_size} model...")
self.model = whisper.load_model(model_size)
self.sample_rate = 16000 # Whisper expects 16kHz
def record_audio(
self,
duration: int = 10,
silence_threshold: float = 0.01,
silence_duration: float = 2.0
) -> np.ndarray:
"""Record audio with automatic silence detection."""
print("Listening... (speak now)")
# Record raw audio
audio = sd.rec(
int(duration * self.sample_rate),
samplerate=self.sample_rate,
channels=1,
dtype="float32",
)
sd.wait()
return audio.flatten()
def record_until_silence(
self,
max_duration: int = 30,
silence_threshold: float = 0.005,
silence_chunks: int = 20
) -> np.ndarray:
"""Record until the user stops speaking."""
chunk_size = int(self.sample_rate * 0.1) # 100ms chunks
all_audio = []
silent_count = 0
has_speech = False
print("Listening... (speak your command)")
with sd.InputStream(
samplerate=self.sample_rate,
channels=1,
dtype="float32",
blocksize=chunk_size
) as stream:
max_chunks = int(max_duration * 10)
for _ in range(max_chunks):
chunk, _ = stream.read(chunk_size)
chunk_flat = chunk.flatten()
all_audio.extend(chunk_flat)
# Detect speech vs silence
rms = np.sqrt(np.mean(chunk_flat ** 2))
if rms > silence_threshold:
has_speech = True
silent_count = 0
elif has_speech:
silent_count += 1
# Stop after silence following speech
if has_speech and silent_count >= silence_chunks:
break
print("Processing speech...")
return np.array(all_audio)
def transcribe(self, audio: np.ndarray) -> str:
"""Transcribe audio array to text."""
# Save to temp file (Whisper needs a file path)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
sf.write(tmp.name, audio, self.sample_rate)
tmp_path = tmp.name
try:
result = self.model.transcribe(
tmp_path,
language="en",
fp16=False, # CPU inference
verbose=False,
)
transcript = result["text"].strip()
return transcript
finally:
os.unlink(tmp_path)
def listen_and_transcribe(self) -> str:
"""Full pipeline: record then transcribe."""
audio = self.record_until_silence()
if len(audio) < self.sample_rate * 0.5: # Less than 0.5 seconds
return ""
return self.transcribe(audio)
Building the TTS Module
Here's a unified TTS interface with three backends:
# tts/tts_engine.py
import os
import tempfile
from abc import ABC, abstractmethod
from typing import Optional
import threading
class TTSBase(ABC):
@abstractmethod
def speak(self, text: str):
pass
@abstractmethod
def speak_async(self, text: str):
pass
class Pyttsx3TTS(TTSBase):
"""Free, offline TTS. Works everywhere, sounds robotic."""
def __init__(self, rate: int = 185, volume: float = 0.9):
import pyttsx3
self.engine = pyttsx3.init()
self.engine.setProperty("rate", rate)
self.engine.setProperty("volume", volume)
# Use best available voice
voices = self.engine.getProperty("voices")
if voices:
# Prefer female voice if available
for voice in voices:
if "female" in voice.name.lower() or "zira" in voice.name.lower():
self.engine.setProperty("voice", voice.id)
break
def speak(self, text: str):
self.engine.say(text)
self.engine.runAndWait()
def speak_async(self, text: str):
thread = threading.Thread(target=self.speak, args=(text,))
thread.start()
class OpenAITTS(TTSBase):
"""OpenAI TTS API. Good quality, pay per character."""
def __init__(self, model: str = "tts-1-hd", voice: str = "alloy"):
from openai import OpenAI
self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
self.model = model
self.voice = voice # alloy, echo, fable, onyx, nova, shimmer
def speak(self, text: str):
from playsound import playsound
response = self.client.audio.speech.create(
model=self.model,
voice=self.voice,
input=text,
)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
tmp.write(response.content)
tmp_path = tmp.name
try:
playsound(tmp_path)
finally:
os.unlink(tmp_path)
def speak_async(self, text: str):
thread = threading.Thread(target=self.speak, args=(text,))
thread.start()
class ElevenLabsTTS(TTSBase):
"""ElevenLabs TTS. Best quality, paid API."""
def __init__(self, voice_id: str = "21m00Tcm4TlvDq8ikWAM"):
from elevenlabs import ElevenLabs
self.client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))
self.voice_id = voice_id
# Popular voice IDs:
# Rachel: 21m00Tcm4TlvDq8ikWAM
# Domi: AZnzlk1XvdvUeBnXmlld
# Bella: EXAVITQu4vr4xnSDxMaL
def speak(self, text: str):
import io
from playsound import playsound
audio = self.client.generate(
text=text,
voice=self.voice_id,
model="eleven_multilingual_v2",
)
audio_bytes = b"".join(audio)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
tmp.write(audio_bytes)
tmp_path = tmp.name
try:
playsound(tmp_path)
finally:
os.unlink(tmp_path)
def speak_async(self, text: str):
thread = threading.Thread(target=self.speak, args=(text,))
thread.start()
def get_tts_engine(engine: str = "pyttsx3") -> TTSBase:
"""Factory function to get TTS engine by name."""
engines = {
"pyttsx3": Pyttsx3TTS,
"openai": OpenAITTS,
"elevenlabs": ElevenLabsTTS,
}
if engine not in engines:
raise ValueError(f"Unknown TTS engine: {engine}. Choose from {list(engines.keys())}")
return engines[engine]()
TTS Engine Comparison
| Engine | Quality | Cost | Latency | Offline | Languages |
|---|---|---|---|---|---|
| pyttsx3 | Poor — robotic | Free | ~50ms | Yes | Limited |
| OpenAI tts-1 | Good | $0.015/1K chars | 500-1500ms | No | 57 |
| OpenAI tts-1-hd | Very good | $0.030/1K chars | 1000-2500ms | No | 57 |
| ElevenLabs standard | Excellent | $0.30/1K chars | 800-2000ms | No | 29 |
| ElevenLabs turbo | Very good | $0.18/1K chars | 300-600ms | No | 32 |
| Azure Neural TTS | Good | $0.016/1K chars | 400-1200ms | No | 140+ |
For development, use pyttsx3 to avoid API costs. Switch to OpenAI TTS or ElevenLabs when you care about the listening experience.
The Voice Agent Integration
Now wire the STT and TTS layers around an AutoGen agent:
# voice_agent.py
import autogen
import os
import re
from stt.whisper_stt import WhisperSTT
from tts.tts_engine import get_tts_engine
WAKE_WORDS = ["hey agent", "okay agent", "agent", "assistant"]
EXIT_PHRASES = ["stop", "quit", "exit", "goodbye", "that's all"]
class VoiceAgent:
def __init__(
self,
tts_engine: str = "openai",
whisper_model: str = "base",
max_response_length: int = 500,
):
self.stt = WhisperSTT(model_size=whisper_model)
self.tts = get_tts_engine(tts_engine)
self.max_response_length = max_response_length
self.conversation_history = []
# Set up AutoGen agent
llm_config = {
"config_list": [
{"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")}
],
"temperature": 0.3,
}
self.assistant = autogen.AssistantAgent(
name="VoiceAssistant",
system_message="""You are a voice assistant. Your responses will be spoken aloud,
so follow these rules:
1. Keep responses concise — 2-4 sentences maximum
2. Avoid markdown, bullets, numbered lists, and headers
3. Speak in natural conversational language
4. Do not use abbreviations that sound odd when read aloud (e.g., write "for example" not "e.g.")
5. If a topic requires a long explanation, offer to break it into parts
6. When you have answered fully, end with DONE""",
llm_config=llm_config,
)
self.user_proxy = autogen.UserProxyAgent(
name="VoiceUser",
human_input_mode="NEVER",
max_consecutive_auto_reply=3,
is_termination_msg=lambda msg: "DONE" in (msg.get("content") or ""),
code_execution_config=False,
)
def clean_for_speech(self, text: str) -> str:
"""Remove markdown and formatting that sounds bad when spoken."""
# Remove markdown headers
text = re.sub(r'#+\s+', '', text)
# Remove bold/italic
text = re.sub(r'\*+([^*]+)\*+', r'\1', text)
# Remove bullet points
text = re.sub(r'^\s*[-*•]\s+', '', text, flags=re.MULTILINE)
# Remove numbered lists
text = re.sub(r'^\s*\d+\.\s+', '', text, flags=re.MULTILINE)
# Remove code blocks
text = re.sub(r'```[^`]*```', '[code block]', text)
# Remove inline code
text = re.sub(r'`([^`]+)`', r'\1', text)
# Remove DONE marker
text = text.replace("DONE", "").strip()
# Truncate if too long
if len(text) > self.max_response_length:
sentences = text.split(". ")
short = ""
for s in sentences:
if len(short) + len(s) < self.max_response_length:
short += s + ". "
else:
break
text = short.strip() + " ...I can continue if you'd like."
return text
def get_agent_response(self, user_input: str) -> str:
"""Get response from AutoGen agent."""
self.user_proxy.initiate_chat(
self.assistant,
message=user_input,
clear_history=False, # Maintain conversation context
)
messages = self.assistant.chat_messages.get(self.user_proxy, [])
for msg in reversed(messages):
if msg.get("role") == "assistant" and msg.get("content"):
return self.clean_for_speech(msg["content"])
return "I'm sorry, I didn't get a response. Please try again."
def run(self):
"""Main voice interaction loop."""
self.tts.speak("Voice agent ready. Say 'agent' followed by your request.")
print("Voice agent active. Press Ctrl+C to stop.")
while True:
try:
# Listen for input
transcript = self.stt.listen_and_transcribe()
if not transcript:
continue
print(f"You said: {transcript}")
# Check exit conditions
if any(phrase in transcript.lower() for phrase in EXIT_PHRASES):
self.tts.speak("Goodbye!")
break
# Process the input (with or without wake word check)
response = self.get_agent_response(transcript)
print(f"Agent: {response}")
self.tts.speak(response)
except KeyboardInterrupt:
self.tts.speak("Stopping voice agent.")
break
except Exception as e:
print(f"Error: {e}")
self.tts.speak("I encountered an error. Please try again.")
if __name__ == "__main__":
agent = VoiceAgent(
tts_engine="openai", # or "pyttsx3" for free offline
whisper_model="base",
max_response_length=400,
)
agent.run()
Running the Voice Agent
export OPENAI_API_KEY=sk-...
python voice_agent.py
Example interaction:
- You: "What's the capital of Australia and what's it known for?"
- Agent: "The capital of Australia is Canberra, chosen as a compromise between Sydney and Melbourne. It's known for the Australian War Memorial, the National Gallery, and Parliament House. Would you like more detail on any of these?"
This connects nicely to AI agents and the future of work — voice interfaces are one of the key ways agents will integrate into daily workflows rather than remaining purely developer tools.
Handling Long AutoGPT Responses
AutoGPT is designed for long-form output — research reports, code, detailed analysis. Voice doesn't work well with 2,000-word outputs. The solution is a response mode selector:
# Add to your agent system message:
"""When responding via voice:
- For factual questions: answer in 2-3 sentences
- For complex topics: give a 3-sentence summary, then ask if the user wants more detail
- For tasks (write code, create a document): confirm what you'll do, then say you've saved the output to the workspace
- Never read out full code or long documents"""
The Build AI chatbot Python guide has complementary patterns for managing response length in conversational contexts.
Production Considerations
For a voice agent you'll use daily, a few additional investments are worthwhile. A noise cancellation preprocessing step on microphone input dramatically improves Whisper accuracy in real environments. The noisereduce library handles this in two lines:
import noisereduce as nr
audio_cleaned = nr.reduce_noise(y=audio, sr=sample_rate)
Wake word detection using pvporcupine from Picovoice makes the agent feel more natural — it only activates when you say a specific phrase, rather than constantly recording.
Building voice into your agent stack is one of those changes that makes the technology feel genuinely different to interact with. The technical components are all mature and well-documented. The main investment is tuning the response format so the agent sounds natural rather than reading markdown out loud.
Frequently Asked Questions
Does AutoGPT have built-in voice support? Not natively in most AutoGPT forks. Voice support requires wrapping AutoGPT with a speech layer — using Whisper for speech-to-text and a TTS engine for output. Some community forks include voice integrations, but they vary in quality and maintenance status.
Which TTS engine sounds most natural for AutoGPT responses? ElevenLabs produces the most natural-sounding voices by a wide margin, but it costs money based on character usage. OpenAI TTS (tts-1-hd) is a strong middle ground — high quality at lower cost. pyttsx3 is free and works offline but sounds robotic. For production use, ElevenLabs or OpenAI TTS are worth the cost.
How do I handle AutoGPT's long responses with text-to-speech? Split long responses into sentences before passing to TTS. Libraries like NLTK's sent_tokenize or simple period-splitting work well. Speak each sentence sequentially. For very long outputs, add a voice command like 'summarize that' to get a shorter version before speaking it.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
10 AutoGPT Command Line Arguments (Continuous Mode, Speak)
Complete reference for AutoGPT's 10 most powerful CLI arguments. Master continuous mode, headless operation, and CI/CD integration for automated agent workflows.
10 AutoGPT Configuration Tweaks for Better Performance
10 proven AutoGPT configuration tweaks to improve speed, cut costs, and boost task success. Model selection, temperature, token limits, and workspace settings.
Build a Content Research Agent with AutoGPT (Trends, Outlines)
Build an AutoGPT content research agent that finds trending topics, analyzes SERPs, and generates SEO-ready outlines automatically — full workflow inside.
Build a Data Analysis Agent with AutoGPT (CSV, SQL, Plots)
Build a data analysis agent using AutoGPT that reads CSVs, queries SQL databases, and generates plots automatically. Full code with pandas and matplotlib.