AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

voice interface for autonomous AI agent — AutoGPT speech TTS

How to Use AutoGPT with Voice (Speech-to-Text + TTS)

⚡ Quick Answer

Add voice control to AutoGPT with Whisper speech-to-text and ElevenLabs or pyttsx3 TTS. Build a conversational autonomous agent you talk to hands-free.

AiTechWorlds Team May 31, 2026 11 min read

#AutoGPT #voice interface #speech-to-text #text-to-speech

📚Part of the Autogpt Autogen guide — explore all Autogpt Autogen articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Voice interfaces change how autonomous agents feel to use. Instead of typing goals into a terminal and reading back walls of text, you speak naturally and hear responses. For hands-free workflows — cooking while asking an agent to research recipes, driving while getting a briefing, or accessibility use cases — voice transforms AutoGPT from a dev tool into something genuinely practical.

This guide covers the full technical stack: Whisper for speech-to-text, your choice of TTS engine, and the integration layer that ties it all together with an AutoGen-based agent. You'll have a working voice-controlled agent by the end.

Architecture Overview

The voice pipeline has three distinct layers:

Microphone Input
     ↓
[Whisper STT] → Text transcript
     ↓
[AutoGPT/AutoGen Agent] → Text response
     ↓
[TTS Engine] → Audio output
     ↓
Speaker Output

Each layer is independently swappable. You can start with pyttsx3 for free offline TTS and upgrade to ElevenLabs later without changing the agent layer at all.

Installing Dependencies

# Core speech stack
pip install openai-whisper pyaudio sounddevice soundfile numpy

# TTS options (install what you need)
pip install pyttsx3           # Free, offline, robotic
pip install openai            # OpenAI TTS API
pip install elevenlabs        # ElevenLabs (best quality, paid)

# Agent framework
pip install pyautogen

# Audio utilities
pip install playsound pydub

On Windows, pyaudio often requires a pre-built wheel:

pip install pipwin
pipwin install pyaudio

On macOS, install portaudio first:

brew install portaudio
pip install pyaudio

Building the Speech-to-Text Module

Whisper is the gold standard for offline STT. It runs locally, handles accents well, and supports 99 languages:

# stt/whisper_stt.py
import whisper
import sounddevice as sd
import soundfile as sf
import numpy as np
import tempfile
import os
from typing import Optional


class WhisperSTT:
    def __init__(self, model_size: str = "base"):
        """
        model_size options: tiny, base, small, medium, large
        - tiny: fastest, least accurate (~39M params)
        - base: good balance for most use cases (~74M params)
        - small: better accuracy, still fast (~244M params)
        - medium: high accuracy, slower (~769M params)
        """
        print(f"Loading Whisper {model_size} model...")
        self.model = whisper.load_model(model_size)
        self.sample_rate = 16000  # Whisper expects 16kHz

    def record_audio(
        self,
        duration: int = 10,
        silence_threshold: float = 0.01,
        silence_duration: float = 2.0
    ) -> np.ndarray:
        """Record audio with automatic silence detection."""
        print("Listening... (speak now)")

        # Record raw audio
        audio = sd.rec(
            int(duration * self.sample_rate),
            samplerate=self.sample_rate,
            channels=1,
            dtype="float32",
        )
        sd.wait()

        return audio.flatten()

    def record_until_silence(
        self,
        max_duration: int = 30,
        silence_threshold: float = 0.005,
        silence_chunks: int = 20
    ) -> np.ndarray:
        """Record until the user stops speaking."""
        chunk_size = int(self.sample_rate * 0.1)  # 100ms chunks
        all_audio = []
        silent_count = 0
        has_speech = False

        print("Listening... (speak your command)")

        with sd.InputStream(
            samplerate=self.sample_rate,
            channels=1,
            dtype="float32",
            blocksize=chunk_size
        ) as stream:
            max_chunks = int(max_duration * 10)

            for _ in range(max_chunks):
                chunk, _ = stream.read(chunk_size)
                chunk_flat = chunk.flatten()
                all_audio.extend(chunk_flat)

                # Detect speech vs silence
                rms = np.sqrt(np.mean(chunk_flat ** 2))

                if rms > silence_threshold:
                    has_speech = True
                    silent_count = 0
                elif has_speech:
                    silent_count += 1

                # Stop after silence following speech
                if has_speech and silent_count >= silence_chunks:
                    break

        print("Processing speech...")
        return np.array(all_audio)

    def transcribe(self, audio: np.ndarray) -> str:
        """Transcribe audio array to text."""
        # Save to temp file (Whisper needs a file path)
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
            sf.write(tmp.name, audio, self.sample_rate)
            tmp_path = tmp.name

        try:
            result = self.model.transcribe(
                tmp_path,
                language="en",
                fp16=False,  # CPU inference
                verbose=False,
            )
            transcript = result["text"].strip()
            return transcript
        finally:
            os.unlink(tmp_path)

    def listen_and_transcribe(self) -> str:
        """Full pipeline: record then transcribe."""
        audio = self.record_until_silence()
        if len(audio) < self.sample_rate * 0.5:  # Less than 0.5 seconds
            return ""
        return self.transcribe(audio)

Building the TTS Module

Here's a unified TTS interface with three backends:

# tts/tts_engine.py
import os
import tempfile
from abc import ABC, abstractmethod
from typing import Optional
import threading


class TTSBase(ABC):
    @abstractmethod
    def speak(self, text: str):
        pass

    @abstractmethod
    def speak_async(self, text: str):
        pass


class Pyttsx3TTS(TTSBase):
    """Free, offline TTS. Works everywhere, sounds robotic."""

    def __init__(self, rate: int = 185, volume: float = 0.9):
        import pyttsx3
        self.engine = pyttsx3.init()
        self.engine.setProperty("rate", rate)
        self.engine.setProperty("volume", volume)

        # Use best available voice
        voices = self.engine.getProperty("voices")
        if voices:
            # Prefer female voice if available
            for voice in voices:
                if "female" in voice.name.lower() or "zira" in voice.name.lower():
                    self.engine.setProperty("voice", voice.id)
                    break

    def speak(self, text: str):
        self.engine.say(text)
        self.engine.runAndWait()

    def speak_async(self, text: str):
        thread = threading.Thread(target=self.speak, args=(text,))
        thread.start()


class OpenAITTS(TTSBase):
    """OpenAI TTS API. Good quality, pay per character."""

    def __init__(self, model: str = "tts-1-hd", voice: str = "alloy"):
        from openai import OpenAI
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.model = model
        self.voice = voice  # alloy, echo, fable, onyx, nova, shimmer

    def speak(self, text: str):
        from playsound import playsound

        response = self.client.audio.speech.create(
            model=self.model,
            voice=self.voice,
            input=text,
        )

        with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
            tmp.write(response.content)
            tmp_path = tmp.name

        try:
            playsound(tmp_path)
        finally:
            os.unlink(tmp_path)

    def speak_async(self, text: str):
        thread = threading.Thread(target=self.speak, args=(text,))
        thread.start()


class ElevenLabsTTS(TTSBase):
    """ElevenLabs TTS. Best quality, paid API."""

    def __init__(self, voice_id: str = "21m00Tcm4TlvDq8ikWAM"):
        from elevenlabs import ElevenLabs
        self.client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))
        self.voice_id = voice_id
        # Popular voice IDs:
        # Rachel: 21m00Tcm4TlvDq8ikWAM
        # Domi: AZnzlk1XvdvUeBnXmlld
        # Bella: EXAVITQu4vr4xnSDxMaL

    def speak(self, text: str):
        import io
        from playsound import playsound

        audio = self.client.generate(
            text=text,
            voice=self.voice_id,
            model="eleven_multilingual_v2",
        )

        audio_bytes = b"".join(audio)

        with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
            tmp.write(audio_bytes)
            tmp_path = tmp.name

        try:
            playsound(tmp_path)
        finally:
            os.unlink(tmp_path)

    def speak_async(self, text: str):
        thread = threading.Thread(target=self.speak, args=(text,))
        thread.start()


def get_tts_engine(engine: str = "pyttsx3") -> TTSBase:
    """Factory function to get TTS engine by name."""
    engines = {
        "pyttsx3": Pyttsx3TTS,
        "openai": OpenAITTS,
        "elevenlabs": ElevenLabsTTS,
    }
    if engine not in engines:
        raise ValueError(f"Unknown TTS engine: {engine}. Choose from {list(engines.keys())}")
    return engines[engine]()

TTS Engine Comparison

Engine	Quality	Cost	Latency	Offline	Languages
pyttsx3	Poor — robotic	Free	~50ms	Yes	Limited
OpenAI tts-1	Good	$0.015/1K chars	500-1500ms	No	57
OpenAI tts-1-hd	Very good	$0.030/1K chars	1000-2500ms	No	57
ElevenLabs standard	Excellent	$0.30/1K chars	800-2000ms	No	29
ElevenLabs turbo	Very good	$0.18/1K chars	300-600ms	No	32
Azure Neural TTS	Good	$0.016/1K chars	400-1200ms	No	140+

For development, use pyttsx3 to avoid API costs. Switch to OpenAI TTS or ElevenLabs when you care about the listening experience.

The Voice Agent Integration

Now wire the STT and TTS layers around an AutoGen agent:

# voice_agent.py
import autogen
import os
import re
from stt.whisper_stt import WhisperSTT
from tts.tts_engine import get_tts_engine

WAKE_WORDS = ["hey agent", "okay agent", "agent", "assistant"]
EXIT_PHRASES = ["stop", "quit", "exit", "goodbye", "that's all"]


class VoiceAgent:
    def __init__(
        self,
        tts_engine: str = "openai",
        whisper_model: str = "base",
        max_response_length: int = 500,
    ):
        self.stt = WhisperSTT(model_size=whisper_model)
        self.tts = get_tts_engine(tts_engine)
        self.max_response_length = max_response_length
        self.conversation_history = []

        # Set up AutoGen agent
        llm_config = {
            "config_list": [
                {"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")}
            ],
            "temperature": 0.3,
        }

        self.assistant = autogen.AssistantAgent(
            name="VoiceAssistant",
            system_message="""You are a voice assistant. Your responses will be spoken aloud,
            so follow these rules:
            1. Keep responses concise — 2-4 sentences maximum
            2. Avoid markdown, bullets, numbered lists, and headers
            3. Speak in natural conversational language
            4. Do not use abbreviations that sound odd when read aloud (e.g., write "for example" not "e.g.")
            5. If a topic requires a long explanation, offer to break it into parts
            6. When you have answered fully, end with DONE""",
            llm_config=llm_config,
        )

        self.user_proxy = autogen.UserProxyAgent(
            name="VoiceUser",
            human_input_mode="NEVER",
            max_consecutive_auto_reply=3,
            is_termination_msg=lambda msg: "DONE" in (msg.get("content") or ""),
            code_execution_config=False,
        )

    def clean_for_speech(self, text: str) -> str:
        """Remove markdown and formatting that sounds bad when spoken."""
        # Remove markdown headers
        text = re.sub(r'#+\s+', '', text)
        # Remove bold/italic
        text = re.sub(r'\*+([^*]+)\*+', r'\1', text)
        # Remove bullet points
        text = re.sub(r'^\s*[-*•]\s+', '', text, flags=re.MULTILINE)
        # Remove numbered lists
        text = re.sub(r'^\s*\d+\.\s+', '', text, flags=re.MULTILINE)
        # Remove code blocks
        text = re.sub(r'```[^`]*```', '[code block]', text)
        # Remove inline code
        text = re.sub(r'`([^`]+)`', r'\1', text)
        # Remove DONE marker
        text = text.replace("DONE", "").strip()
        # Truncate if too long
        if len(text) > self.max_response_length:
            sentences = text.split(". ")
            short = ""
            for s in sentences:
                if len(short) + len(s) < self.max_response_length:
                    short += s + ". "
                else:
                    break
            text = short.strip() + " ...I can continue if you'd like."

        return text

    def get_agent_response(self, user_input: str) -> str:
        """Get response from AutoGen agent."""
        self.user_proxy.initiate_chat(
            self.assistant,
            message=user_input,
            clear_history=False,  # Maintain conversation context
        )

        messages = self.assistant.chat_messages.get(self.user_proxy, [])
        for msg in reversed(messages):
            if msg.get("role") == "assistant" and msg.get("content"):
                return self.clean_for_speech(msg["content"])

        return "I'm sorry, I didn't get a response. Please try again."

    def run(self):
        """Main voice interaction loop."""
        self.tts.speak("Voice agent ready. Say 'agent' followed by your request.")
        print("Voice agent active. Press Ctrl+C to stop.")

        while True:
            try:
                # Listen for input
                transcript = self.stt.listen_and_transcribe()

                if not transcript:
                    continue

                print(f"You said: {transcript}")

                # Check exit conditions
                if any(phrase in transcript.lower() for phrase in EXIT_PHRASES):
                    self.tts.speak("Goodbye!")
                    break

                # Process the input (with or without wake word check)
                response = self.get_agent_response(transcript)
                print(f"Agent: {response}")
                self.tts.speak(response)

            except KeyboardInterrupt:
                self.tts.speak("Stopping voice agent.")
                break
            except Exception as e:
                print(f"Error: {e}")
                self.tts.speak("I encountered an error. Please try again.")


if __name__ == "__main__":
    agent = VoiceAgent(
        tts_engine="openai",    # or "pyttsx3" for free offline
        whisper_model="base",
        max_response_length=400,
    )
    agent.run()

Running the Voice Agent

export OPENAI_API_KEY=sk-...
python voice_agent.py

Example interaction:

You: "What's the capital of Australia and what's it known for?"
Agent: "The capital of Australia is Canberra, chosen as a compromise between Sydney and Melbourne. It's known for the Australian War Memorial, the National Gallery, and Parliament House. Would you like more detail on any of these?"

This connects nicely to AI agents and the future of work — voice interfaces are one of the key ways agents will integrate into daily workflows rather than remaining purely developer tools.

Handling Long AutoGPT Responses

AutoGPT is designed for long-form output — research reports, code, detailed analysis. Voice doesn't work well with 2,000-word outputs. The solution is a response mode selector:

# Add to your agent system message:
"""When responding via voice:
- For factual questions: answer in 2-3 sentences
- For complex topics: give a 3-sentence summary, then ask if the user wants more detail
- For tasks (write code, create a document): confirm what you'll do, then say you've saved the output to the workspace
- Never read out full code or long documents"""

The Build AI chatbot Python guide has complementary patterns for managing response length in conversational contexts.

Production Considerations

For a voice agent you'll use daily, a few additional investments are worthwhile. A noise cancellation preprocessing step on microphone input dramatically improves Whisper accuracy in real environments. The noisereduce library handles this in two lines:

import noisereduce as nr
audio_cleaned = nr.reduce_noise(y=audio, sr=sample_rate)

Wake word detection using pvporcupine from Picovoice makes the agent feel more natural — it only activates when you say a specific phrase, rather than constantly recording.

Building voice into your agent stack is one of those changes that makes the technology feel genuinely different to interact with. The technical components are all mature and well-documented. The main investment is tuning the response format so the agent sounds natural rather than reading markdown out loud.

Frequently Asked Questions

Does AutoGPT have built-in voice support? Not natively in most AutoGPT forks. Voice support requires wrapping AutoGPT with a speech layer — using Whisper for speech-to-text and a TTS engine for output. Some community forks include voice integrations, but they vary in quality and maintenance status.

Which TTS engine sounds most natural for AutoGPT responses? ElevenLabs produces the most natural-sounding voices by a wide margin, but it costs money based on character usage. OpenAI TTS (tts-1-hd) is a strong middle ground — high quality at lower cost. pyttsx3 is free and works offline but sounds robotic. For production use, ElevenLabs or OpenAI TTS are worth the cost.

How do I handle AutoGPT's long responses with text-to-speech? Split long responses into sentences before passing to TTS. Libraries like NLTK's sent_tokenize or simple period-splitting work well. Speak each sentence sequentially. For very long outputs, add a voice command like 'summarize that' to get a shorter version before speaking it.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Not natively in most AutoGPT forks. Voice support requires wrapping AutoGPT with a speech layer — using Whisper for speech-to-text and a TTS engine for output. Some community forks include voice integrations, but they vary in quality and maintenance status.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI agent role assignment diagram — AutoGen agent types roles

Agent Development

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

Understand the 5 core AutoGen agent types — AssistantAgent, UserProxyAgent, CodeExecutorAgent, and more — with code examples and a comparison table for each role.

May 31, 2026 11 min read

AutoGen agent served as REST API endpoint — FastAPI deployment

Agent Development

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

Learn to serve AutoGen multi-agent systems as production REST APIs using FastAPI with async endpoints and real-time streaming responses.

May 31, 2026 10 min read

Azure OpenAI enterprise integration with AutoGen — managed private instances

Agent Development

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Connect Microsoft AutoGen to Azure OpenAI for enterprise-grade AI agents. Step-by-step setup with private endpoints, OAI_CONFIG_LIST, and deployment config.

May 31, 2026 10 min read

AI agent automatically fixing code bugs — AutoGen code debugging auto-fix

Agent Development

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

Build an AutoGen agent that reviews code, analyzes PR diffs, suggests fixes, and automates code quality improvements with a full working implementation.

May 31, 2026 11 min read

Go deeper on this topic

ProjectAI Voice Assistant with Speech-to-Text

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Autogpt Autogen

How to Use AutoGPT with Voice (Speech-to-Text + TTS)

⚡ Quick Answer

Add voice control to AutoGPT with Whisper speech-to-text and ElevenLabs or pyttsx3 TTS. Build a conversational autonomous agent you talk to hands-free.

AiTechWorlds Team May 31, 2026 11 min read

#AutoGPT #voice interface #speech-to-text #text-to-speech

📚Part of the Autogpt Autogen guide — explore all Autogpt Autogen articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Architecture Overview

The voice pipeline has three distinct layers:

Microphone Input
     ↓
[Whisper STT] → Text transcript
     ↓
[AutoGPT/AutoGen Agent] → Text response
     ↓
[TTS Engine] → Audio output
     ↓
Speaker Output

Each layer is independently swappable. You can start with pyttsx3 for free offline TTS and upgrade to ElevenLabs later without changing the agent layer at all.

Installing Dependencies

# Core speech stack
pip install openai-whisper pyaudio sounddevice soundfile numpy

# TTS options (install what you need)
pip install pyttsx3           # Free, offline, robotic
pip install openai            # OpenAI TTS API
pip install elevenlabs        # ElevenLabs (best quality, paid)

# Agent framework
pip install pyautogen

# Audio utilities
pip install playsound pydub

On Windows, pyaudio often requires a pre-built wheel:

pip install pipwin
pipwin install pyaudio

On macOS, install portaudio first:

brew install portaudio
pip install pyaudio

Building the Speech-to-Text Module

Whisper is the gold standard for offline STT. It runs locally, handles accents well, and supports 99 languages:

# stt/whisper_stt.py
import whisper
import sounddevice as sd
import soundfile as sf
import numpy as np
import tempfile
import os
from typing import Optional


class WhisperSTT:
    def __init__(self, model_size: str = "base"):
        """
        model_size options: tiny, base, small, medium, large
        - tiny: fastest, least accurate (~39M params)
        - base: good balance for most use cases (~74M params)
        - small: better accuracy, still fast (~244M params)
        - medium: high accuracy, slower (~769M params)
        """
        print(f"Loading Whisper {model_size} model...")
        self.model = whisper.load_model(model_size)
        self.sample_rate = 16000  # Whisper expects 16kHz

    def record_audio(
        self,
        duration: int = 10,
        silence_threshold: float = 0.01,
        silence_duration: float = 2.0
    ) -> np.ndarray:
        """Record audio with automatic silence detection."""
        print("Listening... (speak now)")

        # Record raw audio
        audio = sd.rec(
            int(duration * self.sample_rate),
            samplerate=self.sample_rate,
            channels=1,
            dtype="float32",
        )
        sd.wait()

        return audio.flatten()

    def record_until_silence(
        self,
        max_duration: int = 30,
        silence_threshold: float = 0.005,
        silence_chunks: int = 20
    ) -> np.ndarray:
        """Record until the user stops speaking."""
        chunk_size = int(self.sample_rate * 0.1)  # 100ms chunks
        all_audio = []
        silent_count = 0
        has_speech = False

        print("Listening... (speak your command)")

        with sd.InputStream(
            samplerate=self.sample_rate,
            channels=1,
            dtype="float32",
            blocksize=chunk_size
        ) as stream:
            max_chunks = int(max_duration * 10)

            for _ in range(max_chunks):
                chunk, _ = stream.read(chunk_size)
                chunk_flat = chunk.flatten()
                all_audio.extend(chunk_flat)

                # Detect speech vs silence
                rms = np.sqrt(np.mean(chunk_flat ** 2))

                if rms > silence_threshold:
                    has_speech = True
                    silent_count = 0
                elif has_speech:
                    silent_count += 1

                # Stop after silence following speech
                if has_speech and silent_count >= silence_chunks:
                    break

        print("Processing speech...")
        return np.array(all_audio)

    def transcribe(self, audio: np.ndarray) -> str:
        """Transcribe audio array to text."""
        # Save to temp file (Whisper needs a file path)
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
            sf.write(tmp.name, audio, self.sample_rate)
            tmp_path = tmp.name

        try:
            result = self.model.transcribe(
                tmp_path,
                language="en",
                fp16=False,  # CPU inference
                verbose=False,
            )
            transcript = result["text"].strip()
            return transcript
        finally:
            os.unlink(tmp_path)

    def listen_and_transcribe(self) -> str:
        """Full pipeline: record then transcribe."""
        audio = self.record_until_silence()
        if len(audio) < self.sample_rate * 0.5:  # Less than 0.5 seconds
            return ""
        return self.transcribe(audio)

Building the TTS Module

Here's a unified TTS interface with three backends:

# tts/tts_engine.py
import os
import tempfile
from abc import ABC, abstractmethod
from typing import Optional
import threading


class TTSBase(ABC):
    @abstractmethod
    def speak(self, text: str):
        pass

    @abstractmethod
    def speak_async(self, text: str):
        pass


class Pyttsx3TTS(TTSBase):
    """Free, offline TTS. Works everywhere, sounds robotic."""

    def __init__(self, rate: int = 185, volume: float = 0.9):
        import pyttsx3
        self.engine = pyttsx3.init()
        self.engine.setProperty("rate", rate)
        self.engine.setProperty("volume", volume)

        # Use best available voice
        voices = self.engine.getProperty("voices")
        if voices:
            # Prefer female voice if available
            for voice in voices:
                if "female" in voice.name.lower() or "zira" in voice.name.lower():
                    self.engine.setProperty("voice", voice.id)
                    break

    def speak(self, text: str):
        self.engine.say(text)
        self.engine.runAndWait()

    def speak_async(self, text: str):
        thread = threading.Thread(target=self.speak, args=(text,))
        thread.start()


class OpenAITTS(TTSBase):
    """OpenAI TTS API. Good quality, pay per character."""

    def __init__(self, model: str = "tts-1-hd", voice: str = "alloy"):
        from openai import OpenAI
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.model = model
        self.voice = voice  # alloy, echo, fable, onyx, nova, shimmer

    def speak(self, text: str):
        from playsound import playsound

        response = self.client.audio.speech.create(
            model=self.model,
            voice=self.voice,
            input=text,
        )

        with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
            tmp.write(response.content)
            tmp_path = tmp.name

        try:
            playsound(tmp_path)
        finally:
            os.unlink(tmp_path)

    def speak_async(self, text: str):
        thread = threading.Thread(target=self.speak, args=(text,))
        thread.start()


class ElevenLabsTTS(TTSBase):
    """ElevenLabs TTS. Best quality, paid API."""

    def __init__(self, voice_id: str = "21m00Tcm4TlvDq8ikWAM"):
        from elevenlabs import ElevenLabs
        self.client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))
        self.voice_id = voice_id
        # Popular voice IDs:
        # Rachel: 21m00Tcm4TlvDq8ikWAM
        # Domi: AZnzlk1XvdvUeBnXmlld
        # Bella: EXAVITQu4vr4xnSDxMaL

    def speak(self, text: str):
        import io
        from playsound import playsound

        audio = self.client.generate(
            text=text,
            voice=self.voice_id,
            model="eleven_multilingual_v2",
        )

        audio_bytes = b"".join(audio)

        with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
            tmp.write(audio_bytes)
            tmp_path = tmp.name

        try:
            playsound(tmp_path)
        finally:
            os.unlink(tmp_path)

    def speak_async(self, text: str):
        thread = threading.Thread(target=self.speak, args=(text,))
        thread.start()


def get_tts_engine(engine: str = "pyttsx3") -> TTSBase:
    """Factory function to get TTS engine by name."""
    engines = {
        "pyttsx3": Pyttsx3TTS,
        "openai": OpenAITTS,
        "elevenlabs": ElevenLabsTTS,
    }
    if engine not in engines:
        raise ValueError(f"Unknown TTS engine: {engine}. Choose from {list(engines.keys())}")
    return engines[engine]()

TTS Engine Comparison

Engine	Quality	Cost	Latency	Offline	Languages
pyttsx3	Poor — robotic	Free	~50ms	Yes	Limited
OpenAI tts-1	Good	$0.015/1K chars	500-1500ms	No	57
OpenAI tts-1-hd	Very good	$0.030/1K chars	1000-2500ms	No	57
ElevenLabs standard	Excellent	$0.30/1K chars	800-2000ms	No	29
ElevenLabs turbo	Very good	$0.18/1K chars	300-600ms	No	32
Azure Neural TTS	Good	$0.016/1K chars	400-1200ms	No	140+

For development, use pyttsx3 to avoid API costs. Switch to OpenAI TTS or ElevenLabs when you care about the listening experience.

The Voice Agent Integration

Now wire the STT and TTS layers around an AutoGen agent:

# voice_agent.py
import autogen
import os
import re
from stt.whisper_stt import WhisperSTT
from tts.tts_engine import get_tts_engine

WAKE_WORDS = ["hey agent", "okay agent", "agent", "assistant"]
EXIT_PHRASES = ["stop", "quit", "exit", "goodbye", "that's all"]


class VoiceAgent:
    def __init__(
        self,
        tts_engine: str = "openai",
        whisper_model: str = "base",
        max_response_length: int = 500,
    ):
        self.stt = WhisperSTT(model_size=whisper_model)
        self.tts = get_tts_engine(tts_engine)
        self.max_response_length = max_response_length
        self.conversation_history = []

        # Set up AutoGen agent
        llm_config = {
            "config_list": [
                {"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")}
            ],
            "temperature": 0.3,
        }

        self.assistant = autogen.AssistantAgent(
            name="VoiceAssistant",
            system_message="""You are a voice assistant. Your responses will be spoken aloud,
            so follow these rules:
            1. Keep responses concise — 2-4 sentences maximum
            2. Avoid markdown, bullets, numbered lists, and headers
            3. Speak in natural conversational language
            4. Do not use abbreviations that sound odd when read aloud (e.g., write "for example" not "e.g.")
            5. If a topic requires a long explanation, offer to break it into parts
            6. When you have answered fully, end with DONE""",
            llm_config=llm_config,
        )

        self.user_proxy = autogen.UserProxyAgent(
            name="VoiceUser",
            human_input_mode="NEVER",
            max_consecutive_auto_reply=3,
            is_termination_msg=lambda msg: "DONE" in (msg.get("content") or ""),
            code_execution_config=False,
        )

    def clean_for_speech(self, text: str) -> str:
        """Remove markdown and formatting that sounds bad when spoken."""
        # Remove markdown headers
        text = re.sub(r'#+\s+', '', text)
        # Remove bold/italic
        text = re.sub(r'\*+([^*]+)\*+', r'\1', text)
        # Remove bullet points
        text = re.sub(r'^\s*[-*•]\s+', '', text, flags=re.MULTILINE)
        # Remove numbered lists
        text = re.sub(r'^\s*\d+\.\s+', '', text, flags=re.MULTILINE)
        # Remove code blocks
        text = re.sub(r'```[^`]*```', '[code block]', text)
        # Remove inline code
        text = re.sub(r'`([^`]+)`', r'\1', text)
        # Remove DONE marker
        text = text.replace("DONE", "").strip()
        # Truncate if too long
        if len(text) > self.max_response_length:
            sentences = text.split(". ")
            short = ""
            for s in sentences:
                if len(short) + len(s) < self.max_response_length:
                    short += s + ". "
                else:
                    break
            text = short.strip() + " ...I can continue if you'd like."

        return text

    def get_agent_response(self, user_input: str) -> str:
        """Get response from AutoGen agent."""
        self.user_proxy.initiate_chat(
            self.assistant,
            message=user_input,
            clear_history=False,  # Maintain conversation context
        )

        messages = self.assistant.chat_messages.get(self.user_proxy, [])
        for msg in reversed(messages):
            if msg.get("role") == "assistant" and msg.get("content"):
                return self.clean_for_speech(msg["content"])

        return "I'm sorry, I didn't get a response. Please try again."

    def run(self):
        """Main voice interaction loop."""
        self.tts.speak("Voice agent ready. Say 'agent' followed by your request.")
        print("Voice agent active. Press Ctrl+C to stop.")

        while True:
            try:
                # Listen for input
                transcript = self.stt.listen_and_transcribe()

                if not transcript:
                    continue

                print(f"You said: {transcript}")

                # Check exit conditions
                if any(phrase in transcript.lower() for phrase in EXIT_PHRASES):
                    self.tts.speak("Goodbye!")
                    break

                # Process the input (with or without wake word check)
                response = self.get_agent_response(transcript)
                print(f"Agent: {response}")
                self.tts.speak(response)

            except KeyboardInterrupt:
                self.tts.speak("Stopping voice agent.")
                break
            except Exception as e:
                print(f"Error: {e}")
                self.tts.speak("I encountered an error. Please try again.")


if __name__ == "__main__":
    agent = VoiceAgent(
        tts_engine="openai",    # or "pyttsx3" for free offline
        whisper_model="base",
        max_response_length=400,
    )
    agent.run()

Running the Voice Agent

export OPENAI_API_KEY=sk-...
python voice_agent.py

Example interaction:

You: "What's the capital of Australia and what's it known for?"
Agent: "The capital of Australia is Canberra, chosen as a compromise between Sydney and Melbourne. It's known for the Australian War Memorial, the National Gallery, and Parliament House. Would you like more detail on any of these?"

This connects nicely to AI agents and the future of work — voice interfaces are one of the key ways agents will integrate into daily workflows rather than remaining purely developer tools.

Handling Long AutoGPT Responses

AutoGPT is designed for long-form output — research reports, code, detailed analysis. Voice doesn't work well with 2,000-word outputs. The solution is a response mode selector:

# Add to your agent system message:
"""When responding via voice:
- For factual questions: answer in 2-3 sentences
- For complex topics: give a 3-sentence summary, then ask if the user wants more detail
- For tasks (write code, create a document): confirm what you'll do, then say you've saved the output to the workspace
- Never read out full code or long documents"""

The Build AI chatbot Python guide has complementary patterns for managing response length in conversational contexts.

Production Considerations

import noisereduce as nr
audio_cleaned = nr.reduce_noise(y=audio, sr=sample_rate)

Wake word detection using pvporcupine from Picovoice makes the agent feel more natural — it only activates when you say a specific phrase, rather than constantly recording.

Frequently Asked Questions

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

Understand the 5 core AutoGen agent types — AssistantAgent, UserProxyAgent, CodeExecutorAgent, and more — with code examples and a comparison table for each role.

May 31, 2026 11 min read

Agent Development

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

Learn to serve AutoGen multi-agent systems as production REST APIs using FastAPI with async endpoints and real-time streaming responses.

May 31, 2026 10 min read

Agent Development

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Connect Microsoft AutoGen to Azure OpenAI for enterprise-grade AI agents. Step-by-step setup with private endpoints, OAI_CONFIG_LIST, and deployment config.

May 31, 2026 10 min read

Agent Development

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

Build an AutoGen agent that reviews code, analyzes PR diffs, suggests fixes, and automates code quality improvements with a full working implementation.

May 31, 2026 11 min read

Go deeper on this topic

ProjectAI Voice Assistant with Speech-to-Text

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

How to Use AutoGPT with Voice (Speech-to-Text + TTS)

Architecture Overview

Installing Dependencies

Building the Speech-to-Text Module

Building the TTS Module

TTS Engine Comparison

The Voice Agent Integration

Running the Voice Agent

Handling Long AutoGPT Responses

Production Considerations

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

Go deeper on this topic

Get Free AI Notes Daily

How to Use AutoGPT with Voice (Speech-to-Text + TTS)

Architecture Overview

Installing Dependencies

Building the Speech-to-Text Module

Building the TTS Module

TTS Engine Comparison

The Voice Agent Integration

Running the Voice Agent

Handling Long AutoGPT Responses

Production Considerations

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

5 AutoGen Agent Roles (Assistant, UserProxy, CodeExecutor)

How to Deploy AutoGen Agents as APIs with FastAPI (2026)

How to Use AutoGen with Azure OpenAI (Enterprise Security)

Build a Code Debugging Agent with AutoGen (Auto-Fix PRs)

Go deeper on this topic

Get Free AI Notes Daily