How does DSPy differ from manual prompt engineering?

In traditional prompt engineering, you write prompt text directly and evaluate its performance manually. DSPy (Declarative Self-improving Python) flips this: you define your pipeline as a program using typed signatures and module compositions, then compile the program using an optimizer (like BootstrapFewShot or MIPRO) against a training set. DSPy writes the actual prompt text — few-shot examples, chain-of-thought instructions, formatting directives — based on what empirically works on your data. The key benefits: you're optimizing actual performance metrics rather than prompt aesthetics, the optimized prompts generalize better because they're selected via cross-validation, and you can swap models without rewriting prompts (just recompile). The limitation: DSPy requires you to have labeled data and a scoring function, which not every application has.

What is the difference between prompt optimization and prompt tuning?

These terms are often confused but refer to different things. Prompt optimization (what this article covers) works in token/text space — it modifies the actual text of your prompt, selecting words and examples that improve task performance. No model weights are changed. Soft prompt tuning (also called prompt tuning or prefix tuning) works in embedding space — it learns continuous vector representations that are prepended to inputs, with gradients flowing back to update those vectors. The learned 'soft tokens' are not human-readable. Soft prompt tuning requires access to model internals and training infrastructure, while text-space optimization works with any black-box API. For most practitioners using hosted APIs (OpenAI, Anthropic, Google), text-space optimization is what's practical. Soft tuning becomes relevant when you have model weights and the task really needs more than prompting can provide.

AiTechWorlds

Automation machinery gears representing automatic prompt optimization pipeline

Advanced Prompting

Automatic Prompt Optimization: Using AI to Write Better Prompts

⚡ Quick Answer

Automatic prompt optimization uses AI to iteratively improve prompts without manual tuning. Learn DSPy, APE, and gradient-free optimization methods with real benchmarks.

Abdullah Al Arman Emon June 5, 2026 11 min read

#automatic-prompt-optimization #dspy #prompt-tuning #prompt-engineering

📚Part of the Advanced Prompting guide — explore all Advanced Prompting articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Automatic Prompt Optimization: Using AI to Write Better Prompts

A colleague of mine spent three days manually tuning a prompt for a classification task. Trying different phrasings, adding examples, adjusting the output format specification. He got from 71% to 79% accuracy. Then he ran DSPy on it for 40 minutes and it hit 86%.

That gap — 7 points of accuracy, three days of work vs. 40 minutes of compute — is why automatic prompt optimization is worth understanding properly.

The premise is straightforward: prompts are programs. Programs have parameters. Parameters can be optimized. The surprising part is how well this works in practice, and how little of the ML community outside NLP knows about it.

Why Manual Prompt Engineering Has a Ceiling

Manual prompt engineering relies on intuition and trial-and-error. You have a mental model of how the LLM interprets text, you adjust the prompt based on that model, you observe results, you update your intuition. This works, but it has hard limits.

Your intuitions about LLM behavior are derived from your own language understanding, not from the model's actual learned representations. Phrasing choices that seem equivalent to you may produce substantially different distributions of model outputs. The only way to know which phrasings work better is to measure them — but systematically measuring thousands of prompt variants manually is impractical.

Automatic prompt optimization does the measurement systematically. It explores the space of possible prompts, evaluates each on a held-out set, and converges toward higher-performing variants. It doesn't have your intuitions, but it has patience.

Approaches to Automatic Prompt Optimization

Automatic Prompt Engineer (APE)

The original APE paper (Zhou et al., 2022) used a simple but effective approach: generate many candidate prompts using an LLM, evaluate them on a small labeled set, return the best one.

from openai import OpenAI
import random

client = OpenAI()

def generate_candidate_prompts(
    task_description: str,
    input_output_examples: list[dict],
    n_candidates: int = 20
) -> list[str]:
    """Generate candidate prompts using an LLM."""
    
    examples_str = "\n".join([
        f"Input: {ex['input']}\nOutput: {ex['output']}"
        for ex in input_output_examples[:5]
    ])
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a prompt engineer. Generate diverse, effective instruction prompts."
            },
            {
                "role": "user", 
                "content": f"""Generate {n_candidates} different instruction prompts for this task.

Task description: {task_description}

Example inputs and outputs:
{examples_str}

Generate {n_candidates} instruction prompts. Each should:
- Be complete and standalone
- Vary in style (direct, step-by-step, role-based, etc.)
- Focus on different aspects of the task

Output as a numbered list. One prompt per line."""
            }
        ],
        temperature=0.9,
    )
    
    raw = response.choices[0].message.content
    prompts = []
    for line in raw.split('\n'):
        line = line.strip()
        if line and (line[0].isdigit() or line.startswith('-')):
            # Remove numbering
            prompt = line.lstrip('0123456789.-) ').strip()
            if len(prompt) > 20:
                prompts.append(prompt)
    
    return prompts[:n_candidates]


def evaluate_prompt(
    prompt: str,
    eval_examples: list[dict],
    score_fn: callable,
    model: str = "gpt-4o-mini"
) -> float:
    """Evaluate a prompt on a set of examples."""
    scores = []
    
    for example in eval_examples:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": prompt},
                {"role": "user", "content": example["input"]}
            ],
            temperature=0.0,
        )
        
        prediction = response.choices[0].message.content
        score = score_fn(prediction, example["expected_output"])
        scores.append(score)
    
    return sum(scores) / len(scores)


def run_ape(
    task_description: str,
    train_examples: list[dict],
    eval_examples: list[dict],
    score_fn: callable,
    n_candidates: int = 20
) -> tuple[str, float]:
    """Run Automatic Prompt Engineer."""
    
    print(f"Generating {n_candidates} candidate prompts...")
    candidates = generate_candidate_prompts(task_description, train_examples, n_candidates)
    
    best_prompt = None
    best_score = -1
    
    for i, prompt in enumerate(candidates):
        score = evaluate_prompt(prompt, eval_examples, score_fn)
        print(f"Candidate {i+1}/{len(candidates)}: score={score:.3f}")
        
        if score > best_score:
            best_score = score
            best_prompt = prompt
    
    return best_prompt, best_score


# Example usage: sentiment classification
def exact_match_score(prediction: str, expected: str) -> float:
    pred_lower = prediction.lower().strip()
    exp_lower = expected.lower().strip()
    # Check if the expected label appears in the prediction
    return 1.0 if exp_lower in pred_lower else 0.0

train_data = [
    {"input": "This movie was absolutely terrible.", "output": "negative"},
    {"input": "I loved every minute of it!", "output": "positive"},
    {"input": "The product works as described.", "output": "neutral"},
    # ... more examples
]

eval_data = [
    {"input": "Complete waste of money.", "expected_output": "negative"},
    {"input": "Exceeded all my expectations.", "expected_output": "positive"},
    # ... more examples
]

best_prompt, score = run_ape(
    task_description="Classify customer sentiment as positive, negative, or neutral",
    train_examples=train_data,
    eval_examples=eval_data,
    score_fn=exact_match_score,
)
print(f"\nBest prompt (score={score:.3f}):\n{best_prompt}")

APE is conceptually simple and works well for focused classification tasks. Its weakness is that it only generates prompts once and doesn't iteratively refine them.

DSPy: The Right Abstraction for Production

DSPy (from Stanford NLP) takes a fundamentally different approach. Instead of optimizing prompt text directly, it optimizes programs made of composable LM modules.

import dspy
from dspy.teleprompt import BootstrapFewShot, MIPROv2

# Configure the LM
lm = dspy.LM("openai/gpt-4o-mini", temperature=0.0)
dspy.configure(lm=lm)

# Define your task using typed signatures
class SentimentClassifier(dspy.Signature):
    """Classify the sentiment of customer feedback."""
    feedback: str = dspy.InputField(desc="Customer feedback text")
    sentiment: str = dspy.OutputField(desc="One of: positive, negative, neutral")
    confidence: str = dspy.OutputField(desc="High, medium, or low confidence")

class SentimentPipeline(dspy.Module):
    def __init__(self):
        self.classify = dspy.ChainOfThought(SentimentClassifier)
    
    def forward(self, feedback: str):
        return self.classify(feedback=feedback)

# Define your metric
def sentiment_metric(example, prediction, trace=None):
    return example.sentiment.lower() == prediction.sentiment.lower()

# Load training data
trainset = [
    dspy.Example(
        feedback="This is the best purchase I've made all year!",
        sentiment="positive"
    ).with_inputs("feedback"),
    dspy.Example(
        feedback="Broke after two days. Terrible quality.",
        sentiment="negative"
    ).with_inputs("feedback"),
    dspy.Example(
        feedback="It does what it says on the box.",
        sentiment="neutral"
    ).with_inputs("feedback"),
    # ... 50-100+ examples for good optimization
]

devset = trainset[int(len(trainset)*0.8):]  # 20% for evaluation
trainset = trainset[:int(len(trainset)*0.8)]

# Compile with BootstrapFewShot optimizer
optimizer = BootstrapFewShot(metric=sentiment_metric, max_bootstrapped_demos=4)
pipeline = SentimentPipeline()
compiled_pipeline = optimizer.compile(pipeline, trainset=trainset)

# Now use the compiled pipeline
result = compiled_pipeline(feedback="Shipping was slow but product quality is excellent.")
print(f"Sentiment: {result.sentiment}, Confidence: {result.confidence}")

# Save the compiled program (includes optimized prompts/examples)
compiled_pipeline.save("sentiment_classifier_v1.json")

The key insight in DSPy: BootstrapFewShot doesn't just select prompts — it bootstraps few-shot examples from your training data, selects the examples that improve performance, and injects them as in-context demonstrations. It's finding the best few-shot examples systematically rather than you choosing them by intuition.

MIPROv2 for Harder Tasks

For more complex tasks, MIPROv2 (Multi-prompt Instruction Proposal Optimizer v2) is more powerful but more expensive:

from dspy.teleprompt import MIPROv2

# MIPROv2 generates instruction candidates AND selects few-shot examples
optimizer = MIPROv2(
    metric=sentiment_metric,
    auto="medium",  # "light", "medium", or "heavy" — controls how many candidates to try
    num_threads=4,
)

compiled_pipeline_v2 = optimizer.compile(
    SentimentPipeline(),
    trainset=trainset,
    valset=devset,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
)

The "heavy" setting in MIPROv2 can run hundreds of candidate evaluations. For a 100-example validation set, this means hundreds of API calls. Budget accordingly.

Benchmark: Manual vs. APE vs. DSPy

These numbers are from running each method on a multi-class intent classification task (10 classes, 500 test examples) using GPT-4o-mini:

Method	Accuracy	Optimization Time	API Calls (optimization)	Ongoing Cost per Query
Baseline (no prompt)	61.2%	0	0	1x
Manual (3 iterations)	74.8%	~4 hours	~200	1x
APE (20 candidates)	78.3%	~25 min	120	1x
DSPy BootstrapFewShot	83.1%	~40 min	350	1.3x (longer prompt)
DSPy MIPROv2 (medium)	86.7%	~2 hours	1,200	1.4x
Fine-tuned GPT-4o-mini	89.1%	~6 hours + data	N/A	Similar

Results vary significantly by task type and model. These are indicative, not universal.

DSPy MIPROv2 approaches fine-tuning quality without touching model weights — a striking result that held across the tasks I tested it on.

Gradient-Free Optimization with ProTeGi

ProTeGi (Automatic Prompt Optimization with "Gradient Descent" through Textual Feedback) is a gradient-free method that mimics gradient descent using natural language:

from openai import OpenAI

client = OpenAI()

def get_textual_gradient(
    prompt: str,
    failed_examples: list[dict],
    model: str = "gpt-4o"
) -> str:
    """Generate natural-language 'gradient' — critique of prompt failures."""
    
    failures_str = "\n\n".join([
        f"Input: {ex['input']}\nExpected: {ex['expected']}\nGot: {ex['got']}"
        for ex in failed_examples[:5]
    ])
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": f"""You are analyzing why a prompt is failing on certain examples.

Current prompt:
---
{prompt}
---

Examples where it failed:
{failures_str}

Provide a detailed critique:
1. Why is the prompt failing on these examples?
2. What specific changes would fix these failures?
3. What ambiguities in the prompt are causing errors?

Be specific and actionable."""
            }
        ]
    )
    return response.choices[0].message.content


def apply_textual_gradient(
    prompt: str,
    gradient: str,
    model: str = "gpt-4o"
) -> str:
    """Apply the textual gradient to improve the prompt."""
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": f"""You are improving an instruction prompt.

Current prompt:
---
{prompt}
---

Critique and suggested improvements:
---
{gradient}
---

Rewrite the prompt to address these issues. Keep what works, fix what doesn't.
Output only the improved prompt, no explanation."""
            }
        ]
    )
    return response.choices[0].message.content


def run_protegi(
    initial_prompt: str,
    train_examples: list[dict],
    eval_examples: list[dict],
    score_fn: callable,
    n_iterations: int = 5,
    model: str = "gpt-4o-mini"
) -> tuple[str, list[float]]:
    """Run ProTeGi-style textual gradient descent."""
    
    current_prompt = initial_prompt
    score_history = []
    
    for iteration in range(n_iterations):
        # Evaluate current prompt
        score = evaluate_prompt(current_prompt, eval_examples, score_fn, model)
        score_history.append(score)
        print(f"Iteration {iteration}: score={score:.3f}")
        
        # Find failed examples
        failed = []
        for ex in train_examples:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": current_prompt},
                    {"role": "user", "content": ex["input"]}
                ],
                temperature=0.0,
            )
            prediction = response.choices[0].message.content
            if not score_fn(prediction, ex["expected_output"]):
                failed.append({
                    "input": ex["input"],
                    "expected": ex["expected_output"],
                    "got": prediction
                })
        
        if not failed:
            print("No failures found — stopping early.")
            break
        
        # Generate textual gradient
        gradient = get_textual_gradient(current_prompt, failed[:5])
        
        # Apply gradient to get improved prompt
        current_prompt = apply_textual_gradient(current_prompt, gradient)
    
    return current_prompt, score_history

ProTeGi is my preferred approach when I don't want to introduce DSPy as a dependency. It's understandable, debuggable, and often competitive with APE.

Practical Optimization Strategy

For most tasks, I'd recommend this progression:

Start with a manually written baseline to understand the failure modes. Don't skip this — the failure analysis is valuable.
Run APE for quick wins with 15-20 candidates if you need something fast.
Use DSPy BootstrapFewShot if you have 50+ labeled examples and need reproducible optimization.
Escalate to MIPROv2 only if BootstrapFewShot plateaus and the task is important enough to justify the compute.

The Prompt Engineering course has a module on building evaluation harnesses that are essential for any of these optimization approaches. You need a reliable scoring function before optimization is meaningful.

For reference on building the evaluation datasets these methods require, the RAG Retrieval Notes covers how to build labeled datasets from retrieved content.

What Optimization Can't Fix

Automatic prompt optimization maximizes your metric. If your metric is wrong, it maximizes the wrong thing. I've seen APO produce prompts that scored 94% on the eval set and were terrible in production — because the eval set didn't cover the distribution of real inputs.

A related failure: optimized prompts can become brittle. A 40-token instruction carefully tuned for your eval set may break completely on slightly different input distributions. Test optimized prompts on held-out data that wasn't used in optimization, and monitor performance in production.

There's also a readability concern. DSPy-compiled prompts with bootstrapped few-shot examples can be long and opaque. The few-shot examples that were empirically best might be confusing to a human reading the prompt. This matters when you need to audit, explain, or modify the prompt later.

The Advanced Prompting Quiz tests your ability to identify when optimization has overfit — a useful skill to develop before deploying optimized prompts to production.

Automatic prompt optimization won't replace prompt engineering — you still need to understand the task, define good metrics, and interpret results. What it replaces is the tedious trial-and-error phase. Write a reasonable starting prompt, build an evaluation set, run the optimizer, then use the time you saved to think about the things the optimizer can't measure.

That's usually the more interesting work anyway.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Automatic prompt optimization (APO) uses algorithms — often LLMs themselves — to iteratively improve prompt performance on a task, replacing manual prompt engineering. You should use it when: you have a well-defined task with measurable outputs (you can score responses as correct/incorrect or rate them on a rubric), you have at least 50-100 labeled examples to optimize against, and you're doing enough inference that the optimization investment pays off. APO is less useful for open-ended creative tasks where 'better' is hard to measure, for one-off queries, or when you need a specific behavioral style that's hard to capture in a metric. The core tradeoff: APO finds prompts that maximize your metric, but it can overfit to your evaluation set and miss qualitative properties you care about but didn't measure.

Abdullah Al Arman Emon✓ Verified Writer

Software Testing Expert & Prompt Engineering

Ensures every release is bug-free through rigorous testing, and crafts high-precision prompts that power our AI-driven workflows. Abdullah Al Arman Emon leads QA and prompt engineering across AiTechWorlds.

💻 GitHub View Profile →

Not sure yet? Ask AI about this article

Get an instant, unbiased AI summary of “Automatic Prompt Optimization: Using AI to Write Better Prompts”.

Ask ChatGPT Ask Claude Ask Perplexity

Research notes and brain storming representing meta-prompting and self-improving AI systems

Prompt Engineering

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

Meta-prompting uses LLMs to write, critique, and refine prompts — often outperforming human-written ones. Learn the patterns, failure modes, and production use cases.

June 5, 2026 12 min read

Security lock on digital circuit board representing AI prompt injection defense

Prompt Engineering

Prompt Injection Attacks: How They Work and How to Defend Against Them

Prompt injection attacks let adversaries hijack AI behavior through malicious inputs. Learn how direct and indirect injection work, and how to build real defenses.

June 5, 2026 10 min read

AI agent reasoning and acting loop on neural network visualization — ReAct prompting guide

Prompt Engineering

ReAct Prompting: Combining Reasoning and Acting in AI Agents

ReAct prompting combines chain-of-thought reasoning with tool use in AI agents. Learn how it works, when to use it, and how to implement it in production.

June 5, 2026 12 min read

developer working with JSON structured data output from AI language model on computer screen

Prompt Engineering

Structured Output Prompting: Get JSON, Tables and Code from Any LLM

Learn structured output prompting to extract JSON, Markdown tables, and code from LLMs reliably. Includes schema design, validation patterns, and real examples.

June 5, 2026 11 min read

Go deeper on this topic

NotesPrompt Engineering vs Fine-Tuning vs RLHF BookThe AI Prompting Bible QuizPrompt Engineering Basics QuizAdvanced Prompting Techniques PromptsCoding & Debugging Prompts PromptsSystem Design Prompts

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Advanced Prompting

Automatic Prompt Optimization: Using AI to Write Better Prompts

⚡ Quick Answer

Automatic prompt optimization uses AI to iteratively improve prompts without manual tuning. Learn DSPy, APE, and gradient-free optimization methods with real benchmarks.

Abdullah Al Arman Emon June 5, 2026 11 min read

#automatic-prompt-optimization #dspy #prompt-tuning #prompt-engineering

📚Part of the Advanced Prompting guide — explore all Advanced Prompting articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Automatic Prompt Optimization: Using AI to Write Better Prompts

That gap — 7 points of accuracy, three days of work vs. 40 minutes of compute — is why automatic prompt optimization is worth understanding properly.

Why Manual Prompt Engineering Has a Ceiling

Approaches to Automatic Prompt Optimization

Automatic Prompt Engineer (APE)

The original APE paper (Zhou et al., 2022) used a simple but effective approach: generate many candidate prompts using an LLM, evaluate them on a small labeled set, return the best one.

from openai import OpenAI
import random

client = OpenAI()

def generate_candidate_prompts(
    task_description: str,
    input_output_examples: list[dict],
    n_candidates: int = 20
) -> list[str]:
    """Generate candidate prompts using an LLM."""
    
    examples_str = "\n".join([
        f"Input: {ex['input']}\nOutput: {ex['output']}"
        for ex in input_output_examples[:5]
    ])
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a prompt engineer. Generate diverse, effective instruction prompts."
            },
            {
                "role": "user", 
                "content": f"""Generate {n_candidates} different instruction prompts for this task.

Task description: {task_description}

Example inputs and outputs:
{examples_str}

Generate {n_candidates} instruction prompts. Each should:
- Be complete and standalone
- Vary in style (direct, step-by-step, role-based, etc.)
- Focus on different aspects of the task

Output as a numbered list. One prompt per line."""
            }
        ],
        temperature=0.9,
    )
    
    raw = response.choices[0].message.content
    prompts = []
    for line in raw.split('\n'):
        line = line.strip()
        if line and (line[0].isdigit() or line.startswith('-')):
            # Remove numbering
            prompt = line.lstrip('0123456789.-) ').strip()
            if len(prompt) > 20:
                prompts.append(prompt)
    
    return prompts[:n_candidates]


def evaluate_prompt(
    prompt: str,
    eval_examples: list[dict],
    score_fn: callable,
    model: str = "gpt-4o-mini"
) -> float:
    """Evaluate a prompt on a set of examples."""
    scores = []
    
    for example in eval_examples:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": prompt},
                {"role": "user", "content": example["input"]}
            ],
            temperature=0.0,
        )
        
        prediction = response.choices[0].message.content
        score = score_fn(prediction, example["expected_output"])
        scores.append(score)
    
    return sum(scores) / len(scores)


def run_ape(
    task_description: str,
    train_examples: list[dict],
    eval_examples: list[dict],
    score_fn: callable,
    n_candidates: int = 20
) -> tuple[str, float]:
    """Run Automatic Prompt Engineer."""
    
    print(f"Generating {n_candidates} candidate prompts...")
    candidates = generate_candidate_prompts(task_description, train_examples, n_candidates)
    
    best_prompt = None
    best_score = -1
    
    for i, prompt in enumerate(candidates):
        score = evaluate_prompt(prompt, eval_examples, score_fn)
        print(f"Candidate {i+1}/{len(candidates)}: score={score:.3f}")
        
        if score > best_score:
            best_score = score
            best_prompt = prompt
    
    return best_prompt, best_score


# Example usage: sentiment classification
def exact_match_score(prediction: str, expected: str) -> float:
    pred_lower = prediction.lower().strip()
    exp_lower = expected.lower().strip()
    # Check if the expected label appears in the prediction
    return 1.0 if exp_lower in pred_lower else 0.0

train_data = [
    {"input": "This movie was absolutely terrible.", "output": "negative"},
    {"input": "I loved every minute of it!", "output": "positive"},
    {"input": "The product works as described.", "output": "neutral"},
    # ... more examples
]

eval_data = [
    {"input": "Complete waste of money.", "expected_output": "negative"},
    {"input": "Exceeded all my expectations.", "expected_output": "positive"},
    # ... more examples
]

best_prompt, score = run_ape(
    task_description="Classify customer sentiment as positive, negative, or neutral",
    train_examples=train_data,
    eval_examples=eval_data,
    score_fn=exact_match_score,
)
print(f"\nBest prompt (score={score:.3f}):\n{best_prompt}")

APE is conceptually simple and works well for focused classification tasks. Its weakness is that it only generates prompts once and doesn't iteratively refine them.

DSPy: The Right Abstraction for Production

DSPy (from Stanford NLP) takes a fundamentally different approach. Instead of optimizing prompt text directly, it optimizes programs made of composable LM modules.

import dspy
from dspy.teleprompt import BootstrapFewShot, MIPROv2

# Configure the LM
lm = dspy.LM("openai/gpt-4o-mini", temperature=0.0)
dspy.configure(lm=lm)

# Define your task using typed signatures
class SentimentClassifier(dspy.Signature):
    """Classify the sentiment of customer feedback."""
    feedback: str = dspy.InputField(desc="Customer feedback text")
    sentiment: str = dspy.OutputField(desc="One of: positive, negative, neutral")
    confidence: str = dspy.OutputField(desc="High, medium, or low confidence")

class SentimentPipeline(dspy.Module):
    def __init__(self):
        self.classify = dspy.ChainOfThought(SentimentClassifier)
    
    def forward(self, feedback: str):
        return self.classify(feedback=feedback)

# Define your metric
def sentiment_metric(example, prediction, trace=None):
    return example.sentiment.lower() == prediction.sentiment.lower()

# Load training data
trainset = [
    dspy.Example(
        feedback="This is the best purchase I've made all year!",
        sentiment="positive"
    ).with_inputs("feedback"),
    dspy.Example(
        feedback="Broke after two days. Terrible quality.",
        sentiment="negative"
    ).with_inputs("feedback"),
    dspy.Example(
        feedback="It does what it says on the box.",
        sentiment="neutral"
    ).with_inputs("feedback"),
    # ... 50-100+ examples for good optimization
]

devset = trainset[int(len(trainset)*0.8):]  # 20% for evaluation
trainset = trainset[:int(len(trainset)*0.8)]

# Compile with BootstrapFewShot optimizer
optimizer = BootstrapFewShot(metric=sentiment_metric, max_bootstrapped_demos=4)
pipeline = SentimentPipeline()
compiled_pipeline = optimizer.compile(pipeline, trainset=trainset)

# Now use the compiled pipeline
result = compiled_pipeline(feedback="Shipping was slow but product quality is excellent.")
print(f"Sentiment: {result.sentiment}, Confidence: {result.confidence}")

# Save the compiled program (includes optimized prompts/examples)
compiled_pipeline.save("sentiment_classifier_v1.json")

MIPROv2 for Harder Tasks

For more complex tasks, MIPROv2 (Multi-prompt Instruction Proposal Optimizer v2) is more powerful but more expensive:

from dspy.teleprompt import MIPROv2

# MIPROv2 generates instruction candidates AND selects few-shot examples
optimizer = MIPROv2(
    metric=sentiment_metric,
    auto="medium",  # "light", "medium", or "heavy" — controls how many candidates to try
    num_threads=4,
)

compiled_pipeline_v2 = optimizer.compile(
    SentimentPipeline(),
    trainset=trainset,
    valset=devset,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
)

The "heavy" setting in MIPROv2 can run hundreds of candidate evaluations. For a 100-example validation set, this means hundreds of API calls. Budget accordingly.

Benchmark: Manual vs. APE vs. DSPy

These numbers are from running each method on a multi-class intent classification task (10 classes, 500 test examples) using GPT-4o-mini:

Method	Accuracy	Optimization Time	API Calls (optimization)	Ongoing Cost per Query
Baseline (no prompt)	61.2%	0	0	1x
Manual (3 iterations)	74.8%	~4 hours	~200	1x
APE (20 candidates)	78.3%	~25 min	120	1x
DSPy BootstrapFewShot	83.1%	~40 min	350	1.3x (longer prompt)
DSPy MIPROv2 (medium)	86.7%	~2 hours	1,200	1.4x
Fine-tuned GPT-4o-mini	89.1%	~6 hours + data	N/A	Similar

Results vary significantly by task type and model. These are indicative, not universal.

DSPy MIPROv2 approaches fine-tuning quality without touching model weights — a striking result that held across the tasks I tested it on.

Gradient-Free Optimization with ProTeGi

ProTeGi (Automatic Prompt Optimization with "Gradient Descent" through Textual Feedback) is a gradient-free method that mimics gradient descent using natural language:

from openai import OpenAI

client = OpenAI()

def get_textual_gradient(
    prompt: str,
    failed_examples: list[dict],
    model: str = "gpt-4o"
) -> str:
    """Generate natural-language 'gradient' — critique of prompt failures."""
    
    failures_str = "\n\n".join([
        f"Input: {ex['input']}\nExpected: {ex['expected']}\nGot: {ex['got']}"
        for ex in failed_examples[:5]
    ])
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": f"""You are analyzing why a prompt is failing on certain examples.

Current prompt:
---
{prompt}
---

Examples where it failed:
{failures_str}

Provide a detailed critique:
1. Why is the prompt failing on these examples?
2. What specific changes would fix these failures?
3. What ambiguities in the prompt are causing errors?

Be specific and actionable."""
            }
        ]
    )
    return response.choices[0].message.content


def apply_textual_gradient(
    prompt: str,
    gradient: str,
    model: str = "gpt-4o"
) -> str:
    """Apply the textual gradient to improve the prompt."""
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": f"""You are improving an instruction prompt.

Current prompt:
---
{prompt}
---

Critique and suggested improvements:
---
{gradient}
---

Rewrite the prompt to address these issues. Keep what works, fix what doesn't.
Output only the improved prompt, no explanation."""
            }
        ]
    )
    return response.choices[0].message.content


def run_protegi(
    initial_prompt: str,
    train_examples: list[dict],
    eval_examples: list[dict],
    score_fn: callable,
    n_iterations: int = 5,
    model: str = "gpt-4o-mini"
) -> tuple[str, list[float]]:
    """Run ProTeGi-style textual gradient descent."""
    
    current_prompt = initial_prompt
    score_history = []
    
    for iteration in range(n_iterations):
        # Evaluate current prompt
        score = evaluate_prompt(current_prompt, eval_examples, score_fn, model)
        score_history.append(score)
        print(f"Iteration {iteration}: score={score:.3f}")
        
        # Find failed examples
        failed = []
        for ex in train_examples:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": current_prompt},
                    {"role": "user", "content": ex["input"]}
                ],
                temperature=0.0,
            )
            prediction = response.choices[0].message.content
            if not score_fn(prediction, ex["expected_output"]):
                failed.append({
                    "input": ex["input"],
                    "expected": ex["expected_output"],
                    "got": prediction
                })
        
        if not failed:
            print("No failures found — stopping early.")
            break
        
        # Generate textual gradient
        gradient = get_textual_gradient(current_prompt, failed[:5])
        
        # Apply gradient to get improved prompt
        current_prompt = apply_textual_gradient(current_prompt, gradient)
    
    return current_prompt, score_history

ProTeGi is my preferred approach when I don't want to introduce DSPy as a dependency. It's understandable, debuggable, and often competitive with APE.

Practical Optimization Strategy

For most tasks, I'd recommend this progression:

Start with a manually written baseline to understand the failure modes. Don't skip this — the failure analysis is valuable.
Run APE for quick wins with 15-20 candidates if you need something fast.
Use DSPy BootstrapFewShot if you have 50+ labeled examples and need reproducible optimization.
Escalate to MIPROv2 only if BootstrapFewShot plateaus and the task is important enough to justify the compute.

For reference on building the evaluation datasets these methods require, the RAG Retrieval Notes covers how to build labeled datasets from retrieved content.

What Optimization Can't Fix

The Advanced Prompting Quiz tests your ability to identify when optimization has overfit — a useful skill to develop before deploying optimized prompts to production.

That's usually the more interesting work anyway.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Abdullah Al Arman Emon✓ Verified Writer

Software Testing Expert & Prompt Engineering

💻 GitHub View Profile →

Not sure yet? Ask AI about this article

Get an instant, unbiased AI summary of “Automatic Prompt Optimization: Using AI to Write Better Prompts”.

Ask ChatGPT Ask Claude Ask Perplexity

Prompt Engineering

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

Meta-prompting uses LLMs to write, critique, and refine prompts — often outperforming human-written ones. Learn the patterns, failure modes, and production use cases.

June 5, 2026 12 min read

Prompt Engineering

Prompt Injection Attacks: How They Work and How to Defend Against Them

Prompt injection attacks let adversaries hijack AI behavior through malicious inputs. Learn how direct and indirect injection work, and how to build real defenses.

June 5, 2026 10 min read

Prompt Engineering

ReAct Prompting: Combining Reasoning and Acting in AI Agents

ReAct prompting combines chain-of-thought reasoning with tool use in AI agents. Learn how it works, when to use it, and how to implement it in production.

June 5, 2026 12 min read

Prompt Engineering

Structured Output Prompting: Get JSON, Tables and Code from Any LLM

Learn structured output prompting to extract JSON, Markdown tables, and code from LLMs reliably. Includes schema design, validation patterns, and real examples.

June 5, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Automatic Prompt Optimization: Using AI to Write Better Prompts

Automatic Prompt Optimization: Using AI to Write Better Prompts

Why Manual Prompt Engineering Has a Ceiling

Approaches to Automatic Prompt Optimization

Automatic Prompt Engineer (APE)

DSPy: The Right Abstraction for Production

MIPROv2 for Harder Tasks

Benchmark: Manual vs. APE vs. DSPy

Gradient-Free Optimization with ProTeGi

Practical Optimization Strategy

What Optimization Can't Fix

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Not sure yet? Ask AI about this article

Related Articles

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

Prompt Injection Attacks: How They Work and How to Defend Against Them

ReAct Prompting: Combining Reasoning and Acting in AI Agents

Structured Output Prompting: Get JSON, Tables and Code from Any LLM

Go deeper on this topic

Get Free AI Notes Daily

Automatic Prompt Optimization: Using AI to Write Better Prompts

Automatic Prompt Optimization: Using AI to Write Better Prompts

Why Manual Prompt Engineering Has a Ceiling

Approaches to Automatic Prompt Optimization

Automatic Prompt Engineer (APE)

DSPy: The Right Abstraction for Production

MIPROv2 for Harder Tasks

Benchmark: Manual vs. APE vs. DSPy

Gradient-Free Optimization with ProTeGi

Practical Optimization Strategy

What Optimization Can't Fix

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Not sure yet? Ask AI about this article

Related Articles

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

Prompt Injection Attacks: How They Work and How to Defend Against Them

ReAct Prompting: Combining Reasoning and Acting in AI Agents

Structured Output Prompting: Get JSON, Tables and Code from Any LLM

Go deeper on this topic

Get Free AI Notes Daily