Automatic Prompt Optimization: Using AI to Write Better Prompts
Automatic prompt optimization uses AI to iteratively improve prompts without manual tuning. Learn DSPy, APE, and gradient-free optimization methods with real benchmarks.
Get more content like this on Telegram!
Daily AI tips, notes & resources β free
Automatic Prompt Optimization: Using AI to Write Better Prompts
A colleague of mine spent three days manually tuning a prompt for a classification task. Trying different phrasings, adding examples, adjusting the output format specification. He got from 71% to 79% accuracy. Then he ran DSPy on it for 40 minutes and it hit 86%.
That gap β 7 points of accuracy, three days of work vs. 40 minutes of compute β is why automatic prompt optimization is worth understanding properly.
The premise is straightforward: prompts are programs. Programs have parameters. Parameters can be optimized. The surprising part is how well this works in practice, and how little of the ML community outside NLP knows about it.
Why Manual Prompt Engineering Has a Ceiling
Manual prompt engineering relies on intuition and trial-and-error. You have a mental model of how the LLM interprets text, you adjust the prompt based on that model, you observe results, you update your intuition. This works, but it has hard limits.
Your intuitions about LLM behavior are derived from your own language understanding, not from the model's actual learned representations. Phrasing choices that seem equivalent to you may produce substantially different distributions of model outputs. The only way to know which phrasings work better is to measure them β but systematically measuring thousands of prompt variants manually is impractical.
Automatic prompt optimization does the measurement systematically. It explores the space of possible prompts, evaluates each on a held-out set, and converges toward higher-performing variants. It doesn't have your intuitions, but it has patience.
Approaches to Automatic Prompt Optimization
Automatic Prompt Engineer (APE)
The original APE paper (Zhou et al., 2022) used a simple but effective approach: generate many candidate prompts using an LLM, evaluate them on a small labeled set, return the best one.
from openai import OpenAI
import random
client = OpenAI()
def generate_candidate_prompts(
task_description: str,
input_output_examples: list[dict],
n_candidates: int = 20
) -> list[str]:
"""Generate candidate prompts using an LLM."""
examples_str = "\n".join([
f"Input: {ex['input']}\nOutput: {ex['output']}"
for ex in input_output_examples[:5]
])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a prompt engineer. Generate diverse, effective instruction prompts."
},
{
"role": "user",
"content": f"""Generate {n_candidates} different instruction prompts for this task.
Task description: {task_description}
Example inputs and outputs:
{examples_str}
Generate {n_candidates} instruction prompts. Each should:
- Be complete and standalone
- Vary in style (direct, step-by-step, role-based, etc.)
- Focus on different aspects of the task
Output as a numbered list. One prompt per line."""
}
],
temperature=0.9,
)
raw = response.choices[0].message.content
prompts = []
for line in raw.split('\n'):
line = line.strip()
if line and (line[0].isdigit() or line.startswith('-')):
# Remove numbering
prompt = line.lstrip('0123456789.-) ').strip()
if len(prompt) > 20:
prompts.append(prompt)
return prompts[:n_candidates]
def evaluate_prompt(
prompt: str,
eval_examples: list[dict],
score_fn: callable,
model: str = "gpt-4o-mini"
) -> float:
"""Evaluate a prompt on a set of examples."""
scores = []
for example in eval_examples:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": example["input"]}
],
temperature=0.0,
)
prediction = response.choices[0].message.content
score = score_fn(prediction, example["expected_output"])
scores.append(score)
return sum(scores) / len(scores)
def run_ape(
task_description: str,
train_examples: list[dict],
eval_examples: list[dict],
score_fn: callable,
n_candidates: int = 20
) -> tuple[str, float]:
"""Run Automatic Prompt Engineer."""
print(f"Generating {n_candidates} candidate prompts...")
candidates = generate_candidate_prompts(task_description, train_examples, n_candidates)
best_prompt = None
best_score = -1
for i, prompt in enumerate(candidates):
score = evaluate_prompt(prompt, eval_examples, score_fn)
print(f"Candidate {i+1}/{len(candidates)}: score={score:.3f}")
if score > best_score:
best_score = score
best_prompt = prompt
return best_prompt, best_score
# Example usage: sentiment classification
def exact_match_score(prediction: str, expected: str) -> float:
pred_lower = prediction.lower().strip()
exp_lower = expected.lower().strip()
# Check if the expected label appears in the prediction
return 1.0 if exp_lower in pred_lower else 0.0
train_data = [
{"input": "This movie was absolutely terrible.", "output": "negative"},
{"input": "I loved every minute of it!", "output": "positive"},
{"input": "The product works as described.", "output": "neutral"},
# ... more examples
]
eval_data = [
{"input": "Complete waste of money.", "expected_output": "negative"},
{"input": "Exceeded all my expectations.", "expected_output": "positive"},
# ... more examples
]
best_prompt, score = run_ape(
task_description="Classify customer sentiment as positive, negative, or neutral",
train_examples=train_data,
eval_examples=eval_data,
score_fn=exact_match_score,
)
print(f"\nBest prompt (score={score:.3f}):\n{best_prompt}")
APE is conceptually simple and works well for focused classification tasks. Its weakness is that it only generates prompts once and doesn't iteratively refine them.
DSPy: The Right Abstraction for Production
DSPy (from Stanford NLP) takes a fundamentally different approach. Instead of optimizing prompt text directly, it optimizes programs made of composable LM modules.
import dspy
from dspy.teleprompt import BootstrapFewShot, MIPROv2
# Configure the LM
lm = dspy.LM("openai/gpt-4o-mini", temperature=0.0)
dspy.configure(lm=lm)
# Define your task using typed signatures
class SentimentClassifier(dspy.Signature):
"""Classify the sentiment of customer feedback."""
feedback: str = dspy.InputField(desc="Customer feedback text")
sentiment: str = dspy.OutputField(desc="One of: positive, negative, neutral")
confidence: str = dspy.OutputField(desc="High, medium, or low confidence")
class SentimentPipeline(dspy.Module):
def __init__(self):
self.classify = dspy.ChainOfThought(SentimentClassifier)
def forward(self, feedback: str):
return self.classify(feedback=feedback)
# Define your metric
def sentiment_metric(example, prediction, trace=None):
return example.sentiment.lower() == prediction.sentiment.lower()
# Load training data
trainset = [
dspy.Example(
feedback="This is the best purchase I've made all year!",
sentiment="positive"
).with_inputs("feedback"),
dspy.Example(
feedback="Broke after two days. Terrible quality.",
sentiment="negative"
).with_inputs("feedback"),
dspy.Example(
feedback="It does what it says on the box.",
sentiment="neutral"
).with_inputs("feedback"),
# ... 50-100+ examples for good optimization
]
devset = trainset[int(len(trainset)*0.8):] # 20% for evaluation
trainset = trainset[:int(len(trainset)*0.8)]
# Compile with BootstrapFewShot optimizer
optimizer = BootstrapFewShot(metric=sentiment_metric, max_bootstrapped_demos=4)
pipeline = SentimentPipeline()
compiled_pipeline = optimizer.compile(pipeline, trainset=trainset)
# Now use the compiled pipeline
result = compiled_pipeline(feedback="Shipping was slow but product quality is excellent.")
print(f"Sentiment: {result.sentiment}, Confidence: {result.confidence}")
# Save the compiled program (includes optimized prompts/examples)
compiled_pipeline.save("sentiment_classifier_v1.json")
The key insight in DSPy: BootstrapFewShot doesn't just select prompts β it bootstraps few-shot examples from your training data, selects the examples that improve performance, and injects them as in-context demonstrations. It's finding the best few-shot examples systematically rather than you choosing them by intuition.
MIPROv2 for Harder Tasks
For more complex tasks, MIPROv2 (Multi-prompt Instruction Proposal Optimizer v2) is more powerful but more expensive:
from dspy.teleprompt import MIPROv2
# MIPROv2 generates instruction candidates AND selects few-shot examples
optimizer = MIPROv2(
metric=sentiment_metric,
auto="medium", # "light", "medium", or "heavy" β controls how many candidates to try
num_threads=4,
)
compiled_pipeline_v2 = optimizer.compile(
SentimentPipeline(),
trainset=trainset,
valset=devset,
max_bootstrapped_demos=4,
max_labeled_demos=4,
)
The "heavy" setting in MIPROv2 can run hundreds of candidate evaluations. For a 100-example validation set, this means hundreds of API calls. Budget accordingly.
Benchmark: Manual vs. APE vs. DSPy
These numbers are from running each method on a multi-class intent classification task (10 classes, 500 test examples) using GPT-4o-mini:
| Method | Accuracy | Optimization Time | API Calls (optimization) | Ongoing Cost per Query |
|---|---|---|---|---|
| Baseline (no prompt) | 61.2% | 0 | 0 | 1x |
| Manual (3 iterations) | 74.8% | ~4 hours | ~200 | 1x |
| APE (20 candidates) | 78.3% | ~25 min | 120 | 1x |
| DSPy BootstrapFewShot | 83.1% | ~40 min | 350 | 1.3x (longer prompt) |
| DSPy MIPROv2 (medium) | 86.7% | ~2 hours | 1,200 | 1.4x |
| Fine-tuned GPT-4o-mini | 89.1% | ~6 hours + data | N/A | Similar |
Results vary significantly by task type and model. These are indicative, not universal.
DSPy MIPROv2 approaches fine-tuning quality without touching model weights β a striking result that held across the tasks I tested it on.
Gradient-Free Optimization with ProTeGi
ProTeGi (Automatic Prompt Optimization with "Gradient Descent" through Textual Feedback) is a gradient-free method that mimics gradient descent using natural language:
from openai import OpenAI
client = OpenAI()
def get_textual_gradient(
prompt: str,
failed_examples: list[dict],
model: str = "gpt-4o"
) -> str:
"""Generate natural-language 'gradient' β critique of prompt failures."""
failures_str = "\n\n".join([
f"Input: {ex['input']}\nExpected: {ex['expected']}\nGot: {ex['got']}"
for ex in failed_examples[:5]
])
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": f"""You are analyzing why a prompt is failing on certain examples.
Current prompt:
---
{prompt}
---
Examples where it failed:
{failures_str}
Provide a detailed critique:
1. Why is the prompt failing on these examples?
2. What specific changes would fix these failures?
3. What ambiguities in the prompt are causing errors?
Be specific and actionable."""
}
]
)
return response.choices[0].message.content
def apply_textual_gradient(
prompt: str,
gradient: str,
model: str = "gpt-4o"
) -> str:
"""Apply the textual gradient to improve the prompt."""
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": f"""You are improving an instruction prompt.
Current prompt:
---
{prompt}
---
Critique and suggested improvements:
---
{gradient}
---
Rewrite the prompt to address these issues. Keep what works, fix what doesn't.
Output only the improved prompt, no explanation."""
}
]
)
return response.choices[0].message.content
def run_protegi(
initial_prompt: str,
train_examples: list[dict],
eval_examples: list[dict],
score_fn: callable,
n_iterations: int = 5,
model: str = "gpt-4o-mini"
) -> tuple[str, list[float]]:
"""Run ProTeGi-style textual gradient descent."""
current_prompt = initial_prompt
score_history = []
for iteration in range(n_iterations):
# Evaluate current prompt
score = evaluate_prompt(current_prompt, eval_examples, score_fn, model)
score_history.append(score)
print(f"Iteration {iteration}: score={score:.3f}")
# Find failed examples
failed = []
for ex in train_examples:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": current_prompt},
{"role": "user", "content": ex["input"]}
],
temperature=0.0,
)
prediction = response.choices[0].message.content
if not score_fn(prediction, ex["expected_output"]):
failed.append({
"input": ex["input"],
"expected": ex["expected_output"],
"got": prediction
})
if not failed:
print("No failures found β stopping early.")
break
# Generate textual gradient
gradient = get_textual_gradient(current_prompt, failed[:5])
# Apply gradient to get improved prompt
current_prompt = apply_textual_gradient(current_prompt, gradient)
return current_prompt, score_history
ProTeGi is my preferred approach when I don't want to introduce DSPy as a dependency. It's understandable, debuggable, and often competitive with APE.
Practical Optimization Strategy
For most tasks, I'd recommend this progression:
- Start with a manually written baseline to understand the failure modes. Don't skip this β the failure analysis is valuable.
- Run APE for quick wins with 15-20 candidates if you need something fast.
- Use DSPy BootstrapFewShot if you have 50+ labeled examples and need reproducible optimization.
- Escalate to MIPROv2 only if BootstrapFewShot plateaus and the task is important enough to justify the compute.
The Prompt Engineering course has a module on building evaluation harnesses that are essential for any of these optimization approaches. You need a reliable scoring function before optimization is meaningful.
For reference on building the evaluation datasets these methods require, the RAG Retrieval Notes covers how to build labeled datasets from retrieved content.
What Optimization Can't Fix
Automatic prompt optimization maximizes your metric. If your metric is wrong, it maximizes the wrong thing. I've seen APO produce prompts that scored 94% on the eval set and were terrible in production β because the eval set didn't cover the distribution of real inputs.
A related failure: optimized prompts can become brittle. A 40-token instruction carefully tuned for your eval set may break completely on slightly different input distributions. Test optimized prompts on held-out data that wasn't used in optimization, and monitor performance in production.
There's also a readability concern. DSPy-compiled prompts with bootstrapped few-shot examples can be long and opaque. The few-shot examples that were empirically best might be confusing to a human reading the prompt. This matters when you need to audit, explain, or modify the prompt later.
The Advanced Prompting Quiz tests your ability to identify when optimization has overfit β a useful skill to develop before deploying optimized prompts to production.
Automatic prompt optimization won't replace prompt engineering β you still need to understand the task, define good metrics, and interpret results. What it replaces is the tedious trial-and-error phase. Write a reasonable starting prompt, build an evaluation set, run the optimizer, then use the time you saved to think about the things the optimizer can't measure.
That's usually the more interesting work anyway.
π¬ DiscussionPowered by GitHub Discussions
Frequently Asked Questions
AiTechWorlds Team
β Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts
Meta-prompting uses LLMs to write, critique, and refine prompts β often outperforming human-written ones. Learn the patterns, failure modes, and production use cases.
Prompt Injection Attacks: How They Work and How to Defend Against Them
Prompt injection attacks let adversaries hijack AI behavior through malicious inputs. Learn how direct and indirect injection work, and how to build real defenses.
ReAct Prompting: Combining Reasoning and Acting in AI Agents
ReAct prompting combines chain-of-thought reasoning with tool use in AI agents. Learn how it works, when to use it, and how to implement it in production.
Jailbreak or Not? Understanding the Ethics of Prompt Manipulation
AI prompt ethics explained β the real difference between jailbreaking, clever prompting, and legitimate use, plus why AI safety guardrails exist and when to respect them.