Can an LLM reliably evaluate the quality of its own prompts?

Partially, and with important limitations. LLMs are reasonable at identifying obvious flaws in prompts: ambiguity, missing context, conflicting instructions, overly complex output format requirements. They're less reliable at predicting whether a prompt will produce accurate outputs on rare cases, or at detecting subtle biases in prompt framing. The 'evaluator is the same model being evaluated' problem is real — GPT-4o's assessment of a GPT-4o prompt is partially circular. For factual accuracy evaluation, a model will tend to approve prompts that produce confident-sounding outputs, even if those outputs are wrong. Best practice: use meta-prompting for qualitative prompt design (structure, clarity, coverage of edge cases) and reserve empirical measurement on labeled data for quality assurance. Don't use LLM self-evaluation as a substitute for actual accuracy benchmarking.

What is the Stanford meta-prompting approach and how does it work?

The Stanford meta-prompting paper (Suzgun & Kalai, 2024) introduced a 'meta-prompt' that instructs an LLM to decompose tasks and orchestrate sub-agents to solve them. The meta-prompt instructs the model to: (1) analyze the incoming task and break it into subtasks, (2) for each subtask, formulate a specialized prompt and call itself as a 'fresh' assistant (simulating a new LLM call), (3) aggregate the sub-agent responses into a final answer. This gives the model the ability to self-scaffold — it creates the problem decomposition on the fly rather than requiring a human to pre-specify the pipeline. The model acts simultaneously as orchestrator and worker. Benchmarks showed strong performance on complex reasoning tasks, with the self-scaffolding particularly helping on tasks that benefit from separation of concerns (e.g., independently verifying each step of a multi-step proof).

AiTechWorlds

Research notes and brain storming representing meta-prompting and self-improving AI systems

Advanced Prompting

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

⚡ Quick Answer

Meta-prompting uses LLMs to write, critique, and refine prompts — often outperforming human-written ones. Learn the patterns, failure modes, and production use cases.

Abdullah Al Arman Emon June 5, 2026 12 min read

#meta-prompting #prompt-generation #self-improving-prompts #prompt-engineering

📚Part of the Advanced Prompting guide — explore all Advanced Prompting articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

There's an uncomfortable truth about prompt engineering: the people who write prompts for LLMs are worse at modeling LLM behavior than the LLMs themselves. Your mental model of how GPT-4o interprets a particular phrasing is an approximation. The model's own prediction of how it will respond to a given prompt is, in a meaningful sense, more accurate.

This is the premise behind meta-prompting. Don't just use an LLM to complete tasks — use it to think about how tasks should be prompted. Use the model's self-knowledge as a design tool.

The idea sounds circular. In practice, it's one of the more productive techniques I've added to my workflow.

What Meta-Prompting Actually Means

The term covers several related but distinct practices:

Prompt generation: Ask an LLM to write a prompt for a task you describe
Prompt critique: Ask an LLM to identify weaknesses in an existing prompt
Prompt refinement: Iterative generation-critique-revision cycles
Prompt decomposition: Ask an LLM to break a complex task into a multi-prompt pipeline
Self-scaffolding: An LLM that generates its own sub-prompts at inference time

These form a spectrum from simple design assistance to full autonomous prompt construction. Most practical applications use the simpler end of the spectrum.

Prompt Generation: Getting a First Draft

The most common use case. You describe what you want, the model writes the prompt. This is genuinely useful, not as a replacement for prompt engineering expertise but as a first-draft generator that you then refine.

from openai import OpenAI

client = OpenAI()

META_PROMPT_GENERATOR = """You are an expert prompt engineer. Write high-quality system prompts for LLM applications.

When writing a prompt:
1. Be explicit about the task, output format, and constraints
2. Include guidance on how to handle edge cases and ambiguous inputs
3. Specify the tone and style appropriate for the use case
4. Include any required output structure or formatting
5. Anticipate common failure modes and address them preemptively

Output ONLY the prompt text — no explanation, no preamble."""

def generate_prompt(task_description: str, examples: list[dict] = None) -> str:
    examples_str = ""
    if examples:
        examples_str = "\n\nHere are some example inputs and ideal outputs:\n" + "\n\n".join([
            f"Input: {ex['input']}\nIdeal output: {ex['output']}"
            for ex in examples
        ])
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": META_PROMPT_GENERATOR},
            {
                "role": "user",
                "content": f"Write a system prompt for this task:\n\n{task_description}{examples_str}"
            }
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content


# Example
task = """
A customer support agent for a SaaS analytics product. 
The agent should:
- Answer questions about the product features and pricing
- Help users debug common data pipeline issues
- Escalate complex technical issues to the engineering team
- Always be professional but not overly formal
- Admit uncertainty rather than guessing
"""

generated_prompt = generate_prompt(task)
print(generated_prompt)

The output from this is usually better than what most developers write in their first attempt. Not because the model has magic insight, but because the meta-prompt forces it to think about edge cases, failure modes, and output structure — things developers often skip in first-draft prompts.

Prompt Critique: Finding Weaknesses Before They Bite You

Given an existing prompt, ask the model to identify problems. This is where meta-prompting earns its keep most reliably.

META_PROMPT_CRITIC = """You are a prompt quality analyst. Your job is to find potential failure modes in LLM system prompts.

For the prompt you review, identify:

AMBIGUITIES: Parts of the prompt that could be interpreted in multiple ways, leading to inconsistent behavior.

MISSING GUIDANCE: Common scenarios or edge cases the prompt doesn't address, which the model will handle unpredictably.

CONFLICTING INSTRUCTIONS: Instructions that could pull the model in different directions.

FORMAT ISSUES: Problems with output format specification that could cause parsing failures.

SECURITY CONCERNS: Prompt design choices that make the system vulnerable to injection or jailbreak.

OVERCONSTRAINING: Instructions so restrictive that they prevent helpful responses.

Be specific. For each issue, quote the relevant part of the prompt and explain why it's a problem."""

def critique_prompt(prompt: str, context: str = "") -> str:
    context_str = f"\nContext for this prompt: {context}" if context else ""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": META_PROMPT_CRITIC},
            {
                "role": "user",
                "content": f"Critique this prompt:{context_str}\n\n---\n{prompt}\n---"
            }
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content


# The iterative refinement loop
def refine_prompt_iteratively(
    initial_prompt: str,
    task_context: str,
    n_iterations: int = 3
) -> tuple[str, list[str]]:
    
    current_prompt = initial_prompt
    critique_history = []
    
    for i in range(n_iterations):
        print(f"\n--- Iteration {i+1} ---")
        
        # Critique current prompt
        critique = critique_prompt(current_prompt, task_context)
        critique_history.append(critique)
        print(f"Critique:\n{critique[:500]}...")
        
        # Generate improved version
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "You are a prompt engineer. Improve prompts based on critiques. Output only the improved prompt, no explanation."
                },
                {
                    "role": "user",
                    "content": f"""Improve this prompt based on the critique.

Current prompt:
---
{current_prompt}
---

Critique:
---
{critique}
---

Write the improved prompt:"""
                }
            ],
            temperature=0.2,
        )
        
        current_prompt = response.choices[0].message.content
        print(f"Updated prompt:\n{current_prompt[:300]}...")
    
    return current_prompt, critique_history

I use this critique loop routinely before deploying any new system prompt. The model catches things I miss — usually ambiguities about edge case handling and inconsistencies between different parts of a long prompt.

Meta-Prompting for Few-Shot Example Generation

Generating good few-shot examples is harder than it looks. Bad examples introduce biases, cover only easy cases, or don't represent the distribution of real inputs. LLMs can help here:

def generate_few_shot_examples(
    task_description: str,
    target_coverage: list[str],  # Specific scenarios to cover
    n_examples: int = 10
) -> list[dict]:
    """
    Generate diverse, high-quality few-shot examples.
    target_coverage: list of edge cases / scenario types to explicitly cover
    """
    
    coverage_str = "\n".join([f"- {scenario}" for scenario in target_coverage])
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a training data expert. Generate high-quality, diverse labeled examples for NLP tasks."
            },
            {
                "role": "user",
                "content": f"""Generate {n_examples} diverse input/output examples for this task:

Task: {task_description}

Make sure to cover these specific scenarios:
{coverage_str}

For each example:
- Input should be realistic and natural
- Output should be ideal, exactly what the model should produce
- Vary the difficulty level
- Include some tricky/ambiguous cases

Format as JSON:
[
  {{"input": "...", "output": "..."}},
  ...
]"""
            }
        ],
        temperature=0.7,
        response_format={"type": "json_object"}
    )
    
    import json
    raw = json.loads(response.choices[0].message.content)
    # Handle both {"examples": [...]} and [...] formats
    if isinstance(raw, list):
        return raw
    for key in raw:
        if isinstance(raw[key], list):
            return raw[key]
    return []


# Example: generating examples for a ticket routing system
examples = generate_few_shot_examples(
    task_description="Route customer support tickets to the correct team: billing, technical, or general",
    target_coverage=[
        "Billing question disguised as a technical question",
        "Angry customer with unclear issue",
        "Multiple issues in one ticket",
        "Non-English input mixed with English",
        "Very short ambiguous message",
        "Technical issue that's actually a billing/account problem",
    ],
    n_examples=12
)

The target_coverage parameter is key. Without explicitly specifying edge cases, the model generates examples from the easy part of the distribution. The hard cases are where your model will fail in production, so they should be represented in few-shot examples.

Stanford Meta-Prompting: Self-Scaffolding at Inference Time

The Stanford meta-prompting paper introduced a more ambitious idea: a single prompt that instructs the model to decompose tasks and call itself recursively. The model acts as both the orchestrator and the worker.

STANFORD_META_PROMPT = """You are an expert assistant with access to a powerful reasoning capability.

For complex tasks, you can break them into subtasks and solve each independently.

When you receive a complex question:
1. ANALYZE: Identify if the question benefits from decomposition
2. DECOMPOSE: If yes, identify the independent subtasks
3. SOLVE EACH: For each subtask, reason carefully as if it were a standalone question
4. INTEGRATE: Combine the subtask answers into a coherent final answer

Format for decomposed reasoning:
[SUBTASK 1: description]
[REASONING: your reasoning for this subtask]
[ANSWER: subtask answer]

[SUBTASK 2: description]
...

[FINAL INTEGRATION]
Combining the above: [final answer]

For simple questions, just answer directly.
IMPORTANT: Each subtask should be solved as if you have no information from other subtasks — 
this prevents anchoring bias and produces more reliable results."""


def run_meta_prompted_query(query: str, model: str = "gpt-4o") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": STANFORD_META_PROMPT},
            {"role": "user", "content": query}
        ],
        temperature=0.1,
    )
    return response.choices[0].message.content


# Complex multi-part query
result = run_meta_prompted_query("""
A startup raised $2M at a $10M valuation in 2023. In 2024, they raised another $5M at a $25M valuation.
The founders owned 80% after the first round.

What percentage do the founders own after the second round, and what is the dollar value of 
the original investors' stake after the second round?
""")
print(result)

This is less powerful than actually making separate API calls (the "fresh context" claim is not really true within a single context window — the model does see all subtasks simultaneously), but it's useful as a lightweight zero-infrastructure alternative to multi-agent pipelines.

Prompt Decomposition: Breaking Complex Tasks into Pipelines

One of the most underused meta-prompting applications is using an LLM to design multi-step prompt pipelines:

PIPELINE_DESIGNER_PROMPT = """You are a prompt pipeline architect.

When given a complex task, design a multi-step pipeline where each step is a focused LLM call.

For each step, specify:
- Step name and purpose
- Input (from user or previous step output)
- Prompt instruction (complete, standalone)
- Output format
- How output feeds into next step

Design principles:
- Each step should do ONE thing well
- Steps should be independently testable
- Minimize information passing between steps
- Each step prompt should be explicit about its narrow scope

Output as a structured JSON pipeline spec."""

def design_pipeline(task_description: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": PIPELINE_DESIGNER_PROMPT},
            {
                "role": "user",
                "content": f"Design a prompt pipeline for: {task_description}"
            }
        ],
        temperature=0.2,
        response_format={"type": "json_object"}
    )
    import json
    return json.loads(response.choices[0].message.content)


# Example
pipeline_spec = design_pipeline(
    "Given a research paper PDF, produce a structured summary with: "
    "main findings, methodology, limitations, and relevance to ML practitioners."
)

# The model will produce a multi-step pipeline:
# Step 1: Extract key sections from raw text
# Step 2: Identify main findings
# Step 3: Analyze methodology
# Step 4: Extract limitations
# Step 5: Assess ML relevance
# Step 6: Combine into structured output

This meta-prompting application is one I return to constantly. Designing a good pipeline manually requires experience with how LLMs handle different task types. The model's suggested decompositions are often better than my first instinct, and they're a good starting point for iteration.

Where Meta-Prompting Falls Short

Evaluating factual correctness. The model is good at evaluating prompt structure but not at predicting whether a prompt will produce factually accurate outputs. A critique of a prompt for "describe quantum entanglement" won't catch the case where the prompt produces plausible-sounding but incorrect physics.

Creativity and voice. If you need prompts that produce a specific brand voice or creative style, meta-prompting produces generic results. The model generates prompts that will produce "good" outputs by its standards — which tend to be clear, helpful, and somewhat homogeneous.

Novel task types. Meta-prompting works best on tasks similar to what the model has seen in its training data. For genuinely novel applications, the generated prompts often miss the specific requirements.

For more on when to move beyond prompt engineering entirely, the LLM Concepts Notes covers the spectrum from prompting to fine-tuning. The Prompt Engineering course has a practical module on integrating meta-prompting into a real development workflow.

Comparing Meta-Prompting Approaches

Approach	Setup Cost	Quality	Best Use Case	Requires Examples?
Simple generation	Low	Good first draft	New prompts from scratch	Optional
Critique + refine	Low	Better than manual	Improving existing prompts	No
Iterative refinement (3+ cycles)	Medium	High	Production-grade prompts	Optional
Few-shot generation	Low	Task-specific	Training data creation	No (specify coverage)
Pipeline decomposition	Low	Variable	Complex multi-step tasks	No
Stanford meta-prompt	None (inference)	High for reasoning	Complex analytical queries	No
APO + meta-prompting	High	Highest	High-volume production tasks	Yes (50+)

Practical Workflow

The meta-prompting workflow that consistently produces better results than pure manual engineering:

Describe your task in plain language (no prompt syntax)
Use the generator to produce a first draft
Read it critically and note your concerns
Run the critique meta-prompt on it
Look at what the critique identified vs. what you noticed — the overlap tells you what you already knew; the gaps tell you what you missed
Refine once or twice
Test on real examples before any further optimization

The goal isn't to remove humans from the loop — it's to use the model's self-knowledge as a second opinion. You bring the domain knowledge and the judgment about what actually matters; the model brings a different perspective on how prompts are likely to be interpreted.

For deeper exploration of how this connects to self-improving systems and agent architectures, see the AI Agent Dev course and the Agent Development section of the site. The Advanced Prompting Quiz tests the concepts covered here in a way that's actually useful for consolidating understanding rather than just pattern-matching to definitions.

Meta-prompting doesn't make prompt engineering unnecessary. It makes it faster, more systematic, and less dependent on individual expertise. The people who will be best at prompt engineering in two years won't be the ones who memorized the most techniques — they'll be the ones who built the right meta-prompting workflows and used them consistently.

The model knows a lot about how it works. Let it tell you.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Meta-prompting is the practice of using an LLM to generate, critique, or improve prompts — the LLM is reasoning about prompting itself. Automatic prompt optimization (APO) is a specific application of meta-prompting where the goal is to maximize a measurable metric through iterative refinement. Meta-prompting is broader: it includes using an LLM to write a first draft of a system prompt, to critique an existing prompt's potential weaknesses, to generate diverse prompt variants for A/B testing, to decompose a complex task into a multi-prompt pipeline, and to generate few-shot examples. APO is the quantitative, evaluation-driven end of meta-prompting. The rest of meta-prompting is more qualitative — using LLM judgment about prompts as a design aid rather than as an optimization signal.

Abdullah Al Arman Emon✓ Verified Writer

Software Testing Expert & Prompt Engineering

Ensures every release is bug-free through rigorous testing, and crafts high-precision prompts that power our AI-driven workflows. Abdullah Al Arman Emon leads QA and prompt engineering across AiTechWorlds.

💻 GitHub View Profile →

Not sure yet? Ask AI about this article

Get an instant, unbiased AI summary of “Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts”.

Ask ChatGPT Ask Claude Ask Perplexity

Automation machinery gears representing automatic prompt optimization pipeline

Prompt Engineering

Automatic Prompt Optimization: Using AI to Write Better Prompts

Automatic prompt optimization uses AI to iteratively improve prompts without manual tuning. Learn DSPy, APE, and gradient-free optimization methods with real benchmarks.

June 5, 2026 11 min read

Security lock on digital circuit board representing AI prompt injection defense

Prompt Engineering

Prompt Injection Attacks: How They Work and How to Defend Against Them

Prompt injection attacks let adversaries hijack AI behavior through malicious inputs. Learn how direct and indirect injection work, and how to build real defenses.

June 5, 2026 10 min read

AI agent reasoning and acting loop on neural network visualization — ReAct prompting guide

Prompt Engineering

ReAct Prompting: Combining Reasoning and Acting in AI Agents

ReAct prompting combines chain-of-thought reasoning with tool use in AI agents. Learn how it works, when to use it, and how to implement it in production.

June 5, 2026 12 min read

developer working with JSON structured data output from AI language model on computer screen

Prompt Engineering

Structured Output Prompting: Get JSON, Tables and Code from Any LLM

Learn structured output prompting to extract JSON, Markdown tables, and code from LLMs reliably. Includes schema design, validation patterns, and real examples.

June 5, 2026 11 min read

Go deeper on this topic

NotesPrompt Engineering vs Fine-Tuning vs RLHF BookThe AI Prompting Bible QuizPrompt Engineering Basics QuizAdvanced Prompting Techniques PromptsCoding & Debugging Prompts PromptsSystem Design Prompts

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Advanced Prompting

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

⚡ Quick Answer

Meta-prompting uses LLMs to write, critique, and refine prompts — often outperforming human-written ones. Learn the patterns, failure modes, and production use cases.

Abdullah Al Arman Emon June 5, 2026 12 min read

#meta-prompting #prompt-generation #self-improving-prompts #prompt-engineering

📚Part of the Advanced Prompting guide — explore all Advanced Prompting articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

This is the premise behind meta-prompting. Don't just use an LLM to complete tasks — use it to think about how tasks should be prompted. Use the model's self-knowledge as a design tool.

The idea sounds circular. In practice, it's one of the more productive techniques I've added to my workflow.

What Meta-Prompting Actually Means

The term covers several related but distinct practices:

Prompt generation: Ask an LLM to write a prompt for a task you describe
Prompt critique: Ask an LLM to identify weaknesses in an existing prompt
Prompt refinement: Iterative generation-critique-revision cycles
Prompt decomposition: Ask an LLM to break a complex task into a multi-prompt pipeline
Self-scaffolding: An LLM that generates its own sub-prompts at inference time

These form a spectrum from simple design assistance to full autonomous prompt construction. Most practical applications use the simpler end of the spectrum.

Prompt Generation: Getting a First Draft

from openai import OpenAI

client = OpenAI()

META_PROMPT_GENERATOR = """You are an expert prompt engineer. Write high-quality system prompts for LLM applications.

When writing a prompt:
1. Be explicit about the task, output format, and constraints
2. Include guidance on how to handle edge cases and ambiguous inputs
3. Specify the tone and style appropriate for the use case
4. Include any required output structure or formatting
5. Anticipate common failure modes and address them preemptively

Output ONLY the prompt text — no explanation, no preamble."""

def generate_prompt(task_description: str, examples: list[dict] = None) -> str:
    examples_str = ""
    if examples:
        examples_str = "\n\nHere are some example inputs and ideal outputs:\n" + "\n\n".join([
            f"Input: {ex['input']}\nIdeal output: {ex['output']}"
            for ex in examples
        ])
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": META_PROMPT_GENERATOR},
            {
                "role": "user",
                "content": f"Write a system prompt for this task:\n\n{task_description}{examples_str}"
            }
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content


# Example
task = """
A customer support agent for a SaaS analytics product. 
The agent should:
- Answer questions about the product features and pricing
- Help users debug common data pipeline issues
- Escalate complex technical issues to the engineering team
- Always be professional but not overly formal
- Admit uncertainty rather than guessing
"""

generated_prompt = generate_prompt(task)
print(generated_prompt)

Prompt Critique: Finding Weaknesses Before They Bite You

Given an existing prompt, ask the model to identify problems. This is where meta-prompting earns its keep most reliably.

META_PROMPT_CRITIC = """You are a prompt quality analyst. Your job is to find potential failure modes in LLM system prompts.

For the prompt you review, identify:

AMBIGUITIES: Parts of the prompt that could be interpreted in multiple ways, leading to inconsistent behavior.

MISSING GUIDANCE: Common scenarios or edge cases the prompt doesn't address, which the model will handle unpredictably.

CONFLICTING INSTRUCTIONS: Instructions that could pull the model in different directions.

FORMAT ISSUES: Problems with output format specification that could cause parsing failures.

SECURITY CONCERNS: Prompt design choices that make the system vulnerable to injection or jailbreak.

OVERCONSTRAINING: Instructions so restrictive that they prevent helpful responses.

Be specific. For each issue, quote the relevant part of the prompt and explain why it's a problem."""

def critique_prompt(prompt: str, context: str = "") -> str:
    context_str = f"\nContext for this prompt: {context}" if context else ""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": META_PROMPT_CRITIC},
            {
                "role": "user",
                "content": f"Critique this prompt:{context_str}\n\n---\n{prompt}\n---"
            }
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content


# The iterative refinement loop
def refine_prompt_iteratively(
    initial_prompt: str,
    task_context: str,
    n_iterations: int = 3
) -> tuple[str, list[str]]:
    
    current_prompt = initial_prompt
    critique_history = []
    
    for i in range(n_iterations):
        print(f"\n--- Iteration {i+1} ---")
        
        # Critique current prompt
        critique = critique_prompt(current_prompt, task_context)
        critique_history.append(critique)
        print(f"Critique:\n{critique[:500]}...")
        
        # Generate improved version
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "You are a prompt engineer. Improve prompts based on critiques. Output only the improved prompt, no explanation."
                },
                {
                    "role": "user",
                    "content": f"""Improve this prompt based on the critique.

Current prompt:
---
{current_prompt}
---

Critique:
---
{critique}
---

Write the improved prompt:"""
                }
            ],
            temperature=0.2,
        )
        
        current_prompt = response.choices[0].message.content
        print(f"Updated prompt:\n{current_prompt[:300]}...")
    
    return current_prompt, critique_history

Meta-Prompting for Few-Shot Example Generation

Generating good few-shot examples is harder than it looks. Bad examples introduce biases, cover only easy cases, or don't represent the distribution of real inputs. LLMs can help here:

def generate_few_shot_examples(
    task_description: str,
    target_coverage: list[str],  # Specific scenarios to cover
    n_examples: int = 10
) -> list[dict]:
    """
    Generate diverse, high-quality few-shot examples.
    target_coverage: list of edge cases / scenario types to explicitly cover
    """
    
    coverage_str = "\n".join([f"- {scenario}" for scenario in target_coverage])
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a training data expert. Generate high-quality, diverse labeled examples for NLP tasks."
            },
            {
                "role": "user",
                "content": f"""Generate {n_examples} diverse input/output examples for this task:

Task: {task_description}

Make sure to cover these specific scenarios:
{coverage_str}

For each example:
- Input should be realistic and natural
- Output should be ideal, exactly what the model should produce
- Vary the difficulty level
- Include some tricky/ambiguous cases

Format as JSON:
[
  {{"input": "...", "output": "..."}},
  ...
]"""
            }
        ],
        temperature=0.7,
        response_format={"type": "json_object"}
    )
    
    import json
    raw = json.loads(response.choices[0].message.content)
    # Handle both {"examples": [...]} and [...] formats
    if isinstance(raw, list):
        return raw
    for key in raw:
        if isinstance(raw[key], list):
            return raw[key]
    return []


# Example: generating examples for a ticket routing system
examples = generate_few_shot_examples(
    task_description="Route customer support tickets to the correct team: billing, technical, or general",
    target_coverage=[
        "Billing question disguised as a technical question",
        "Angry customer with unclear issue",
        "Multiple issues in one ticket",
        "Non-English input mixed with English",
        "Very short ambiguous message",
        "Technical issue that's actually a billing/account problem",
    ],
    n_examples=12
)

Stanford Meta-Prompting: Self-Scaffolding at Inference Time

STANFORD_META_PROMPT = """You are an expert assistant with access to a powerful reasoning capability.

For complex tasks, you can break them into subtasks and solve each independently.

When you receive a complex question:
1. ANALYZE: Identify if the question benefits from decomposition
2. DECOMPOSE: If yes, identify the independent subtasks
3. SOLVE EACH: For each subtask, reason carefully as if it were a standalone question
4. INTEGRATE: Combine the subtask answers into a coherent final answer

Format for decomposed reasoning:
[SUBTASK 1: description]
[REASONING: your reasoning for this subtask]
[ANSWER: subtask answer]

[SUBTASK 2: description]
...

[FINAL INTEGRATION]
Combining the above: [final answer]

For simple questions, just answer directly.
IMPORTANT: Each subtask should be solved as if you have no information from other subtasks — 
this prevents anchoring bias and produces more reliable results."""


def run_meta_prompted_query(query: str, model: str = "gpt-4o") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": STANFORD_META_PROMPT},
            {"role": "user", "content": query}
        ],
        temperature=0.1,
    )
    return response.choices[0].message.content


# Complex multi-part query
result = run_meta_prompted_query("""
A startup raised $2M at a $10M valuation in 2023. In 2024, they raised another $5M at a $25M valuation.
The founders owned 80% after the first round.

What percentage do the founders own after the second round, and what is the dollar value of 
the original investors' stake after the second round?
""")
print(result)

Prompt Decomposition: Breaking Complex Tasks into Pipelines

One of the most underused meta-prompting applications is using an LLM to design multi-step prompt pipelines:

PIPELINE_DESIGNER_PROMPT = """You are a prompt pipeline architect.

When given a complex task, design a multi-step pipeline where each step is a focused LLM call.

For each step, specify:
- Step name and purpose
- Input (from user or previous step output)
- Prompt instruction (complete, standalone)
- Output format
- How output feeds into next step

Design principles:
- Each step should do ONE thing well
- Steps should be independently testable
- Minimize information passing between steps
- Each step prompt should be explicit about its narrow scope

Output as a structured JSON pipeline spec."""

def design_pipeline(task_description: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": PIPELINE_DESIGNER_PROMPT},
            {
                "role": "user",
                "content": f"Design a prompt pipeline for: {task_description}"
            }
        ],
        temperature=0.2,
        response_format={"type": "json_object"}
    )
    import json
    return json.loads(response.choices[0].message.content)


# Example
pipeline_spec = design_pipeline(
    "Given a research paper PDF, produce a structured summary with: "
    "main findings, methodology, limitations, and relevance to ML practitioners."
)

# The model will produce a multi-step pipeline:
# Step 1: Extract key sections from raw text
# Step 2: Identify main findings
# Step 3: Analyze methodology
# Step 4: Extract limitations
# Step 5: Assess ML relevance
# Step 6: Combine into structured output

Where Meta-Prompting Falls Short

Comparing Meta-Prompting Approaches

Approach	Setup Cost	Quality	Best Use Case	Requires Examples?
Simple generation	Low	Good first draft	New prompts from scratch	Optional
Critique + refine	Low	Better than manual	Improving existing prompts	No
Iterative refinement (3+ cycles)	Medium	High	Production-grade prompts	Optional
Few-shot generation	Low	Task-specific	Training data creation	No (specify coverage)
Pipeline decomposition	Low	Variable	Complex multi-step tasks	No
Stanford meta-prompt	None (inference)	High for reasoning	Complex analytical queries	No
APO + meta-prompting	High	Highest	High-volume production tasks	Yes (50+)

Practical Workflow

The meta-prompting workflow that consistently produces better results than pure manual engineering:

Describe your task in plain language (no prompt syntax)
Use the generator to produce a first draft
Read it critically and note your concerns
Run the critique meta-prompt on it
Look at what the critique identified vs. what you noticed — the overlap tells you what you already knew; the gaps tell you what you missed
Refine once or twice
Test on real examples before any further optimization

The model knows a lot about how it works. Let it tell you.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Abdullah Al Arman Emon✓ Verified Writer

Software Testing Expert & Prompt Engineering

💻 GitHub View Profile →

Not sure yet? Ask AI about this article

Get an instant, unbiased AI summary of “Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts”.

Ask ChatGPT Ask Claude Ask Perplexity

Prompt Engineering

Automatic Prompt Optimization: Using AI to Write Better Prompts

Automatic prompt optimization uses AI to iteratively improve prompts without manual tuning. Learn DSPy, APE, and gradient-free optimization methods with real benchmarks.

June 5, 2026 11 min read

Prompt Engineering

Prompt Injection Attacks: How They Work and How to Defend Against Them

Prompt injection attacks let adversaries hijack AI behavior through malicious inputs. Learn how direct and indirect injection work, and how to build real defenses.

June 5, 2026 10 min read

Prompt Engineering

ReAct Prompting: Combining Reasoning and Acting in AI Agents

ReAct prompting combines chain-of-thought reasoning with tool use in AI agents. Learn how it works, when to use it, and how to implement it in production.

June 5, 2026 12 min read

Prompt Engineering

Structured Output Prompting: Get JSON, Tables and Code from Any LLM

Learn structured output prompting to extract JSON, Markdown tables, and code from LLMs reliably. Includes schema design, validation patterns, and real examples.

June 5, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

What Meta-Prompting Actually Means

Prompt Generation: Getting a First Draft

Prompt Critique: Finding Weaknesses Before They Bite You

Meta-Prompting for Few-Shot Example Generation

Stanford Meta-Prompting: Self-Scaffolding at Inference Time

Prompt Decomposition: Breaking Complex Tasks into Pipelines

Where Meta-Prompting Falls Short

Comparing Meta-Prompting Approaches

Practical Workflow

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Not sure yet? Ask AI about this article

Related Articles

Automatic Prompt Optimization: Using AI to Write Better Prompts

Prompt Injection Attacks: How They Work and How to Defend Against Them

ReAct Prompting: Combining Reasoning and Acting in AI Agents

Structured Output Prompting: Get JSON, Tables and Code from Any LLM

Go deeper on this topic

Get Free AI Notes Daily

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

Meta-Prompting: Using LLMs to Generate and Improve Their Own Prompts

What Meta-Prompting Actually Means

Prompt Generation: Getting a First Draft

Prompt Critique: Finding Weaknesses Before They Bite You

Meta-Prompting for Few-Shot Example Generation

Stanford Meta-Prompting: Self-Scaffolding at Inference Time

Prompt Decomposition: Breaking Complex Tasks into Pipelines

Where Meta-Prompting Falls Short

Comparing Meta-Prompting Approaches

Practical Workflow

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Not sure yet? Ask AI about this article

Related Articles

Automatic Prompt Optimization: Using AI to Write Better Prompts

Prompt Injection Attacks: How They Work and How to Defend Against Them

ReAct Prompting: Combining Reasoning and Acting in AI Agents

Structured Output Prompting: Get JSON, Tables and Code from Any LLM

Go deeper on this topic

Get Free AI Notes Daily