LLM Token Pricing Explained: How to Calculate and Minimize AI API Costs
LLM token pricing explained — how tokens are counted, 2025 API pricing for GPT-4, Claude, and Gemini, and practical strategies to cut costs by 70-90% without losing quality.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
LLM Token Pricing Explained: How to Calculate and Minimize AI API Costs
My first month building with the OpenAI API, I got a $340 bill I hadn't expected. I'd been sending entire documents as context for every query — 10,000-token prompts for questions that needed maybe 500 tokens of relevant context. Simple optimization reduced that bill to $28 the following month.
Understanding token pricing is the difference between an AI feature that's economically viable and one that's quietly burning money. With the right strategies, you can often reduce costs by 70-90% without any quality degradation.
Here's the complete picture of how pricing works, what the major APIs charge in 2025, and the actual techniques that move the needle.
How Tokens Work
Tokens are the basic unit of text that LLMs process. They're not quite words, not quite characters — they're subword pieces determined by the tokenizer.
import tiktoken
# GPT-4 tokenizer (cl100k_base)
enc = tiktoken.encoding_for_model("gpt-4o")
examples = {
"Common word": "hello",
"Uncommon word": "serendipitous",
"Code": "def calculate_roi(investment, returns):",
"Number": "12345678",
"Non-English": "こんにちは", # Japanese
}
for name, text in examples.items():
tokens = enc.encode(text)
print(f"{name}: '{text}'")
print(f" Tokens: {len(tokens)} → {[enc.decode([t]) for t in tokens]}\n")
# Output examples:
# Common word: 'hello'
# Tokens: 1 → ['hello']
# Uncommon word: 'serendipitous'
# Tokens: 4 → ['seren', 'dip', 'it', 'ous']
# Non-English: 'こんにちは'
# Tokens: 5 → many single-character tokens
Token Counting Rules of Thumb
| Content Type | Tokens per Word | Tokens per 1K Words |
|---|---|---|
| English prose | 1.3-1.5 | 1,300-1,500 |
| English code | 1.5-2.0 | 1,500-2,000 |
| Technical writing | 1.4-1.6 | 1,400-1,600 |
| Chinese/Japanese | 2-3× less efficient | 2,500-4,500 |
| Mixed code + text | ~1.7 | 1,700 |
2025 API Pricing: Major Providers
OpenAI
| Model | Input ($/M tokens) | Output ($/M tokens) | Notes |
|---|---|---|---|
| GPT-4o | $5.00 | $15.00 | Flagship |
| GPT-4o mini | $0.15 | $0.60 | Budget, strong for simple tasks |
| o1 | $15.00 | $60.00 | Reasoning model |
| o3-mini | $1.10 | $4.40 | Budget reasoning |
| GPT-3.5 Turbo | $0.50 | $1.50 | Legacy, often worse than 4o mini |
Anthropic (Claude)
| Model | Input ($/M tokens) | Output ($/M tokens) | Notes |
|---|---|---|---|
| Claude 3.5 Sonnet | $3.00 | $15.00 | Flagship |
| Claude 3 Haiku | $0.25 | $1.25 | Budget, very fast |
| Claude 3 Opus | $15.00 | $75.00 | Most capable |
| Prompt caching | $0.30 | — | 90% discount on cache hits |
Google (Gemini)
| Model | Input ($/M tokens) | Output ($/M tokens) | Notes |
|---|---|---|---|
| Gemini 1.5 Pro (<128K ctx) | $3.50 | $10.50 | Flagship |
| Gemini 1.5 Pro (>128K ctx) | $7.00 | $21.00 | Long context premium |
| Gemini 1.5 Flash | $0.075 | $0.30 | Budget, fastest |
| Gemini 1.0 Ultra | $10.00 | $30.00 | Legacy flagship |
Cost Calculator
def calculate_cost(
input_tokens: int,
output_tokens: int,
model: str,
requests_per_day: int = 1
) -> dict:
pricing = {
"gpt-4o": (5.00, 15.00),
"gpt-4o-mini": (0.15, 0.60),
"claude-3-5-sonnet": (3.00, 15.00),
"claude-3-haiku": (0.25, 1.25),
"gemini-1.5-pro": (3.50, 10.50),
"gemini-1.5-flash": (0.075, 0.30),
}
if model not in pricing:
raise ValueError(f"Unknown model: {model}")
input_rate, output_rate = pricing[model]
cost_per_request = (input_tokens / 1_000_000 * input_rate) + \
(output_tokens / 1_000_000 * output_rate)
return {
"cost_per_request": cost_per_request,
"daily_cost": cost_per_request * requests_per_day,
"monthly_cost": cost_per_request * requests_per_day * 30,
"annual_cost": cost_per_request * requests_per_day * 365,
}
# Example: document analysis app
result = calculate_cost(
input_tokens=8000, # 6K document + 2K system prompt
output_tokens=500, # Summary response
model="gpt-4o",
requests_per_day=1000
)
print(f"Per request: ${result['cost_per_request']:.4f}")
print(f"Monthly: ${result['monthly_cost']:.2f}")
# Compare models
for model in ["gpt-4o", "gpt-4o-mini", "claude-3-haiku", "gemini-1.5-flash"]:
r = calculate_cost(8000, 500, model, 1000)
print(f"{model}: ${r['monthly_cost']:.2f}/month")
# Output:
# gpt-4o: $2,475.00/month
# gpt-4o-mini: $72.00/month
# claude-3-haiku: $72.00/month
# gemini-1.5-flash: $22.50/month
Strategy 1: Model Routing
Not every task needs GPT-4. Route tasks to the cheapest model that can handle them:
from openai import OpenAI
client = OpenAI()
def classify_task_complexity(task: str) -> str:
"""Classify task complexity to route to appropriate model."""
# Simple tasks: classification, extraction, basic QA
# Medium tasks: summarization, explanation, code review
# Hard tasks: complex reasoning, nuanced writing, debugging
simple_indicators = ["classify", "extract", "yes or no", "is this", "which category"]
hard_indicators = ["debug", "explain why", "analyze", "compare", "architect", "reason"]
task_lower = task.lower()
if any(word in task_lower for word in hard_indicators):
return "hard"
if any(word in task_lower for word in simple_indicators):
return "simple"
return "medium"
def smart_completion(
messages: list,
task_description: str = "",
force_model: str | None = None
) -> str:
"""Route to cheapest model based on task complexity."""
model_map = {
"simple": "gpt-4o-mini", # $0.15/$0.60 per M tokens
"medium": "gpt-4o-mini", # Test if mini is sufficient
"hard": "gpt-4o" # $5/$15 per M tokens
}
complexity = classify_task_complexity(task_description)
model = force_model or model_map[complexity]
response = client.chat.completions.create(
model=model,
messages=messages
)
# Log for monitoring
usage = response.usage
print(f"Model: {model} | Complexity: {complexity} | "
f"Tokens: {usage.prompt_tokens} in, {usage.completion_tokens} out")
return response.choices[0].message.content
# Simple task → mini model
result = smart_completion(
messages=[{"role": "user", "content": "Classify this as positive or negative: 'Great product!'"}],
task_description="classify sentiment"
)
Strategy 2: Prompt Caching (Claude)
Claude charges 90% less for cached tokens — critical if you have a long system prompt:
import anthropic
client = anthropic.Anthropic()
# Long system prompt (2,000+ tokens to qualify for caching)
SYSTEM_PROMPT = """You are an expert assistant for [Company Name].
Company Context:
[Include extensive company documentation, product details, policies here...]
[This block is 2000+ tokens and doesn't change between requests]
"""
def cached_completion(user_message: str) -> str:
"""Use prompt caching for system prompt — 90% discount on cache hits."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Cache this block
}
],
messages=[{"role": "user", "content": user_message}]
)
# Check cache usage
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}") # 90% cheaper
print(f"Cache write tokens: {usage.cache_creation_input_tokens}") # 25% more expensive
print(f"Uncached input tokens: {usage.input_tokens}")
return response.content[0].text
# First request: pays full price + cache write fee (25% more)
result1 = cached_completion("What is your return policy?")
# Second request: 90% discount on the 2000-token system prompt
result2 = cached_completion("How do I track my order?")
# Cache hit saves: 2000 tokens × $3/M × 90% = ~$0.0054 per request
Strategy 3: Output Length Control
Output tokens cost 3-5× more than input — controlling response length is high leverage:
def cost_optimized_prompt(question: str, max_words: int = 100) -> str:
"""Append explicit length instructions to control output costs."""
return f"""{question}
Respond in {max_words} words or fewer. Be direct and specific. No preamble."""
# Without length control: model might write 500 tokens
# With "respond in 100 words": often 80-120 tokens
# Cost reduction: ~75% on output tokens for same quality
# For structured data extraction: JSON output is token-efficient
extraction_prompt = """Extract from this text:
- customer_name
- order_id
- issue_type
- urgency (high/medium/low)
Return ONLY valid JSON. No explanation.
Text: {user_text}"""
# JSON output is concise, parseable, and cheaper than prose explanations
Strategy 4: OpenAI Batch API (50% Discount)
For non-real-time workloads:
import json
from openai import OpenAI
client = OpenAI()
# Prepare batch requests
requests = [
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o",
"messages": [
{"role": "user", "content": f"Summarize: {document}"}
],
"max_tokens": 200
}
}
for i, document in enumerate(documents_to_process)
]
# Write to JSONL file
with open("batch_requests.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Upload and create batch
with open("batch_requests.jsonl", "rb") as f:
batch_file = client.files.create(file=f, purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h" # Results within 24 hours
)
print(f"Batch created: {batch.id}")
print("50% discount applied to all tokens in this batch")
# Check status
batch_status = client.batches.retrieve(batch.id)
print(f"Status: {batch_status.status}")
# Retrieve results when complete
if batch_status.status == "completed":
results = client.files.content(batch_status.output_file_id)
for line in results.text.splitlines():
result = json.loads(line)
print(result["response"]["body"]["choices"][0]["message"]["content"])
Strategy 5: Response Caching
Cache identical or similar queries:
import hashlib
import json
from functools import lru_cache
# Simple exact-match caching
response_cache = {}
def cached_api_call(prompt: str, model: str = "gpt-4o") -> str:
cache_key = hashlib.md5(f"{model}:{prompt}".encode()).hexdigest()
if cache_key in response_cache:
print("Cache hit — $0 cost")
return response_cache[cache_key]
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
result = response.choices[0].message.content
response_cache[cache_key] = result
# In production: use Redis with TTL
# redis_client.setex(cache_key, 3600, result) # 1 hour TTL
return result
# For FAQ systems: pre-cache answers to common questions
# Cache hit rate of 30-50% cuts costs significantly for common use cases
Real-World Cost Optimization Example
Before optimization:
- Task: Analyze customer feedback documents
- Stack: GPT-4o, full 30-page document in context, verbose responses
- Per request: 15,000 input tokens × $5 + 800 output tokens × $15
= $0.075 + $0.012 = $0.087
- At 500 requests/day: $43.50/day → $1,305/month
After optimization:
- Stack: RAG (only retrieve relevant chunks) + GPT-4o mini + concise prompts
- Per request: 2,000 input tokens × $0.15 + 200 output tokens × $0.60
= $0.0003 + $0.00012 = $0.00042
- At 500 requests/day: $0.21/day → $6.30/month
Cost reduction: 99.5% — same quality for structured extraction
Conclusion
Token pricing is one of the most important architectural decisions when building LLM applications. The difference between an optimized and unoptimized system can be 10-100×.
The hierarchy of impact: first choose the right model tier (biggest impact), then control output length, use caching for repeated prompts, apply RAG to reduce context size, and batch non-real-time workloads. Combining these strategies typically reduces costs by 70-90% without meaningful quality loss.
For the retrieval system that enables sending smaller prompts (RAG), see our RAG guide. For running models locally at zero API cost, see our open-source LLM guide.
Frequently Asked Questions
How does LLM token pricing work?
APIs charge per token (roughly 4 characters or 0.75 words) separately for input and output. Output tokens typically cost 3-5× more than input. A GPT-4o request with 1,000 input tokens and 200 output tokens costs $0.005 + $0.003 = $0.008. Model choice is the biggest cost lever — gpt-4o-mini is 33× cheaper than gpt-4o per million tokens.
How many tokens is 1000 words?
~1,300-1,500 tokens for typical English text. Code tokenizes 50-100% less efficiently. Non-English text (especially Chinese/Japanese) is 2-3× less efficient than English. Rule of thumb: multiply word count by 1.33 for a conservative estimate.
What is the cheapest LLM API in 2025?
Gemini 1.5 Flash ($0.075 input/$0.30 output per million tokens) and GPT-4o mini ($0.15/$0.60) are cheapest among frontier APIs. For open-source hosted: Groq's free tier or Together AI (~$0.90/M for LLaMA 3.1 70B). Local inference via Ollama is free after hardware costs.
How can I reduce my LLM API costs?
Ranked by impact: use smaller models for simple tasks (20-33× savings); control output length with explicit instructions (3-5× savings on output); use prompt caching for repeated system prompts (90% discount); use Batch API for non-real-time work (50% discount); implement RAG to send smaller prompts; cache responses to repeated queries.
Is it worth using GPT-4 vs GPT-4o mini for my use case?
For most practical tasks (classification, extraction, summarization, basic QA), GPT-4o mini performs 80-95% as well at 33× lower cost. Flagship quality is justified for complex reasoning, nuanced writing, difficult debugging, and tasks where errors have high business cost. Prototype with flagship, then downgrade until quality degradation appears.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)
AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.
Embeddings Explained: How AI Converts Words to Numbers That Mean Something
Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.
Fine-Tuning LLMs: When to Do It and How to Do It Right
Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.
GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?
GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.