AI agent cost optimization: 10 tips to reduce your LLM bill
Practical strategies to cut AI agent costs by 40-60% — caching, prompt compression, model tiering, batch processing, and monitoring.
OpenAI’s prompt caching guide describes how caching frequent input prefixes can significantly reduce both cost and latency — directly applicable to the tiering strategy in this post.
Anthropic’s prompt caching documentation shows how caching reduces API costs by up to 50% for repeated context prefixes — one of the most impactful cost-saving techniques.
TL;DR: My first production agent cost ₹12,000/month in API calls. After applying these 10 strategies — semantic caching, prompt compression, model tiering, batching, context budgeting, tool result caching, rate limiting, cost monitoring, open-source models, and pruning unused tools — the same agent runs on ₹4,500/month (62% reduction) with zero quality loss.
My first production agent cost ₹12,000/month in API calls. I almost killed the project right there.
The agent was doing legitimate work — processing support tickets, generating reports, automating workflows. But the costs were eating the margins. At ₹12,000/month, the agent was more expensive than the intern it was supposed to replace.
I spent the next month optimizing. Here’s what I learned.
After applying these 10 strategies, the same agent runs on ₹4,500/month — a 62% reduction. The work output is the same. The quality is the same. The only difference is we stopped wasting tokens.
Key takeaways:
- Semantic caching alone saves 25-35% — cache LLM responses for similar queries
- Model tiering saves 15-25% — cheap model for simple tasks, expensive model for complex
- Prompt compression, batching, and context budget each save 10-20%
- Composite savings of 60-70% are achievable without quality loss
1. Semantic caching (~30% savings)
The biggest waste in most agent deployments: answering the same question repeatedly. A semantic cache stores previous LLM responses and returns them for similar queries:
import numpy as np
from openai import OpenAI
class SemanticCache:
def __init__(self, similarity_threshold=0.92):
self.cache = [] # [{embedding, prompt, response, cost, timestamp}]
self.threshold = similarity_threshold
self.embedding_client = OpenAI()
self.hits = 0
self.misses = 0
def get(self, prompt: str) -> dict | None:
prompt_embedding = self._embed(prompt)
for entry in self.cache:
similarity = self._cosine_similarity(prompt_embedding, entry["embedding"])
if similarity >= self.threshold:
self.hits += 1
return {
"response": entry["response"],
"cached": True,
"similarity": similarity,
"savings": entry["cost"]
}
self.misses += 1
return None
def set(self, prompt: str, response: str, cost: float):
self.cache.append({
"embedding": self._embed(prompt),
"prompt": prompt,
"response": response,
"cost": cost,
"timestamp": time.time()
})
def stats(self):
total = self.hits + self.misses
return {
"hit_rate": self.hits / total if total > 0 else 0,
"total_savings": sum(
entry["cost"] for entry in self.cache
) * self.hits / max(self.hits, 1)
}
def _embed(self, text: str) -> list[float]:
response = self.embedding_client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def _cosine_similarity(self, a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Expected savings: 25-35% for support agents, 15-20% for coding agents. Support queries have high repetition (same billing question from different users). Coding agents produce more unique outputs, so cache hit rates are lower.
Gotcha: Cache invalidation. If your knowledge base changes, cached responses become stale. Solution: add a TTL (time-to-live) — typically 24 hours for support, 1 hour for time-sensitive queries.
2. Prompt compression (~15% savings)
Long system prompts eat your input token budget. Every token in the system prompt is multiplied by every call:
# Before: 750 tokens in system prompt
SYSTEM_PROMPT_VERBOSE = """You are a helpful AI agent that assists users with their tasks.
Your job is to understand what the user wants and help them accomplish it.
You have access to the following tools: read_file, write_file, run_command, search_web.
When you use a tool, make sure to provide the correct arguments.
If you're not sure about something, ask the user for clarification.
...
""" # ~500 more words
# After: 180 tokens in system prompt
SYSTEM_PROMPT_COMPRESSED = """You are a coding agent with tools: read_file, write_file, run_command, search_web.
Rules:
- Provide correct tool arguments
- Ask for clarification when unsure
- One task at a time
- Report results concisely"""
Expected savings: 10-20% from shorter system prompts. More importantly, shorter prompts reduce latency — fewer tokens means faster first-token generation.
3. Model tiering (~20% savings)
The biggest line item savings strategy: don’t use your most expensive model for every task:
import re
MODEL_TIERS = {
"cheap": {
"model": "claude-haiku-3-20240307",
"cost_per_call": 0.002, # $0.002 per call
"use_for": ["classify", "extract", "simple_qa", "summarize"]
},
"standard": {
"model": "claude-sonnet-4-20250514",
"cost_per_call": 0.015, # $0.015 per call
"use_for": ["generate", "analyze", "code_review"]
},
"expensive": {
"model": "claude-opus-4-20250514",
"cost_per_call": 0.075, # $0.075 per call
"use_for": ["complex_reasoning", "debug", "plan"]
}
}
def select_model(task_type: str, complexity: str = "low") -> str:
if complexity == "high" or task_type in MODEL_TIERS["expensive"]["use_for"]:
return MODEL_TIERS["expensive"]
elif task_type in MODEL_TIERS["standard"]["use_for"]:
return MODEL_TIERS["standard"]
else:
return MODEL_TIERS["cheap"]
Expected savings: 15-25% overall. In my experience, about 60% of agent tasks are simple enough for Haiku-level models. Only 10% need Opus-level reasoning. The remaining 30% work fine on Sonnet.
Real example from my stack:
- Intent classification → Haiku (₹0.03/call)
- Code generation → Sonnet (₹0.30/call)
- Debug analysis → Opus (₹1.50/call)
- Average cost per run: ₹0.50 instead of ₹1.50 if everything used Sonnet
4. Batch independent LLM calls (~20% savings)
Many agent workflows make multiple independent LLM calls. If they don’t depend on each other, batch them:
import asyncio
from anthropic import AsyncAnthropic
async_client = AsyncAnthropic()
async def batch_llm_calls(calls: list[dict]) -> list[str]:
"""Execute independent LLM calls in parallel."""
async def single_call(call):
response = await async_client.messages.create(**call)
return response.content[0].text
results = await asyncio.gather(*[single_call(c) for c in calls])
return results
# Before: 3 sequential calls = 3x latency, 3x overhead
# classify = llm.call(...)
# extract = llm.call(...)
# summarize = llm.call(...)
# After: 3 parallel calls = 1x latency, 1x overhead
calls = [
{"model": "claude-haiku-3-20240307", "max_tokens": 100, "messages": [...]}, # classify
{"model": "claude-haiku-3-20240307", "max_tokens": 200, "messages": [...]}, # extract
{"model": "claude-haiku-3-20240307", "max_tokens": 150, "messages": [...]}, # summarize
]
results = await batch_llm_calls(calls)
Expected savings: 15-20% reduction in both cost and latency. Batching reduces the overhead of API connection setup and token processing overhead.
When NOT to batch: If one call depends on another’s output (e.g., you need to classify before you can retrieve), batching doesn’t apply. Only batch truly independent calls.
5. Context window budgeting (~15% savings)
Most agents stuff everything into the context window without thinking about what’s actually needed. Budget your context:
MAX_CONTEXT_BUDGET = 32000 # tokens
def prepare_context(messages, max_context=MAX_CONTEXT_BUDGET):
"""Trim context to fit within budget, prioritizing recent and important messages."""
# Count tokens in current messages
total_tokens = count_tokens(messages)
if total_tokens <= max_context:
return messages
# Strategy: keep system prompt, last 2 turns, truncated middle
system_prompt = [m for m in messages if m["role"] == "system"]
recent = messages[-4:] # Last 2 user + 2 assistant
middle = messages[len(system_prompt):-4]
# Summarize middle messages
if middle:
summary = summarize_messages(middle)
# Keep only the summary and recent messages
budget_messages = system_prompt + [
{"role": "system", "content": f"Previous context: {summary}"}
] + recent
if count_tokens(budget_messages) <= max_context:
return budget_messages
# If still over budget, keep only recent messages
return system_prompt + recent[-2:] # Just last turn
Expected savings: 10-15% reduction in input tokens per call. More importantly, shorter context windows produce faster responses and lower latency.
6. Tool result caching (~10% savings)
Tool calls are often repeated. The same file is read multiple times. The same API is called with the same parameters:
from functools import lru_cache
import hashlib
class ToolResultCache:
def __init__(self, ttl_seconds=300): # 5 minute TTL
self.cache = {}
self.ttl = ttl_seconds
def get_or_execute(self, tool_name: str, args: dict, tool_fn):
cache_key = self._make_key(tool_name, args)
if cache_key in self.cache:
entry = self.cache[cache_key]
if time.time() - entry["timestamp"] < self.ttl:
return entry["result"]
result = tool_fn(**args)
self.cache[cache_key] = {
"result": result,
"timestamp": time.time(),
"tool": tool_name,
"args": args
}
return result
def _make_key(self, tool_name, args):
serialized = json.dumps({"tool": tool_name, "args": args}, sort_keys=True)
return hashlib.md5(serialized.encode()).hexdigest()
Expected savings: 5-10%. This is situational — high for file-reading agents, low for agents that make unique API calls.
7. Rate limiting non-critical agents (~15% savings)
Background agents that run on a schedule don’t need instant responses. Rate-limit them to off-peak hours:
import schedule
import time
class RateLimitedAgent:
def __init__(self, agent_fn, max_calls_per_hour=60):
self.agent_fn = agent_fn
self.max_calls = max_calls_per_hour
self.call_times = []
def should_throttle(self) -> bool:
# Clean old entries
now = time.time()
self.call_times = [t for t in self.call_times if now - t < 3600]
if len(self.call_times) >= self.max_calls:
sleep_time = 3600 - (now - self.call_times[0])
if sleep_time > 0:
print(f"Throttling: sleeping {sleep_time:.0f}s")
time.sleep(sleep_time)
self.call_times.append(time.time())
def run(self, *args, **kwargs):
self.should_throttle()
return self.agent_fn(*args, **kwargs)
# Schedule non-critical agents for off-peak hours (10 PM - 6 AM IST)
def run_nightly_reports():
agent = RateLimitedAgent(generate_report, max_calls_per_hour=30)
# ... process reports
schedule.every().day.at("22:00").do(run_nightly_reports)
Expected savings: 10-15% from using lower-cost models during off-peak and spreading out API calls to avoid burst pricing (on some providers).
8. Monitor cost per request (~5% savings from awareness)
You can’t optimize what you don’t measure. Cost tracking is the foundation:
class CostMonitor:
def __init__(self, daily_budget=500): # ₹500/day default
self.daily_budget = daily_budget
self.daily_cost = 0.0
self.alerts = []
def track(self, run_id, model, input_tokens, output_tokens):
RATES = {
"claude-sonnet": {"input": 0.25, "output": 1.25}, # per 1K tokens, in INR
"claude-haiku": {"input": 0.03, "output": 0.15},
"claude-opus": {"input": 1.50, "output": 7.50},
"gpt-4o": {"input": 0.20, "output": 0.80},
"gpt-4o-mini": {"input": 0.01, "output": 0.04},
}
rate = RATES.get(model, RATES["claude-sonnet"])
cost = (input_tokens / 1000 * rate["input"]) + (output_tokens / 1000 * rate["output"])
self.daily_cost += cost
if self.daily_cost > self.daily_budget * 0.8:
self.alerts.append({
"type": "budget_warning",
"cost": self.daily_cost,
"budget": self.daily_budget,
"run_id": run_id
})
return cost
def get_report(self):
return {
"daily_cost": f"₹{self.daily_cost:.2f}",
"budget_remaining": f"₹{max(0, self.daily_budget - self.daily_cost):.2f}",
"alerts_count": len(self.alerts),
"alerts": self.alerts[-5:] # Last 5 alerts
}
Expected savings: The awareness alone saves 5-10%. When you see which runs cost the most, you naturally find optimizations. I found a buggy agent loop this way that was burning ₹200/day in retries.
9. Open-source models for internal tasks (~20% savings for those tasks)
For internal tools and batch processing, running open-source models locally can be cheaper than API calls:
# Run Llama 3 70B locally
ollama run llama3:70b
# Or use a cloud GPU
# T4 GPU: ~₹30/hour, serves ~500 requests/hour
# Cost: ₹0.06 per request vs ₹0.30 for Sonnet
Where it works: Internal code review, batch document processing, data extraction, classification at scale.
Where it doesn’t: Customer-facing agents, complex tool use, tasks requiring high reliability.
My setup: API models for customer-facing, Llama 3 70B (via Ollama) for internal batch jobs. This cut my API bill by 20%.
10. Prune unused tools (~5% savings and reliability)
Every tool you give an agent increases the prompt size (tools are serialized into the prompt) and adds decision complexity:
# Before: 15 tools, each with detailed descriptions
ALL_TOOLS = [
read_file, write_file, list_directory, search_files,
run_command, install_package, run_tests, build_project,
search_web, fetch_url, scrape_page, call_api,
query_database, send_email, create_ticket
] # ~2,500 tokens of tool definitions
# After: Only the tools this agent actually uses
CORE_TOOLS = [
read_file, write_file, run_command, search_files
] # ~600 tokens of tool definitions
Expected savings: 5-10% reduction in input tokens per call. More importantly: fewer tools means fewer wrong tool choices. The agent doesn’t accidentally call send_email when it meant write_file.
Related: AI agent error handling patterns — retry strategies, cost spikes, and graceful degradation for production agents.
If you can only implement two strategies today, make it semantic caching and model tiering. Those two alone will save 40-50% with zero quality impact. Add the rest as your agent scales and you identify specific cost drivers. Most importantly, track your costs before and after — the numbers will tell you which optimization to tackle next.