---
title: "AI agent cost optimization: 10 tips to reduce your LLM bill"
canonical: "https://agenticup.dev/posts/ai-agent-cost-optimization-tips/"
pubDate: "2026-06-01T00:00:00.000Z"
description: "My first production agent cost ₹12,000/month in API calls. After applying these 10 strategies, the same agent runs on ₹4,500/month. Here's exactly how — with code, expected savings, and tradeoffs."
tags: [cost optimization, llm costs, ai agents, production, savings]
---

OpenAI's [prompt caching guide](https://platform.openai.com/docs/guides/prompt-caching) describes how caching frequent input prefixes can significantly reduce both cost and latency — directly applicable to the tiering strategy in this post.

Anthropic's [prompt caching documentation](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) shows how caching reduces API costs by up to 50% for repeated context prefixes — one of the most impactful cost-saving techniques.

**TL;DR:** My first production agent cost ₹12,000/month in API calls. After applying these 10 strategies — semantic caching, prompt compression, model tiering, batching, context budgeting, tool result caching, rate limiting, cost monitoring, open-source models, and pruning unused tools — the same agent runs on ₹4,500/month (62% reduction) with zero quality loss.

My first production agent cost ₹12,000/month in API calls. I almost killed the project right there.

The agent was doing legitimate work — processing support tickets, generating reports, automating workflows. But the costs were eating the margins. At ₹12,000/month, the agent was more expensive than the intern it was supposed to replace.

I spent the next month optimizing. Here's what I learned.

After applying these 10 strategies, the same agent runs on ₹4,500/month — a 62% reduction. The work output is the same. The quality is the same. The only difference is we stopped wasting tokens.

> **Key takeaways:**
> - Semantic caching alone saves 25-35% — cache LLM responses for similar queries
> - Model tiering saves 15-25% — cheap model for simple tasks, expensive model for complex
> - Prompt compression, batching, and context budget each save 10-20%
> - Composite savings of 60-70% are achievable without quality loss

## 1. Semantic caching (~30% savings)

The biggest waste in most agent deployments: answering the same question repeatedly. A semantic cache stores previous LLM responses and returns them for similar queries:

```python
import numpy as np
from openai import OpenAI

class SemanticCache:
    def __init__(self, similarity_threshold=0.92):
        self.cache = []  # [{embedding, prompt, response, cost, timestamp}]
        self.threshold = similarity_threshold
        self.embedding_client = OpenAI()
        self.hits = 0
        self.misses = 0

    def get(self, prompt: str) -> dict | None:
        prompt_embedding = self._embed(prompt)

        for entry in self.cache:
            similarity = self._cosine_similarity(prompt_embedding, entry["embedding"])
            if similarity >= self.threshold:
                self.hits += 1
                return {
                    "response": entry["response"],
                    "cached": True,
                    "similarity": similarity,
                    "savings": entry["cost"]
                }

        self.misses += 1
        return None

    def set(self, prompt: str, response: str, cost: float):
        self.cache.append({
            "embedding": self._embed(prompt),
            "prompt": prompt,
            "response": response,
            "cost": cost,
            "timestamp": time.time()
        })

    def stats(self):
        total = self.hits + self.misses
        return {
            "hit_rate": self.hits / total if total > 0 else 0,
            "total_savings": sum(
                entry["cost"] for entry in self.cache
            ) * self.hits / max(self.hits, 1)
        }

    def _embed(self, text: str) -> list[float]:
        response = self.embedding_client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

    def _cosine_similarity(self, a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
```

**Expected savings:** 25-35% for support agents, 15-20% for coding agents. Support queries have high repetition (same billing question from different users). Coding agents produce more unique outputs, so cache hit rates are lower.

**Gotcha:** Cache invalidation. If your knowledge base changes, cached responses become stale. Solution: add a TTL (time-to-live) — typically 24 hours for support, 1 hour for time-sensitive queries.

## 2. Prompt compression (~15% savings)

Long system prompts eat your input token budget. Every token in the system prompt is multiplied by every call:

```python
# Before: 750 tokens in system prompt
SYSTEM_PROMPT_VERBOSE = """You are a helpful AI agent that assists users with their tasks.
Your job is to understand what the user wants and help them accomplish it.
You have access to the following tools: read_file, write_file, run_command, search_web.
When you use a tool, make sure to provide the correct arguments.
If you're not sure about something, ask the user for clarification.
...
"""  # ~500 more words

# After: 180 tokens in system prompt
SYSTEM_PROMPT_COMPRESSED = """You are a coding agent with tools: read_file, write_file, run_command, search_web.
Rules:
- Provide correct tool arguments
- Ask for clarification when unsure
- One task at a time
- Report results concisely"""
```

**Expected savings:** 10-20% from shorter system prompts. More importantly, shorter prompts reduce latency — fewer tokens means faster first-token generation.

## 3. Model tiering (~20% savings)

The biggest line item savings strategy: don't use your most expensive model for every task:

```python
import re

MODEL_TIERS = {
    "cheap": {
        "model": "claude-haiku-3-20240307",
        "cost_per_call": 0.002,    # $0.002 per call
        "use_for": ["classify", "extract", "simple_qa", "summarize"]
    },
    "standard": {
        "model": "claude-sonnet-4-20250514",
        "cost_per_call": 0.015,    # $0.015 per call
        "use_for": ["generate", "analyze", "code_review"]
    },
    "expensive": {
        "model": "claude-opus-4-20250514",
        "cost_per_call": 0.075,    # $0.075 per call
        "use_for": ["complex_reasoning", "debug", "plan"]
    }
}

def select_model(task_type: str, complexity: str = "low") -> str:
    if complexity == "high" or task_type in MODEL_TIERS["expensive"]["use_for"]:
        return MODEL_TIERS["expensive"]
    elif task_type in MODEL_TIERS["standard"]["use_for"]:
        return MODEL_TIERS["standard"]
    else:
        return MODEL_TIERS["cheap"]
```

**Expected savings:** 15-25% overall. In my experience, about 60% of agent tasks are simple enough for Haiku-level models. Only 10% need Opus-level reasoning. The remaining 30% work fine on Sonnet.

**Real example from my stack:**
- Intent classification → Haiku (₹0.03/call)
- Code generation → Sonnet (₹0.30/call)
- Debug analysis → Opus (₹1.50/call)
- Average cost per run: ₹0.50 instead of ₹1.50 if everything used Sonnet

## 4. Batch independent LLM calls (~20% savings)

Many agent workflows make multiple independent LLM calls. If they don't depend on each other, batch them:

```python
import asyncio
from anthropic import AsyncAnthropic

async_client = AsyncAnthropic()

async def batch_llm_calls(calls: list[dict]) -> list[str]:
    """Execute independent LLM calls in parallel."""
    async def single_call(call):
        response = await async_client.messages.create(**call)
        return response.content[0].text

    results = await asyncio.gather(*[single_call(c) for c in calls])
    return results

# Before: 3 sequential calls = 3x latency, 3x overhead
# classify = llm.call(...)
# extract = llm.call(...)
# summarize = llm.call(...)

# After: 3 parallel calls = 1x latency, 1x overhead
calls = [
    {"model": "claude-haiku-3-20240307", "max_tokens": 100, "messages": [...]},  # classify
    {"model": "claude-haiku-3-20240307", "max_tokens": 200, "messages": [...]},  # extract
    {"model": "claude-haiku-3-20240307", "max_tokens": 150, "messages": [...]},  # summarize
]
results = await batch_llm_calls(calls)
```

**Expected savings:** 15-20% reduction in both cost and latency. Batching reduces the overhead of API connection setup and token processing overhead.

**When NOT to batch:** If one call depends on another's output (e.g., you need to classify before you can retrieve), batching doesn't apply. Only batch truly independent calls.

## 5. Context window budgeting (~15% savings)

Most agents stuff everything into the context window without thinking about what's actually needed. Budget your context:

```python
MAX_CONTEXT_BUDGET = 32000  # tokens

def prepare_context(messages, max_context=MAX_CONTEXT_BUDGET):
    """Trim context to fit within budget, prioritizing recent and important messages."""
    # Count tokens in current messages
    total_tokens = count_tokens(messages)

    if total_tokens <= max_context:
        return messages

    # Strategy: keep system prompt, last 2 turns, truncated middle
    system_prompt = [m for m in messages if m["role"] == "system"]
    recent = messages[-4:]  # Last 2 user + 2 assistant
    middle = messages[len(system_prompt):-4]

    # Summarize middle messages
    if middle:
        summary = summarize_messages(middle)
        # Keep only the summary and recent messages
        budget_messages = system_prompt + [
            {"role": "system", "content": f"Previous context: {summary}"}
        ] + recent

        if count_tokens(budget_messages) <= max_context:
            return budget_messages

    # If still over budget, keep only recent messages
    return system_prompt + recent[-2:]  # Just last turn
```

**Expected savings:** 10-15% reduction in input tokens per call. More importantly, shorter context windows produce faster responses and lower latency.

## 6. Tool result caching (~10% savings)

Tool calls are often repeated. The same file is read multiple times. The same API is called with the same parameters:

```python
from functools import lru_cache
import hashlib

class ToolResultCache:
    def __init__(self, ttl_seconds=300):  # 5 minute TTL
        self.cache = {}
        self.ttl = ttl_seconds

    def get_or_execute(self, tool_name: str, args: dict, tool_fn):
        cache_key = self._make_key(tool_name, args)

        if cache_key in self.cache:
            entry = self.cache[cache_key]
            if time.time() - entry["timestamp"] < self.ttl:
                return entry["result"]

        result = tool_fn(**args)
        self.cache[cache_key] = {
            "result": result,
            "timestamp": time.time(),
            "tool": tool_name,
            "args": args
        }
        return result

    def _make_key(self, tool_name, args):
        serialized = json.dumps({"tool": tool_name, "args": args}, sort_keys=True)
        return hashlib.md5(serialized.encode()).hexdigest()
```

**Expected savings:** 5-10%. This is situational — high for file-reading agents, low for agents that make unique API calls.

## 7. Rate limiting non-critical agents (~15% savings)

Background agents that run on a schedule don't need instant responses. Rate-limit them to off-peak hours:

```python
import schedule
import time

class RateLimitedAgent:
    def __init__(self, agent_fn, max_calls_per_hour=60):
        self.agent_fn = agent_fn
        self.max_calls = max_calls_per_hour
        self.call_times = []

    def should_throttle(self) -> bool:
        # Clean old entries
        now = time.time()
        self.call_times = [t for t in self.call_times if now - t < 3600]

        if len(self.call_times) >= self.max_calls:
            sleep_time = 3600 - (now - self.call_times[0])
            if sleep_time > 0:
                print(f"Throttling: sleeping {sleep_time:.0f}s")
                time.sleep(sleep_time)

        self.call_times.append(time.time())

    def run(self, *args, **kwargs):
        self.should_throttle()
        return self.agent_fn(*args, **kwargs)


# Schedule non-critical agents for off-peak hours (10 PM - 6 AM IST)
def run_nightly_reports():
    agent = RateLimitedAgent(generate_report, max_calls_per_hour=30)
    # ... process reports

schedule.every().day.at("22:00").do(run_nightly_reports)
```

**Expected savings:** 10-15% from using lower-cost models during off-peak and spreading out API calls to avoid burst pricing (on some providers).

## 8. Monitor cost per request (~5% savings from awareness)

You can't optimize what you don't measure. Cost tracking is the foundation:

```python
class CostMonitor:
    def __init__(self, daily_budget=500):  # ₹500/day default
        self.daily_budget = daily_budget
        self.daily_cost = 0.0
        self.alerts = []

    def track(self, run_id, model, input_tokens, output_tokens):
        RATES = {
            "claude-sonnet": {"input": 0.25, "output": 1.25},    # per 1K tokens, in INR
            "claude-haiku":  {"input": 0.03, "output": 0.15},
            "claude-opus":   {"input": 1.50, "output": 7.50},
            "gpt-4o":        {"input": 0.20, "output": 0.80},
            "gpt-4o-mini":   {"input": 0.01, "output": 0.04},
        }

        rate = RATES.get(model, RATES["claude-sonnet"])
        cost = (input_tokens / 1000 * rate["input"]) + (output_tokens / 1000 * rate["output"])
        self.daily_cost += cost

        if self.daily_cost > self.daily_budget * 0.8:
            self.alerts.append({
                "type": "budget_warning",
                "cost": self.daily_cost,
                "budget": self.daily_budget,
                "run_id": run_id
            })

        return cost

    def get_report(self):
        return {
            "daily_cost": f"₹{self.daily_cost:.2f}",
            "budget_remaining": f"₹{max(0, self.daily_budget - self.daily_cost):.2f}",
            "alerts_count": len(self.alerts),
            "alerts": self.alerts[-5:]  # Last 5 alerts
        }
```

**Expected savings:** The awareness alone saves 5-10%. When you see which runs cost the most, you naturally find optimizations. I found a buggy agent loop this way that was burning ₹200/day in retries.

## 9. Open-source models for internal tasks (~20% savings for those tasks)

For internal tools and batch processing, running open-source models locally can be cheaper than API calls:

```bash
# Run Llama 3 70B locally
ollama run llama3:70b

# Or use a cloud GPU
# T4 GPU: ~₹30/hour, serves ~500 requests/hour
# Cost: ₹0.06 per request vs ₹0.30 for Sonnet
```

**Where it works:** Internal code review, batch document processing, data extraction, classification at scale.

**Where it doesn't:** Customer-facing agents, complex tool use, tasks requiring high reliability.

**My setup:** API models for customer-facing, Llama 3 70B (via Ollama) for internal batch jobs. This cut my API bill by 20%.

## 10. Prune unused tools (~5% savings and reliability)

Every tool you give an agent increases the prompt size (tools are serialized into the prompt) and adds decision complexity:

```python
# Before: 15 tools, each with detailed descriptions
ALL_TOOLS = [
    read_file, write_file, list_directory, search_files,
    run_command, install_package, run_tests, build_project,
    search_web, fetch_url, scrape_page, call_api,
    query_database, send_email, create_ticket
]  # ~2,500 tokens of tool definitions

# After: Only the tools this agent actually uses
CORE_TOOLS = [
    read_file, write_file, run_command, search_files
]  # ~600 tokens of tool definitions
```

**Expected savings:** 5-10% reduction in input tokens per call. More importantly: fewer tools means fewer wrong tool choices. The agent doesn't accidentally call `send_email` when it meant `write_file`.

---

*Related: [AI agent error handling patterns](/posts/ai-agent-error-handling-patterns/) — retry strategies, cost spikes, and graceful degradation for production agents.*

<div class="callout">
  <div class="callout-title">Start with caching and tiering</div>
  <p>If you can only implement two strategies today, make it semantic caching and model tiering. Those two alone will save 40-50% with zero quality impact. Add the rest as your agent scales and you identify specific cost drivers. Most importantly, track your costs before and after — the numbers will tell you which optimization to tackle next.</p>
</div>