How do I handle API rate limits in production?

Implement retry logic with exponential backoff for transient errors. Use the retry decorator pattern shown in this guide : 3 retries with base delay of 1 second and backoff factor of 2.

AI agent deployment guide: from localhost to production

How to host and deploy an AI agent. from local development to production server. Covers containerization, deployment, monitoring, cost control, and reliability patterns.

TL;DR: My agent worked perfectly on my laptop. It crashed 12 minutes after deployment. The gap between local and production is not the code, it’s the infrastructure. Here is the 7-step deployment guide I wish existed when I started.

Cloudflare’s Workers documentation shows how serverless functions serve as API gateways for AI agents. The Docker documentation provides the standard for containerisation, which is the first step in the deployment pipeline described in this guide.

An agent that works on your laptop is a demo. An agent that works in production without constant attention is a product.

The gap between these two states is where most agent projects die. I’ve deployed about a dozen agents to production. Some are still running. Some died in staging. Here’s the deployment playbook I’ve developed from the survivors.

Key takeaways:

Production AI agents need cost controls, monitoring, error recovery, and alerting: not just working agent code

Docker containerization forces dependency discipline and eliminates “works on my machine” failures

Structured logging lets you query past runs by cost, status, or failure pattern

Start on Railway for simplicity, graduate to Fly.io or a VPS as your agent outgrows the platform

This follows what I call the Vertical Agent Method: build narrow, purpose-built agents that replace one specific workflow, not general-purpose assistants. The deployment patterns below are designed for exactly this kind of focused, production-grade agent.

Step 1: Containerize the agent

Before anything else, get the agent into a Docker container. This forces you to make dependencies explicit and eliminates the “works on my machine” class of failures.

FROM python:3.12-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
 git \
 curl \
 && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the agent code
COPY src/ ./src/
COPY config/ ./config/

# Set up non-root user
RUN useradd -m -u 1000 agent && chown -R agent:agent /app
USER agent

CMD ["python", "-m", "src.main"]

Key choices:

python:3.12-slim: minimal base image reduces attack surface and build time
Non-root user: basic security hygiene, doesn’t add much complexity
Dependencies before code: layer caching means faster rebuilds when code changes

Step 2: API key management

This seems obvious but I’ve seen production agents with API keys hardcoded in config files. Don’t.

import os
from pydantic_settings import BaseSettings

class AgentConfig(BaseSettings):
 model_config = {"env_prefix": "AGENT_"}

 # Required
 anthropic_api_key: str
 openai_api_key: str | None = None

 # Optional with defaults
 max_steps: int = 20
 max_tokens: int = 4096
 model: str = "claude-sonnet-4-20250514"
 cost_warning_threshold: float = 0.50 # $0.50 per run
 cost_hard_limit: float = 5.00 # $5.00 absolute max

 # Logging
 log_level: str = "INFO"
 log_file: str | None = None

config = AgentConfig()

Use environment variables, loaded through Pydantic’s BaseSettings. This gives you validation, defaults, and a single source of truth.

In production, inject secrets through your deployment platform’s secrets manager (Railway, Fly, Cloudflare Workers all have this). Never in your codebase.

Step 3: Cost controls

Agents cost money. Production agents cost money at scale. You need controls that stop a runaway agent from generating a surprising bill.

class CostTracker:
 def __init__(self, hard_limit: float = 5.00):
 self.hard_limit = hard_limit
 self.total_cost = 0.0
 self.step_costs: list[float] = []

 def add_step(self, tokens_in: int, tokens_out: int,
 model: str = "claude-sonnet-4-20250514"):
 cost = self._calculate_cost(tokens_in, tokens_out, model)
 self.total_cost += cost
 self.step_costs.append(cost)

 if self.total_cost > self.hard_limit:
 raise CostLimitExceeded(
 f"Cost limit ${self.hard_limit} exceeded: ${self.total_cost:2f}"
 )

 @property
 def average_cost_per_step(self) -> float:
 if not self.step_costs:
 return 0.0
 return sum(self.step_costs) / len(self.step_costs)

 def _calculate_cost(self, tokens_in, tokens_out, model):
 rates = {
 "claude-sonnet-4-20250514": (3e-06, 15e-06),
 "claude-haiku-3-5-20241022": (0.8e-06, 4e-06),
 "gpt-4o-mini": (0.15e-06, 0.6e-06),
 }
 input_rate, output_rate = rates.get(model, (3e-06, 15e-06))
 return tokens_in * input_rate + tokens_out * output_rate

Two numbers matter: a warning threshold (alert me if this run exceeds $X) and a hard limit (stop the agent if it hits $Y). Without both, you’ll get a surprise bill.

Step 4: Monitoring and logging

An agent that doesn’t log is a black box. When it fails, and it will, you need to know what happened.

import logging
import json
from datetime import datetime

class AgentLogger:
 def __init__(self, name: str, log_dir: str = "runs"):
 self.name = name
 self.log_dir = log_dir
 self.start_time = datetime.utcnow()
 self.steps: list[dict] = []

 def log_step(self, step_num: int, action: str,
 tool: str | None, result: str,
 tokens_in: int, tokens_out: int, cost: float):
 entry = {
 "timestamp": datetime.utcnow().isoformat(),
 "step": step_num,
 "action": action,
 "tool": tool,
 "result_length": len(result),
 "tokens_in": tokens_in,
 "tokens_out": tokens_out,
 "cost": round(cost, 6),
 }
 self.steps.append(entry)

 def save(self):
 run_log = {
 "agent": self.name,
 "start": self.start_time.isoformat(),
 "end": datetime.utcnow().isoformat(),
 "total_steps": len(self.steps),
 "total_cost": round(sum(s["cost"] for s in self.steps), 4),
 "steps": self.steps,
 }
 filename = f"{self.start_time.strftime('%Y%m%d_%H%M%S')}.json"
 path = f"{self.log_dir}/{self.name}/{filename}"
 os.makedirs(os.path.dirname(path), exist_ok=True)
 with open(path, "w") as f:
 json.dump(run_log, f, indent=2)
 return path

Every agent run should produce a structured log. You want to query it. “show me all runs that cost more than $1” or “how many runs failed on step 3.”

For production monitoring, I send key metrics to a simple dashboard:

Cost per run (average and P95)
Steps per run (is it converging or looping?)
Error rate (what percentage of runs fail?)
Duration per run (is it getting slower?)

You can use Prometheus + Grafana, or a simpler solution like Datadog or even a spreadsheet if you’re solo. The important thing is to look at the metrics regularly.

Step 5: Error recovery

Production agents encounter errors constantly. LLM APIs time out. Tools return unexpected data. Network requests fail. Your agent needs to handle all of these gracefully.

import time
from functools import wraps

def retry(max_retries=3, base_delay=1.0, backoff=2.0):
 def decorator(func):
 @wraps(func)
 def wrapper(*args, **kwargs):
 last_error = None
 for attempt in range(max_retries):
 try:
 return func(*args, **kwargs)
 except (APITimeoutError, RateLimitError) as e:
 last_error = e
 delay = base_delay * (backoff ** attempt)
 logging.warning(
 f"Retry {attempt + 1}/{max_retries} "
 f"after {delay:1f}s: {e}"
 )
 time.sleep(delay)
 except ToolExecutionError as e:
 # Tool errors are not retryable: return the error
 return {"error": str(e), "retryable": False}
 raise last_error
 return wrapper
 return decorator

The principle: transient errors (timeouts, rate limits) should auto-retry. Permanent errors (invalid inputs, missing data) should return a helpful error message. Don’t retry the latter: it wastes money and time.

Step 6: Deployment platforms

For solo developers, the deployment platform choice matters. Here’s what I’ve found:

Platform	Cost	Best for	Gotchas
Railway	$5–$20/month	Quick deployment, simple agents	Limited region options
Fly.io	~$12/month	Better global presence	More config work
Cloudflare Workers	$0–$10/month	Stateless agents, webhook handlers	30s execution timeout
VPS (Hetzner, etc.)	€4–€10/month	Full control, long-running agents	You manage everything
Self-hosted	Server cost	Privacy-sensitive workloads	You own all ops

My recommendation for most solo developers: start on Railway, move to Fly.io or a Hetzner VPS when you outgrow it. Railway handles the complexity of deployment (Dockerfile → running service) with minimal configuration. The premium is worth the saved time.

Step 7: Deployment checklist

Before any agent goes to production, run through this checklist:

A production agent’s lifecycle

Here’s what a well-deployed production agent looks like:

A trigger arrives (webhook, schedule, API call)
The orchestrator validates the input
A new run is created with a unique ID and cost budget
The agent loop executes with checkpointing and logging
On success, the output is stored and the orchestrator sends a notification
On failure, the error is logged, the cost is refunded to budget, and an alert fires
The run log is available for inspection

The difference between this and a script running on a laptop isn’t the agent logic. It’s the infrastructure around it: cost tracking, error recovery, monitoring, and alerting.

The agent itself is the easy part. The deployment is where you earn your experience.

Related: How to build your first AI agent, a step-by-step tutorial from scratch, and Best AI agent frameworks for 2026, comparing LangChain, CrewAI, and custom builds.

Related: How to build an AI customer support agent (that works): a complete walkthrough of building and deploying a production customer support agent.

Pro tip

Don't deploy your first agent perfectly. Deploy it fast, watch it fail, and fix the failure pattern. The production pattern I've described here emerged from failures, not planning. Run the loop, deploy, observe, improve, and the architecture will evolve naturally.

Related: The Vertical Agent Method: the framework behind how we build and ship AI agents.

FAQ

What’s the cheapest way to deploy an AI agent? For a simple agent, Cloudflare Workers or a $5 VPS with Docker works fine. For multi-agent systems, Railway or a small Kubernetes cluster is more appropriate.

Do I need Docker to deploy AI agents? Not strictly, but Docker makes dependency management, environment consistency, and scaling much easier. I’d recommend it for any production deployment.

How do I monitor costs for production AI agents? Set up a CostTracker with a warning threshold and a hard limit. Log every LLM call with token counts and cost. Use structured logging so you can query past runs by cost.

What deployment platform should I start with? Start on Railway : it handles Dockerfile-based deployment with minimal configuration. Move to Fly.io or a Hetzner VPS when you outgrow it.

AI agent deployment server setup. Production-grade VPS infrastructure with Docker, Nginx, SSL, and CI/CD
AI agent logging and monitoring. Seeing inside your agent with structured logs, metrics, and alerting
AI agent error handling patterns. Retry strategies, circuit breakers, and graceful degradation for production

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected]