How to build an AI customer support agent that works
A complete walkthrough of building a production customer support agent — from RAG setup to handoff escalation. What I learned from shipping 3 support agents.
The Anthropic documentation on tool use describes the client-tool pattern used in this agent — where the LLM returns a tool_use block and your application handles execution, making it ideal for customer support workflows that need controlled tool access.
TL;DR: A production support agent needs four stages: intent classification, RAG retrieval, response generation, and escalation logic. Skip classification and your agent will answer billing questions with technical responses. This walkthrough covers the architecture, code, and real cost data — about ₹1.05 per 3-turn conversation.
I’ve shipped three customer support agents. The first one was a disaster — it answered confidently with wrong information, escalated everything to humans anyway, and cost more in API calls than the support team’s salary.
The second one was better. The third one actually works.
Here’s what I learned. The architecture, the code, the hard-won lessons, and the exact numbers you should expect.
Key takeaways:
- Intent classification before response generation is the most critical component
- RAG works well when your docs are well-structured, poorly when they’re messy
- Cost per conversation ranges from ₹1 to ₹15 depending on model and complexity
- The most common failure mode is overconfidence — fix it with confidence thresholds
Architecture overview
Every support agent I’ve built follows the same four-stage pipeline:
User Message
│
▼
[1. INTENT CLASSIFICATION]
│ billing / technical / account / general
▼
[2. RETRIEVAL (RAG)]
│ Relevant docs + past solutions
▼
[3. RESPONSE GENERATION]
│ Draft response + confidence score
▼
[4. ESCALATION DECISION]
│ Send to user? Or escalate to human?
▼
Response or Escalation
The critical insight: classify before you respond. If you skip intent classification, your agent treats every question the same. A billing question gets a technical answer. A frustrated customer gets a generic script. The architecture above evolved exactly because I made those mistakes on version one.
Stage 1: Intent classification
Before the agent responds, it needs to know what kind of question it’s dealing with:
from anthropic import Anthropic
client = Anthropic()
INTENTS = [
"billing" — payment issues, invoices, pricing questions
"technical" — API errors, bug reports, integration issues
"account" — password reset, account changes, subscription management
"general" — product questions, feature requests, how-to questions
]
def classify_intent(message: str, history: list) -> dict:
"""Classify the user's intent with confidence score."""
response = client.messages.create(
model="claude-haiku-3-20240307", # Fast and cheap for classification
max_tokens=200,
messages=[
{"role": "system", "content": f"""Classify this support message into one of:
- billing: payments, invoices, refunds, pricing
- technical: bugs, errors, API issues, integration help
- account: login, password, subscription changes
- general: product questions, feature requests, how-to
Return JSON: {{"intent": "billing", "confidence": 0.95, "reasoning": "..."}}"""},
{"role": "user", "content": message}
]
)
return json.loads(response.content[0].text)
I use Haiku for classification because it’s fast (under 500ms) and cheap (₹0.03 per call). Sonnet-level reasoning isn’t needed for picking from four categories.
Expected classification accuracy: ~92% for well-defined intents. The edge cases are mixed-intent messages like “I can’t log in and I’m being charged” — the classifier picks the dominant intent, but the response needs to address both.
Stage 2: RAG setup
Once you know the intent, retrieve relevant information from your support documentation:
import chromadb
from chromadb.utils import embedding_functions
# Initialize ChromaDB with support docs
chroma_client = chromadb.PersistentClient(path="./support_docs_db")
embedding_fn = embedding_functions.DefaultEmbeddingFunction()
collection = chroma_client.get_or_create_collection(
name="support_docs",
embedding_function=embedding_fn
)
# Query relevant docs based on intent + message
def retrieve_support_docs(intent: str, message: str, top_k: int = 3) -> list:
query = f"{intent}: {message}"
results = collection.query(
query_texts=[query],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
docs = []
for i, doc in enumerate(results["documents"][0]):
docs.append({
"content": doc,
"metadata": results["metadatas"][0][i],
"relevance_score": 1.0 - results["distances"][0][i]
})
return docs
Document chunking strategy that works:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # 500 characters per chunk
chunk_overlap=50, # 50 characters overlap between chunks
separators=["\n## ", "\n### ", "\n\n", "\n", " "],
length_function=len,
)
# Split each support article into chunks
chunks = splitter.split_text(article_content)
# Store with metadata
for i, chunk in enumerate(chunks):
collection.add(
documents=[chunk],
metadatas=[{
"article_title": article_title,
"chunk_index": i,
"intent": article_intent,
}],
ids=[f"{article_slug}_{i}"]
)
Key RAG lesson: The quality of your retrieval depends almost entirely on the quality of your documentation. If your support docs are well-structured with clear headings and one topic per section, RAG works beautifully. If your docs are wall-of-text pages, RAG returns garbage. Clean your docs first, then build RAG.
Stage 3: Response generation
With the intent and relevant documents, generate a response:
def generate_response(intent: str, message: str, retrieved_docs: list, history: list) -> dict:
"""Generate a response with confidence score."""
docs_context = "\n\n".join([
f"[Source: {d['metadata']['article_title']}] (relevance: {d['relevance_score']:.2f})\n{d['content']}"
for d in retrieved_docs
])
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{"role": "system", "content": f"""You are a support agent. Use only the provided documentation to answer.
Intent: {intent}
Relevant documentation:
{docs_context}
Rules:
- Answer only from the documentation provided
- If the docs don't fully answer the question, say what you know and what you're unsure about
- Keep responses concise (2-3 paragraphs max)
- Include specific steps when giving instructions
- Never make up pricing or policies
Return JSON: {{"response": "...", "confidence": 0.85, "missing_info": "..."}}"""},
*history[-6:], # Last 3 turns of conversation history
{"role": "user", "content": message}
]
)
return json.loads(response.content[0].text)
The confidence score is critical. It determines whether the response goes to the user or gets escalated to a human.
Stage 4: Escalation decision
This is the make-or-break component. The escalation logic decides which responses are good enough to send and which need human review:
def should_escalate(
intent: str,
response: dict,
conversation_history: list,
sentiment: str
) -> dict:
"""Decide whether to send response or escalate."""
reasons = []
# 1. Low confidence
if response["confidence"] < 0.7:
reasons.append(f"Low confidence ({response['confidence']:.2f})")
# 2. Missing information
if response.get("missing_info"):
reasons.append(f"Missing info: {response['missing_info']}")
# 3. User asked for human
last_user_msg = conversation_history[-1]["content"] if conversation_history else ""
if any(phrase in last_user_msg.lower() for phrase in ["human", "agent", "manager", "speak to someone", "real person"]):
reasons.append("User requested human")
# 4. Repeated issue (same question 3+ times)
if len(conversation_history) >= 6: # 3+ turns
reasons.append("Exceeded 3 conversation turns")
# 5. Angry sentiment
if sentiment == "angry":
reasons.append("Negative sentiment detected")
# 6. Sensitive actions
sensitive_intents = ["billing", "account"]
if intent in sensitive_intents and response["confidence"] < 0.85:
reasons.append(f"Sensitive intent ({intent}) with moderate confidence")
should_escalate = len(reasons) > 0
return {
"escalate": should_escalate,
"reasons": reasons,
"severity": "high" if should_escalate else "low",
}
Escalation rates I see in production:
- Well-documented product: 15-20% of conversations escalated
- Poorly-documented product: 40-50% escalated
- After 2 weeks of refinement: drops to 10-15%
The goal isn’t 0% escalation — that’s impossible unless your product has zero edge cases. The goal is 10-15% escalation, where the escalations are legitimate edge cases that a human needs to handle.
Multi-turn conversation handling
Support conversations are rarely one question → one answer. Users ask follow-ups, clarify their problem, or change the topic entirely. Here’s how to handle that:
class ConversationManager:
def __init__(self, max_turns_without_resolution=5):
self.sessions = {} # session_id -> conversation state
self.max_turns = max_turns_without_resolution
def process_turn(self, session_id: str, message: str) -> dict:
if session_id not in self.sessions:
self.sessions[session_id] = {
"history": [],
"unresolved_count": 0,
"current_intent": None,
"summary": ""
}
session = self.sessions[session_id]
# Classify intent (might change between turns)
intent_result = classify_intent(message, session["history"])
# Retrieve docs
docs = retrieve_support_docs(intent_result["intent"], message)
# Generate response
response = generate_response(
intent_result["intent"],
message,
docs,
session["history"]
)
# Track unresolved issues
if intent_result["intent"] == session["current_intent"]:
session["unresolved_count"] += 1
else:
session["unresolved_count"] = 0
session["current_intent"] = intent_result["intent"]
# Compress history after every 3 turns
session["history"].append({"role": "user", "content": message})
session["history"].append({"role": "assistant", "content": response["response"]})
if len(session["history"]) > 6: # More than 3 turns
session["summary"] = self._summarize_history(session["history"], session["summary"])
session["history"] = session["history"][-4:] # Keep only last 2 turns
# Check escalation
escalation = should_escalate(
intent_result["intent"],
response,
session["history"],
self._detect_sentiment(message)
)
# Force escalation on repeated unresolved issues
if session["unresolved_count"] >= 3:
escalation["escalate"] = True
escalation["reasons"].append("Same issue repeated 3+ times")
return {
"response": response["response"] if not escalation["escalate"] else None,
"escalate": escalation["escalate"],
"escalation_reasons": escalation["reasons"],
"intent": intent_result["intent"],
"confidence": response["confidence"],
}
def _summarize_history(self, history, previous_summary):
"""Compress conversation history to save tokens."""
response = client.messages.create(
model="claude-haiku-3-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this support conversation in 2 sentences:\n\n{previous_summary}\n\n{history[-4:]}"
}]
)
return response.content[0].text
The conversation summarizer is the unsung hero. Without it, a 10-turn conversation burns through 5,000+ tokens. With summarization, you keep the same context with 500 tokens.
Cost analysis per conversation
Here are real numbers from my latest support agent deployment:
| Component | Model | Cost per call | Calls per conversation | Total |
|---|---|---|---|---|
| Intent classification | Haiku | ₹0.03 | 3 | ₹0.09 |
| RAG retrieval | Embedding model | ₹0.005 | 3 | ₹0.015 |
| Response generation | Sonnet | ₹0.30 | 3 | ₹0.90 |
| Sentiment analysis | Haiku | ₹0.03 | 1 | ₹0.03 |
| Conversation summary | Haiku | ₹0.03 | 1 (every 3 turns) | ₹0.01 |
| Total per 3-turn conversation | ₹1.05 ($0.013) |
With Sonnet: ~₹1.05 per conversation. With Haiku throughout: ~₹0.15 per conversation.
For 500 conversations/month: ₹525 with Sonnet, ₹75 with Haiku.
The choice depends on your complexity. Technical support needs Sonnet for accurate troubleshooting. General questions can use Haiku.
Common failure modes
1. Overconfidence (most common)
The agent retrieves a partially relevant document and fills in the gaps with hallucinated details. The user gets a confident-sounding wrong answer.
Fix: Add a “I don’t know” fallback. If confidence is below 0.7, the agent says “I’m not sure about this, let me connect you with someone who can help” instead of guessing.
2. Context bleed
The agent remembers information from a previous conversation in the same session and applies it incorrectly.
Fix: Clear the conversation summary when the intent changes completely (e.g., from “billing” to “technical”).
3. Documentation gap
The user asks about a feature or scenario that isn’t documented. The agent has nothing relevant to retrieve.
Fix: Monitor retrieval failure rates. If RAG returns nothing relevant for >10% of queries, you need better documentation.
4. Escalation fatigue
The agent escalates too easily, overwhelming the human support team.
Fix: Track your escalation rate and aim for 10-15%. If it’s higher, loosen your escalation triggers or improve your RAG.
Related: AI agent error handling patterns — retry strategies, fallback behaviors, and graceful degradation for production agents.
Also: AI agent cost optimization: 10 tips to reduce your LLM bill — keeping agent costs under control.
The architecture above looks complex, but version one is just the classification + RAG + response pipeline. Skip the escalation logic, sentiment analysis, and conversation summarizer. Add those in version two after you see where the agent actually fails. Every pattern I've shared was added in response to a real failure, not in anticipation of one.