What's the most common failure mode for support agents?

Overconfidence : the agent answers a question it doesn't have the right information for. This happens when RAG retrieves a partially relevant document and the LLM fills in the gaps with hallucinated details. The fix is a confidence threshold and a 'I don't know' fallback.

How to build an AI customer support agent that works

A complete walkthrough of building a production customer support agent. from RAG setup to handoff escalation. What I learned from shipping 3 support agents.

The Anthropic documentation on tool use describes the client-tool pattern used in this agent: where the LLM returns a tool_use block and your application handles execution, making it ideal for customer support workflows that need controlled tool access.

TL;DR: A production support agent needs four stages: intent classification, RAG retrieval, response generation, and escalation logic. Skip classification and your agent will answer billing questions with technical responses. This walkthrough covers the architecture, code, and real cost data: about ₹1.05 per 3-turn conversation.

I’ve shipped three customer support agents. The first one was a disaster: it answered confidently with wrong information, escalated everything to humans anyway, and cost more in API calls than the support team’s salary.

The second one was better. The third one works.

Here’s what I learned. The architecture, the code, the hard-won lessons, and the exact numbers you should expect.

Key takeaways:

Intent classification before response generation is the most critical component

RAG works well when your docs are well-structured, poorly when they’re messy

Cost per conversation ranges from ₹1 to ₹15 depending on model and complexity

The most common failure mode is overconfidence: fix it with confidence thresholds

What is the architecture of an AI customer support agent?

Every support agent I’ve built follows the same four-stage pipeline:

User Message
 │
 ▼
[1. INTENT CLASSIFICATION]
 │ billing / technical / account / general
 ▼
[2. RETRIEVAL (RAG)]
 │ Relevant docs + past solutions
 ▼
[3. RESPONSE GENERATION]
 │ Draft response + confidence score
 ▼
[4. ESCALATION DECISION]
 │ Send to user? Or escalate to human?
 ▼
 Response or Escalation

The critical insight: classify before you respond. If you skip intent classification, your agent treats every question the same. A billing question gets a technical answer. A frustrated customer gets a generic script. The architecture above evolved exactly because I made those mistakes on version one.

How does intent classification work in a support agent?

Before the agent responds, it needs to know what kind of question it’s dealing with:

from anthropic import Anthropic

client = Anthropic()

INTENTS = [
 "billing": payment issues, invoices, pricing questions
 "technical". API errors, bug reports, integration issues
 "account": password reset, account changes, subscription management
 "general": product questions, feature requests, how-to questions
]

def classify_intent(message: str, history: list) -> dict:
 """Classify the user's intent with confidence score."""
 response = client.messages.create(
 model="claude-haiku-3-20240307", # Fast and cheap for classification
 max_tokens=200,
 messages=[
 {"role": "system", "content": f"""Classify this support message into one of:
 - billing: payments, invoices, refunds, pricing
 - technical: bugs, errors, API issues, integration help
 - account: login, password, subscription changes
 - general: product questions, feature requests, how-to

 Return JSON: {{"intent": "billing", "confidence": 0.95, "reasoning": ".."}}"""},
 {"role": "user", "content": message}
 ]
 )
 return json.loads(response.content[0].text)

I use Haiku for classification because it’s fast (under 500ms) and cheap (₹0.03 per call). Sonnet-level reasoning isn’t needed for picking from four categories.

Expected classification accuracy: ~92% for well-defined intents. The edge cases are mixed-intent messages like “I can’t log in and I’m being charged”: the classifier picks the dominant intent, but the response needs to address both.

How do I set up RAG for a support agent?

Once you know the intent, retrieve relevant information from your support documentation:

import chromadb
from chromadb.utils import embedding_functions

# Initialize ChromaDB with support docs
chroma_client = chromadb.PersistentClient(path="./support_docs_db")
embedding_fn = embedding_functions.DefaultEmbeddingFunction()

collection = chroma_client.get_or_create_collection(
 name="support_docs",
 embedding_function=embedding_fn
)

# Query relevant docs based on intent + message
def retrieve_support_docs(intent: str, message: str, top_k: int = 3) -> list:
 query = f"{intent}: {message}"

 results = collection.query(
 query_texts=[query],
 n_results=top_k,
 include=["documents", "metadatas", "distances"]
 )

 docs = []
 for i, doc in enumerate(results["documents"][0]):
 docs.append({
 "content": doc,
 "metadata": results["metadatas"][0][i],
 "relevance_score": 1.0 - results["distances"][0][i]
 })

 return docs

Document chunking strategy that works:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
 chunk_size=500, # 500 characters per chunk
 chunk_overlap=50, # 50 characters overlap between chunks
 separators=["\n## ", "\n### ", "\n\n", "\n", " "],
 length_function=len,
)

# Split each support article into chunks
chunks = splitter.split_text(article_content)

# Store with metadata
for i, chunk in enumerate(chunks):
 collection.add(
 documents=[chunk],
 metadatas=[{
 "article_title": article_title,
 "chunk_index": i,
 "intent": article_intent,
 }],
 ids=[f"{article_slug}_{i}"]
 )

Key RAG lesson: The quality of your retrieval depends almost entirely on the quality of your documentation. If your support docs are well-structured with clear headings and one topic per section, RAG works beautifully. If your docs are wall-of-text pages, RAG returns garbage. Clean your docs first, then build RAG.

How does response generation work in a support agent?

With the intent and relevant documents, generate a response:

def generate_response(intent: str, message: str, retrieved_docs: list, history: list) -> dict:
 """Generate a response with confidence score."""

 docs_context = "\n\n".join([
 f"[Source: {d['metadata']['article_title']}] (relevance: {d['relevance_score']:2f})\n{d['content']}"
 for d in retrieved_docs
 ])

 response = client.messages.create(
 model="claude-sonnet-4-20250514",
 max_tokens=1024,
 messages=[
 {"role": "system", "content": f"""You are a support agent. Use only the provided documentation to answer.

Intent: {intent}

Relevant documentation:
{docs_context}

Rules:
- Answer only from the documentation provided
- If the docs don't fully answer the question, say what you know and what you're unsure about
- Keep responses concise (2-3 paragraphs max)
- Include specific steps when giving instructions
- Never make up pricing or policies

Return JSON: {{"response": "..", "confidence": 0.85, "missing_info": ".."}}"""},
 *history[-6:], # Last 3 turns of conversation history
 {"role": "user", "content": message}
 ]
 )

 return json.loads(response.content[0].text)

The confidence score is critical. It determines whether the response goes to the user or gets escalated to a human.

How does the escalation decision work?

This is the make-or-break component. The escalation logic decides which responses are good enough to send and which need human review:

def should_escalate(
 intent: str,
 response: dict,
 conversation_history: list,
 sentiment: str
) -> dict:
 """Decide whether to send response or escalate."""
 reasons = []

 # 1. Low confidence
 if response["confidence"] < 0.7:
 reasons.append(f"Low confidence ({response['confidence']:2f})")

 # 2. Missing information
 if response.get("missing_info"):
 reasons.append(f"Missing info: {response['missing_info']}")

 # 3. User asked for human
 last_user_msg = conversation_history[-1]["content"] if conversation_history else ""
 if any(phrase in last_user_msg.lower() for phrase in ["human", "agent", "manager", "speak to someone", "real person"]):
 reasons.append("User requested human")

 # 4. Repeated issue (same question 3+ times)
 if len(conversation_history) >= 6: # 3+ turns
 reasons.append("Exceeded 3 conversation turns")

 # 5. Angry sentiment
 if sentiment == "angry":
 reasons.append("Negative sentiment detected")

 # 6. Sensitive actions
 sensitive_intents = ["billing", "account"]
 if intent in sensitive_intents and response["confidence"] < 0.85:
 reasons.append(f"Sensitive intent ({intent}) with moderate confidence")

 should_escalate = len(reasons) > 0

 return {
 "escalate": should_escalate,
 "reasons": reasons,
 "severity": "high" if should_escalate else "low",
 }

Escalation rates I see in production:

Well-documented product: 15-20% of conversations escalated
Poorly-documented product: 40-50% escalated
After 2 weeks of refinement: drops to 10-15%

The goal isn’t 0% escalation: that’s impossible unless your product has zero edge cases. The goal is 10-15% escalation, where the escalations are legitimate edge cases that a human needs to handle.

How do I handle multi-turn conversations?

Support conversations are rarely one question → one answer. Users ask follow-ups, clarify their problem, or change the topic entirely. Here’s how to handle that:

class ConversationManager:
 def __init__(self, max_turns_without_resolution=5):
 self.sessions = {} # session_id -> conversation state
 self.max_turns = max_turns_without_resolution

 def process_turn(self, session_id: str, message: str) -> dict:
 if session_id not in self.sessions:
 self.sessions[session_id] = {
 "history": [],
 "unresolved_count": 0,
 "current_intent": None,
 "summary": ""
 }

 session = self.sessions[session_id]

 # Classify intent (might change between turns)
 intent_result = classify_intent(message, session["history"])

 # Retrieve docs
 docs = retrieve_support_docs(intent_result["intent"], message)

 # Generate response
 response = generate_response(
 intent_result["intent"],
 message,
 docs,
 session["history"]
 )

 # Track unresolved issues
 if intent_result["intent"] == session["current_intent"]:
 session["unresolved_count"] += 1
 else:
 session["unresolved_count"] = 0
 session["current_intent"] = intent_result["intent"]

 # Compress history after every 3 turns
 session["history"].append({"role": "user", "content": message})
 session["history"].append({"role": "assistant", "content": response["response"]})

 if len(session["history"]) > 6: # More than 3 turns
 session["summary"] = self._summarize_history(session["history"], session["summary"])
 session["history"] = session["history"][-4:] # Keep only last 2 turns

 # Check escalation
 escalation = should_escalate(
 intent_result["intent"],
 response,
 session["history"],
 self._detect_sentiment(message)
 )

 # Force escalation on repeated unresolved issues
 if session["unresolved_count"] >= 3:
 escalation["escalate"] = True
 escalation["reasons"].append("Same issue repeated 3+ times")

 return {
 "response": response["response"] if not escalation["escalate"] else None,
 "escalate": escalation["escalate"],
 "escalation_reasons": escalation["reasons"],
 "intent": intent_result["intent"],
 "confidence": response["confidence"],
 }

 def _summarize_history(self, history, previous_summary):
 """Compress conversation history to save tokens."""
 response = client.messages.create(
 model="claude-haiku-3-20240307",
 max_tokens=200,
 messages=[{
 "role": "user",
 "content": f"Summarize this support conversation in 2 sentences:\n\n{previous_summary}\n\n{history[-4:]}"
 }]
 )
 return response.content[0].text

The conversation summarizer is the unsung hero. Without it, a 10-turn conversation burns through 5,000+ tokens. With summarization, you keep the same context with 500 tokens.

What does each support conversation cost?

Here are real numbers from my latest support agent deployment:

Component	Model	Cost per call	Calls per conversation	Total
Intent classification	Haiku	₹0.03	3	₹0.09
RAG retrieval	Embedding model	₹0.005	3	₹0.015
Response generation	Sonnet	₹0.30	3	₹0.90
Sentiment analysis	Haiku	₹0.03	1	₹0.03
Conversation summary	Haiku	₹0.03	1 (every 3 turns)	₹0.01
Total per 3-turn conversation				₹1.05 ($0.013)

With Sonnet: ~₹1.05 per conversation. With Haiku throughout: ~₹0.15 per conversation.

For 500 conversations/month: ₹525 with Sonnet, ₹75 with Haiku.

The choice depends on your complexity. Technical support needs Sonnet for accurate troubleshooting. General questions can use Haiku.

What failure modes affect AI support agents?

1. Overconfidence (most common)

The agent retrieves a partially relevant document and fills in the gaps with hallucinated details. The user gets a confident-sounding wrong answer.

Fix: Add a “I don’t know” fallback. If confidence is below 0.7, the agent says “I’m not sure about this, let me connect you with someone who can help” instead of guessing.

2. Context bleed

The agent remembers information from a previous conversation in the same session and applies it incorrectly.

Fix: Clear the conversation summary when the intent changes completely (e.g., from “billing” to “technical”).

3. Documentation gap

The user asks about a feature or scenario that isn’t documented. The agent has nothing relevant to retrieve.

Fix: Monitor retrieval failure rates. If RAG returns nothing relevant for >10% of queries, you need better documentation.

4. Escalation fatigue

The agent escalates too easily, overwhelming the human support team.

Fix: Track your escalation rate and aim for 10-15%. If it’s higher, loosen your escalation triggers or improve your RAG.

Related: AI agent error handling patterns: retry strategies, fallback behaviors, and graceful degradation for production agents.

Also: AI agent cost optimization: 10 tips to reduce your LLM bill: keeping agent costs under control.

Ship version one this week

The architecture above looks complex, but version one is just the classification + RAG + response pipeline. Skip the escalation logic, sentiment analysis, and conversation summarizer. Add those in version two after you see where the agent fails. Every pattern I've shared was added in response to a real failure, not in anticipation of one.

FAQ

What’s the right architecture for a customer support agent? A four-stage pipeline: classification (intent detection) → retrieval (RAG from support docs) → response generation → escalation decision. The classification stage is critical : without it, the agent responds to billing questions with technical answers.

How do I handle multi-turn conversations? Use a conversation summarizer that compresses the chat history after every 3 turns. Pass the summary instead of raw history to keep token usage low. Also track unresolved issues across turns : if the user asks the same question 3 times, escalate regardless of confidence.

When should I hand off to a human? Hand off when: the agent confidence is below 0.7, the user explicitly asks for a human, the conversation exceeds 5 turns without resolution, the issue involves account changes or refunds, or the sentiment is clearly angry or frustrated.

What’s the typical cost per conversation for an AI support agent? For a typical 3-turn support conversation with Claude Sonnet: about ₹8-₹15 ($0.10-$0.18). With Claude Haiku, it drops to ₹1-₹3 ($0.01-$0.04). For 500 conversations/month, that’s ₹500-₹7,500 depending on model choice.

How to build your first AI agent in 2026 (tutorial). A beginner-friendly tutorial for building AI agents from scratch
AI agent deployment guide: from localhost to production. Taking your agent from development to production deployment

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected]