---
title: Build an agentic incident triage assistant with AWS Quick and New Relic
canonical: "https://agenticup.dev/posts/agentic-incident-triage-aws-new-relic/"
pubDate: "2026-06-10T00:00:00.000Z"
description: "AWS and New Relic published a guide for building an agentic incident triage assistant. Here's the architecture pattern. automated context gathering, diagnostic execution, and remediation suggestions triggered by alerts."
tags: [aws, incident-response, agentic-workflow, new-relic, devops, automation]
---

TL;DR: My pager went off at 2am. By the time I logged in, the logs had rotated, the metrics window had passed, and I spent 30 minutes reconstructing what happened. This architecture fixes that. Alert fires, agent gathers context, runs diagnostics, suggests remediation. All before you finish your coffee.

Incident response follows a predictable pattern. An alert fires, and the engineer on call spends the first 5-10 minutes gathering context: checking dashboards, pulling logs, reviewing recent deployments. Only then do they start diagnosing the actual problem. An agentic triage assistant can automate that context-gathering phase.

> **Key takeaways:**
> - The first 5 minutes of incident response is context gathering: highly automatable
> - Architecture: alert → agent orchestration → observability data → diagnostics → remediation suggestion
> - AWS Quick handles the agent loop, Bedrock AgentCore provides the LLM
> - Works best for well-understood failure patterns with clear diagnostic procedures
> - Always include a human-in-the-loop for remediation: agents suggest, humans approve

## What is the architecture of an agentic incident triage assistant?

The pattern from the AWS/New Relic reference architecture breaks down into four stages:

**1. Alert triggers the agent.** A New Relic alert fires and sends a webhook to Amazon Quick. The alert payload includes the incident type, affected service, severity, and a link to the related dashboard.

**2. Context gathering.** The agent receives the alert and immediately starts collecting context: recent logs from the affected service, error rates from New Relic metrics, recent deployment activity, and related alerts from the past hour.

**3. Diagnostic execution.** Based on the incident type, the agent runs predefined diagnostic procedures. For a high-latency alert, it checks database query performance, CPU use, and upstream dependency latency. For an error rate spike, it looks at recent code deployments and error log patterns.

**4. Remediation suggestion.** The agent compiles its findings into a structured triage report: what's happening, what's changed, likely causes, and suggested remediation steps. This goes to the on-call engineer for review.

## How can I extend this incident triage architecture?

The interesting part is what happens after the initial implementation. Once you have this pattern running, you can extend it:

- **Playbook automation.** For well-understood failure modes, the agent can execute remediation steps directly, restart services, roll back deployments, scale resources, with human approval for each action.

- **Post-mortem generation.** After the incident is resolved, the agent can automatically generate a post-mortem draft from the timeline, context data, and remediation actions taken.

- **Pattern learning.** Over time, the agent can learn which diagnostic steps are most useful for each incident type and prioritize them accordingly.

The [AWS and New Relic reference architecture post](https://aws.amazon.com/blogs/machine-learning/build-an-agentic-incident-triage-assistant-with-amazon-quick-and-new-relic/) is worth reading for the specific implementation details. But the pattern itself, alert-driven agent orchestration with observability tool integration, is applicable to any stack.

I've covered [agent deployment patterns](/posts/ai-agent-deployment-guide-localhost-to-production/) and [monitoring agents in production](/posts/ai-agent-logging-monitoring/): the incident triage pattern fits naturally into both.

## FAQ

> **What is an agentic incident triage assistant?**
> An AI agent that automatically responds to incidents by gathering context (logs, metrics, traces), running diagnostic checks, and suggesting remediation steps. This reduces the first 5 minutes of manual incident response.
>
> **What AWS services are used?**
> Amazon Quick (agent orchestration), Bedrock AgentCore (LLM integration), and integration with New Relic for observability data.
>
> **Can this pattern work outside AWS?**
> Yes: the architecture pattern of alert → context gathering → diagnostics → suggestion works with any observability platform and agent framework.
>

## Related Posts

- [AI agent error handling patterns](/posts/ai-agent-error-handling-patterns/). Retry strategies, circuit breakers, and graceful degradation for production agents
- [The policy gate every agent needs before production](/posts/ai-agent-policy-gates/). How to add safety checks and human approval gates to agent tool calls
- [AI agent multi-step workflows](/posts/ai-agent-multi-step-workflows/). Sequential chains, parallel execution, and conditional branching for orchestrated agents

---

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev.
