---
title: "Loop engineering: the production agent loop nobody talks about"
canonical: "https://agenticup.dev/posts/loop-engineering-production-agent-loops/"
pubDate: "2026-06-20T00:00:00.000Z"
description: "Every agent framework converges on the same six-line loop. The production-hardened version adds context compaction, loop detection, cost budgets, step-level durability, and the ability for the agent to write its own skills."
tags: [agent-loops, durable-execution, agent-orchestration, production-agents, step-checkpointing, react, reflexion, hatchet, temporal, inngest]
---

**TL;DR:** Every agent framework converges on the same six-line loop. The production-hardened version with context compaction, loop detection, cost budgets, step-level durability, and the ability for the agent to write its own skills. That is where the real engineering lives. The core insight: every loop iteration follows a gain-cost tradeoff. The second loop refines. The third loop degrades. Design for two, not for N.

> **Key takeaways:**
> - The canonical agent loop is six lines of code. Every framework implements the same ReAct loop. The loop is settled. The engineering around it is not.
> - Every loop iteration follows a gain-cost tradeoff. LoopCoder-v2 proved it at the architecture level: two loops improve SWE-bench from 43% to 64%. Three loops regress to 27%. Marginal refinement saturates. Fixed overhead accumulates. Design for two.
> - The escalation path is: ReAct → measure stability (ICIR) → Reflexion → re-anchor at 30 steps → verifier-critic for high-stakes → multi-agent only when single-agent caps out.
> - The three-layer architecture (loop, skill, orchestrator) separates concerns. The loop is the unit of work. The skill is the asset that compounds. The orchestrator provides durability, retries, concurrency control, and observability.
> - On Monday morning, ship a ReAct baseline with Hatchet or Temporal for step-level durability. Measure where it caps out. Add one layer at a time. Don't lead with multi-agent.

---

The first time I deployed an agent loop to production, it worked for exactly six hours. Then the container restarted during a routine deploy. The agent came back online instantly. It also forgot everything it was doing. It re-fetched data it had already processed. It re-called the LLM for decisions it had already made. It sent three identical Slack messages before I killed it. The team wasn't amused.

The loop itself was fine. Six lines of a while loop. The problem wasn't the loop. It was everything the loop did not have. No checkpointing. No step-level retry. No way to answer the question "what step was I on when I died?"

I spent the next three weeks building what the runtime should have given me. I should have been shipping features.

## What is the canonical agent loop, and why is it not the hard part?

Every major agent framework runs the same core loop. I spent an unreasonable amount of time reading through the source code of Claude Code, Codex, Cursor, LangGraph, the Vercel AI SDK, and smolagents ([Steve Kinney, Anatomy of an Agent Loop](https://stevekinney.com/writing/agent-loops)) expecting to find meaningfully different approaches. I found the same six lines of logic wearing different costumes.

```
while not done:
    response = call_llm(messages)
    if response.tool_calls:
        results = execute_tools(response.tool_calls)
        messages.extend(results)
    else:
        return response
```

Barry Zhang at Princeton boiled it down even further: `env = Environment(); while True: action = llm.run(system_prompt + env.state); env.state = tools.run(action)`. Two lines if you squint.

This pattern is called ReAct, from Yao et al. at Princeton and Google Research in 2022. The model interleaves reasoning and acting: think about what to do, do it, observe the result, think again. It showed a 34% improvement on ALFWorld benchmarks over chain-of-thought alone.

The loop is the solved problem. The engineering around the loop. Context management, durability, cost containment, and graceful degradation. That is where all the interesting decisions live. And most teams discover the hard way that a while loop in a terminal gives you none.

## What is the gain-cost tradeoff that every loop faces?

This is the central insight that most discussions miss. Every loop iteration has a gain and a cost. The gain is marginal refinement. The loop processes new information, corrects course, produces a better output. The cost is fixed overhead. Latency, tokens, cache growth, coordination delay.

The first iteration (the initial ReAct pass) captures most of the value. A second iteration catches mistakes and refines. Beyond that, the gain shrinks while the cost accumulates.

LoopCoder-v2, a 7B coding model from Beihang University and IQuest Research ([arxiv 2606.18023](https://arxiv.org/abs/2606.18023)), demonstrated this empirically at the architecture level. They trained a Parallel Loop Transformer with different loop counts from scratch on 18 trillion tokens:

| Loop count | SWE-bench Verified | Multi-SWE |
|---|---|---|
| 1 (non-looped) | 43.0% | 14.0% |
| **2** | **64.4%** | **31.0%** |
| 3 | 27.6% | drops |
| 4 | worse | worse |

Two loops improved dramatically. Three loops regressed. The second loop provided productive refinement. Coherent updates to hidden states, changed attention routing, shifted output distributions. Later loops produced diminishing, oscillatory updates with reduced representational diversity.

The same pattern applies to agent orchestration loops. The first ReAct pass captures most of the value. A second Reflexion pass catches repeated mistakes. A third or fourth pass? The LLM oscillates, the context grows, the cost compounds.

Loop engineering is the discipline of designing for the gain-cost sweet spot and no further.

Think of it like a restaurant kitchen during a dinner service. The head chef reads the ticket and calls out the order. That is loop one. The sous chef checks the plating before it goes out. That is loop two. If the head chef keeps re-reading the ticket and the sous chef keeps re-checking the same plate, the ticket never leaves the pass. The gain of another check is zero. The cost is a cold meal and a wait.

Kitchens that work well do not stack redundant checks. They make the first pass count and add exactly one verification layer. Agent loops work the same way.

## Where do naive agent loops break in production?

The costliest thing in AI coding is no longer writing code. It is managing the agent loop.

A while true in a terminal doesn't survive a restart. Neither does a long-running process on a VM. When the process dies. A deploy, an OOM, a spot instance reclamation. The loop restarts from scratch. It re-fetches data it already had. It re-calls the LLM for decisions it already made. It sends a duplicate notification. It spawns a duplicate sub-agent. You wake up to three identical Slack messages and a confused team.

The fix isn't better error handling. It is an execution model where each step is checkpointed, each decision is persisted, and recovery means resuming from the last successful step.

## What is the escalation path from simple to production-grade?

The most common mistake teams make is skipping steps. They jump straight to multi-agent before single-agent reaches its ceiling. Multi-agent adds 2-5x the coordination overhead and significantly more debugging surface. The quality gain is often modest unless the failure mode is genuinely decomposable.

Follow this sequence. Do not skip steps.

**Step 1: ReAct.** Build a single-agent ReAct baseline. Measure success rate, tool-call accuracy, latency, and cost on your eval set. The first pass captures most of the value.

**Step 1.5: Measure stability, not just win rate.** A single success rate number hides variance. A model that hits 90% one day and 40% the next is worse than a model that hits 75% every day. The quant research community formalized this as ICIR (information coefficient / information coefficient IR): `mean(score) / std(score)`. A high-variance agent loop has a low ICIR. Track it. Optimize for it. Consistency beats flashy single runs.

**Step 2: Reflexion if failures repeat.** If your eval shows the same mistake across multiple runs, add a self-critique step after each iteration. Latency increases by roughly 30%. Quality improves 10-30% on the failure-mode subset ([Shinn et al. 2023](https://arxiv.org/abs/2303.11366)). Bound the critique-revise cycle to 2-3 attempts. This is the second loop, the one that provides most marginal gain.

**Step 3: Re-anchor at 30+ steps.** At approximately 30 loop iterations, the accumulated context triggers a silent prefix-cache miss that can spike costs 5-10x. Add re-anchor checkpoints that compact the context: summarize what happened, drop less important details, restore a manageable cache profile. This is the single most expensive footgun in production agent loops and almost nobody discusses it.

**Step 4: Verifier-critic for high-stakes outputs.** If output quality matters. Legal, financial, or security-critical code. Add a critic agent that scores output against a rubric. Use different models for generator and critic to prevent collusion. This doubles inference cost on the verified subset but cuts critical errors.

**Step 5: Multi-agent only when single-agent caps out.** Measure a specific failure mode that single-agent cannot address: parallel work, role specialization, perspective diversity. If none of these apply, multi-agent adds overhead without benefit.

**Step 6: Hierarchy or graph for multi-agent.** Avoid swarm in production. Hierarchy (supervisor-worker) when task decomposition is clear. Graph when control flow is conditional and needs observability.

## What are the 8 canonical agent architecture patterns?

Industry taxonomy stabilized around eight patterns across four quadrants ([Agent Architecture Patterns: 2026 Taxonomy](https://www.digitalapplied.com/blog/agent-architecture-patterns-taxonomy-2026)). Most production systems compose 2-3 of these.

**ReAct** is the single-agent default. Strong on reasoning and tool use. Weak past approximately 50 steps without re-anchoring.

**Reflexion** adds post-step self-critique. Reduces repeated failures by 30-50% on coding and math.

**Plan-and-execute** uses a planner agent and a cheaper executor agent. Cheaper at scale. Brittle when plans need mid-run adaptation.

**Supervisor-worker** is the hierarchical multi-agent pattern. A supervisor decomposes tasks to specialized workers. Each worker runs its own loop. The standard pattern for production multi-agent systems.

**Multi-agent debate** puts agents in adversarial positions with a synthesizer. Useful for high-stakes decisions. The failure mode is premature convergence.

**Verifier-critic** pairs a generator with a critic scoring against a rubric. Standard for catching hallucinations in production.

**Graph orchestration** arranges agents as nodes in a directed graph. Strong observability. LangGraph implements this.

**Swarm / blackboard** uses peer topology with no supervisor. Theoretically interesting. Rarely outperforms hierarchy in production.

## What is the three-layer architecture for production loops?

Break the agent loop into three layers. Each maps to a concrete primitive.

**Layer 1: The Loop.** A cron plus a decision-maker. It runs on a schedule or a trigger, evaluates state, and decides what to do next. The cron is the heartbeat. The LLM is the decision-maker. Steps are the durable execution units that checkpoint progress.

**Layer 2: The Skill.** A multi-step, retryable, composable, independently deployable workflow. Not a prompt. The loop is plumbing. The asset is the skill it calls. Each new skill makes every loop more capable. Skills compound.

**Layer 3: The Orchestrator.** The engine that runs everything. Schedules crons. Executes steps. Manages retries. Enforces concurrency limits. Stores run history. Hot-deploys new functions without disrupting running ones.

This reframes the standard mental model. Agents are not LLM plus tools. Agents are loops plus skills plus orchestration.

## What does durable execution look like in practice?

Durable execution means the runtime guarantees your workflow will complete despite crashes, restarts, or evictions. Each step is checkpointed automatically. If the process dies between steps, completed steps don't re-execute on restart.

The practical difference from checkpointing is stark. LangGraph saves graph state at every superstep. CrewAI has persist and task replay. Google ADK has event sourcing. All three save snapshots. None detect the failure, resume automatically, or prevent duplicate execution when two processes try to resume the same thread.

A durable execution engine gives you these four guarantees:

| Requirement | What it means | Why while loop fails |
|---|---|---|
| Step-level retry | Retry step 3, not steps 1 and 2 | Loop restart re-runs everything |
| Sub-agent lifecycle | Spawn child, wait hours, cancel if parent dies | No parent-child management |
| Guaranteed event delivery | Event fires while process is down, still processed | Events lost if process isn't running |
| Post-hoc observability | See every step, decision, retry after the fact | Logs are your only option |

Step-level checkpointing isn't just a correctness feature. It is a money saver. A single LLM call at 200K context on GPT-5.5 costs roughly ₹6,000 ($70). If your agent retries from the beginning on every failure, it re-calls the LLM for every previous step. Multiply by 10 agents. That is the cost difference between durability and starting over.

## What are the open-source options for durable loop infrastructure?

The landscape shifted significantly in mid-2026. Three options stand out.

**Temporal** is the gold standard, battle-tested at Uber, Netflix, Snap, Block, and Retool. In May 2026 at Replay 2026, Temporal announced Serverless Workers and Standalone Activities. Standalone Activities are a significant development for agent loops: activities can now run independently, not just as steps inside a workflow. This means sub-agents have their own lifecycle independent of the parent loop. Temporal also announced official integrations with Google ADK and OpenAI Agents SDK ([Temporal Replay 2026](https://temporal.io/)). The tradeoff remains infrastructure complexity. A Temporal Server cluster needs Cassandra or PostgreSQL plus Elasticsearch.

**Hatchet** shipped v1 on June 11, 2026 ([HN](https://news.ycombinator.com/item?id=43572733)). MIT license, built on Postgres. One database, Docker Compose. The step.run primitive checkpoints automatically. Hatchet was built specifically for AI agent workloads and has the simplest setup of any durable execution engine. The "Durable Execution the Hard Way" GitHub repo by the same team teaches the internals by building an engine from scratch in Go. If you want to self-host with minimal infrastructure, Hatchet is the default.

**Restate** is the newest option. A single binary, no external dependencies. Uses durable promises instead of event-sourcing. The tradeoff is maturity. Restate has the smallest community and is unproven at extreme scale.

| Engine | Setup | Step checkpoint | Dependencies | Best for |
|---|---|---|---|---|
| Temporal | Cluster | Auto (replay) | PSQL + ES | Battle-tested at scale |
| Hatchet v1 | Docker Compose | Auto (step.run) | Postgres only | Self-host simplicity |
| Restate | Single binary | Auto (promises) | None | Simplest deployment |

## What does the self-extending agent look like?

This is where loop engineering becomes self-modifying. The agent doesn't just run inside loops. It authors new loops and registers them with the orchestration engine.

Inngest's Utah project ([github.com/inngest/utah](https://github.com/inngest/utah)) demonstrates this pattern. A sidecar process gives the agent access to the orchestration SDK as a tool. The agent writes multi-step functions, registers them with the engine, and they start running immediately. No PR. No deploy pipeline.

Here is a concrete walkthrough from the Inngest article:

Monday morning, an engineer says: "Our services keep having latency spikes overnight. Nobody notices until morning." The agent writes two functions. A health check loop that runs every 30 minutes, pulling error rates, latency, and resource usage. The LLM classifies system health as normal, degraded, or critical. An incident triage skill that fetches detailed metrics and deploy history, correlates root causes with an LLM, and posts a summary to a Slack channel with recommended actions. Error handling: if the metrics API is down, back off and retry. If the LLM fails, fall back to rule-based severity classification.

A sidecar picks up the function code and registers it. Live in seconds. No human touched the deploy pipeline.

Every 30 minutes, the engine triggers the health check. If something is wrong, it invokes the triage skill. Fully autonomous.

A separate review loop runs weekly. It reads the run history and evaluates performance. If the health check keeps false-positive flagging a service because the thresholds are too sensitive, the review loop catches it. The skill gets updated automatically. The loop improves itself.

Before and after: Monday morning, the engineer would have needed to build the health check manually, set up the cron, configure the alerts, write the triage playbook, and deploy it. That is 2-3 days of work. With the self-extending pattern, the agent does it in minutes. The engineer reviews the output and owns the code.

The agent is ephemeral. Its output is durable. Kill the agent process and restart it. The skills keep running. Swap the underlying model. The skills keep running. Every durable skill is institutional knowledge encoded as executable infrastructure.

## What should you ship on Monday morning?

Here is the actionable verdict.

Start with a single-agent ReAct baseline on Hatchet or Temporal. Hatchet if you want Docker Compose and Postgres. Temporal if you need battle-tested scale. Both give you automatic step-level checkpointing.

Measure success rate, tool-call accuracy, latency, and cost. If failures repeat, add Reflexion as a second loop iteration. If loops exceed 30 steps, add re-anchor checkpoints. If outputs are high-stakes, add a verifier-critic.

Do not add a second agent until you know why the first one cannot do the job.

The gain-cost tradeoff is the lens for every decision. The second loop catches mistakes. The third loop oscillates. Two loops is the sweet spot at every level. The architecture level, the orchestration level, and the organizational level. Design for two. Measure before you escalate.

## FAQ

> **Should I default to ReAct or jump straight to multi-agent?**
> Default to ReAct. The most common mistake teams make is jumping to multi-agent before single-agent reaches its quality ceiling. Build a ReAct baseline. Measure where it caps out. Escalate from there. Multi-agent adds 2-5x the coordination overhead and significantly more debugging surface. The quality gain is often modest unless the failure mode is genuinely decomposable.

> **When does Reflexion justify its latency tax?**
> When your agent repeats the same failure mode across multiple runs. Reflexion adds roughly 30% latency per loop iteration but reduces repeated mistakes by 30-50% on coding and math tasks. Add it when your evaluation shows systematic failure patterns, not random errors.

> **Hatchet or Temporal for a self-hosted agent loop?**
> Hatchet for simpler deployments (one Postgres, Docker Compose, MIT license, agent-focused). Temporal for battle-tested reliability at scale. The decision depends on whether you can run a Temporal Server cluster or need a lighter setup. Temporal's new Standalone Activities (Replay 2026) make it more attractive for agent loops.

> **What is the most expensive footgun in production agent loops?**
> Cache invalidation at approximately 30 steps. The accumulated context triggers a silent prefix-cache miss that spikes costs 5-10x. Add re-anchor checkpoints that compact the context. This is almost never discussed and burns more money than model choice or prompt quality combined.

> **Can the agent write its own loops?**
> Yes, with the right orchestration layer. Inngest's Utah project demonstrates this: a sidecar process gives the agent access to the orchestration SDK as a tool. The agent writes multi-step functions, registers them with the engine, and they start running immediately. The agent becomes a builder of its own infrastructure.

## Related Posts

- [Project Think: the durable execution pattern most agent frameworks ignore](/posts/project-think-durable-execution/). Cloudflare's approach to durable execution with runFiber, stash, and automatic crash recovery — and why checkpointing is different from durable execution.
- [Cloudflare + Flue: the open agent harness stack](/posts/cloudflare-flue-agent-harness-stack/). The three-layer architecture connecting durable primitives to agent frameworks.
- [Your agent is 1.6% model. The rest is the harness.](/posts/claude-code-harness-architecture-98-percent/). What a production-grade agent harness looks like — permission gates, context compaction, tool routing, recovery, and persistence.
- [AI agent error handling patterns](/posts/ai-agent-error-handling-patterns/). Five predictable failure modes in production agents and how to fix each one, including loop detection and circuit breakers.
- [The 2026 LLM architecture landscape: 5 design families and when to use each](/posts/llm-architecture-five-families-2026/). The architecture that runs inside the loop — GQA, MLA, MoE, hybrid attention, and SSMs compared.

---

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev