---
title: "Project Think: the durable execution pattern most agent frameworks ignore"
canonical: "https://agenticup.dev/posts/project-think-durable-execution/"
pubDate: "2026-06-20T00:00:00.000Z"
description: "Most agent frameworks offer checkpoints that shift failure handling to you. Cloudflare's Project Think embeds durability at the runtime level. The difference matters when your agent is 5 steps into a 10-step plan and the process dies."
tags: [cloudflare, project-think, durable-execution, agent-frameworks, crash-recovery, runfiber, durable-objects, actor-model, production-agents, agent-infrastructure]
---

**TL;DR:** Most agent frameworks offer checkpoints, not durable execution. LangGraph saves graph state per superstep. CrewAI has @persist and task replay. Google ADK eventsources sessions. All three require you to detect failures, resume manually, and prevent duplicate execution yourself. Cloudflare Project Think takes a different approach — it embeds durability at the runtime level using Durable Objects, fibers with mid-execution checkpointing, and automatic crash recovery. For production agents that run for hours or days, this isn't a nice-to-have. It is the difference between an agent that occasionally loses its train of thought and one that always finishes what it started.

> **Key takeaways:**
> - Checkpointing isn't durable execution. LangGraph, CrewAI, and Google ADK all save state, but you must detect failures, resume manually, and build distributed locking to prevent duplicates. The framework saves the snapshot. Everything else is your problem.
> - Cloudflare Project Think embeds durability at the runtime level using Durable Objects as the actor model foundation. Every agent is an addressable entity with embedded SQLite that survives hibernation, restarts, and evictions.
> - runFiber() checkpoints execution mid-task using ctx.stash(). If the agent is evicted, onFiberRecovered() fires on the next activation with the last checkpoint. The agent resumes from where it stopped, not from the beginning.
> - The economic model inverts container-based agents. Durable Objects cost zero when idle. An agent that is 99% dormant costs 1% of a container. At 10,000 agents, this is the difference between 100 concurrent instances and 10,000 always-on containers.
> - LangGraph uses checkpointers that require manual invoke(None, config) with the correct thread_id. CrewAI's @persist doesn't auto-resume — you add conditional skip logic to every method. Google ADK's event sourcing still relies on you to detect failures and re-invoke. None of them provide runtime-guaranteed completion.

---

I spent a week building an agent on LangGraph last month. The agent worked correctly on happy path. It called tools, routed between sub-graphs, and produced the right output. Then I killed the process mid-execution to test crash recovery.

The graph state was saved in PostgreSQL. Every superstep checkpoint was there. Thread_id was intact. I called invoke(None, config) to resume. It worked — for that one test. Then I tried to scale it. Two processes crashed simultaneously. Both tried to resume the same thread_id. I spent three days building distributed locking, failure detection, and a watchdog that would restart workflows. By day four I realized I was building what the runtime should have given me.

LangGraph isn't unusual here. CrewAI has @persist. Google ADK has event sourcing with ResumabilityConfig. All three save state. None of them guarantee your agent finishes. That guarantee requires durable execution — and Cloudflare's Project Think is the first agent platform to ship it as a first-class primitive.

## What is durable execution, and why do agents need it?

Durable execution means the runtime guarantees your workflow will complete despite crashes, restarts, or evictions. The runtime detects the failure, restores the last checkpoint, and resumes automatically. You don't write failure detection, retry logic, or distributed locking. The platform handles it.

Agents need this more than traditional applications because agents are long-running by nature. A single agent session can span minutes of active reasoning with an LLM, hours of waiting for tool results, or days of human-in-the-loop approvals. During that time, the process can die at any moment — a deploy, a platform restart, a resource limit, a network partition.

The worst part isn't the crash itself. It is what happens to the agent's reasoning. An agent that has completed 7 of 10 planning steps, gathered intermediate results, and is waiting for a tool response doesn't just lose those steps. It loses the thread of reasoning. The next token it was about to predict, the chain of thought it was building, the context it had accumulated — all gone. Restoring from a state snapshot gets you back to a known point, but the agent doesn't remember what it was doing.

True durable execution solves this by checkpointing not just state but progress. The agent resumes from the last checkpoint with its execution context intact.

## How does checkpointing fall short in practice?

The Diagrid analysis of LangGraph, CrewAI, and Google ADK identifies the same three gaps across all three frameworks.

**No automatic failure detection.** If your process crashes, no one knows. LangGraph has no supervisor or watchdog. CrewAI's @persist saves state but doesn't monitor whether the flow is alive. Google ADK's event sourcing preserves history but doesn't detect that the current invocation died. Your workflow is dead until something external notices — and that something is you.

**No automatic resumption.** Once you detect the failure, you must resume manually. LangGraph requires invoke(None, config) with the correct thread_id. CrewAI requires from_pending(flow_id) followed by resume(). Google ADK requires re-invoking with the original invocation_id. At scale, with hundreds of concurrent workflows, you are now building a failure-detection-and-retry infrastructure.

**No duplicate execution prevention.** If two processes try to resume the same thread_id simultaneously — entirely possible during a partial failure in a distributed system — none of these frameworks have built-in coordination. You are now responsible for distributed locking, lease coordination, and idempotency keys.

LangGraph's checkpointer saves state at every superstep. CrewAI's @persist decorator writes to SQLite after each method. Google ADK appends events to the session store. All three save snapshots. None of them automate the detection, resumption, and coordination that turn a snapshot into a guarantee.

## How does Project Think solve this at the runtime level?

Project Think builds on Cloudflare's Durable Objects — the actor model runtime that gives every agent an identity, persistent state, and the ability to wake on message. This isn't a framework layer. It is the platform itself.

A Durable Object is an addressable entity with its own embedded SQLite database. It hibernates when idle and wakes on demand. State, SQL data, schedules, and fiber checkpoints survive hibernation and restarts. In-memory variables, running timers, open fetch requests, and local closures don't — but that is the point. The platform boundary between durable and ephemeral is explicit and well-defined.

Think of it like a neighborhood restaurant that keeps a prep list on a clipboard. The clipboard hangs on the kitchen wall. It survives the lunch rush, a health inspection, a power outage. The head chef's mental notes about the exact pinch of salt in the special sauce don't survive. The clipboard does. Durable Objects are the clipboard. Fibers are the checkmarks the chef makes next to each prep item.

**runFiber()** provides crash-recoverable execution. It persists a row in SQLite for the duration of the work. The fiber has a name, a status, and a stash — a JSON blob that the agent writes to at checkpoints. When the agent calls ctx.stash(), the platform persists the checkpoint to SQLite synchronously. If the agent is evicted mid-task, the fiber row survives. On the next activation — triggered by any event source — onFiberRecovered() fires with the last checkpoint.

The pattern is simple: checkpoint before expensive work, recover from the last checkpoint. This isn't automatic replay of every step. You decide what recovery means for your domain.

```
async executeTask(task: Task) {
  await this.runFiber(`task:${task.id}`, async (ctx) => {
    const resources = await this.gatherResources(task);
    ctx.stash({ phase: "prepared", resources, task });

    const result = await this.runSubAgent(task, resources);
    ctx.stash({ phase: "executed", result, task });

    await this.updateTaskStatus(task.id, "complete", result);
  });
}
```

**startFiber()** adds idempotency keys, retained status records, and cancellation on top of the same fiber machinery. It is the right choice for webhooks and external callbacks where the provider may retry delivery — the idempotency key guarantees the agent processes the request exactly once.

**keepAlive()** prevents the agent from hibernating during active work that takes longer than the idle eviction window of about 70-140 seconds. Streaming an LLM response, orchestrating a multi-step tool chain, waiting on a slow API — keepAlive() creates a heartbeat that resets the inactivity timer. The recommended approach is keepAliveWhile(), which guarantees the heartbeat is cleaned up when work finishes.

Duration	Strategy
Seconds	Normal request handling
Minutes	keepAlive() / keepAliveWhile() with disposer
Minutes to hours	startFiber() with waitForCompletion
Hours to days	Async pattern: start job, hibernate, wake on completion callback

## What about persistent sessions and agent memory?

Durable execution solves crash recovery. Persistent sessions solve the problem of what happens to context between turns.

Project Think's Session API stores conversations as a relational tree, not a linear list. Each message has a parent_id. The agent can fork conversations — explore alternative solutions in parallel without polluting the primary reasoning path. It can compact older branches non-destructively, preserving the summary while reducing storage cost.

Context Blocks are structured, persistent sections of the system prompt that the model can query and update. The agent proactively manages its own learned facts. It can add a fact, update it when new evidence arrives, and remove it when it becomes irrelevant. This isn't vector search over past conversations. It is a structured key-value store that the agent controls through the same tool-calling interface it uses for everything else.

Cloudflare Agent Memory, announced during Agents Week 2026, adds a managed service layer for persistent agent memory — recall, forgetting, and cross-session context. It is a separate managed service rather than a runtime primitive, which means it comes with its own cost and latency characteristics.

## What does this mean for the economics of running agents?

The economic argument for Durable Objects is as important as the technical one.

A container costs the same whether it is actively processing or waiting for input. An agent that is 99% dormant and 1% active still costs 100% of a container. At 10,000 agents — a reasonable number for a team or small company — that is 10,000 always-on instances.

Durable Objects invert this. An agent exists as an addressable entity with persistent state, but consumes zero compute when hibernated. When something happens, the platform wakes the agent, loads its state from SQLite, and hands it the event. The agent does its work, then goes back to sleep. No containers. No idle costs. No external databases.

The comparison is stark:

| Dimension | VMs / Containers | Durable Objects |
|---|---|---|
| Idle cost | Full compute, always | Zero (hibernated) |
| Scaling | Provision and manage capacity | Automatic, per-agent |
| State | External database required | Built-in SQLite |
| Recovery | You build it (process mgrs, health checks) | Platform restarts, state survives |
| Identity/Routing | Load balancers, sticky sessions | Built-in (name to agent) |
| 10K agents at 1% active | 10,000 always-on instances | ~100 active at any moment |

Project Think doesn't make the economics slightly better. It makes the deployment model for agents fundamentally different from everything that came before.

## What are the real tradeoffs?

The Cloudflare approach isn't universally superior. The tradeoffs matter for specific use cases.

**Cold start latency.** A hibernated agent takes time to wake. The Durable Object must be loaded from storage, SQLite data must be read, and the runtime must reconstruct state. For latency-sensitive applications where every millisecond counts, a warm container may be a better fit despite the cost.

**Computation-bound workloads.** Agents that spend most of their time doing heavy computation rather than waiting for I/O — batch data processing, model inference at scale — benefit less from the idle-cost savings. If your agent is 90% compute and 10% waiting, the Durable Object advantage disappears.

**Ecosystem maturity.** Durable Objects are Cloudflare-specific. If you need to run agents across multiple cloud providers, on-premises, or in hybrid deployments, the portability cost is significant. Temporal offers durable execution across arbitrary infrastructure but requires managing your own cluster.

**The thinking-mode integration tax.** Project Think's fiber recovery works with any LLM call, but the agent harness needs to be Think-aware. The Flue framework wraps this into a clean abstraction, but if you are building directly on the Agents SDK, you need to understand the fiber lifecycle, the stash API, and the onFiberRecovered() contract.

## Where does durable execution go next?

Temporal has been the reference implementation of durable execution since before agents were the primary workload. The Durable AI Agent Bundle from Temporal makes the case that every production agent needs this foundation. Cloudflare's contribution is making it serverless and embedding it in the platform rather than requiring a separate cluster.

Three trends will determine whether durable execution becomes standard in agent frameworks.

First, the durable execution vs checkpointing distinction will harden. Teams that run agents in production will stop accepting "checkpoint support" as sufficient and will demand runtime-guaranteed completion. The frameworks that embed durability at the runtime level will win the production deployment story.

Second, the open-source agent frameworks will add durable execution layers. LangGraph already has checkpointers that can be wired to Temporal. The question is whether frameworks ship durable execution as a first-class primitive or as an integration you must assemble yourself. The Diagrid analysis suggests the latter is still the default.

Third, the economic model of Durable Objects will pressure container-based agent hosting. When one platform charges for idle time and another does not, the cost difference at scale isn't marginal. It determines whether agent deployment at 100,000 concurrent agents is economically viable.

The frameworks that solve durability at the runtime level won't just be better at crash recovery. They will define what production-grade agent infrastructure looks like.

## FAQ

> **Can I use Project Think with any LLM provider?**
> Yes. The Think SDK and Agents SDK are provider-agnostic. You configure an LLM provider API key (OpenAI, Anthropic, DeepSeek, etc.) and the SDK handles the integration. The durable execution primitives work independently of which model you use.
>
> **How does this compare to Temporal for agent workloads?**
> Temporal is the gold standard for durable execution across arbitrary infrastructure but requires managing your own Temporal cluster. Cloudflare's approach is serverless — no cluster management, no scaling decisions, no infrastructure to operate. The tradeoff is vendor lock-in to Cloudflare's platform vs Temporal's portability.
>
> **What happens to WebSocket connections during fiber recovery?**
> Project Think's Think base class wraps chat turns in recoverable fibers by default. If the Durable Object is evicted mid-stream, Think reconstructs any buffered chunks, persists partial LLM responses, and notifies the client on reconnection. WebSocket state per connection is also persisted.
>
> **Can I use runFiber with existing LangGraph or CrewAI agents?**
> Not directly. runFiber is a Cloudflare Agents SDK primitive. You would need to port your agent to the Agents SDK or use a wrapper like Flue that abstracts the underlying primitives. The Flue framework provides Markdown skills, CLI, and multi-cloud deploy targets including Cloudflare.
>
> **Does durable execution guarantee exactly-once processing?**
> Not automatically. runFiber provides at-least-once delivery with checkpoint recovery. For exactly-once semantics, use startFiber() with idempotency keys, which prevents duplicate processing of the same external event. The platform guarantees the idempotency check is atomic with the fiber creation.

## Related Posts

- [Cloudflare + Flue: the open agent harness stack](/posts/cloudflare-flue-agent-harness-stack/). The three-layer architecture: Cloudflare Agents SDK (platform), Pi/Think (harness), Flue (framework).
- [DeepSeek V4's hybrid attention changes your agent context budget](/posts/deepseek-v4-agent-context-budget/). What million-token affordable context means for agent design, and how multi-model routing becomes the default deployment pattern.
- [AI agent error handling patterns](/posts/ai-agent-error-handling-patterns/). Five predictable failure modes in production agents and how to fix each one.
- [Your agent is 1.6% model. The rest is the harness.](/posts/claude-code-harness-architecture-98-percent/). What a production-grade agent harness looks like — permission gates, context compaction, tool routing, recovery, and persistence.
- [AWS just turned the agent harness into a managed service](/posts/aws-bedrock-agentcore-harness-managed/). Bedrock AgentCore harness went GA on June 18. Two API calls to a production agent. Durable execution with managed memory and auto-tracing.

---

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev
