Building pi-loop: A Research-Backed Recursive Agent Loop
The decompose-solve-critique-iterate-synthesize loop I built as a Pi extension, with the research papers that shaped each decision.
pi-loop is a recursive agent loop that breaks complex tasks into 8-15 tiny sub-problems, solves them in parallel, critiques each solution with majority voting, and recursively deepens where the critic finds gaps. It runs as a Pi extension and ships with zero dependencies.
Install it: pi install npm:@agenticup/pi-loop then /reload.
TL;DR: I built a Pi extension that implements a decompose-solve-critique-iterate-synthesize loop backed by 12 research papers. Extreme decomposition into micro-tasks eliminated timeouts. Adaptive majority voting caught critic blind spots. The loop analyzed and improved its own code.
Key takeaways:
- Extreme decomposition into 8-15 micro-tasks eliminates nearly all timeouts and errors
- A backward precondition check catches what forward decomposition misses
- Adaptive majority voting in critique prevents false positives without 3x cost
- Deeper decomposition on failed sub-problems beats re-prompting
The first time I ran pi-loop with 22 sub-problems, every single one returned a solution. Zero timeouts. Zero errors.
That wasn’t the case four hours earlier.
I started with a scaffold that decomposed into 4-6 broad sub-problems. They timed out constantly. The critique phase ran one critic per solution and got confused by conflated concerns. The synthesis prompt concatenated without checking for contradictions. It worked, but barely.
What changed was the research.
I spent the day reading twelve papers on task decomposition, iterative refinement, multi-agent voting, and loop architecture. Each one pointed to a specific weakness. Each one suggested a concrete fix.
The Pipeline
pi-loop runs seven stages:
Phase 1: Forward decompose, 8-15 tiny sub-problems [MAKER]
Phase 1.5: Backward precondition check [DRIP]
Phase 2: Solve in parallel, semaphore concurrency
Phase 3: Adaptive critique, 1 critic escalate to 3 [MAKER]
Phase 4: Iterate or deeper decompose [ADaPT]
Phase 5: Synthesize with conflict detection [DRAGON + GSA]
Phase 1: Extreme Decomposition
The old decompose prompt said “break into 4-6 sub-problems.” The model produced broad topics like “design retry logic with backoff, jitter, max retries, and DLQ” with four concerns crammed into one sub-problem. The sub-agent would time out. The critic would evaluate four unrelated things at once.
MAKER (arxiv 2511.09030) changed that. MAKER proved that extreme decomposition into micro-tasks achieves near-zero error rates. The paper solved a million-step LLM task with zero errors.
The new prompt:
- Aim for 8-15 sub-problems
- Each sub-problem covers exactly ONE concern
- Design tasks: solvable in 2-3 sentences
- Review tasks: covers ONE file
Sub-problems went from “Design full retry logic with backoff, jitter, DLQ” to “Choose backoff formula” and “Pick jitter strategy.” Each sub-agent had less to do. Timeouts dropped to zero.
Phase 1.5: The Backward Pass
Forward decomposition misses things. You break a task into parts based on what is in front of you, not based on what must be true first.
DRIP (Decompositional Reasoning, OpenReview 2025) fixes this with a backward pass. After forward decompose, pi-loop asks the model: “For each sub-problem, what precondition must hold before solving it? If a precondition isn’t covered by another sub-problem, it’s missing.”
The first version was too aggressive. It flagged obvious things like “need a database before you can design queries.” I tightened it to three criteria:
- The sub-problem has no solution without it
- It isn’t common knowledge
- It isn’t already implied by another sub-problem
Now it catches genuinely missing pieces without bloating the problem set.
Phase 2: Semaphore Concurrency
The solve phase runs sub-agents in parallel via a semaphore. A batch pool waits for the whole group to finish before starting the next group. A semaphore fires all tasks immediately and gates how many run at once. Task 3 finishing early lets task 6 start immediately.
Phase 3: Adaptive Majority Voting
A single critic has blind spots. It might miss an edge case. It might hallucinate a problem that doesn’t exist. I’ve seen both happen.
MAKER’s error correction via voting solves this. But running 3 critics for every sub-problem is expensive. The majority of solutions pass the first critic.
The adaptive approach:
1. Run one critic
2. If critic says PASS, accept it
3. If critic says ITERATE, escalate to 2 more critics
4. 3 votes total, majority wins. Tie goes to PASS.
This saves about 66% of critique calls on passing solutions while keeping majority-vote reliability on flagged ones.
const firstCritic = await subAgent(critiquePrompt, options);
if (firstCritic.trim().startsWith("ITERATE")) {
const moreCritics = await Promise.allSettled(
[2, 3].map(i => subAgent(critiquePrompt, options)),
);
}
Phase 4: As-Needed Deeper Decompose
When a sub-problem fails critique, the default response is to re-prompt: “Your solution missed X. Fix it.”
ADaPT (arxiv 2311.05772) suggests a different path. Instead of refining the same sub-problem, decompose it further into smaller pieces. Solve each piece independently. Then synthesize back into a refined solution.
The ADaPT path only activates when maxDepth > 1. At max depth, it falls back to direct refinement.
if (problem.depth < maxDepth) {
// Decompose into 3-5 smaller parts
// Solve each in parallel
// Synthesize into refined solution
} else {
// Direct refine with critic feedback
}
The kitchen analogy: Job 4 is the head chef at the pass. The head chef does not cook. They orchestrate. When a station falls behind, the head chef does not take over. They break the task into smaller pieces and redistribute. ADaPT does the same thing.
Phase 5: Synthesis with Conflict Detection
The final phase takes all sub-problem solutions and synthesizes them into one answer. This is where contradictions surface.
Sub-problem A might say “use Redis for caching.” Sub-problem B might say “in-memory is faster, skip Redis.” Without conflict detection, the synthesis model picks one arbitrarily.
DRAGON (arxiv 2601.06502) adds explicit conflict resolution. The synthesis prompt now includes every sub-problem’s solution, every critic verdict, and an instruction to check for contradictions. PASS solutions take precedence over ITERATE solutions.
Generative Self-Aggregation (arxiv 2503.04104) reinforces this. GSA proved that generating a new answer from all candidates outperforms picking the best one. LLMs are good at generative combination and bad at discriminative judgment. Synthesis plays to their strength.
The Trajectory Tax
Every sub-agent call creates a full Pi session. It loads extensions, resolves models, checks auth. With 22 sub-problems, that is about 50 session creations per run.
The AgentDiet paper (arxiv 2509.23586) finds that agent trajectories contain 39 to 59 percent useless, redundant, or expired information. Removing it does not hurt performance.
For pi-loop, the accumulated log sends the full output history on every update. Truncating to 40 lines prevented unbounded growth without losing useful output.
The session creation cost is the one bottleneck I haven’t solved. Pi’s SDK doesn’t expose a session reset, so each sub-agent starts fresh. It’s the price of isolation.
Cost
The research papers are free. The API calls are not.
A typical run with 14 sub-problems costs about 30-40 model calls. At opencode-go rates, that is around ₹60-80 ($0.72-0.96) per run. Comparable to asking the same question 5-6 times and synthesizing.
The tradeoff: one deep run replaces five or six shallow attempts. The output comes structured, critiqued, and conflict-checked.
What I Would Change
Three things I would do differently:
-
Session pooling. If the SDK added a reset method, pre-creating N sessions before the solve phase would eliminate the creation wait from the critical path.
-
Token budget. The self-healing orchestrator paper (arxiv 2606.01416) recommends exposing a total cost cap. pi-loop has per-agent timeouts but no overall budget. A run with 40 sub-problems could surprise a user.
-
Complexity-guided decomposition. ACONIC (arxiv 2510.07772) models task complexity as constraint treewidth. Asking the decompose model to rate each sub-problem’s complexity and split the complex ones further would make decomposition adaptive instead of fixed-count.
Limitations
- Sub-agents are stateless. They can’t learn from earlier phases. Each one starts with a fresh session. This is by design, no cross-contamination, but it limits depth.
- The model parameter is cosmetic. You can request a specific model for sub-agents, but it’s appended as a prompt hint, not a routing instruction. True model routing needs SDK-level support.
- Twenty minutes per sub-agent. The timeout is generous because complex sub-problems can be slow. Typically they complete in 30-60 seconds, but the safety window is wide.
FAQ
When should I use pi-loop? Architecture decisions, multi-step analysis, code reviews, research synthesis. Not for simple lookups.
Does it work with any model? Sub-agents use your current Pi session model. The model parameter is a hint, not a router.
How is this different from asking Pi twice? pi-loop decomposes, solves in parallel, critiques with voting, and iterates. Asking twice gives you two attempts at the same approach. pi-loop gives you N approaches checked by independent critics.
What is the failure mode? Sub-agents time out occasionally. They get a fallback value and the loop continues. The critic catches gaps and triggers iteration. The worst case is a weaker answer, not a crashed run.
Related Posts
- Loop engineering: the production agent loop nobody talks about. The article that motivated pi-loop, the gain-cost tradeoff of iterative refinement.
- The Vertical Agent Method. Pick one workflow, build one agent, ship in 14 days.
- Pi extensions: connecting agents to external tools. How the Pi extension system works.
- AI agent error handling patterns. Recovery patterns that prevent cascading failures in production agents.
Agent mode: pi-loop implements a recursive decompose-solve-critique-iterate-synthesize pipeline backed by MAKER, ADaPT, DRIP, DRAGON, Self-Refine, and SGH research. Each sub-problem gets an independent Pi session with fresh context. Critiques use adaptive majority voting. Flagged sub-problems decompose further instead of naive re-prompting.
This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected]