Is this survey relevant to someone building agents today?

Yes, specifically the agent-initiated code artifacts pattern. The best production agents already do this: they write tests to verify their own output, create temporary tools for one-off data transformations, and build evaluators for quality checking. The survey codifies a pattern that top practitioners already use.

Code as Agent Harness: What a 102-page survey tells us

Stanford, Meta, and UIUC dropped a 102-page survey on code as agent harness. Three insights: agent-initiated code, code as environment model, and why the harness matters most.

A 102-page survey landed from Stanford, Meta, and UIUC: Code as Agent Harness. It covers 400+ papers across three layers of agent architecture. I sat down with it expecting academic abstraction. What I found was a surprisingly practical framework for how agents should use code.

Not as a final output. As a runtime.

TL;DR: Stanford, Meta, and UIUC surveyed 400+ papers on code as agent harness. The most practical insight: agents that write their own tools, tests, and evaluators mid-task outperform those that don’t. The harness matters more than most teams think. Start having your agent write tests before making changes.

Key takeaways:

Agent-initiated code artifacts are the most underused lever in production agents

Code as environment model means your agent’s state should be executable, not described

Multi-agent coordination through shared code artifacts beats message-passing for complex workflows

The harness is the bottleneck: the model is only as good as the execution layer around it

What does “code as agent harness” mean?

The survey’s central argument is simple: code is not just what agents produce. It’s the medium through which they reason, act, and coordinate.

Every agent has a harness: the software layer between the model and the task. Tools, sandbox, memory, execution loop, policy gates, observability. The survey calls this the “system-provided harness infrastructure.” It’s what I mapped in the 15 jobs every agent harness must do.

The new contribution is the third element: agent-initiated code artifacts. These are code objects that agents create, execute, observe, revise, and persist during task execution. Not the final deliverable. The operational code that helps them get there.

Examples from production agents I’ve built:

An agent writing a regression test to verify a database migration before committing
A code review agent generating a temporary analysis script to check for security patterns
A data pipeline agent writing a custom validator to inspect output quality
A support agent generating a one-shot transformation script for a customer’s malformed data

In every case, the agent wrote code that helped it complete the task. That code was thrown away or evolved into a reusable skill. It was never the deliverable.

Why agent-initiated code artifacts matter

The survey organizes code artifacts into three roles within the harness:

Code for reasoning. The agent externalizes intermediate computation into executable programs. Instead of reasoning about a data transformation in natural language, it writes a Python script, runs it, and inspects the output. The execution trace becomes the reasoning chain. This is more reliable than hoping the model tracks state in its weights.

Code for acting. Generated programs serve as executable policies. An agent controlling a GUI writes desktop automation scripts. A robotics agent writes executable motion plans. A DevOps agent writes infrastructure-as-code manifests. The code is the action, not a description of the action.

Code for environment modeling. This is the one most teams miss. The agent uses code to represent its understanding of the environment: a test suite represents verification criteria, a build script represents the deployment pipeline, a trace log represents execution history. The environment becomes inspectable and verifiable in a way that natural language context never is.

The survey cites evidence that agents using code artifacts for mid-task verification reduce hallucination rates and improve task completion on long-horizon workflows. This matches what I’ve seen: the agents that write their own tests before making changes produce fewer regressions.

What the survey gets right

Three things the survey nails that most agent architecture discussions miss.

First: the harness is the bottleneck.

The survey dedicates significant space to harness mechanisms: planning, memory, tool use, feedback-driven control, and optimization. It argues that the reliability ceiling for most agents is not the model’s reasoning ability but the harness’s ability to convert model outputs into verifiable actions.

This aligns with the LangChain anatomy of an agent harness post, which cites a case where swapping the harness moved a coding agent from Top 30 to Top 5 on Terminal Bench 2.0 without changing the model. The same model, different execution layer, dramatically different results.

Though I should note the counter-evidence: the Agents’ Last Exam benchmark found model choice matters 3x more than harness choice. The two aren’t in conflict. The harness determines what’s possible. The model determines how well it executes within those bounds. A great model in a bad harness beats a bad model in a great harness, but a great model in a great harness beats everything.

Second: code as a shared coordination surface.

When multiple agents work on the same codebase, the survey argues that shared code artifacts beat message-passing for coordination. Instead of agents sending each other natural language summaries, they share tests, traces, build artifacts, and executable workflows. A reviewer agent runs the tests the coder agent wrote. A tester agent executes the trace the planner generated. The code is the coordination protocol.

This is what dynamic workflows in Claude Code already do: sub-agents produce structured artifacts that other sub-agents consume. The survey just formalizes the pattern.

Third: feedback-driven control loops.

The survey catalogs how agents use execution feedback to revise their own behavior. Static analysis catches type errors before runtime. Test failures trigger re-planning. Runtime exceptions generate corrective patches. The pattern is always the same: execute, observe, revise. Code makes this loop tight because execution feedback is immediate and structured, unlike human feedback which is slow and subjective.

What the survey misses

One thing: the deployment reality.

The survey covers 400+ papers on agent architectures but barely mentions cost, latency, or failure modes in production. The agent-initiated code artifacts pattern sounds great in a research environment. In production, every code execution costs API tokens and wall-clock time. An agent that writes and runs five verification scripts per turn burns through context window and budget faster than one that doesn’t.

The teams I’ve seen succeed with code artifacts are disciplined about when to use them. They gate code execution behind policy checks. They set budgets per turn. They use compaction to keep context manageable. The survey describes what’s possible. It doesn’t describe the cost of doing it wrong.

What this means for your next agent

If you take one thing from this survey, make it the agent-initiated code artifacts pattern. Your agent should be writing code that helps it complete the task, not just code that is the task.

Concretely:

Write tests first. Before your agent modifies a function, have it write a test that captures the expected behavior. Run the test before and after the change. This catches regressions before they reach production.
Use scripts as intermediate state. Instead of tracking data transformations in context, have your agent write intermediate results to disk as executable analysis scripts. The script is inspectable, rerunnable, and doesn’t consume context tokens.
Build custom evaluators. Have your agent write a verifier function that checks its own output against criteria you define. Run the verifier before returning results. Reject and retry if it fails.
Evolve scripts into skills. When an agent writes the same kind of helper script twice, have it save it as a reusable skill. The survey calls this “lifelong code-based agents.” I call it not writing the same script twice.

The survey is at arxiv.org/abs/2605.18747. The companion repo is at github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers. It’s worth reading the planning and memory sections even if you skip the rest.

FAQ

What is code as agent harness? It’s the view that code is not just something agents generate, but the operational medium through which they reason, act, and coordinate. The harness (tools, sandbox, memory, execution loop) sits between the model and the task. Code as harness means agents use program execution as their primary mode of interaction with the environment.

What are agent-initiated code artifacts? Code objects that agents create, execute, observe, and revise during task execution. Examples include writing a regression test to verify a fix, generating a temporary script to transform data, building a custom evaluator to check output quality, or creating a reusable skill from a solved problem.

Does the survey say the harness matters more than the model? It argues they’re co-dependent. The survey cites evidence that changing the harness under a fixed model can shift performance significantly. The ALE benchmark found model choice matters 3x more. Both can be true: the harness determines what’s possible, the model determines how well it executes.

What does code as environment model mean? Instead of representing the agent’s world through natural language, the harness uses executable code: codebases represent project state, tests represent verification criteria, execution traces represent history. This makes state inspectable and verifiable in a way free text isn’t.

The 15 jobs every agent harness must do: comprehensive breakdown of harness components
Your AI Agent Just Scaffolded a Project from 2020: what happens when agents execute code without version pinning
Is Your Agent Extension Actually Working?: measuring whether harness improvements actually lift outcomes
AI agent error handling patterns: turning execution feedback into corrective action
The Vertical Agent Method framework: pick one workflow, build one agent, ship in 14 days

Survey: “Code as Agent Harness” by Ning, Tieu, Fu et al. (UIUC, Meta, Stanford). 400+ papers surveyed. May 2026.

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected]