The 15 jobs every agent harness must do
15 jobs, not one. Agent harness architecture reference: turn requests, credentials, model catalog, FSM, skills, streaming, policy, approvals, budgets, hooks, sessions, compaction, events, tracing.
If you remember nothing else from this post, remember this:
An agent harness is not one thing. It’s 15 separate jobs, bundled together by the frameworks because nothing underneath gave you a way to compose them.
Most teams don’t choose the bundle: they inherit it. They pick LangChain or LangGraph because it’s the obvious choice, and they accept the tradeoff: every job is in one box, and when one job needs to change, they have to change the whole box.
This post is the reference guide for those 15 jobs. What each one does. Why it matters. How it connects to the others. And which posts go deeper on each.
Bookmark this. It’s your checklist before any production deploy.
TL;DR: An agent harness has 15 jobs: from persisting turn requests to compacting context windows. The state machine (job 4) is the conductor; everything else is a station. Frameworks bundle all 15 into one install; the composition model lets you swap any one without touching the others. Start with job 4, add jobs incrementally.
Key takeaways:
- 15 jobs, not 1: each has a different change frequency and replacement risk
- The state machine (job 4) is the conductor: it orchestrates all other jobs
- Frameworks bundle all 15; composition lets you replace any one independently
- Thin harness = jobs 1-4, 7 (autonomous agents). Thick harness = all 15 (production workflows)
- Every job has a kitchen analogy: use this guide to explain the harness to anyone
The 15 jobs at a glance
| # | Job | Kitchen equivalent | Deep dive |
|---|---|---|---|
| 1 | Accept and persist turn request | Hostess writes reservation in booking book | Logging & monitoring |
| 2 | Resolve credentials per provider | Vendor accounts, know which login to use | - |
| 3 | Look up model capabilities | Equipment reference sheet: what can each station do? | Context window |
| 4 | Drive the per-turn state machine | Head chef at the pass: orchestrates the kitchen | State machine |
| 5 | Load and serve skill bodies | Recipe cards at each station | Function calling |
| 6 | Assemble the system prompt | Pre-service team briefing, mode, identity, available skills | - |
| 7 | Stream tokens back to client | Pass window: plates go out as components are ready | Function calling |
| 8 | Policy-check every tool call | ”Are we allowed to use that ingredient?” | Policy gates |
| 9 | Pause for human approval | Sommelier approval for expensive wine | Policy gates |
| 10 | Track LLM spend against budgets | Kitchen accountant: food cost per dish | Error handling |
| 11 | Run hooks before and after tool calls | Quality control, before and after each plate | - |
| 12 | Persist session as branching tree | Ticket log that branches when orders change | Branching sessions |
| 13 | Compact history when context fills | Prep board consolidation: clear space, keep working | Context window |
| 14 | Emit event stream for UI | Display board: every table’s status in real time | Logging & monitoring |
| 15 | OpenTelemetry trace across every step | Kitchen CCTV: one continuous recording of everything | Logging & monitoring |
The 15 jobs in detail
Job 1: Accept a turn request and persist it
What it does: Catches the incoming message, gives it a unique ID, writes it to the session store before anything else happens.
Why it matters: The request exists the moment it arrives. If you don’t persist it immediately and the agent crashes before handling it, you have no proof the request was received. Persisting first gives you a paper trail from the start: and the ID is what all subsequent jobs reference.
Kitchen equivalent: The hostess at a restaurant takes a reservation and writes it in the booking book before the kitchen even knows about it. That log is the proof the request existed.
Links: AI agent logging and monitoring: session state is logged at the start of every turn.
Job 2: Resolve credentials for the model provider
What it does: Figures out which provider (OpenAI, Anthropic, Google, a local model) is being called for this turn, finds the right API key, and makes sure it’s available when the actual call happens.
Why it matters: You can’t hard-code one API key. Providers change, keys rotate, you might have different keys per workspace or per customer. The harness has to look up the right credential at runtime, every time.
Kitchen equivalent: Every supplier (meat vendor, fish vendor, vegetable vendor) has their own account. When you need to order beef, you have to know which vendor account to charge.
Job 3: Look up what the chosen model can do
What it does: Maintains a catalog of every model available to the agent: context window size, vision capabilities, tool support, streaming support. Checks this catalog before routing a request.
Why it matters: Send a 200-page document to a model with a 4,000 token context window and it fails. Ask a text-only model to do a vision task and it fails. The catalog prevents these failures by routing requests to capable models.
Kitchen equivalent: Before the kitchen starts cooking, someone checks: do we have a grill that can do steaks? What’s the oven’s max temperature? The equipment reference sheet.
Links: AI agent context window management: context window limits are part of the model catalog.
Job 4: Drive the per-turn state machine
What it does: The conductor. Manages the sequence of states every turn goes through: start → provision → assistant_stream → function_execute → steering_check → teardown. Knows what transitions are valid, handles errors, decides when to loop and when to stop.
Why it matters: This is the heart of the harness. Without it, the agent has no concept of “where it is.” It can run the same tool call 11 times. It doesn’t know when to stop. It doesn’t recover from crashes. The FSM is what makes the agent reliable.
Kitchen equivalent: The head chef at the pass. Knows exactly what stage every dish is at. Calls out the transitions, “risotto out, steak on the plate.” Decides what to do when something goes wrong, remake, substitute, or tell the table.
Links: Build a state machine for your AI agent in a weekend: the full implementation of the 6-state turn FSM.
Job 5: Load and serve skill bodies
What it does: Maintains the catalog of every tool the agent can call: the function schema, the inputs it needs, the errors it might return, when to use it, when not to. Serves these skill bodies on demand so the model knows how to call each tool correctly.
Why it matters: The model doesn’t inherently know how to use your tools. The skill body is what makes a tool discoverable and correctly callable. If the skill body for “send email” is wrong, the model calls it with wrong parameters or doesn’t call it at all.
Kitchen equivalent: Recipe cards at each station. The grill station has a card for ribeye, temperature, resting time, finishing butter. The sauce station has a card for béchamel, roux thickness, when to add milk. Every procedure, documented.
Links: OpenAI function calling tutorial: tool schemas and function calling patterns.
Job 6: Assemble the system prompt
What it does: Builds the instruction block sent to the model every turn. Assembles it from pieces: the mode paragraph (plan/ask/agent), the identity preamble (who the agent is, how to use tools), the list of available skills, the working directory context.
Why it matters: The system prompt shapes the model’s behavior. Get it wrong and the model doesn’t know it’s an agent, doesn’t know how to use tools, doesn’t know what mode it’s in. The harness has to assemble the right prompt for every turn.
Kitchen equivalent: Pre-service team briefing. “Tonight we’re doing a tasting menu: that’s the mode. You’re the team at Restaurant XYZ: that’s the identity. We have 12 courses planned: here’s the menu. The sommelier is on call if you need wine pairings: that’s a skill available on demand.”
Job 7: Stream tokens back to the client
What it does: Catches the model’s streaming response and pushes it to the client (browser, CLI) in real time. The user sees the response as it’s being generated, not after it’s fully done.
Why it matters: Seeing text appear gradually feels responsive. Waiting for the whole response feels slow. Streaming is standard for modern AI interfaces: the harness has to handle it correctly: managing the connection, handling disconnects, making sure the stream reaches the right client.
Kitchen equivalent: The pass window: plates go out as components are ready. The sauce is done, it goes out. The garnish is placed, it goes out. The customer sees the dish being assembled in front of them.
Links: OpenAI function calling tutorial: streaming with tool call deltas.
Job 8: Check every tool call against a policy before it runs
What it does: Every tool the model wants to call goes through one chokepoint: consultBefore. The policy rules say what’s allowed, what’s denied, what needs human approval. The gate returns allow, deny, or needs_approval before any tool executes.
Why it matters: This is the safety gate. Without it, any tool the model decides to call runs immediately: delete files, send emails, spend money. The policy check is what keeps the agent from doing things it shouldn’t.
Kitchen equivalent: Every time a station wants to use a restricted ingredient, “we want to use the truffle, it’s $200 for the portion”, the policy check is “does the ticket allow premium ingredients? Is there a budget for this?”
Links: The policy gate every agent needs before production: the fail-closed pattern, consultBefore implementation, three outcomes.
Job 9: Pause tool calls that need human decision and route the answer back
What it does: Some tool calls pass the policy check but still need a human to say “yes, do this.” These get parked: the turn pauses, the human is notified, the answer routes back into the right turn and the turn resumes exactly where it left off.
Why it matters: Not everything can be fully automated. Customer-facing actions, destructive actions, expensive actions: these need a human in the loop. The harness has to support this without breaking the turn’s state.
Kitchen equivalent: The chef wants to use the restaurant’s last bottle of a rare wine for a table’s order. The policy says “allowed” but the sommelier has to physically approve it. The kitchen doesn’t proceed until the sommelier says yes: and when they do, the kitchen continues without re-cooking anything.
Links: The policy gate every agent needs before production: the reactive approval trigger (turn::on_approval) pattern.
Job 10: Track LLM spend against per-workspace or per-agent budgets
What it does: Every LLM call costs money. The harness tracks spending against budgets set per workspace, per agent, per customer. When a workspace approaches its limit, the harness throttles requests or alerts someone.
Why it matters: Without this, you have no financial visibility. You don’t know which agent is burning through budget, which customer is accidentally running expensive loops, when you’re going to hit a surprise bill.
Kitchen equivalent: The restaurant accountant tracks every dish’s food cost against the menu price. If a table orders 10 portions of the expensive tasting menu, the accountant knows that bill is going to be high. If the monthly ingredient budget is running low, the chef gets alerted.
Links: AI agent error handling patterns: circuit breakers and cost caps as budget enforcement mechanisms.
Job 11: Run hooks before and after tool calls
What it does: Hooks are side effects that run at specific points: before a tool executes (log it, redact sensitive data) and after it executes (check for errors, update a counter). The harness provides the before/after pattern so you can add behavior without modifying the tool itself.
Why it matters: This is how you add custom behavior, logging, redaction, metrics, custom side effects, without touching the tool code. Hooks are composable: add as many as you want.
Kitchen equivalent: Quality control checks. Before each plate goes out: “Did the chef wash their hands? Is the temperature right?” After: “Was the plate returned clean? Was there a complaint?” These checks happen around every action, not as part of the action.
Job 12: Persist the session as a branching tree
What it does: Stores the full conversation history as a tree, not a line. Each turn is a node with optional children. When the user asks “what if we tried X instead of Y?”, a new branch forks from the last common node. The original branch stays intact.
Why it matters: Linear sessions break the moment you want to explore a branch. With a branching tree, you never lose the main thread. You can fork, explore, and return: or keep both branches and compare.
Kitchen equivalent: The kitchen’s ticket log with a twist. When a customer changes an order mid-way, the kitchen writes a new ticket that branches off the original. The original order, what was started before the change, is still in the log. The kitchen can go back to it.
Links: Why your agent forgets conversations (and how to fix it with a branching tree): the branching tree model, fork and resume implementation.
Job 13: Compact session history when the context window fills up
What it does: When the conversation gets long enough that the context window starts filling, the harness compacts the history, summarization, selective forgetting, compression, so the agent can keep running without hitting the wall.
Why it matters: Without compaction, the agent hits the context window limit and either drops old history (losing context) or refuses new requests (breaking the agent). Compaction lets the agent run indefinitely on long conversations.
Kitchen equivalent: The kitchen’s prep board can only hold so many orders. When it’s full, the chef reviews the board, consolidates similar tickets, clears space. “we’re still working on the same 5 tables, just more efficiently.” The kitchen keeps running.
Links: AI agent context window management: sliding windows, summarization, structured memory, and when to use each.
Job 14: Emit an event stream for the UI to subscribe to
What it does: The UI needs to know what’s happening inside the agent in real time: tool calls, results, approval requests, turn endings. The harness emits events on topics, and the UI subscribes to the events it needs.
Why it matters: Without this, the UI is blind. It sends a request and waits for a final response with no visibility into what’s happening in the middle. With an event stream, the UI shows “the agent is calling the email tool” in real time.
Kitchen equivalent: The kitchen display board. “Table 7: steak is being cooked, table 12: soup is plated, table 3: waiting for manager approval on wine choice.” Events appear as they happen, not just at the end.
Links: AI agent logging and monitoring: structured event logging, JSON Lines format, replay patterns.
Job 15: Carry one OpenTelemetry trace across every step
What it does: Every operation in the turn is tagged with the same session/message/function IDs. When something goes wrong, you can see the full chain: which session, which turn, which function call, how long it took, what it returned.
Why it matters: Without tracing, debugging a failing agent is like finding a leak without knowing which floor the pipe is on. You know something went wrong but you can’t see the path. With tracing, you can pinpoint exactly where the failure happened.
Kitchen equivalent: Full CCTV recording of every shift. When a dish goes wrong, you can rewind and see exactly what happened. “the sous chef added the sauce at the wrong time.” You follow one order’s journey from placement to service.
Links: AI agent logging and monitoring: decision point logging, structured JSON, replay debugging.
The state machine is the conductor
All 15 jobs are connected by one: job 4, the per-turn state machine.
The FSM is the conductor. Everything else is a station the conductor talks to.
Turn comes in
→ Job 1: Persist request
→ Job 2: Resolve credentials
→ Job 3: Look up model capabilities
→ Job 4: push FSM (orchestrates everything below)
→ Job 5: Load skill bodies
→ Job 6: Assemble system prompt
→ Job 7: Stream tokens back to client
→ Job 8: Policy check every tool call
→ Job 9: Pause for human approval if needed
→ Job 10: Track spend against budget
→ Job 11: Run before/after hooks
→ Job 12: Persist session as branching tree
→ Job 13: Compact history when context fills
→ Job 14: Emit events for UI
→ Job 15: OTel trace across everything
When the FSM transitions from assistant_stream to function_execute, it triggers the policy gate (job 8). When the FSM transitions to steering_check, it evaluates whether to continue the loop. When the FSM transitions to stopped or failed, it calls teardown which triggers job 12 (persist) and job 14 (emit agent_end event).
Every job is a station. The FSM is the train that moves between them.
Thin vs thick: the slider
The 15 jobs aren’t all-or-nothing. You can run a thin harness or a thick one by adding or removing jobs from your config:
Thin harness: Jobs 1, 2, 3, 4, 7. No approvals, no budgets, no hooks, no compaction, no tracing. For autonomous research agents where you trust the model. The agent runs fast and loose.
Thick harness: All 15. For production customer-facing workflows where every tool call needs to be auditable, every dollar tracked, every action logged and traceable. The agent runs with guardrails.
The distance between thin and thick isn’t a rewrite. It’s a config change. Same wire protocol, same trace shape, same observability story. The slider moves by adding and removing workers from your config.
The framework trap
The reason the 15 jobs exist as a list is that most teams discover them by hitting them: one by one, in production, when something breaks.
The framework trap is this: you pick a framework (LangChain, LangGraph, CrewAI) and it ships all 15 jobs in one box. It works great for the first few months. Then you need to replace the policy engine (job 8) because your security requirements changed. You find out you can’t just swap it: it’s baked into the framework’s loop. You have two choices: fight the framework, or rewrite the harness from scratch.
The alternative is the composition model: each job is a separate worker on a shared bus. The policy engine is a worker. The credential resolver is a worker. The session store is a worker. Replace any one by writing a new worker that registers the same function IDs. The rest of the stack doesn’t change.
That’s the architectural bet underneath everything in this post. The 15 jobs are not a design choice: they’re a fact about what an agent harness has to do. The design choice is whether you bundle them or compose them.
Start here
If you’re building your own harness, start with job 4: the state machine. Get a turn flowing through the 6 states with a real model call. Everything else builds on top of that foundation.
The state machine post has the full implementation. The policy gates post has the safety layer. The branching sessions post has the session persistence model.
This post is the map. Those posts are the trailheads.
Agent mode: The 15 jobs are the complete picture of what an agent harness does. Bookmark this reference. Use it as a checklist when evaluating frameworks, when designing your own stack, and before any production deployment. Every job is a place where something can go wrong: and every job is a place where something can be replaced when it does.
Related Posts
Read Build a state machine for your AI agent in a weekend for the full FSM implementation: the 6 states, valid transitions, error handling, teardown.
Read The policy gate every agent needs before production for jobs 8 and 9: fail-closed policy checks, the three outcomes, the reactive approval trigger.
Read Why your agent forgets conversations for job 12: the branching tree model for session persistence that doesn’t lose context.
Read AI agent context window management for job 13: compaction strategies that keep the agent running on long conversations.
Read AI agent logging and monitoring for jobs 14 and 15: event streams, structured logging, and replay debugging.
Read AI agent multi-step workflows for how workflow patterns layer on top of the state machine: sequential, parallel, conditional, human-in-the-loop.
A 2026 survey on AI agent architectures maps the production agent stack across tools, memory, and guardrails. The Reddit AI Agents community discusses real-world agent harness configurations.
This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected].