Tutorials

The Operator Loop Stack: Five Layers of Production Agent Loops

A deep dive into the operator loop stack — harness, loop contract, state layer, checker, and human checkpoint. How each layer answers one failure mode of autonomous AI agents, with concrete Claude Code implementations for every layer.

Claude Skills TeamJuly 3, 202610 min read
#loop-engineering#operator-loop#autonomous-agents#agent-harness#claude-code

When loop engineering took off in mid-2026, practitioners converged on a five-layer model for loops that actually survive production: the operator loop stack. This guide walks each layer — the failure mode it answers, and how to implement it with Claude Code primitives today.

Why a Stack, Not a Script

The naive autonomous loop is a while-loop around an agent: while not done: agent.run(). Everyone who ships one learns the same four lessons within a week:

  1. The agent doesn't know what "done" means (so it never stops, or stops early)
  2. It loses its place when context compacts (so it repeats or forgets work)
  3. It grades its own homework (so "done" means "claimed done")
  4. Nobody caught the one action that mattered (so autonomy cost more than it saved)

Each lesson is one missing layer. The stack exists because every layer answers a specific, predictable failure mode.

Layer 1: The Harness — "Act With What?"

The environment underneath the loop: tools, skills, permissions, connectors, memory. This layer is harness engineering's whole job, and it's a prerequisite — a loop on a weak harness just automates failure.

Minimum viable harness for a coding loop: a methodology skill set (Superpowers), scoped permissions, and version control the loop can't escape.

Failure mode answered: the agent improvises procedure and touches things it shouldn't.

Layer 2: The Loop Contract — "Done Means What?"

A machine-checkable definition of done, written before the loop starts:

Done means:
- npm test exits 0
- npx tsc --noEmit is clean
- diff confined to src/billing/
- CHANGELOG entry added

Two properties make a contract loop-grade: every line is verifiable by a command (not "code is clean"), and it includes a scope fence (the diff confined to line). Contracts with a judgment call in them ("UX feels better") belong to a human checkpoint, not a loop contract.

Failure mode answered: the loop can't terminate correctly because "done" lives in someone's head.

Layer 3: The State Layer — "Where Were We?"

Everything the loop knows that must outlive the context window: plan files, task lists, worktree isolation, progress logs. The design test: kill the session mid-loop; a fresh session should resume from files alone.

Concrete implementation: a plan document with checkbox tasks, Claude Code's task tracking, one git worktree per loop, and notes-as-files for anything discovered along the way (file-based knowledge tools like Obsidian Markdown shine here).

Failure mode answered: compaction amnesia — the loop repeats finished work or abandons half-done work.

Layer 4: The Checker — "Says Who?"

Verification the agent cannot skip, in escalating cost order:

  1. Hooks — formatter/linter after every edit, test gate before every commit. Deterministic.
  2. The contract's commands — the actual test suite, type check, build.
  3. Fresh-context review — a sub-agent that sees only the diff and the contract, so it can't inherit the implementer's assumptions. Two passes (spec compliance, then code quality) catch different failure classes.
  4. Adversarial verification — for claims that matter, a checker prompted to refute rather than confirm.

The layer's law: trust the checker, not the transcript. "All tests pass" in prose is a claim; an exit code is a fact. Eval-driven development extends this from single runs to the loop itself — measuring whether your loop changes actually improve outcomes.

Failure mode answered: self-graded homework.

Layer 5: The Human Checkpoint — "Who Signs?"

The narrow gate where a person reviews: merge, deploy, publish, spend. Everything before it is autonomous; nothing after it happens without sign-off.

Design principles: the checkpoint receives a finished, verified unit (branch pushed, checks green, summary written) — not a 40-turn transcript to archaeology through. And it's placed at irreversibility, not at effort: a loop can write ten thousand lines unsupervised, but it can't push one commit to main.

Failure mode answered: the one action that mattered went unreviewed.

Composing the Stack

A production loop reads like this: the harness defines what the agent can do; the contract defines success; the state layer carries progress; the checker converts claims into facts; the checkpoint guards irreversibility. Collections like Everything Claude Code ship most of the stack pre-wired — skills for procedure, hooks for checking, sub-agents for review, even a loop-operator agent for monitoring long-running loops.

Loops built this way scale sideways naturally: three worktrees, three contracts, one human checkpoint at the end of each — parallel feature development with a single reviewer.

Start Small

Don't build all five layers for your first loop. Sequence that works: contract + checker first (that alone transforms interactive sessions), then state (survive restarts), then tighten the harness, then move your checkpoint later as trust grows. The practical first-loop walkthrough follows exactly this order.

Go deeper: Loop Engineering: The Complete Guide · Harness Engineering · Loop-ready skill collections

Skills in This Post

Related Posts