Life is about deciding who we are: Join us and decide to be for environmental protection, free education and generous social security, human rights and international law, and, of course, action against oppression and violence (starting with helping the people of occupied Palestine 🇵🇸)! Hide

Frontend Dogma

From Rule, Spec, to Harness: A Phased Adoption Path for AI Coding

on , tagged , , , (share this post, e.g., on Mastodon or on Bluesky)

For most of the past year, I’ve been helping teams of all sizes adopt AI coding. From early pilots with just one or two people, to teams of a dozen trying to use AI in daily delivery, and now to designing the next steps in the evolution path, I’ve repeatedly encountered the same phenomenon: everyone is excited during the pilot phase—code is generated quickly, scaffolds are beautifully set up, and even tests can be filled in. But when it comes to scaling, things start to go off the rails. Some developers submit AI-generated code directly, bypassing review; others discover that AI has modified interfaces it shouldn’t have, but the PR is already merged; and some even treat AI output as a half-finished product, manually rewriting it from scratch—rendering the effort useless.

The real problem isn’t that the model isn’t strong enough, but that the team doesn’t yet know how to integrate AI into their engineering system. Pilots rely on individual ingenuity; scaling requires system-level safeguards. Most teams only validate during pilots that “AI can write code,” but they don’t validate how AI’s output is accepted by the engineering system.

Prologue: The Pilot and Scaling Chasm

This isn’t just my observation. If you compare OpenAI’s Codex documentation with Anthropic’s Claude Code memory docs, you’ll see both companies ultimately emphasize the same thing—rule files, persistent memory, structured state, and an external mechanism to pull agents back within engineering boundaries. The bottleneck in AI coding has never been on the generation side, but on the reception side. Once code is generated, is it generated within the boundaries? Can it converge within the correct scope? Has it been validated? Can it ultimately enter the review and release system? These questions aren’t answerable by the model, but must be answered by the engineering system.

So, what we need to build isn’t a path to “make AI write more code,” but a path to “enable AI to stably complete delivery within the engineering system.” Rule, Spec, Loop, and Harness aren’t four parallel capability modules, but a progressive construction path with each layer tightening control—“don’t go rogue,” “this is all you do,” “how to continuously converge,” and “why the results should be accepted into production.” Each layer provides constraints for the next.

Rule: First, Turn “Don’t Go Rogue” into Machine-Enforceable Boundaries

Most teams initially focus on what AI can do. But in real projects, the most expensive mistakes are rarely about poorly written code. They’re about crossing boundaries, incorrectly modifying contracts, expanding change surfaces, or declaring tasks complete when they shouldn’t be.

This is why Rule must come first. In OpenAI’s documentation, AGENTS.md isn’t just a regular README—it’s a persistent project guide that Codex reads at startup. It supports hierarchical overrides from global to project to subdirectory levels, and the official recommendation is to keep it minimal, adding rules only after repeated errors are observed. Anthropic, meanwhile, defines CLAUDE.md and auto-memory as context loaded in every session, while reminding you: These are context, not hard constraints. Putting both together reveals a clear conclusion: Rule files are just the entry point, not the guardrails. True guardrails must be enforced by external signals like tests, linters, builds, schema checks, and review policies.

So, the focus of Rule shouldn’t be “give AI more background,” but first turning team discipline into default boundaries. Write “NEVER” and “DO NOT” first, then suggestions; constrain high-cost errors before discussing optimal implementations; keep only the minimal entry point in the root directory, sinking specific rules as close as possible to their context or skill; and once a particular error repeats, stop relying on verbal reminders—instead, codify it into rule files and validation chains.

In other words, Rule doesn’t solve the problem of making agents smarter—it teaches them not to overstep.

Spec: Next, Converge “What We Want to Do” into a Single Executable Change

Rule alone isn’t enough. Because clear boundaries don’t equate to clear goals. Many AI coding failures aren’t because the model misunderstood the requirements, but because the requirements weren’t organized into a clearly bounded, scope-controlled, and verifiable deliverable. AI naturally starts expanding tasks: adjusting a prompt, restructuring workflows on the side; supplementing a state, altering contracts along the way; writing a happy path, “tidying up” an entire module.

So, Spec isn’t a longer prompt—it’s a change scope discipline. In OpenAI’s documentation for multi-hour tasks, even PLANS.md is written as a living document: Milestones, progress, observable results, verification commands, and self-contained context must be clearly defined, allowing the agent to advance along these checkpoints. This direction is worth noting, because it shows that truly useful Specs aren’t about “providing more background”—they compress vague intentions into a single executable, reviewable, and verifiable change.

In my view, an effective Spec must answer at least five questions: What problem are we solving this time, what aren’t we solving, which surfaces are allowed to change, which contracts must remain untouched, and what counts as completion? It doesn’t need to be a tome, nor should it become a new documentation burden. Its true value lies in locking down scope, front-loading completion conditions, and nipping the impulse to “tidy things up on the side” in the bud.

Rule solves “don’t go rogue”; Spec solves “don’t go off track.” This is why they must follow each other.

Loop: Transform One-Shot Generation into a Convergent Execution Loop

With Rule and Spec in place, AI finally has boundaries and a target. But even then, errors will still occur during execution. What truly determines whether it can work continuously in engineering isn’t the quality of a single response, but whether there’s a Loop that turns “read context—make minimal changes—run external validation—record state—proceed to the next round” into a system action.

This step isn’t abstract theory. Vercel’s ralph-loop-agent bluntly describes continuous AI agent loops: The agent works, an evaluator checks results, and if not complete, it continues to the next round. OpenAI’s recent summary of Codex’s long-task experience explicitly describes the loop as: Plan, edit code, run tools, observe results, repair failures, update docs/status, and repeat. It emphasizes that long tasks are more reliable not because of a longer prompt, but because the harness provides structured context and clear “done when” routines.

Three actions are particularly critical here. First, each round of changes must be small enough to be quickly adjudicated by the validator; second, state must be externalized into files, logs, Git history, and plan documents—not relying on the model to “remember”; third, if a check fails, stop and fix it—don’t keep rolling forward with failure. Otherwise, Loop easily degrades from “continuous convergence” to “continuous drift.”

So, Loop doesn’t optimize generation quality—it optimizes convergence efficiency. Without Spec, Loop is just more patient trial and error; with Spec, Loop becomes an execution loop that continuously approaches acceptance conditions.

Harness: Finally, Reintegrate AI into Engineering Governance—Not Directly into Production

Once the three layers of Rule, Spec, and Loop are established, AI is no longer just “able to write code”—it’s advancing tasks within a controlled environment. But to truly land this, one final layer is needed: Harness.

Harness isn’t a single tool, but an entire engineering exoskeleton that brings AI outputs into validation, submission, review, release, and accountability systems. As early as 2026, Anthropic directly identified harness design as a key variable in long-term autonomous coding performance, placing task chunking, structured hand-offs, independent evaluators, and clear scoring criteria at the center of the system. OpenAI’s side also chains testing, checking, and reviewing into a reliability loop, even suggesting binding code_review.md and AGENTS.md together so that review rules become repository-level constraints.

This reveals one thing: the task of Harness has never been to make AI freer, but to make the organization more confident. It doesn’t solve “can AI write code,” but “why these results are worthy of trust.”

In engineering, this layer typically manifests as a set of combined mechanisms: Contracts guard schema, types, and interfaces—shared boundaries; Hooks move linting, typechecking, tests, and pre-commit checks upstream; Fitness and CI make repository-level decisions, including risk grading, manual review, and whether to allow release. If the first three layers aren’t established, Harness can at best scavenge at the tail end; if the first three layers are in place, Harness gains true organizational-level governance significance.

In other words, Harness answers not how strong AI is, but why this change is sufficiently trustworthy.

Why These Four Steps Must Be Progressive

These four layers are often misunderstood as a checklist of capabilities, but they actually have strict dependencies.

Without Rule, Spec is just giving unbounded agents requirements. Without Spec, Loop is just amplifying errors more efficiently. Without Loop, Harness can only passively scavenge at the PR and CI tail end. Only by first containing default behavior, then nailing down the scope of a single change, then turning execution into a closed loop, and finally integrating results into quality and release systems, can AI coding progress from “occasionally usable” to “stable delivery.”

This is why I prefer to understand this path as “expanding the control plane layer by layer.” Rule is the innermost layer, solving default agent behavior; Spec is the next layer out, solving the scope discipline of a single change; Loop turns static constraints into dynamic execution mechanisms; Harness aggregates the first three layers into organizational governance. They’re not four independent modules, but a maturity path.

Closing Thoughts

The next stage of AI coding will be decided not by model capabilities alone, but by who earliest transforms engineering experience into a machine-participatory, organization-decidable delivery system.

Rule: Teach AI not to overstep. Spec: Make AI clear on what it should do this time. Loop: Ensure AI can’t declare “I think this is good enough” as completion. Harness: Make AI’s results truly enter validation, review, and release systems.

From Rule to Harness, what we build isn’t a smarter coding assistant, but a delivery system that can work stably within engineering boundaries. The real gap won’t be the model—it’ll be who builds this delivery system first.

(This post is a machine-made, human-reviewed, and authorized translation of phodal.com/­blog/­from-rule-spec-to-harness-ai-coding-adoption-path/­.)