← Back to blog
·18 min read

How I Set Up Codex for Spec-Driven Development

CodexWorkflowAI EngineeringSpecsAutomation

I wanted Codex to feel like a reliable teammate, not a fast autocomplete that occasionally rewrites half my repo.

The shift that worked for me was simple:

No approved spec, no code changes.

This post is my real setup flow based on my init.md blueprint in spec-driven-template-codex, plus how I actually use it day to day when building features.

The System in One Flow

User request
   |
   v
spec-architect drafts task spec
   |
   v
human approval gate
   |
   v
agent-router picks specialist
   |
   v
specialist implements inside scope_in only
   |
   v
validation (npm run verify)
   |
   v
commit with spec deletion + evidence
   |
   v
full-branch PR review

This ordering matters more than any individual prompt trick.

What I Build First in a New Repo

My init.md breaks setup into explicit tasks. In practice, I treat them as six foundation layers.

1) Project Standard Files (CODEX.md + AGENTS.md)

I keep two top-level files:

  • CODEX.md is the canonical contract.
  • AGENTS.md is the loader that tells Codex to follow that contract.

CODEX.md carries the rules that I do not want to renegotiate per session:

  • command list (dev, build, lint, verify)
  • architecture boundaries
  • domain routing table
  • commit policy
  • the hard workflow gates

I keep this file direct and non-negotiable. If a rule is optional, I remove it.

2) Behavioral Blueprint in .codex/WORKFLOW.md

This file is where behavior is encoded, not implied.

My key block is:

  • first principle: never implement without an approved spec
  • spec-first gate on every request
  • architect mode when no spec exists
  • mandatory subagent chain (spec-architect -> agent-router -> specialist)
  • model enforcement (model + model_reasoning_effort on every agent)
  • evidence gate tied to deleted specs

I use .codex/STRATEGY.md as the stable "why" and .codex/WORKFLOW.md as the executable "how".

3) Agent Topology in .codex/agents/*.toml

I split responsibilities so one agent is not making every decision end-to-end.

Core agents:

  • spec-architect: plans and drafts specs only
  • agent-router: reads approved specs and dispatches
  • domain specialists: implement only in scope_in
  • pr-reviewer: branch-level quality gate

A detail that made my setup much more predictable: every agent file pins both model and model_reasoning_effort. I do not allow inheritance.

My usual pattern:

  • strongest reasoning for architecture and review
  • medium reasoning for implementation specialists
  • lower-cost, fast routing for dispatch-only work

4) Spec Template as the Unit of Work

Each task is a TASK-YYYY-MM-DD-###.spec.md with strict front matter:

  • goal
  • scope_in and scope_out
  • constraints
  • validation
  • status
  • collaborators and design flags when needed

The point is not bureaucracy. The point is forcing clarity before edits begin.

I keep tasks small enough to finish in around 30 minutes. If I cannot describe it that tightly, it usually means I am hiding complexity.

5) Hard Guardrails with Hooks

This is where workflow stops being "best effort."

I add .codex/hooks/workflow-guard.sh and wire it through .codex/hooks.json (or inline in .codex/config.toml).

The guard blocks patterns that silently damage quality:

  • git commit --no-verify
  • broad staging like git add .
  • commit attempts without staged spec deletion
  • missing required agent files
  • missing model or model_reasoning_effort fields
  • missing or invalid evidence JSON for deleted specs
  • mismatch between evidence model values and pinned agent models

The important behavior is that policy is enforced at command time, not remembered manually.

6) Evidence + Memory

For every completed spec, I track chain evidence in:

  • .codex/evidence/agent-chain/<spec-id>.json

I record:

  • agent name
  • model used
  • chain step (architect, router, specialist)
  • timestamp
  • success status

I also initialize .codex/memory/ for persistent preferences and constraints so sessions start with context instead of re-discovery.

My Day-to-Day Execution Pattern

Once the repo is bootstrapped, feature work becomes very repeatable.

Step 1: Request -> Draft Spec

I start by spawning spec-architect and asking it to create or update a spec.

If there is no approved spec, no implementation is allowed.

Step 2: Approve Before Code

I keep status flow explicit:

draft -> approved -> in_progress -> done|blocked

Approval is where I catch wrong assumptions early, before diff churn begins.

Step 3: Route by Domain

I spawn agent-router on the approved spec.

  • If domain is clear, route to one specialist.
  • If domain is mixed, split specs first.
  • Parallel only for truly non-overlapping owned paths.

Step 4: Implement Only Inside Scope

Specialists are constrained by spec boundaries.

No "while I'm here" changes. No opportunistic refactors outside scope.

This keeps diffs reviewable and rollback-friendly.

Step 5: Validate and Commit Under Policy

I run npm run verify, then commit with strict formatting.

My commit gate expects spec lifecycle completion behavior, including spec deletion and matching evidence when required by workflow.

Step 6: Run PR-Level Review

After feature specs are done, I run a full-branch review.

That catches regressions that are invisible when you only inspect one task at a time.

What Changed After I Adopted This

Three practical improvements stood out.

1) Fewer accidental repo-wide edits

Explicit scope_in stopped many "small change" cascades.

2) Faster reviews

Review conversation shifted from "what happened?" to "is this the right behavior?" because intent was already encoded in specs.

3) Better handoffs across days

When I pause and resume later, I continue from spec status and evidence instead of reconstructing context from raw diffs.

Common Failure Modes I Guard Against

"This is too small for a spec"

Small tasks are where process drift starts. I still create a tiny spec.

"Let's skip verify once"

If verify is painful, optimize verify. Skipping it just moves failure later.

"Agent touched unrelated files"

I treat that as workflow failure, not a harmless side effect. I re-scope and rerun.

"We can commit now and clean evidence later"

I avoid deferred compliance. Evidence exists to prove the actual chain that happened.

Minimal Setup Order If You Want to Copy This

If you are starting fresh, this is the shortest safe sequence:

  1. Create CODEX.md and AGENTS.md
  2. Add specs/templates/TASK.spec.template.md
  3. Add .codex/WORKFLOW.md and .codex/STRATEGY.md
  4. Create core agents in .codex/agents/
  5. Enable hooks in .codex/config.toml and wire workflow-guard.sh
  6. Add evidence schema path under .codex/evidence/agent-chain/
  7. Test blocked and allowed commit scenarios

If step 7 is skipped, your rules are probably not real yet.

Final Takeaway

My Codex setup works because it converts process from documentation into enforcement:

  • specs define intent
  • agents separate responsibilities
  • hooks enforce non-negotiable policies
  • evidence proves what actually ran
  • PR review validates system-level safety

I still iterate prompts, but prompts are now the smallest part of the system.

The bigger win is having a workflow that stays stable even when tasks, tools, or models change.