From Spec to Skill: The Mental Model That Made Claude Code Click

You’re three hours into a Claude Code session. The feature is almost done. You ask for one more refactor, something you’d normally trust Claude to nail, and it produces nonsense. Wrong function name. Forgets a constraint you established two hours ago. Suggests a library you explicitly told it not to use.

You compact. Try again. Same result.

This is the wall. I hit it constantly until I sat down with two recent talks (Matt Pocock’s AI Engineer workshop and Barry Zhang and Mahesh Murag’s Anthropic skills keynote) and finally understood what I was doing wrong.

The short version: I was treating Claude Code like a chatbot when I should have been treating it like an operating system. I lacked the vocabulary to tell the difference between a workflow, a skill, an agent, and a sub-agent, which meant every “improvement” I made to my setup was a guess.

This post is the mental model that fixed it.

The Smart Zone Is Real

Here’s what I had to learn the hard way: LLMs have a usable context budget that’s a lot smaller than the advertised one.

Pocock calls it the smart zone, an idea he credits to Dex Horthy at Human Layer. Roughly the first 100K tokens of a session. Beyond that, attention degrades, the model starts forgetting earlier decisions, and you get the same kind of dumb output you’d expect from a junior who’s been awake for 36 hours.

The 1M-token context window Anthropic shipped earlier this year? That’s not a smart zone upgrade. That’s just a bigger dumb zone tacked on. Useful for retrieval (finding a clause buried in a 200-page contract). Not useful for coding.

"They shipped a lot more dumb zone to you."

Once you accept that the smart zone is a hard physical limit on your tool, every other design decision flows from it. You stop trying to cram an entire feature into one session. You stop compacting and praying. You start architecting your work to fit.

Compacting Is a Trap

This was the second realization. Compacting feels like progress. You squeeze 80K tokens of conversation down to a 5K summary and keep going. But you’re not actually back in the smart zone. You’re in the smart zone with sediment. Every compaction strips nuance the model used to have access to, and the next implementation step is now based on a lossy reconstruction of your earlier conversation.

Pocock’s rule, which I’ve now adopted: clear, don’t compact. Treat every session like the protagonist in Memento. Bound your tasks small enough that they finish in one fresh context window, then start the next one from scratch.

If the task can’t fit in one window, the task is too big. Split it.

The Four Words You Need

Here’s where the vocabulary problem comes in. Most posts about Claude Code use “workflow,” “skill,” “agent,” and “sub-agent” interchangeably. They’re not the same thing, and not knowing the difference kept me from building anything reusable.

Workflow. A predefined sequence of steps. You decide what happens in what order. The LLM fills in each step but doesn’t decide what comes next. Predictable, cheap, debuggable.

Agent. The LLM decides what to do next. You give it a goal and tools; it picks which tool to call, when to backtrack, when to stop. Flexible, but more expensive and harder to predict.

Sub-agent. A separate LLM invocation with its own isolated context window, called by a parent. The point is context isolation: the sub-agent can burn through 80K tokens exploring your codebase and return a 2K summary to the parent, which stays in the smart zone.

Skill. A folder containing instructions (and optionally scripts and assets) that encodes a procedure. The skill is the artifact on disk. The workflow is what happens when Claude executes it.

That last one is the unlock. A skill is the noun. A workflow is the verb. Once you see this, everything Barry Zhang said in his keynote clicks: skills are the new application layer on top of the agent runtime, the same way apps sit on top of an operating system. You don’t build a new agent for every domain. You build the agent once and ship skills to extend it.

This is also how Anthropic defines the workflow/agent split in their original Building Effective Agents post: “Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage.”

"Skills are just folders. This simplicity is deliberate."

Why Most “Agents” Should Be Workflows

Barry’s main argument, and I think he’s right, is that most things people build as agents should actually be workflows. Only reach for true agent-shaped autonomy when you genuinely can’t pre-script the path. He made this case first in his earlier How We Build Effective Agents talk at the AI Engineer Summit, and the skills keynote is the natural extension: skills are the structured artifact that workflow-first thinking actually produces.

If you find yourself typing the same setup instructions session after session, that’s a workflow waiting to be extracted. If you can write down “first do A, then B, then check C, then either D or E depending on the result,” that’s a workflow. Bake it into a skill. Stop reinventing it every Tuesday.

The test I use now: who decides what happens next, me or Claude?

If it’s me, I should be writing it down once instead of typing it every time.

If it’s genuinely Claude (open-ended exploration, novel problem, no fixed sequence), that’s where agent autonomy earns its cost.

The Pipeline That Actually Works

Pocock’s workshop is the most concrete version of this I’ve seen. He runs a five-skill pipeline that takes him from “Slack message from a stakeholder” to shipped code. I’ve been running variations of it for the last few weeks and the difference is real.

Here’s the shape:

/grill-me. Relentless interview about the feature until you and Claude share what Pocock calls a “design concept.” Not a plan. Not a doc. Just alignment. Expect 20-80 questions for anything non-trivial.
/write-prd. Summarize the alignment into a destination document. User stories, out-of-scope items, testing decisions. Don’t bother reading it. You wrote it with Claude, you already know what’s in it.
/prd-to-issues. Split the PRD into vertical slices as a local kanban board. Independently grabbable issues with blocking relationships.
/ralph-once. AFK loop that grabs the next unblocked issue, does TDD red-green-refactor, runs your feedback loops, commits, repeats.
/improve-codebase-architecture. Background skill that finds shallow modules and proposes deepening them, because the ceiling of AI quality is the quality of your feedback loops.

The first three are human-in-the-loop. Planning is irreducibly human. This is where your taste, your domain knowledge, and your understanding of the business actually matter. The Ralph loop is AFK. You walk away. Claude grinds. You come back to commits.

This split is the whole game. Stop trying to automate planning. Stop trying to babysit implementation. The leverage is in being explicit about which is which.

Vertical Slices Are Non-Negotiable

This was the technique that improved my outputs most. AI loves to code horizontally: all the database schema first, then all the API, then all the UI. It feels organized. It’s actually broken.

Why? Because you don’t get integration feedback until phase three. You haven’t tested that the layers connect until you’ve already burned hours building each layer independently.

The fix is the old Pragmatic Programmer technique: tracer bullets. Thin vertical slices that cross every layer from day one. Your first slice might be ugly. It might handle exactly one user story. But it exercises the full stack (schema change, service logic, API endpoint, UI surface) and you get feedback on the whole flow before you’ve committed to a single architectural decision you can’t unwind.

"Without that, AI is coding blind until it reaches the later phases."

When I read PRDs that propose “phase 1: data layer, phase 2: API, phase 3: frontend,” I throw them out now. Every phase should be a thin vertical slice that you could ship if you had to.

Bad Codebases Make Bad Agents

This is the line from Pocock’s workshop that I keep coming back to. The quality of your AI output has a ceiling, and that ceiling is the quality of your feedback loops.

If your codebase has no tests, no type checking, no linting, AI can’t verify its own work. It will hallucinate confidently and you’ll catch the bugs in production. If your codebase is a tangle of shallow modules with crisscrossing dependencies, AI can’t reason about it cleanly. It has to trace through too much code, and most of that tracing happens in the dumb zone.

John Ousterhout’s framing from A Philosophy of Software Design (deep modules versus shallow modules) is now load-bearing for me. A deep module exposes a small interface and hides a lot of implementation behind it. You can wrap one big test boundary around a deep module and catch real bugs. A shallow module spreads logic across a dozen tiny files and forces you into a brittle mocking pattern where the tests don’t catch what matters.

Refactoring for depth isn’t just a code-quality concern anymore. It’s an AI-tooling investment. Every hour spent making your modules deeper pays back as faster, more accurate AI implementation work.

This is also why I’m skeptical of teams that try to skip the code-review step. The QA phase isn’t bureaucratic overhead. It’s where you impose taste, where you push back, where you stop the model from producing technically-correct slop that violates conventions nobody bothered to encode. You can’t automate it away. You can only make it faster by giving the reviewer (human or AI) better tools.

Push vs Pull: Where to Put Your Standards

One last distinction that took me a while to get right. When you’re trying to enforce coding standards with AI, you have two levers:

Push. Information you force into context every time. Anything in CLAUDE.md gets pushed. Anything you paste at the top of a session gets pushed. Push burns tokens whether or not the information is relevant to the current task.

Pull. Information Claude can fetch when it decides it needs it. Skills are pull-based by design. Claude reads the skill’s description, decides it’s relevant, and pulls the rest into context only then. Anthropic calls this progressive disclosure, and it’s the mechanism that lets you ship hundreds of skills without blowing the smart zone.

My current heuristic:

Implementer agent: pull. Let it fetch coding standards when it has a question. Don’t blow the smart zone budget on conventions it might never touch.
Reviewer agent: push. Push the coding standards in directly. The reviewer’s whole job is to compare code against rules, so the rules need to be in context up front.

Same information. Different staging. Big difference in output quality.

What I’d Tell My Past Self

If I could go back three months and hand myself a sticky note for the monitor, it would say this:

Treat Claude Code like an OS, not a chatbot. The agent runtime is the kernel. MCP servers are the device drivers. Skills are the applications. You don’t reinvent the kernel every time you want a new capability. You ship a skill.

Size your tasks to the smart zone. If a task won’t finish in 100K tokens of fresh context, the task is too big. Split it.

Workflow first, agent second. Anything you’d type the same way twice belongs in a skill. Save autonomous agent behavior for the parts that genuinely can’t be scripted.

Vertical slices, always. Horizontal layers feel organized and produce broken integrations. Thin slices feel ugly and produce shippable software.

Your codebase is your ceiling. Invest in depth, tests, types, and lint. Every hour there compounds into faster, better AI work.

Wrapping Up

The biggest shift wasn’t tooling. It was vocabulary.

When I couldn’t tell a workflow from a skill from an agent, I couldn’t explain what I was trying to build. I’d grab at techniques from blog posts, glue them together, and wonder why my setup felt brittle. Once I had the four words (workflow, skill, agent, sub-agent), I could finally architect my AI workflow the way I’d architect software: deliberately, with clear boundaries, and with each piece doing one thing well.

The spec-driven development I wrote about earlier this year is still the foundation. But it’s no longer enough on its own. Specs without skills means you write the same setup prompt over and over. Skills without specs means you automate misalignment at scale. The pipeline only works when you have both: destination documents that capture what you’re building, and skills that capture how you build it consistently.

Start with one skill. Mine was /grill-me. Yours might be /fix-bug or /review-pr or /ship. Pick the procedure you re-type most often, write it down once, and run it for a week. You’ll feel the difference before you finish the week.

The vibe got us started. The spec gave us structure. Skills are how we ship.

Sources

Matt Pocock, Full Walkthrough: Workflow for AI Coding (AI Engineer workshop, 2025)
Barry Zhang & Mahesh Murag, Don’t Build Agents, Build Skills Instead (Anthropic, 2025)
Barry Zhang, How We Build Effective Agents (AI Engineer Summit, 2025)
Anthropic, Building Effective Agents (engineering blog)
Anthropic, Agent Skills documentation
Anthropic, anthropics/skills GitHub repository
Agent Skills open standard specification
Dex Horthy (Human Layer), origin of the “smart zone / dumb zone” framing
John Ousterhout, A Philosophy of Software Design (deep modules vs shallow modules)
Andy Hunt & Dave Thomas, The Pragmatic Programmer (tracer bullets / vertical slices)

From Spec to Skill: The Mental Model That Made Claude Code Click

From Spec to Skill: The Mental Model That Made Claude Code Click

The Smart Zone Is Real

Compacting Is a Trap

The Four Words You Need

Why Most “Agents” Should Be Workflows

The Pipeline That Actually Works

Vertical Slices Are Non-Negotiable

Bad Codebases Make Bad Agents

Push vs Pull: Where to Put Your Standards

What I’d Tell My Past Self

Wrapping Up

Sources

Let's talk