A Claude Code case study · skill-creator methodology

Rebuilt skill: 100% pass.
Old skill: 74.2%.
Token cost: identical.

What happens when you ask Claude Code to “tear it down and rebuild it” instead of “improve it.” One skill, one engineering loop, eight eval runs, and an evidence trail you can audit line by line.

23 / 23
assertions passed, rebuilt skill
17 / 23
assertions passed, old skill
+1.2%
token cost of the improvement
8
eval runs: 4 prompts × 2 skill versions
01 · The Loop

Not a rewrite. An engineering loop.

The skill under test is decompose-goal: it flattens a free-text goal (“check the container, redeploy if down, verify”) into atomic subtasks that agents, workflow engines, cron jobs, or humans can execute. The request was simple: “review, tear down, and rebuild it.” Instead of a one-shot rewrite, that phrase triggered Claude Code’s skill-creator methodology, which imposed the loop below. The human sits at the edges. The machine runs the middle.

human decides machine executes
  1. 1
    Callsite research first machine
    Before touching a line: grep for everything that actually consumes the skill’s output. The research found the skill’s documented sponsor was wrong. The real consumer was a nightly cron worker that parses the output headlessly and depends on a hard contract: a literal Atomic subtasks (N): header. Break that line, break production.
  2. 2
    Design decisions surfaced to the human human
    Four multi-choice questions, asked before any code: add a JSON schema? (yes) · dependency annotations? (opt-in) · per-subtask verification rule? (as an internal litmus test) · publish a shareable version? (deferred). The human spends 60 seconds making four decisions, not 60 minutes reviewing a surprise rewrite.
  3. 3
    Snapshot the old skill machine
    A byte-for-byte frozen copy of the original. Two jobs: rollback insurance if the rewrite is worse, and an honest A/B baseline. Without it you know the new skill scores 100%, but not whether the old one would have too.
  4. 4
    Rewrite machine
    Orchestrator-specific framing replaced with executor profiles (agent turn / DAG node / cron run / human checklist), a fully specified JSON schema, opt-in [after: N] dependency annotations, and a verification litmus test: if you can’t state how to verify a subtask succeeded, it’s still compound. Decompose further.
  5. 5
    Four evals, run blind, twice over machine
    Four realistic prompts, each executed by two parallel fresh subagents: one with the new skill loaded, one with the old snapshot. Eight runs total. Same prompts for both sides, so any score difference is attributable to the skill text, not the questions.
  6. 6
    Assertion-based grading with written evidence machine
    Every run graded against per-eval assertions. Each verdict carries a quoted evidence line, so “fail” is never just a vibe; it points at the exact line that broke.
  7. 7
    Aggregate benchmark machine
    All grades roll up into one table: new skill 100% (±0%) vs old skill 74.2% (±29%), at the same wall-clock time and a ~1% token delta. This is the go/no-go signal for promoting the rewrite.
  8. 8
    Human review → feedback → iteration 2 human
    A self-contained review page renders every output and grade side by side so the author can audit the grader itself, then export structured feedback. Disagreements become fixed assertions; observed weaknesses become new evals. The same eval set is reused, so iteration 2’s score is directly comparable to iteration 1’s.
Human at the edges, machine in the middle. The author makes design decisions up front (step 2) and exercises judgment at the end (step 8). Everything between (snapshotting, rewriting, running, grading, aggregating) is mechanical and reproducible. The benchmark is a signal, not an auto-approval.
02 · The Benchmark

Same prompts. Same model. Same cost. Different skill text.

Both configurations answered the same four prompts in fresh subagent sessions. The only variable was which version of the skill file was loaded.

MetricOld skillRebuilt skillDelta
Pass rate (mean across evals) 74.2% ± 28.6% 100% ± 0% +25.8 pts
Assertions passed 17 / 23 23 / 23 +6
Wall-clock per run 47.5s ± 7.7s 47.4s ± 7.0s −0.2% (noise)
Tokens per run 76,645 ± 678 77,553 ± 484 +908 (+1.2%)
Eval 0
Default markdown contract
new
6/6
old
6/6
Eval 1 ★
JSON mode
new
6/6
old
2/6
Eval 2
Dependency DAG
new
6/6
old
5/6
Eval 3
Executor framing (LangGraph)
new
5/5
old
4/5
The variance is the real finding. The old skill’s ±29% standard deviation means it aced the easy evals and collapsed on JSON mode and DAG framing. For a dispatch contract that programs parse, unreliable is worse than uniformly mediocre: you can’t build on output whose shape changes run to run.
03 · The Evidence

Don’t trust the table. Read the runs.

Every eval below shows the actual prompt, both raw outputs side by side, and the grader’s assertion-by-assertion verdicts with evidence. Paths and hosts have been replaced with neutral placeholders; structure and content are otherwise verbatim.

Featured: the 2/6 collapse. The old skill’s description promised “a numbered JSON or Markdown list” but never defined a single JSON field. So the model invented an ad-hoc shape: subtasks with only id and action, an invented “atomic_subtasks” key instead of count, the halt token buried in prose instead of a machine-readable field, and no verify anywhere. A human reading both outputs might shrug; both look reasonable. A program keyed to the schema parses the new output and breaks on the old one. Four of six failures trace back to one missing paragraph.
Prompt

Decompose this goal into subtasks as JSON: Check whether the webapp container on the container host is running via the Portainer API. If it’s running, output [HEALTHY] and halt. If not, pull the latest webapp image, redeploy the stack with PullImage true, then verify the container came up.

NEW 6/6vsOLD 2/6
Rebuilt skill outputdecomposition.json
{
  "goal": "Check whether the webapp container on the
   container host is running via the Portainer API.
   If it's running, output [HEALTHY] and halt. If
   not, pull the latest webapp image, redeploy the
   stack with PullImage true, then verify the
   container came up.",
  "count": 5,
  "subtasks": [
    {
      "id": 1,
      "action": "Query the Portainer API
       (https://portainer.internal:9443, endpoint 1)
       for the webapp container's state — capture
       whether it is running.",
      "artifacts": ["https://portainer.internal:9443
       /api/endpoints/1/docker/containers/json"],
      "halt_condition": null,
      "depends_on": [],
      "verify": "A running/not-running state value for
       the webapp container was captured from the
       Portainer API response."
    },
    {
      "id": 2,
      "action": "If the webapp container is running,
       output [HEALTHY] and halt.",
      "artifacts": [],
      "halt_condition": "[HEALTHY]",
      "depends_on": [1],
      "verify": "Either [HEALTHY] was emitted and
       execution stopped, or the container was not
       running and execution continued."
    },
    {
      "id": 3,
      "action": "Pull the latest webapp image via the
       Portainer API (POST /api/endpoints/1/docker
       /images/create?fromImage=acme/webapp&tag=latest).",
      "artifacts": ["https://portainer.internal:9443
       /api/endpoints/1/docker/images/create"],
      "halt_condition": null,
      "depends_on": [2],
      "verify": "The image-pull API call returned
       success and the latest webapp image is present
       on the Docker host."
    },
    {
      "id": 4,
      "action": "Redeploy the webapp stack via the
       Portainer API (PUT /api/stacks/{id} with
       PullImage: true).",
      "artifacts": ["https://portainer.internal:9443
       /api/stacks/{id}"],
      "halt_condition": null,
      "depends_on": [3],
      "verify": "The stack-update API call returned
       success (HTTP 200) for the webapp stack."
    },
    {
      "id": 5,
      "action": "Verify via the Portainer API that
       the webapp container is now running.",
      "artifacts": ["https://portainer.internal:9443
       /api/endpoints/1/docker/containers/json"],
      "halt_condition": null,
      "depends_on": [4],
      "verify": "The Portainer API reports the webapp
       container in a running state after the
       redeploy."
    }
  ]
}
Old skill outputdecomposition.json
{
  "atomic_subtasks": 5,   ← invented key, no "count"
  "subtasks": [
    {
      "id": 1,
      "action": "Query the webapp container state on
       the container host via GET
       https://portainer.internal:9443/api/endpoints
       /3/docker/containers/json?all=true
       (authenticate with the stored Portainer
       credentials in the same call) — capture the
       webapp container's Id and State."
    },                       ← only id + action;
                               no artifacts, no
                               halt_condition, no
                               depends_on, no verify
    {
      "id": 2,
      "action": "If the webapp container State is
       \"running\", output [HEALTHY] and halt."
    },                       ← halt token buried in
                               prose, not a field
    {
      "id": 3,
      "action": "Pull the latest webapp image via POST
       https://portainer.internal:9443/api/endpoints
       /3/docker/images/create?fromImage=acme/webapp
       &tag=latest (re-authenticate in the same
       call)."
    },
    {
      "id": 4,
      "action": "Redeploy the webapp stack via PUT
       https://portainer.internal:9443/api/stacks
       /<webapp-stack-id>?endpointId=1 with PullImage:
       true in the request body (re-authenticate in
       the same call)."
    },
    {
      "id": 5,
      "action": "Verify the webapp container State is
       \"running\" by re-querying GET
       https://portainer.internal:9443/api/endpoints
       /3/docker/containers/json?all=true — report
       the observed state as the final result."
    }
  ]
}
Assertions · grader verdicts with evidence
single-parseable-jsonnew ✓old ✓
newOne fenced json block; valid JSON with goal/count/subtasks.
oldOne fenced json block, valid JSON.
schema-fields-completenew ✓old ✗
newAll 5 subtasks carry all six fields.
oldSubtasks have only id and action. The old skill promised JSON in its description but never defined a schema, so the model invented a minimal ad-hoc shape.
count-matches-lengthnew ✓old ✗
newcount: 5, subtasks array has 5 entries.
oldNo 'count' field; an ad-hoc 'atomic_subtasks': 5 key was invented instead. A programmatic consumer keyed to the schema would break.
healthy-halt-conditionnew ✓old ✗
newSubtask 2 has "halt_condition": "[HEALTHY]".
old[HEALTHY] only appears embedded inside subtask 2's action text; there is no halt_condition field to read.
verify-is-observablenew ✓old ✗
newe.g. subtask 4: "The stack-update API call returned success (HTTP 200)" — checkable states, not restatements.
oldNo verify fields exist anywhere in the output.
no-chained-actionsnew ✓old ✓
newSubtasks 3, 4, 5 split image pull / stack update / running-state verification.
oldSubtasks 3, 4, 5 are correctly split. The decomposition itself was fine; the structure failed.
Prompt

Decompose this goal into subtasks: Nightly repo hygiene on the workstation — run git status across every repo under ~/projects, list any repo with uncommitted changes older than 7 days, write the report to ~/reports/repo-hygiene-<date>.md, and if every repo is clean output [ALL_CLEAN] and stop.

NEW 6/6vsOLD 6/6

The control case. This is the bread-and-butter markdown contract the nightly cron worker parses, and the old skill was genuinely good at it. The rebuild had to not regress here, and didn’t.

Rebuilt skill outputdecomposition.md
Atomic subtasks (5):
1. Enumerate every git repository under ~/projects
   (directories containing a .git) — capture the
   repo list.
2. Run `git status --porcelain` in each enumerated
   repo — capture which repos have uncommitted
   changes.
3. For each repo with uncommitted changes, check
   whether the changes are older than 7 days —
   capture the flagged repo list.
4. If the flagged repo list is empty, output
   [ALL_CLEAN] and halt.
5. Write the report listing each flagged repo and
   its stale uncommitted changes to
   ~/reports/repo-hygiene-<date>.md.
Old skill outputdecomposition.md
Atomic subtasks (5):
1. Enumerate every git repository under ~/projects
   (directories containing a `.git`) — capture the
   repo list.
2. Run `git status --porcelain` in each enumerated
   repo — capture per-repo uncommitted changes.
3. If every repo reports zero uncommitted changes,
   output [ALL_CLEAN] and halt.
4. For each repo with uncommitted changes, check
   the modification timestamps of the changed
   files — keep only repos whose uncommitted
   changes are older than 7 days.
5. Write the report listing those repos (repo path
   + dirty files + change age) to
   ~/reports/repo-hygiene-<date>.md.
Assertions · grader verdicts with evidence
header-line-presentnew ✓old ✓
newFirst line is 'Atomic subtasks (5):' and there are exactly 5 numbered lines.
oldFirst line is 'Atomic subtasks (5):' with 5 numbered lines.
imperative-no-chainingnew ✓old ✓
newLines start Enumerate/Run/For each.../If.../Write; no line joins two distinct actions with and/then.
oldNo and/then chaining in any line.
halt-branch-presentnew ✓old ✓
newLine 4: 'If the flagged repo list is empty, output [ALL_CLEAN] and halt.'
oldLine 3: 'If every repo reports zero uncommitted changes, output [ALL_CLEAN] and halt.'
artifacts-namednew ✓old ✓
newLine 5 names ~/reports/repo-hygiene-<date>.md verbatim.
oldLine 5 names ~/reports/repo-hygiene-<date>.md.
no-invented-stepsnew ✓old ✓
new5 subtasks map 1:1 onto enumerate/status/age-filter/halt/write; no commits, pushes, or notifications added.
old5 subtasks, all implied by the goal.
no-optional-modes-leaknew ✓old ✓
newPlain numbered markdown list only.
oldPlain numbered markdown list only.
Prompt

Decompose this goal into subtasks with dependencies: Generate three marketing assets for the product launch — a hero image, a 30-second video script, and a landing-page headline set — then assemble all three into a review doc at ~/drafts/launch-pack.md and notify me it is ready for review.

NEW 6/6vsOLD 5/6

The old skill had no dependency convention at all, so the model improvised: it invented a [deps:] syntax on the spot, then appended explanatory prose after the list. Reasonable for a human. Fatal for a deterministic parser, because the improvisation would differ run to run.

Rebuilt skill outputdecomposition.md
Atomic subtasks (5):
1. Generate the launch hero image; save it to
   ~/drafts/launch-hero-image.png.
2. Write the 30-second launch video script.
3. Write the landing-page headline set.
4. Assemble the hero image, video script, and
   headline set into a review doc at
   ~/drafts/launch-pack.md. [after: 1, 2, 3]
5. Notify the requester that
   ~/drafts/launch-pack.md is ready for review.
   [after: 4]
Old skill outputdecomposition.md
Atomic subtasks (5):
1. Generate a hero image for the launch — save the
   image artifact (or its path/URL) for assembly.
   [deps: none]  ← invented ad-hoc syntax
2. Write a 30-second video script for the launch —
   capture the script text for assembly.
   [deps: none]
3. Write a landing-page headline set for the
   launch — capture the headline variants for
   assembly. [deps: none]
4. Assemble the hero image (from 1), video script
   (from 2), and headline set (from 3) into a
   review doc at ~/drafts/launch-pack.md.
   [deps: 1, 2, 3]
5. Notify the requester that
   ~/drafts/launch-pack.md is ready for review.
   [deps: 4]

Dependency notes: subtasks 1–3 are independent and
can run in parallel; subtask 4 blocks on all three;
subtask 5 blocks on 4.
↑ trailing commentary a parser would choke on
Assertions · grader verdicts with evidence
header-line-presentnew ✓old ✓
new'Atomic subtasks (5):' with 5 lines.
old'Atomic subtasks (5):' with 5 lines.
parallel-assets-unannotatednew ✓old ✓
newLines 1-3 (hero image, video script, headline set) have no annotation — independently startable per the skill's annotated-mode semantics.
oldLines 1-3 marked '[deps: none]' — an invented ad-hoc convention, but it does convey independent startability.
assemble-depends-on-all-assetsnew ✓old ✓
newLine 4 ends with '[after: 1, 2, 3]'.
oldLine 4 marked '[deps: 1, 2, 3]'.
notify-depends-on-assemblenew ✓old ✓
newLine 5 ends with '[after: 4]'.
oldLine 5 marked '[deps: 4]'.
imperative-no-chainingnew ✓old ✓
newAssemble (4) and notify (5) are separate entries.
oldOne action per line.
output-only-header-and-listnew ✓old ✗
newOutput is exactly the header plus 5 numbered lines.
oldOutput ends with a 'Dependency notes:' prose paragraph after the list. The old skill had no dependency convention, so the model invented '[deps:]' syntax AND added explanatory commentary — format would vary run to run, breaking deterministic parsers.
Prompt

I’m wiring up a LangGraph workflow and need this flattened into nodes: scrape the pricing pages of our three competitor sites, normalize each into a common JSON shape, diff against last month’s snapshot at ./data/pricing/2026-05.json, and write a changes report — if nothing changed the workflow should just end with a NO_CHANGES output.

NEW 5/5vsOLD 4/5

This eval deliberately tests the whole point of the rebuild: the old skill only knew one kind of executor (a single agent taking turns inside one specific orchestrator). Asked for workflow nodes, it produced a strictly sequential list and bundled all three normalizations into one step. The rebuilt skill recognized a DAG engine and emitted a real fan-out/fan-in graph.

Rebuilt skill outputdecomposition.md
Atomic subtasks (9):
1. Scrape the pricing page of competitor site 1 —
   capture raw page content.
2. Scrape the pricing page of competitor site 2 —
   capture raw page content.
3. Scrape the pricing page of competitor site 3 —
   capture raw page content.
4. Normalize the site 1 scrape into the common
   pricing JSON shape. [after: 1]
5. Normalize the site 2 scrape into the common
   pricing JSON shape. [after: 2]
6. Normalize the site 3 scrape into the common
   pricing JSON shape. [after: 3]
7. Diff the three normalized JSON objects against
   the snapshot at ./data/pricing/2026-05.json —
   capture the change set. [after: 4, 5, 6]
8. If the change set is empty, output NO_CHANGES
   and end the workflow. [after: 7]
9. Write the changes report from the captured
   change set. [after: 8]
Old skill outputdecomposition.md
Atomic subtasks (8):
1. Scrape the pricing page of competitor site 1 —
   capture raw page content.
2. Scrape the pricing page of competitor site 2 —
   capture raw page content.
3. Scrape the pricing page of competitor site 3 —
   capture raw page content.
4. For each scraped pricing page, normalize the
   content into the common JSON shape — one
   normalized object per competitor.
   ← three nodes bundled into one; hides the
     fan-out a DAG engine wants
5. Read last month's snapshot at
   ./data/pricing/2026-05.json.
6. Diff the three normalized JSON objects against
   the snapshot from ./data/pricing/2026-05.json —
   capture the change set.
7. If the change set is empty, output NO_CHANGES
   and halt.
8. Write the changes report from the non-empty
   change set.
   ← strictly sequential; zero dependency info
Assertions · grader verdicts with evidence
usable-node-listnew ✓old ✓
new'Atomic subtasks (9):' with 9 node-sized entries.
old'Atomic subtasks (8):' with 8 entries.
no-changes-shortcircuitnew ✓old ✓
newLine 8: 'If the change set is empty, output NO_CHANGES and end the workflow. [after: 7]'.
oldLine 7: 'If the change set is empty, output NO_CHANGES and halt.'
no-orchestrator-jargonnew ✓old ✓
newOutput is system-agnostic; speaks only of the workflow.
oldNo internal orchestrator vocabulary or agent codenames leaked into the output.
dag-awarenessnew ✓old ✗
newScrapes 1-3 unannotated (parallel); per-site normalizes 4-6 each depend only on their own scrape ([after: 1]/[after: 2]/[after: 3]); diff joins at [after: 4, 5, 6]. A real fan-out/fan-in DAG.
oldStrictly sequential list with no dependency information. Also bundles all three normalizations into one 'For each scraped pricing page' subtask (line 4), hiding the fan-out a LangGraph consumer would want as separate nodes.
snapshot-path-namednew ✓old ✓
newLine 7 names ./data/pricing/2026-05.json.
oldLines 5-6 name ./data/pricing/2026-05.json.
04 · Before / After

The diff that bought 26 points.

Four excerpts from the skill file itself, old vs new (lightly redacted: internal agent codenames and private paths replaced with generic terms). Skills are just instructions; every benchmark point above traces to one of these paragraphs.

1 · Who is the executor?
The old skill knew exactly one consumer: one internal mission orchestrator’s dispatch agents. The rebuild made “atomic” relative to the executor. This single table is why eval 3 went from a flat sequential list to a real fan-out/fan-in DAG.
Before
"…atomic, dispatchable subtasks suitable for [internal orchestrator] / [worker-agent] execution … each runnable by a single agent turn." (One notion of atomic, hard-coded to one system.)
After
"What 'atomic' means depends on who runs the subtask: | Executor | Atomic unit | | Agent turn (default)| one search / write / run | | Workflow / DAG node | one node's worth | | Headless cron run | one command sequence | | Human checklist | one uninterrupted action |"
2 · The JSON promise
The killer detail of the whole benchmark. The old skill promised JSON output in its description and then never defined a single field, so the model improvised a shape that breaks any programmatic consumer. Four of six failures in eval 1 trace to this one missing paragraph.
Before · the entire JSON spec
"Output is a numbered JSON or Markdown list — one atomic action per line." (That's it. No fields, no example, no rules.)
After
"Emit exactly one fenced json block, nothing else: { "goal": …, "count": 5, "subtasks": [ { "id", "action", "artifacts", "halt_condition", "depends_on", "verify" } ] } Field rules: count must equal subtasks.length. depends_on is always present ([] = may start immediately). halt_condition is the literal token emitted on short-circuit, or null. verify is required."
3 · The verification litmus test
The old skill defined “atomic” by feel. The rebuild gives the model a checkable criterion it runs silently on every candidate subtask, and that surfaces as the required verify field in JSON mode.
Before
(No verification concept anywhere in the file.)
After
"For each candidate subtask, ask: can you state in one sentence how an observer would verify it succeeded? (A concrete observable outcome — a file exists, a command exited 0, a value was captured — not a restatement of the action.) If you cannot articulate verification, the subtask is still compound: decompose further."
4 · Known consumers (do not break)
Step 1 of the loop (callsite research) found the skill’s documented sponsor was wrong; the real consumer was a nightly cron worker parsing the output headlessly. The rebuild writes that contract into the skill itself, so no future rewrite can drift it silently.
Before
"Why this is a skill: pattern recurred in 5 distinct sessions … every time the orchestrator preps a goal for a dispatch agent." (Justifies the skill's existence; documents no contract.)
After
"Known consumers (do not break): the nightly worker's decompose step invokes this skill headlessly and parses the default markdown contract: a single 'Atomic subtasks (N):' line followed by numbered imperative lines. Any change to the default output shape must keep that contract intact."
05 · Try This Yourself

The pattern is reusable. The phrasing matters.

You don’t need custom tooling for this; you need to ask for the loop instead of the rewrite. These phrasings reliably steer Claude Code from “edit the file” into “run the methodology”:

"tear it down and build it back up" "benchmark it against the old version" "with evals" "snapshot the old one first" "show me the evidence"
research callsites → human decides → snapshot → rewrite → eval ×2 (new vs old) → grade w/ evidence → benchmark → human reviewsiterate

Human at the edges, machine in the middle: you make the design decisions before any code, and you audit the grader at the end. Everything between is mechanical, reproducible, and cheap to re-run.

01The full loop
Review, tear down, and rebuild my <skill-name> skill. Snapshot the old version first, then benchmark the rebuild against that snapshot on realistic evals and show me the evidence for every grade.
02Callsite research before any edit
Before rewriting anything: find every place that actually consumes this skill's output (grep the whole config tree, crons, hooks, and other skills) and tell me exactly what output contract each consumer depends on.
03Decisions up front, not surprises after
Surface every design decision as a multiple-choice question BEFORE writing any code. I want to decide the tradeoffs; you implement them.
04A/B evals against the snapshot
Write 4 realistic eval prompts with explicit pass/fail assertions. Run each prompt in two fresh subagents (one with the new version loaded, one with the old snapshot) so the only variable is the skill text.
05Evidence-graded benchmark + human gate
Grade every run assertion-by-assertion with a quoted evidence line per verdict. Roll the grades into one benchmark table (pass rate, time, tokens, old vs new), then build me a self-contained review page so I can audit the grader and export feedback for iteration 2.