Not a rewrite. An engineering loop.
The skill under test is decompose-goal: it flattens a free-text goal
(“check the container, redeploy if down, verify”) into atomic subtasks that agents,
workflow engines, cron jobs, or humans can execute. The request was simple:
“review, tear down, and rebuild it.”
Instead of a one-shot rewrite, that phrase triggered Claude Code’s skill-creator
methodology, which imposed the loop below. The human sits at the edges. The machine
runs the middle.
-
1
Callsite research first machineBefore touching a line: grep for everything that actually consumes the skill’s output. The research found the skill’s documented sponsor was wrong. The real consumer was a nightly cron worker that parses the output headlessly and depends on a hard contract: a literal
Atomic subtasks (N):header. Break that line, break production. -
2
Design decisions surfaced to the human humanFour multi-choice questions, asked before any code: add a JSON schema? (yes) · dependency annotations? (opt-in) · per-subtask verification rule? (as an internal litmus test) · publish a shareable version? (deferred). The human spends 60 seconds making four decisions, not 60 minutes reviewing a surprise rewrite.
-
3
Snapshot the old skill machineA byte-for-byte frozen copy of the original. Two jobs: rollback insurance if the rewrite is worse, and an honest A/B baseline. Without it you know the new skill scores 100%, but not whether the old one would have too.
-
4
Rewrite machineOrchestrator-specific framing replaced with executor profiles (agent turn / DAG node / cron run / human checklist), a fully specified JSON schema, opt-in
[after: N]dependency annotations, and a verification litmus test: if you can’t state how to verify a subtask succeeded, it’s still compound. Decompose further. -
5
Four evals, run blind, twice over machineFour realistic prompts, each executed by two parallel fresh subagents: one with the new skill loaded, one with the old snapshot. Eight runs total. Same prompts for both sides, so any score difference is attributable to the skill text, not the questions.
-
6
Assertion-based grading with written evidence machineEvery run graded against per-eval assertions. Each verdict carries a quoted evidence line, so “fail” is never just a vibe; it points at the exact line that broke.
-
7
Aggregate benchmark machineAll grades roll up into one table: new skill 100% (±0%) vs old skill 74.2% (±29%), at the same wall-clock time and a ~1% token delta. This is the go/no-go signal for promoting the rewrite.
-
8
Human review → feedback → iteration 2 humanA self-contained review page renders every output and grade side by side so the author can audit the grader itself, then export structured feedback. Disagreements become fixed assertions; observed weaknesses become new evals. The same eval set is reused, so iteration 2’s score is directly comparable to iteration 1’s.
Same prompts. Same model. Same cost. Different skill text.
Both configurations answered the same four prompts in fresh subagent sessions. The only variable was which version of the skill file was loaded.
| Metric | Old skill | Rebuilt skill | Delta |
|---|---|---|---|
| Pass rate (mean across evals) | 74.2% ± 28.6% | 100% ± 0% | +25.8 pts |
| Assertions passed | 17 / 23 | 23 / 23 | +6 |
| Wall-clock per run | 47.5s ± 7.7s | 47.4s ± 7.0s | −0.2% (noise) |
| Tokens per run | 76,645 ± 678 | 77,553 ± 484 | +908 (+1.2%) |
Don’t trust the table. Read the runs.
Every eval below shows the actual prompt, both raw outputs side by side, and the grader’s assertion-by-assertion verdicts with evidence. Paths and hosts have been replaced with neutral placeholders; structure and content are otherwise verbatim.
Decompose this goal into subtasks as JSON: Check whether the webapp container on the container host is running via the Portainer API. If it’s running, output [HEALTHY] and halt. If not, pull the latest webapp image, redeploy the stack with PullImage true, then verify the container came up.
{
"goal": "Check whether the webapp container on the
container host is running via the Portainer API.
If it's running, output [HEALTHY] and halt. If
not, pull the latest webapp image, redeploy the
stack with PullImage true, then verify the
container came up.",
"count": 5,
"subtasks": [
{
"id": 1,
"action": "Query the Portainer API
(https://portainer.internal:9443, endpoint 1)
for the webapp container's state — capture
whether it is running.",
"artifacts": ["https://portainer.internal:9443
/api/endpoints/1/docker/containers/json"],
"halt_condition": null,
"depends_on": [],
"verify": "A running/not-running state value for
the webapp container was captured from the
Portainer API response."
},
{
"id": 2,
"action": "If the webapp container is running,
output [HEALTHY] and halt.",
"artifacts": [],
"halt_condition": "[HEALTHY]",
"depends_on": [1],
"verify": "Either [HEALTHY] was emitted and
execution stopped, or the container was not
running and execution continued."
},
{
"id": 3,
"action": "Pull the latest webapp image via the
Portainer API (POST /api/endpoints/1/docker
/images/create?fromImage=acme/webapp&tag=latest).",
"artifacts": ["https://portainer.internal:9443
/api/endpoints/1/docker/images/create"],
"halt_condition": null,
"depends_on": [2],
"verify": "The image-pull API call returned
success and the latest webapp image is present
on the Docker host."
},
{
"id": 4,
"action": "Redeploy the webapp stack via the
Portainer API (PUT /api/stacks/{id} with
PullImage: true).",
"artifacts": ["https://portainer.internal:9443
/api/stacks/{id}"],
"halt_condition": null,
"depends_on": [3],
"verify": "The stack-update API call returned
success (HTTP 200) for the webapp stack."
},
{
"id": 5,
"action": "Verify via the Portainer API that
the webapp container is now running.",
"artifacts": ["https://portainer.internal:9443
/api/endpoints/1/docker/containers/json"],
"halt_condition": null,
"depends_on": [4],
"verify": "The Portainer API reports the webapp
container in a running state after the
redeploy."
}
]
}
{
"atomic_subtasks": 5, ← invented key, no "count"
"subtasks": [
{
"id": 1,
"action": "Query the webapp container state on
the container host via GET
https://portainer.internal:9443/api/endpoints
/3/docker/containers/json?all=true
(authenticate with the stored Portainer
credentials in the same call) — capture the
webapp container's Id and State."
}, ← only id + action;
no artifacts, no
halt_condition, no
depends_on, no verify
{
"id": 2,
"action": "If the webapp container State is
\"running\", output [HEALTHY] and halt."
}, ← halt token buried in
prose, not a field
{
"id": 3,
"action": "Pull the latest webapp image via POST
https://portainer.internal:9443/api/endpoints
/3/docker/images/create?fromImage=acme/webapp
&tag=latest (re-authenticate in the same
call)."
},
{
"id": 4,
"action": "Redeploy the webapp stack via PUT
https://portainer.internal:9443/api/stacks
/<webapp-stack-id>?endpointId=1 with PullImage:
true in the request body (re-authenticate in
the same call)."
},
{
"id": 5,
"action": "Verify the webapp container State is
\"running\" by re-querying GET
https://portainer.internal:9443/api/endpoints
/3/docker/containers/json?all=true — report
the observed state as the final result."
}
]
}
Decompose this goal into subtasks: Nightly repo hygiene on the workstation — run git status across every repo under ~/projects, list any repo with uncommitted changes older than 7 days, write the report to ~/reports/repo-hygiene-<date>.md, and if every repo is clean output [ALL_CLEAN] and stop.
The control case. This is the bread-and-butter markdown contract the nightly cron worker parses, and the old skill was genuinely good at it. The rebuild had to not regress here, and didn’t.
Atomic subtasks (5): 1. Enumerate every git repository under ~/projects (directories containing a .git) — capture the repo list. 2. Run `git status --porcelain` in each enumerated repo — capture which repos have uncommitted changes. 3. For each repo with uncommitted changes, check whether the changes are older than 7 days — capture the flagged repo list. 4. If the flagged repo list is empty, output [ALL_CLEAN] and halt. 5. Write the report listing each flagged repo and its stale uncommitted changes to ~/reports/repo-hygiene-<date>.md.
Atomic subtasks (5): 1. Enumerate every git repository under ~/projects (directories containing a `.git`) — capture the repo list. 2. Run `git status --porcelain` in each enumerated repo — capture per-repo uncommitted changes. 3. If every repo reports zero uncommitted changes, output [ALL_CLEAN] and halt. 4. For each repo with uncommitted changes, check the modification timestamps of the changed files — keep only repos whose uncommitted changes are older than 7 days. 5. Write the report listing those repos (repo path + dirty files + change age) to ~/reports/repo-hygiene-<date>.md.
Decompose this goal into subtasks with dependencies: Generate three marketing assets for the product launch — a hero image, a 30-second video script, and a landing-page headline set — then assemble all three into a review doc at ~/drafts/launch-pack.md and notify me it is ready for review.
The old skill had no dependency convention at all, so the model improvised: it invented a
[deps:] syntax on the spot, then appended explanatory prose after the list.
Reasonable for a human. Fatal for a deterministic parser, because the improvisation
would differ run to run.
Atomic subtasks (5): 1. Generate the launch hero image; save it to ~/drafts/launch-hero-image.png. 2. Write the 30-second launch video script. 3. Write the landing-page headline set. 4. Assemble the hero image, video script, and headline set into a review doc at ~/drafts/launch-pack.md. [after: 1, 2, 3] 5. Notify the requester that ~/drafts/launch-pack.md is ready for review. [after: 4]
Atomic subtasks (5): 1. Generate a hero image for the launch — save the image artifact (or its path/URL) for assembly. [deps: none] ← invented ad-hoc syntax 2. Write a 30-second video script for the launch — capture the script text for assembly. [deps: none] 3. Write a landing-page headline set for the launch — capture the headline variants for assembly. [deps: none] 4. Assemble the hero image (from 1), video script (from 2), and headline set (from 3) into a review doc at ~/drafts/launch-pack.md. [deps: 1, 2, 3] 5. Notify the requester that ~/drafts/launch-pack.md is ready for review. [deps: 4] Dependency notes: subtasks 1–3 are independent and can run in parallel; subtask 4 blocks on all three; subtask 5 blocks on 4. ↑ trailing commentary a parser would choke on
I’m wiring up a LangGraph workflow and need this flattened into nodes: scrape the pricing pages of our three competitor sites, normalize each into a common JSON shape, diff against last month’s snapshot at ./data/pricing/2026-05.json, and write a changes report — if nothing changed the workflow should just end with a NO_CHANGES output.
This eval deliberately tests the whole point of the rebuild: the old skill only knew one kind of executor (a single agent taking turns inside one specific orchestrator). Asked for workflow nodes, it produced a strictly sequential list and bundled all three normalizations into one step. The rebuilt skill recognized a DAG engine and emitted a real fan-out/fan-in graph.
Atomic subtasks (9): 1. Scrape the pricing page of competitor site 1 — capture raw page content. 2. Scrape the pricing page of competitor site 2 — capture raw page content. 3. Scrape the pricing page of competitor site 3 — capture raw page content. 4. Normalize the site 1 scrape into the common pricing JSON shape. [after: 1] 5. Normalize the site 2 scrape into the common pricing JSON shape. [after: 2] 6. Normalize the site 3 scrape into the common pricing JSON shape. [after: 3] 7. Diff the three normalized JSON objects against the snapshot at ./data/pricing/2026-05.json — capture the change set. [after: 4, 5, 6] 8. If the change set is empty, output NO_CHANGES and end the workflow. [after: 7] 9. Write the changes report from the captured change set. [after: 8]
Atomic subtasks (8): 1. Scrape the pricing page of competitor site 1 — capture raw page content. 2. Scrape the pricing page of competitor site 2 — capture raw page content. 3. Scrape the pricing page of competitor site 3 — capture raw page content. 4. For each scraped pricing page, normalize the content into the common JSON shape — one normalized object per competitor. ← three nodes bundled into one; hides the fan-out a DAG engine wants 5. Read last month's snapshot at ./data/pricing/2026-05.json. 6. Diff the three normalized JSON objects against the snapshot from ./data/pricing/2026-05.json — capture the change set. 7. If the change set is empty, output NO_CHANGES and halt. 8. Write the changes report from the non-empty change set. ← strictly sequential; zero dependency info
The diff that bought 26 points.
Four excerpts from the skill file itself, old vs new (lightly redacted: internal agent codenames and private paths replaced with generic terms). Skills are just instructions; every benchmark point above traces to one of these paragraphs.
"…atomic, dispatchable subtasks suitable for [internal orchestrator] / [worker-agent] execution … each runnable by a single agent turn." (One notion of atomic, hard-coded to one system.)
"What 'atomic' means depends on who runs the subtask: | Executor | Atomic unit | | Agent turn (default)| one search / write / run | | Workflow / DAG node | one node's worth | | Headless cron run | one command sequence | | Human checklist | one uninterrupted action |"
"Output is a numbered JSON or Markdown list — one atomic action per line." (That's it. No fields, no example, no rules.)
"Emit exactly one fenced json block, nothing else: { "goal": …, "count": 5, "subtasks": [ { "id", "action", "artifacts", "halt_condition", "depends_on", "verify" } ] } Field rules: count must equal subtasks.length. depends_on is always present ([] = may start immediately). halt_condition is the literal token emitted on short-circuit, or null. verify is required."
verify field in JSON mode.
(No verification concept anywhere in the file.)
"For each candidate subtask, ask: can you state in one sentence how an observer would verify it succeeded? (A concrete observable outcome — a file exists, a command exited 0, a value was captured — not a restatement of the action.) If you cannot articulate verification, the subtask is still compound: decompose further."
"Why this is a skill: pattern recurred in 5 distinct sessions … every time the orchestrator preps a goal for a dispatch agent." (Justifies the skill's existence; documents no contract.)
"Known consumers (do not break): the nightly worker's decompose step invokes this skill headlessly and parses the default markdown contract: a single 'Atomic subtasks (N):' line followed by numbered imperative lines. Any change to the default output shape must keep that contract intact."
The pattern is reusable. The phrasing matters.
You don’t need custom tooling for this; you need to ask for the loop instead of the rewrite. These phrasings reliably steer Claude Code from “edit the file” into “run the methodology”:
Human at the edges, machine in the middle: you make the design decisions before any code, and you audit the grader at the end. Everything between is mechanical, reproducible, and cheap to re-run.
Review, tear down, and rebuild my <skill-name> skill. Snapshot the old version first, then benchmark the rebuild against that snapshot on realistic evals and show me the evidence for every grade.
Before rewriting anything: find every place that actually consumes this skill's output (grep the whole config tree, crons, hooks, and other skills) and tell me exactly what output contract each consumer depends on.
Surface every design decision as a multiple-choice question BEFORE writing any code. I want to decide the tradeoffs; you implement them.
Write 4 realistic eval prompts with explicit pass/fail assertions. Run each prompt in two fresh subagents (one with the new version loaded, one with the old snapshot) so the only variable is the skill text.
Grade every run assertion-by-assertion with a quoted evidence line per verdict. Roll the grades into one benchmark table (pass rate, time, tokens, old vs new), then build me a self-contained review page so I can audit the grader and export feedback for iteration 2.