Code quality at agent pace — a line of stations between every commit and production. Each station measures one number. Any station can stop the line.
Cursor, Claude Code, and Codex write the bulk of the code. We review, shape, gate, ship. Same five people both years — Keith, Reggie, Eli, Bobby, Matias — shipping 3,635 commits across 25 repos in the first four months of 2026. Two years prior, the same five shipped 255.
The toolset has to do the supervising. Every repo on a different testing config is the velocity penalty we cannot afford.
One pattern, applied across every repo, that turns "is this any good?" from a vibe-check into a number we can read.
From this morning's rollout across 10 repos — none of these are tests:
waitUntil: "commit" in page.waitForURL — would have hung E2E for 30 s eachhttpOnly: false on NCEE staging — only surfaced on invalid-tokenReggie's NCEE testing deck still covers the test-failure class. This deck is the gauntlet that wraps it — the static-analysis, build-artifact, supply-chain, and observability stations every commit walks past on its way to ship.
Every commit walks the gauntlet on its way to production. Each station measures one number. Any station can stop the line — and nothing on the line ever weakens without a signed reason.
"wallace 248833 → 248861 after react 19.2.5". Public reason or it doesn't merge.
|| true. No --no-verify. No "warn-only" tier. Half-on stations are the pattern the next agent copies — either the station fails on regression, or it isn't a station.
One unfilled fleet-wide: L15 test authorship — the AI maintains tests under our standards, but nobody is actively growing the suite. Documented as the second open hire in every per-repo SKILL.
|| true. No informational tier. No --no-verify. If a step is worth running it's worth failing on. The same gates catch compliance regressions — secrets sweep, audit baseline, enforce_admins. Also our Vanta substrate.// .wallace/tenant.json
{
"totalSize": 248833,
"selectorCount": 4129,
"specificity": {
"max": [0, 4, 4, 0]
},
"rules": {
"empty": { "total": 0 },
"important": { "total": 0 }
}
}
Today's measured value, written verbatim. Zero headroom.
// scripts/wallace/check.mjs
const baseline = readJson(BASELINE);
const measured = await analyzeCss(BUILT);
for (const [k, v] of entries(baseline)) {
if (measured[k] > v) {
fail(`${k}: ${measured[k]} > ${v}`);
}
}
// Exits non-zero if any metric
// regressed. No informational tier.
Any increase fails the gate. The wheel only turns one direction.
$ git log -1 .wallace/tenant.json chore(wallace): bump totalSize 248833 → 248861 (+28 bytes) react 19.2.5 ships ~28 bytes of new createRoot scaffolding we can't drop. Verified bundle diff in PR #4129.
Public reason or it doesn't merge. Loosening is ceremony.
This pattern is the whole gauntlet. Every station — built-CSS bytes (Wallace), JS bundle bytes, npm-audit count, fallow clone groups, vitest coverage floor, gate wall-time — is the same three-step template: freeze a number, gate on it, ceremony to raise it. aatm-brain has 9 stations running today.
Provide the agent with up-to-date docs, idiomatic usage, working examples for that vendor's product. Soft guidance — the agent reads them, picks them up, sometimes ignores them.
Output: better-formed code that uses the vendor correctly. Mode: guidance, advisory.
Enforce our bytes-per-build budget, our coverage floor, our npm-audit baseline, our commit-message format. Hard stops — if the threshold regresses, the build is red.
Output: code that complies with company-wide quality contracts. Mode: enforcement, gating.
Vendor MCPs help the agent write better code. The gauntlet decides whether that code ships. We use vendor MCPs in every repo. We also walk every repo through the same gauntlet.
| Gate | IDEms · on save | Pre-commit~3 s · on commit | Pre-push~70 s · on push | CI Test Gate~3 min · parallel | Crondaily · persistent PR |
|---|---|---|---|---|---|
| L1 · ESLint + format | ● | ● | ● | ● | |
| L2 · TypeScript | ● | ● | ● | ||
| L7 · e2e preflight | ● | ● | |||
| L6 · Fallow preflight (diff-scoped) | ● | ● | |||
| Branch protection / commitlint / secrets | ● | ● | ● | ||
| L12 · npm audit | ● | ● | ● | ||
| A4 · Dependabot grouped weekly | ● | ● | ● | ||
| L13 · Coverage floor | ● | ● | |||
| L8 · Wallace built CSS | if CSS | ● | ● | ||
| L5 · Fallow project graph | ● | ● | ● | ||
| L9 · Bundle byte budget | ● | ● | |||
| CI Test Gate orchestration | ● | ||||
| A1 · Wall-time budget wrapper | ● | ● | |||
| L14 · Visual regression (PNG diff) | ● | ||||
| v3.2 · PostHog telemetry gate | v3.2 | ||||
| A3 · Daily report PR | ● |
The cheapest phase wins. Catch at IDE → free. Catch at CI → minutes. Catch in production → a deploy and an apology. Most gates fire at multiple phases on purpose: the same lint runs locally and in CI so a fast local loop never bypasses the slow remote one.
"Execute org-wide v3.1 rollout per ~/.claude/plans/put-together-a-plan- reactive-starfish.md. Authorship gate: skip repos not primarily authored by Keith. DO NOT STOP UNTIL ALL ITERATIONS COMPLETE. Every repo finished with our changes working and running on main."
No per-repo instructions. The agent read the global skill, the rollout plan, the per-repo INDEX, and the standards. It picked the iteration order, opened PRs, fixed CI failures, merged 10 PRs in ~6 hours of wall-clock. 9 deployed clean to production.
The mechanism worked before we had a name for it. Reggie hand-rolled custom ESLint plugins, surfaced them via the LSP, and watched the agent self-correct. The agent treats lints as gates and rewrites until they pass.
v3.1 is the same pattern, made repo-canonical: standards surfaced as lints get respected; rules in a doc get ignored.
Every minute in the SKILL pays back across every agent that comes after it.
Adopt v3.1 standard evolution onto a repo that already had 11 of 14 layers wired. Add the new amendments — wall-time budget (A1), SHA-cached pre-push (A1b), JSX prose (L11), visual regression (L14, A2), daily report (A3) — and document the rest as honest gaps in the per-repo SKILL.
e2e:preflight static check caught 5 real page.waitForURL calls missing waitUntil: "commit" — bugs that would have hung E2E for 30 s each in CI. Static analysis cheaper than runtime hang every time.AATM_PREPUSH_FORCE=1 because the audit baseline holds 2 high CVEs out of scope for this PR. The gap is documented in the PR description and the per-repo SKILL. Honest gap, not skipped gate.5 × 30-second hangs prevented = ~2.5 minutes per CI run × every PR forever. Static check < runtime hang, always.
| Layer | Before | After |
|---|---|---|
| L1 ESLint | ✓ | ✓ held |
| L2 tsc | ✓ | ✓ held |
| L5 Fallow | ✓ | ✓ held |
| L6 Fallow preflight | ✗ | ✓ gained |
| L7 e2e preflight | ✗ | ✓ gained · caught 5 |
| L8 Wallace CSS | ✓ | ✓ held |
| L9 Bundle budget | ✓ | ✓ held |
| L10 md prose (alex) | ✓ | ✓ held |
| L11 JSX prose | ✗ | ✓ gained |
| L12 npm audit | ✗ baseline | ✗ deferred · documented |
| L13 coverage floor | ✓ 45% | ✓ frozen |
| L14 visual regression | ✗ | ✓ gained |
| A1 budget wrapper | ✓ | ✓ held |
| A1b SHA-cached pre-push | ✓ | ✓ held |
| A3 daily report cron | ✗ | ✓ gained |
| L3/L4 tokens + contrast | ✗ | ✗ documented |
5 layers gained · 9 held at the freeze · 2 gaps explicitly deferred with a paper trail.
Add the new amendments to the most-mature repo in the convoy — wall-time budget, SHA-cached pre-push, visual regression, daily report. Refresh the per-repo SKILL to v3.1 wording. Land the iteration as the reference for the other 9 PRs to cite.
client-feedback page loads, smoke - global agent); succeeded on retry #3. Not deterministically broken — flaky. v3.1 made the flake visible by failing fast. v3.1 didn't fix it. v3.2 work: quarantine flake-prone tests or fix them. Don't normalize retries.The stricter pre-commit + gauntlet pattern pushed Bobby off the blueprint repo for two days while we tuned which warnings get treated as hard stops vs. info. Stations that catch quality regressions only matter if the team can still work — calibration is its own line item.
Frozen rot is still rot — but it's visible rot, with a paper trail. Next PR can't make it worse. Burndown is a real workstream, not a someday-refactor.
| Layer | Before | After |
|---|---|---|
| L1 ESLint | ✓ | ✓ held |
| L2 tsc | ✓ | ✓ held |
| L3 theme tokens | ✓ | ✓ held |
| L4 WCAG contrast | ✓ | ✓ held |
| L5 Fallow | ✓ | ✓ baselined 565+990 |
| L8 Wallace CSS | ✓ | ✓ held |
| L9 Bundle budget | ✓ | ✓ held |
| L11 JSX prose | ✓ | ✓ caught "Looser" |
| L12 npm audit | ✓ 1 high | ✗ baseline · burndown |
| L13 coverage floor | ✓ 38% | ✓ frozen at floor |
| L14 visual regression | ✓ | ✓ held |
| A1 budget wrapper | ✗ | ✓ gained · 67.4 s/90 s |
| A1b SHA-cached pre-push | ✗ | ✓ gained |
| A3 daily report cron | ✗ | ✓ gained |
| CI Test Gate drill | flake-prone | flake-prone · v3.2 fix |
| Carpenter burndown | ~unknown | 1,555 · baselined |
3 layers gained · 11 held · 2 visible gaps with named workstreams.
Eight repos started with most layers unfilled. We didn't pretend to fill them. The iteration PR landed three things on every repo — and named the rest as gaps in the per-repo SKILL.
The signal isn't the layers we filled. It's the layers we named as gaps. A documented gap is a hire we can plan; a hidden gap is a fire we'll fight blind.
Production deploys went green for 9. bfd-platform's E2E flake unmasked an old retry-tolerant setup we'd never properly seen — that's a v3.2 fix, not a v3.1 regression.
Daily-report cron 401'd at PR creation across all 10 repos. Cause: Allow GitHub Actions to create and approve pull requests was off org-wide. Repo-level Actions perms can't override it. ~30 min to find. Now step 0 of any cron-PR workflow.
cfut_ vs cfat_Org's ~/.config/bfd/cloudflare.env held a cfut_ wrangler-OAuth token, not a cfat_ API token. Pages-create + DNS need API tokens; OAuth fails silently mid-flow. Token-type check is now the first line of every CF-touching script.
Token system + contrast gate wired only on bfd-platform. Eight UI-bearing repos ship hand-stitched canvas — every refactor risks brand drift and a11y regression. Single biggest hire on the v3.2 board. One repo per quarter cadence.
Senior, not junior. Goes to any repo, names the missing standard, writes the gate or files the work order. Owns Vanta + compliance alongside quality. One head, not five.
Maintains tests, fixes lint, writes coverage, tightens stations — under the SME's standards. "AI is going to do a 10× better job of writing and maintaining those tests than an intern will." — Reggie
Every station in the gauntlet is one instance of that. CSS bytes, bundle bytes, audit count, coverage floor, gate wall-time. Start with the one drifting in your repo this week.
"I don't code anymore, so I'm not scared… eventually I learn." — Matias, 2026-04-20. The point of every station is that the next person — agent, junior, returning teammate — can't accidentally make it worse.
The team's headline: a metric of quality, not an experiment on quality. One SME unlocked. Same standards across every repo. The toolset doing the supervising. That's how five people ship like fifty without hating ourselves.
~/.claude/skills/code-quality-setup/SKILL.md — symlinked into Claude Code, Codex, and Cursor. Every agent reads it before generating code in the new repo.
Drop the per-repo SKILL template at .cursor/skills/code-quality/SKILL.md. Fill in which layers are filled, pending, or N/A. The honest gap doc is the deliverable.
In priority order: L1+L2 → A1 budget wrapper → A3 daily report → L6/L7 preflights → L5 fallow → L13 coverage → L8/L9 bytes → L12 audit → L10/L11 prose → L3/L4 tokens+contrast → L14 visual regression.
Update ~/.claude/skills/code-quality-setup/per-repo/INDEX.md — name, cron hour, filled vs. pending layers. The index is the org-wide adoption roster.
Open the iter PR. Each layer commit gets its own message. Gaps land documented, not pretended-shipped. PR description names every filled layer + every documented gap.
10 PRs merged · 9 production deploys clean · 7 real issues caught mid-flight · gaps documented, not hidden.