Agents are moving. We needed a way to know — at a glance, on every commit — whether the diff is any good. The Gauntlet is what we built and how we keep it sharp as the toolset compounds.
Same five people, two years apart. We don't write most of the code anymore — we review, shape, gate, ship. The line of supervision can't be a person.
Cursor, Claude Code, and Codex write the bulk of the diff. We can't read every line at this pace. The gates have to.
waitUntil: "commit" on page.waitForURL — would have hung E2E for 30s each. Cookies set httpOnly: false on NCEE staging. "Looser" flagged by alex as homophone.Reggie's NCEE testing deck covers the test-failure class. The Gauntlet wraps it — the static-analysis, build-artifact, supply-chain, and observability stations every commit walks past on its way to ship.
Every commit walks the gauntlet on its way to production. Each station measures one number. Any station can stop the line — and nothing on the line ever weakens without a written reason.
Four checkpoints. Every release is a PR. None can be skipped — not even by the people who built the system.
staging triggers the GitHub Actions staging deploy. A follow-up PR into main triggers the production deploy.Even the people who own the repo cannot bypass this. The merge button only lights up when both the Gauntlet and the Copilot review have signed off and every Copilot comment has been remediated.
19 layers, four colors. Lexical/semantic catch the obvious. Structure/visual catch the subtle. Bytes/coverage make "good" a number. Supply chain + policy stop upstream rot. Amendments keep the whole thing humane.
Hover any cell for the layer's job, the tool, and the link to its config.
waitUntil:"commit" in page.waitForURL.--update-snapshots.Same pattern, four examples. Pick any station and the shape is identical.
// .wallace/tenant.json
{
"totalSize": 248833,
"selectorCount": 4129,
"specificity": {
"max": [0, 4, 4, 0]
},
"rules": {
"empty": { "total": 0 },
"important": { "total": 0 }
}
}
Today's measured value, written verbatim. Zero headroom.
// scripts/wallace/check.mjs
const baseline = readJson(BASELINE);
const measured = await analyzeCss(BUILT);
for (const [k, v] of entries(baseline)) {
if (measured[k] > v) {
fail(`${k}: ${measured[k]} > ${v}`);
}
}
// Exits non-zero if any metric
// regressed. No informational tier.
Any increase fails the gate. The wheel only turns one direction.
$ git log -1 .wallace/tenant.json chore(wallace): bump totalSize 248833 → 248861 (+28 bytes) react 19.2.5 ships ~28 bytes of new createRoot scaffolding we can't drop. Verified bundle diff in PR #4129.
Public reason or it doesn't merge. Loosening is ceremony.
// .bundle/tenant.json
{
"chunks": {
"index": { "max": 184231 },
"vendor": { "max": 308442 },
"ui": { "max": 64211 }
}
}
Per-chunk byte budgets. Hash-stripped — any deterministic build re-verifies.
// scripts/bundle/check.mjs
for (const [name, spec] of entries(budget.chunks)) {
const built = sizeOf(`dist/${name}-*.js`);
if (built > spec.max) {
fail(`${name}: ${built} > ${spec.max}`);
}
}
Compares actual built size against the frozen budget. Fail-stop on overage.
chore(bundle): vendor chunk 308442 → 312118 (+3.6 KB) @radix-ui/react-popover 1.0.7 ships extra positioning math needed for the new draft surface. Floor stays free to ratchet back once we drop the old composer.
A loosen ceremony names the cause and a way back.
// .fallow/baseline.json
{
"cloneGroups": 565,
"healthFindings": 990,
"boundaries": {
"violations": 0
}
}
Whole-monorepo project graph snapshot. Shrinks freely. Grows only with reason.
$ fallow audit
✓ unused exports held
✓ circular deps held
✓ boundary violations held
✕ clone groups 573 > 565
fail(boundary='clones')
A new duplicate stops the build until either the duplicate goes or the baseline is consciously raised.
chore(fallow): cloneGroups 565 → 542 after dedupe Pulled the BFDAdmin feedback list/get/screenshot helpers into a shared module. Free ratchet — no PR needed.
Tightening is free. The commit is documentation, not approval.
// .deps/audit-baseline.json
{
"high": 0,
"critical": 0,
"moderate": 11
}
CVE counts, frozen. Zero high/critical, today's moderate count as the cap.
// scripts/deps/audit-check.mjs
const live = await npmAudit();
for (const sev of ["high", "critical", "moderate"]) {
if (live[sev] > baseline[sev]) {
fail(`${sev}: ${live[sev]} > ${baseline[sev]}`);
}
}
Live `npm audit` against the baseline. New CVEs above floor halt the line.
chore(deps): audit baseline moderate 11 → 9 Patched lodash transitive via dependabot batch 2026-04 group. Two CVEs cleared. Free ratchet.
Dependabot's grouped weekly PR usually takes the count down. The diff is the receipt.
Provide the agent with up-to-date docs, idiomatic usage, working examples for that vendor's product. Soft guidance — the agent reads them, picks them up, sometimes ignores them.
Output: better-formed code that uses the vendor correctly. Mode: guidance, advisory.
Enforce our bytes-per-build budget, our coverage floor, our npm-audit baseline, our commit-message format. Hard stops — if the threshold regresses, the build is red.
Output: code that complies with company-wide quality contracts. Mode: enforcement, gating.
Vendor MCPs help the agent write better code. The gauntlet decides whether that code ships. We use vendor MCPs in every repo. We also walk every repo through the same gauntlet.
| Gate | IDEms · on save | Pre-commit~3 s · on commit | Pre-push~70 s · on push | CI Test Gate~3 min · parallel | Crondaily · persistent PR |
|---|---|---|---|---|---|
| L1 · ESLint + format | ● | ● | ● | ● | |
| L2 · TypeScript | ● | ● | ● | ||
| L7 · e2e preflight | ● | ● | |||
| L6 · Fallow preflight (diff-scoped) | ● | ● | |||
| Branch protection / commitlint / secrets | ● | ● | ● | ||
| L12 · npm audit | ● | ● | ● | ||
| A4 · Dependabot grouped weekly | ● | ● | ● | ||
| L13 · Coverage floor | ● | ● | |||
| L8 · Wallace built CSS | if CSS | ● | ● | ||
| L5 · Fallow project graph | ● | ● | ● | ||
| L9 · Bundle byte budget | ● | ● | |||
| CI Test Gate orchestration | ● | ||||
| A1 · Wall-time budget wrapper | ● | ● | |||
| L14 · Visual regression (PNG diff) | ● | ||||
| v3.2 · PostHog telemetry gate | v3.2 | ||||
| A3 · Daily report PR | ● |
The cheapest phase wins. Catch at IDE → free. Catch at CI → minutes. Catch in production → a deploy and an apology. Most gates fire at multiple phases on purpose: the same lint runs locally and in CI so a fast local loop never bypasses the slow remote one.
"Execute org-wide v3.1 rollout per ~/.claude/plans/put-together-a-plan- reactive-starfish.md. Authorship gate: skip repos not primarily authored by Keith. DO NOT STOP UNTIL ALL ITERATIONS COMPLETE. Every repo finished with our changes working and running on main."
No per-repo instructions. The agent read the global skill, the rollout plan, the per-repo INDEX, and the standards. It picked the iteration order, opened PRs, fixed CI failures, merged 10 PRs in ~6 hours of wall-clock. 9 deployed clean to production.
The mechanism worked before we had a name for it. Reggie hand-rolled custom ESLint plugins, surfaced them via the LSP, and watched the agent self-correct. The agent treats lints as gates and rewrites until they pass.
v3.1 is the same pattern, made repo-canonical: standards surfaced as lints get respected; rules in a doc get ignored.
Every minute in the SKILL pays back across every agent that comes after it.
Adopt v3.1 standard evolution onto a repo that already had 11 of 14 layers wired. Add the new amendments — wall-time budget (A1), SHA-cached pre-push (A1b), JSX prose (L11), visual regression (L14, A2), daily report (A3) — and document the rest as honest gaps in the per-repo SKILL.
e2e:preflight static check caught 5 real page.waitForURL calls missing waitUntil: "commit" — bugs that would have hung E2E for 30 s each in CI. Static analysis cheaper than runtime hang every time.AATM_PREPUSH_FORCE=1 because the audit baseline holds 2 high CVEs out of scope for this PR. The gap is documented in the PR description and the per-repo SKILL. Honest gap, not skipped gate.5 × 30-second hangs prevented = ~2.5 minutes per CI run × every PR forever. Static check < runtime hang, always.
| Layer | Before | After |
|---|---|---|
| L1 ESLint | ✓ | ✓ held |
| L2 tsc | ✓ | ✓ held |
| L5 Fallow | ✓ | ✓ held |
| L6 Fallow preflight | ✗ | ✓ gained |
| L7 e2e preflight | ✗ | ✓ gained · caught 5 |
| L8 Wallace CSS | ✓ | ✓ held |
| L9 Bundle budget | ✓ | ✓ held |
| L10 md prose (alex) | ✓ | ✓ held |
| L11 JSX prose | ✗ | ✓ gained |
| L12 npm audit | ✗ baseline | ✗ deferred · documented |
| L13 coverage floor | ✓ 45% | ✓ frozen |
| L14 visual regression | ✗ | ✓ gained |
| A1 budget wrapper | ✓ | ✓ held |
| A1b SHA-cached pre-push | ✓ | ✓ held |
| A3 daily report cron | ✗ | ✓ gained |
| L3/L4 tokens + contrast | ✗ | ✗ documented |
5 layers gained · 9 held at the freeze · 2 gaps explicitly deferred with a paper trail.
Add the new amendments to the most-mature repo in the convoy — wall-time budget, SHA-cached pre-push, visual regression, daily report. Refresh the per-repo SKILL to v3.1 wording. Land the iteration as the reference for the other 9 PRs to cite.
client-feedback page loads, smoke - global agent); succeeded on retry #3. Not deterministically broken — flaky. v3.1 made the flake visible by failing fast. v3.1 didn't fix it. v3.2 work: quarantine flake-prone tests or fix them. Don't normalize retries.The stricter pre-commit + gauntlet pattern pushed Bobby off the blueprint repo for two days while we tuned which warnings get treated as hard stops vs. info. Stations that catch quality regressions only matter if the team can still work — calibration is its own line item.
Frozen rot is still rot — but it's visible rot, with a paper trail. Next PR can't make it worse. Burndown is a real workstream, not a someday-refactor.
| Layer | Before | After |
|---|---|---|
| L1 ESLint | ✓ | ✓ held |
| L2 tsc | ✓ | ✓ held |
| L3 theme tokens | ✓ | ✓ held |
| L4 WCAG contrast | ✓ | ✓ held |
| L5 Fallow | ✓ | ✓ baselined 565+990 |
| L8 Wallace CSS | ✓ | ✓ held |
| L9 Bundle budget | ✓ | ✓ held |
| L11 JSX prose | ✓ | ✓ caught "Looser" |
| L12 npm audit | ✓ 1 high | ✗ baseline · burndown |
| L13 coverage floor | ✓ 38% | ✓ frozen at floor |
| L14 visual regression | ✓ | ✓ held |
| A1 budget wrapper | ✗ | ✓ gained · 67.4 s/90 s |
| A1b SHA-cached pre-push | ✗ | ✓ gained |
| A3 daily report cron | ✗ | ✓ gained |
| CI Test Gate drill | flake-prone | flake-prone · v3.2 fix |
| Carpenter burndown | ~unknown | 1,555 · baselined |
3 layers gained · 11 held · 2 visible gaps with named workstreams.
Eight repos started with most layers unfilled. We didn't pretend to fill them. The iteration PR landed three things on every repo — and named the rest as gaps in the per-repo SKILL.
The signal isn't the layers we filled. It's the layers we named as gaps. A documented gap is a hire we can plan; a hidden gap is a fire we'll fight blind.
Production deploys went green for 9. bfd-platform's E2E flake unmasked an old retry-tolerant setup we'd never properly seen — that's a v3.2 fix, not a v3.1 regression.
Daily-report cron 401'd at PR creation across all 10 repos. Cause: Allow GitHub Actions to create and approve pull requests was off org-wide. Repo-level Actions perms can't override it. ~30 min to find. Now step 0 of any cron-PR workflow.
cfut_ vs cfat_Org's ~/.config/bfd/cloudflare.env held a cfut_ wrangler-OAuth token, not a cfat_ API token. Pages-create + DNS need API tokens; OAuth fails silently mid-flow. Token-type check is now the first line of every CF-touching script.
Token system + contrast gate wired only on bfd-platform. Eight UI-bearing repos ship hand-stitched canvas — every refactor risks brand drift and a11y regression. Single biggest hire on the v3.2 board. One repo per quarter cadence.
Senior, not junior. Goes to any repo, names the missing standard, writes the gate or files the work order. Owns Vanta + compliance alongside quality. One head, not five.
Maintains tests, fixes lint, writes coverage, tightens stations — under the SME's standards. "AI is going to do a 10× better job of writing and maintaining those tests than an intern will." — Reggie
Every station in the gauntlet is one instance of that. CSS bytes, bundle bytes, audit count, coverage floor, gate wall-time. Start with the one drifting in your repo this week.
"I don't code anymore, so I'm not scared… eventually I learn." — Matias, 2026-04-20. The point of every station is that the next person — agent, junior, returning teammate — can't accidentally make it worse.
The team's headline: a metric of quality, not an experiment on quality. One SME unlocked. Same standards across every repo. The toolset doing the supervising. That's how five people ship like fifty without hating ourselves.
~/.claude/skills/code-quality-setup/SKILL.md — symlinked into Claude Code, Codex, and Cursor. Every agent reads it before generating code in the new repo.
Drop the per-repo SKILL template at .cursor/skills/code-quality/SKILL.md. Fill in which layers are filled, pending, or N/A. The honest gap doc is the deliverable.
In priority order: L1+L2 → A1 budget wrapper → A3 daily report → L6/L7 preflights → L5 fallow → L13 coverage → L8/L9 bytes → L12 audit → L10/L11 prose → L3/L4 tokens+contrast → L14 visual regression.
Update ~/.claude/skills/code-quality-setup/per-repo/INDEX.md — name, cron hour, filled vs. pending layers. The index is the org-wide adoption roster.
Open the iter PR. Each layer commit gets its own message. Gaps land documented, not pretended-shipped. PR description names every filled layer + every documented gap.
10 PRs merged · 9 production deploys clean · 7 real issues caught mid-flight · gaps documented, not hidden.
Methodology: Jan 1 – Apr 28 in 2024 vs the same 119-day window in 2026. Counting authored commits across all branches, all repos. Same five engineers both years (Keith, Reggie, Eli, Bobby, Matias). Agent-only branches and third-party PRs not counted.
14.3× more commits. No new headcount. The agents wrote the bulk; the gauntlet kept the floor.
25× repo coverage. The gauntlet had to be a portable pattern, not per-repo bespoke.
What's NOT counted: agent-only branches that never merged, third-party PRs, dependabot grouped batches, force-pushed branches, and squashed history pre-2024. ← back to slide 2