Tech talk · 30 min

Black Flag standards

The BFD
Gauntlet.

Agents are moving. None of us are reading every line anymore. Cursor, Claude Code, and Codex are writing most of the diff. We needed something the toolset could enforce on its own, on every commit, so that "is this any good?" stopped being a vibe-check. The Gauntlet is what we landed on. This is how it works and why we keep tightening it.

Keith Pattison · Black Flag Design 2026-04-28

Black Flag Design · 2026-04-2801 / 17

Why we kept saying this out loud

From the team

The toolset has to do the supervising.

Same five people, two years apart. We don't write most of the code anymore. we review, shape, gate, ship. The line of supervision can't be a person.

Keith · Reggie · Eli · Bobby · Matias

5?Engineers, both years

25?Repos active in 2026

14×?Commits 2024 to 2026 ↗

3,635 vs 255 commits
Jan–Apr · same 5 engineers

Active product repos →

aatm-brain bfd-platform bfd-widget bfd-front-door ncee-interactive-blueprint meeting-os day-savers longevity-strategists bfd-mcp bfd-skills bfd-admin-apps

"If you're gonna jump into a repo and know what's going on quickly, we have to be standardized on some level."

2026-02-09

"You owe us $250 in AI compute, but you didn't have to pay an extra $14,000 for another developer. I'm not really a math guy, but…"

2026-02-23

One pattern, applied across every repo, that turns "is this any good?" from a vibe-check into a number we can read.

Cursor, Claude Code, Codex write the bulk of the diff. We can't read every line at this pace. the toolset has to.

The problem · the toolset has to do the supervising02 / 17

Why now · what changed

Why now · agent pace breaks the model

The testing pyramid was built for human pace.
Now code arrives at machine pace. the model breaks.

"Agents are moving, and AI can ignore rules. It does on a regular basis. Lints are hard stops. Rules are guidelines."

same conversation · 2026-04-20 → 04-27

Pre-agent · craft pace

~2 commitsper dev per week · 1 active repo

Same humans wrote and reviewed. Taste caught bad patterns. The pyramid was self-enforced because volume was small.

Post-agent · machine pace

~30 commitsper dev per week · 25 active repos

Cursor, Claude Code, Codex write the bulk of every diff. No human reads every line. The pyramid still describes a healthy test suite. It says nothing about whether this commit can ship.

Unit tests many · fast

Integration some · medium

Functional few · slow

E2E slowest

What the pyramid alone doesn't catch

Click any item for the real number from a recent BFD repo.

Supply chain

14 high-severity npm advisories on bfd-front-door upstream. Live audit + baseline gate stops new CVEs from landing.

Structural drift

565 clone groups + 990 health findings on bfd-platform. 1,555 frozen work orders of CSS decay. Fallow tracks the project graph; Wallace tracks the built CSS.

Operational hazards

5 missing waitUntil: "commit" on page.waitForURL would have hung E2E for 30s each. Cookies set httpOnly: false on NCEE staging. "Looser" flagged by alex as homophone.

The pyramid is right. It's just not enough.

Reggie's NCEE testing deck covers the test-failure class. The Gauntlet wraps it with the static-analysis, build-artifact, supply-chain, and observability stations every commit walks past on its way to ship.

Why now · the pyramid + the gauntlet03 / 17

How a change reaches our users

From idea to live · the path every change walks

Every change earns its way
to our users.

Five checkpoints. Every release is a PR. Two human approvals. None can be skipped, not even by the people who built the system.

Author

Build it on a side branch.

An engineer (or an agent on their behalf) writes the change in isolation, away from what's live.

Automated

The Gauntlet runs.

Every quality gate from the next slide runs against the change in CI. If any gate fails, the PR can't merge. Full stop.

Copilot

Copilot reviews the PR.

GitHub Copilot reviews every PR. The agent has to remediate every comment before the PR is allowed to merge.

Stage

PR into staging.

HUMAN APPROVALOne of us approves and merges. The merge into staging auto-triggers the staging deploy via Actions.

Prod

PR into main.

HUMAN APPROVALA second human approval cuts the release. Merge into main auto-triggers the production deploy.

Even the people who own the repo cannot bypass this. The merge button only lights up when the Gauntlet is green, every Copilot comment is resolved, and a teammate has clicked approve. Branch protection enforces it for both staging and main.

Live examples →

aatm-brain #158, adopt the gauntlet bfd-platform #84, a written "LOOSEN" ncee-blueprint #68, review-driven burndown

Release flow · branch → staging → main · five checkpoints, two approvals04 / 17

Our approach

Black Flag standards · the gauntlet

The BFD Gauntlet.

Inside checkpoint 02 of the release flow. Each station measures one number, any station can stop the line, and nothing on the line ever weakens without a written reason.

PHASE 1 · IDE

on save · ms

PHASE 2 · PRE-COMMIT

~3 s · on commit

PHASE 3 · PRE-PUSH

~70 s · on push

PHASE 4 · CI TEST GATE

~4 min · parallel

PHASE 5 · CRON

daily · persistent PR

Defense in depth: every station fires at the cheapest phase that can catch the bug. Most stations fire at multiple phases so a fast local loop never bypasses the slow remote one.

COMMIT

agent · human

Lint + format

0 warnings

✓ pass

TypeScript

strict mode

✓ pass

Project graph

clones held

✓ pass

e2e preflight

no hangs

✓ pass

CSS bytes

at baseline

✓ pass

GATE FAILED, merge blocked

JS bundle

over budget

✕ FAIL

L11

Prose (alex)

0 flags

✓ pass

L12

npm audit

at baseline

✓ pass

L13

Coverage

at floor

✓ pass

L14

Visual diff

no change

✓ pass

Wall-time

≤ 90s local

✓ pass

Drift cron

daily PR

✓ pass

MERGEABLE

PR check turns green

$ npm run check

▶ L1  eslint              0 warnings ........... PASS  1.4s
▶ L2  typescript          strict ............... PASS  3.9s
▶ L5  fallow audit        0 new clones ......... PASS  4.7s
▶ L7  e2e preflight       0 hangs detected ..... PASS  0.6s
▶ L8  wallace             248,833 bytes (===) .. PASS  0.9s
▶ L9  bundle budget       312,118 / 308,442 .... FAIL  0.5s
                          vendor +3,676 bytes over budget
                          → raise the floor in a LOOSEN commit, or undo the bloat
─────────────────────────────────────────────────────
gauntlet: 1 FAIL · 5 PASS · halt at L9

PRE-COMMIT & PRE-PUSH Lint, types, fallow preflight, prose, the e2e-hang scan, bundle and CSS bytes all run on the dev's laptop before the commit even leaves. Fast layers, fail-stop, no warn-only tier. Most failures never reach a PR.

CI · TEST GATE Every PR re-runs every layer in parallel jobs in GitHub Actions. The "merge" button is gated by the test-gate check. Until every station is green, the branch can't merge into staging. Required, not advisory.

NO BYPASS, EVER Branch protection blocks bypass for everyone, including repo owners. Loosening any threshold takes a commit message naming the new value and the reason. Public reasoning forces the call to be deliberate.

The BFD Gauntlet · the line every commit walks05 / 17

One stack, every repo

Our code-quality stack · the gates that travel together

The code-quality stack.
One set of gates, every repo we own.

A working ecosystem of small, opinionated tools, each catching one class of regression. Lexical / semantic catch the obvious. Structure / visual catch the subtle. Bytes / coverage make "good" a number. Supply chain + policy stop upstream rot. Amendments keep the whole thing humane. Click any cell for why we picked it, how it went when we used it, and what we considered instead.

Tool	What we did with it
ESLint 9.x	Chosen. Plugin ecosystem, custom-rule story, every BFD dev knows the failure modes.
Biome	Strong contender. 10x faster but custom rules need a Rust rewrite. Watching v3.
Oxlint	Tracking. Speed is great, missing react-hooks, plugin model unclear.
deno_lint	Considered. Bound to Deno runtime, doesn't fit our Node + Vite stack.
Standard / xo	Considered. Opinionated bundles. We have our own opinions to enforce.

Tool	What we did with it
TypeScript 5.9 strict	Chosen. Best inference, best ecosystem, only credible default for new TS work in 2026.
Flow	Not seriously considered. Effectively dead-end; Meta migrated to TS.
JSDoc-only TS	Considered for tiny libs. Piloted on BFD Skills, didn't extend to apps.
Plain JS	Not seriously considered. Loses every property we care about.

Tool	What we did with it
alex	Chosen. One config, zero false-positives in our corpus, ships everywhere.
Vale	Considered. Powerful, much heavier setup. Right for a real editorial program; overkill for us.
write-good	Open question for v3.2. Complementary, focuses on prose quality.
Custom rules in ESLint markdown plugin	Considered. Limited surface; alex already does this well.

Tool	What we did with it
Custom 80-LOC extractor	Chosen. Fits exactly what we need, no extra runtime.
LinguiJS	Considered. i18n-shaped, much heavier than we need.
react-intl extract	Considered. Same shape as Lingui, same overkill.
Babel plugin	Considered as an alternative AST walker. Equivalent, less idiomatic for our TS codebase.

Tool	What we did with it
Fallow + fallow.cloud	Chosen. Score + boundaries + cloud rollup, all in one tool.
madge	Considered. Strong graph viz, no scoring.
dependency-cruiser	Considered. Excellent boundary rules, no clone detection.
jscpd	Considered. Clone detection only, no graph or boundaries.
SonarQube	Considered. Heavier, paid tier, less BFD-shaped.

Tool	What we did with it
Custom 40-LOC AST scan	Chosen. Exactly the rule we needed, ships on our schedule.
eslint-plugin-playwright	Tracking upstream. Doesn't currently model this rule.
Custom ESLint rule	Tried first. Fired too broadly; harder to limit scope to test files.

Tool	What we did with it
Playwright snapshots	Chosen. Already in the stack, zero added vendor.
Percy	Considered. Best UI for review-as-a-job, paid, vendor-locked.
Chromatic	Considered. Storybook-native, doesn't fit our E2E shape.
jest-image-snapshot	Considered. Ties to Jest, which we don't use.

Tool	What we did with it
@projectwallace/css-analyzer	Chosen. Quality scores beat what we'd hand-roll.
Custom metrics	Tried. Adequate baseline, didn't beat Wallace's quality signal.
cssstats	Considered. Deprecated.
projectwallace.com SaaS	Considered. Local + git history gives us the same value, free.

Tool	What we did with it
Custom rollup plugin	Chosen. Hash-stripped budgets, JSON baseline, fits our shape.
bundlesize	Considered. Travis-only, unmaintained.
size-limit	Considered. Heavier config, opinions about measurement.
bundlewatch	Considered. SaaS-bound, didn't want vendor lock-in.

Tool	What we did with it
Vitest + c8	Chosen. Mirrors Vite, fast on ESM, integration is one config file.
Jest	Considered. Slower on ESM, mocking opinions don't match ours.
Mocha + Chai	Considered. Older shape, more config surface, no integrated coverage.
Node test runner	Considered. Standard-library answer, less ecosystem support for what we need.

Tool	What we did with it
npm audit + baseline	Chosen. Free, sufficient, ~60 LOC.
Snyk	Considered. Best UI, paid per-seat, vendor-locked.
OSV-Scanner	Pilot. Excellent, slower, more false-positives. Revisit v3.2.
GitHub Dependabot alerts	Considered. Notifications-shaped, not gate-shaped.

Tool	What we did with it
Dependabot	Chosen. Free with GitHub, sufficient grouping, low config burden.
Renovate	Open question. More flexible, more config to maintain. Re-evaluate v3.2.
Greenkeeper	Not considered. Deprecated.
Custom CI cron	Considered. Reinventing the wheel.

Tool	What we did with it
GitHub native + terraform-github-provider	Chosen. Rules-as-code, version-controlled, free.
GitLab CE / Gitea	Considered. Self-hosted forge, complexity we don't need.
commitlint	Chosen. Standard for Conventional Commits enforcement.
git-secrets	Chosen. Sufficient. Watching gitleaks for v3.2.
gitleaks	Open question for v3.2. Better entropy detection.

Tool	What we did with it
GitHub Actions reusable workflows	Chosen. Already in the platform, deterministic, easy aggregate.
Nx Cloud DTE	Considered. Overkill for our matrix size today.
Turborepo remote cache	Considered. Same shape as Nx Cloud, same conclusion.
Self-hosted Buildkite / CircleCI	Considered. We don't need a separate forge.

Tool	What we did with it
Custom 30-LOC bash wrapper	Chosen. The design is to feel slowness, not hide it.
turbo	Considered. Hides slowness behind cache. Wrong shape for what we want.
nx caching	Considered. Same shape as turbo, same conclusion.

Tool	What we did with it
Custom ~50 LOC bash with tree-SHA cache	Chosen. Content-addressed, safer scope.
husky + lint-staged caching	Considered. File-set scope is a weaker key.
turbo cache	Considered. Heavier, requires turbo as the runner.

Tool	What we did with it
GitHub Actions cron + persistent PR	Chosen. Calm, durable, lives where PRs live.
Renovate-style notifications	Considered. Notifications get muted; signal decays.
Slack alerts	Considered. Same problem as notifications, plus channel sprawl.
Custom dashboard	Considered. Yet another surface to remember to check.

Tool	What we did with it
Style Dictionary	Likely. Token build-pipeline, deterministic output.
axe-core via axe-playwright	Likely. Standard bridge, fits our existing suite.
WAVE	Considered. Browser-extension-shaped, doesn't fit CI gating.
Pa11y	Considered. CLI-shaped, less ergonomic than axe-core.

Code-quality stack · 19 gates · 4 categories · 1 open gap06 / 17

Why the gauntlet works where guidelines don't

The doctrine · diagnosis first, cure second

"It wasn't following the rules that even existed half the time. I didn't feel like adding more rules would really help. because it wasn't following the ones well that were there to begin with."

2026-04-20 · the diagnosis

"If I'm finding that it keeps fucking something up, I'm making it write custom ESLint rules that enforce the shit."

same conversation · the cure

Each station ratchets. tighter is free, loosening is ceremony.

In plain English: every gate's threshold can drop without asking. Raising it takes a written reason in the commit. Quality only ever moves one direction.

CAPTUREToday's number is the threshold. Don't aspire. Whatever the metric is right now is the freeze point. committed verbatim with zero headroom.

TIGHTENWhen the number gets better, lower the threshold. No ceremony, no discussion. The wheel only turns one direction.

LOOSENRaising the number takes a commit message naming the new value AND the reason. Public reasoning forces the call to be deliberate.

NO BYPASSNo silenced failures. No "warn-only" tier. If a step is worth running, it's worth failing the build on. Same gates catch compliance regressions. secrets sweep, audit baseline, enforce_admins.

The doctrine · gates not guidelines07 / 17

Inside one station

Inside a station · same shape, every gate

Every station follows the same pattern:
freeze a number, gate on it, ceremony to raise it.

Different libraries, different file formats, different units. but the shape is identical across the gauntlet. That's why a new station is fast to add and impossible to silently weaken: once you know one, you know them all. Aim: every commit walks the same flow whether it touches CSS, JS, types, deps, or pixels.

1 · Freeze the value

// .wallace/tenant.json
{
  "totalSize":     248833,
  "selectorCount": 4129,
  "specificity": {
    "max": [0, 4, 4, 0]
  },
  "rules": {
    "empty":     { "total": 0 },
    "important": { "total": 0 }
  }
}

Today's measured value, written verbatim. Zero headroom.

2 · Gate on it

// scripts/wallace/check.mjs
const baseline = readJson(BASELINE);
const measured = await analyzeCss(BUILT);

for (const [k, v] of entries(baseline)) {
  if (measured[k] > v) {
    fail(`${k}: ${measured[k]} > ${v}`);
  }
}
// Exits non-zero if any metric
// regressed. No informational tier.

Any increase fails the gate. The wheel only turns one direction.

3 · Raising costs ceremony

$ git log -1 .wallace/tenant.json

chore(wallace): bump totalSize
  248833 → 248861 (+28 bytes)

react 19.2.5 ships ~28 bytes
of new createRoot scaffolding
we can't drop. Verified bundle
diff in PR #4129.

Public reason or it doesn't merge. Loosening is ceremony.

1 · Freeze the value

// .bundle/tenant.json
{
  "chunks": {
    "index":  { "max": 184231 },
    "vendor": { "max": 308442 },
    "ui":     { "max":  64211 }
  }
}

Per-chunk byte budgets. Hash-stripped. any deterministic build re-verifies.

2 · Gate on it

// scripts/bundle/check.mjs
for (const [name, spec] of entries(budget.chunks)) {
  const built = sizeOf(`dist/${name}-*.js`);
  if (built > spec.max) {
    fail(`${name}: ${built} > ${spec.max}`);
  }
}

Compares actual built size against the frozen budget. Fail-stop on overage.

3 · Raising costs ceremony

chore(bundle): vendor chunk
  308442 → 312118 (+3.6 KB)

@radix-ui/react-popover 1.0.7
ships extra positioning math
needed for the new draft surface.
Floor stays free to ratchet back
once we drop the old composer.

A loosen ceremony names the cause and a way back.

1 · Freeze the value

// .fallow/baseline.json
{
  "cloneGroups":   565,
  "healthFindings": 990,
  "boundaries": {
    "violations": 0
  }
}

Whole-monorepo project graph snapshot. Shrinks freely. Grows only with reason.

2 · Gate on it

$ fallow audit
✓ unused exports         held
✓ circular deps          held
✓ boundary violations    held
✕ clone groups   573 > 565
       fail(boundary='clones')

A new duplicate stops the build until either the duplicate goes or the baseline is consciously raised.

3 · Tighten ceremony

chore(fallow): cloneGroups
  565 → 542 after dedupe

Pulled the BFDAdmin feedback
list/get/screenshot helpers
into a shared module. Free
ratchet. no PR needed.

Tightening is free. The commit is documentation, not approval.

1 · Freeze the value

// .deps/audit-baseline.json
{
  "high":     0,
  "critical": 0,
  "moderate": 11
}

CVE counts, frozen. Zero high/critical, today's moderate count as the cap.

2 · Gate on it

// scripts/deps/audit-check.mjs
const live = await npmAudit();
for (const sev of ["high", "critical", "moderate"]) {
  if (live[sev] > baseline[sev]) {
    fail(`${sev}: ${live[sev]} > ${baseline[sev]}`);
  }
}

Live `npm audit` against the baseline. New CVEs above floor halt the line.

3 · Tighten ceremony

chore(deps): audit baseline
  moderate 11 → 9

Patched lodash transitive
via dependabot batch 2026-04
group. Two CVEs cleared.
Free ratchet.

Dependabot's grouped weekly PR usually takes the count down. The diff is the receipt.

Inside a station · freeze · gate · ceremony08 / 17

vs vendor MCPs

Eli's question, answered with Reggie's war story

"Are we contriving this in a way that's so 2024? Sentry, Convex, Clerk all have approved Claude and Cursor plugins already…"

practice dev · 2026-04-27

"I used the Claude AWS MCP to inspect the account and make a migration plan. Then the Cloudflare MCP verified it would work. Moved it all over and crossed my fingers. and it was fine."

practice dev · 2026-04-20 · the proof point

Vendor MCPs teach the AI how to use a library.
The BFD Gauntlet enforces our standards. Two different jobs.

VENDOR MCPs · LIBRARY KNOWLEDGE

Sentry · Convex · Clerk · etc.

Provide the agent with up-to-date docs, idiomatic usage, working examples for that vendor's product. Soft guidance. the agent reads them, picks them up, sometimes ignores them.

"How do I set up Clerk middleware?"
"What's the right way to query Convex from a server action?"
"What Sentry tags should I attach to this error?"

Output: better-formed code that uses the vendor correctly. Mode: guidance, advisory.

THE GAUNTLET · COMPANY STANDARDS

Black Flag standards · gates that fail builds.

Enforce our bytes-per-build budget, our coverage floor, our npm-audit baseline, our commit-message format. Hard stops. if the threshold regresses, the build is red.

"This PR adds 28 KB to the CSS bundle. fails Wallace."
"This change drops coverage from 45 % to 44.7 %. fails L13."
"This commit raises the wallace threshold without a public reason. fails commit-msg."

Output: code that complies with company-wide quality contracts. Mode: enforcement, gating.

Both, not either. They live at different layers.

Vendor MCPs help the agent write better code. The gauntlet decides whether that code ships. We use vendor MCPs in every repo. We also walk every repo through the same gauntlet.

What this isn't · vendor MCPs vs Black Flag gates09 / 17

When each gate fires

Phase timing · catch the same problem at the cheapest possible phase

Each gate runs at the earliest cheap phase.
Defense in depth: same gate, multiple phases.

Gate	IDEms · on save	Pre-commit~3 s · on commit	Pre-push~70 s · on push	CI Test Gate~3 min · parallel	Crondaily · persistent PR
L1 · ESLint + format	●	●	●	●
L2 · TypeScript		●	●	●
L7 · e2e preflight			●	●
L6 · Fallow preflight (diff-scoped)		●	●
Branch protection / commitlint / secrets		●	●	●
L12 · npm audit			●	●	●
A4 · Dependabot grouped weekly			●	●	●
L13 · Coverage floor			●	●
L8 · Wallace built CSS		if CSS	●	●
L5 · Fallow project graph		●	●	●
L9 · Bundle byte budget			●	●
CI Test Gate orchestration				●
A1 · Wall-time budget wrapper			●	●
L14 · Visual regression (PNG diff)				●
v3.2 · PostHog telemetry gate					v3.2
A3 · Daily report PR					●

The cheapest phase wins. Catch at IDE → free. Catch at CI → minutes. Catch in production → a deploy and an apology. Most gates fire at multiple phases on purpose: the same lint runs locally and in CI so a fast local loop never bypasses the slow remote one.

Phase timing · same gate, multiple phases10 / 17

How it performed in real codebases

The rollout · 11 repos · one prompt · one morning

Agents don't read your docs.
Agents react to your linter.

Standards expressed as prose get skipped. Standards expressed as lints, type errors, and red CI checks get rewritten until they pass. Get the toolset to say it, and the agent will do it.

"Execute org-wide v3.1 rollout per
~/.claude/plans/put-together-a-plan-
reactive-starfish.md.

Authorship gate: skip repos not primarily
authored by Keith.

DO NOT STOP UNTIL ALL ITERATIONS COMPLETE.
Every repo finished with our changes
working and running on main."

No per-repo prompting. The agent read the plan, the per-repo INDEX, and the standards SKILL, then picked the order, opened PRs, fixed CI failures, and merged. The thin directive worked because the standards underneath it were thick.

The mechanism, before we had a name for it

"There's an ESLint rule now that says every route in this array has to start with /analysis based on the file name, or else the linter gets pissed off."

practice dev · 2026-04-20 · the in-repo plugin

Hand-rolled lint rules surfaced through the language server. The agent reacted to the red squiggles and rewrote until they were gone. No prompt change, no instruction file. The toolset said it, the agent did it.

"I'm running SonarQube locally, barely set up, default rule set. The agents are reacting to them as lints via the LSP, even though I haven't tuned anything."

practice dev · 2026-04-27 · the productized version

v3.1 is that same pattern, made repo-canonical: every standard we care about gets a gate. Gates fail builds. Failures get rewritten. The standard takes care of itself.

Thin prompt + thick standards ≫ thick prompt + thin standards.

The shortest, dumbest directive on top of a real gauntlet beats a 4,000-word prompt running against a CONTRIBUTING.md.

The rollout · 11 PRs · agents react to lints, not docs11 / 17

Real codebase · #1 of 3 walkthroughs

aatm-brain · iter 2 · PR #158 · daily report 07:00 UTC

aatm-brain · most layers wired before;
now backed by amendments + a real catch.

What we asked the agent to do

Adopt v3.1 standard evolution onto a repo that already had 11 of 14 layers wired. Add the new amendments. wall-time budget (A1), SHA-cached pre-push (A1b), JSX prose (L11), visual regression (L14, A2), daily report (A3). and document the rest as honest gaps in the per-repo SKILL.

What actually happened

The new e2e:preflight static check caught 5 real page.waitForURL calls missing waitUntil: "commit". bugs that would have hung E2E for 30 s each in CI. Static analysis cheaper than runtime hang every time.
Pre-push had to be force-pushed with AATM_PREPUSH_FORCE=1 because the audit baseline holds 2 high CVEs out of scope for this PR. The gap is documented in the PR description and the per-repo SKILL. Honest gap, not skipped gate.
Coverage at 45 % was frozen as the new floor. gate against regression. Growing the suite is a separate workstream.
Theme tokens + WCAG contrast deferred. biggest standing investment for v3.2.

The L7 preflight paid for itself in one PR.

5 × 30-second hangs prevented = ~2.5 minutes per CI run × every PR forever. Static check < runtime hang, always.

aatm-brain · before → after by layer

Layer	Before	After
L1 ESLint	✓	✓ held
L2 tsc	✓	✓ held
L5 Fallow	✓	✓ held
L6 Fallow preflight	✗	✓ gained
L7 e2e preflight	✗	✓ gained · caught 5
L8 Wallace CSS	✓	✓ held
L9 Bundle budget	✓	✓ held
L10 md prose (alex)	✓	✓ held
L11 JSX prose	✗	✓ gained
L12 npm audit	✗ baseline	✗ deferred · documented
L13 coverage floor	✓ 45%	✓ frozen
L14 visual regression	✗	✓ gained
A1 budget wrapper	✓	✓ held
A1b SHA-cached pre-push	✓	✓ held
A3 daily report cron	✗	✓ gained
L3/L4 tokens + contrast	✗	✗ documented

5 layers gained · 9 held at the freeze · 2 gaps explicitly deferred with a paper trail.

aatm-brain · iter 2 · PR #15812 / 17

Real codebase · #2 of 3 walkthroughs

bfd-platform · iter 1 · PR #82 · the reference implementation

bfd-platform · we baselined 1,555 work orders
rather than pretend we'd cleaned them.

What we asked the agent to do

Add the new amendments to the most-mature repo in the convoy. wall-time budget, SHA-cached pre-push, visual regression, daily report. Refresh the per-repo SKILL to v3.1 wording. Land the iteration as the reference for the other 9 PRs to cite.

What actually happened

Baselined 565 fallow clone groups + 990 health findings. accumulated structural debt the new layer surfaced. Frozen as the v3.1 baseline so it can't get worse. Burndown is a separate workstream.
Test Gate flaked twice. CI Test Gate failed with 2 E2E suites timing out (client-feedback page loads, smoke - global agent); succeeded on retry #3. Not deterministically broken. flaky. v3.1 made the flake visible by failing fast. v3.1 didn't fix it. v3.2 work: quarantine flake-prone tests or fix them. Don't normalize retries.
Wall-time wrapper (A1) measured the full pipeline at 67.4 s under the 90 s ceiling. 22 s headroom.

The honest tradeoff

The stricter pre-commit + gauntlet pattern pushed Bobby off the blueprint repo for two days while we tuned which warnings get treated as hard stops vs. info. Stations that catch quality regressions only matter if the team can still work. calibration is its own line item.

The honest baseline beats the fake clean slate.

Frozen rot is still rot. but it's visible rot, with a paper trail. Next PR can't make it worse. Burndown is a real workstream, not a someday-refactor.

bfd-platform · before → after by layer

Layer	Before	After
L1 ESLint	✓	✓ held
L2 tsc	✓	✓ held
L3 theme tokens	✓	✓ held
L4 WCAG contrast	✓	✓ held
L5 Fallow	✓	✓ baselined 565+990
L8 Wallace CSS	✓	✓ held
L9 Bundle budget	✓	✓ held
L11 JSX prose	✓	✓ caught "Looser"
L12 npm audit	✓ 1 high	✗ baseline · burndown
L13 coverage floor	✓ 38%	✓ frozen at floor
L14 visual regression	✓	✓ held
A1 budget wrapper	✗	✓ gained · 67.4 s/90 s
A1b SHA-cached pre-push	✗	✓ gained
A3 daily report cron	✗	✓ gained
CI Test Gate drill	flake-prone	flake-prone · v3.2 fix
Carpenter burndown	~unknown	1,555 · baselined

3 layers gained · 11 held · 2 visible gaps with named workstreams.

bfd-platform · iter 1 · PR #82 · with honest tradeoff13 / 17

Real codebase · #3 of 3 + day-savers

ncee + 7 others · min-viable adoption · plus what we wouldn't have predicted

ncee, front-door, mcp, playbook, style-guide, cli, widget, muster.
Min-viable. Honest. Documented.

What "min-viable" actually shipped

Eight repos started with most layers unfilled. We didn't pretend to fill them. The iteration PR landed three things on every repo. and named the rest as gaps in the per-repo SKILL.

Wall-time budget wrapper (A1). every repo now has a measured ceiling. Drift visible the moment it appears.
Daily report cron (A3). every repo has a persistent-PR log. The day the org-perm flips, every cron starts firing.
Per-repo SKILL with explicit gaps. ncee documents 8 unfilled layers; bfd-front-door documents 14+ upstream Astro CVEs; bfd-cli documents node:test → vitest as deferred.

The signal isn't the layers we filled. It's the layers we named as gaps. A documented gap is a hire we can plan; a hidden gap is a fire we'll fight blind.

9 of 10 repos deployed clean. The 10th flaked twice and passed on retry.

Production deploys went green for 9. bfd-platform's E2E flake unmasked an old retry-tolerant setup we'd never properly seen. that's a v3.2 fix, not a v3.1 regression.

Day-savers we'll write down for the next agent

Org-level GitHub setting

The PR-create permission flip

Daily-report cron 401'd at PR creation across all 10 repos. Cause: Allow GitHub Actions to create and approve pull requests was off org-wide. Repo-level Actions perms can't override it. ~30 min to find. Now step 0 of any cron-PR workflow.

Cloudflare token type confusion

cfut_ vs cfat_

Org's ~/.config/bfd/cloudflare.env held a cfut_ wrangler-OAuth token, not a cfat_ API token. Pages-create + DNS need API tokens; OAuth fails silently mid-flow. Token-type check is now the first line of every CF-touching script.

Largest standing investment · v3.2

L3/L4 · tokens + WCAG contrast on 8 of 10 repos

Token system + contrast gate wired only on bfd-platform. Eight UI-bearing repos ship hand-stitched canvas. every refactor risks brand drift and a11y regression. Single biggest hire on the v3.2 board. One repo per quarter cadence.

The convoy + day-savers · 2026-04-2814 / 17

What's in it for you

The hiring debate · two voices, five weeks apart

"Do we want a recent college grad to handle maintaining and building out a testing suite? Has AI changed the way we should be thinking about the next deal?"

practice dev · 2026-04-27

"I'm starting to feel like we may never have to hire another person again. I couldn't be more excited."

practice dev · 2026-03-23

Don't hire five juniors. Hire one technical-QA SME.
Write the standards. Let the agent maintain the code under them.

THE HIRING ANSWER

One red-team / technical-QA SME.

Senior, not junior. Goes to any repo, names the missing standard, writes the gate or files the work order. Owns Vanta + compliance alongside quality. One head, not five.

WHAT THE AGENT DOES INSTEAD

The work that used to need a junior. bounded enough for the agent now.

Maintains tests, fixes lint, writes coverage, tightens stations. under the SME's standards. "AI is going to do a 10× better job of writing and maintaining those tests than an intern will.". Reggie

THE SINGLE TAKEAWAY

Pick one number your repo can measure. Freeze it. Make the build fail on regression.

Every station in the gauntlet is one instance of that. CSS bytes, bundle bytes, audit count, coverage floor, gate wall-time. Start with the one drifting in your repo this week.

The gauntlet protects the inheritor, not the author.

"I don't code anymore, so I'm not scared… eventually I learn.". Matias, 2026-04-20. The point of every station is that the next person. agent, junior, returning teammate. can't accidentally make it worse.

The team's headline: a metric of quality, not an experiment on quality. One SME unlocked. Same standards across every repo. The toolset doing the supervising. That's how five people ship like fifty without hating ourselves.

The hiring debate · gauntlet protects the inheritor15 / 17

From one station to the full gauntlet

Adoption playbook · five steps for every new repo

From one station to the full BFD Gauntlet.
Same five steps that ran the rollout.

STEP 1

Read the standards

~/.claude/skills/code-quality-setup/SKILL.md. symlinked into Claude Code, Codex, and Cursor. Every agent reads it before generating code in the new repo.

STEP 2

Per-repo SKILL

Drop the per-repo SKILL template at .cursor/skills/code-quality/SKILL.md. Fill in which layers are filled, pending, or N/A. The honest gap doc is the deliverable.

STEP 3

Hire the layers

In priority order: L1+L2 → A1 budget wrapper → A3 daily report → L6/L7 preflights → L5 fallow → L13 coverage → L8/L9 bytes → L12 audit → L10/L11 prose → L3/L4 tokens+contrast → L14 visual regression.

STEP 4

Add to the index

Update ~/.claude/skills/code-quality-setup/per-repo/INDEX.md. name, cron hour, filled vs. pending layers. The index is the org-wide adoption roster.

STEP 5

Ship

Open the iter PR. Each layer commit gets its own message. Gaps land documented, not pretended-shipped. PR description names every filled layer + every documented gap.

A metric of quality, not an experiment on quality.
One SME, five engineers, eleven repos, one gauntlet.

10 PRs merged · 9 production deploys clean · 7 real issues caught mid-flight · gaps documented, not hidden.

Thanks.
Questions?

The BFD Gauntlet · adoption + close16 / 17

Appendix · output growth 2024 → 2026

Appendix · the 14× metric, broken down

Same five engineers. Same calendar window.
14× the commits. 25× the active repos.

Methodology: Jan 1 – Apr 28 in 2024 vs the same 119-day window in 2026. Counting authored commits across all branches, all repos. Same five engineers both years (Keith, Reggie, Eli, Bobby, Matias). Agent-only branches and third-party PRs not counted.

Commits, Jan to Apr ?

255

2024

3,635

2026

Heaviest months: Mar 2026 (1,142 commits) and Apr 2026 (978). The Apr week we adopted the gauntlet on aatm-brain was the biggest single week of the window: 312 commits.

14.3× more commits. No new headcount. The agents wrote the bulk, the gauntlet held the floor.

Active repos receiving merges ?

2024

2026

11 are active client product repos. The other 14 are internal tools (BFD platform, widget, MCP, CLI, Playbook, etc). Same gauntlet shape on every one.

25× repo coverage. The gauntlet had to be a portable pattern, not per-repo bespoke.

The five?

119?Days, same window each year

11?Active product repos in 2026

What's NOT counted: agent-only branches that never merged, third-party PRs, dependabot grouped batches, force-pushed branches, and squashed history pre-2024. ← back to slide 2

Appendix · output growth 2024 → 202617 / 17

The BFDGauntlet.

The toolset has to do the supervising.

The testing pyramid was built for human pace.Now code arrives at machine pace. the model breaks.

What the pyramid alone doesn't catch

The pyramid is right. It's just not enough.

Every change earns its wayto our users.

The BFD Gauntlet.

The code-quality stack.One set of gates, every repo we own.

Standards expressed as docs are skipped. Standards expressed as red squiggles are obeyed.

ESLint 9.x with @typescript-eslint and our internal config

Most agent regressions are wrong-shape data flowing across a poorly-typed boundary.

TypeScript 5.9 · strict

Words travel further than code.

alex (alexjs.com) v11

Marketing copy ages out of design at a different cadence than code.

alex via custom JSX-string extractor (~80 LOC)

Without this, agents quietly fork modules instead of reusing them.

Fallow · static graph + cloud rollup

Keep the dev's outer loop fast, or they'll start using --no-verify.

Fallow (preflight mode)

If it happens twice, lint it.

Custom JS-AST script via @typescript-eslint/parser

Force every UI change to be intentional, named, and reviewed.

Playwright's expect(page).toHaveScreenshot

CSS rots silently. Numbers stop the rot.

@projectwallace/css-analyzer

Agents npm install things.

Custom rollup-plugin-bundle-budget config

Gate on the floor, not the aspiration.

Vitest with c8 coverage

The friction is the value.

Supply-chain regressions are quiet and fast.

npm audit + custom baseline checker

Approve a tested batch, not 40 cherry-picks.

Dependabot · grouped

If admins can bypass, agents will learn to bypass.

GitHub branch protection + commitlint + git-secrets

One required check, every gate aggregated under it.

GitHub Actions reusable workflow

Quality gates that take 9 minutes get skipped.

Custom shell wrapper

Fast paths kill --no-verify habits.

Custom hash check (~50 LOC bash)

The diff is the report.

GitHub Actions schedule trigger + persistent-PR pattern

v3.2's largest open hire.

Likely tooling (v3.2 spec)

Each station ratchets. tighter is free, loosening is ceremony.

Every station follows the same pattern:freeze a number, gate on it, ceremony to raise it.

Vendor MCPs teach the AI how to use a library.The BFD Gauntlet enforces our standards. Two different jobs.

Sentry · Convex · Clerk · etc.

Black Flag standards · gates that fail builds.

Both, not either. They live at different layers.

Each gate runs at the earliest cheap phase.Defense in depth: same gate, multiple phases.

Agents don't read your docs.Agents react to your linter.

Reactive Starfish · v3.1 rollout plan

The mechanism, before we had a name for it

Thin prompt + thick standards ≫ thick prompt + thin standards.

aatm-brain · most layers wired before;now backed by amendments + a real catch.

What we asked the agent to do

What actually happened

The L7 preflight paid for itself in one PR.

bfd-platform · we baselined 1,555 work ordersrather than pretend we'd cleaned them.

What we asked the agent to do

What actually happened

The honest tradeoff

The honest baseline beats the fake clean slate.

ncee, front-door, mcp, playbook, style-guide, cli, widget, muster.Min-viable. Honest. Documented.

What "min-viable" actually shipped

9 of 10 repos deployed clean. The 10th flaked twice and passed on retry.

Day-savers we'll write down for the next agent

Don't hire five juniors. Hire one technical-QA SME.Write the standards. Let the agent maintain the code under them.

One red-team / technical-QA SME.

The work that used to need a junior. bounded enough for the agent now.

Pick one number your repo can measure. Freeze it. Make the build fail on regression.

The gauntlet protects the inheritor, not the author.

From one station to the full BFD Gauntlet.Same five steps that ran the rollout.

Read the standards

Per-repo SKILL

Hire the layers

The BFD
Gauntlet.

The testing pyramid was built for human pace.
Now code arrives at machine pace. the model breaks.

Every change earns its way
to our users.

The code-quality stack.
One set of gates, every repo we own.

Agents `npm install` things.

Fast paths kill `--no-verify` habits.

Every station follows the same pattern:
freeze a number, gate on it, ceremony to raise it.

Vendor MCPs teach the AI how to use a library.
The BFD Gauntlet enforces our standards. Two different jobs.

Each gate runs at the earliest cheap phase.
Defense in depth: same gate, multiple phases.

Agents don't read your docs.
Agents react to your linter.

aatm-brain · most layers wired before;
now backed by amendments + a real catch.

bfd-platform · we baselined 1,555 work orders
rather than pretend we'd cleaned them.

ncee, front-door, mcp, playbook, style-guide, cli, widget, muster.
Min-viable. Honest. Documented.

Don't hire five juniors. Hire one technical-QA SME.
Write the standards. Let the agent maintain the code under them.

From one station to the full BFD Gauntlet.
Same five steps that ran the rollout.

A metric of quality, not an experiment on quality.
One SME, five engineers, eleven repos, one gauntlet.

Thanks.
Questions?

Same five engineers. Same calendar window.
14× the commits. 25× the active repos.