Dallas Crilley
All writing
Agent systems · RevOps architecture

I Rebuilt Cloudflare’s $1 Review Factory for RevOps

Cloudflare ran an AI code-review system across 5,169 repositories for about a dollar per merge request. The interesting part isn’t the code review — it’s the factory pattern. I rebuilt that architecture for the surface I actually operate: revenue data. Here’s what it looks like in working code.

The constraint moved

In the age of agents, two things bottleneck a team: planning and review. Cloudflare attacked the review constraint — a merge request used to wait hours for a first human look, and AI review collapsed that to minutes. RevOps has the exact same shape. Revenue data rots quietly: duplicate accounts inflate the pipeline, attribution credits a contact at the wrong company, a closed-won deal carries no close date, enrichment goes stale. A human analyst auditing a batch is slow and expensive, so in practice nobody audits every sync.

The demo this post describes reviews a synthetic HubSpot-shaped batch the way Cloudflare reviews a diff. A coordinator dispatches specialist agents, fuses their findings, and returns one approval-biased verdict — with a live cost ledger pricing the run against an analyst-hour. It runs on synthetic data only; the agents make real model calls server-side.

Here’s the shape of a batch the factory might receive:

{
  "id": "full",
  "companies": [
    { "id": "co_201", "name": "Vertex Robotics", "domain": "vertexrobotics.example" },
    { "id": "co_202", "name": "Vertex Robotics LLC", "domain": "vertexrobotics.example" }
  ],
  "contacts": [
    { "id": "ct_201", "email": "[email protected]", "companyId": "co_201" },
    { "id": "ct_202", "email": "[email protected]", "companyId": "co_202" }
  ],
  "deals": [
    {
      "id": "dl_201",
      "name": "Vertex — cell automation",
      "amount": 410000,
      "stage": "decisionmakerboughtin",
      "closeDate": "2026-07-31",
      "companyId": "co_201"
    }
  ]
}

One: tokenomics is the master skill

The headline isn’t “AI reviews your data,” it’s “AI reviews your data for about a dollar.” That number is an engineering choice, not a coincidence. You don’t put the most powerful model on every step — that destroys the unit economics. The specialists run on a cheap, fast model; only the coordinator, which has to exercise judgment, runs on a mid-tier one.

The engine declares its models and rates up front so the ledger can price every run deterministically:

export const MODELS = {
  cheap: 'google/gemini-2.5-flash-lite',  // specialists
  mid:   'google/gemini-2.5-flash',       // coordinator
};

export const RATES = {
  [MODELS.cheap]: { input: 0.1, output: 0.4 },   // $ per 1M tokens
  [MODELS.mid]:   { input: 0.3, output: 2.5 },
};

export function costOf(usages) {
  let tokens = 0, usd = 0;
  for (const u of usages) {
    const rate = RATES[u.model] ?? { input: 0, output: 0 };
    tokens += u.inputTokens + u.outputTokens;
    usd += (u.inputTokens / 1_000_000) * rate.input
         + (u.outputTokens / 1_000_000) * rate.output;
  }
  // micro-dollars: a real run on these models is well under a cent
  return { tokens, usd: Math.round(usd * 1_000_000) / 1_000_000 };
}

Every run reports tokens and dollar cost next to a stated analyst baseline. If you can’t see what each agent costs, you can’t arbitrage that cost against the value it produces — and the whole proposition falls apart.

Two: many specialists beat one generalist

The naive version is one big prompt: “here’s the CRM export, find problems.” It’s expensive, unfocused, and impossible to verify. The factory instead runs four narrow reviewers — duplicate detection, attribution integrity, stage-and-pipeline logic, and enrichment freshness. Each reads only the slice of records its domain needs.

The deduplication specialist, for example, reads identity fields only:

export const dedup = {
  category: 'dedup',
  model: MODELS.cheap,

  slice(batch) {
    return {
      companies: batch.companies.map(c =>
        ({ id: c.id, name: c.name, domain: c.domain })),
      contacts: batch.contacts.map(c =>
        ({ id: c.id, email: c.email, companyId: c.companyId })),
      deals: batch.deals.map(d =>
        ({ id: d.id, name: d.name, amount: d.amount,
           companyId: d.companyId, closeDate: d.closeDate })),
    };
  },

  buildCall(batch, shared) {
    return {
      label: 'dedup',
      model: this.model,
      system: [
        'You are the DEDUP / IDENTITY reviewer on a RevOps data-integrity team.',
        'Find duplicate records: companies sharing a domain, contacts sharing an email,',
        'and deals that are clearly the same deal entered twice.',
        ANTI_SPEC,
        FINDING_FORMAT,
      ].join('\n'),
      user: shared + '\n\nRecords to inspect:\n'
           + JSON.stringify(this.slice(batch), null, 2),
    };
  },
};

Sending every agent the whole dataset would multiply the token bill. Slicing keeps each specialist’s context small and its prompt precise. Every specialist emits findings in the same structured JSON shape, so a coordinator can fuse them.

Schematic — batch → specialists → coordinator → one approval-biased verdict.

Three: scale the compute to the blast radius

You don’t send the dream team to review a typo. A six-record nightly sync gets a two-agent team and a cheaper coordinator; a twenty-nine-record quarter-end pipeline pull gets the full four-agent roster on the top model. The system classifies the batch by risk — record count weighted by field sensitivity — and spins up exactly as much review as the blast radius justifies.

export function classifyRisk(batch) {
  const score = batch.companies.length
              + batch.contacts.length
              + batch.deals.length * 2;  // deals carry money + stage

  let tier;
  if (score <= 8) tier = 'trivial';
  else if (score <= 20) tier = 'lite';
  else tier = 'full';

  return {
    tier,
    specialistCount: tier === 'trivial' ? 2
                   : tier === 'lite' ? 3
                   : 4,
    coordinatorModel: tier === 'trivial' ? MODELS.cheap : MODELS.mid,
  };
}

Most reviews should be cheap, so the expensive ones can afford to be thorough. The result is one verdict a human can act on in seconds, not four transcripts to reconcile.

Four: build for failure, or you’re just vibe-coding

A one-shot system you assume will always work isn’t engineering. Putting real model calls behind a public URL forces the issue. Every call runs under a timeout with a single failback to a cheaper previous-generation model. A KV-backed layer enforces a per-IP rate limit and a global daily token budget, and an aggressive cache means most clicks serve a previously-computed real run — instantly, for free.

The guard decides how to serve each request before any model is called:

export async function decideServe(kv, cacheKey, ip, cfg) {
  const cached = await kv.get(cacheKey);
  if (cached) {
    await bump(kv, 'stat:hits');
    return { mode: 'cached', result: cached };
  }

  const spent = await kv.get(budgetKey(today));
  if (spent >= cfg.dailyTokenBudget) {
    const lastGood = await kv.get(LAST_GOOD_KEY);
    if (!lastGood) return { mode: 'error', reason: 'budget exhausted' };
    return { mode: 'failback', result: lastGood, reason: 'daily budget exhausted' };
  }

  const hits = await kv.get(rateKey(ip));
  if (hits >= cfg.maxRequestsPerIp) {
    const lastGood = await kv.get(LAST_GOOD_KEY);
    if (!lastGood) return { mode: 'error', reason: 'rate limit exceeded' };
    return { mode: 'failback', result: lastGood, reason: 'rate limit exceeded' };
  }

  return { mode: 'fresh' };
}

When the budget is exhausted, the system fails back to the last good run rather than erroring or burning money. The cache, the budget, and the failback aren’t hidden plumbing — they’re shown in the UI. Each run is labeled fresh, cached, or failback, so the “live” claim stays honest.

The guard decides how to serve each request before any model is called.

What carries over, and what doesn’t

What generalizes is the shape: a coordinator over diverse specialists, structured findings fused into one biased-toward-shipping decision, compute scaled to risk, and cost treated as a first-class metric with real resilience around the model calls. That pattern applies to far more than code or CRM data — any review queue with a cost and a quality bar.

What I deliberately left on the table is the part Cloudflare also hasn’t reached: zero-touch self-improvement. The factory flags problems; it doesn’t yet fix the instructions that let them recur. That’s the honest edge of the current build, and the next thing worth attacking.

The demo is live and runnable — pick a batch, watch the agents review it, and read the cost ledger. It runs on synthetic data only.