Adapting Cloudflare’s $1 Review Factory for RevOps

The constraint moved

In the age of agents, two things bottleneck a team: planning and review. Cloudflare attacked the review constraint — a merge request used to wait hours for a first human look, and AI review collapsed that to minutes. RevOps has much the same shape. Revenue data rots quietly: duplicate accounts inflate the pipeline, attribution credits a contact at the wrong company, a closed-won deal carries no close date, enrichment goes stale. A human analyst auditing a batch is slow and expensive, so in practice nobody audits every sync.

The demo this post describes reviews a synthetic HubSpot-shaped batch the way Cloudflare reviews a diff. A coordinator dispatches specialist agents, fuses their findings, and returns one approval-biased verdict — with a live cost ledger pricing the run against an analyst-hour. It runs on synthetic data only; the agents make real model calls server-side.

This is the shape of a batch the factory might receive:

{
  "id": "full",
  "companies": [
    { "id": "co_201", "name": "Vertex Robotics", "domain": "vertexrobotics.example" },
    { "id": "co_202", "name": "Vertex Robotics LLC", "domain": "vertexrobotics.example" }
  ],
  "contacts": [
    { "id": "ct_201", "email": "[email protected]", "companyId": "co_201" },
    { "id": "ct_202", "email": "[email protected]", "companyId": "co_202" }
  ],
  "deals": [
    {
      "id": "dl_201",
      "name": "Vertex — cell automation",
      "amount": 410000,
      "stage": "decisionmakerboughtin",
      "closeDate": "2026-07-31",
      "companyId": "co_201"
    }
  ]
}

One: tokenomics is the master skill

The headline isn’t “AI reviews your data,” it’s “AI reviews your data for about a dollar.” That number is an engineering choice, not a coincidence. You don’t put the most powerful model on every step — that destroys the unit economics. The specialists run on a cheap, fast model; only the coordinator, which has to exercise judgment, runs on a mid-tier one.

The engine declares its models and rates up front so the ledger can price every run deterministically:

export const MODELS = {
  cheap: 'google/gemini-2.5-flash-lite',  // specialists
  mid:   'google/gemini-2.5-flash',       // coordinator
};

export const RATES = {
  [MODELS.cheap]: { input: 0.1, output: 0.4 },   // $ per 1M tokens
  [MODELS.mid]:   { input: 0.3, output: 2.5 },
};

export function costOf(usages) {
  let tokens = 0, usd = 0;
  for (const u of usages) {
    const rate = RATES[u.model] ?? { input: 0, output: 0 };
    tokens += u.inputTokens + u.outputTokens;
    usd += (u.inputTokens / 1_000_000) * rate.input
         + (u.outputTokens / 1_000_000) * rate.output;
  }
  // micro-dollars: a real run on these models is well under a cent
  return { tokens, usd: Math.round(usd * 1_000_000) / 1_000_000 };
}

Every run reports tokens and dollar cost next to a stated analyst baseline. If you can’t see what each agent costs, you can’t arbitrage that cost against the value it produces — and the whole proposition falls apart.

Two: many specialists beat one generalist

The naive version is one big prompt: “here’s the CRM export, find problems.” It’s expensive, unfocused, and impossible to verify. The factory instead runs four narrow reviewers — duplicate detection, attribution integrity, stage-and-pipeline logic, and enrichment freshness. Each reads only the slice of records its domain needs.

The deduplication specialist, for example, reads identity fields only:

export const dedup = {
  category: 'dedup',
  model: MODELS.cheap,

  slice(batch) {
    return {
      companies: batch.companies.map(c =>
        ({ id: c.id, name: c.name, domain: c.domain })),
      contacts: batch.contacts.map(c =>
        ({ id: c.id, email: c.email, companyId: c.companyId })),
      deals: batch.deals.map(d =>
        ({ id: d.id, name: d.name, amount: d.amount,
           companyId: d.companyId, closeDate: d.closeDate })),
    };
  },

  buildCall(batch, shared) {
    return {
      label: 'dedup',
      model: this.model,
      system: [
        'You are the DEDUP / IDENTITY reviewer on a RevOps data-integrity team.',
        'Find duplicate records: companies sharing a domain, contacts sharing an email,',
        'and deals that are clearly the same deal entered twice.',
        ANTI_SPEC,
        FINDING_FORMAT,
      ].join('\n'),
      user: shared + '\n\nRecords to inspect:\n'
           + JSON.stringify(this.slice(batch), null, 2),
    };
  },
};

Sending every agent the whole dataset would multiply the token bill. Slicing keeps each specialist’s context small and its prompt precise. Every specialist emits findings in the same structured JSON shape, so a coordinator can fuse them.

Batch Synthetic CRM records (2–29, by tier)

4 specialists · cheap model

Duplicate detection
Attribution integrity
Stage-and-pipeline logic
Enrichment freshness

Coordinator · mid model Fuses the structured findings into one result

Verdict One approval-biased call, acted on in seconds

Schematic — batch → specialists → coordinator → one approval-biased verdict.

Three: scale the compute to the blast radius

You don’t send the dream team to review a typo. A six-record nightly sync gets a two-agent team and a cheaper coordinator; a twenty-nine-record quarter-end pipeline pull gets the full four-agent roster on the top model. The system classifies the batch by risk — record count weighted by field sensitivity — and spins up exactly as much review as the blast radius justifies.

export function classifyRisk(batch) {
  const score = batch.companies.length
              + batch.contacts.length
              + batch.deals.length * 2;  // deals carry money + stage

  let tier;
  if (score <= 8) tier = 'trivial';
  else if (score <= 20) tier = 'lite';
  else tier = 'full';

  return {
    tier,
    specialistCount: tier === 'trivial' ? 2
                   : tier === 'lite' ? 3
                   : 4,
    coordinatorModel: tier === 'trivial' ? MODELS.cheap : MODELS.mid,
  };
}

Most reviews should be cheap, so the expensive ones can afford to be thorough. The result is one verdict a human can act on in seconds, not four transcripts to reconcile.

Four: build for failure first

A one-shot system you assume will always work isn’t really engineering yet. Putting real model calls behind a public URL forces the issue. Every call runs under a timeout with a single failback to a cheaper previous-generation model. A KV-backed layer enforces a per-IP rate limit and a global daily token budget, and an aggressive cache means most clicks serve a previously-computed real run — instantly, for free.

The guard decides how to serve each request before any model is called:

export async function decideServe(kv, cacheKey, ip, cfg) {
  const cached = await kv.get(cacheKey);
  if (cached) {
    await bump(kv, 'stat:hits');
    return { mode: 'cached', result: cached };
  }

  const spent = await kv.get(budgetKey(today));
  if (spent >= cfg.dailyTokenBudget) {
    const lastGood = await kv.get(LAST_GOOD_KEY);
    if (!lastGood) return { mode: 'error', reason: 'budget exhausted' };
    return { mode: 'failback', result: lastGood, reason: 'daily budget exhausted' };
  }

  const hits = await kv.get(rateKey(ip));
  if (hits >= cfg.maxRequestsPerIp) {
    const lastGood = await kv.get(LAST_GOOD_KEY);
    if (!lastGood) return { mode: 'error', reason: 'rate limit exceeded' };
    return { mode: 'failback', result: lastGood, reason: 'rate limit exceeded' };
  }

  return { mode: 'fresh' };
}

When the budget is exhausted, the system fails back to the last good run rather than erroring or burning money. The cache, the budget, and the failback aren’t hidden plumbing — they’re shown in the UI. Each run is labeled fresh, cached, or failback, so the “live” claim stays honest.

Cache hit? yes → serve cached — instant, free

Daily token budget left? no → failback to last good run

Under per-IP rate limit? no → failback to last good run

All checks clear → fresh model call

The guard decides how to serve each request before any model is called.

What carries over, and what doesn’t

What generalizes is the shape: a coordinator over diverse specialists, structured findings fused into one biased-toward-shipping decision, compute scaled to risk, and cost treated as a first-class metric with real resilience around the model calls. That pattern applies to far more than code or CRM data — any review queue with a cost and a quality bar.

What I deliberately left on the table is the part Cloudflare also hasn’t reached: zero-touch self-improvement. The factory flags problems; it doesn’t yet fix the instructions that let them recur. That’s the honest edge of the current build, and probably the next thing worth attacking.

The demo is live and runnable — pick a batch, watch the agents review it, and read the cost ledger. It runs on synthetic data only.

Run the demo Case studies