Your Own Work Is Invisible to You
After enough hours inside a project, you lose the ability to see it cold. I built a re-runnable panel of fresh-context AI reviewers — each with a distinct adversarial mission, an anchored scoring rubric, and a hard evidence rule — to audit one of my own live projects the way a stranger would. This is the architecture, what it caught, and where it falls short.
The problem: insider blindness
Every builder has this failure mode. You know where everything is, so navigation never feels confusing. You know what every claim means, so nothing reads as overstated. You wrote the docs, so they always seem complete. The defects that matter most — the dead link, the contradiction between two pages, the claim that outruns the code — are precisely the ones you can no longer perceive.
Asking an AI assistant “review my site” doesn’t fix this, because a long-running assistant has the same disease: it helped build the thing. It knows your intentions, and intentions are exactly what a cold visitor doesn’t have. The fix isn’t a smarter reviewer. It’s an architecture that manufactures genuine cold eyes, repeatably.
Design rule one: fresh context is the whole point
Each reviewer runs as a freshly dispatched agent with zero conversation history and no repo tour beyond what its mission prompt grants. It gets live URLs and a mission — not the project’s backstory, not prior findings, not my framing of what matters. Anti-anchoring is the core mechanism: the moment a reviewer inherits context, it starts reviewing your story about the project instead of the project.
The reviewers also run in parallel and never see each other’s output. Independent passes that agree are signal; a panel that deliberates converges on whoever spoke first.
Design rule two: personas with conflicting missions
A single “be critical” prompt produces generic criticism. What works is giving each agent a narrow persona with a time budget and a decision to make — because real audiences don’t read your work; they decide something about it under time pressure.
My panel uses three lenses. A first-impressions reviewer gets ninety seconds per surface and must answer basic questions fast — and report exactly where it got confused, slowed down, or bounced. A depth reviewer gets ten minutes to judge whether the strongest material holds up: where attention dropped, which claim triggered doubt, what a serious practitioner would expect to find but didn’t. And a skeptic gets twenty minutes and an explicitly adversarial mission: catch the marketing outrunning the code.
The skeptic doesn’t trust pages. It fires its own requests at the live services behind the project — including negative cases the docs don’t advertise, like inputs that should fail with structured errors rather than flattering ones. It shallow-clones the public repos, reads source the way a reviewer reads a stranger’s PR, and runs the test suites locally to check whether tests exercise real logic or pad counts. Its verdict is one word — credible, maybe, or pass — plus the three observations that drove it.
Design rule three: an uncited score is invalid
Free-form critique from a language model regresses to plausible-sounding generalities. Two constraints keep the output honest. First, every finding must cite evidence: an exact URL, a file and line, or a request/response pair from a probe it actually fired. No citation, no finding. Second, scoring happens on an anchored 0–10 scale — fifteen dimensions across the three lenses — where each band is defined by consequence, not adjective: 0–2 actively costs you, 5–6 is par and wins nothing, 9–10 is the thing a visitor would quote to someone else. Anchors are what make scores from different agents on different days comparable enough to trend.
The loop: baseline, fix, re-run, diff
One audit is a critique. The system only becomes useful when runs are diffable. After each run, reports are written verbatim to dated files, scores append to a scoreboard, and every finding becomes a tracked task — one fix per commit, so the fix history stays auditable.
The re-run then executes a recurrence check against the prior run’s findings: each one is classified fixed (not re-found by a reviewer that knew nothing about it), recurred (re-found — meaning the fix didn’t land or didn’t work, which escalates it above anything new), or obsolete. Recurrence is the honest signal. A fresh-context reviewer independently not-finding a defect is much stronger evidence than the implementer asserting it’s fixed.
Running this against a live project of mine, the baseline panel surfaced things I had walked past for weeks: template residue in a public repo, two pages quietly contradicting each other, a status value being coerced to success in a place that flattered a demo, and numbers that were technically true but read as implausible. Ten fixes shipped, one per commit. The re-run confirmed all ten landed — none recurred — and the panel’s verdict firmed up a notch while the lowest-scoring dimensions pointed at exactly what to do next.
What this costs, honestly
The rubric is the expensive part. My baseline run shipped before the anchored scale existed, so its scores aren’t comparable to anything — the first run mostly bought me the rubric. Writing consequence-based anchors that two different agents interpret the same way took longer than wiring the agents themselves.
Judge variance is real. The same persona on the same surfaces will not produce identical scores twice; I treat one-point moves as noise and trust direction only across multiple dimensions. Personas also drift polite — adversarial missions need blunt, almost rude register baked into the prompt, or the skeptic turns into a cheerleader with caveats. And the panel only reviews what its missions name: it will not tell you that the entire premise is wrong, only that the execution leaks.
What generalizes: any project with public surfaces and live behavior — a product, an API, documentation, an open-source repo — can be audited this way. The ingredients are boring on purpose: fresh context per reviewer, conflicting time-boxed missions, evidence-or-it-didn’t-happen, anchored scores, and a recurrence check that treats the previous run as the test suite for your fixes.
The panel runs against real systems I build and operate. The case studies cover what those systems are; this post is the judgment layer that keeps them honest.