Dallas Crilley
All writing
CoHost AI Studio · Eval architecture

The Gate Has to Stand Without the LLM

If an AI system controls the step before publish, the gate cannot be vibes. I built CoHost around a stricter rule: programmatic metrics make the blocking decision, LLM evals add qualitative judgment, and a human still owns the final irreversible action.

The failure mode was silent quality loss

Podcast post-production is a chain of small judgment calls: transcript quality, speaker labels, loudness, chapters, show notes, clips, social copy, and distribution. When those handoffs are manual, the work is slow. When they are automated naively, the deeper risk is worse: bad output can become publishable-looking output without anyone noticing where the quality fell below the line.

CoHost AI Studio was my answer to that risk. The pipeline runs twenty-three named production steps as a dependency graph, then puts a quality gate in front of distribution. The gate scores eleven steps programmatically across twenty-five individual metrics. Five content-heavy steps can also receive LLM-based qualitative evaluations, but those evals are advisory by design. The publish block does not depend on an LLM being available, cheap, or consistent that day.

Rule one: deterministic checks block

The gate starts with metrics that can be measured directly. Transcription gets word confidence, speaker resolution, and coverage checks. Mastering gets a loudness delta against the target. Video and distribution-adjacent steps get artifact completeness. Each scorer emits a normalized score, a reason, and enough context for an operator to understand what failed.

Those scores are combined into a weighted composite: heavier production-critical steps count more, lighter copy-generation steps count less. But the composite has a hard floor. Any single step below the failure threshold fails the episode regardless of the average. That rule matters because averages are exactly how catastrophic single-dimension failures hide. An episode with good chapters, show notes, and social copy is still not publishable if the audio is wrong.

Rule two: LLM evals advise

LLM evals are useful where deterministic checks run out of language. They can judge whether show notes are coherent, whether clips preserve the point of a segment, or whether social copy sounds plausible but wrong for the episode. CoHost can run those evals for chapters, show notes, social copy, clips, and transcript-heavy review.

I intentionally kept those evals out of the blocking core. They attach reasoning to the scorecard so a human can inspect the judgment, but the pipeline still has a durable gate when the model is slow, expensive, unavailable, or simply uncertain. The pattern is not “let the model decide.” It is “make the model explain what the deterministic system cannot see, then put that explanation in front of the operator.”

Rule three: the graph owns failure policy

Each production step declares its dependencies, completion predicate, input and output artifacts, and cascade behavior. Some failures block all dependent work. Some pause and ask a human. Some are protected so a nonessential failure cannot cascade into the distribution path. The graph is validated before a run starts: cycles, duplicate steps, and unknown dependencies fail early.

This is the difference between a pipeline and a long script. A script retries from the top or leaves an operator guessing where to resume. A graph can skip completed work, isolate failed branches, preserve artifact lineage, and show why the publish gate is closed. That is what makes the scorecard operational instead of decorative.

The human is part of the architecture

CoHost has a normal quality-gate mode that blocks failed episodes and a stricter mode that also blocks warnings. It also has human-prompt cascade points for consequential decisions. That is not a fallback for automation that failed. It is a product decision: publish is the wrong place to optimize for full autonomy before trust has been earned.

The operator sees the per-step scores, the deterministic reasons, and any qualitative eval reasoning. The system narrows the review surface to the parts that matter, but it does not pretend the final judgment has disappeared. In applied AI work, I trust that shape more than a system that hides the judgment call behind a green checkmark.

What generalizes

This architecture is not podcast-specific. The transferable pattern is an instrumented workflow with an explicit gate before the consequential action. Swap publish for agent-drafted emails entering a send queue, CRM updates committing to Salesforce, generated content entering a CMS, or enrichment results changing a customer-facing record. The same rule holds: score what can be measured, ask an eval model for qualitative judgment where it helps, and keep irreversible actions behind a policy boundary that a person can inspect.

The part I would bring to any applied-AI team is the registry shape. New pipeline steps register scorers without rewriting the composite engine. Qualitative evals sit beside the deterministic gate instead of replacing it. Per-dimension floors prevent a fluent average from burying a single dangerous miss. And failure policy lives in the graph, where operators can reason about blast radius before the run starts.

What I would not overclaim

CoHost is pre-launch. Its local single-show path is demonstrably functional and heavily tested, but I do not claim production outcome metrics from shows it has not published. The thresholds are explicit engineering judgment, not a statistically calibrated model trained on hundreds of labeled episodes. That is the next layer real production traffic would enable.

The honest claim is narrower and stronger: the architecture turns “AI helped make this” into an auditable workflow. Before anything leaves the system, there is a scorecard, a deterministic floor, optional qualitative eval reasoning, and a human review point for the irreversible step.

This is the AI/eval side of the same portfolio: systems that measure outputs, expose failure, and keep consequential automation reviewable.