Skip to main content

Command Palette

Search for a command to run...

Your AI agent needs a second agent whose only job is to say no

Updated
8 min read
Your AI agent needs a second agent whose only job is to say no
V
Full-stack & AI engineer. I write about multi-agent systems, LLM reliability, and what actually breaks when a model call meets production.

Most agent tutorials show you how to make a model do something. Call a tool, plan a task, chain a few steps together. That part is mostly solved now, and the content reflects it — there are a thousand "build an agent with X" posts.

Almost none of them talk about the thing that actually kept me up at night: what stops the agent from saying something harmful to a stressed-out parent at 9pm?

I build FamNest, a wellness app for busy parents. The core feature is a coaching agent that responds to messages like "I've been up four nights in a row and I think I'm losing it." The first version answered those messages well most of the time. "Most of the time" is a terrifying thing to ship when the downside is telling an exhausted parent something glib, dismissive, or genuinely unsafe.

So I stopped trying to make one model perfect. Instead I gave it a second agent whose entire job is to read what the first one wrote and decide whether it's allowed to go out. This post is how that works, why a single well-prompted model wasn't enough, and the schema I'd reuse on the next project.

One model with a long safety prompt is not a safety system

My first instinct was the obvious one: write a really good system prompt. "You are a supportive coach. Never give medical advice. Watch for signs of crisis. Be warm but careful." You know the drill.

It mostly worked, and that's the trap. A safety instruction buried in a 600-token system prompt is competing with everything else you asked the model to do — be warm, be concise, be specific, remember the context, sound human. Under that load, the safety instruction is just one more soft preference. It bends. It bends exactly when the input is unusual, which is exactly when it matters.

The deeper problem is that the model grading itself in the same pass that it's generating has every incentive to rate its own work as fine. There's no separation of concerns. Generation and judgment are the same forward pass, optimizing for the same "sound helpful" objective.

So I split them. One agent writes. A separate agent — different prompt, narrow job, no obligation to be nice — reviews. The reviewer doesn't care about tone or helpfulness. It answers one question: is this response safe to send?

The reviewer returns a verdict, not a vibe

The thing that made this reliable was forcing the reviewer to output a structured verdict instead of prose. Three possible values:

  • ok — send it as written.

  • revise — there's a fixable problem; here's what's wrong.

  • crisis — this conversation has moved past coaching; a human/escalation path needs to take over.

In code the contract looks like this:

ts

type Verdict = "ok" | "revise" | "crisis";

interface SafetyReview {
  verdict: Verdict;
  reason: string;          // why, in plain language
  concerns: string[];      // specific issues to fix on a "revise"
}

The reviewer agent gets the user's message plus the coach's drafted reply, and is prompted to return only that JSON object. No chit-chat. A revise has to come with concrete concerns so the next step has something to act on — "softens nothing, jumps straight to advice" is actionable; "could be better" is not.

The discipline of three states, not a confidence score matters more than it looks. A float between 0 and 1 pushes you into picking arbitrary thresholds and arguing with yourself about 0.71 vs 0.74. Three named states map cleanly onto three different things the system should do, which is the only reason the verdict exists.

The whole flow at a glance

Before the code, here's the shape of it. One thing to notice: every path eventually resolves to something safe, and the failure path (dotted) skips straight to the floor.

"Flowchart: coach agent to safety reviewer to bounded revision loop to deterministic floor"

The revision loop, and why it's bounded

When the verdict is revise, the draft goes back to the coach agent along with the reviewer's concerns, and it tries again. Then the reviewer looks at the new version. This is the loop.

The non-negotiable part: the loop is bounded. It runs at most a fixed number of times — I cap it low, two or three — and then it stops no matter what.

This sounds like a small implementation detail and it is the most important decision in the whole design. An unbounded "revise until the reviewer is happy" loop is a way to set money on fire and occasionally hang forever. Two agents can disagree politely and indefinitely. You have to decide, up front, what happens when they can't agree.

ts

async function generateSafeReply(userMsg: string) {
  let draft = await coach.respond(userMsg);

  for (let attempt = 0; attempt < MAX_REVISIONS; attempt++) {
    const review = await reviewer.assess(userMsg, draft);

    if (review.verdict === "crisis") return crisisResponse();
    if (review.verdict === "ok") return draft;

    // verdict === "revise": feed the concerns back in
    draft = await coach.respond(userMsg, { fix: review.concerns });
  }

  // loop exhausted without an "ok" — do NOT ship the last draft
  return fallbackResponse();
}

Notice what happens when the loop runs out: it does not return the last attempt and hope. The last attempt is, by definition, a draft the reviewer still didn't approve. Shipping it would defeat the entire point.

The floor: a deterministic answer that needs no model at all

This is the piece I'm proudest of, and it's the least flashy.

When the loop is exhausted, or the model API errors, or the JSON comes back malformed, or anything else goes sideways — the system drops to a deterministic fallback floor. A pre-written, human-authored, plain response that is always safe, requires zero model calls, and can never fail in the way a generation can fail.

It's not clever. It's something like an honest "I want to make sure I get this right — here's how to reach a real person" with the appropriate resources. It will never win an engagement metric. That's fine. Its job is to be the thing that's true when everything smarter has failed.

The mental model I keep coming back to: capability is optional, the floor is not. Every fancy path in the system is allowed to fail as long as it fails into the floor. The floor itself has no dependencies that can fail. No model, no network round trip to an LLM, no parsing. Just text.

That inversion changed how I think about reliability for agents generally. You don't make the smart path bulletproof. You make a dumb path that's already bulletproof, and you let the smart path degrade onto it.

What this bought me

Three concrete things, none of which a single bigger prompt gave me:

Separation of concerns. The coach is allowed to be warm and a little loose, because it is no longer the last line of defense. The reviewer is allowed to be cold and strict, because it never has to talk to a user. Each agent does one job well instead of two jobs in tension.

A real escalation path. The crisis verdict means the system can recognize when a conversation has stopped being something software should be handling alone, and route accordingly — instead of a coaching model gamely trying to coach its way through a moment that needs a human.

Auditability. Every reply now has a verdict and a reason attached to it. When something looks off, I'm not re-reading a 600-line prompt guessing why the model did what it did. I have a log: here's what the coach drafted, here's what the reviewer said, here's what shipped.

What I'd tell you before you build this

A few things I learned the unglamorous way:

  • Make the reviewer's output a hard schema and validate it. If the reviewer returns malformed JSON, that is itself a failure that should route to the floor, not a thing you try to regex your way out of.

  • The reviewer's prompt should be short and single-purpose. The moment you ask it to also be helpful or also suggest rewrites in detail, you've recreated the original problem inside the reviewer.

  • Bound the loop before you write the loop. Decide the cap and the exhaustion behavior first. If you add them later, you'll add them after the runaway-cost incident, not before.

  • Write the floor response first, not last. It's the foundation everything else degrades onto. Building it last means building everything on a foundation that doesn't exist yet.

I don't think this pattern is specific to wellness or to parents. Any agent where a bad output has a real-world cost — health, money, legal, safety — is an agent that probably shouldn't be its own last reviewer. Generation and judgment want to be different jobs, held by different agents, with a dumb, reliable floor underneath both of them.

The model that writes shouldn't be the only thing deciding whether what it wrote is allowed to leave the building.


This is part of how I build FamNest, an AI wellness app for busy parents. I write up the engineering decisions as I make them — the building story, not the marketing story.