A Model Call Isn't a System

The first version of FamNest's daily check-in coach was one function. Take the parent's message, send it to the model, return the reply. It worked in the demo. It worked for me, testing it alone, in a good mood, typing reasonable sentences.

It did not work the moment I imagined a real parent — three hours of sleep, a toddler who won't nap, typing something raw at 11pm — hitting send on that same box.

A single model call is a script. The gap between a script and a system is everything that has to happen around the call: routing, verification, memory, fallback. That gap is where the actual engineering lives, and it's where most AI tutorials quietly stop.

The pipeline

Coach → Safety → (revise) → ship, with two supporting layers underneath.

The coach agent drafts the plan. Its prompt makes it reason silently first — read the numbers, read the parent's history, pick the single biggest lever, and fit every action inside the minutes the parent actually has. Not five action items. One thing, sized to a real day.

The safety reviewer agent reads that draft before it ever reaches the parent and returns a verdict: ok, revise, or crisis. On revise — an action that blows past the time budget, or advice drifting toward medical territory — the coach gets exactly one correction pass. Bounded on purpose. An unbounded critique loop is just a slower way to ship a bad reply.

The memory layer distills recent check-ins into a short trend summary — sleep slipping, stress climbing, the parent keeps circling back to "be more patient" — and feeds that into the next coaching pass. That's the difference between a form that resets every day and something that feels like it's actually been paying attention.

A lightweight lexical RAG layer retrieves from a curated, sourced wellness corpus and grounds the coach's advice in it, with source provenance attached to every plan. I chose lexical retrieval over vector embeddings deliberately — it's a fraction of the footprint, runs fine on a free serverless tier, and for a corpus this size the recall difference doesn't justify the cost.

Underneath all of it sits a deterministic floor: if any model call fails outright, a fixed fallback plan still ships. A parent reaching out at a low moment never gets a blank error screen instead of a response.

The constraint that shaped everything: no budget

Here's the part that's more honest than impressive. I couldn't put a card on a paid LLM API — the payment-method wall that anyone building from certain parts of the world knows well. Stripe, OpenAI billing, plenty of "just add a credit card" onboarding flows simply don't clear for a Kenya-issued card the way they assume.

Instead of stalling on that wall, I built the LLM layer to be provider-swappable from day one and pointed it at Groq's free tier running Llama 3.3 70B. Groq's API is OpenAI-compatible, so the integration cost was small, and swapping to Claude or GPT later is a single environment variable, not a rewrite. The entire pipeline — coaching, safety review, revisions, weekly summaries — runs in production today at zero API cost.

Constraints get a bad reputation. This one forced a cleaner abstraction boundary than I would have bothered writing with a blank check sitting in front of me.

The two bugs that only real testing caught

Code review wouldn't have caught either of these. I wrote a throwaway script that drives the full pipeline through three scenarios — a brand-new parent, a returning parent running on no sleep, and a crisis message — with the database stubbed out. Then I actually ran it.

Bug one: the offline path had no crisis safety. When the model was unavailable and the deterministic fallback kicked in, it took whatever the parent wrote and jammed it into a generic encouraging template. A message containing "I want to disappear" came back wrapped in language about showing up being real strength. The crisis check only existed on the AI-call path — nobody had asked what happens when that path isn't taken. The fix was a deterministic keyword guard that runs on the offline path too, so a crisis message always routes to real support resources regardless of whether the model is reachable.

Bug two: the safety classifier itself was non-deterministic on crisis input. On the AI path, the safety reviewer is an LLM call, and the exact same crisis message returned revise on one run and crisis on the next. The revise verdict quietly skipped the curated crisis response and let a breathing-exercise suggestion through on an acute-crisis message. That's the kind of bug that passes every demo and fails exactly when it matters.

The lesson sits underneath both bugs: never leave your single highest-stakes decision to a probabilistic classifier with no deterministic backstop. The model can be the first opinion. It can't be the only one.

What this actually is

None of this is exotic. There's no agent framework, no autonomous tool-calling loop, no twelve-step chain. It's two LLM calls with a verdict gate between them, a memory summary, a retrieval step, and a fallback that doesn't depend on any of it working.

That's the actual claim in the title. A model call isn't a system — but it doesn't take much more than a model call to become one. It takes deciding, explicitly, what happens when the model is wrong, when the model is unreachable, and when the stakes are too high to let it decide alone.