Beyond the Agent Framework: Two Layers Every AI-native Production App Needs
A year ago, most LLM features in production were one model call wrapped in a prompt. That’s no longer the shape of the work. Teams are building agents now, on pydantic-ai and others, and the average production feature is a graph of model calls, tool invocations, and structured outputs feeding back into more model calls. That’s progress. It also means every reliability and safety problem you had with one call now applies to ten, and a few new ones (runaway tool loops, partial tool failures, structured-output drift) come along for free. A demo becomes a product when two things are true: it keeps working when the model doesn’t, and it stops working when it shouldn’t. Almost no team gets both right on the first try, because the patterns that make this easy aren’t in the SDK quickstarts.
If you’re leading an engineering team shipping LLM-backed features, the two layers worth investing in early are cascades (for reliability) and guardrails (for governance, privacy, and security). They sound like infrastructure busywork. They are actually the difference between “we shipped an AI feature” and “we run an AI product.”
This post is opinionated. Where there’s a tradeoff, I’ll tell you which side I think is right. I’ll use one running example throughout: a customer support triage agent. It reads a long support thread, classifies the issue, calls a tool to fetch similar past tickets, and drafts a response for the human agent to send. Three model calls (classify, plan retrieval, draft), one tool invocation, structured output at every step. Realistic enough to touch every concern that matters (PII, prompt injection, cost, latency, quality drift, tool failures) and small enough that every pattern below applies to it directly.
Layer 1: Cascades
Why a single model call is a liability
A single client.messages.create(...) call inside the triage agent assumes the provider is up, your account isn’t rate-limited, the model you pinned still exists, the latency is acceptable, and the cost per call hasn’t quietly tripled because someone changed the prompt. Now multiply that by three model calls per ticket. In any given month, at least one of those assumptions breaks. I’ve seen all five break in the same week.
The usual responses are wrong in instructive ways:
“We’ll add a retry.” Retries help with transient errors. They don’t help when the provider is having a regional outage, when your spend cap hit, or when the model returned a successful response that’s unusable.
“We’ll switch providers if there’s a problem.” Switching providers under pressure, in a hotfix, with no tested fallback path, is how you turn a one-hour incident into a one-week incident.
“We’ll use a router/proxy product.” Fine, but you’ve moved the problem, not solved it. You still need to decide what the fallback policy is. The proxy can’t decide for you.
What a cascade actually is
A cascade is an ordered, declarative fallback policy across models and providers, evaluated per call, observable in production, and changeable without a deploy.
For the classification step in the triage agent, the policy might be: try a strong model first, fall back to a peer model from a different provider if the first fails, and as a last resort fall back to a smaller, cheaper model and mark the response degraded so the UI can tell the human agent. Each step in the agent gets its own cascade. The classifier and the drafter don’t need the same fallback policy, and pretending they do is how you end up paying Sonnet rates for a five-way classification.
In pseudo-code, the shape that works is small on purpose:# one cascade per step, each tuned for its job
classifier_cascade = [
(”anthropic”, “claude-sonnet-4-6”), # primary
(”openai”, “gpt-4.1”), # peer fallback
(”anthropic”, “claude-haiku-4-5”), # degraded fallback
]
retrieval_planner_cascade = [
(”anthropic”, “claude-haiku-4-5”), # tool-calling, no need for premium
(”openai”, “gpt-4.1-mini”),
]
drafter_cascade = [
(”anthropic”, “claude-sonnet-4-6”), # quality matters, no degraded fallback
(”openai”, “gpt-4.1”),
]
agent = TriageAgent(
classifier = classifier_cascade,
retrieval_planner = retrieval_planner_cascade,
drafter = drafter_cascade,
)
response = await agent.run(thread_text)The invocation loop underneath is boring on purpose: try each step in order, catch the failure modes you’ve decided are recoverable (rate-limit errors, provider outages, timeouts, content-filter rejections), record which step served the call, and surface a degraded flag when the last-resort step is used.
The non-obvious things that matter
Inject the cascade; don’t import it. Code that calls LLMs should take the cascade as a constructor argument, not reach for a global. This is the single change that makes the rest of this testable. Every “we couldn’t write a unit test for that” excuse I’ve heard about LLM code traces back to skipping this.
Alert on every fallback, not just on total failure. A cascade that silently falls through to the cheap model on every call is broken in a way that won’t show up in error rates. It’ll show up in your bill and in subtle quality regressions weeks later. If step 0 fails, that’s a signal. Page someone, or at least Slack someone.
Every call should land in a log you can query. Fallback rate and cost-per-step are the two numbers that tell you whether the policy is working. Without them, the cascade is a black box that feels reliable. With them, you can answer “did the classifier fall back more this week, and on which tickets?” in one query.
Treat the “degraded” path as a first-class product state. If your fallback is a smaller model, the UI should know. Human agents tolerate degraded gracefully (”draft generated with reduced model”). They don’t tolerate “this used to work and now it’s worse and nobody told me.”
What cascades are not
Cascades are not a quality-improvement mechanism. They don’t make drafts better. They make drafts exist when they otherwise wouldn’t. Don’t oversell them internally. Engineering leaders who pitch cascades as a quality lever get caught out when the cheaper fallback model produces visibly worse output.
Layer 2: Guardrails
Why prompt-based safety is theater
The default approach to keeping the triage agent “safe” is to add sentences to the system prompt: “Do not include customer SSNs or credit card numbers in the draft. Do not mention competitors. Do not give legal or medical advice.” This works until the moment it matters. Models drift, prompts get truncated, jailbreaks get cleverer, and, most commonly, somebody on your team edits the prompt for an unrelated reason and quietly removes a rule that legal cared about. With an agent, the problem multiplies: every step has its own prompt, and the rule legal cared about needs to be in all of them.
Prompt rules are suggestions to a probabilistic system. Guardrails are policy enforced outside the model, on both the input and the output.
What a guardrail actually is
A guardrail is a deterministic, auditable check that runs around the model call, configured separately from the prompt, owned by someone other than the engineer writing the feature.
That last point is the one most teams miss. The reason guardrails exist as a separate layer is that the people responsible for what the model is allowed to do are usually not the people writing the prompt. Legal, compliance, security, and ops have a legitimate interest in saying “the triage agent must never output a credit card number” without having to file a PR against your feature code. If the only way to enforce that is to edit a Python file, you have an org problem disguised as a tech problem.
The shape that works, applied to the triage agent:
# guardrails.yaml
version: 2026-05-12.3
profiles:
triage_agent:
input:
- name: pii_redaction
action: redact # redact before sending to model
patterns: [ssn, credit_card, api_key]
- name: prompt_injection_scan
action: block_on_match
threshold: 0.85
output:
- name: pii_egress_scan
action: block_and_alert
patterns: [ssn, credit_card, internal_employee_id]
- name: competitor_mentions
action: rewrite
rules: [
”replace competitor names with ‘another provider’”
]
- name: refusal_consistency
action: log_only
The invocation composes guardrails around the agent:def triage(thread_text):
request = guardrails.apply_input(
”triage_agent”,
thread_text
)
# the agent, with cascades from Layer 1 at each step
response = await agent.run(request)
response = guardrails.apply_output(
”triage_agent”,
response
)
audit.record(
guardrails.version,
request.id,
guardrails.findings
)
return responseNotice the cascades and the guardrails are composed but independent. You can change a fallback policy without touching policy, and you can tighten policy without redeploying the feature.
The non-obvious things that matter
Separate the three actions: redact, block, alert. Most teams conflate them. Redaction modifies content silently and lets the call proceed (good for PII on the input side of the agent, where customer threads routinely contain account numbers). Block raises an error and surfaces a user-facing message (good for prompt injection; if a customer thread is trying to jailbreak the agent, you don’t want to triage it). Alert notifies a human asynchronously (good for ambiguous output cases where you want to learn before you enforce). Use all three. They are not substitutes.
Make the policy editable by non-engineers, but version-controlled. A guardrail config that only an engineer can edit will go stale. A guardrail config that anyone can edit with no history will get you sued. The sweet spot is a configuration UI for ops, legal, and security users that produces a versioned, auditable change record: every edit attributed, every version diff-able, rollback in one click.
Log the guardrail findings, even when nothing was blocked. The most valuable artifact a guardrail layer produces, long-term, is not the blocks. It’s the corpus of near-misses, drafts the model wanted to write that contained something policy-adjacent and were nudged. That corpus is how you tune the policy, defend it to auditors, and convince your CISO this is real.
Prefer purpose-built classifiers to general-purpose LLMs for enforcement. Regex is brittle, especially for PII. Modern guardrail stacks lean on small specialized models (GLiNER and similar NER models for PII, prompt-injection classifiers, toxicity classifiers). They’re fast, cheap, give you a calibrated score you can threshold, and, critically, they don’t share failure modes with the generative model you’re guarding. Use a general-purpose LLM as a judge for genuinely ambiguous cases where no classifier exists, and prefer it for offline evaluation over inline enforcement. In the live path it doubles your latency and cost and inherits the same drift and jailbreak surface as the model it’s policing.
What guardrails are not
Guardrails are not a substitute for a good prompt or a well-chosen model. They are a backstop. If your prompt is producing dangerous output 30% of the time and your classifiers are catching it, you don’t have a working product. You have a guardrail layer that’s about to miss the 31st percent.
How to think about the investment
If you’re an engineering leader deciding whether to build either of these now or “later,” the honest framing:
Cascades pay back the first time a provider has a regional outage. That happens more often than the SLAs suggest. The investment is small if you do it before you have ten features each calling the LLM directly, and large if you wait until you have to retrofit.
Guardrails pay back the first time someone outside engineering asks “how do we make sure the model can never do X?” When that question comes, from legal, from a customer’s security review, from a regulator, you want the answer to be a config diff, not a sprint.
The pattern I’d push for: build both as shared code, not per-feature code. One module that exposes a single call. Give it a feature name, get back a response that’s already been through the cascade and the guardrails. Feature teams don’t write retry loops. They don’t write PII checks. They don’t pick fallback models. That’s not a platform-team project, it’s a file. The mistake to avoid is letting every feature team roll its own, because then you have neither cascades nor guardrails, just a coordination problem with extra steps.
The teams that get this right treat the LLM call as the least interesting part of the system. The interesting parts are what surrounds it: which model do we fall back to, and what are we allowed to say.
The production version of both layers is smaller than the YAML in this post makes it look. In the next post I’ll take the triage agent from a naive pydantic-ai (or similar) implementation to cascaded, guarded, and logged at every step, and show that each step is a handful of lines, not a quarter of platform work.


