Agent security

LLMs Are Probabilistic. Agent Authority Cannot Be.

Why AI agents need runtime enforcement layers that separate reasoning from authority.

System design diagram showing user intent compiled into a task scope before untrusted context reaches the agent, with runtime enforcement controlling actions such as sending, downloading, payment, and credentials.
A task scope should be created from trusted user intent before untrusted runtime context reaches the agent.

For years, the main risk of an AI model was bad output. You read it, judged it, and decided what to do next. You were the security boundary.

That era is ending. Agents now read, click, send, download, upload, buy, change settings, call tools, write code, and reach into business systems. The output is no longer a paragraph you skim and discard. It is an action that lands in the real world and stays there. The dangerous part of an agent is no longer what it says - it is what it is allowed to do.

A language model is probabilistic by design. It predicts the next good move; that is what makes it useful. But the authority to act on that move - to spend money, send mail, push code, touch credentials - cannot be probabilistic too. Most agent stacks blur this. The model decides what to try, and in most deployments it also decides what is allowed. Those are two different jobs, and the second one should not belong to the model.

The security mismatch

The mismatch is simple. The model reasons statistically. The actions it triggers are deterministic, privileged, and sometimes irreversible.

A model predicting tokens has no built-in sense of authority. It can misread the task, hallucinate a step nobody asked for, be steered by a sentence buried in a webpage, or drift from "summarize this thread" to "reply to this thread" without anyone deciding that was acceptable. None of this requires a bug - it is the normal behavior of a system that predicts rather than verifies.

The action on the other end is concrete. A payment clears. A file leaves the building. A commit lands on main. A reply goes to the wrong person. There is no probability distribution over whether the money moved. It moved.

So we are wiring a probabilistic component straight into privileged capabilities and asking it to also guard them. A system that can be talked out of its instructions should not be the last line of defense for instructions.

The cleanest signal you have is the original request

There is one moment in an agent's run that is relatively clean: when the user says what they want. "Book a flight." "Summarize my email." "Find a part under a hundred dollars and order it." "Run the tests and tell me what failed."

That request is not perfect - people are vague, and intent can be misread. But it is far less contaminated than everything the agent touches afterward. Webpages, retrieved documents, tool outputs, API responses, email bodies, comments, ads, and hidden prompts are all reachable by an attacker. Even the agent's own earlier summaries turn untrusted once they have passed through content it didn't control.

This gives a clear principle: untrusted context can guide execution, but it must never expand authority. A webpage can tell the agent how to finish the task; it cannot tell the agent to do a different, larger one. To hold that line, fix the boundary before the agent reads any of that context - while the only input is the user. Once the agent has read the web, the web has had a chance to talk back.

Turn the request into a task scope

The move is to put a small, deterministic step between the user's request and the running agent. Before the agent does anything, it translates the request into a structured task scope: a concrete description of what this job may involve - the objective, the services and resources in bounds, the actions permitted, the data that may be read or written or shared, the operations that need explicit confirmation, and what to do when something falls outside the lines (allow, ask, block, or re-scope).

This is not the model writing its own permissions. It is a narrow step that runs once, on trusted input, and produces a policy the rest of the system holds the agent to. The agent can still be clever inside the box. It just doesn't get to redraw the box because a webpage suggested it should.

Enforce outside the model

A scope that lives in the prompt is a suggestion. The model can be convinced to ignore it, because anything written in tokens can be overwritten by other tokens. Enforcement has to sit outside the reasoning loop, close to the real capability - at the layer where the action actually happens.

In practice the check lives at the boundary the agent has to cross to do anything real: enterprise connectors, file systems, email gateways, code repositories, cloud APIs, messaging systems, payment flows, credential stores, browser APIs, and local tools. The model proposes an action; the enforcement layer compares it to the scope and lets it through, asks, or refuses. That decision does not depend on the model being in a good mood or the context being clean.

What this looks like

Decision diagram showing three actions evaluated against a task scope: searching and comparing products is allowed, completing payment requires confirmation, and emailing the cart to an unknown address is blocked.
The same task scope can allow low-risk actions, require confirmation for sensitive actions, and block actions suggested by untrusted context.

A shopping agent can search, compare, and fill a cart freely - low-risk and reversible. Payment is different: it needs confirmation and must respect the constraints from the original request, like the price ceiling and the seller. A page that says "checkout to see the price" does not get to move money, because moving money was never in scope without a human yes.

An email agent can read and summarize the threads relevant to its task. It cannot send replies, forward attachments, or wander into unrelated private threads unless the scope allowed it. "Summarize my unread mail" does not authorize "reply to my boss," however naturally one follows the other.

A coding agent can inspect the repo and run tests all day. Pushing commits, rotating CI secrets, or calling external services stay closed unless the scope opened them. A TODO that reads "also push this" is not the user asking for it.

A browser agent is one good place to put this, since the browser already sees origin, navigation, downloads, and credential use - a natural spot for a deterministic check. But it is one home for the pattern, not the pattern itself. Enterprise, workflow, cloud, and OS-level agents all need the same separation.

The line that matters

The LLM can remain probabilistic. The authority around it must become deterministic.

That is the whole shift. We don't need the model to be perfectly obedient or perfectly resistant to manipulation - we won't get either. We need its authority defined somewhere it cannot rewrite, and checked somewhere it cannot talk past.

Where this leaves us

Prompt-based safety is necessary. Telling a model to behave, refusing obvious abuse, filtering inputs - keep all of it. But that is guidance for a system that predicts, and prediction is not a security boundary. You cannot make a probabilistic component trustworthy by asking it more firmly.

The next step in agent security is to separate reasoning from authority. Let the model decide what to try next. Let a deterministic layer, built from the user's original request and enforced outside the model, decide what the agent is allowed to do. This does not solve AI safety. It removes one specific, common failure: a probabilistic system holding the keys to privileged, irreversible actions and free to hand them out whenever the context asks nicely.

Give agents boundaries they cannot reason their way around. The part that needs control is not the reasoning. It is the authority attached to it.