We audited dozens of enterprise AI deployments in Q4. Teams relying solely on native model safety failed 90% of our agentic attacks. Here's what a proper defense architecture looks like.

The definition of "AI security" has been shifting since the first LLMs hit production. When we started Promptention, most conversations about securing AI were about jailbreaks—getting a model to say something it shouldn't. That's still a real problem. But it's no longer the primary one.

The frontier has moved to agents: systems that take actions, call tools, write to memory, and operate across multi-step pipelines with minimal human oversight. Attacking an agent means attacking a process, not just a conversation. The blast radius is different.

What the Audits Showed

We audited dozens of enterprise AI deployments in Q4. The pattern was consistent: teams that relied solely on native model safety—expecting GPT-4, GPT-5, or Claude to protect themselves—failed 90% of our agentic attacks.

Native safety mechanisms are built for conversational misuse. They are not built for adversarial agent pipelines where:

Tools can be manipulated through injected inputs
Memory can be poisoned across sessions
Loops can be exhausted to cause denial-of-service
Indirect injections arrive through documents, databases, or web content the agent retrieves

Expecting the model itself to catch all of this is not a security strategy. It's a liability.

What Proper Defense Looks Like

Security in production AI requires layering. No single mechanism catches everything. Here's the architecture that holds up:

Layer 1 — Agentic Defense

Protecting the agent's decision-making process and execution context.

Tool control — restrict which tools the agent can invoke and under what conditions
Memory integrity — validate that stored context hasn't been tampered with between sessions
Loop exhaustion checks — detect and break runaway chains that indicate adversarial manipulation or bugs

Layer 2 — Input Control

Everything entering the model's context is treated as potentially adversarial.

Strict input validation — structural and semantic checks before anything reaches the model
Indirect injection scanning — content retrieved from external sources (documents, search results, emails) gets scanned before it enters the context window
Multimodal analysis — text is not the only attack surface; images, PDFs, and other formats need coverage too

Layer 3 — Data Integrity

The model's outputs and the data it handles require their own protection layer.

PII redaction — sensitive data gets stripped before it leaves the system or gets stored
Output blocking — responses are checked before they reach the end user or trigger downstream actions
Hallucination guardrails — for high-stakes use cases, outputs get verified against source material before they're acted on

Layer 4 — Verification

Security isn't static. The systems need continuous review.

Supply chain audit — understand what models, plugins, and third-party components are in the stack and whether they've been vetted
Continuous red teaming — the threat landscape evolves; your testing needs to evolve with it

The Practical Takeaway

Native model safety catches some things. It doesn't catch:

Indirect prompt injections embedded in retrieved content
Adversarial tool manipulation
Memory poisoning across sessions
Encoding-level attacks that bypass the model's own filters

A defense-in-depth architecture doesn't assume the model will protect itself. It wraps the model in layers that operate independently, at different levels of the stack, with different detection mechanisms.

That's what makes it defensible when something actually gets through.

Promptention's platform implements this architecture across agentic defense, input control, data integrity, and continuous verification. Built for production environments, not demos.

Defense-in-Depth for AI: Why Native Model Safety Isn't Enough

Table of Contents

What the Audits Showed