We audited dozens of enterprise AI deployments in Q4. Teams relying solely on native model safety failed 90% of our agentic attacks. Here's what a proper defense architecture looks like.
The definition of "AI security" has been shifting since the first LLMs hit production. When we started Promptention, most conversations about securing AI were about jailbreaks—getting a model to say something it shouldn't. That's still a real problem. But it's no longer the primary one.
The frontier has moved to agents: systems that take actions, call tools, write to memory, and operate across multi-step pipelines with minimal human oversight. Attacking an agent means attacking a process, not just a conversation. The blast radius is different.
What the Audits Showed
We audited dozens of enterprise AI deployments in Q4. The pattern was consistent: teams that relied solely on native model safety—expecting GPT-4, GPT-5, or Claude to protect themselves—failed 90% of our agentic attacks.
Native safety mechanisms are built for conversational misuse. They are not built for adversarial agent pipelines where:
- Tools can be manipulated through injected inputs
- Memory can be poisoned across sessions
- Loops can be exhausted to cause denial-of-service
- Indirect injections arrive through documents, databases, or web content the agent retrieves
Expecting the model itself to catch all of this is not a security strategy. It's a liability.
What Proper Defense Looks Like
Security in production AI requires layering. No single mechanism catches everything. Here's the architecture that holds up:
Layer 1 — Agentic Defense
Protecting the agent's decision-making process and execution context.
- Tool control — restrict which tools the agent can invoke and under what conditions
- Memory integrity — validate that stored context hasn't been tampered with between sessions
- Loop exhaustion checks — detect and break runaway chains that indicate adversarial manipulation or bugs
Layer 2 — Input Control
Everything entering the model's context is treated as potentially adversarial.
- Strict input validation — structural and semantic checks before anything reaches the model
- Indirect injection scanning — content retrieved from external sources (documents, search results, emails) gets scanned before it enters the context window
- Multimodal analysis — text is not the only attack surface; images, PDFs, and other formats need coverage too
Layer 3 — Data Integrity
The model's outputs and the data it handles require their own protection layer.
- PII redaction — sensitive data gets stripped before it leaves the system or gets stored
- Output blocking — responses are checked before they reach the end user or trigger downstream actions
- Hallucination guardrails — for high-stakes use cases, outputs get verified against source material before they're acted on
Layer 4 — Verification
Security isn't static. The systems need continuous review.
- Supply chain audit — understand what models, plugins, and third-party components are in the stack and whether they've been vetted
- Continuous red teaming — the threat landscape evolves; your testing needs to evolve with it
The Practical Takeaway
Native model safety catches some things. It doesn't catch:
- Indirect prompt injections embedded in retrieved content
- Adversarial tool manipulation
- Memory poisoning across sessions
- Encoding-level attacks that bypass the model's own filters
A defense-in-depth architecture doesn't assume the model will protect itself. It wraps the model in layers that operate independently, at different levels of the stack, with different detection mechanisms.
That's what makes it defensible when something actually gets through.
Promptention's platform implements this architecture across agentic defense, input control, data integrity, and continuous verification. Built for production environments, not demos.



