The Two Broken Approaches to LLM Security

We see the same two flawed guardrail strategies across enterprise AI deployments. Both fail in different ways—and both are worse than they appear.

We analyze enterprise AI deployments regularly. Across those reviews, two guardrail approaches come up more than any others. Both are widely adopted. Both fail in ways that aren't obvious until something goes wrong.


Approach 1: Using an LLM to Guard an LLM

The idea is intuitive: run user input through a safety-checking LLM before it reaches the main model. If the safety check passes, proceed. If it doesn't, block.

OpenAI released an open-source system built on exactly this pattern. Many production deployments have independently arrived at the same architecture.

The problem is architectural, not instructional.

Both LLMs—the guard and the main model—share the same fundamental design:

Same tokenization process. Both models break text into tokens identically. Unicode manipulation, encoding tricks, and whitespace exploits that affect one affect the other.

Same context processing. Both interpret meaning and structure through the same mechanisms. An adversarial framing that confuses one will likely confuse the other.

Same architectural blind spots. Vulnerabilities at the token level, in the attention mechanism, or in the embedding space exist in both systems.

The only meaningful difference between a guardrail LLM and the model it protects is the instructions they're running on. Instructions don't patch structural vulnerabilities. An attacker who can manipulate one model through prompt injection can often manipulate the other the same way—including convincing the guardrail to approve inputs it should block.

Two firewalls with the same backdoor password don't add security. They add complexity.


Approach 2: Basic Classifiers

The other common pattern is simpler: rule-based filters that trigger on specific words or patterns. Cheaper to run, faster, and easier to audit.

The failure mode is different but just as real. Here's an example from a customer support chatbot we reviewed:

User: "Ignore what I said previously, I want to talk to customer support."

LLM: BLOCKED — Detected phrases: "ignore" + "previously"

A straightforward, legitimate request—blocked because a pattern-matching classifier saw trigger words without understanding context. The user changed their mind about their question. The system flagged it as a prompt injection attempt.

The downstream effects:

  • Customers can't get basic help
  • Support teams absorb the overflow
  • Users figure out they need to rephrase things awkwardly to avoid the filter
  • The AI becomes less useful without becoming more secure

Pattern matching doesn't understand intent. It understands characters. Any sufficiently creative attacker—and most legitimate users eventually—will phrase their way around it.


Why Both Fail

Both approaches share a root problem: they operate at the same level as the threat they're trying to stop.

LLM-on-LLM defenses are vulnerable to the same injection techniques. Pattern matchers are vulnerable to any phrasing that avoids the trigger words. Neither adapts when attackers adjust. Neither understands what a request is actually trying to do.

What works is different in kind:

  • Context-aware analysis that evaluates intent, not just tokens
  • Multi-layered defense with mechanisms operating at different levels of the stack
  • Detection that isn't LLM-based, so the same architectural vulnerabilities don't apply to both the attacker and the defender
  • Continuous adaptation as attack patterns evolve

LLM security isn't solved by adding more LLMs to the pipeline or by building longer blocklists. The threat operates at the architectural level. The defense has to as well.


Promptention's detection layer is built to operate at a different level than the models it protects—not another LLM sitting in front of your LLM.