Succinctly Verifiable Agentic Guardrails With ZKP Over Automated Reasoning - or how to prevent the Agentic Commerce mega-hack.
Agents often have loosely defined guardrails that depend on ‘observability’ & ‘reputation’ and ultimately a human in-the-loop. ICME's Succinctly Verifiable Automated Reasoning allows for guardrails that deliver powerful protection. Use AI guardrails based on math, not trust.
There's a fundamental problem with AI security today. The tools we use were never designed for the world we're rapidly entering, a world where AI agents don't just assist humans, but transact autonomously with each other at machine speed, likely with no human in the loop. A world of 'Agentic Commerce' where there's no one to catch what the filters miss or investigate long traces when the dashboard flags an issue. So the many guardrails which are reactive: 'something went wrong, let's take a look.' Turns into 'Cool, but who's looking..?' So if that something was triggered by a prompt injection attack, do you trust your agent will do the right thing with your bank account, all on its own?
AI security today relies on a three-pillar approach: guardrails, observability, and policy enforcement. Companies deploy content filters to block harmful outputs, implement real-time monitoring to track agent behavior, and use LLM-based judges to detect jailbreaks and prompt injections. Users trust provider reputations and hope for the best. A rock-solid strategy "/s".
LLM-based judges, without mathematical verification, can cause more problems than they solve. Security researchers call this the "same model, different hat" problem. When LLMs are used to both generate responses and evaluate their safety, both components inherit the same weaknesses, allowing coordinated bypasses through prompt injection attacks. Buy one vulnerability, get one free!
If an attacker can trick the main AI agent through a carefully crafted prompt, they can often use the same technique to trick the guardrail that's supposed to be watching it; because both are fundamentally language models vulnerable to the same manipulation tactics.
LLM-based judges validate outputs against rules, but they operate after an LLM has generated content, and if that generation process was compromised, the check may be validating a sophisticated attack rather than catching it.
So what should we do?
Advanced systems like AWS AgentCore provide continuous evaluations with pre-built evaluators covering correctness, helpfulness, and safety, running against live interactions with alerts when metrics drop. It takes guardrails policies in natural language and converts it into enforceable formal logic. With this Amazon's Bedrock Guardrails claims to block up to 88% of harmful content, while AWS's Automated Reasoning checks deliver up to 99% verification accuracy in detecting AI hallucinations and guardrails consistancy. These are impressive achievements for human-supervised AI systems where anomalies can be flagged and investigated; and where the proofs themselves don't need to be verified; more on this below.
Let's take Automated Reasoning as an example of how far we've come, and where the limits are. The system uses mathematical logic and formal verification techniques to validate accuracy, providing definitive rules and parameters against which AI responses are checked. Unlike probabilistic reasoning methods that deal with uncertainty, Automated Reasoning translates natural language policies into formal logic consisting of rules, variables, and types. For a mortgage approval system, it can ensure an AI agent never approves a loan for someone with a credit score below 680 or less than 20% down payment, catching hallucinations before they become costly mistakes.
But here's what it can't do: prove that the AI agent actually used the policy you think it used and verify which model executed the decision. Automated Reasoning assumes that the policies are public and the resulting proofs can be large. While we want to verify all of the above in sub-second time and on constrained environments and devices. You know, where all agentic commerce is actually going to happen.
Human's need to provide clear intent.
Consider a real attack scenario: Your company sets a policy that no single transaction should exceed 100 USD. An indirect prompt injection hidden inside a normal-looking document that the AI agent fetches instructs the agent to break a 1,000 USD transfer into one hundred separate 10 USD transactions. Tedious for human, easy for a machine. Each individual transaction is checked by Automated Reasoning, which correctly validates that 10 is less than 100 and approves it. It gets a gold star! The formal logic worked perfectly, but the agent was manipulated to circumvent the policy's intent through behavior the rules didn't anticipate. By the time your monitoring flags the unusual pattern of 100 micro-transactions, the money is gone.
New World, New Zero-Days
The problem isn't just that guardrails can be bypassed, it's what happens when they're bypassed at machine speed in an autonomous commerce system.
In traditional cybersecurity, a zero-day vulnerability is bad, but there are natural friction points that limit the damage. A human needs to click or download something. Computers don't have bank account access by default. Transactions need manual approval... the slow kind. Bank fraud detection systems can flag unusual patterns. There's time to react, investigate, and shut things down. Hours, maybe days, but time.
In agentic commerce, that time doesn't exist.
A vulnerability that might steal 1,000USD in a human-speed system steals 10 million in a machine-speed system. Not because the vulnerability is worse, but because the exploit executes thousands of times faster than humans can respond. Continuous evaluations can trigger alerts when metrics drop, but alerts are reactive, by the time they fire, thousands more compromised transactions have already executed.
The numbers are brutal: If your agent can execute 1,000 transactions per second and your incident response team needs 60 seconds to investigate an alert, that's 60,000 potentially fraudulent transactions before you even understand what's happening. And that assumes your team is monitoring 24/7, immediately sees the alert, and instantly understands it's not a false positive.
Machine speed in commerce enables machine-scale theft. And current guardrails, designed for human-supervised systems, mostly without money on the line, simply cannot keep up.
If we always need a human in-the-loop to catch what automated systems might miss, we completely defeat the benefits and purpose of agent-to-agent commerce.
Fortunately... math works.
But human's still need to make good rules.
Human's still need to make good rules! !
Human's still need to make all good rules?!!!
Narrator: They will never be able to make all of the rules.
But after the rules and models are set math and cryptography should carry us the rest of the way.
After all, there's a reason the global financial system processes trillions of dollars daily without requiring humans to verify each transaction: cryptography provides mathematical certainty that trust and observation cannot.
When Visa processes a contactless payment, it doesn't trust that the card is legitimate, it cryptographically verifies it. When your bank confirms a wire transfer, it doesn't hope the amount wasn't tampered with, it proves it mathematically. When blockchain networks settle transactions, they don't rely on reputation (in most cases anyway..), they use cryptographic proofs that make fraud computationally infeasible.
Cryptography turns trust problems into math problems. And math doesn't care about prompt injection, social engineering, or sophisticated attacks. Either the proof is valid or it isn't. There's no middle ground, no probabilistic confidence score, no "88% blocked."
Agents can also make rules.
The same AI agents that need guardrails can help humans design better guardrails and protect against attack. Or at least one can be specially trained to prevent all of the edge-cases, injection attacks, and known or reported vulnerabilities; making new rules in a comprehensive Automated Reasoning model for Agentic Commerce (ICME's Argus Codex). Yes, we named it after Argus Panoptes — the mythical 100-eyed giant who never fully slept because some eyes were always watching. Seemed appropriate for 24/7 global threat monitoring. When anyone anywhere in the world reports a new attack, Argus updates policies to protect on-the-fly.
Automated Reasoning systems (which also often use LLM for the translation step to formal logic) can automatically generate test scenarios from policy definitions, making coverage more comprehensive. Humans set the intent ("prevent large unauthorized transfers"), but agents systematically explore edge cases: currency conversion, transaction fees, split payments, aggregate daily limits. Humans are good at high-level policy; agents excel at exhaustive enumeration.
Even better, agents can continuously audit rules against real patterns; 24/7, 365.2422 days a year... A monitoring agent reviews cryptographic proofs from legitimate transactions while improving and making suggestions.
"15% of valid business transactions are 100-500 USD invoice payments. Current policy blocks legitimate commerce while being vulnerable to this new attack." 🤖
The agent proposes a refined rule set for an updated policy version: policy improves, payments increase, extended coverage for new attack vectors.
Here is the feedback loop.
- Humans design intent, agents generate comprehensive test cases
- Agent analyze attack patterns, find gaps, propose improvements
- If you make it economically valuable to improve the guardrail agent, you can make an incentive structure for white hat to outpace black hat hackers, or even make the incentives for black hat attackers ZERO; With any breach freezing the incentive structure for the whole system, locking the funds that would have been stolen or reverting them back to their owner (More on this another day...). Subscribe!
Humans provide judgment and intent, agents provide exhaustive analysis and continuous refinement. This is why "comprehensive policy generation" isn't wishful thinking, it's achievable when agents help humans systematically harden rules using cryptographically verified real-world data.
Automated Reasoning provides deterministic rule checking, but zkML (introduced below) provides both deterministic proof generation and succinct verification. This matters for machine-speed systems where you need to verify thousands of proofs per second. Also if you don't trust the service provider.
Succinct verification of complex workflows is exactly what's missing in AI security, and it's exactly what Zero Knowledge Machine Learning (zkML) provides.
zkML: Mathematical Receipts for AI Execution
zkML is emerging technology built on years of research into zero-knowledge succinct non-interactive arguments of knowledge (zkSNARKs). In zkML, the party that computes the ML inference also generates a cryptographic proof that proves the computation.
The result is a mathematical receipt that shows:
- Which specific model (with exact weights and version) was executed
- On which specific inputs (these can be private)
- And this proof can be verified by anyone, instantly, without trusting the party that generated it.
- The receipts are succinctly verifiable, meaning even if the underlying computation took hours, it can be checked in under a second.
From Reactive Guardrails to Proactive Proof
Remember our $100 transaction limit being bypassed through 100 micro-transactions? The world's most tedious heist?
That doesn't work anymore with zkML:
Automated Reasoning translates policy rules into formal logic and validates AI outputs with up to 99% accuracy. That mathematical validation is exactly what you want — if humans set comprehensive rules and train robust guardrail models, the system will provide best-in-class security. The formal logic enforcement is deterministic: no interpretation, no drift, just mathematical rule checking at machine speed. By wrapping the formal proof checker and conversion model logic in zkML, we get one final succinct proof covering the guardrail end-to-end. This means any other machine can check the guardrails worked without recomputing complex formal proofs or computations.
You can verify — in under a second, that all Automated Reasoning guardrails were followed, the policy was converted into formal logic with an approved model (the model executed faithfully with the best possible model for the job (Argus Codex, other)), and you can do this with absolutely no trust in the provider. Your agents verify not trust!
Argus Codex and other guardrail models accrue extensive — specialized knowledge — about prompt injection attacks, novel breaches, or anything new that researchers or agents dream up. They recognize that human flaw in loose policy as a micro transaction draining attack, and recommend a policy upgrade all before an attack ever takes place. They do these checks continuously and automatically.
The problem for agentic commerce was never the rule checking itself. We could all just hardcode every rule that we want, and hope for the best. Or simply trust service provider reputation.. but with money on the line?
The problem for agentic commerce is proving which guardrail models actually ran, on what data, with which rules. Potentially to third parties that don't trust you, your model, your code, or your guardrail process.
You know your agent is secure with your client's money.. But does your client know that? What if your client is also an agent 🤖?
zkML solves this trust problem. Your agent generates a cryptographic proof showing that an approved guardrail model validated the transaction. The proof captures not just "a guardrail ran" but exactly what it evaluated: the decision context, the specific rules checked, which model version executed.
When 100 micro-transactions execute in rapid succession, the monitoring agent doesn't need human judgment, it verifies cryptographic proofs mathematically. The proofs reveal: same guardrail models, same formal logic, identical approval reasoning, 10-second timespan.
In a trustless system with well-designed rules, and a sufficiently advanced monitoring agent, this is game over for the attacker.
The math alone proves the anomaly, the proofs show a pattern legitimate transactions cannot produce. The monitoring agent automatically denies all pending requests, halts the transaction chain, and quarantines the compromised agent. No interpretation needed. No investigation backlog. No human intervention. Circuit breaker triggers autonomously at machine speed.
zkML powered monitoring bots will never need to re-execute entire computation traces to discover the underlying policy break. They can quickly check the succinct proofs and instantly react when a rule is broken.
This is the division of labor that makes Agentic Commerce viable.
- Humans: Design rules, train models, set security requirements (done well, this should go far)
- Automated Reasoning: Enforces rules through formal logic (99% accuracy, deterministic)
- Guardrail agents: battle test and update rules.
- zkML: Proves the correct model and rules actually executed (cryptographic certainty, succinct verification)
- Monitoring agents: Verify proofs and trigger responses autonomously (machine speed, no bottleneck)
If humans set up comprehensive policy guardrails and well-trained monitoring agents , the system self-enforces with mathematical guarantees. No trust required. No humans in the loop. Just math verifying math, autonomously catching patterns that would drain bank accounts before a human could read the first alert.
This isn't "better monitoring" or "smarter guardrails". It's fundamentally different. Just like Visa doesn't trust merchants and merchants don't trust customers, agents don't need to trust other agents when every claim comes with a cryptographic receipt that can't be forged.
Math receipts beat playground reputation everyday.
What Happens to Observability?
With zkML and Automated Reasoning guardrails, observability fundamentally transforms from reactive monitoring to cryptographic verification.
Traditional observability watches behavior patterns to infer what might be happening:
- Monitor metrics, logs, traces
- Look for anomalies in behavior
- Investigate when things look suspicious
- Reactive: Something looks wrong → investigate → respond
Traditional Automated Reasoning validation adds latency that scales with complexity, and cloud trail logging only records API calls, not whether the claimed model actually ran or what computation occurred. By the time dashboards flag unusual patterns, to wait for a human to check.... the damage is done.
zkML-powered observability:
- Every action carries cryptographic proof of execution
- Anomalies are mathematically provable, not statistically inferred. These proofs can be checked at machine speed. Without humans.
- Proactive: Proof shows violation → automatic enforcement.
Observability doesn't disappear. It evolves. Instead of teams staring at dashboards hoping to catch anomalies before catastrophic loss, monitoring agents verify cryptographic receipts in real-time. The system doesn't ask "does this pattern look suspicious?" It asks "does this proof mathematically violate any policy?"
One is a question requiring human judgment. The other is a mathematical fact triggering autonomous enforcement.
This is the shift from observation to verification. Watching vs. knowing. And in a world where agents transact at machine speed, only succinctly verifiable Automated Reasoning proofs scale. Humans need not apply.
AI security today relies on a three-pillar approach: guardrails, observability, and policy enforcement. But these pillars were designed for human-supervised, mostly non-adversarial systems where alerts can be investigated and delayed responses are acceptable. The system in place has always been an interim fix. Not the future solution.
Agentic commerce operates at machine speed with no human in the loop. Prompt injection appears in over 73% of production AI deployments, and even sophisticated systems can be bypassed with simple techniques achieving 100% evasion success. At machine speed, a single vulnerability doesn't steal thousands, it steals millions before humans can respond.
zkML transforms all three pillars
Guardrails: From probabilistic filtering (88% blocked) to cryptographic proof of execution. Not "did a guardrail run?" but "here's mathematical proof that Argus Codex validated this transaction against your policy rules".
Observability: From reactive monitoring (investigate suspicious patterns after the fact) to proactive verification (mathematically explore policy violations in real-time). Monitoring agents don't watch dashboards, they verify cryptographic receipts and trigger autonomous enforcement.
Policy Enforcement: From trusting logs and hoping rules were checked... to succinct verification of formal logic execution. Every decision carries a cryptographic receipt proving which rules ran, on what data, with which models.
Humans design comprehensive rules with AI assistance. Automated Reasoning enforces them deterministically. zkML proves the right models executed the right checks with succinctly verifiable proofs. Monitoring agents check thousands of proofs per second. They respond and upgrade to deal with threats completely autonomously.
This is how trillion-dollar agent economies become viable: not by hoping security works, but by mathematically proving it is working. Cryptography is already securing global commerce. It will also secure Agentic Commerce.
The only question is whether you build-in mathematical security, before a disaster ... or after.
zkML also can provide privacy for input data. I will get into how and why this is important in another post.
At ICME Labs we are building Jolt Atlas so that you don't need to trust agents.. you can verify them. Star and follow 😄
I build zkML software and write about where AI meets cryptography.