Vericoding: The End of "Trust Me Bro, The AI Wrote It".

92% of developers use AI coding tools daily. Trust in AI-generated code has dropped from 77% to 60%. The gap between those two numbers is where the next billion-dollar problem lives.

Vibe coding won. That debate is over. GitHub says 46% of all new code is AI-generated. Among Y Combinator's Winter 2025 cohort, 21% of startups have codebases that are 91% or more AI-generated. Gartner forecasts 60% of all new software code will be AI-generated by end of 2026. The machines are writing the code. Congratulations.

Now the bad news.

AI co-authored code has 2.74x more security vulnerabilities than human-written code. 45% of it fails security tests. Vibe-coded projects accumulate technical debt 3x faster. METR ran a randomized controlled trial with experienced developers and found they were 19% slower with AI tools, while believing they were 24% faster. 95% of developers report feeling productive while measurably producing lower-quality code.

AWS themselves have said it plainly: review capacity, not developer output, is the bottleneck in 2026. We can generate code at machine speed. We cannot verify it at machine speed. That asymmetry is the whole problem.

And it's already exploding. The Tea App breach exposed 72,000 government IDs because a database was just… open. The Orchids platform had a zero-click vulnerability that sat unfixed for months. Security experts are predicting "catastrophic explosions" as vibe-coded applications hit production at scale. Not might happen. Are happening.

The software industry automated code generation but left verification in 2019. That's like building a car factory that runs at 10,000 units per hour and staffing the quality inspection line with two guys and a clipboard.

Vibe Coding vs. Vericoding

The term vericoding was introduced by researchers at the Beneficial AI Foundation in late 2025. Where vibe coding generates potentially buggy code from natural language descriptions, vericoding generates formally verified code from formal specifications. Code that is mathematically proven correct. Not tested. Not probably correct. Proven.

The difference is not incremental. It is categorical.

A test suite says: "I tried 1,000 inputs and none of them broke." Formal verification says: "I have a mathematical proof that no input, out of the infinite space of all possible inputs, can break this." Tests demonstrate the presence of bugs. Proofs demonstrate their absence.

The vericoding benchmark tested 12,504 formal specifications across three verification languages: Dafny, Verus/Rust, and Lean. Using off-the-shelf LLMs with no special training or fine-tuning, vericoding success rates hit 82% in Dafny. Pure Dafny verification improved from 68% to 96% over the past year. The curve is steep and accelerating.

Dafny is a verification-aware programming language created by K. Rustan M. Leino, now at AWS's Automated Reasoning Group. It compiles through Boogie into SMT-LIB2 queries that are checked by Z3, an SMT solver. If Z3 says UNSAT, your code is proven correct against its specification. Not probably. Proven.

AWS uses this exact stack to verify their authorization engine, the one handling over a billion API calls per second. They verified Cedar, their authorization policy language, in Dafny and proved it satisfies security properties with mathematical certainty. Then they used differential testing against quadrillions of production authorizations to confirm the Rust implementation matches. The result? A 65% performance improvement and provable security.

This is not academic technology looking for a problem. This is production infrastructure running at planetary scale.

The Gap Everyone Sees, Nobody Has Closed

Every vericoding system today, including the research from AWS's own DafnyPro and ATLAS teams, assumes you already have a formal specification. Someone has to write:

requires amount <= balance
ensures balance == old(balance) - amount
ensures balance >= 0

That someone is currently a verification expert. There are maybe a few thousand people on Earth who can write Dafny specifications fluently. That's not a market, that's a mailing list!

The pipeline academia is building looks like this:

Formal Spec → LLM → Verified Code → Production

DafnyPro hits 86% on this. ATLAS generates thousands of verified programs. The vericoding benchmark shows 82% success. All impressive. All starting from a formal spec that a human expert already wrote.

The pipeline we have starts earlier, where the actual problem is:

Natural Language → Formal Spec → Human Review → LLM → Verified Code → Production

That first step, natural language to formal spec, is the reason vericoding hasn't become a product. It's the hardest translation, the one the researchers explicitly scope out of their papers. The vericoding benchmark authors said it directly: "We acknowledge that spec generation is an important problem, but focus on the task of generating implementations and formal proofs in this work."

We're building the part they left out. And we have an unfair advantage.

We Already Do This. It's Called PreFlight.

If you've been following this blog, you know ICME's PreFlight. You write a guardrail policy in plain English. PreFlight compiles it to formal logic (SMT-LIB). An SMT solver checks every agent action against that logic in under a second. Every decision produces a cryptographic proof: tamper-proof, independently verifiable, shareable with any third party. SAT means allowed. UNSAT means blocked.

That's the same fundamental pipeline. Natural language in. Formal logic out. Verified by an SMT solver. Cryptographic receipt attached. The only difference is what you're verifying. PreFlight verifies agent actions against a policy. Vericoding verifies code against a specification. The core machinery is identical: translate human intent into math, then let the solver do its job.

PreFlight already handles the hard part that every academic vericoding paper punts on. The NL-to-formal-logic translation. The ambiguity detection. The sub-second verification. The cryptographic proof that the check actually ran. These aren't theoretical capabilities. They're in production, running against real agent actions, right now.

Why PreFlight Is a Vericoding Engine in Disguise

Here's the part that makes this more than a research direction for us.

Automated Reasoning's core job is translating natural language into formal logic. That translation is the same fundamental problem whether you're compiling a guardrail policy or generating a code specification. "Never approve transactions over $10,000 without a second confirmation" compiles to guardrail logic. "No withdrawal can exceed the account balance" compiles to a Dafny precondition. The domain shifts, but the translation pattern is identical: vague human intent into precise formal constraint.

The system we've built handles every stage of this translation. Parsing ambiguous natural language. Cross-checking across multiple model passes. Catching gaps where the intent is underspecified. Returning a formal result that an SMT solver can verify. These are hard engineering problems, and we've already solved them for guardrails. The architecture generalizes.

This is the cold start problem that academic vericoding teams keep running into. There's almost no paired NL-to-Dafny data in the wild. The ATLAS team had to synthetically generate their training data. Meanwhile, PreFlight has been operating in production, processing real natural language policies across financial, access control, and compliance domains. The patterns our system has learned for translating intent into formal logic apply directly to translating intent into code specifications. We don't need to start from zero. We're extending a system that already works.

The Vericoding Pipeline

1. Natural Language Intent The developer writes what they want in English: "No user can withdraw more than their balance. Failed withdrawals don't change state. Daily withdrawal limits are enforced."

Same interface as writing a PreFlight policy. Same experience. If you've used PreFlight, you already know how to write a vericoding spec.

2. LLM → Dafny Specification (Multi-Model Cross-Check) Multiple LLMs independently translate the intent into Dafny pre/postconditions. Only translations where a majority agree are accepted. PreFlight already uses this cross-checking pattern for policy compilation. Applying it to code specs is a direct extension of the same architecture.

3. SMT Spec Scoring Before any code is generated, the spec itself is analyzed by Z3. The spec is just logic. You don't need an implementation to audit it. The solver checks: Is the spec consistent? What inputs have undefined behavior? How many distinct implementations satisfy it? The output is a confidence score and a gap analysis the human can actually read:

Spec Confidence: 72%

✓ Balance non-negativity: fully specified
✓ Success case: fully specified
⚠ Failure case: underspecified
⚠ Daily limit: not mentioned
✗ Concurrent access: not addressed

PreFlight already catches these kinds of ambiguities in guardrail policies. A travel expense policy that technically allows $1,001 purchases without approval? The solver finds it. A withdrawal spec that doesn't constrain failure behavior? Same solver, same technique.

4. Human Battle Testing The human reviews ambiguous parts of the spec, not 500 lines of generated code. "Does ensures balance == old(balance) - amount capture what I mean by withdrawal?" That is a question anyone with domain knowledge can answer.

5. LLM → Verified Implementation DafnyPro-style annotation and code generation. The LLM writes the implementation, Z3 verifies it satisfies the spec. If it fails, iterate. This loop already works at 86% first-pass success in academic benchmarks.

6. Compile to Production Language Dafny compiles to Python, Java, C#, Go, JavaScript. Verified code, in the language your stack actually uses.

7. Archive with Cryptographic Proof The SMT-LIB2 file is a portable, standard-format proof artifact. Any third party can feed it to any compliant solver (Z3, CVC5, Yices) and independently confirm the verification result. And because this is ICME, the proof doesn't stop at the solver output. We wrap it in a cryptographic proof (ZKP), the same way PreFlight wraps every guardrail decision. Verifiable in under a second. On any device. You can also fold these proofs together to verify the entire program end to end!

Guardrails & Verified Code

PreFlight proves agent actions are safe. Vericoding proves agent-written code is correct. These are not completely separate. They are the same trust infrastructure applied at different layers of the stack.

Think about what the fully integrated picture looks like. An AI agent writes code (vibe coding). The code is verified against a formal spec derived from human intent (vericoding). Every verification produces a cryptographic receipt (PreFlight). The agent then operates under guardrail policies that are themselves formally verified (PreFlight again). From code generation to runtime behavior, every step is provably correct and independently auditable. One trust layer, top to bottom.

For regulated industries & agentic commerce, this is a major unlock. Healthcare companies aren't vibe coding because they can't prove correctness to auditors. Financial services aren't vibe coding because compliance requires evidence, not vibes. 72% of healthcare and 66% of financial services companies are sitting out the AI coding revolution entirely.

We're building the tool that lets them in:

  • Translate compliance rules from English to formal specs
  • Score specs for completeness before code is generated
  • Prove implementations correct against the spec
  • Archive standard-format proof artifacts for regulators
  • Wrap every proof in a cryptographic receipt, verifiable by any third party in under a second

That's not a developer tool. That's compliance infrastructure for the age of AI-generated code. And the TAM is every regulated line of code that will ever be written by an AI.

The Math Works. We're Shipping It.

The vericoding benchmark shows 82% success with off-the-shelf models. DafnyPro shows 86% with inference-time techniques. Dafny verification hit 96% in the latest results. The SMT-LIB standard has been stable for decades. Z3 powers a billion checks per day at AWS.

The academic pieces are proven. The solver infrastructure is battle-tested. What's been missing is a product that starts from natural language, ends with verified code, and produces a cryptographic proof of every step. We've already built the NL-to-formal-logic engine for guardrails. We've already shipped the cryptographic proof layer. The architecture generalizes directly to code verification.

Vericoding is an additional step. And we're already most of the way there!

The vibe coding trust crisis is here. 46% of code is AI-generated. Trust is falling. Breaches are climbing. Regulated industries are locked out. The bottleneck is verification, not generation.

LLM automated coding. Now we're using cryptography to automate verification.

With math, not vibes.

Wyatt Benno

I build software and write about where AI meets cryptography.

My Twitter

Subscribe to ICME

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe