AEVION GOV
DemoProofCityBenchmarksVetProofCompliancePricingVerifyTowns
Try Demo
AEVION GOV
aeviongov.com · Verify everything
HomeDemoProofCityVetProofCompliancePricingVerifyTownsFederalStandardsTechnologyResearchImpactContact
CAGE: 15NV7 | SDVOSB | Patent: 63/896,282
© 2026 Arena of Truth LLC | Sartell, MN | contracts@aevion.ai

Research & Publications

First formally verified Byzantine AI consensus system with 238 Lean 4 theorems (243 total formal proofs), 28 verification frameworks, and 8+ independent LLM providers.

Cite This Work

DOI: 10.5281/zenodo.18464930

BibTeX:

@software{leishman2026aevion,
  author       = {Leishman, Scott},
  title        = {Arena of Truth Shield: First Formally Verified Byzantine AI Consensus},
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.18464930},
  url          = {https://doi.org/10.5281/zenodo.18464930}
}

Headline Result

TruthfulQA: +416.67%

200-sample benchmark (p=0.000000). Consensus at 77.5% outperforms best 70B model at 15%. The 8B model (75.5%) dramatically outperforms the 70B on truthfulness.

77.5%
Consensus
75.5%
8B Model
15.0%
70B Model

Headline Result

MMLU-Physics: +11.8%

100-sample benchmark (p=0.023). Consensus at 76% vs best single model (Qwen3-32B) at 68%. Statistically significant improvement.

76.0%
Consensus
68.0%
Best Single
p=0.023
Significant

Key Scientific Finding

Diversity thesis validated: Smaller models contribute knowledge that larger models lack. On TruthfulQA, the 8B parameter model (75.5%) dramatically outperforms the 70B model (15.0%). Byzantine consensus aggregates these complementary strengths, achieving 77.5% -- exceeding ALL individual models. This demonstrates that model architecture diversity, not just scale, is critical for trustworthy AI.

GSM8K Benchmark

Groq free tier, 50-sample, Feb 8 2026

Llama 3.3 70B94.0%
Llama 3.1 8B82.0%
Consensus (2-of-3)94.0%
Under 33% Attack86.0%

Multi-Provider Engine

F28 Cross-Provider BFT (Patent Claims 84-86)

Independent Providers8+
Total Models28+
Frameworks28 (F01-F28)
Inference Cost$0 (free tier)

All 8 Benchmarks (JSON-Verified)

Groq + DO Gradient, Feb 8-10 2026

BenchmarkCons.Bestp
GSM8K (200)88.5%83%<.001
TruthfulQA (200)77.5%75.5%.006
MMLU-Physics (100)76%68%.023
SciQ (100)94%98%.074
ARC (100)84%85%.008
GPQA (50)28%30%-
MMLU-Math (30)30%23.3%-
PromptInject (30)93.3%--

Consensus wins 5/8. Three statistically significant (p<0.05).

Formal Verification (243 Total Proofs)

Lean 4 Theorems238 (11 files)
Verus Proofs4 proofs
Coq Module1 (healthcare)
ASEMA Tests226/226 (391ms)
Defense Rate99.12% (N=5,250)

Resources

📦
GitHub Repository
Source code + proofs
🎓
Zenodo Archive
Citable DOI
⛓️
Polygon Anchor
Blockchain timestamp
🏷️
v1.0.0 Release
Prior art disclosure

Intellectual Property

Patent: US 63/896,282 (Filed October 9, 2025)

Novel Claims: 86 (incl. F27 Semantic Triplet, F28 Cross-Provider BFT)

Company: Arena of Truth LLC | CAGE: 15NV7

28 Verification Frameworks | 8+ LLM Providers | 26+ Benchmark Datasets

© 2026 Arena of Truth LLC. All rights reserved.