Research & Publications
First formally verified Byzantine AI consensus system with 238 Lean 4 theorems (243 total formal proofs), 28 verification frameworks, and 8+ independent LLM providers.
Cite This Work
BibTeX:
@software{leishman2026aevion,
author = {Leishman, Scott},
title = {Arena of Truth Shield: First Formally Verified Byzantine AI Consensus},
year = 2026,
publisher = {Zenodo},
doi = {10.5281/zenodo.18464930},
url = {https://doi.org/10.5281/zenodo.18464930}
}Headline Result
TruthfulQA: +416.67%
200-sample benchmark (p=0.000000). Consensus at 77.5% outperforms best 70B model at 15%. The 8B model (75.5%) dramatically outperforms the 70B on truthfulness.
Headline Result
MMLU-Physics: +11.8%
100-sample benchmark (p=0.023). Consensus at 76% vs best single model (Qwen3-32B) at 68%. Statistically significant improvement.
Key Scientific Finding
Diversity thesis validated: Smaller models contribute knowledge that larger models lack. On TruthfulQA, the 8B parameter model (75.5%) dramatically outperforms the 70B model (15.0%). Byzantine consensus aggregates these complementary strengths, achieving 77.5% -- exceeding ALL individual models. This demonstrates that model architecture diversity, not just scale, is critical for trustworthy AI.
GSM8K Benchmark
Groq free tier, 50-sample, Feb 8 2026
| Llama 3.3 70B | 94.0% |
| Llama 3.1 8B | 82.0% |
| Consensus (2-of-3) | 94.0% |
| Under 33% Attack | 86.0% |
Multi-Provider Engine
F28 Cross-Provider BFT (Patent Claims 84-86)
| Independent Providers | 8+ |
| Total Models | 28+ |
| Frameworks | 28 (F01-F28) |
| Inference Cost | $0 (free tier) |
All 8 Benchmarks (JSON-Verified)
Groq + DO Gradient, Feb 8-10 2026
| Benchmark | Cons. | Best | p |
| GSM8K (200) | 88.5% | 83% | <.001 |
| TruthfulQA (200) | 77.5% | 75.5% | .006 |
| MMLU-Physics (100) | 76% | 68% | .023 |
| SciQ (100) | 94% | 98% | .074 |
| ARC (100) | 84% | 85% | .008 |
| GPQA (50) | 28% | 30% | - |
| MMLU-Math (30) | 30% | 23.3% | - |
| PromptInject (30) | 93.3% | - | - |
Consensus wins 5/8. Three statistically significant (p<0.05).
Formal Verification (243 Total Proofs)
| Lean 4 Theorems | 238 (11 files) |
| Verus Proofs | 4 proofs |
| Coq Module | 1 (healthcare) |
| ASEMA Tests | 226/226 (391ms) |
| Defense Rate | 99.12% (N=5,250) |
Resources
Intellectual Property
Patent: US 63/896,282 (Filed October 9, 2025)
Novel Claims: 86 (incl. F27 Semantic Triplet, F28 Cross-Provider BFT)
Company: Arena of Truth LLC | CAGE: 15NV7
28 Verification Frameworks | 8+ LLM Providers | 26+ Benchmark Datasets