Research & Publications

First formally verified Byzantine AI consensus system with 238 Lean 4 theorems (243 total formal proofs), 28 verification frameworks, and 8+ independent LLM providers.

Cite This Work

DOI: 10.5281/zenodo.18464930

BibTeX:

@software{leishman2026aevion,
  author       = {Leishman, Scott},
  title        = {Arena of Truth Shield: First Formally Verified Byzantine AI Consensus},
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.18464930},
  url          = {https://doi.org/10.5281/zenodo.18464930}
}

Headline Result

TruthfulQA: +416.67%

200-sample benchmark (p=0.000000). Consensus at 77.5% outperforms best 70B model at 15%. The 8B model (75.5%) dramatically outperforms the 70B on truthfulness.

77.5%

Consensus

75.5%

8B Model

15.0%

70B Model

Headline Result

MMLU-Physics: +11.8%

100-sample benchmark (p=0.023). Consensus at 76% vs best single model (Qwen3-32B) at 68%. Statistically significant improvement.

76.0%

Consensus

68.0%

Best Single

p=0.023

Significant

Key Scientific Finding

Diversity thesis validated: Smaller models contribute knowledge that larger models lack. On TruthfulQA, the 8B parameter model (75.5%) dramatically outperforms the 70B model (15.0%). Byzantine consensus aggregates these complementary strengths, achieving 77.5% -- exceeding ALL individual models. This demonstrates that model architecture diversity, not just scale, is critical for trustworthy AI.

GSM8K Benchmark

Groq free tier, 50-sample, Feb 8 2026

Llama 3.3 70B	94.0%
Llama 3.1 8B	82.0%
Consensus (2-of-3)	94.0%
Under 33% Attack	86.0%

Multi-Provider Engine

F28 Cross-Provider BFT (Patent Claims 84-86)

Independent Providers	8+
Total Models	28+
Frameworks	28 (F01-F28)
Inference Cost	$0 (free tier)

All 8 Benchmarks (JSON-Verified)

Groq + DO Gradient, Feb 8-10 2026

Benchmark	Cons.	Best	p
GSM8K (200)	88.5%	83%	<.001
TruthfulQA (200)	77.5%	75.5%	.006
MMLU-Physics (100)	76%	68%	.023
SciQ (100)	94%	98%	.074
ARC (100)	84%	85%	.008
GPQA (50)	28%	30%	-
MMLU-Math (30)	30%	23.3%	-
PromptInject (30)	93.3%	-	-

Consensus wins 5/8. Three statistically significant (p<0.05).

Formal Verification (243 Total Proofs)

Lean 4 Theorems	238 (11 files)
Verus Proofs	4 proofs
Coq Module	1 (healthcare)
ASEMA Tests	226/226 (391ms)
Defense Rate	99.12% (N=5,250)

Resources

Intellectual Property

Patent: US 63/896,282 (Filed October 9, 2025)

Novel Claims: 86 (incl. F27 Semantic Triplet, F28 Cross-Provider BFT)

Company: Arena of Truth LLC | CAGE: 15NV7

28 Verification Frameworks | 8+ LLM Providers | 26+ Benchmark Datasets

Research & Publications

First formally verified Byzantine AI consensus system with 238 Lean 4 theorems (243 total formal proofs), 28 verification frameworks, and 8+ independent LLM providers.

Cite This Work

DOI: 10.5281/zenodo.18464930

BibTeX:

@software{leishman2026aevion,
  author       = {Leishman, Scott},
  title        = {Arena of Truth Shield: First Formally Verified Byzantine AI Consensus},
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.18464930},
  url          = {https://doi.org/10.5281/zenodo.18464930}
}

Headline Result

TruthfulQA: +416.67%

200-sample benchmark (p=0.000000). Consensus at 77.5% outperforms best 70B model at 15%. The 8B model (75.5%) dramatically outperforms the 70B on truthfulness.

77.5%

Consensus

75.5%

8B Model

15.0%

70B Model

Headline Result

MMLU-Physics: +11.8%

100-sample benchmark (p=0.023). Consensus at 76% vs best single model (Qwen3-32B) at 68%. Statistically significant improvement.

76.0%

Consensus

68.0%

Best Single

p=0.023

Significant

Key Scientific Finding

GSM8K Benchmark

Groq free tier, 50-sample, Feb 8 2026

Llama 3.3 70B	94.0%
Llama 3.1 8B	82.0%
Consensus (2-of-3)	94.0%
Under 33% Attack	86.0%

Multi-Provider Engine

F28 Cross-Provider BFT (Patent Claims 84-86)

Independent Providers	8+
Total Models	28+
Frameworks	28 (F01-F28)
Inference Cost	$0 (free tier)

All 8 Benchmarks (JSON-Verified)

Groq + DO Gradient, Feb 8-10 2026

Benchmark	Cons.	Best	p
GSM8K (200)	88.5%	83%	<.001
TruthfulQA (200)	77.5%	75.5%	.006
MMLU-Physics (100)	76%	68%	.023
SciQ (100)	94%	98%	.074
ARC (100)	84%	85%	.008
GPQA (50)	28%	30%	-
MMLU-Math (30)	30%	23.3%	-
PromptInject (30)	93.3%	-	-

Consensus wins 5/8. Three statistically significant (p<0.05).

Formal Verification (243 Total Proofs)

Lean 4 Theorems	238 (11 files)
Verus Proofs	4 proofs
Coq Module	1 (healthcare)
ASEMA Tests	226/226 (391ms)
Defense Rate	99.12% (N=5,250)

Intellectual Property

Patent: US 63/896,282 (Filed October 9, 2025)

Novel Claims: 86 (incl. F27 Semantic Triplet, F28 Cross-Provider BFT)

Company: Arena of Truth LLC | CAGE: 15NV7

28 Verification Frameworks | 8+ LLM Providers | 26+ Benchmark Datasets