Verifiable evidence

Benchmarks & Results

Byzantine-safe math benchmarks with cryptographic proof. Run the Colab notebook, then share or upload results here.

GSM8K Byzantine 3-Model Fleet

Hash-chained evidence · Patent Pending 63/896,282

Grade-school math (GSM8K) with a 3-model consensus protocol: if ≥2 models agree → release answer; else constitutional halt. Every problem is committed in a SHA-256 hash chain; final hash commits the full run.

  • Notebook: notebooks/gsm8k_byzantine_demo.ipynb — run in Google Colab with API keys in Secrets.
  • Output: Timestamped JSON (metrics + per-problem records + final hash).
  • Full run: 1,319 problems (full test set); set GSM8K_SAMPLES = 1319 in Config.

Latest run results

After running the Colab notebook, save the generated gsm8k_byzantine_run_*.json and optionally upload to Drive or commit to docs/results/. Headline metrics to paste here when you have a run:

single_model_accuracy, consensus_accuracy_non_halt, halt_percent, final_hash

AIMO3 (AI Mathematical Olympiad)

Progress Prize 3 · Byzantine consensus submission (v4/v5) with code-verified weighting and constitutional halt.

Strategy and notebooks: AIMO3/ — see AIMO3_SCORE_47_STRATEGY.md and aevion-aimo3-submission-v5-47.ipynb.