Superthesis turns a claim into a calibrated, source-cited confidence score through adversarial multi-agent debate. A score is only as good as it is honest — so we grade the engine on the dimensions that actually matter, show the numbers as charts pulled live from the latest run, and publish the method and the caveats.
Public fact-checking benchmarks reward one thing: top-line accuracy on a static, frozen set. That misses what a buyer of a claim-intelligence system actually needs, and it has three well-known failure modes:
Frozen public sets leak into training data; a high score can measure memorization, not reasoning. We report a leakage ablation that scores claims with no evidence to expose it.
Accuracy says whether the verdict is right, never whether a 70 means 70%. For decisions under uncertainty, a calibrated, well-spread score is worth more than a binary label.
A leaderboard never asks if the same claim returns the same answer twice, or whether the cited evidence actually supports the verdict. Both are load-bearing in production.
We lead with calibration, stability and discrimination, report accuracy honestly next to a single-model baseline, and put every number on a chart you can interrogate.
Each engine change we ship has to move one of these rows. We run a fixed, labeled claim set through the engine N times per claim and compute:
Run-to-run variance on the identical claim. The product promises a repeatable score, so we measure the standard deviation across N runs.
σ = stdev(scores) · lower better · pre-fix baseline ≈ 12ppDoes the score land where a careful analyst would put it? We compare the median score to a human-set anchor per claim.
MAE = mean |median − anchor| · lower betterCan the engine separate true claims from false ones at any threshold? AUROC over binary-labeled claims using the score.
AUROC · 1.0 perfect · 0.5 chanceThe blunt check: do clearly-true claims land above 50 and clearly-false below it? A floor sanity test beneath the finer metrics.
correct-side / total binary claimsAre citations real, relevant, correctly tiered, and does the score follow the evidence? Auditability is the moat; hallucinated cites would break it.
citation-real rate · score-follows-evidenceThe honest counterweight: a single model is cheaper. We report $ and seconds per run so the trade-off is visible, not hidden.
mean $/run · mean s/runCalibration is the whole thesis of the product, so we show it three ways: the internal score against human anchors, the external reliability curve on AVeriTeC, and the Brier score that summarizes both.
Calibration asks whether the number is right; discrimination asks whether the engine ranks true claims above false ones at any threshold. This ROC curve is computed live from the AVeriTeC Supported-vs-Refuted predictions — the binary subset, scored by the spectral confidence.
Run-to-run variance was the engine's biggest early weakness — an identical claim could swing ~12 points. Pinning the deterministic scaffolding (temperature-0 decomposition, a query-keyed evidence cache) collapsed it to low single digits. Each bar is one claim's run-to-run σ; the dashed line is the pre-stabilization baseline.
The charts above are largely self-graded. This is the real comparison: our engine on the public AVeriTeC dev set, golden-evidence condition, against external ground-truth labels — the same benchmark DebateCV reports on. Our bar carries a 95% Wilson interval; the others are published point estimates on the same golden condition.
The remaining gap to DebateCV is almost entirely one class our three-verdict engine structurally cannot emit. The chart makes that visible.
These aren't pass/fail accuracy — they stress-test whether the engine moves the way an honest evidence-weigher should. Each is run live and the data drives the charts below.
The same claim scored at a historical "as-of" date should fall as the story unravels and the evidence turns. Each line connects the backtested score to today's score for a once-hyped venture that later failed.
The bar chart compares like-for-like (AVeriTeC golden). This table adds the broader landscape — including systems measured on different benchmarks, so read across rows with care.
| System | Approach | Reported | Source |
|---|---|---|---|
| Superthesis | 3-role debate + golden evidence (spectral remap) | see above | AVeriTeC dev |
| DebateCV | 2 debaters + moderator (closest analog) | 83.4% golden · 73.6% retrieved | WWW 2026 |
| AVeriTeC SOTA (TUDA_MAI) | retrieval-tuned pipeline (shared-task winner) | 63% AVeriTeC score | FEVER 2024 |
| PROClaim | Courtroom + progressive RAG | 81.7% Check-COVID | 2026 |
| Single model + RAG | one strong LLM, one retrieval pass | 85.8% Check-COVID* | PROClaim baseline |
* The single-model+RAG number is the honesty anchor: on clean benchmarks a capable single model can match or beat elaborate debate on raw accuracy. Our case is calibration, stability and auditability. Full source list in RESEARCH_SOURCES.md (84 papers).
Every chart on this page is rendered from timestamped JSON written by the harness — not hand-entered. From the repo root, with provider keys loaded:
# internal calibration set — 8 claims, 3 runs each python benchmark/run_benchmark.py --n 3 # AVeriTeC stratified-100, golden evidence, then backfill spectral calibration python benchmark/averitec/run_averitec.py --n 100 --workers 4 python benchmark/averitec/analyze_spectral.py # behavior probes (live retrieval) python benchmark/demo/run_demos.py --workers 4