Benchmark

How we grade our own engine.

Superthesis turns a claim into a calibrated, source-cited confidence score through adversarial multi-agent debate. A score is only as good as it is honest — so we grade the engine on the dimensions that actually matter, show the numbers as charts pulled live from the latest run, and publish the method and the caveats.

loading the latest run…
Why this page exists

A leaderboard number isn't enough

Public fact-checking benchmarks reward one thing: top-line accuracy on a static, frozen set. That misses what a buyer of a claim-intelligence system actually needs, and it has three well-known failure modes:

Contamination

Frozen public sets leak into training data; a high score can measure memorization, not reasoning. We report a leakage ablation that scores claims with no evidence to expose it.

No calibration

Accuracy says whether the verdict is right, never whether a 70 means 70%. For decisions under uncertainty, a calibrated, well-spread score is worth more than a binary label.

No stability or audit

A leaderboard never asks if the same claim returns the same answer twice, or whether the cited evidence actually supports the verdict. Both are load-bearing in production.

So we grade differently

We lead with calibration, stability and discrimination, report accuracy honestly next to a single-model baseline, and put every number on a chart you can interrogate.

Methodology

Six graded dimensions

Each engine change we ship has to move one of these rows. We run a fixed, labeled claim set through the engine N times per claim and compute:

Stability

Run-to-run variance on the identical claim. The product promises a repeatable score, so we measure the standard deviation across N runs.

σ = stdev(scores) · lower better · pre-fix baseline ≈ 12pp

Calibration

Does the score land where a careful analyst would put it? We compare the median score to a human-set anchor per claim.

MAE = mean |median − anchor| · lower better

Discrimination

Can the engine separate true claims from false ones at any threshold? AUROC over binary-labeled claims using the score.

AUROC · 1.0 perfect · 0.5 chance

Directional accuracy

The blunt check: do clearly-true claims land above 50 and clearly-false below it? A floor sanity test beneath the finer metrics.

correct-side / total binary claims

Faithfulness (planned)

Are citations real, relevant, correctly tiered, and does the score follow the evidence? Auditability is the moat; hallucinated cites would break it.

citation-real rate · score-follows-evidence

Cost & latency

The honest counterweight: a single model is cheaper. We report $ and seconds per run so the trade-off is visible, not hidden.

mean $/run · mean s/run
Calibration · the differentiator

Does a 70 mean 70?

Calibration is the whole thesis of the product, so we show it three ways: the internal score against human anchors, the external reliability curve on AVeriTeC, and the Brier score that summarizes both.

Score vs. human anchorinternal set
loading…
Reliability curveAVeriTeC
loading…
Loading calibration summary…
Discrimination

Separating true from false

Calibration asks whether the number is right; discrimination asks whether the engine ranks true claims above false ones at any threshold. This ROC curve is computed live from the AVeriTeC Supported-vs-Refuted predictions — the binary subset, scored by the spectral confidence.

ROC — Supported vs. RefutedAVeriTeC
loading…
Stability

The same claim, the same answer

Run-to-run variance was the engine's biggest early weakness — an identical claim could swing ~12 points. Pinning the deterministic scaffolding (temperature-0 decomposition, a query-keyed evidence cache) collapsed it to low single digits. Each bar is one claim's run-to-run σ; the dashed line is the pre-stabilization baseline.

Per-claim run-to-run σ
loading…
External benchmark · AVeriTeC dev

The apples-to-apples test

The charts above are largely self-graded. This is the real comparison: our engine on the public AVeriTeC dev set, golden-evidence condition, against external ground-truth labels — the same benchmark DebateCV reports on. Our bar carries a 95% Wilson interval; the others are published point estimates on the same golden condition.

Accuracy — golden evidenceAVeriTeC dev
loading…

Per-class F1 — where the gap lives

The remaining gap to DebateCV is almost entirely one class our three-verdict engine structurally cannot emit. The chart makes that visible.

Per-class F1 (spectral remap)4-class, imbalanced
loading…

Behavior probes

Does it behave the way the design claims?

These aren't pass/fail accuracy — they stress-test whether the engine moves the way an honest evidence-weigher should. Each is run live and the data drives the charts below.

Hindsight collapse

The same claim scored at a historical "as-of" date should fall as the story unravels and the evidence turns. Each line connects the backtested score to today's score for a once-hyped venture that later failed.

Backtest score → today
loading…
Affirmationoverwhelmingly-true claims
loading…
Forecasts — real spread2026–2030, leakage-free
loading…

Context

Where this sits against the literature

The bar chart compares like-for-like (AVeriTeC golden). This table adds the broader landscape — including systems measured on different benchmarks, so read across rows with care.

SystemApproachReportedSource
Superthesis3-role debate + golden evidence (spectral remap)see aboveAVeriTeC dev
DebateCV2 debaters + moderator (closest analog)83.4% golden · 73.6% retrievedWWW 2026
AVeriTeC SOTA (TUDA_MAI)retrieval-tuned pipeline (shared-task winner)63% AVeriTeC scoreFEVER 2024
PROClaimCourtroom + progressive RAG81.7% Check-COVID2026
Single model + RAGone strong LLM, one retrieval pass85.8% Check-COVID*PROClaim baseline

* The single-model+RAG number is the honesty anchor: on clean benchmarks a capable single model can match or beat elaborate debate on raw accuracy. Our case is calibration, stability and auditability. Full source list in RESEARCH_SOURCES.md (84 papers).

Reproduce

Run the numbers yourself

Every chart on this page is rendered from timestamped JSON written by the harness — not hand-entered. From the repo root, with provider keys loaded:

# internal calibration set — 8 claims, 3 runs each
python benchmark/run_benchmark.py --n 3

# AVeriTeC stratified-100, golden evidence, then backfill spectral calibration
python benchmark/averitec/run_averitec.py --n 100 --workers 4
python benchmark/averitec/analyze_spectral.py

# behavior probes (live retrieval)
python benchmark/demo/run_demos.py --workers 4
Loading caveats…