Documentation · updated Jun 2026

How the Superthesis engine works

Superthesis turns a claim into a calibrated, source-cited confidence score by running a structured adversarial debate between specialized LLM agents over retrieval-augmented rounds. This page documents the product and the engine end to end — the layered architecture, the product design, the system that runs it, the debate pipeline stage by stage, and how the whole thing is benchmarked.

Overview

Give the engine a falsifiable proposition — "Nokia's manufacturing scale is a durable competitive moat" — and it returns a number between 0 and 100 with an uncertainty interval, a stance-by-stance breakdown, and the sources that moved the score. It does this not with a single model pass but by staging an argument: one agent builds the strongest case for the claim, another the strongest case against, and a moderator adjudicates on evidence quality.

The unifying primitive is the round — a retrieval-augmented exchange where each side argues from freshly retrieved evidence and the moderator decides whether the dispute is settled or needs another pass. Everything below is organized around that unit.

What it outputs. A spectral score — a continuous 0–100 reading rather than a binary label — with a calibrated interval, or a verdict (Supported / Refuted / Incomplete) — both accompanied by per-sub-claim arguments, tiered source citations, and an explicit "what would change this score" analysis.

Why a debate engine

A single strong model with good retrieval can already answer "is this claim true?" reasonably well. Debate earns its added cost in three specific places a single pass struggles with: surfacing the strongest counter-case (not just the model's first instinct), calibration (a defensible number, not an confident guess), and an auditable trail a human can contest. The published research is blunt that debate does not reliably beat a single model on raw accuracy — so we don't claim it does. We claim robustness on noisy evidence and traceability.

ApproachCounter-caseCalibrationAudit trail
Single LLM judgeModel's first takeUncalibratedNone
Naive RAG + LLMRetrieval-dependentWeakCitations only
SuperthesisAdversarially constructedAnchored + measuredFull transcript + sources
When not to use it. For a clean, well-documented factual lookup, a single model + retrieval is cheaper and just as good. Reach for debate when the claim is contested, the evidence is noisy or conflicting, and the reasoning needs to be defensible — not just the answer.

Core product layers

Superthesis is built as a stack of independent layers, each with a single responsibility. The separation is deliberate: retrieval is centralized so every agent argues from one evidence pool, models can be swapped per role without touching the debate logic, and calibration is measured as its own property rather than baked into the engine. From the reasoning substrate at the bottom up to the application:

Application & API
L6 · UI + service
The web app — run console, live trace, verdict/transcript, drift chart, queue, settings — plus the SSE debate endpoint and a REST API over runs, claims, and history. Every page shares one design system.
Persistence & drift
L5 · SQLite (WAL)
Every run is stored with its score, interval, cost, and full result blob; an append-only series records how a claim's score moves over time and across backtest dates.
Structural analysis
L4 · reasoning map UI in progress
Surfaces the claim's underlying assumptions, first principles, critical path, and external dependencies. The data ships in every result today (see Output schema); the dedicated reasoning-map view and export are in progress.
Adjudication & calibration
L3 · moderator + score
The Final Moderator turns the debate into a calibrated output: a spectral 0–100 score with an interval, or a Supported / Refuted / Incomplete verdict, anchored to a labeled calibration set. Includes the suitability check and an optional moderator ensemble.
Debate engine
L2 · MAD orchestration
The multi-agent-debate (MAD) scaffold: MECE (mutually-exclusive, collectively-exhaustive) decomposition, two parallel debaters across two rounds, a moderator between them, and an anti-sycophancy protocol — run on a thread pool with a parallel-execution self-check.
Evidence
L1 · retrieval
Centralized web retrieval with T1/T2 source tiering, a query-keyed cache for reproducibility, progressive per-round retrieval, and temporal scrubbing that bounds evidence to a backtest date.
Reasoning substrate
L0 · models
The LLMs each agent runs on. Homogeneous by default — one Claude model for every role — and optionally heterogeneous, routing each role to a different model family through one vendor-agnostic call layer.
Cross-cutting: security & cost. Provider keys are encrypted at rest (Fernet), every mutation is CSRF-protected, and a two-part cost guard — a global concurrency cap plus a per-IP hourly limit — bounds spend on the public endpoint. Token cost is accounted at each model's true rate and attached to every run.

Product design

The product has a thesis of its own: a contested claim deserves a defensible answer, not just a confident one. Every surface is built to make the reasoning inspectable and the number honest. Superthesis is positioned as a verdict engine — adjacent to but distinct from fact-checking (binary, media-focused), research tools (no verdict), and BI/decision platforms (no evidence adjudication). The category it claims is claim intelligence.

Design principles

A handful of opinionated stances, each grounded in the multi-agent-debate literature, shape the whole product:

  • Calibration over confidence. The moderator "synthesizes the evidence — not the rhetoric." A 70 should mean what a careful analyst means by 70, and the interval is shown as genuine uncertainty.
  • Honesty about what debate buys. The product does not claim raw-accuracy SOTA — the research is blunt that debate doesn't reliably beat a single strong model. It claims robustness on noisy evidence and an auditable trail: "a traceable, contestable reasoning artifact, not a magic accuracy number."
  • Fixed adversarial roles. The case-for side always argues for the claim, the case-against always against, the moderator always judges — never reassigned mid-run (the app labels these Prosecutor / Defender / Moderator). Dynamic role assignment is rejected because it regresses on verification tasks.
  • A debater's self-confidence is never a trust signal. Both sides over-claim victory most of the time, so a debater's implied score is an input to the moderator, never surfaced as the answer.
  • Evidence over identity. The moderator judges anonymized, order-randomized arguments, so it weighs the case — not which agent or slot produced it.
  • Honesty about limits is a feature. Suitability warnings, self-referential-bias detection, and an explicit "what would change this" analysis ship as first-class surfaces.

User-facing surfaces

SurfaceWhat it does
Run consoleClaim box, Spectral/Verdict mode toggle, an optional "analyze as of" backtest date, document-context attach, and a claim queue that auto-fires through a batch.
Live traceA server-sent execution log that streams each pass as it runs, parallel passes visually marked, with a sticky "debate is running" banner that survives navigation.
Verdict / transcriptScore hero with an interval bar, per-sub-claim cards, the expandable For/Against debate grid with tiered sources, the moderator's decisive-source callout, and a cost panel.
Drift chartScore-over-time for a claim across historical and backtest runs — the heart of the monitoring loop.
History & settingsSource-filtered run history (app / bench / eval / demo / forecast) and a settings surface for models, per-agent routing, keys, and debate parameters.

Design system

One canonical stylesheet is the source of truth for color, type, radius, and motion; every page links it first. The visual language is editorial-meets-instrument-panel:

  • Three type families — Newsreader (serif, for the claim and the big score, an authoritative editorial register), Inter (UI), and JetBrains Mono (labels, metadata, source tiers).
  • Light/dark, set before paint — the theme is applied from localStorage before first paint to avoid a flash, with a shared toggle on every page.
  • The spectral score — a large number colored along a 0–100 scale, paired with a confidence-interval bar; Verdict mode swaps it for categorical pills.
  • A consistent semantic palette — Supported green, Refuted red, Incomplete amber for verdicts; fixed agent colors (Prosecutor red, Defender blue, Moderator violet) carried from the live trace through the transcript into settings.
One engine, several products. The same calibration-and-debate core is designed to power more than the analyst-facing app — a continuous assumption-monitor for executives, and a consumer score-drift watch ("set a thesis, get notified when the score moves") are on the roadmap. The docs here describe the engine and the shipped app.

Architecture at a glance

The topology is a supervisor pattern: a Moderator adjudicates two adversarial debaters, over a retrieval-augmented round loop. Retrieval is centralized — the orchestrator runs the searches and injects the evidence as text — so every agent argues from the same evidence pool and disagreement reflects reasoning, not search luck. (The engine calls the two debaters Case For and Case Against; the app UI labels them Prosecutor and Defender.)

Claim
a falsifiable proposition
Decompose → sub-claims
MECE split · 1 retrieval directive each
Centralized retrieval
web search · T1/T2 source tiers · cache
Round 1 — parallel
Case For
strongest case TRUE
Case Against
strongest case FALSE
R1 Moderator
issues the Round-2 retrieval directive (progressive RAG)
Round 2 — parallel, new evidence
Case For R2
addresses rebuttal
Case Against R2
addresses rebuttal
Final Moderator
judges anonymized, order-randomized arguments
Score + interval + citations
0–100 · calibrated band · tiered sources
AgentRoleConstraint
Case ForBuilds the strongest evidence-grounded case that the claim is true.Must argue from retrieved evidence; honest implied score.
Case AgainstBuilds the strongest case that the claim is false.Same evidence pool; anti-sycophancy protocol.
R1 ModeratorIdentifies the decisive dispute after Round 1 and writes the Round-2 search query.Deterministic (temperature 0).
Final ModeratorWeighs evidence quality over rhetoric and produces the score, interval, and synthesis.Sees identity-stripped, order-randomized arguments.

System architecture

The topology above describes what argues; this section describes how it runs. Superthesis is a single Flask application that streams a debate to the browser over Server-Sent Events while orchestrating parallel agents on a thread pool, with SQLite for persistence and a query-keyed retrieval cache for reproducibility.

Stack

ConcernChoice
Web frameworkFlask 3.1 · gunicorn 23 (gthread)
Reasoninganthropic SDK (default) · openai · google-genai (optional)
RetrievalAnthropic web_search tool (server-side)
PersistenceSQLite (WAL) via stdlib sqlite3 — no ORM
Secretscryptography / Fernet — keys encrypted at rest
RuntimePython 3.12

Request lifecycle & SSE

A debate is requested at GET /debate_stream (the browser's EventSource) or POST /debate (curl/API). Both run the same producer/consumer bridge: a background worker thread runs the engine and pushes progress events into a queue; the request thread drains the queue and yields each item as an SSE frame, until a sentinel closes the stream with a final result event. The finished run is persisted on the worker thread, so a database hiccup never breaks the stream.

Browser
EventSource → /debate_stream
Worker thread
run_medium() emits step events
Queue
progress sink
Request thread drains queue → SSE frames
event: step · … · event: result
save_run() → SQLite
persisted after the stream completes

Concurrency

Parallelism uses real OS threads (ThreadPoolExecutor), not asyncio. Three regions run concurrently inside a single debate: per-sub-claim Round-1 retrieval, the two Round-1 debaters, and the two Round-2 debaters. A parallel gate records start/end timestamps and asserts the two debaters actually overlapped, surfacing PASS/FAIL on the run. Because the engine owns its own threads, the server runs gthread workers — gevent monkey-patching would conflict.

Retrieval & caching

Retrieval is centralized on Anthropic's server-side web-search tool, run at temperature 0 so synthesis is deterministic; results come back as SOURCE / PASSAGE blocks. Each query is hashed and cached in SQLite with a 24-hour TTL, so an identical query returns identical evidence — the single biggest lever on run-to-run stability. In backtest mode, three layers bound evidence to the as-of date: a before: operator on every query, a sentence-level scrub of post-cutoff dates and hindsight phrasing, and a Haiku classifier that drops sources written about the period after the fact.

Persistence

SQLite in WAL mode (write-ahead logging, so worker threads can write while the request reads), six tables, idempotent migrations on startup. Runs store the score, interval, mode, cost, tokens, and the full result JSON; a drift series is appended per run to power the score-over-time chart; deletes are soft. The database path is environment-configurable so it can live on a mounted volume in production.

TableHolds
claimsOne row per distinct claim, with run count and first/last-seen timestamps.
runsEvery debate: score, interval, mode, cost, tokens, as-of date, full result blob, source tag.
drift_reportsAppend-only score time-series per claim.
retrieval_cacheQuery-hash → evidence, with a 24h TTL.
settings · api_key_storeWorkspace settings and Fernet-encrypted provider keys (one per provider).

Provider routing & keys

One vendor-agnostic call() dispatches to Claude, GPT, or Gemini, normalizing token accounting and text extraction across SDKs. Each role resolves to a model from settings; if a selected provider has no key, that role safely falls back to the default rather than failing the run. The Anthropic key comes from the environment; optional OpenAI/Google keys are entered in-app, stored Fernet-encrypted, and billed to the user's own provider account. (Web search is Anthropic-only, so cross-vendor debaters receive evidence as injected text, never as a tool.)

Cost guard

A public debate costs real money, so spend is bounded two ways: a global concurrency semaphore (default 3 simultaneous debates) and a per-IP hourly cap (default 20), both environment-tunable. A Budget object charges every model call at its true per-token rate, so the cost attached to each run stays accurate even across a heterogeneous panel.

Endpoints

RoutePurpose
GET /debate_stream · POST /debateRun a debate, streamed as SSE.
GET /api/runs · /api/runs/<id>List recent runs; fetch one full run. DELETE soft-deletes.
GET /api/claims · …/historyGrouped claims and a claim's drift series.
GET /api/benchmark · /api/averitec/*Latest benchmark scorecards and the AVeriTeC sample/results.
GET/POST /api/settings · /apikeySettings and provider keys (CSRF-protected mutations).

Deployment

Shipped as a Render blueprint: one web service running gunicorn with gthread workers (2 workers × 8 threads, and a 600-second timeout to keep multi-minute SSE debates alive), a 1 GB persistent disk holding the SQLite file so history survives redeploys, and secrets set in the dashboard rather than committed. Each active debate holds one worker thread for its duration, so simultaneous capacity is roughly workers × threads.

The pipeline, stage by stage

A run proceeds through nine stages, split into a deterministic setup phase and an adversarial runtime phase.

Setup · deterministic
1

Suitability check

A fast pass decides whether the claim is well-served by debate at all — checking evidence retrievability and flagging claims that are time-sensitive or self-referential. Runs at temperature 0.

in: claim → out: suitable | add_documents | warning
2

Decomposition

The claim is split into mutually-exclusive, collectively-exhaustive sub-claims, each with one precise retrieval directive (the exact search query to run). A MECE self-check retries if the split overlaps or leaves gaps.

in: claim → out: sub-claims[] + retrieval directives[]
3

Round-1 retrieval

The orchestrator executes each directive against live web search, tiering sources (T1 = government/regulatory/primary, T2 = secondary), and — under backtest mode — scrubbing any evidence dated after the as-of date. Results are cached so an identical query returns identical evidence across runs.

in: directives → out: tiered evidence blocks
Runtime · adversarial
4

Round 1 — Case For ∥ Case Against

Both debaters run in parallel, each producing a structured argument per sub-claim with an implied score, an epistemic-confidence rating, and cited sources. They argue from the same injected evidence.

parallel · in: evidence → out: two R1 arguments
5

R1 Moderator → progressive retrieval

The moderator reads both R1 arguments, identifies the single decisive dispute, and writes one new search query to resolve it. This is progressive RAG: retrieval in Round 2 is conditioned on what the debate actually contested, not fixed up front.

in: R1 args → out: Round-2 retrieval directive
6

Round-2 retrieval

The new directive is executed, expanding the evidence pool with material targeted at the live dispute.

in: R2 directive → out: additional evidence
7

Round 2 — rebuttal, in parallel

Each side argues again with the new evidence and the opponent's R1 argument in view, explicitly addressing the rebuttal. Position changes must be justified by specific new evidence, not rhetorical pressure.

parallel · in: R1 + new evidence → out: two R2 arguments
8

Final adjudication

The Final Moderator receives all four arguments — anonymized and in randomized order so it weighs evidence, not which side or slot produced it — and synthesizes a per-sub-claim verdict, an overall score, the decisive sources, and a "what would change this" analysis.

in: 4 args (anonymized) → out: structured verdict
9

Score, interval & citations

The overall score is assembled, an uncertainty interval is computed (empirically when moderator self-consistency is on, otherwise from sub-claim dispersion), a tail constraint caps unsupported high scores, and the cited evidence is attached.

out: score · interval · sub-claims · citations

Trace one claim

Here is a real run of "The Earth is flat" in spectral mode, abbreviated:

run · The Earth is flat
# Stage 2 — decomposition (3 sub-claims, each with a directive)
SC1 measured geometry  → "WGS84 oblate spheroid GPS geodetic survey"
SC2 horizon/curvature   → "observed Earth curvature high-altitude measurement"
SC3 consensus of science→ "scientific consensus Earth shape NASA NOAA"

# Stage 4 — Round 1, parallel
Case For    implied_score 2   # concedes no evidence supports flat geometry
Case Against implied_score 2   # NGA + NOAA define Earth as oblate spheroid (T1)

# Stage 8 — final moderator (anonymized order: A=FALSE, B=TRUE)
referee_synthesis: "Case against is decisively supported by two T1
  government sources; the case-for concedes the core point."

# Stage 9 — output
overall_score: 2      interval: 2–2      verdict: refuted

The same claim, re-run, reliably returns ~2 — the deterministic setup stages (decomposition, retrieval) are pinned and cached, while the debaters' natural-temperature variance stays small enough that the interval barely moves. That repeatability is the point of the stabilization work described under Engine internals.

Key features

  • Adversarial multi-perspective reasoning — a dedicated agent constructs the strongest counter-case, so the score reflects the best argument on each side, not the model's first instinct.
  • Progressive, per-round retrieval — Round-2 evidence is targeted at the dispute the debate actually surfaced, not fixed up front.
  • Tiered, cited evidence — every argument carries sources tagged T1 (primary/regulatory) or T2 (secondary), and the verdict names the decisive source.
  • Calibrated confidence with an interval — the score is anchored to a human calibration set and reported with an uncertainty band, not as a bare number.
  • Spectral and verdict modes — a 0–100 score for analytical claims, or Supported / Refuted / Incomplete for binary ones.
  • Backtest mode — score a claim "as of" a historical date with temporally-bounded retrieval, to chart how the answer drifts as evidence accumulates.
  • Heterogeneous models (optional) — assign a different model family to each agent for genuine reasoning diversity.

Calibration & stability

A confidence score is only useful if it is both calibrated (a 70 means roughly what a careful analyst would call 70) and stable (the same claim returns the same answer). We treat these as first-class, measured properties — see the live benchmark.

Calibration

The engine is anchored against a labeled set spanning the range — a flat-earth claim should floor near 2, a well-supported regulatory fact should land in the 80s. Calibration is graded as mean absolute error against those anchors.

Stability

Run-to-run variance was the engine's biggest early weakness — an identical claim could swing ~30 points. The fix was to make the deterministic scaffolding actually deterministic: the decomposition and retrieval-synthesis passes run at temperature 0, and an identical search query returns cached evidence. That collapsed run-to-run σ from roughly 12 points to low single digits. The debaters themselves stay at natural temperature — diversity of argument is the point of debate — so a small, claim-dependent residual variance remains and is reported honestly in the interval.

Do
Read the score as calibrated confidence given the retrieved evidence, and read the interval as genuine uncertainty.
Don't
Read a 60 as "60% of experts agree." It is the engine's evidence-weighted confidence, not a poll.

Benchmarking harness

Calibration and stability are only credible if they are measured, so the engine ships with a benchmarking harness — three independent sub-harnesses under benchmark/. All of them call the engine in-process (no HTTP, to avoid SSE/timeout fragility), tag each run with a source so benchmark runs never pollute product history, and write timestamped JSON that the live benchmark page renders automatically.

HarnessSetModeMeasures
run_benchmark.py8 hand-anchored claimsspectralCalibration, stability, discrimination, cost
run_averitec.pyAVeriTeC, stratified 100verdictAccuracy & macro-F1 vs. a public benchmark
run_demos.pyDemo + expansion setsspectralBehavior: hindsight, leakage, forecasting

Metrics, as implemented

MetricDefinition
Stability σPopulation std-dev of the score across N runs of the same claim; reported as the mean and worst-claim σ.
Calibration MAEMean absolute error of the run median against the claim's hand-set anchor.
Discrimination AUROCMann–Whitney AUROC over binary claims — P(a random true claim outscores a random false one).
Accuracy / macro-F1For AVeriTeC: exact-match accuracy and 4-class macro-F1 against gold labels.
Brier scoreOn the Supported/Refuted subset, mean squared error of score/100 vs. outcome — forecasting-native calibration.

Headline results

Internal calibration set (8 claims × 2 runs):

0.81 pp
Mean run-to-run σ — down from a ~12pp pre-stabilization baseline
4.06 pp
Calibration MAE against the anchors
1.00
Discrimination AUROC (true vs. false)
5/5
Directional accuracy on binary claims

AVeriTeC, stratified-100, golden-evidence condition:

0.79
Accuracy (spectral remap); 0.74 raw verdict-mode
0.486
Macro-F1, 4 classes (spectral remap); 0.451 raw verdict
0.074
Brier score (baseline 0.25 — strong)
83.4
DebateCV comparator (WWW 2026); ~0.61 majority baseline

AVeriTeC, honestly

The AVeriTeC run uses the golden-evidence condition — the engine is given the benchmark's gold Q&A and is not allowed live retrieval — which makes it directly comparable to the structurally similar DebateCV system (83.4% golden-evidence, WWW 2026). The sample is a seeded, stratified 100 drawn from the dev set, so the label mix is preserved. One ceiling is structural and disclosed: AVeriTeC has a fourth Conflicting Evidence / Cherrypicking class that the engine's three-way verdict cannot emit — that class alone is ~8% of the sample and an automatic miss, which more than accounts for the ~4pp we trail DebateCV. A fixed-threshold remap of the spectral score (≥50 Supported, ≤30 Refuted) recovers ~5pp over the raw verdict, and a leakage-free 5-fold cross-validation confirms that gain is not threshold overfitting.

Behavior probes

The demo set is not pass/fail — it stress-tests whether the engine behaves the way the design claims it should:

  • Hindsight collapse. The same claim scored at a historical date vs. today should fall as the story unravels — Theranos 12→4, WeWork 16→5, Enron 42→8.
  • Leakage probe. An ablation that scores a claim from the model's prior alone (no evidence) lands at 5% for Theranos, near the scrubbed 2015 backtest (12%). The gap is the test: if they matched, the backtest would be memory, not evidence.
  • Affirmation. Overwhelmingly-true claims should score high, not just skeptically low — smoking→cancer 97, 2008→subprime 82 (mean ~87), confirming the engine can affirm, not only doubt.
  • Forecasts. Forward-looking 2026–2030 claims (provably leakage-free) show a real spread rather than a flat prior.

Reproduce a run

from the repo root, with provider keys loaded
# internal calibration set — 8 claims, 3 runs each
python benchmark/run_benchmark.py --n 3

# AVeriTeC stratified-100, golden evidence, then backfill spectral calibration
python benchmark/averitec/run_averitec.py --n 100 --workers 4
python benchmark/averitec/analyze_spectral.py

# behavior probes (live retrieval)
python benchmark/demo/run_demos.py --workers 4
Read these as honest, small-sample numbers. The internal set is 8 claims (not a public leaderboard), AVeriTeC results are N=1 per claim, and a sixth dimension — faithfulness (does the verdict follow from the cited evidence?) — is planned but not yet scored. The figures above are a June 2026 snapshot; the benchmark page renders the latest live, and results are regenerated whenever the engine changes.

Engine internals

Four design decisions, each grounded in the 2024–2026 multi-agent-debate literature (full source list in RESEARCH_SOURCES.md):

Heterogeneous models per role

The most-replicated finding in the field is that homogeneous debate — every agent the same model — is barely distinguishable from sampling one model several times. The gain comes from model diversity. The engine routes each role to a model by id (Claude / GPT / Gemini), so the debaters can bring genuinely different priors. Retrieval stays centralized, so cross-vendor agents still argue from one evidence pool.

Moderator anonymization

Judges defer to which agent said something far more than they should. Before the Final Moderator sees the arguments, their role labels are stripped and their order is randomized; the for/against mapping is re-attached only after the verdict, for display. This removes identity-anchoring and positional bias.

Retrieval stabilization

Temperature-0 decomposition and retrieval synthesis plus a query-keyed evidence cache make the same claim reproduce the same evidence — the single biggest lever on run-to-run stability.

Moderator self-consistency (optional)

For high-stakes claims, the Final Moderator can be resampled, taking the median score and deriving the interval from the observed spread, with adaptive early-stopping when samples agree. Off by default — the moderator was empirically the stable stage, so the resampling budget is better spent elsewhere.

Output schema

Every run returns a structured object. The fields that matter:

FieldTypeMeaning
overall_score0–100Calibrated evidence-weighted confidence the claim is true.
interval{low,high}Uncertainty band around the score.
overall_verdictenum|nullsupported / refuted / incomplete (verdict mode).
sub_claims[]object[]Per sub-claim: case_for, case_against, referee_synthesis, decisive_source.
what_would_changeobjectEvidence that would drive the score toward 0 or 100, and where to find it.
structural_analysisobjectFirst principles, the critical path, and external dependencies of the claim.
_usageobjectTokens and cost for the run.

Spectral mode populates overall_score + interval; verdict mode populates overall_verdict. Each sub-claim names its decisive source and tier (T1 / T2).

API & quickstart

Score a claim by POSTing to /debate. The endpoint streams Server-Sent Events with live stage progress, then a final result.

POST /debate
curl -N -X POST https://superthesis.ai/debate \
  -H "Content-Type: application/json" \
  -d '{"claim": "The SEC requires public companies to file a 10-K.",
       "mode": "spectral"}'

Parameters: claim (required), mode (spectral | verdict), context (optional supporting text), as_of_date (optional ISO date for backtest mode). Per-agent models, moderator samples, and retrieval caching are configured in Settings.

Authentication, errors & limits

  • Auth. The engine runs against your own provider key (set server-side); mutating /api routes require a double-submit CSRF token. There is no public bearer-token API yet — treat /debate as a self-hosted endpoint.
  • Errors. Because the stream has already opened, engine failures arrive as a terminal event: error SSE frame rather than an HTTP status; an unsuitable claim streams a suitability warning before any debate runs.
  • Limits. The cost guard caps concurrent debates (default 3) and per-IP volume (default 20/hour); over either cap the request is rejected before any spend. Response shape is under Output schema.
Cost. A medium-effort run averages ≈$0.49 (about $0.40–0.60 on the calibration set), billed to your own provider API key. Retrieval caching makes re-runs of the same claim materially cheaper.

FAQ

Isn't this just an LLM-as-judge?

No. An LLM judge is a single pass that scores whatever it's handed. Superthesis constructs an adversarial case on each side from retrieved evidence, refreshes evidence based on the live dispute, and adjudicates anonymized arguments — producing a contestable transcript and a calibrated score, not a one-shot opinion.

How is the score calibrated?

Against a labeled anchor set spanning the 0–100 range, graded as mean absolute error. Calibration is published on the benchmark page and re-checked whenever the engine changes.

What happens with conflicting or missing sources?

Conflicting evidence is exactly where debate helps — each side marshals its strongest support and the moderator weighs tiers. With thin evidence, the suitability check flags it, arguments carry low epistemic confidence, and the interval widens to reflect real uncertainty.

Is it deterministic?

The setup stages are (temperature-0 + cached), so the evidence and structure reproduce. The debaters run at natural temperature by design, leaving a small residual variance that the interval reports.

How many rounds?

Two, fixed — Round 1, a progressive-retrieval step, then Round 2 — which the literature supports as the practical optimum; more rounds before adjudication tend to reduce accuracy. (A high-effort third round is on the roadmap.)

Limitations & known gaps

The honest caveats, collected in one place:

  • Debate isn't an accuracy cheat code. On clean, well-documented facts a single model + retrieval matches it for less — debate earns its cost on contested, noisy claims (see Why a debate engine).
  • Small evaluation samples. The calibration set is 8 claims; AVeriTeC is N=1 per claim. The numbers are indicative, not leaderboard-grade.
  • Faithfulness isn't scored yet. Whether the verdict strictly follows from the cited evidence is the next metric to ship; until then the decisive source is surfaced for manual check.
  • Residual run-to-run variance. The debaters run at natural temperature, so an identical claim can move a few points between runs — the interval reports it.
  • Retrieval is web-search-bounded. Evidence quality is capped by what web search returns; thin or paywalled topics widen the interval and lower epistemic confidence.
  • Backtests aren't leakage-proof. Temporal scrubbing reduces hindsight, but a model's parametric memory of famous events can't be fully removed — which is exactly what the leakage probe measures.
Put a claim through it.Type a thesis and watch the debate run live — or browse the calibrated benchmark first.
Run a claim → See the benchmark →

See also: the benchmark methodology & live ratings, the per-claim AVeriTeC predictions, and the research source log (84 papers).