How the Superthesis engine works
Superthesis turns a claim into a calibrated, source-cited confidence score by running a structured adversarial debate between specialized LLM agents over retrieval-augmented rounds. This page documents the product and the engine end to end — the layered architecture, the product design, the system that runs it, the debate pipeline stage by stage, and how the whole thing is benchmarked.
Overview
Give the engine a falsifiable proposition — "Nokia's manufacturing scale is a durable competitive moat" — and it returns a number between 0 and 100 with an uncertainty interval, a stance-by-stance breakdown, and the sources that moved the score. It does this not with a single model pass but by staging an argument: one agent builds the strongest case for the claim, another the strongest case against, and a moderator adjudicates on evidence quality.
The unifying primitive is the round — a retrieval-augmented exchange where each side argues from freshly retrieved evidence and the moderator decides whether the dispute is settled or needs another pass. Everything below is organized around that unit.
Why a debate engine
A single strong model with good retrieval can already answer "is this claim true?" reasonably well. Debate earns its added cost in three specific places a single pass struggles with: surfacing the strongest counter-case (not just the model's first instinct), calibration (a defensible number, not an confident guess), and an auditable trail a human can contest. The published research is blunt that debate does not reliably beat a single model on raw accuracy — so we don't claim it does. We claim robustness on noisy evidence and traceability.
| Approach | Counter-case | Calibration | Audit trail |
|---|---|---|---|
| Single LLM judge | Model's first take | Uncalibrated | None |
| Naive RAG + LLM | Retrieval-dependent | Weak | Citations only |
| Superthesis | Adversarially constructed | Anchored + measured | Full transcript + sources |
Core product layers
Superthesis is built as a stack of independent layers, each with a single responsibility. The separation is deliberate: retrieval is centralized so every agent argues from one evidence pool, models can be swapped per role without touching the debate logic, and calibration is measured as its own property rather than baked into the engine. From the reasoning substrate at the bottom up to the application:
Product design
The product has a thesis of its own: a contested claim deserves a defensible answer, not just a confident one. Every surface is built to make the reasoning inspectable and the number honest. Superthesis is positioned as a verdict engine — adjacent to but distinct from fact-checking (binary, media-focused), research tools (no verdict), and BI/decision platforms (no evidence adjudication). The category it claims is claim intelligence.
Design principles
A handful of opinionated stances, each grounded in the multi-agent-debate literature, shape the whole product:
- Calibration over confidence. The moderator "synthesizes the evidence — not the rhetoric." A 70 should mean what a careful analyst means by 70, and the interval is shown as genuine uncertainty.
- Honesty about what debate buys. The product does not claim raw-accuracy SOTA — the research is blunt that debate doesn't reliably beat a single strong model. It claims robustness on noisy evidence and an auditable trail: "a traceable, contestable reasoning artifact, not a magic accuracy number."
- Fixed adversarial roles. The case-for side always argues for the claim, the case-against always against, the moderator always judges — never reassigned mid-run (the app labels these Prosecutor / Defender / Moderator). Dynamic role assignment is rejected because it regresses on verification tasks.
- A debater's self-confidence is never a trust signal. Both sides over-claim victory most of the time, so a debater's implied score is an input to the moderator, never surfaced as the answer.
- Evidence over identity. The moderator judges anonymized, order-randomized arguments, so it weighs the case — not which agent or slot produced it.
- Honesty about limits is a feature. Suitability warnings, self-referential-bias detection, and an explicit "what would change this" analysis ship as first-class surfaces.
User-facing surfaces
| Surface | What it does |
|---|---|
| Run console | Claim box, Spectral/Verdict mode toggle, an optional "analyze as of" backtest date, document-context attach, and a claim queue that auto-fires through a batch. |
| Live trace | A server-sent execution log that streams each pass as it runs, parallel passes visually marked, with a sticky "debate is running" banner that survives navigation. |
| Verdict / transcript | Score hero with an interval bar, per-sub-claim cards, the expandable For/Against debate grid with tiered sources, the moderator's decisive-source callout, and a cost panel. |
| Drift chart | Score-over-time for a claim across historical and backtest runs — the heart of the monitoring loop. |
| History & settings | Source-filtered run history (app / bench / eval / demo / forecast) and a settings surface for models, per-agent routing, keys, and debate parameters. |
Design system
One canonical stylesheet is the source of truth for color, type, radius, and motion; every page links it first. The visual language is editorial-meets-instrument-panel:
- Three type families — Newsreader (serif, for the claim and the big score, an authoritative editorial register), Inter (UI), and JetBrains Mono (labels, metadata, source tiers).
- Light/dark, set before paint — the theme is applied from
localStoragebefore first paint to avoid a flash, with a shared toggle on every page. - The spectral score — a large number colored along a 0–100 scale, paired with a confidence-interval bar; Verdict mode swaps it for categorical pills.
- A consistent semantic palette — Supported green, Refuted red, Incomplete amber for verdicts; fixed agent colors (Prosecutor red, Defender blue, Moderator violet) carried from the live trace through the transcript into settings.
Architecture at a glance
The topology is a supervisor pattern: a Moderator adjudicates two adversarial debaters, over a retrieval-augmented round loop. Retrieval is centralized — the orchestrator runs the searches and injects the evidence as text — so every agent argues from the same evidence pool and disagreement reflects reasoning, not search luck. (The engine calls the two debaters Case For and Case Against; the app UI labels them Prosecutor and Defender.)
| Agent | Role | Constraint |
|---|---|---|
| Case For | Builds the strongest evidence-grounded case that the claim is true. | Must argue from retrieved evidence; honest implied score. |
| Case Against | Builds the strongest case that the claim is false. | Same evidence pool; anti-sycophancy protocol. |
| R1 Moderator | Identifies the decisive dispute after Round 1 and writes the Round-2 search query. | Deterministic (temperature 0). |
| Final Moderator | Weighs evidence quality over rhetoric and produces the score, interval, and synthesis. | Sees identity-stripped, order-randomized arguments. |
System architecture
The topology above describes what argues; this section describes how it runs. Superthesis is a single Flask application that streams a debate to the browser over Server-Sent Events while orchestrating parallel agents on a thread pool, with SQLite for persistence and a query-keyed retrieval cache for reproducibility.
Stack
| Concern | Choice |
|---|---|
| Web framework | Flask 3.1 · gunicorn 23 (gthread) |
| Reasoning | anthropic SDK (default) · openai · google-genai (optional) |
| Retrieval | Anthropic web_search tool (server-side) |
| Persistence | SQLite (WAL) via stdlib sqlite3 — no ORM |
| Secrets | cryptography / Fernet — keys encrypted at rest |
| Runtime | Python 3.12 |
Request lifecycle & SSE
A debate is requested at GET /debate_stream (the browser's EventSource) or POST /debate (curl/API). Both run the same producer/consumer bridge: a background worker thread runs the engine and pushes progress events into a queue; the request thread drains the queue and yields each item as an SSE frame, until a sentinel closes the stream with a final result event. The finished run is persisted on the worker thread, so a database hiccup never breaks the stream.
Concurrency
Parallelism uses real OS threads (ThreadPoolExecutor), not asyncio. Three regions run concurrently inside a single debate: per-sub-claim Round-1 retrieval, the two Round-1 debaters, and the two Round-2 debaters. A parallel gate records start/end timestamps and asserts the two debaters actually overlapped, surfacing PASS/FAIL on the run. Because the engine owns its own threads, the server runs gthread workers — gevent monkey-patching would conflict.
Retrieval & caching
Retrieval is centralized on Anthropic's server-side web-search tool, run at temperature 0 so synthesis is deterministic; results come back as SOURCE / PASSAGE blocks. Each query is hashed and cached in SQLite with a 24-hour TTL, so an identical query returns identical evidence — the single biggest lever on run-to-run stability. In backtest mode, three layers bound evidence to the as-of date: a before: operator on every query, a sentence-level scrub of post-cutoff dates and hindsight phrasing, and a Haiku classifier that drops sources written about the period after the fact.
Persistence
SQLite in WAL mode (write-ahead logging, so worker threads can write while the request reads), six tables, idempotent migrations on startup. Runs store the score, interval, mode, cost, tokens, and the full result JSON; a drift series is appended per run to power the score-over-time chart; deletes are soft. The database path is environment-configurable so it can live on a mounted volume in production.
| Table | Holds |
|---|---|
| claims | One row per distinct claim, with run count and first/last-seen timestamps. |
| runs | Every debate: score, interval, mode, cost, tokens, as-of date, full result blob, source tag. |
| drift_reports | Append-only score time-series per claim. |
| retrieval_cache | Query-hash → evidence, with a 24h TTL. |
| settings · api_key_store | Workspace settings and Fernet-encrypted provider keys (one per provider). |
Provider routing & keys
One vendor-agnostic call() dispatches to Claude, GPT, or Gemini, normalizing token accounting and text extraction across SDKs. Each role resolves to a model from settings; if a selected provider has no key, that role safely falls back to the default rather than failing the run. The Anthropic key comes from the environment; optional OpenAI/Google keys are entered in-app, stored Fernet-encrypted, and billed to the user's own provider account. (Web search is Anthropic-only, so cross-vendor debaters receive evidence as injected text, never as a tool.)
Cost guard
A public debate costs real money, so spend is bounded two ways: a global concurrency semaphore (default 3 simultaneous debates) and a per-IP hourly cap (default 20), both environment-tunable. A Budget object charges every model call at its true per-token rate, so the cost attached to each run stays accurate even across a heterogeneous panel.
Endpoints
| Route | Purpose |
|---|---|
| GET /debate_stream · POST /debate | Run a debate, streamed as SSE. |
| GET /api/runs · /api/runs/<id> | List recent runs; fetch one full run. DELETE soft-deletes. |
| GET /api/claims · …/history | Grouped claims and a claim's drift series. |
| GET /api/benchmark · /api/averitec/* | Latest benchmark scorecards and the AVeriTeC sample/results. |
| GET/POST /api/settings · /apikey | Settings and provider keys (CSRF-protected mutations). |
Deployment
Shipped as a Render blueprint: one web service running gunicorn with gthread workers (2 workers × 8 threads, and a 600-second timeout to keep multi-minute SSE debates alive), a 1 GB persistent disk holding the SQLite file so history survives redeploys, and secrets set in the dashboard rather than committed. Each active debate holds one worker thread for its duration, so simultaneous capacity is roughly workers × threads.
The pipeline, stage by stage
A run proceeds through nine stages, split into a deterministic setup phase and an adversarial runtime phase.
Suitability check
A fast pass decides whether the claim is well-served by debate at all — checking evidence retrievability and flagging claims that are time-sensitive or self-referential. Runs at temperature 0.
Decomposition
The claim is split into mutually-exclusive, collectively-exhaustive sub-claims, each with one precise retrieval directive (the exact search query to run). A MECE self-check retries if the split overlaps or leaves gaps.
Round-1 retrieval
The orchestrator executes each directive against live web search, tiering sources (T1 = government/regulatory/primary, T2 = secondary), and — under backtest mode — scrubbing any evidence dated after the as-of date. Results are cached so an identical query returns identical evidence across runs.
Round 1 — Case For ∥ Case Against
Both debaters run in parallel, each producing a structured argument per sub-claim with an implied score, an epistemic-confidence rating, and cited sources. They argue from the same injected evidence.
R1 Moderator → progressive retrieval
The moderator reads both R1 arguments, identifies the single decisive dispute, and writes one new search query to resolve it. This is progressive RAG: retrieval in Round 2 is conditioned on what the debate actually contested, not fixed up front.
Round-2 retrieval
The new directive is executed, expanding the evidence pool with material targeted at the live dispute.
Round 2 — rebuttal, in parallel
Each side argues again with the new evidence and the opponent's R1 argument in view, explicitly addressing the rebuttal. Position changes must be justified by specific new evidence, not rhetorical pressure.
Final adjudication
The Final Moderator receives all four arguments — anonymized and in randomized order so it weighs evidence, not which side or slot produced it — and synthesizes a per-sub-claim verdict, an overall score, the decisive sources, and a "what would change this" analysis.
Score, interval & citations
The overall score is assembled, an uncertainty interval is computed (empirically when moderator self-consistency is on, otherwise from sub-claim dispersion), a tail constraint caps unsupported high scores, and the cited evidence is attached.
Trace one claim
Here is a real run of "The Earth is flat" in spectral mode, abbreviated:
# Stage 2 — decomposition (3 sub-claims, each with a directive) SC1 measured geometry → "WGS84 oblate spheroid GPS geodetic survey" SC2 horizon/curvature → "observed Earth curvature high-altitude measurement" SC3 consensus of science→ "scientific consensus Earth shape NASA NOAA" # Stage 4 — Round 1, parallel Case For implied_score 2 # concedes no evidence supports flat geometry Case Against implied_score 2 # NGA + NOAA define Earth as oblate spheroid (T1) # Stage 8 — final moderator (anonymized order: A=FALSE, B=TRUE) referee_synthesis: "Case against is decisively supported by two T1 government sources; the case-for concedes the core point." # Stage 9 — output overall_score: 2 interval: 2–2 verdict: refuted
The same claim, re-run, reliably returns ~2 — the deterministic setup stages (decomposition, retrieval) are pinned and cached, while the debaters' natural-temperature variance stays small enough that the interval barely moves. That repeatability is the point of the stabilization work described under Engine internals.
Key features
- Adversarial multi-perspective reasoning — a dedicated agent constructs the strongest counter-case, so the score reflects the best argument on each side, not the model's first instinct.
- Progressive, per-round retrieval — Round-2 evidence is targeted at the dispute the debate actually surfaced, not fixed up front.
- Tiered, cited evidence — every argument carries sources tagged T1 (primary/regulatory) or T2 (secondary), and the verdict names the decisive source.
- Calibrated confidence with an interval — the score is anchored to a human calibration set and reported with an uncertainty band, not as a bare number.
- Spectral and verdict modes — a 0–100 score for analytical claims, or Supported / Refuted / Incomplete for binary ones.
- Backtest mode — score a claim "as of" a historical date with temporally-bounded retrieval, to chart how the answer drifts as evidence accumulates.
- Heterogeneous models (optional) — assign a different model family to each agent for genuine reasoning diversity.
Calibration & stability
A confidence score is only useful if it is both calibrated (a 70 means roughly what a careful analyst would call 70) and stable (the same claim returns the same answer). We treat these as first-class, measured properties — see the live benchmark.
Calibration
The engine is anchored against a labeled set spanning the range — a flat-earth claim should floor near 2, a well-supported regulatory fact should land in the 80s. Calibration is graded as mean absolute error against those anchors.
Stability
Run-to-run variance was the engine's biggest early weakness — an identical claim could swing ~30 points. The fix was to make the deterministic scaffolding actually deterministic: the decomposition and retrieval-synthesis passes run at temperature 0, and an identical search query returns cached evidence. That collapsed run-to-run σ from roughly 12 points to low single digits. The debaters themselves stay at natural temperature — diversity of argument is the point of debate — so a small, claim-dependent residual variance remains and is reported honestly in the interval.
Benchmarking harness
Calibration and stability are only credible if they are measured, so the engine ships with a benchmarking harness — three independent sub-harnesses under benchmark/. All of them call the engine in-process (no HTTP, to avoid SSE/timeout fragility), tag each run with a source so benchmark runs never pollute product history, and write timestamped JSON that the live benchmark page renders automatically.
| Harness | Set | Mode | Measures |
|---|---|---|---|
| run_benchmark.py | 8 hand-anchored claims | spectral | Calibration, stability, discrimination, cost |
| run_averitec.py | AVeriTeC, stratified 100 | verdict | Accuracy & macro-F1 vs. a public benchmark |
| run_demos.py | Demo + expansion sets | spectral | Behavior: hindsight, leakage, forecasting |
Metrics, as implemented
| Metric | Definition |
|---|---|
| Stability σ | Population std-dev of the score across N runs of the same claim; reported as the mean and worst-claim σ. |
| Calibration MAE | Mean absolute error of the run median against the claim's hand-set anchor. |
| Discrimination AUROC | Mann–Whitney AUROC over binary claims — P(a random true claim outscores a random false one). |
| Accuracy / macro-F1 | For AVeriTeC: exact-match accuracy and 4-class macro-F1 against gold labels. |
| Brier score | On the Supported/Refuted subset, mean squared error of score/100 vs. outcome — forecasting-native calibration. |
Headline results
Internal calibration set (8 claims × 2 runs):
AVeriTeC, stratified-100, golden-evidence condition:
AVeriTeC, honestly
The AVeriTeC run uses the golden-evidence condition — the engine is given the benchmark's gold Q&A and is not allowed live retrieval — which makes it directly comparable to the structurally similar DebateCV system (83.4% golden-evidence, WWW 2026). The sample is a seeded, stratified 100 drawn from the dev set, so the label mix is preserved. One ceiling is structural and disclosed: AVeriTeC has a fourth Conflicting Evidence / Cherrypicking class that the engine's three-way verdict cannot emit — that class alone is ~8% of the sample and an automatic miss, which more than accounts for the ~4pp we trail DebateCV. A fixed-threshold remap of the spectral score (≥50 Supported, ≤30 Refuted) recovers ~5pp over the raw verdict, and a leakage-free 5-fold cross-validation confirms that gain is not threshold overfitting.
Behavior probes
The demo set is not pass/fail — it stress-tests whether the engine behaves the way the design claims it should:
- Hindsight collapse. The same claim scored at a historical date vs. today should fall as the story unravels — Theranos 12→4, WeWork 16→5, Enron 42→8.
- Leakage probe. An ablation that scores a claim from the model's prior alone (no evidence) lands at 5% for Theranos, near the scrubbed 2015 backtest (12%). The gap is the test: if they matched, the backtest would be memory, not evidence.
- Affirmation. Overwhelmingly-true claims should score high, not just skeptically low — smoking→cancer 97, 2008→subprime 82 (mean ~87), confirming the engine can affirm, not only doubt.
- Forecasts. Forward-looking 2026–2030 claims (provably leakage-free) show a real spread rather than a flat prior.
Reproduce a run
# internal calibration set — 8 claims, 3 runs each python benchmark/run_benchmark.py --n 3 # AVeriTeC stratified-100, golden evidence, then backfill spectral calibration python benchmark/averitec/run_averitec.py --n 100 --workers 4 python benchmark/averitec/analyze_spectral.py # behavior probes (live retrieval) python benchmark/demo/run_demos.py --workers 4
Engine internals
Four design decisions, each grounded in the 2024–2026 multi-agent-debate literature (full source list in RESEARCH_SOURCES.md):
Heterogeneous models per role
The most-replicated finding in the field is that homogeneous debate — every agent the same model — is barely distinguishable from sampling one model several times. The gain comes from model diversity. The engine routes each role to a model by id (Claude / GPT / Gemini), so the debaters can bring genuinely different priors. Retrieval stays centralized, so cross-vendor agents still argue from one evidence pool.
Moderator anonymization
Judges defer to which agent said something far more than they should. Before the Final Moderator sees the arguments, their role labels are stripped and their order is randomized; the for/against mapping is re-attached only after the verdict, for display. This removes identity-anchoring and positional bias.
Retrieval stabilization
Temperature-0 decomposition and retrieval synthesis plus a query-keyed evidence cache make the same claim reproduce the same evidence — the single biggest lever on run-to-run stability.
Moderator self-consistency (optional)
For high-stakes claims, the Final Moderator can be resampled, taking the median score and deriving the interval from the observed spread, with adaptive early-stopping when samples agree. Off by default — the moderator was empirically the stable stage, so the resampling budget is better spent elsewhere.
Output schema
Every run returns a structured object. The fields that matter:
| Field | Type | Meaning |
|---|---|---|
| overall_score | 0–100 | Calibrated evidence-weighted confidence the claim is true. |
| interval | {low,high} | Uncertainty band around the score. |
| overall_verdict | enum|null | supported / refuted / incomplete (verdict mode). |
| sub_claims[] | object[] | Per sub-claim: case_for, case_against, referee_synthesis, decisive_source. |
| what_would_change | object | Evidence that would drive the score toward 0 or 100, and where to find it. |
| structural_analysis | object | First principles, the critical path, and external dependencies of the claim. |
| _usage | object | Tokens and cost for the run. |
Spectral mode populates overall_score + interval; verdict mode populates overall_verdict. Each sub-claim names its decisive source and tier (T1 / T2).
API & quickstart
Score a claim by POSTing to /debate. The endpoint streams Server-Sent Events with live stage progress, then a final result.
curl -N -X POST https://superthesis.ai/debate \ -H "Content-Type: application/json" \ -d '{"claim": "The SEC requires public companies to file a 10-K.", "mode": "spectral"}'
Parameters: claim (required), mode (spectral | verdict), context (optional supporting text), as_of_date (optional ISO date for backtest mode). Per-agent models, moderator samples, and retrieval caching are configured in Settings.
Authentication, errors & limits
- Auth. The engine runs against your own provider key (set server-side); mutating
/apiroutes require a double-submit CSRF token. There is no public bearer-token API yet — treat/debateas a self-hosted endpoint. - Errors. Because the stream has already opened, engine failures arrive as a terminal
event: errorSSE frame rather than an HTTP status; an unsuitable claim streams a suitabilitywarningbefore any debate runs. - Limits. The cost guard caps concurrent debates (default 3) and per-IP volume (default 20/hour); over either cap the request is rejected before any spend. Response shape is under Output schema.
FAQ
Isn't this just an LLM-as-judge?
No. An LLM judge is a single pass that scores whatever it's handed. Superthesis constructs an adversarial case on each side from retrieved evidence, refreshes evidence based on the live dispute, and adjudicates anonymized arguments — producing a contestable transcript and a calibrated score, not a one-shot opinion.
How is the score calibrated?
Against a labeled anchor set spanning the 0–100 range, graded as mean absolute error. Calibration is published on the benchmark page and re-checked whenever the engine changes.
What happens with conflicting or missing sources?
Conflicting evidence is exactly where debate helps — each side marshals its strongest support and the moderator weighs tiers. With thin evidence, the suitability check flags it, arguments carry low epistemic confidence, and the interval widens to reflect real uncertainty.
Is it deterministic?
The setup stages are (temperature-0 + cached), so the evidence and structure reproduce. The debaters run at natural temperature by design, leaving a small residual variance that the interval reports.
How many rounds?
Two, fixed — Round 1, a progressive-retrieval step, then Round 2 — which the literature supports as the practical optimum; more rounds before adjudication tend to reduce accuracy. (A high-effort third round is on the roadmap.)
Limitations & known gaps
The honest caveats, collected in one place:
- Debate isn't an accuracy cheat code. On clean, well-documented facts a single model + retrieval matches it for less — debate earns its cost on contested, noisy claims (see Why a debate engine).
- Small evaluation samples. The calibration set is 8 claims; AVeriTeC is N=1 per claim. The numbers are indicative, not leaderboard-grade.
- Faithfulness isn't scored yet. Whether the verdict strictly follows from the cited evidence is the next metric to ship; until then the decisive source is surfaced for manual check.
- Residual run-to-run variance. The debaters run at natural temperature, so an identical claim can move a few points between runs — the interval reports it.
- Retrieval is web-search-bounded. Evidence quality is capped by what web search returns; thin or paywalled topics widen the interval and lower epistemic confidence.
- Backtests aren't leakage-proof. Temporal scrubbing reduces hindsight, but a model's parametric memory of famous events can't be fully removed — which is exactly what the leakage probe measures.
See also: the benchmark methodology & live ratings, the per-claim AVeriTeC predictions, and the research source log (84 papers).