One Engine, Every Argument

Ask a large language model whether a hard claim is true and it will tell you. Confidently. Instantly. Often wrongly. The number it hands back is the most persuasive part of the answer and the least trustworthy — one voice, mistaking fluency for knowledge.

We started superthesis from a different premise: the truth of a contested claim isn't a fact to be retrieved, it's an argument to be had. So instead of one model answering, we run three that disagree — a Thesis that defends the claim, an Antithesis that attacks it, and a Synthesis that scores the evidence on quality alone, blind to which side argued better. The output isn't an opinion. It's a calibrated reading from 0 to 1, every source cited, every step you can walk.

That engine works. But the question that actually kept us up at night wasn't how to debate one claim. It was the harder one: how do you build something that can debate anything?

We're not building a debate tool. We're building a reasoning substrate — with a debate interface on top.The through-line

01 — The trap everyone falls into

Don't scale by field. Scale by type.

The obvious plan is to go domain by domain: a module for medicine, one for law, one for finance, markets, politics. That project never finishes — there is always one more field. It's a treadmill disguised as a roadmap.

The escape is to notice that domains are the wrong axis entirely. A claim about gravitational lensing and a claim about interest rates can be the same kind of claim — "X causes Y" — and they demand the same machinery: confounders, mechanism, a counterfactual. Their subject matter is worlds apart. Their epistemic shape is identical. So we route on the type of claim, not its field. Six types span the space:

/ classify the claim → route to the right debate

Empirical

"X happened."

→ evidence + source quality

Causal

"X causes Y."

→ confounders · mechanism · counterfactual

Predictive

"X will happen."

→ base rates · scenarios · calibration

Normative

"X is good / we ought to."

→ surface the value premise

Definitional

"X counts as Y."

→ stipulation & usage

Comparative

"X beats Y."

→ estimation · decompose

Physics and politics share claim types. Get the typology right and "works across all fields" stops being a roadmap — it becomes a side effect.

The normative row is the one most systems botch: they try to win a values question with facts. The honest move is to detect that the disagreement is about a value weighting, make that premise explicit, and answer conditionally — "if you weight liberty over equality here, then…" An engine has to know when a question isn't the kind that evidence settles.

02 — The atomic move

An argument isn't a sentence. It's a graph.

Underneath the debate sits a finer structure. Every argument decomposes — borrowing from Stephen Toulmin — into evidence (grounds), the claim it supports, and the silent joint between them: the warrant, the unspoken "and therefore." Most bad reasoning hides exactly there, in the inference nobody says out loud. So we make the warrant a node you can attack.

Edges are typed by what they attack — supports, rebuts (the claim), undercuts (the warrant), undermines (the evidence). The same three attacks work in epidemiology or constitutional law. That's what makes one representation carry every debate.

03 — The decision we'll defend to the end

Two questions. Never one.

Here's the part most engines get wrong. They answer with a single number, which quietly fuses two completely different questions: what survives the argument, and how confident we should be. We keep them in separate layers — and that separation is what lets the engine be honestly uncertain.

Structure first — which arguments are acceptable. Magnitude second — how much belief the survivors warrant. Two questions, two layers, never conflated.

Why does this matter so much? Because a claim can be fiercely contested and still land at exactly 0.5 — and "we're confident it's a coin flip" is a completely different message from "we have no idea." A single number can't tell them apart. Two layers can. It's also why a fourth verdict — conflicting evidence — needs to be expressible rather than flattened into a shrug.

An engine that always answers is an engine that sometimes lies.Principle zero

04 — Why it works everywhere

A universal grammar of attack.

Here's the trick that lets one engine red-team a vaccine study and a Fed decision with the same code. Arguments aren't infinite snowflakes — they fall into recurring patterns. The philosopher Douglas Walton catalogued roughly sixty: appeal to expert opinion, cause-to-effect, analogy, sign, slippery slope. And each pattern ships with its own built-in objections — its critical questions. Detect the pattern, fire the questions, and you've generated the strongest principled attack automatically.

"A renowned economist says the policy will work."

scheme detected · appeal to expert opinion

Is the source a genuine expert in this specific field?
Is the quote accurate, and in context?
Are experts in the field actually in agreement?
Is the source biased or financially interested?
Is the claim consistent with the broader evidence?

No domain logic written. The same five questions interrogate an expert on epidemiology and one on monetary policy. Add a scheme once, and every field gets sharper at the same time.

05 — One job becomes nine

Truth-scoring is just the start.

When we mapped what people actually want, a second axis appeared. "Score whether this is true" turns out to be one job among many. The same topic becomes a different machine depending on your goal — a verdict, a list of risks, a sparring partner, a map of positions. The full model isn't claim type alone. It's claim type × intent. Intent resolves into nine modes. Pick a real task and watch where it routes:

Same substrate underneath every one — only the role set and the final output change. That's the whole bet: build the substrate field-agnostic, and breadth is free.

Tally every use case by its claim type and the routing logic becomes visible — and falsifiable. Two things jump out of the grid below: every mode has a dominant claim type (so a detected type can pre-select the likely mode), and Definitional is never a primary driver — definitions ride inside other claims rather than spawning debates of their own.

Coverage matrix — interaction mode × primary claim type, across the 66 cases that carry a single primary type (of 68 mapped; the two proposed wrappers have none). Bold cells are each mode's center of gravity. The empty Definitional column is a design finding, not an oversight.

interaction modes

use cases mapped

industries spanned

shared substrate

06 — The line we hold

Calibrated, not loud.

All of this rests on a few commitments we enforce by construction, not good intentions. Score the evidence blind to who argued more persuasively. Prize calibration over raw accuracy — a 0.5 that means 0.5 is worth more than a confident label that's wrong. Trace every node to a source, so the verdict is a graph you can walk, not a vibe. And protect the right to say not enough evidence — because the alternative is an engine that fills silence with confidence.

The verdict is a graph you can walk — not a vibe.On provenance

None of this is a rebuild from zero. The seven-stage pipeline we run today already contains most of it, implicitly. The work ahead is to make the implicit structure explicit and configurable — to let anyone compose a debate the way they'd sketch one on a whiteboard, and have the engine actually run it. That's the suite we're building.

You're reading the essential version. Want the full picture — the two-layer mechanics, the seven commitments we won't break, the guardrails on high-stakes use, the intellectual lineage, and the full 68-case register? Switch to the in-depth read (about 24 minutes).

07 — The line we hold, in full

Seven commitments, enforced by construction.

A reasoning engine is only as trustworthy as the rules it can't talk itself out of. These aren't aspirations in a doc — they're constraints built into the structure, so the engine can't violate them even when it would be convenient.

Separate evidence quality from rhetorical force.: The final referee scores the evidence anonymized and shuffled — blind to which side argued better. It's the difference between judging and being persuaded.
Calibration over accuracy.: A well-spread, honest score beats a confident label. We reward the engine for saying 0.5 when it's 0.5 — and grade it on whether a 70 really means 70.
Falsifiability and load-bearingness.: For every verdict, surface the single piece of evidence that would most move it. If nothing would, that isn't confidence — it's dogma.
Adversarial symmetry and steelmanning.: Both sides argue the strongest available case from the same evidence. No straw men. Anti-sycophancy is enforced in the protocol, not left to chance.
Decompose, then recompose.: Break the claim into mutually exclusive, collectively exhaustive sub-claims, score each, recombine. It bounds error and makes the chain auditable end to end.
Provenance is non-negotiable.: Every node traces to a source. The verdict is a graph you can walk, not a vibe — auditability is the moat.
The right to suspend judgment.: "Not enough evidence" is a first-class verdict. An engine that always answers is an engine that sometimes invents.

08 — Under the hood

How a verdict actually resolves.

The two layers from earlier aren't a metaphor; they're two distinct computations. The structural layer borrows from Phan Minh Dung's abstract argumentation frameworks: treat each argument as a node and each attack as an edge, then solve for the set that survives — the grounded extension. When you need structure inside the nodes — strict versus defeasible inference rules, explicit preferences — ASPIC+ supplies it.

Only then does the quantitative layer run: it maps the surviving structure to a calibrated 0–1 score by Bayesian aggregation — a prior or base rate, a likelihood ratio for each piece of standing evidence, a posterior. Because the two run in sequence and never collapse into one another, a claim whose arguments fight to a draw resolves to an honest 0.5, and a genuinely two-sided body of evidence earns a native "conflicting" verdict instead of a misleading average.

09 — Power, handled with care

The use cases that get guardrails.

A tool this general touches decisions where being wrong — or merely overconfident — does real harm. Three constraints are designed in from the start, not bolted on. Loan-borrower eligibility sits under fair-lending law (ECOA, disparate impact): the engine surfaces the strongest case for and against eligibility to assist a human decision — it never renders the lending decision itself. Refining a religious or philosophical belief runs in a mode that is non-coercive by design — it surfaces premises and answers conditionally, clarifying what you believe and why, never "disproving" a faith. And court simulation is a thinking tool, explicitly not legal advice. Honesty isn't only about the verdict. It's about knowing where the engine's authority ends.

10 — Not a rebuild

How this maps to the engine we run today.

Most of this already exists in the seven-stage pipeline we ship — just hardcoded. The work is to make the implicit structure explicit and configurable.

Today (hardcoded)	Becomes	What changes
Suitability classifier	Claim-type router	Classify the type, not just suitability; route to a default topology.
Decomposition (2–4)	Decompose / recompose	Already right in spirit; sub-claim count becomes a knob.
Case For / Against	Typed argument graph	Explicit warrants; attacks typed as rebut / undercut / undermine.
R1 referee "find the gap"	Load-bearing-evidence finder	Generalizes to the double-crux: name the evidence that would flip the verdict.
Final referee (ensemble)	Two-layer adjudication	Structural extension first, then Bayesian magnitude — enabling a native "conflicting" verdict.
Fixed 2 rounds, 4 roles	Serializable debate graph	Rounds and roles become data in a schema — the visual builder's output.

The substrate is already there, implicitly. The redesign is mostly an act of making it nameable, then editable.

11 — The interface

How anyone will build a debate.

If the substrate is a typed graph, the builder is a way to compose it without writing code — and four rules guide it. Typed, constrained blocks: validity is enforced by the type system, so you can't wire an incoherent debate; a gate requires both sides, a structure requires a terminal referee. Meta versus object: you edit the procedure; the engine emits the argument instance for a given claim. Progressive disclosure: classification proposes a structure, and the knobs reveal themselves only when you reach for them. Observability: you watch the argument graph get built and attacked, because the reasoning is the product. And the palette is organized by operation, not by role — a debate is always a paired gate, so it's one block with two lanes, never a lone advocate you could place by mistake.

12 — Standing on shoulders

Where the ideas come from.

None of this is invented from scratch. The anatomy of a single argument is Stephen Toulmin's. The formal question of which arguments survive a web of attacks is Dung's, extended by the computational-argumentation school. The universal attack generator is Douglas Walton's schemes and their critical questions. Calibration, reference classes, and decomposition come from Bayesian epistemology and Philip Tetlock's work on forecasting; the causal machinery from Judea Pearl and the Bradford Hill criteria; the discipline of stating what would change your mind from Karl Popper and Imre Lakatos; the hunt for the operative disagreement from the rationalist "double crux." Full citations sit at the foot of this piece.

One we record with particular care, because this company began with a misattributed quote: the thesis–antithesis–synthesis triad our three agents borrow is Fichte's (1795), codified and pinned on Hegel by Chalybäus in 1843. Hegel himself called such formulae a "lifeless schema." A product built to catch confident misattributions ought to get its own naming right.

13 — The road ahead

What we're building next.

The order of operations is deliberate. Write the claim-type taxonomy as a spec — the six types, their detection signals, the default topology each routes to; it unblocks everything else. Pin the node-and-edge type system beneath the visual builder. Stand up a scheme library, starting with the ten highest-frequency Walton patterns. Prototype the two-layer adjudicator and prove it can emit a native "conflicting" verdict. And publish the debate-graph schema as the contract between builder and engine — the thing that finally replaces the hardcoded flow with something anyone can shape.

The questions we're still arguing — fittingly — among ourselves: how much of the routing should be visible to the user versus automatic? Where is the line between "configurable" and "overwhelming"? And which claim type do we serve first — the empirical case is easiest, but the normative and predictive cases are where everyone else fails and our calibration edge shows most.

go deeperThe full use-case register — 68 cases▶

go deeperWhere the ideas come from — works cited▶

Argumentation — structure & schemes

Toulmin, S. E. (1958). The Uses of Argument. Cambridge University Press.

Walton, D., Reed, C., & Macagno, F. (2008). Argumentation Schemes. Cambridge University Press.

Computational & formal argumentation

Dung, P. M. (1995). On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games. Artificial Intelligence, 77(2), 321–357.

Modgil, S., & Prakken, H. (2014). The ASPIC+ framework for structured argumentation: a tutorial. Argument & Computation, 5(1), 31–62.

Calibration, forecasting & causal inference

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.

Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.

Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.

Hill, A. B. (1965). The environment and disease: association or causation? Proceedings of the Royal Society of Medicine, 58(5), 295–300.

Philosophy of science & dialectic

Popper, K. R. (1959). The Logic of Scientific Discovery. Hutchinson.

Lipton, P. (2004). Inference to the Best Explanation (2nd ed.). Routledge.

Fichte, J. G. (1795). Grundriss des Eigentümlichen der Wissenschaftslehre. [Origin of the thesis–antithesis–synthesis triad — later misattributed to Hegel by Chalybäus (1843); Hegel himself called it a "lifeless schema." Recorded in the spirit of the product.]

Fact verification & benchmarks

Schlichtkrull, M., Guo, Z., & Vlachos, A. (2023). AVeriTeC: A dataset for real-world claim verification with evidence from the web. NeurIPS 2023. arXiv:2305.13117

Stop trusting. Start verifying.

Point superthesis at any claim, market, or question — and watch it argue both sides, cite every source, and resolve a calibrated signal. Free while The Field is in early access.

Find the signal →Read the research