Ask a large language model whether a hard claim is true and it will tell you. Confidently. Instantly. Often wrongly. The number it hands back is the most persuasive part of the answer and the least trustworthy — one voice, mistaking fluency for knowledge.
We started superthesis from a different premise: the truth of a contested claim isn't a fact to be retrieved, it's an argument to be had. So instead of one model answering, we run three that disagree — a Thesis that defends the claim, an Antithesis that attacks it, and a Synthesis that scores the evidence on quality alone, blind to which side argued better. The output isn't an opinion. It's a calibrated reading from 0 to 1, every source cited, every step you can walk.
That engine works. But the question that actually kept us up at night wasn't how to debate one claim. It was the harder one: how do you build something that can debate anything?
Don't scale by field. Scale by type.
The obvious plan is to go domain by domain: a module for medicine, one for law, one for finance, markets, politics. That project never finishes — there is always one more field. It's a treadmill disguised as a roadmap.
The escape is to notice that domains are the wrong axis entirely. A claim about gravitational lensing and a claim about interest rates can be the same kind of claim — "X causes Y" — and they demand the same machinery: confounders, mechanism, a counterfactual. Their subject matter is worlds apart. Their epistemic shape is identical. So we route on the type of claim, not its field. Six types span the space:
The normative row is the one most systems botch: they try to win a values question with facts. The honest move is to detect that the disagreement is about a value weighting, make that premise explicit, and answer conditionally — "if you weight liberty over equality here, then…" An engine has to know when a question isn't the kind that evidence settles.
An argument isn't a sentence. It's a graph.
Underneath the debate sits a finer structure. Every argument decomposes — borrowing from Stephen Toulmin — into evidence (grounds), the claim it supports, and the silent joint between them: the warrant, the unspoken "and therefore." Most bad reasoning hides exactly there, in the inference nobody says out loud. So we make the warrant a node you can attack.
Two questions. Never one.
Here's the part most engines get wrong. They answer with a single number, which quietly fuses two completely different questions: what survives the argument, and how confident we should be. We keep them in separate layers — and that separation is what lets the engine be honestly uncertain.
Why does this matter so much? Because a claim can be fiercely contested and still land at exactly 0.5 — and "we're confident it's a coin flip" is a completely different message from "we have no idea." A single number can't tell them apart. Two layers can. It's also why a fourth verdict — conflicting evidence — needs to be expressible rather than flattened into a shrug.
A universal grammar of attack.
Here's the trick that lets one engine red-team a vaccine study and a Fed decision with the same code. Arguments aren't infinite snowflakes — they fall into recurring patterns. The philosopher Douglas Walton catalogued roughly sixty: appeal to expert opinion, cause-to-effect, analogy, sign, slippery slope. And each pattern ships with its own built-in objections — its critical questions. Detect the pattern, fire the questions, and you've generated the strongest principled attack automatically.
- Is the source a genuine expert in this specific field?
- Is the quote accurate, and in context?
- Are experts in the field actually in agreement?
- Is the source biased or financially interested?
- Is the claim consistent with the broader evidence?
Truth-scoring is just the start.
When we mapped what people actually want, a second axis appeared. "Score whether this is true" turns out to be one job among many. The same topic becomes a different machine depending on your goal — a verdict, a list of risks, a sparring partner, a map of positions. The full model isn't claim type alone. It's claim type × intent. Intent resolves into nine modes. Pick a real task and watch where it routes:
Tally every use case by its claim type and the routing logic becomes visible — and falsifiable. Two things jump out of the grid below: every mode has a dominant claim type (so a detected type can pre-select the likely mode), and Definitional is never a primary driver — definitions ride inside other claims rather than spawning debates of their own.
Calibrated, not loud.
All of this rests on a few commitments we enforce by construction, not good intentions. Score the evidence blind to who argued more persuasively. Prize calibration over raw accuracy — a 0.5 that means 0.5 is worth more than a confident label that's wrong. Trace every node to a source, so the verdict is a graph you can walk, not a vibe. And protect the right to say not enough evidence — because the alternative is an engine that fills silence with confidence.
None of this is a rebuild from zero. The seven-stage pipeline we run today already contains most of it, implicitly. The work ahead is to make the implicit structure explicit and configurable — to let anyone compose a debate the way they'd sketch one on a whiteboard, and have the engine actually run it. That's the suite we're building.
You're reading the essential version. Want the full picture — the two-layer mechanics, the seven commitments we won't break, the guardrails on high-stakes use, the intellectual lineage, and the full 68-case register? Switch to the in-depth read (about 24 minutes).
Seven commitments, enforced by construction.
A reasoning engine is only as trustworthy as the rules it can't talk itself out of. These aren't aspirations in a doc — they're constraints built into the structure, so the engine can't violate them even when it would be convenient.
- Separate evidence quality from rhetorical force.
- The final referee scores the evidence anonymized and shuffled — blind to which side argued better. It's the difference between judging and being persuaded.
- Calibration over accuracy.
- A well-spread, honest score beats a confident label. We reward the engine for saying 0.5 when it's 0.5 — and grade it on whether a 70 really means 70.
- Falsifiability and load-bearingness.
- For every verdict, surface the single piece of evidence that would most move it. If nothing would, that isn't confidence — it's dogma.
- Adversarial symmetry and steelmanning.
- Both sides argue the strongest available case from the same evidence. No straw men. Anti-sycophancy is enforced in the protocol, not left to chance.
- Decompose, then recompose.
- Break the claim into mutually exclusive, collectively exhaustive sub-claims, score each, recombine. It bounds error and makes the chain auditable end to end.
- Provenance is non-negotiable.
- Every node traces to a source. The verdict is a graph you can walk, not a vibe — auditability is the moat.
- The right to suspend judgment.
- "Not enough evidence" is a first-class verdict. An engine that always answers is an engine that sometimes invents.
How a verdict actually resolves.
The two layers from earlier aren't a metaphor; they're two distinct computations. The structural layer borrows from Phan Minh Dung's abstract argumentation frameworks: treat each argument as a node and each attack as an edge, then solve for the set that survives — the grounded extension. When you need structure inside the nodes — strict versus defeasible inference rules, explicit preferences — ASPIC+ supplies it.
Only then does the quantitative layer run: it maps the surviving structure to a calibrated 0–1 score by Bayesian aggregation — a prior or base rate, a likelihood ratio for each piece of standing evidence, a posterior. Because the two run in sequence and never collapse into one another, a claim whose arguments fight to a draw resolves to an honest 0.5, and a genuinely two-sided body of evidence earns a native "conflicting" verdict instead of a misleading average.
The use cases that get guardrails.
A tool this general touches decisions where being wrong — or merely overconfident — does real harm. Three constraints are designed in from the start, not bolted on. Loan-borrower eligibility sits under fair-lending law (ECOA, disparate impact): the engine surfaces the strongest case for and against eligibility to assist a human decision — it never renders the lending decision itself. Refining a religious or philosophical belief runs in a mode that is non-coercive by design — it surfaces premises and answers conditionally, clarifying what you believe and why, never "disproving" a faith. And court simulation is a thinking tool, explicitly not legal advice. Honesty isn't only about the verdict. It's about knowing where the engine's authority ends.
How this maps to the engine we run today.
Most of this already exists in the seven-stage pipeline we ship — just hardcoded. The work is to make the implicit structure explicit and configurable.
| Today (hardcoded) | Becomes | What changes |
|---|---|---|
| Suitability classifier | Claim-type router | Classify the type, not just suitability; route to a default topology. |
| Decomposition (2–4) | Decompose / recompose | Already right in spirit; sub-claim count becomes a knob. |
| Case For / Against | Typed argument graph | Explicit warrants; attacks typed as rebut / undercut / undermine. |
| R1 referee "find the gap" | Load-bearing-evidence finder | Generalizes to the double-crux: name the evidence that would flip the verdict. |
| Final referee (ensemble) | Two-layer adjudication | Structural extension first, then Bayesian magnitude — enabling a native "conflicting" verdict. |
| Fixed 2 rounds, 4 roles | Serializable debate graph | Rounds and roles become data in a schema — the visual builder's output. |
How anyone will build a debate.
If the substrate is a typed graph, the builder is a way to compose it without writing code — and four rules guide it. Typed, constrained blocks: validity is enforced by the type system, so you can't wire an incoherent debate; a gate requires both sides, a structure requires a terminal referee. Meta versus object: you edit the procedure; the engine emits the argument instance for a given claim. Progressive disclosure: classification proposes a structure, and the knobs reveal themselves only when you reach for them. Observability: you watch the argument graph get built and attacked, because the reasoning is the product. And the palette is organized by operation, not by role — a debate is always a paired gate, so it's one block with two lanes, never a lone advocate you could place by mistake.
Where the ideas come from.
None of this is invented from scratch. The anatomy of a single argument is Stephen Toulmin's. The formal question of which arguments survive a web of attacks is Dung's, extended by the computational-argumentation school. The universal attack generator is Douglas Walton's schemes and their critical questions. Calibration, reference classes, and decomposition come from Bayesian epistemology and Philip Tetlock's work on forecasting; the causal machinery from Judea Pearl and the Bradford Hill criteria; the discipline of stating what would change your mind from Karl Popper and Imre Lakatos; the hunt for the operative disagreement from the rationalist "double crux." Full citations sit at the foot of this piece.
One we record with particular care, because this company began with a misattributed quote: the thesis–antithesis–synthesis triad our three agents borrow is Fichte's (1795), codified and pinned on Hegel by Chalybäus in 1843. Hegel himself called such formulae a "lifeless schema." A product built to catch confident misattributions ought to get its own naming right.
What we're building next.
The order of operations is deliberate. Write the claim-type taxonomy as a spec — the six types, their detection signals, the default topology each routes to; it unblocks everything else. Pin the node-and-edge type system beneath the visual builder. Stand up a scheme library, starting with the ten highest-frequency Walton patterns. Prototype the two-layer adjudicator and prove it can emit a native "conflicting" verdict. And publish the debate-graph schema as the contract between builder and engine — the thing that finally replaces the hardcoded flow with something anyone can shape.
The questions we're still arguing — fittingly — among ourselves: how much of the routing should be visible to the user versus automatic? Where is the line between "configurable" and "overwhelming"? And which claim type do we serve first — the empirical case is easiest, but the normative and predictive cases are where everyone else fails and our calibration edge shows most.
go deeperThe full use-case register — 68 cases▶
go deeperWhere the ideas come from — works cited▶
Stop trusting. Start verifying.
Point superthesis at any claim, market, or question — and watch it argue both sides, cite every source, and resolve a calibrated signal. Free while The Field is in early access.