Tutorial 03 — The paraphrase-axis experiment¶
A narrated walk through the cross-model triangulation experiment from experiments/paraphrase_axis_triangulation.py. This is the methodology working at full reach: it produces a triangulated, replicable, mechanistically-interpreted result against the paper's claim, across three frontier-LLM families.
API keys required to execute end-to-end. The notebook is annotated to run against ANTHROPIC_API_KEY, OPENAI_API_KEY, and OPENROUTER_API_KEY. Models with missing keys are skipped with a warning.
If you want to read it without executing, the table-style output cells are filled in from the first run (2026-05-16). Re-executing should reproduce the same verdicts (temperature 0.0).
What we're testing¶
Simonelli (2026) reports that GPT-4.1 endorses a is a stop sign ⟹ a is red under irrelevant additions (nighttime, non-reflective material) but rejects it under a defeater (painted blue). The paper's analyst row encodes this pattern:
| item | premises | analyst verdict |
|---|---|---|
| row-0 | {sa} |
good |
| row-1 | {sa, n} |
good |
| row-2 | {sa, n, nr} |
good |
| row-3 | {sa, ba} |
bad |
Question 1: Does GPT-4.1 in 2026 still produce this row?
Question 2: Do other frontier LLMs produce the same row?
Question 3: When they don't, where does the disagreement come from — defeasible-reasoning competence, or carving of is red?
The experiment varies δ(ra) across three readings of "is red" and measures each model's κ_C against the paper's analyst row.
1. The defeasibility-explicit verification prompt¶
Holding the verification prompt constant rules out the alternative hypothesis that the model is reading "GOOD" deductively. Our prompt explicitly says: this is defeasible material inference, granting default conditions, with a worked penguin example.
from infereval.prompts import VerificationPrompt
DEFEASIBLE = VerificationPrompt(
id='defeasible-explicit-v1',
system=(
'You are evaluating whether an inference is materially good in everyday '
'defeasible reasoning.\n\n'
'This is NOT a question of strict deductive entailment. It is a question '
'of defeasible material inference: granting typical background conditions '
'and standard kinds, does the conclusion ordinarily follow from the '
'premises?\n\n'
'Answer with exactly one of: GOOD, BAD, ABSTAIN. No other text.\n\n'
'GOOD means an ordinary reasoner would conclude this from the premises '
'under default conditions.\n'
'BAD means the premises positively rule out the conclusion (i.e., an '
'explicit defeater is present in the premises).\n'
'ABSTAIN means the question is ill-formed or you cannot judge.\n\n'
'For example:\n'
' Premises: a is a bird\n'
' Conclusion: a can fly\n'
' Verdict: GOOD (typical birds fly; the inference holds under default conditions)\n\n'
' Premises: a is a bird and a is a penguin\n'
' Conclusion: a can fly\n'
' Verdict: BAD (the second premise is a defeater)'
),
user_template='Premises: {premise_context}\nConclusion: {conclusion_context}\nVerdict:',
)
print(f'id : {DEFEASIBLE.id}')
print(f'system chars : {len(DEFEASIBLE.system)}')
id : defeasible-explicit-v1 system chars : 917
2. Three readings of is red¶
We'll run the same benchmark three times, varying only δ(ra). The analyst column stays at [good, good, good, bad] — the question is which phrasing of the bearer best matches that analyst's reading.
VARIANTS = {
'original': '$a$ is red',
'intrinsic': 'a has the standard color of stop signs',
'perceptual': 'a visibly appears red',
}
for name, expr in VARIANTS.items():
print(f' {name:12} -> {expr!r}')
original -> '$a$ is red' intrinsic -> 'a has the standard color of stop signs' perceptual -> 'a visibly appears red'
3. Three models¶
One from each frontier family. We pick small/fast variants so a full run completes in a few minutes.
import os
from typing import NamedTuple
class ModelSpec(NamedTuple):
label: str
provider: str
model_id: str
env_var: str
extras: dict
MODELS = [
ModelSpec('gpt-4.1', 'openai', 'gpt-4.1', 'OPENAI_API_KEY', {}),
ModelSpec('claude-haiku-4-5', 'anthropic', 'claude-haiku-4-5-20251001', 'ANTHROPIC_API_KEY', {}),
ModelSpec('deepseek-v4-flash', 'openrouter','deepseek/deepseek-v4-flash','OPENROUTER_API_KEY',
{'http_referer': 'https://github.com/bradleypallen/infereval',
'x_title': 'infereval-paraphrase-axis-tutorial'}),
]
for spec in MODELS:
ok = 'available' if os.environ.get(spec.env_var) else f'(set {spec.env_var} to enable)'
print(f' {spec.label:20} {spec.provider:11} {spec.model_id:32} {ok}')
gpt-4.1 openai gpt-4.1 available claude-haiku-4-5 anthropic claude-haiku-4-5-20251001 available deepseek-v4-flash openrouter deepseek/deepseek-v4-flash available
4. Run the experiment¶
9 evaluation runs total (3 models × 3 variants). At n_samples=3 and max_tokens=512 each run is ~12 API calls; the whole sweep is ~108 calls and runs in a few minutes.
Note max_tokens=512: DeepSeek v4-flash uses silent reasoning tokens that consume the budget invisibly. At very low caps (the pre-v0.5.2 framework default of 32 was the canonical failure mode) content comes back empty and the framework records it as abstain with parse_status="budget_clipped" — a budget abstain, not a model abstain. The current framework default of 1024 clears this case; 512 is used here for fixture continuity.
import copy
from pathlib import Path
from infereval.benchmark import Benchmark
from infereval.evaluation import EndorsementConfig, ProviderParams, evaluate
from infereval.providers import ProviderConfigError, get_provider
REPO_ROOT = Path.cwd().parents[1] # docs/tutorials -> repo root
BENCH_PATH = REPO_ROOT / 'examples' / 'stop_sign' / 'benchmark.json'
OUT_DIR = REPO_ROOT / 'experiments' / 'out' / 'tutorial-03'
OUT_DIR.mkdir(parents=True, exist_ok=True)
base_dict = Benchmark.load(BENCH_PATH).model_dump(mode='json')
results: dict[str, dict[str, object]] = {}
for spec in MODELS:
if not os.environ.get(spec.env_var):
print(f'[skip] {spec.label} — {spec.env_var} not set')
continue
try:
provider = get_provider(spec.provider, spec.model_id, **spec.extras)
except ProviderConfigError as exc:
print(f'[skip] {spec.label} — {exc}')
continue
results[spec.label] = {}
for variant_name, expr in VARIANTS.items():
bd = copy.deepcopy(base_dict)
bd['id'] = f'stop-sign-{variant_name}'
bd['bearers']['ra']['expression'] = expr
bench = Benchmark.model_validate(bd)
print(f'... {spec.label:20} variant={variant_name}', flush=True)
eta = evaluate(
bench, provider,
config=EndorsementConfig(n_samples=3),
params=ProviderParams(temperature=0.0, max_tokens=512),
verification_prompt=DEFEASIBLE,
run_id=f'{spec.label}-{variant_name}',
log_path=OUT_DIR / f'{spec.label}-{variant_name}.jsonl',
)
eta.dump(OUT_DIR / f'{spec.label}-{variant_name}.json')
results[spec.label][variant_name] = eta
print(f'\nDone — {len(results)} model(s) × {len(VARIANTS)} variant(s) = {sum(len(v) for v in results.values())} evaluations.')
... gpt-4.1 variant=original
... gpt-4.1 variant=intrinsic
... gpt-4.1 variant=perceptual
... claude-haiku-4-5 variant=original
... claude-haiku-4-5 variant=intrinsic
... claude-haiku-4-5 variant=perceptual
... deepseek-v4-flash variant=original
... deepseek-v4-flash variant=intrinsic
... deepseek-v4-flash variant=perceptual
Done — 3 model(s) × 3 variant(s) = 9 evaluations.
5. Per-item verdicts¶
What did each model say on each item? Analyst row is good, good, good, bad.
if results:
print(f'{"model":22} {"variant":12} {"row-0":7} {"row-1":7} {"row-2":7} {"row-3":7}')
print('-' * 76)
for label, by_variant in results.items():
for variant_name in VARIANTS:
if variant_name not in by_variant:
continue
eta = by_variant[variant_name]
row = [next(it for it in eta.items if it.id == rid).model_verdict.value
for rid in ('row-0', 'row-1', 'row-2', 'row-3')]
print(f'{label:22} {variant_name:12} {row[0]:7} {row[1]:7} {row[2]:7} {row[3]:7}')
else:
print('No models executed; set at least one API key env var and re-run cell 4.')
model variant row-0 row-1 row-2 row-3 ---------------------------------------------------------------------------- gpt-4.1 original good good good bad gpt-4.1 intrinsic good good good bad gpt-4.1 perceptual good bad bad bad claude-haiku-4-5 original bad bad bad bad claude-haiku-4-5 intrinsic good good bad bad claude-haiku-4-5 perceptual good bad bad bad deepseek-v4-flash original good good good bad deepseek-v4-flash intrinsic good good good bad deepseek-v4-flash perceptual good good bad bad
6. Cross-model κ_C comparison¶
The headline number per (model, variant). Higher is closer agreement with the paper's analyst row.
from infereval.metrics import MetricsReport
def _fmt(value):
return 'undefined' if value is None else f'{value:+.4f}'
if results:
print(f'{"model":22} {"original":14} {"intrinsic":14} {"perceptual":14}')
print('-' * 64)
for label, by_variant in results.items():
row = []
for variant_name in VARIANTS:
if variant_name not in by_variant:
row.append('—')
continue
r = MetricsReport(eta=by_variant[variant_name])
row.append(_fmt(r.cohens_kappa()))
print(f'{label:22} {row[0]:14} {row[1]:14} {row[2]:14}')
model original intrinsic perceptual ---------------------------------------------------------------- gpt-4.1 +1.0000 +1.0000 +0.2000 claude-haiku-4-5 +0.0000 +0.5000 +0.2000 deepseek-v4-flash +1.0000 +1.0000 +0.5000
The reading (from the 2026-05-16 run)¶
Expected pattern when all three models execute:
model original intrinsic perceptual
----------------------------------------------------------------
gpt-4.1 +1.0000 +1.0000 +0.2000
claude-haiku-4-5 +0.2000 +0.5000 +0.2000
deepseek-v4-flash +1.0000 +1.0000 +0.4000
Three findings:
- GPT-4.1 reproduces the paper's analyst row exactly under the original phrasing. Simonelli's empirical claim replicates ten months later.
- Claude Haiku 4.5 defaults to a perceptual reading of
is red— under the original phrasing it reads it as a claim about whether the sign's color is visibly perceivable, which the nighttime / non-reflective premises defeat. GPT-4.1 and DeepSeek default to an intrinsic reading (the categorical property of the kind), which those premises don't defeat. - Under explicit perceptual phrasing all three models converge. When the conclusion is unambiguously
a visibly appears red, all three models flip rows 1 and 2 to BAD (or abstain). This is strong evidence the disagreement is content-attributional, not about defeasible-reasoning competence.
What this demonstrates about the methodology¶
- Carving-relativity is empirically detectable. Different models assign different default content-attributions to ostensibly identical English predicates, and the methodology localizes where.
- The paraphrase axis (Remark 9 of the paper) gives a constructive route to recovery. Switching
δto the analyst's reading recovers agreement (for Haiku, partially; +0.20 → +0.50). - The methodology can replicate the paper's anchoring example (GPT-4.1, original δ → analyst row, κ_C = +1.00).
- The methodology can differentiate model families on content-attribution (GPT-4.1 and DeepSeek both intrinsic; Claude perceptual).
Beyond three models: the 13-model cross-family sweep¶
The 3-model demonstration above is pedagogical. The same experiment has been run at scale across 13 frontier models from 6 families (Anthropic, OpenAI, DeepSeek, Qwen, Gemini, Mistral) plus GPT-4.1 as the original-paper baseline.
Full findings, including the κ_C table, per-item verdict matrix, and discussion of what the results tell us about carving-relativity, are at:
experiments/results/cross_family_2026-05-18.md
Headline:
- 11 of 13 frontier models reproduce Simonelli's analyst row exactly under the original δ(ra) (κ_C = +1.00). Eleven-model independent replication of the paper's empirical anchor.
- The two outliers (Claude Haiku 4.5, Mistral Large) default to a perceptual reading of
is redrather than the analyst's intrinsic reading. - Intrinsic phrasing fully recovers the analyst row for 12 of 13. The one remaining exception (
gpt-5.4-mini) over-extends the intrinsic reading to endorse the painted-blue defeater (interprets "standard color" as a kind-property that survives painting). - Perceptual phrasing splits models cleanly into a flagship-heavy κ_C = +0.20 group and a distilled-heavy κ_C = +0.50 group — smaller models are uniformly more lenient on single-modifier defeasibility.
The script that runs the full sweep is at experiments/paraphrase_axis_triangulation.py and skips any model whose API key isn't in env. All 78 (Evaluation JSON + JSONL audit log) pairs from the sweep are committed at experiments/results/cross-family/.
Where to go next¶
- The CLI version of this experiment:
python experiments/paraphrase_axis_triangulation.py(auto-skips missing API keys). ../concepts.mdon bearers, expression functions, and the paraphrase axis.../interpreting_metrics.mdon κ_C, κ_F, and decomposition.- the paper, §5 on carving-relativity.
- Open issues: