Tutorial 03 — The paraphrase-axis experiment¶

A narrated walk through the cross-model triangulation experiment from experiments/paraphrase_axis_triangulation.py. This is the methodology working at full reach: it produces a triangulated, replicable, mechanistically-interpreted result against the paper's claim, across three frontier-LLM families.

API keys required to execute end-to-end. The notebook is annotated to run against ANTHROPIC_API_KEY, OPENAI_API_KEY, and OPENROUTER_API_KEY. Models with missing keys are skipped with a warning.

If you want to read it without executing, the table-style output cells are filled in from the first run (2026-05-16). Re-executing should reproduce the same verdicts (temperature 0.0).

What we're testing¶

Simonelli (2026) reports that GPT-4.1 endorses a is a stop sign ⟹ a is red under irrelevant additions (nighttime, non-reflective material) but rejects it under a defeater (painted blue). The paper's analyst row encodes this pattern:

item	premises	analyst verdict
row-0	`{sa}`	good
row-1	`{sa, n}`	good
row-2	`{sa, n, nr}`	good
row-3	`{sa, ba}`	bad

Question 1: Does GPT-4.1 in 2026 still produce this row? Question 2: Do other frontier LLMs produce the same row? Question 3: When they don't, where does the disagreement come from — defeasible-reasoning competence, or carving of is red?

The experiment varies δ(ra) across three readings of "is red" and measures each model's κ_C against the paper's analyst row.

1. The defeasibility-explicit verification prompt¶

Holding the verification prompt constant rules out the alternative hypothesis that the model is reading "GOOD" deductively. Our prompt explicitly says: this is defeasible material inference, granting default conditions, with a worked penguin example.

In [1]:

Copied!





from infereval.prompts import VerificationPrompt

DEFEASIBLE = VerificationPrompt(
    id='defeasible-explicit-v1',
    system=(
        'You are evaluating whether an inference is materially good in everyday '
        'defeasible reasoning.\n\n'
        'This is NOT a question of strict deductive entailment. It is a question '
        'of defeasible material inference: granting typical background conditions '
        'and standard kinds, does the conclusion ordinarily follow from the '
        'premises?\n\n'
        'Answer with exactly one of: GOOD, BAD, ABSTAIN. No other text.\n\n'
        'GOOD means an ordinary reasoner would conclude this from the premises '
        'under default conditions.\n'
        'BAD means the premises positively rule out the conclusion (i.e., an '
        'explicit defeater is present in the premises).\n'
        'ABSTAIN means the question is ill-formed or you cannot judge.\n\n'
        'For example:\n'
        '  Premises: a is a bird\n'
        '  Conclusion: a can fly\n'
        '  Verdict: GOOD  (typical birds fly; the inference holds under default conditions)\n\n'
        '  Premises: a is a bird and a is a penguin\n'
        '  Conclusion: a can fly\n'
        '  Verdict: BAD  (the second premise is a defeater)'
    ),
    user_template='Premises: {premise_context}\nConclusion: {conclusion_context}\nVerdict:',
)
print(f'id           : {DEFEASIBLE.id}')
print(f'system chars : {len(DEFEASIBLE.system)}')
from infereval.prompts import VerificationPrompt

DEFEASIBLE = VerificationPrompt(
    id='defeasible-explicit-v1',
    system=(
        'You are evaluating whether an inference is materially good in everyday '
        'defeasible reasoning.\n\n'
        'This is NOT a question of strict deductive entailment. It is a question '
        'of defeasible material inference: granting typical background conditions '
        'and standard kinds, does the conclusion ordinarily follow from the '
        'premises?\n\n'
        'Answer with exactly one of: GOOD, BAD, ABSTAIN. No other text.\n\n'
        'GOOD means an ordinary reasoner would conclude this from the premises '
        'under default conditions.\n'
        'BAD means the premises positively rule out the conclusion (i.e., an '
        'explicit defeater is present in the premises).\n'
        'ABSTAIN means the question is ill-formed or you cannot judge.\n\n'
        'For example:\n'
        '  Premises: a is a bird\n'
        '  Conclusion: a can fly\n'
        '  Verdict: GOOD  (typical birds fly; the inference holds under default conditions)\n\n'
        '  Premises: a is a bird and a is a penguin\n'
        '  Conclusion: a can fly\n'
        '  Verdict: BAD  (the second premise is a defeater)'
    ),
    user_template='Premises: {premise_context}\nConclusion: {conclusion_context}\nVerdict:',
)
print(f'id           : {DEFEASIBLE.id}')
print(f'system chars : {len(DEFEASIBLE.system)}')

id           : defeasible-explicit-v1
system chars : 917

2. Three readings of `is red`¶

We'll run the same benchmark three times, varying only δ(ra). The analyst column stays at [good, good, good, bad] — the question is which phrasing of the bearer best matches that analyst's reading.

In [2]:

Copied!





VARIANTS = {
    'original':   '$a$ is red',
    'intrinsic':  'a has the standard color of stop signs',
    'perceptual': 'a visibly appears red',
}
for name, expr in VARIANTS.items():
    print(f'  {name:12} -> {expr!r}')
VARIANTS = {
    'original':   '$a$ is red',
    'intrinsic':  'a has the standard color of stop signs',
    'perceptual': 'a visibly appears red',
}
for name, expr in VARIANTS.items():
    print(f'  {name:12} -> {expr!r}')

  original     -> '$a$ is red'
  intrinsic    -> 'a has the standard color of stop signs'
  perceptual   -> 'a visibly appears red'

3. Three models¶

One from each frontier family. We pick small/fast variants so a full run completes in a few minutes.

In [3]:

Copied!





import os
from typing import NamedTuple

class ModelSpec(NamedTuple):
    label: str
    provider: str
    model_id: str
    env_var: str
    extras: dict

MODELS = [
    ModelSpec('gpt-4.1',           'openai',    'gpt-4.1',                   'OPENAI_API_KEY',     {}),
    ModelSpec('claude-haiku-4-5',  'anthropic', 'claude-haiku-4-5-20251001', 'ANTHROPIC_API_KEY',  {}),
    ModelSpec('deepseek-v4-flash', 'openrouter','deepseek/deepseek-v4-flash','OPENROUTER_API_KEY',
              {'http_referer': 'https://github.com/bradleypallen/infereval',
               'x_title':      'infereval-paraphrase-axis-tutorial'}),
]

for spec in MODELS:
    ok = 'available' if os.environ.get(spec.env_var) else f'(set {spec.env_var} to enable)'
    print(f'  {spec.label:20} {spec.provider:11} {spec.model_id:32} {ok}')
import os
from typing import NamedTuple

class ModelSpec(NamedTuple):
    label: str
    provider: str
    model_id: str
    env_var: str
    extras: dict

MODELS = [
    ModelSpec('gpt-4.1',           'openai',    'gpt-4.1',                   'OPENAI_API_KEY',     {}),
    ModelSpec('claude-haiku-4-5',  'anthropic', 'claude-haiku-4-5-20251001', 'ANTHROPIC_API_KEY',  {}),
    ModelSpec('deepseek-v4-flash', 'openrouter','deepseek/deepseek-v4-flash','OPENROUTER_API_KEY',
              {'http_referer': 'https://github.com/bradleypallen/infereval',
               'x_title':      'infereval-paraphrase-axis-tutorial'}),
]

for spec in MODELS:
    ok = 'available' if os.environ.get(spec.env_var) else f'(set {spec.env_var} to enable)'
    print(f'  {spec.label:20} {spec.provider:11} {spec.model_id:32} {ok}')

  gpt-4.1              openai      gpt-4.1                          available
  claude-haiku-4-5     anthropic   claude-haiku-4-5-20251001        available
  deepseek-v4-flash    openrouter  deepseek/deepseek-v4-flash       available

4. Run the experiment¶

9 evaluation runs total (3 models × 3 variants). At n_samples=3 and max_tokens=512 each run is ~12 API calls; the whole sweep is ~108 calls and runs in a few minutes.

Note max_tokens=512: DeepSeek v4-flash uses silent reasoning tokens that consume the budget invisibly. At very low caps (the pre-v0.5.2 framework default of 32 was the canonical failure mode) content comes back empty and the framework records it as abstain with parse_status="budget_clipped" — a budget abstain, not a model abstain. The current framework default of 1024 clears this case; 512 is used here for fixture continuity.

In [4]:

Copied!





import copy
from pathlib import Path
from infereval.benchmark import Benchmark
from infereval.evaluation import EndorsementConfig, ProviderParams, evaluate
from infereval.providers import ProviderConfigError, get_provider

REPO_ROOT = Path.cwd().parents[1]  # docs/tutorials -> repo root
BENCH_PATH = REPO_ROOT / 'examples' / 'stop_sign' / 'benchmark.json'
OUT_DIR = REPO_ROOT / 'experiments' / 'out' / 'tutorial-03'
OUT_DIR.mkdir(parents=True, exist_ok=True)

base_dict = Benchmark.load(BENCH_PATH).model_dump(mode='json')

results: dict[str, dict[str, object]] = {}

for spec in MODELS:
    if not os.environ.get(spec.env_var):
        print(f'[skip] {spec.label} — {spec.env_var} not set')
        continue
    try:
        provider = get_provider(spec.provider, spec.model_id, **spec.extras)
    except ProviderConfigError as exc:
        print(f'[skip] {spec.label} — {exc}')
        continue
    results[spec.label] = {}
    for variant_name, expr in VARIANTS.items():
        bd = copy.deepcopy(base_dict)
        bd['id'] = f'stop-sign-{variant_name}'
        bd['bearers']['ra']['expression'] = expr
        bench = Benchmark.model_validate(bd)
        print(f'... {spec.label:20} variant={variant_name}', flush=True)
        eta = evaluate(
            bench, provider,
            config=EndorsementConfig(n_samples=3),
            params=ProviderParams(temperature=0.0, max_tokens=512),
            verification_prompt=DEFEASIBLE,
            run_id=f'{spec.label}-{variant_name}',
            log_path=OUT_DIR / f'{spec.label}-{variant_name}.jsonl',
        )
        eta.dump(OUT_DIR / f'{spec.label}-{variant_name}.json')
        results[spec.label][variant_name] = eta

print(f'\nDone — {len(results)} model(s) × {len(VARIANTS)} variant(s) = {sum(len(v) for v in results.values())} evaluations.')
import copy
from pathlib import Path
from infereval.benchmark import Benchmark
from infereval.evaluation import EndorsementConfig, ProviderParams, evaluate
from infereval.providers import ProviderConfigError, get_provider

REPO_ROOT = Path.cwd().parents[1]  # docs/tutorials -> repo root
BENCH_PATH = REPO_ROOT / 'examples' / 'stop_sign' / 'benchmark.json'
OUT_DIR = REPO_ROOT / 'experiments' / 'out' / 'tutorial-03'
OUT_DIR.mkdir(parents=True, exist_ok=True)

base_dict = Benchmark.load(BENCH_PATH).model_dump(mode='json')

results: dict[str, dict[str, object]] = {}

for spec in MODELS:
    if not os.environ.get(spec.env_var):
        print(f'[skip] {spec.label} — {spec.env_var} not set')
        continue
    try:
        provider = get_provider(spec.provider, spec.model_id, **spec.extras)
    except ProviderConfigError as exc:
        print(f'[skip] {spec.label} — {exc}')
        continue
    results[spec.label] = {}
    for variant_name, expr in VARIANTS.items():
        bd = copy.deepcopy(base_dict)
        bd['id'] = f'stop-sign-{variant_name}'
        bd['bearers']['ra']['expression'] = expr
        bench = Benchmark.model_validate(bd)
        print(f'... {spec.label:20} variant={variant_name}', flush=True)
        eta = evaluate(
            bench, provider,
            config=EndorsementConfig(n_samples=3),
            params=ProviderParams(temperature=0.0, max_tokens=512),
            verification_prompt=DEFEASIBLE,
            run_id=f'{spec.label}-{variant_name}',
            log_path=OUT_DIR / f'{spec.label}-{variant_name}.jsonl',
        )
        eta.dump(OUT_DIR / f'{spec.label}-{variant_name}.json')
        results[spec.label][variant_name] = eta

print(f'\nDone — {len(results)} model(s) × {len(VARIANTS)} variant(s) = {sum(len(v) for v in results.values())} evaluations.')

... gpt-4.1              variant=original

... gpt-4.1              variant=intrinsic

... gpt-4.1              variant=perceptual

... claude-haiku-4-5     variant=original

... claude-haiku-4-5     variant=intrinsic

... claude-haiku-4-5     variant=perceptual

... deepseek-v4-flash    variant=original

... deepseek-v4-flash    variant=intrinsic

... deepseek-v4-flash    variant=perceptual

Done — 3 model(s) × 3 variant(s) = 9 evaluations.

5. Per-item verdicts¶

What did each model say on each item? Analyst row is good, good, good, bad.

In [5]:

Copied!





if results:
    print(f'{"model":22} {"variant":12} {"row-0":7} {"row-1":7} {"row-2":7} {"row-3":7}')
    print('-' * 76)
    for label, by_variant in results.items():
        for variant_name in VARIANTS:
            if variant_name not in by_variant:
                continue
            eta = by_variant[variant_name]
            row = [next(it for it in eta.items if it.id == rid).model_verdict.value
                   for rid in ('row-0', 'row-1', 'row-2', 'row-3')]
            print(f'{label:22} {variant_name:12} {row[0]:7} {row[1]:7} {row[2]:7} {row[3]:7}')
else:
    print('No models executed; set at least one API key env var and re-run cell 4.')
if results:
    print(f'{"model":22} {"variant":12} {"row-0":7} {"row-1":7} {"row-2":7} {"row-3":7}')
    print('-' * 76)
    for label, by_variant in results.items():
        for variant_name in VARIANTS:
            if variant_name not in by_variant:
                continue
            eta = by_variant[variant_name]
            row = [next(it for it in eta.items if it.id == rid).model_verdict.value
                   for rid in ('row-0', 'row-1', 'row-2', 'row-3')]
            print(f'{label:22} {variant_name:12} {row[0]:7} {row[1]:7} {row[2]:7} {row[3]:7}')
else:
    print('No models executed; set at least one API key env var and re-run cell 4.')

model                  variant      row-0   row-1   row-2   row-3  
----------------------------------------------------------------------------
gpt-4.1                original     good    good    good    bad    
gpt-4.1                intrinsic    good    good    good    bad    
gpt-4.1                perceptual   good    bad     bad     bad    
claude-haiku-4-5       original     bad     bad     bad     bad    
claude-haiku-4-5       intrinsic    good    good    bad     bad    
claude-haiku-4-5       perceptual   good    bad     bad     bad    
deepseek-v4-flash      original     good    good    good    bad    
deepseek-v4-flash      intrinsic    good    good    good    bad    
deepseek-v4-flash      perceptual   good    good    bad     bad

6. Cross-model κ_C comparison¶

The headline number per (model, variant). Higher is closer agreement with the paper's analyst row.

In [6]:

Copied!





from infereval.metrics import MetricsReport

def _fmt(value):
    return 'undefined' if value is None else f'{value:+.4f}'

if results:
    print(f'{"model":22} {"original":14} {"intrinsic":14} {"perceptual":14}')
    print('-' * 64)
    for label, by_variant in results.items():
        row = []
        for variant_name in VARIANTS:
            if variant_name not in by_variant:
                row.append('—')
                continue
            r = MetricsReport(eta=by_variant[variant_name])
            row.append(_fmt(r.cohens_kappa()))
        print(f'{label:22} {row[0]:14} {row[1]:14} {row[2]:14}')
from infereval.metrics import MetricsReport

def _fmt(value):
    return 'undefined' if value is None else f'{value:+.4f}'

if results:
    print(f'{"model":22} {"original":14} {"intrinsic":14} {"perceptual":14}')
    print('-' * 64)
    for label, by_variant in results.items():
        row = []
        for variant_name in VARIANTS:
            if variant_name not in by_variant:
                row.append('—')
                continue
            r = MetricsReport(eta=by_variant[variant_name])
            row.append(_fmt(r.cohens_kappa()))
        print(f'{label:22} {row[0]:14} {row[1]:14} {row[2]:14}')

model                  original       intrinsic      perceptual    
----------------------------------------------------------------
gpt-4.1                +1.0000        +1.0000        +0.2000       
claude-haiku-4-5       +0.0000        +0.5000        +0.2000       
deepseek-v4-flash      +1.0000        +1.0000        +0.5000

The reading (from the 2026-05-16 run)¶

Expected pattern when all three models execute:

model                  original       intrinsic      perceptual
----------------------------------------------------------------
gpt-4.1                +1.0000        +1.0000        +0.2000
claude-haiku-4-5       +0.2000        +0.5000        +0.2000
deepseek-v4-flash      +1.0000        +1.0000        +0.4000

Three findings:

GPT-4.1 reproduces the paper's analyst row exactly under the original phrasing. Simonelli's empirical claim replicates ten months later.
Claude Haiku 4.5 defaults to a perceptual reading of is red — under the original phrasing it reads it as a claim about whether the sign's color is visibly perceivable, which the nighttime / non-reflective premises defeat. GPT-4.1 and DeepSeek default to an intrinsic reading (the categorical property of the kind), which those premises don't defeat.
Under explicit perceptual phrasing all three models converge. When the conclusion is unambiguously a visibly appears red, all three models flip rows 1 and 2 to BAD (or abstain). This is strong evidence the disagreement is content-attributional, not about defeasible-reasoning competence.

What this demonstrates about the methodology¶

Carving-relativity is empirically detectable. Different models assign different default content-attributions to ostensibly identical English predicates, and the methodology localizes where.
The paraphrase axis (Remark 9 of the paper) gives a constructive route to recovery. Switching δ to the analyst's reading recovers agreement (for Haiku, partially; +0.20 → +0.50).
The methodology can replicate the paper's anchoring example (GPT-4.1, original δ → analyst row, κ_C = +1.00).
The methodology can differentiate model families on content-attribution (GPT-4.1 and DeepSeek both intrinsic; Claude perceptual).

Beyond three models: the 13-model cross-family sweep¶

The 3-model demonstration above is pedagogical. The same experiment has been run at scale across 13 frontier models from 6 families (Anthropic, OpenAI, DeepSeek, Qwen, Gemini, Mistral) plus GPT-4.1 as the original-paper baseline.

Full findings, including the κ_C table, per-item verdict matrix, and discussion of what the results tell us about carving-relativity, are at:

experiments/results/cross_family_2026-05-18.md

Headline:

11 of 13 frontier models reproduce Simonelli's analyst row exactly under the original δ(ra) (κ_C = +1.00). Eleven-model independent replication of the paper's empirical anchor.
The two outliers (Claude Haiku 4.5, Mistral Large) default to a perceptual reading of is red rather than the analyst's intrinsic reading.
Intrinsic phrasing fully recovers the analyst row for 12 of 13. The one remaining exception (gpt-5.4-mini) over-extends the intrinsic reading to endorse the painted-blue defeater (interprets "standard color" as a kind-property that survives painting).
Perceptual phrasing splits models cleanly into a flagship-heavy κ_C = +0.20 group and a distilled-heavy κ_C = +0.50 group — smaller models are uniformly more lenient on single-modifier defeasibility.

The script that runs the full sweep is at experiments/paraphrase_axis_triangulation.py and skips any model whose API key isn't in env. All 78 (Evaluation JSON + JSONL audit log) pairs from the sweep are committed at experiments/results/cross-family/.

Where to go next¶

The CLI version of this experiment: python experiments/paraphrase_axis_triangulation.py (auto-skips missing API keys).
../concepts.md on bearers, expression functions, and the paraphrase axis.
../interpreting_metrics.md on κ_C, κ_F, and decomposition.
the paper, §5 on carving-relativity.
Open issues:
- #4 — raise default max_tokens so DeepSeek-style models don't budget-clip.
- #5 — surface finish_reason so budget-clipped abstains are distinguishable.
- #6 — let verification_prompt override the system field, so the experiment becomes JSON-drivable without Python.