Tutorial 01 — Quickstart¶

End-to-end walk through infereval against the bundled stop-sign benchmark, using the committed ReplayProvider fixture so no API key is required.

This is the same content as the README's 60-second quickstart, but executable cell by cell so you can inspect intermediate state.

Prerequisites: pip install infereval (no extras needed for the replay path).

1. Load the benchmark¶

In [1]:

Copied!





from pathlib import Path
from infereval.benchmark import Benchmark

# Resolve the repo root from this notebook's location.
REPO_ROOT = Path.cwd().parents[1]
BENCH_PATH = REPO_ROOT / 'examples' / 'stop_sign' / 'benchmark.json'

bench = Benchmark.load(BENCH_PATH)
print(f'id      : {bench.id}')
print(f'title   : {bench.title}')
print(f'|B|     : {len(bench.bearers)} bearers')
print(f'n       : {bench.n} items')
print(f'm       : {bench.m} analyst(s)')
print()
print('Bearers:')
for bid, b in bench.bearers.items():
    print(f'  {bid:4} -> {b.expression!r}')
from pathlib import Path
from infereval.benchmark import Benchmark

# Resolve the repo root from this notebook's location.
REPO_ROOT = Path.cwd().parents[1]
BENCH_PATH = REPO_ROOT / 'examples' / 'stop_sign' / 'benchmark.json'

bench = Benchmark.load(BENCH_PATH)
print(f'id      : {bench.id}')
print(f'title   : {bench.title}')
print(f'|B|     : {len(bench.bearers)} bearers')
print(f'n       : {bench.n} items')
print(f'm       : {bench.m} analyst(s)')
print()
print('Bearers:')
for bid, b in bench.bearers.items():
    print(f'  {bid:4} -> {b.expression!r}')

id      : stop-sign-example-1
title   : Stop-sign RSR (Example 1 of Allen 2026)
|B|     : 5 bearers
n       : 4 items
m       : 1 analyst(s)

Bearers:
  sa   -> '$a$ is a stop sign'
  ra   -> '$a$ is red'
  n    -> 'it is nighttime'
  nr   -> '$a$ is not made with reflective material'
  ba   -> '$a$ has been painted blue'

2. Inspect the items¶

Each item is one ⟨Γ, Δ⟩ implication paired with analyst verdicts.

In [2]:

Copied!





for item in bench.items:
    prem = '{' + ','.join(item.premises) + '}'
    conc = '{' + ','.join(item.conclusions) + '}'
    av = item.analyst_verdicts[0].value
    tags = ','.join(item.tags)
    print(f'{item.id:6}  {prem:14} -> {conc:6}  analyst={av:6}  tags=[{tags}]')
for item in bench.items:
    prem = '{' + ','.join(item.premises) + '}'
    conc = '{' + ','.join(item.conclusions) + '}'
    av = item.analyst_verdicts[0].value
    tags = ','.join(item.tags)
    print(f'{item.id:6}  {prem:14} -> {conc:6}  analyst={av:6}  tags=[{tags}]')

row-0   {sa}           -> {ra}    analyst=good    tags=[base-inference]
row-1   {n,sa}         -> {ra}    analyst=good    tags=[irrelevant-addition,nighttime]
row-2   {n,nr,sa}      -> {ra}    analyst=good    tags=[irrelevant-addition,nonreflective]
row-3   {ba,sa}        -> {ra}    analyst=bad     tags=[defeater,painted-blue]

3. Build a ReplayProvider¶

The committed replay fixture has 20 recorded responses (4 items × 5 samples) keyed by the SHA-256 hash of each prompt. The ReplayProvider looks up the prompt hash for each request and returns the matching canned response — deterministic, no network, no API key.

In [3]:

Copied!





from infereval.providers.mock import ReplayProvider

FIXTURE_PATH = REPO_ROOT / 'tests' / 'fixtures' / 'stop_sign_replay.jsonl'
provider = ReplayProvider(FIXTURE_PATH)
print(f'provider.name     = {provider.name}')
print(f'provider.model_id = {provider.model_id}')
from infereval.providers.mock import ReplayProvider

FIXTURE_PATH = REPO_ROOT / 'tests' / 'fixtures' / 'stop_sign_replay.jsonl'
provider = ReplayProvider(FIXTURE_PATH)
print(f'provider.name     = {provider.name}')
print(f'provider.model_id = {provider.model_id}')

provider.name     = replay
provider.model_id = claude-haiku-4-5-20251001

4. Run the evaluation¶

evaluate() iterates each benchmark item, builds the verification prompt via δ/ctx_Γ/ctx_Δ, samples the provider n_samples times, parses each response, and majority-votes to get E_M for that item.

In [4]:

Copied!





from infereval.evaluation import EndorsementConfig, ProviderParams, evaluate

eta = evaluate(
    bench,
    provider,
    config=EndorsementConfig(n_samples=5),
    run_id='quickstart-tutorial',
)
print(f'evaluation id     : {eta.id}')
print(f'benchmark hash    : {eta.benchmark_hash[:24]}...')
print(f'framework version : {eta.framework_version}')
print()
for item in eta.items:
    av = item.analyst_verdicts[0].value
    mv = item.model_verdict.value
    agree = '✓' if mv == av else '✗'
    print(f'{item.id:6} analyst={av:6} model={mv:6} {agree}')
from infereval.evaluation import EndorsementConfig, ProviderParams, evaluate

eta = evaluate(
    bench,
    provider,
    config=EndorsementConfig(n_samples=5),
    run_id='quickstart-tutorial',
)
print(f'evaluation id     : {eta.id}')
print(f'benchmark hash    : {eta.benchmark_hash[:24]}...')
print(f'framework version : {eta.framework_version}')
print()
for item in eta.items:
    av = item.analyst_verdicts[0].value
    mv = item.model_verdict.value
    agree = '✓' if mv == av else '✗'
    print(f'{item.id:6} analyst={av:6} model={mv:6} {agree}')

evaluation id     : quickstart-tutorial
benchmark hash    : sha256:e89da2d622002ac58...
framework version : 0.1.0

row-0  analyst=good   model=good   ✓
row-1  analyst=good   model=good   ✓
row-2  analyst=good   model=good   ✓
row-3  analyst=bad    model=bad    ✓

5. Inspect per-sample detail¶

Every individual sample is preserved with its raw response and parsed verdict — the full audit trail.

In [5]:

Copied!





row3 = next(it for it in eta.items if it.id == 'row-3')
print(f'row-3 (the defeater): premises={list(row3.premises)} conclusions={list(row3.conclusions)}')
print(f'  model_verdict = {row3.model_verdict.value}')
print(f'  majority_vote = good={row3.majority_vote.good} bad={row3.majority_vote.bad} abstain={row3.majority_vote.abstain}')
print(f'  samples:')
for s in row3.samples:
    print(f'    [{s.sample_index}] raw={s.raw_response!r} -> parsed={s.parsed_verdict.value}')
row3 = next(it for it in eta.items if it.id == 'row-3')
print(f'row-3 (the defeater): premises={list(row3.premises)} conclusions={list(row3.conclusions)}')
print(f'  model_verdict = {row3.model_verdict.value}')
print(f'  majority_vote = good={row3.majority_vote.good} bad={row3.majority_vote.bad} abstain={row3.majority_vote.abstain}')
print(f'  samples:')
for s in row3.samples:
    print(f'    [{s.sample_index}] raw={s.raw_response!r} -> parsed={s.parsed_verdict.value}')

row-3 (the defeater): premises=['ba', 'sa'] conclusions=['ra']
  model_verdict = bad
  majority_vote = good=0 bad=5 abstain=0
  samples:
    [0] raw='BAD' -> parsed=bad
    [1] raw='BAD. The sign has been repainted blue.' -> parsed=bad
    [2] raw='BAD' -> parsed=bad
    [3] raw='BAD' -> parsed=bad
    [4] raw='BAD' -> parsed=bad

6. Compute metrics¶

In [6]:

Copied!

from infereval.metrics import MetricsReport

report = MetricsReport(eta=eta, benchmark=bench)
for k, v in report.to_dict().items():
    print(f'  {k}: {v}')
from infereval.metrics import MetricsReport

report = MetricsReport(eta=eta, benchmark=bench)
for k, v in report.to_dict().items():
    print(f'  {k}: {v}')

Fleiss kappa undefined: fewer than 2 annotators

  n: 4
  coverage: 1.0
  coverage_per_analyst: [1.0]
  cohens_kappa_consensus: 1.0
  fleiss_kappa: 1.0
  inter_analyst_fleiss: None
  coverage_per_analyst_named: {'paper-author': 1.0}

7. Construct the derived frame ⟨B, I_M⟩¶

Definition 3 of the paper. contains(implication) returns True iff:

Γ ∩ Δ ≠ ∅ (Containment, clause i), OR
E_M(⟨Γ, Δ⟩) = good (clause ii)

...and explicitly excluding ⟨∅, ∅⟩.

The frame is lazy — the full I_M is unbounded; you query it as needed.

In [7]:

Copied!





from infereval.frame import DerivedFrame
from infereval.types import Implication

frame = DerivedFrame.from_endorsements(bench.runtime_bearers(), eta.endorsements())

# The four queried items:
for iid in ('row-0', 'row-1', 'row-2', 'row-3'):
    item = next(it for it in bench.items if it.id == iid)
    imp = Implication.of(item.premises, item.conclusions)
    print(f'{iid}  <{set(item.premises)}, {set(item.conclusions)}> in I_M ? {imp in frame}')

# Containment witness — never queried, but in I_M via clause (i):
witness = Implication.of(['sa', 'ra'], ['ra', 'n'])
print(f"witness  <{{sa,ra}}, {{ra,n}}> queried ? {witness in frame.queried_implications()}")
print(f"witness  <{{sa,ra}}, {{ra,n}}> in I_M ? {witness in frame}  (Containment)")

# Empty-empty: excluded by stipulation
print(f"empty    <{{}}, {{}}> in I_M ? {Implication.of([], []) in frame}")
from infereval.frame import DerivedFrame
from infereval.types import Implication

frame = DerivedFrame.from_endorsements(bench.runtime_bearers(), eta.endorsements())

# The four queried items:
for iid in ('row-0', 'row-1', 'row-2', 'row-3'):
    item = next(it for it in bench.items if it.id == iid)
    imp = Implication.of(item.premises, item.conclusions)
    print(f'{iid}  <{set(item.premises)}, {set(item.conclusions)}> in I_M ? {imp in frame}')

# Containment witness — never queried, but in I_M via clause (i):
witness = Implication.of(['sa', 'ra'], ['ra', 'n'])
print(f"witness  <{{sa,ra}}, {{ra,n}}> queried ? {witness in frame.queried_implications()}")
print(f"witness  <{{sa,ra}}, {{ra,n}}> in I_M ? {witness in frame}  (Containment)")

# Empty-empty: excluded by stipulation
print(f"empty    <{{}}, {{}}> in I_M ? {Implication.of([], []) in frame}")

row-0  <{'sa'}, {'ra'}> in I_M ? True
row-1  <{'sa', 'n'}, {'ra'}> in I_M ? True
row-2  <{'nr', 'sa', 'n'}, {'ra'}> in I_M ? True
row-3  <{'ba', 'sa'}, {'ra'}> in I_M ? False
witness  <{sa,ra}, {ra,n}> queried ? False
witness  <{sa,ra}, {ra,n}> in I_M ? True  (Containment)
empty    <{}, {}> in I_M ? False

8. Save and reload¶

The evaluation is a JSON document you can persist and re-analyze later.

In [8]:

Copied!





import tempfile
from infereval.evaluation import Evaluation

with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as f:
    out_path = Path(f.name)

eta.dump(out_path)
loaded = Evaluation.load(out_path)
print(f'round-trip matches: {loaded == eta}')
print(f'on-disk size:       {out_path.stat().st_size} bytes')
out_path.unlink()
import tempfile
from infereval.evaluation import Evaluation

with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as f:
    out_path = Path(f.name)

eta.dump(out_path)
loaded = Evaluation.load(out_path)
print(f'round-trip matches: {loaded == eta}')
print(f'on-disk size:       {out_path.stat().st_size} bytes')
out_path.unlink()

round-trip matches: True
on-disk size:       9608 bytes

Next steps¶

02_authoring_a_benchmark.ipynb — build a new benchmark from scratch in a different domain.
03_paraphrase_axis_experiment.ipynb — the cross-model paraphrase-axis experiment narrated.
../concepts.md — the methodology's mental model.
../interpreting_metrics.md — when each kappa matters and how to read decompositions.
../providers.md — swapping ReplayProvider for a real LLM.