Tutorial 01 — Quickstart¶
End-to-end walk through infereval against the bundled stop-sign benchmark, using the committed ReplayProvider fixture so no API key is required.
This is the same content as the README's 60-second quickstart, but executable cell by cell so you can inspect intermediate state.
Prerequisites: pip install infereval (no extras needed for the replay path).
1. Load the benchmark¶
from pathlib import Path
from infereval.benchmark import Benchmark
# Resolve the repo root from this notebook's location.
REPO_ROOT = Path.cwd().parents[1]
BENCH_PATH = REPO_ROOT / 'examples' / 'stop_sign' / 'benchmark.json'
bench = Benchmark.load(BENCH_PATH)
print(f'id : {bench.id}')
print(f'title : {bench.title}')
print(f'|B| : {len(bench.bearers)} bearers')
print(f'n : {bench.n} items')
print(f'm : {bench.m} analyst(s)')
print()
print('Bearers:')
for bid, b in bench.bearers.items():
print(f' {bid:4} -> {b.expression!r}')
id : stop-sign-example-1 title : Stop-sign RSR (Example 1 of Allen 2026) |B| : 5 bearers n : 4 items m : 1 analyst(s) Bearers: sa -> '$a$ is a stop sign' ra -> '$a$ is red' n -> 'it is nighttime' nr -> '$a$ is not made with reflective material' ba -> '$a$ has been painted blue'
2. Inspect the items¶
Each item is one ⟨Γ, Δ⟩ implication paired with analyst verdicts.
for item in bench.items:
prem = '{' + ','.join(item.premises) + '}'
conc = '{' + ','.join(item.conclusions) + '}'
av = item.analyst_verdicts[0].value
tags = ','.join(item.tags)
print(f'{item.id:6} {prem:14} -> {conc:6} analyst={av:6} tags=[{tags}]')
row-0 {sa} -> {ra} analyst=good tags=[base-inference]
row-1 {n,sa} -> {ra} analyst=good tags=[irrelevant-addition,nighttime]
row-2 {n,nr,sa} -> {ra} analyst=good tags=[irrelevant-addition,nonreflective]
row-3 {ba,sa} -> {ra} analyst=bad tags=[defeater,painted-blue]
3. Build a ReplayProvider¶
The committed replay fixture has 20 recorded responses (4 items × 5 samples) keyed by the SHA-256 hash of each prompt. The ReplayProvider looks up the prompt hash for each request and returns the matching canned response — deterministic, no network, no API key.
from infereval.providers.mock import ReplayProvider
FIXTURE_PATH = REPO_ROOT / 'tests' / 'fixtures' / 'stop_sign_replay.jsonl'
provider = ReplayProvider(FIXTURE_PATH)
print(f'provider.name = {provider.name}')
print(f'provider.model_id = {provider.model_id}')
provider.name = replay provider.model_id = claude-haiku-4-5-20251001
4. Run the evaluation¶
evaluate() iterates each benchmark item, builds the verification prompt via δ/ctx_Γ/ctx_Δ, samples the provider n_samples times, parses each response, and majority-votes to get E_M for that item.
from infereval.evaluation import EndorsementConfig, ProviderParams, evaluate
eta = evaluate(
bench,
provider,
config=EndorsementConfig(n_samples=5),
run_id='quickstart-tutorial',
)
print(f'evaluation id : {eta.id}')
print(f'benchmark hash : {eta.benchmark_hash[:24]}...')
print(f'framework version : {eta.framework_version}')
print()
for item in eta.items:
av = item.analyst_verdicts[0].value
mv = item.model_verdict.value
agree = '✓' if mv == av else '✗'
print(f'{item.id:6} analyst={av:6} model={mv:6} {agree}')
evaluation id : quickstart-tutorial benchmark hash : sha256:e89da2d622002ac58... framework version : 0.1.0 row-0 analyst=good model=good ✓ row-1 analyst=good model=good ✓ row-2 analyst=good model=good ✓ row-3 analyst=bad model=bad ✓
5. Inspect per-sample detail¶
Every individual sample is preserved with its raw response and parsed verdict — the full audit trail.
row3 = next(it for it in eta.items if it.id == 'row-3')
print(f'row-3 (the defeater): premises={list(row3.premises)} conclusions={list(row3.conclusions)}')
print(f' model_verdict = {row3.model_verdict.value}')
print(f' majority_vote = good={row3.majority_vote.good} bad={row3.majority_vote.bad} abstain={row3.majority_vote.abstain}')
print(f' samples:')
for s in row3.samples:
print(f' [{s.sample_index}] raw={s.raw_response!r} -> parsed={s.parsed_verdict.value}')
row-3 (the defeater): premises=['ba', 'sa'] conclusions=['ra']
model_verdict = bad
majority_vote = good=0 bad=5 abstain=0
samples:
[0] raw='BAD' -> parsed=bad
[1] raw='BAD. The sign has been repainted blue.' -> parsed=bad
[2] raw='BAD' -> parsed=bad
[3] raw='BAD' -> parsed=bad
[4] raw='BAD' -> parsed=bad
6. Compute metrics¶
from infereval.metrics import MetricsReport
report = MetricsReport(eta=eta, benchmark=bench)
for k, v in report.to_dict().items():
print(f' {k}: {v}')
Fleiss kappa undefined: fewer than 2 annotators
n: 4
coverage: 1.0
coverage_per_analyst: [1.0]
cohens_kappa_consensus: 1.0
fleiss_kappa: 1.0
inter_analyst_fleiss: None
coverage_per_analyst_named: {'paper-author': 1.0}
7. Construct the derived frame ⟨B, I_M⟩¶
Definition 3 of the paper. contains(implication) returns True iff:
- Γ ∩ Δ ≠ ∅ (Containment, clause i), OR
- E_M(⟨Γ, Δ⟩) = good (clause ii)
...and explicitly excluding ⟨∅, ∅⟩.
The frame is lazy — the full I_M is unbounded; you query it as needed.
from infereval.frame import DerivedFrame
from infereval.types import Implication
frame = DerivedFrame.from_endorsements(bench.runtime_bearers(), eta.endorsements())
# The four queried items:
for iid in ('row-0', 'row-1', 'row-2', 'row-3'):
item = next(it for it in bench.items if it.id == iid)
imp = Implication.of(item.premises, item.conclusions)
print(f'{iid} <{set(item.premises)}, {set(item.conclusions)}> in I_M ? {imp in frame}')
# Containment witness — never queried, but in I_M via clause (i):
witness = Implication.of(['sa', 'ra'], ['ra', 'n'])
print(f"witness <{{sa,ra}}, {{ra,n}}> queried ? {witness in frame.queried_implications()}")
print(f"witness <{{sa,ra}}, {{ra,n}}> in I_M ? {witness in frame} (Containment)")
# Empty-empty: excluded by stipulation
print(f"empty <{{}}, {{}}> in I_M ? {Implication.of([], []) in frame}")
row-0 <{'sa'}, {'ra'}> in I_M ? True
row-1 <{'sa', 'n'}, {'ra'}> in I_M ? True
row-2 <{'nr', 'sa', 'n'}, {'ra'}> in I_M ? True
row-3 <{'ba', 'sa'}, {'ra'}> in I_M ? False
witness <{sa,ra}, {ra,n}> queried ? False
witness <{sa,ra}, {ra,n}> in I_M ? True (Containment)
empty <{}, {}> in I_M ? False
8. Save and reload¶
The evaluation is a JSON document you can persist and re-analyze later.
import tempfile
from infereval.evaluation import Evaluation
with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as f:
out_path = Path(f.name)
eta.dump(out_path)
loaded = Evaluation.load(out_path)
print(f'round-trip matches: {loaded == eta}')
print(f'on-disk size: {out_path.stat().st_size} bytes')
out_path.unlink()
round-trip matches: True on-disk size: 9608 bytes
Next steps¶
02_authoring_a_benchmark.ipynb— build a new benchmark from scratch in a different domain.03_paraphrase_axis_experiment.ipynb— the cross-model paraphrase-axis experiment narrated.../concepts.md— the methodology's mental model.../interpreting_metrics.md— when each kappa matters and how to read decompositions.../providers.md— swappingReplayProviderfor a real LLM.