Tutorial 02 — Authoring a benchmark¶
Build a small medical-defeasibility benchmark from scratch and run it against a mock provider. No API key required — we use a ScriptedProvider so you can run the notebook to completion without spending any tokens.
Companion reading: ../authoring_benchmarks.md, which covers the same material in prose form.
1. Pick a target inference¶
We'll probe the inference bacterial diagnosis ⟹ antibiotics indicated. Side-premises we want to test:
- a fever — irrelevant addition (should preserve the inference)
- a cough — irrelevant addition
- a viral co-infection — irrelevant for this purpose (bacterial part still warrants antibiotics)
- a PCR false-positive flag — defeater (the bacterial diagnosis is invalid)
- full recovery — defeater (no remaining indication)
From those we get our bearer set.
bearers = {
'bd': {'expression': 'a has a bacterial diagnosis'},
'ab': {'expression': 'antibiotics are indicated for a'},
'sf': {'expression': 'a has a fever'},
'cf': {'expression': 'a has a cough'},
'vd': {'expression': 'a has a viral co-infection'},
'pcr': {'expression': "a's bacterial diagnosis was a PCR false positive"},
're': {'expression': 'a has fully recovered'},
}
for bid, b in bearers.items():
print(f' {bid:4} -> {b["expression"]}')
bd -> a has a bacterial diagnosis ab -> antibiotics are indicated for a sf -> a has a fever cf -> a has a cough vd -> a has a viral co-infection pcr -> a's bacterial diagnosis was a PCR false positive re -> a has fully recovered
2. Declare an analyst panel¶
Two analysts so κ_F*(β) is defined. In a real benchmark you'd source verdicts from actual clinicians.
analysts = [
{'id': 'physician-a', 'display_name': 'Dr. A (internal medicine)',
'notes': 'Labels reflect the analyst\'s reading of standard clinical practice.'},
{'id': 'physician-b', 'display_name': 'Dr. B (infectious disease)'},
]
3. Write the items¶
Each item has an id, premises (set of bearer ids), conclusions, an analyst-verdicts tuple (one per analyst), optional tags, and an optional rsr_target.
Note the coinfection item: both physicians agree it's still GOOD because the bacterial diagnosis warrants antibiotics regardless of the viral co-infection. This is an irrelevant addition, not a defeater — listing it under irrelevant-addition tags this for the by-tag decomposition.
items = [
{'id': 'base',
'premises': ['bd'], 'conclusions': ['ab'],
'analyst_verdicts': ['good', 'good'],
'tags': ['base-inference'],
'rsr_target': {'X': ['bd'], 'A': ['ab']}},
{'id': 'irrelevant-fever',
'premises': ['bd', 'sf'], 'conclusions': ['ab'],
'analyst_verdicts': ['good', 'good'],
'tags': ['irrelevant-addition', 'symptom'],
'rsr_target': {'X': ['bd'], 'A': ['ab']}},
{'id': 'irrelevant-cough',
'premises': ['bd', 'cf'], 'conclusions': ['ab'],
'analyst_verdicts': ['good', 'good'],
'tags': ['irrelevant-addition', 'symptom'],
'rsr_target': {'X': ['bd'], 'A': ['ab']}},
{'id': 'irrelevant-coinfection',
'premises': ['bd', 'vd'], 'conclusions': ['ab'],
'analyst_verdicts': ['good', 'good'],
'tags': ['irrelevant-addition', 'coinfection'],
'rsr_target': {'X': ['bd'], 'A': ['ab']}},
{'id': 'defeater-pcr',
'premises': ['bd', 'pcr'], 'conclusions': ['ab'],
'analyst_verdicts': ['bad', 'bad'],
'tags': ['defeater', 'false-positive'],
'rsr_target': {'X': ['bd'], 'A': ['ab']}},
{'id': 'defeater-recovered',
'premises': ['bd', 're'], 'conclusions': ['ab'],
'analyst_verdicts': ['bad', 'bad'],
'tags': ['defeater', 'recovered'],
'rsr_target': {'X': ['bd'], 'A': ['ab']}},
]
print(f'{len(items)} items defined')
6 items defined
4. Assemble and validate¶
Wrap it all up, run it through the Benchmark Pydantic model. Validation catches:
- Bearer ids referenced in
premises/conclusions/rsr_targetthat don't exist inbearers. analyst_verdictstuples whose length differs from the number of analysts.- Duplicate item ids, duplicate analyst ids.
- Invalid verdict values.
from infereval.benchmark import Benchmark
bench_dict = {
'schema_version': '1.0',
'id': 'medical-defeasibility-demo',
'title': 'Bacterial diagnosis → antibiotics indicated (demo)',
'domain': 'medical reasoning',
'description': 'Pedagogical benchmark for docs/tutorials/02_authoring_a_benchmark.ipynb. '
'Not for clinical use.',
'bearers': bearers,
'analysts': analysts,
'items': items,
}
bench = Benchmark.model_validate(bench_dict)
print(f'id : {bench.id}')
print(f'|B| : {len(bench.bearers)} bearers')
print(f'n : {bench.n} items')
print(f'm : {bench.m} analysts')
id : medical-defeasibility-demo |B| : 7 bearers n : 6 items m : 2 analysts
Validation rejects malformed input — let's confirm by trying to add an unknown bearer:
import copy
from pydantic import ValidationError
bad = copy.deepcopy(bench_dict)
bad['items'][0]['premises'].append('ghost-bearer')
try:
Benchmark.model_validate(bad)
except ValidationError as exc:
print('caught (expected):')
print(f' {exc.errors()[0]["msg"]}')
caught (expected): Value error, Item 'base' references unknown bearer ids: premises=['ghost-bearer'], conclusions=[]
5. Inspect the inter-analyst baseline¶
With m = 2 and analysts non-unanimous (some items GOOD, some BAD), κ_F*(β) should be defined and high — both physicians agree on every item.
from infereval.metrics import inter_analyst_fleiss
kappa_star = inter_analyst_fleiss(bench)
print(f'κ_F*(β) = {kappa_star}')
κ_F*(β) = 1.0
Perfect — both analysts agree on every item, no ambiguity. In a real benchmark you would expect 0.6–0.95 reflecting genuine domain disagreement on edge cases.
6. Evaluate against a ScriptedProvider¶
We don't have a real provider here, so we'll script the responses. Pretend a model gets the irrelevant additions right but misses one defeater — this is the kind of pattern the by-tag decomposition is designed to surface.
Per-item response order in a ScriptedProvider is round-robin across all calls. With 6 items × 3 samples each = 18 responses needed.
from infereval.evaluation import EndorsementConfig, ProviderParams, evaluate
from infereval.providers.mock import ScriptedProvider
# Order matches the items above: base, fever, cough, coinfection, pcr, recovered
# Each item gets 3 responses. We'll script the model to agree on base + irrelevant additions,
# but flip on the PCR defeater (says GOOD when analyst says BAD).
scripted_responses = (
['GOOD', 'GOOD', 'GOOD'] + # base
['GOOD', 'GOOD', 'GOOD'] + # irrelevant-fever
['GOOD', 'GOOD', 'GOOD'] + # irrelevant-cough
['GOOD', 'GOOD', 'GOOD'] + # irrelevant-coinfection
['GOOD', 'GOOD', 'GOOD'] + # defeater-pcr (model is WRONG here — analyst said BAD)
['BAD', 'BAD', 'BAD'] # defeater-recovered (model agrees)
)
provider = ScriptedProvider(responses=scripted_responses, model_id='scripted-mock-v1')
eta = evaluate(
bench,
provider,
config=EndorsementConfig(n_samples=3),
params=ProviderParams(temperature=0.0, max_tokens=32),
run_id='medical-demo-1',
)
print(f'{"item":24} {"analyst-A":10} {"analyst-B":10} {"model":10} agree?')
print('-' * 70)
for it in eta.items:
a, b = it.analyst_verdicts
m = it.model_verdict
agree_a = '✓' if m == a else '✗'
print(f'{it.id:24} {a.value:10} {b.value:10} {m.value:10} A:{agree_a}')
item analyst-A analyst-B model agree? ---------------------------------------------------------------------- base good good good A:✓ irrelevant-fever good good good A:✓ irrelevant-cough good good good A:✓ irrelevant-coinfection good good good A:✓ defeater-pcr bad bad good A:✗ defeater-recovered bad bad bad A:✓
7. Compute headline metrics¶
from infereval.metrics import MetricsReport
report = MetricsReport(eta=eta, benchmark=bench)
for k, v in report.to_dict().items():
print(f' {k}: {v}')
n: 6
coverage: 1.0
coverage_per_analyst: [1.0, 1.0]
cohens_kappa_consensus: 0.5714285714285715
fleiss_kappa: 0.723076923076923
inter_analyst_fleiss: 1.0
coverage_per_analyst_named: {'physician-a': 1.0, 'physician-b': 1.0}
8. Decompose by tag¶
This is where the diagnosis happens. We expect:
base-inference: perfect agreement.irrelevant-addition: perfect agreement (all three subtypes match).defeater: disagreement onpcr, agreement onrecovered→ moderate kappa.
The decomposition lets the analyst see exactly which kind of inference the model is mishandling.
for tag in ('base-inference', 'irrelevant-addition', 'defeater'):
sub = report.by_tag(tag)
kc = sub.cohens_kappa()
kf = sub.fleiss_kappa
fkc = f'{kc:+.4f}' if kc is not None else 'undefined'
fkf = f'{kf:+.4f}' if kf is not None else 'undefined'
print(f'by-tag {tag:24} n={sub.n} cov={sub.coverage:.4f} κ_C={fkc:10} κ_F={fkf}')
kappa_C undefined: chance-expected agreement p_e = 1 (M and reference both degenerate on a single class over S)
Fleiss kappa undefined: chance-expected agreement P_bar_e = 1 (all annotations in one class over S_F)
kappa_C undefined: chance-expected agreement p_e = 1 (M and reference both degenerate on a single class over S)
Fleiss kappa undefined: chance-expected agreement P_bar_e = 1 (all annotations in one class over S_F)
by-tag base-inference n=1 cov=1.0000 κ_C=undefined κ_F=undefined by-tag irrelevant-addition n=3 cov=1.0000 κ_C=undefined κ_F=undefined by-tag defeater n=2 cov=1.0000 κ_C=+0.0000 κ_F=-0.2000
9. Persist the benchmark¶
Once you're satisfied, write the benchmark JSON to disk and commit it to version control.
import tempfile
from pathlib import Path
with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as f:
out_path = Path(f.name)
bench.dump(out_path)
print(f'wrote {out_path.stat().st_size} bytes')
print(f'first ~400 chars of the JSON:')
print(out_path.read_text()[:400])
out_path.unlink()
wrote 3839 bytes
first ~400 chars of the JSON:
{
"schema_version": "1.0",
"id": "medical-defeasibility-demo",
"title": "Bacterial diagnosis → antibiotics indicated (demo)",
"domain": "medical reasoning",
"description": "Pedagogical benchmark for docs/tutorials/02_authoring_a_benchmark.ipynb. Not for clinical use.",
"bearers": {
"bd": {
"expression": "a has a bacterial diagnosis",
"paraphrases": []
},
"ab": {
What changes for a real run¶
Replace the ScriptedProvider block (cell 6) with a real provider:
from infereval.providers import get_provider
import os
os.environ.setdefault('ANTHROPIC_API_KEY', '...')
provider = get_provider('anthropic', 'claude-haiku-4-5-20251001')
...and bump n_samples to 5 and max_tokens to at least 256 (higher for reasoning-capable models).
See ../providers.md for the per-provider configuration cheat sheet.
Next steps¶
03_paraphrase_axis_experiment.ipynb— cross-model triangulation of carving-relativity.../interpreting_metrics.md— what to do when the by-tag decomposition shows trouble.