Tutorial 02 — Authoring a benchmark¶

Build a small medical-defeasibility benchmark from scratch and run it against a mock provider. No API key required — we use a ScriptedProvider so you can run the notebook to completion without spending any tokens.

Companion reading: ../authoring_benchmarks.md, which covers the same material in prose form.

1. Pick a target inference¶

We'll probe the inference bacterial diagnosis ⟹ antibiotics indicated. Side-premises we want to test:

a fever — irrelevant addition (should preserve the inference)
a cough — irrelevant addition
a viral co-infection — irrelevant for this purpose (bacterial part still warrants antibiotics)
a PCR false-positive flag — defeater (the bacterial diagnosis is invalid)
full recovery — defeater (no remaining indication)

From those we get our bearer set.

In [1]:

Copied!





bearers = {
    'bd':  {'expression': 'a has a bacterial diagnosis'},
    'ab':  {'expression': 'antibiotics are indicated for a'},
    'sf':  {'expression': 'a has a fever'},
    'cf':  {'expression': 'a has a cough'},
    'vd':  {'expression': 'a has a viral co-infection'},
    'pcr': {'expression': "a's bacterial diagnosis was a PCR false positive"},
    're':  {'expression': 'a has fully recovered'},
}
for bid, b in bearers.items():
    print(f'  {bid:4} -> {b["expression"]}')
bearers = {
    'bd':  {'expression': 'a has a bacterial diagnosis'},
    'ab':  {'expression': 'antibiotics are indicated for a'},
    'sf':  {'expression': 'a has a fever'},
    'cf':  {'expression': 'a has a cough'},
    'vd':  {'expression': 'a has a viral co-infection'},
    'pcr': {'expression': "a's bacterial diagnosis was a PCR false positive"},
    're':  {'expression': 'a has fully recovered'},
}
for bid, b in bearers.items():
    print(f'  {bid:4} -> {b["expression"]}')

  bd   -> a has a bacterial diagnosis
  ab   -> antibiotics are indicated for a
  sf   -> a has a fever
  cf   -> a has a cough
  vd   -> a has a viral co-infection
  pcr  -> a's bacterial diagnosis was a PCR false positive
  re   -> a has fully recovered

2. Declare an analyst panel¶

Two analysts so κ_F*(β) is defined. In a real benchmark you'd source verdicts from actual clinicians.

In [2]:

Copied!





analysts = [
    {'id': 'physician-a', 'display_name': 'Dr. A (internal medicine)',
     'notes': 'Labels reflect the analyst\'s reading of standard clinical practice.'},
    {'id': 'physician-b', 'display_name': 'Dr. B (infectious disease)'},
]
analysts = [
    {'id': 'physician-a', 'display_name': 'Dr. A (internal medicine)',
     'notes': 'Labels reflect the analyst\'s reading of standard clinical practice.'},
    {'id': 'physician-b', 'display_name': 'Dr. B (infectious disease)'},
]

3. Write the items¶

Each item has an id, premises (set of bearer ids), conclusions, an analyst-verdicts tuple (one per analyst), optional tags, and an optional rsr_target.

Note the coinfection item: both physicians agree it's still GOOD because the bacterial diagnosis warrants antibiotics regardless of the viral co-infection. This is an irrelevant addition, not a defeater — listing it under irrelevant-addition tags this for the by-tag decomposition.

In [3]:

Copied!





items = [
    {'id': 'base',
     'premises': ['bd'], 'conclusions': ['ab'],
     'analyst_verdicts': ['good', 'good'],
     'tags': ['base-inference'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
    {'id': 'irrelevant-fever',
     'premises': ['bd', 'sf'], 'conclusions': ['ab'],
     'analyst_verdicts': ['good', 'good'],
     'tags': ['irrelevant-addition', 'symptom'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
    {'id': 'irrelevant-cough',
     'premises': ['bd', 'cf'], 'conclusions': ['ab'],
     'analyst_verdicts': ['good', 'good'],
     'tags': ['irrelevant-addition', 'symptom'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
    {'id': 'irrelevant-coinfection',
     'premises': ['bd', 'vd'], 'conclusions': ['ab'],
     'analyst_verdicts': ['good', 'good'],
     'tags': ['irrelevant-addition', 'coinfection'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
    {'id': 'defeater-pcr',
     'premises': ['bd', 'pcr'], 'conclusions': ['ab'],
     'analyst_verdicts': ['bad', 'bad'],
     'tags': ['defeater', 'false-positive'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
    {'id': 'defeater-recovered',
     'premises': ['bd', 're'], 'conclusions': ['ab'],
     'analyst_verdicts': ['bad', 'bad'],
     'tags': ['defeater', 'recovered'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
]
print(f'{len(items)} items defined')
items = [
    {'id': 'base',
     'premises': ['bd'], 'conclusions': ['ab'],
     'analyst_verdicts': ['good', 'good'],
     'tags': ['base-inference'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
    {'id': 'irrelevant-fever',
     'premises': ['bd', 'sf'], 'conclusions': ['ab'],
     'analyst_verdicts': ['good', 'good'],
     'tags': ['irrelevant-addition', 'symptom'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
    {'id': 'irrelevant-cough',
     'premises': ['bd', 'cf'], 'conclusions': ['ab'],
     'analyst_verdicts': ['good', 'good'],
     'tags': ['irrelevant-addition', 'symptom'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
    {'id': 'irrelevant-coinfection',
     'premises': ['bd', 'vd'], 'conclusions': ['ab'],
     'analyst_verdicts': ['good', 'good'],
     'tags': ['irrelevant-addition', 'coinfection'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
    {'id': 'defeater-pcr',
     'premises': ['bd', 'pcr'], 'conclusions': ['ab'],
     'analyst_verdicts': ['bad', 'bad'],
     'tags': ['defeater', 'false-positive'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
    {'id': 'defeater-recovered',
     'premises': ['bd', 're'], 'conclusions': ['ab'],
     'analyst_verdicts': ['bad', 'bad'],
     'tags': ['defeater', 'recovered'],
     'rsr_target': {'X': ['bd'], 'A': ['ab']}},
]
print(f'{len(items)} items defined')

6 items defined

4. Assemble and validate¶

Wrap it all up, run it through the Benchmark Pydantic model. Validation catches:

Bearer ids referenced in premises / conclusions / rsr_target that don't exist in bearers.
analyst_verdicts tuples whose length differs from the number of analysts.
Duplicate item ids, duplicate analyst ids.
Invalid verdict values.

In [4]:

Copied!





from infereval.benchmark import Benchmark

bench_dict = {
    'schema_version': '1.0',
    'id': 'medical-defeasibility-demo',
    'title': 'Bacterial diagnosis → antibiotics indicated (demo)',
    'domain': 'medical reasoning',
    'description': 'Pedagogical benchmark for docs/tutorials/02_authoring_a_benchmark.ipynb. '
                   'Not for clinical use.',
    'bearers': bearers,
    'analysts': analysts,
    'items': items,
}
bench = Benchmark.model_validate(bench_dict)
print(f'id    : {bench.id}')
print(f'|B|   : {len(bench.bearers)} bearers')
print(f'n     : {bench.n} items')
print(f'm     : {bench.m} analysts')
from infereval.benchmark import Benchmark

bench_dict = {
    'schema_version': '1.0',
    'id': 'medical-defeasibility-demo',
    'title': 'Bacterial diagnosis → antibiotics indicated (demo)',
    'domain': 'medical reasoning',
    'description': 'Pedagogical benchmark for docs/tutorials/02_authoring_a_benchmark.ipynb. '
                   'Not for clinical use.',
    'bearers': bearers,
    'analysts': analysts,
    'items': items,
}
bench = Benchmark.model_validate(bench_dict)
print(f'id    : {bench.id}')
print(f'|B|   : {len(bench.bearers)} bearers')
print(f'n     : {bench.n} items')
print(f'm     : {bench.m} analysts')

id    : medical-defeasibility-demo
|B|   : 7 bearers
n     : 6 items
m     : 2 analysts

Validation rejects malformed input — let's confirm by trying to add an unknown bearer:

In [5]:

Copied!





import copy
from pydantic import ValidationError

bad = copy.deepcopy(bench_dict)
bad['items'][0]['premises'].append('ghost-bearer')
try:
    Benchmark.model_validate(bad)
except ValidationError as exc:
    print('caught (expected):')
    print(f'  {exc.errors()[0]["msg"]}')
import copy
from pydantic import ValidationError

bad = copy.deepcopy(bench_dict)
bad['items'][0]['premises'].append('ghost-bearer')
try:
    Benchmark.model_validate(bad)
except ValidationError as exc:
    print('caught (expected):')
    print(f'  {exc.errors()[0]["msg"]}')

caught (expected):
  Value error, Item 'base' references unknown bearer ids: premises=['ghost-bearer'], conclusions=[]

5. Inspect the inter-analyst baseline¶

With m = 2 and analysts non-unanimous (some items GOOD, some BAD), κ_F*(β) should be defined and high — both physicians agree on every item.

In [6]:

Copied!

from infereval.metrics import inter_analyst_fleiss

kappa_star = inter_analyst_fleiss(bench)
print(f'κ_F*(β) = {kappa_star}')
from infereval.metrics import inter_analyst_fleiss

kappa_star = inter_analyst_fleiss(bench)
print(f'κ_F*(β) = {kappa_star}')

κ_F*(β) = 1.0

Perfect — both analysts agree on every item, no ambiguity. In a real benchmark you would expect 0.6–0.95 reflecting genuine domain disagreement on edge cases.

6. Evaluate against a `ScriptedProvider`¶

We don't have a real provider here, so we'll script the responses. Pretend a model gets the irrelevant additions right but misses one defeater — this is the kind of pattern the by-tag decomposition is designed to surface.

Per-item response order in a ScriptedProvider is round-robin across all calls. With 6 items × 3 samples each = 18 responses needed.

In [7]:

Copied!





from infereval.evaluation import EndorsementConfig, ProviderParams, evaluate
from infereval.providers.mock import ScriptedProvider

# Order matches the items above: base, fever, cough, coinfection, pcr, recovered
# Each item gets 3 responses. We'll script the model to agree on base + irrelevant additions,
# but flip on the PCR defeater (says GOOD when analyst says BAD).
scripted_responses = (
    ['GOOD', 'GOOD', 'GOOD'] +   # base
    ['GOOD', 'GOOD', 'GOOD'] +   # irrelevant-fever
    ['GOOD', 'GOOD', 'GOOD'] +   # irrelevant-cough
    ['GOOD', 'GOOD', 'GOOD'] +   # irrelevant-coinfection
    ['GOOD', 'GOOD', 'GOOD'] +   # defeater-pcr  (model is WRONG here — analyst said BAD)
    ['BAD',  'BAD',  'BAD']      # defeater-recovered  (model agrees)
)
provider = ScriptedProvider(responses=scripted_responses, model_id='scripted-mock-v1')

eta = evaluate(
    bench,
    provider,
    config=EndorsementConfig(n_samples=3),
    params=ProviderParams(temperature=0.0, max_tokens=32),
    run_id='medical-demo-1',
)

print(f'{"item":24} {"analyst-A":10} {"analyst-B":10} {"model":10} agree?')
print('-' * 70)
for it in eta.items:
    a, b = it.analyst_verdicts
    m = it.model_verdict
    agree_a = '✓' if m == a else '✗'
    print(f'{it.id:24} {a.value:10} {b.value:10} {m.value:10} A:{agree_a}')
from infereval.evaluation import EndorsementConfig, ProviderParams, evaluate
from infereval.providers.mock import ScriptedProvider

# Order matches the items above: base, fever, cough, coinfection, pcr, recovered
# Each item gets 3 responses. We'll script the model to agree on base + irrelevant additions,
# but flip on the PCR defeater (says GOOD when analyst says BAD).
scripted_responses = (
    ['GOOD', 'GOOD', 'GOOD'] +   # base
    ['GOOD', 'GOOD', 'GOOD'] +   # irrelevant-fever
    ['GOOD', 'GOOD', 'GOOD'] +   # irrelevant-cough
    ['GOOD', 'GOOD', 'GOOD'] +   # irrelevant-coinfection
    ['GOOD', 'GOOD', 'GOOD'] +   # defeater-pcr  (model is WRONG here — analyst said BAD)
    ['BAD',  'BAD',  'BAD']      # defeater-recovered  (model agrees)
)
provider = ScriptedProvider(responses=scripted_responses, model_id='scripted-mock-v1')

eta = evaluate(
    bench,
    provider,
    config=EndorsementConfig(n_samples=3),
    params=ProviderParams(temperature=0.0, max_tokens=32),
    run_id='medical-demo-1',
)

print(f'{"item":24} {"analyst-A":10} {"analyst-B":10} {"model":10} agree?')
print('-' * 70)
for it in eta.items:
    a, b = it.analyst_verdicts
    m = it.model_verdict
    agree_a = '✓' if m == a else '✗'
    print(f'{it.id:24} {a.value:10} {b.value:10} {m.value:10} A:{agree_a}')

item                     analyst-A  analyst-B  model      agree?
----------------------------------------------------------------------
base                     good       good       good       A:✓
irrelevant-fever         good       good       good       A:✓
irrelevant-cough         good       good       good       A:✓
irrelevant-coinfection   good       good       good       A:✓
defeater-pcr             bad        bad        good       A:✗
defeater-recovered       bad        bad        bad        A:✓

7. Compute headline metrics¶

In [8]:

Copied!

from infereval.metrics import MetricsReport

report = MetricsReport(eta=eta, benchmark=bench)
for k, v in report.to_dict().items():
    print(f'  {k}: {v}')
from infereval.metrics import MetricsReport

report = MetricsReport(eta=eta, benchmark=bench)
for k, v in report.to_dict().items():
    print(f'  {k}: {v}')

  n: 6
  coverage: 1.0
  coverage_per_analyst: [1.0, 1.0]
  cohens_kappa_consensus: 0.5714285714285715
  fleiss_kappa: 0.723076923076923
  inter_analyst_fleiss: 1.0
  coverage_per_analyst_named: {'physician-a': 1.0, 'physician-b': 1.0}

8. Decompose by tag¶

This is where the diagnosis happens. We expect:

base-inference: perfect agreement.
irrelevant-addition: perfect agreement (all three subtypes match).
defeater: disagreement on pcr, agreement on recovered → moderate kappa.

The decomposition lets the analyst see exactly which kind of inference the model is mishandling.

In [9]:

Copied!





for tag in ('base-inference', 'irrelevant-addition', 'defeater'):
    sub = report.by_tag(tag)
    kc = sub.cohens_kappa()
    kf = sub.fleiss_kappa
    fkc = f'{kc:+.4f}' if kc is not None else 'undefined'
    fkf = f'{kf:+.4f}' if kf is not None else 'undefined'
    print(f'by-tag {tag:24} n={sub.n}  cov={sub.coverage:.4f}  κ_C={fkc:10} κ_F={fkf}')
for tag in ('base-inference', 'irrelevant-addition', 'defeater'):
    sub = report.by_tag(tag)
    kc = sub.cohens_kappa()
    kf = sub.fleiss_kappa
    fkc = f'{kc:+.4f}' if kc is not None else 'undefined'
    fkf = f'{kf:+.4f}' if kf is not None else 'undefined'
    print(f'by-tag {tag:24} n={sub.n}  cov={sub.coverage:.4f}  κ_C={fkc:10} κ_F={fkf}')

kappa_C undefined: chance-expected agreement p_e = 1 (M and reference both degenerate on a single class over S)

Fleiss kappa undefined: chance-expected agreement P_bar_e = 1 (all annotations in one class over S_F)

kappa_C undefined: chance-expected agreement p_e = 1 (M and reference both degenerate on a single class over S)

Fleiss kappa undefined: chance-expected agreement P_bar_e = 1 (all annotations in one class over S_F)

by-tag base-inference           n=1  cov=1.0000  κ_C=undefined  κ_F=undefined
by-tag irrelevant-addition      n=3  cov=1.0000  κ_C=undefined  κ_F=undefined
by-tag defeater                 n=2  cov=1.0000  κ_C=+0.0000    κ_F=-0.2000

9. Persist the benchmark¶

Once you're satisfied, write the benchmark JSON to disk and commit it to version control.

In [10]:

Copied!





import tempfile
from pathlib import Path

with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as f:
    out_path = Path(f.name)
bench.dump(out_path)
print(f'wrote {out_path.stat().st_size} bytes')
print(f'first ~400 chars of the JSON:')
print(out_path.read_text()[:400])
out_path.unlink()
import tempfile
from pathlib import Path

with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as f:
    out_path = Path(f.name)
bench.dump(out_path)
print(f'wrote {out_path.stat().st_size} bytes')
print(f'first ~400 chars of the JSON:')
print(out_path.read_text()[:400])
out_path.unlink()

wrote 3839 bytes
first ~400 chars of the JSON:
{
  "schema_version": "1.0",
  "id": "medical-defeasibility-demo",
  "title": "Bacterial diagnosis → antibiotics indicated (demo)",
  "domain": "medical reasoning",
  "description": "Pedagogical benchmark for docs/tutorials/02_authoring_a_benchmark.ipynb. Not for clinical use.",
  "bearers": {
    "bd": {
      "expression": "a has a bacterial diagnosis",
      "paraphrases": []
    },
    "ab": {

What changes for a real run¶

Replace the ScriptedProvider block (cell 6) with a real provider:

from infereval.providers import get_provider
import os
os.environ.setdefault('ANTHROPIC_API_KEY', '...')
provider = get_provider('anthropic', 'claude-haiku-4-5-20251001')

...and bump n_samples to 5 and max_tokens to at least 256 (higher for reasoning-capable models).

See ../providers.md for the per-provider configuration cheat sheet.

Next steps¶

03_paraphrase_axis_experiment.ipynb — cross-model triangulation of carving-relativity.
../interpreting_metrics.md — what to do when the by-tag decomposition shows trouble.