Interpreting metrics¶
You ran infereval evaluate, you ran infereval metrics, you got a wall of numbers. What do they mean?
The four numbers, in order¶
n (items) : 4
coverage (M) : 1.0000
coverage (per analyst) : 1.0000
κ_C(η, consensus) : +1.0000
κ_F(η) : +1.0000
κ_F*(β) (inter-analyst): undefined
coverage — the participation gate¶
cov(η) = fraction of items where the model produced a substantive verdict (good or bad, not abstain).
This is your first sanity check. If it's low, nothing else in the report is fully informative — the kappa metrics are restricted to substantive items, so a 0.3 coverage means you're scoring M on 30% of the benchmark.
What low coverage usually means:
| Coverage | Likely cause | What to do |
|---|---|---|
0.95–1.00 |
Normal | Move on to kappa. |
0.70–0.95 |
Some items are genuinely hard, or you have a small --max-tokens and a verbose model |
Inspect the parse_status field in η: unparseable means the model talked but didn't say a verdict word. budget_clipped means the provider truncated by max_tokens — raise it. sample_failed means a provider error after retries. |
0.30–0.70 |
Likely a verification prompt the model finds genuinely ambiguous, or a content-attribution disagreement (see Haiku-style perceptual default) | Inspect the prompts that produced abstains; try the paraphrase axis (experiments/paraphrase_axis_triangulation.py). |
0.00–0.30 |
Often token-budget or model-misconfigured | Check parse_status distribution. If lots of budget_clipped, bump --max-tokens to 2048+. See providers.md for the reasoning-tokens issue. |
Per-analyst coverage (cov_j(η)) is the analog for each human analyst — useful when authoring a benchmark to spot analysts who abstain a lot (their column has less inferential signal).
κ_C(η, consensus) — agreement with the analysts¶
Cohen's kappa against the analyst consensus c_i (strict-majority vote across analysts; abstain on tie or when most abstain). Restricted to items where both M and the consensus are substantive.
Reading scale, with caveats:
| Value | Loose label | What it means |
|---|---|---|
+1.00 |
perfect | M and the consensus agree on every substantive item. |
+0.80 to +0.99 |
strong | M reliably tracks consensus; disagreements are isolated. |
+0.40 to +0.80 |
moderate | M agrees better than chance but with notable exceptions. Decompose by tag to find the pattern. |
+0.20 to +0.40 |
weak | M's verdicts are correlated with consensus but only modestly. Often this is a content-attribution disagreement, not an inferential one. Try the paraphrase axis (docs/concepts.md). |
0.00 to +0.20 |
none | M is at or near chance. |
< 0.00 |
anti-correlated | M is systematically disagreeing. Often this is a prompt-reading issue (the model is treating "GOOD" as something other than what the verification prompt says). |
undefined |
n/a | Either the substantive subset is empty (raise coverage) or the chance-expected agreement p_e = 1 (everyone — including M — is in a single class on this subset, so kappa has no signal). |
The standard cautions about Landis–Koch labels apply: these are heuristic, not normative. In small-n benchmarks (n < 20), kappa is high-variance and the labels above are at best directional.
Switching the reference: by default the reference is the consensus, but you can compare M against any specific analyst:
This is informative when analysts disagree among themselves: comparing M to each analyst separately can show that M is closer to one analyst's column than the other. The paper's discussion of carving-relativity (§5) is exactly this kind of phenomenon — disagreement among competent labelers reflects different ways of carving the domain, and M may align with one carving more than another.
κ_F(η) — M as the (m+1)-th annotator¶
Fleiss' kappa treats M as if it were one more analyst on the panel. Per-item agreement is the pairwise-pair agreement count across all m+1 annotators, averaged across items, chance-corrected against the pooled marginal distribution.
κ_F is different from κ_C in two key ways:
- Symmetric.
κ_Fdoesn't single out a reference; every annotator contributes equally. This is appropriate when you don't want to privilege analyst consensus overM. - Pooled marginals. The chance-expected agreement is computed from the pooled distribution of all annotators' verdicts. This means
κ_F = κ_Conly by coincidence (typically only when the marginals are symmetric, e.g. under perfect agreement on a balanced item set). Atm = 1,κ_FwithMas the 2nd annotator is not the same number asκ_Cagainst that analyst, except in special cases.
Reading κ_F: same scale as κ_C. What you compare it to is the next metric.
κ_F*(β) — the inter-analyst baseline¶
This is the most important reference number when m ≥ 2. It's Fleiss' kappa computed over the analysts alone (no M).
The paper's Remark 4 makes the point: κ_F*(β) tells you how well the analysts agree with each other before M is in the picture. It's the ceiling above which M is doing something the analysts don't do. It's also the floor: if κ_F*(β) = 0.6, then M reaching κ_F = 0.5 is participating in the practice at a level not far from the analysts' own internal agreement — which is a strong result.
| Comparison | What it tells you |
|---|---|
κ_F(η) ≈ κ_F*(β) |
M's inclusion doesn't lower inter-annotator agreement — it participates at human-analyst-level coherence. |
κ_F(η) > κ_F*(β) |
M is more consistent with the analyst consensus than the analysts are with each other. Plausible if M is just tracking the median, but worth investigating. |
κ_F(η) << κ_F*(β) |
M is systematically disagreeing with the analyst community. Decompose by tag. |
κ_F*(β) = undefined |
Either m < 2 (no inter-analyst comparison possible) or the analysts are unanimous on every item (no signal). Single-analyst benchmarks are perfectly fine — you just lose this comparison. |
If your benchmark has m = 1, expect κ_F* to always be undefined and rely on κ_C for headline numbers.
Per-panel and cross-panel: independent-reference checks (v0.3.0+)¶
If your benchmark declares reference panels (per-analyst panel plus a benchmark-level primary_panel, see authoring_benchmarks.md Step 4b), infereval describe and infereval report surface two extra metrics:
κ_F*(β) per panel:
primary (n_analysts=3) : +0.6824
check (n_analysts=2) : +0.7100
κ_C cross-panel (primary vs check) : +0.5800
| Metric | What it is | Why it matters |
|---|---|---|
inter_analyst_fleiss_per_panel |
κ_F* computed within each declared panel |
Tells you how internally coherent each panel is on its own. A primary panel with κ_F* = 0.10 and a check panel with κ_F* = 0.70 means the primary panel is the one to inspect for analyst-disagreement. |
cross_panel_kappa |
Cohen's κ_C between the two panels' per-item consensuses (default primary vs the first declared check panel; override with --check) |
The construct-validity convergence check (R4 in closing_the_construct_validity_gap.md). A high cross-panel κ_C plus a high κ_C(η, primary-consensus) is the convergent half of a multi-trait/multi-method argument. A model that agrees with primary but disagrees with check is a warning sign that the primary consensus is panel-specific rather than carving-specific. |
The same undefined rules apply as for the corresponding base metrics (empty substantive intersection, p_e = 1, single-analyst panel).
Decompositions: when the overall number isn't enough¶
A single κ_C = 0.40 could mean many different things:
Mis uniformly mediocre across all items;Mis perfect on some items and chance on others;Mis perfect on "easy" items (base inferences) and consistently wrong on "hard" ones (defeaters);Mis fine on items with a particular tag and broken on others.
The first three are inferentially different and the fourth is a smoking gun for content-attribution issues. Decomposition is how you tell which it is.
By tag¶
infereval metrics eta.json --benchmark bench.json \
--by-tag base-inference \
--by-tag irrelevant-addition \
--by-tag defeater
Produces a separate n / coverage / κ_C / κ_F / κ_F* block per tag. This is the single most useful tool the framework gives you for diagnosis. A pattern like:
By tag: base-inference κ_C = +1.0000
By tag: irrelevant-addition κ_C = -1.0000
By tag: defeater κ_C = +1.0000
tells you a precise story: M handles base inferences and defeaters but is systematically wrong on irrelevant additions. This is exactly the cross-model finding from experiments/paraphrase_axis_triangulation.py: Claude Haiku 4.5 vs. the paper's analyst row.
Choose tag conventions when authoring the benchmark with this decomposition in mind. ["base-inference"], ["irrelevant-addition"], ["defeater"] are the obvious ones for RSR-style work; add domain-specific tags too (["nighttime"], ["coinfection"]).
By RSR target¶
Filter to items probing one specific target inference. Useful when a benchmark covers multiple target inferences (multiple rsr_target groupings) and you want the per-target story.
Beyond metrics: structure, factor effects, sensitivity (v0.4.0+)¶
infereval metrics answers "how well does M agree with the analysts on this benchmark, this run". Three other CLI commands answer adjacent construct-validity questions; their outputs are also consumed by infereval report.
infereval structure benchmark.json — content-validity gate¶
Three deterministic checks on the benchmark itself (no model, no evaluation needed):
| Check | What it verifies | A failure means |
|---|---|---|
| Containment | Every declared bearer appears in at least one item's Γ or Δ |
An orphaned bearer — likely a carving slip; either delete it or add an item that exercises it. |
| RSR role consistency | For each rsr_target group, the role hierarchy (base → addition → defeater) is logically consistent |
Misordered roles within a target — the RSR analysis won't be interpretable. |
| Base-case stability | Every rsr_target group contains at least one base-inference item (the Γ from which additions are built) |
No anchor for the RSR sweep — additions and defeaters have nothing to attach to. |
Read it as a gate: if structure returns non-zero, the benchmark fails content validity (R3) and metrics from it should be reported with that caveat. report echoes the structure verdict in §3 of its output.
infereval model eta.json --benchmark β.json — factor effects (v0.4.1)¶
Logistic regression of per-sample agreement (1 if M matches the reference, 0 otherwise) on the declared benchmark.factors levels, with item-clustered standard errors. Requires infereval[stats].
The output has two parts:
- Per-factor joint Wald tests (
factor_wald) — one p-value per declared factor, testing the null "this factor has no effect on agreement". A small p-value (e.g.< 0.05) means agreement varies systematically across the factor's levels — that factor is doing inferential work. A large p-value means the factor is not differentiating the model's behavior at this sample size. - Per-level coefficients (
effects) — log-odds of agreement at this level relative to the alphabetically-first level of the same factor (the baseline). Positive coef → higher agreement than baseline; negative → lower. Thep_valueis the per-level Wald test; theconf_int_low/highis the 95% CI on the log-odds.
pseudo_r2 (McFadden) gives a rough sense of overall fit; values above ~0.2 indicate the declared factors meaningfully predict per-sample agreement. This is the GLMM-proxy specified in R10 of the construct-validity programme.
infereval sweep — sensitivity analysis (v0.4.2)¶
Re-runs metrics across a swept parameter (e.g. --vary tie_break --values abstain,good,bad) and reports a stability verdict based on the range of κ_C across the sweep:
κ_C range |
Verdict | Reading |
|---|---|---|
< 0.05 |
stable across the sweep range |
Choice of parameter doesn't materially affect agreement. Safe to publish at any of the swept values. |
0.05 ≤ r < 0.10 |
moderately sensitive |
Some dependence on the parameter; report the specific value used and acknowledge the range. |
≥ 0.10 |
varies substantively |
Headline numbers are not robust to the parameter. Consider tighter parameter choices or a wider analyst panel before publishing a mastery claim. |
The point is not to pick the best parameter post-hoc — that's p-hacking. The point is to show that the parameter you committed to (infereval evaluate ...) wasn't load-bearing. report consumes the sweep summary and downgrades the construct-validity verdict by one tier if the sweep is anything other than stable (R11).
infereval report — the construct-validity envelope (v0.5.0)¶
Consumes the claims file plus the outputs of the four analytical commands above and emits a single Markdown report with a deterministic verdict over five tiers (Strongly substantiated / Substantiated / Partially substantiated / Limited / Insufficient). The verdict is a function of the claims declared (R-numbers) plus the structural and inferential evidence; it is not a free-text summary.
Two interactions to be aware of:
- Suppression asymmetry: if
negative_findings_suppressed: trueis set in the claims file, the verdict downgrades one tier. This makes silence about negative findings explicitly costly in the headline number. - Sweep instability penalty: any sweep verdict other than "stable" triggers a one-tier downgrade with a logged rationale.
The full mapping from claims and evidence to verdict tier is documented in closing_the_construct_validity_gap.md; the practitioner's walk-through is in construct_validity_workflow.md.
Edge cases the framework reports rather than hides¶
The metrics functions return None (with a logged warning) rather than raise. These cases are not bugs:
κ_Cundefined when substantive subset is empty. All items haveMabstain, or the reference abstain, or both. No items to compare.κ_Cundefined whenp_e = 1. On the substantive subset, every item is the same class (allgoodor allbad) for bothMand the reference. Chance and observed are both 1.0; their difference is 0/0. This often appears on small tag subsets where the analysts unanimously labeled one way andMmatched.κ_Fundefined when no items have all-substantive annotations. Same idea asκ_CemptyS, applied across the full annotator set.κ_F*undefined whenm < 2. Single-analyst benchmarks have no inter-analyst comparison.κ_F*undefined when analysts are unanimous.m ≥ 2analysts agreeing perfectly on every item givesP̄_e = 1and chance and observed are both 1.0. No signal.
Each of these returns null in the JSON output and "undefined" in the text / markdown output, with the reason logged at WARNING. They are diagnostic, not failures — they tell you something concrete about the structure of your benchmark or the model's behavior.
A worked reading: the live Haiku result from the M9 demo¶
We ran Claude Haiku 4.5 against the stop-sign benchmark with the original δ(ra) = "a is red" and got:
n (items) : 4
coverage (M) : 1.0000
κ_C(η, consensus) : +0.2000
κ_F(η) : +0.0000
κ_F*(β) (inter-analyst): undefined // m = 1
By tag: base-inference κ_C undefined (n=1; only one substantive item — p_e = 1)
By tag: irrelevant-addition κ_F = -1.0000 (n=2; perfect anti-correlation!)
By tag: defeater κ_C undefined (n=1; same reason as base-inference)
Step-by-step reading:
- Coverage is 1.00. No abstains; the model committed on every item. Good baseline.
- Overall κ_C = +0.20. Above chance but well below agreement. Something is going wrong.
- By-tag decomposition is the smoking gun.
irrelevant-addition: κ_F = -1.00meansMperfectly disagrees with the analyst on every item taggedirrelevant-addition. Those are exactly the two items whereΓ = {sa, n}andΓ = {sa, n, nr}— the nighttime / non-reflective additions. - Hypothesis:
Mreadsis redas a perceptual claim (nighttime defeats visibility), not an intrinsic claim. Both readings are coherent; they just denote different things.
Following up on the hypothesis is what experiments/paraphrase_axis_triangulation.py does: hold the analyst row constant, vary δ(ra), see whether the disagreement comes from the bearer carving. (It does. The intrinsic phrasing a has the standard color of stop signs flipped row-1 to good and raised κ_C to +0.50.)
This is the methodology working: it gave a single number (+0.20), then a decomposition that localized the disagreement (irrelevant-addition: -1.00), then a hypothesis the analyst could test by varying δ. Each step in this chain is a concrete output of the framework.
When the numbers are surprising¶
If the kappa numbers don't match your prior intuition for the model, the order of diagnostic steps is:
- Coverage low? Token-budget or prompt-ambiguity issue. Fix
--max-tokens; inspect a fewunparseablesamples inη. - Decompose by tag. Is the disagreement concentrated in one tag? That's almost always content-attributional.
- Inspect the actual prompts.
infereval evaluate --dry-runprints what the model is seeing. Verify the framing is what you intended. - Probe with reasoning enabled. Outside the framework, ask the model the same question with room to explain. Often the model's reasoning makes its content-attribution explicit (Claude's verbatim "we cannot infer that the sign displays its standard color — only that it possesses that color as a property" is the canonical example).
- Vary
δ. If the disagreement is content-attributional, swapping the bearer phrasing for the analyst's reading should recover agreement. Run a second benchmark with the variant; seeexperiments/paraphrase_axis_triangulation.pyfor the pattern.
Where to go next¶
concepts.mdfor the methodology's terms.authoring_benchmarks.mdif your interpretation suggests you need a richer benchmark — especially the panel/factor/construction-metadata fields that feed the new analytical commands.construct_validity_workflow.mdfor the end-to-end practitioner's guide: how to chainstructure→metrics→model→sweep→reportinto reproducible evidence for an inferential-mastery claim.closing_the_construct_validity_gap.mdfor which construct-validity requirements (R1–R21) each metric speaks to.providers.mdfor per-provider quirks (specifically, what to set--max-tokensto).experiments/paraphrase_axis_triangulation.pyfor a worked example of the diagnostic chain above.