API reference¶

Auto-generated from the docstrings in src/infereval/. The docstrings are maintained as a first-class artifact and paper-cross-referenced, so help(infereval.metrics.cohens_kappa) is reliable; this page just renders the same content as a navigable site.

If you're looking for symbolic notation rather than callables, see the Glossary.

Core data types¶

infereval.types.Verdict ¶

Bases: str, Enum

Endorsement verdict :math:E_M(\langle \Gamma, \Delta \rangle).

String-valued so JSON serialization yields "good" / "bad" / "abstain" directly. The (str, Enum) pattern is used in place of :class:enum.StrEnum for Python 3.10 compatibility.

infereval.types.Bearer `dataclass` ¶

Bearer(id: str, expression: str, paraphrases: tuple[str, ...] = ())

A propositional content-bearer :math:\varphi \in B.

Parameters¶

id Short stable identifier, e.g. "sa" for "a is a stop sign". expression Canonical natural-language statement :math:\delta(\varphi). May contain TeX-math delimiters (e.g. "$a$ is a stop sign"); these are stripped at prompt-construction time, not here. paraphrases Optional family of meaning-preserving variants of :math:\delta(\varphi). Empty by default. Supports the paraphrase axis of variation discussed in the paper's Discussion.

all_expressions ¶

all_expressions() -> tuple[str, ...]

Return the canonical expression followed by any paraphrases.

Source code in src/infereval/types.py

def all_expressions(self) -> tuple[str, ...]:
    """Return the canonical expression followed by any paraphrases."""
    return (self.expression, *self.paraphrases)

infereval.types.Implication `dataclass` ¶

Implication(premises: frozenset[str], conclusions: frozenset[str], id: str | None = None)

A candidate implication :math:\langle \Gamma, \Delta \rangle.

Premises and conclusions are frozenset of bearer ids (the id field of :class:Bearer). The optional id field is a benchmark-level reference label; it is excluded from equality and hashing so that two implications with the same premise/conclusion sets compare equal regardless of label.

is_empty_empty `property` ¶

is_empty_empty: bool

Whether this is the :math:\langle \emptyset, \emptyset \rangle implication.

Excluded from :math:I_M by stipulation (Definition 3, last sentence).

of `classmethod` ¶

of(premises: Iterable[str], conclusions: Iterable[str], *, id: str | None = None) -> Implication

Convenience constructor accepting any iterables of bearer ids.

Source code in src/infereval/types.py

@classmethod
def of(
    cls,
    premises: Iterable[str],
    conclusions: Iterable[str],
    *,
    id: str | None = None,
) -> Implication:
    """Convenience constructor accepting any iterables of bearer ids."""
    return cls(frozenset(premises), frozenset(conclusions), id=id)

intersects ¶

intersects() -> bool

Whether :math:\Gamma \cap \Delta \neq \emptyset (Definition 3, clause (i)).

Source code in src/infereval/types.py

def intersects(self) -> bool:
    """Whether :math:`\\Gamma \\cap \\Delta \\neq \\emptyset` (Definition 3, clause (i))."""
    return bool(self.premises & self.conclusions)

infereval.frame.DerivedFrame `dataclass` ¶

DerivedFrame(bearers: Mapping[str, Bearer], endorsements: Mapping[Implication, Verdict])

The implication frame :math:\langle B, I_M \rangle derived from a model :math:M.

Construct via :meth:from_endorsements; do not instantiate directly unless you have already validated that all implication bearer-ids reference elements of bearers.

Attributes¶

bearers Read-only mapping id -> Bearer representing :math:B. endorsements Read-only mapping from each queried :class:Implication to the verdict :math:E_M returned. Implications not in this mapping are treated as un-queried; :meth:contains returns False for them unless clause (i) applies.

from_endorsements `classmethod` ¶

from_endorsements(bearers: Mapping[str, Bearer], endorsements: Mapping[Implication, Verdict]) -> DerivedFrame

Build a frame from a bearer set and a mapping of queried endorsements.

Raises¶

ValueError If any implication references a bearer id not present in bearers.

Source code in src/infereval/frame.py

@classmethod
def from_endorsements(
    cls,
    bearers: Mapping[str, Bearer],
    endorsements: Mapping[Implication, Verdict],
) -> DerivedFrame:
    """Build a frame from a bearer set and a mapping of queried endorsements.

    Raises
    ------
    ValueError
        If any implication references a bearer id not present in ``bearers``.
    """
    bearer_ids = set(bearers)
    for imp in endorsements:
        unknown = (imp.premises | imp.conclusions) - bearer_ids
        if unknown:
            raise ValueError(
                f"Implication {imp!r} references unknown bearer ids: {sorted(unknown)}"
            )
    return cls(
        bearers=MappingProxyType(dict(bearers)),
        endorsements=MappingProxyType(dict(endorsements)),
    )

contains ¶

contains(implication: Implication) -> bool

Membership in :math:I_M per Definition 3.

<empty, empty> is excluded by stipulation. Otherwise the iff of clauses (i) and (ii) decides. For implications not in :attr:endorsements, clause (ii) is treated as false (we have no evidence of endorsement).

Source code in src/infereval/frame.py

def contains(self, implication: Implication) -> bool:
    """Membership in :math:`I_M` per Definition 3.

    ``<empty, empty>`` is excluded by stipulation. Otherwise the iff of
    clauses (i) and (ii) decides. For implications not in
    :attr:`endorsements`, clause (ii) is treated as false (we have no
    evidence of endorsement).
    """
    if implication.is_empty_empty:
        return False
    if implication.intersects():
        return True  # clause (i)
    return self.endorsements.get(implication) == Verdict.GOOD  # clause (ii)

satisfies_containment ¶

satisfies_containment() -> bool

Containment is satisfied by construction (clause (i) of Definition 3).

This always returns True; the method is provided as an explicit witness of the invariant the paper makes a remark of ("Containment by construction"). Tests assert it to guard against future refactors that might break the invariant.

Source code in src/infereval/frame.py

def satisfies_containment(self) -> bool:
    """Containment is satisfied by construction (clause (i) of Definition 3).

    This always returns ``True``; the method is provided as an explicit
    witness of the invariant the paper makes a remark of ("Containment by
    construction"). Tests assert it to guard against future refactors that
    might break the invariant.
    """
    return True

queried_implications ¶

queried_implications() -> frozenset[Implication]

The implications for which an :math:E_M verdict has been recorded.

Source code in src/infereval/frame.py

def queried_implications(self) -> frozenset[Implication]:
    """The implications for which an :math:`E_M` verdict has been recorded."""
    return frozenset(self.endorsements)

Benchmark¶

infereval.benchmark.Benchmark ¶

Bases: BaseModel

A benchmark :math:\beta over a bearer set, analyst panel, and items.

See the paper, Definition 4.

n `property` ¶

n: int

Number of items :math:n.

m `property` ¶

m: int

Number of analysts :math:m.

load `classmethod` ¶

load(path: str | Path) -> Benchmark

Load a benchmark from JSON on disk and validate.

Source code in src/infereval/benchmark.py

@classmethod
def load(cls, path: str | Path) -> Benchmark:
    """Load a benchmark from JSON on disk and validate."""
    with Path(path).open("r", encoding="utf-8") as f:
        data = json.load(f)
    return cls.model_validate(data)

dump ¶

dump(path: str | Path, *, indent: int = 2) -> None

Write the benchmark to path as canonical-ish JSON.

Source code in src/infereval/benchmark.py

def dump(self, path: str | Path, *, indent: int = 2) -> None:
    """Write the benchmark to ``path`` as canonical-ish JSON."""
    with Path(path).open("w", encoding="utf-8") as f:
        f.write(self.dumps(indent=indent))
        f.write("\n")

cells ¶

cells() -> dict[tuple[str, ...], int]

Count items per cell of the fully crossed design.

Returns a mapping from cell-tuple to item count, where the cell-tuple is the per-factor level value in the order given by sorted(self.factors). Every cell of the cartesian product is present in the result (count 0 if no item lands there); items whose factor_levels don't name every declared factor are excluded entirely (they belong to no cell).

Source code in src/infereval/benchmark.py

def cells(self) -> dict[tuple[str, ...], int]:
    """Count items per cell of the fully crossed design.

    Returns a mapping from cell-tuple to item count, where the
    cell-tuple is the per-factor level value in the order given by
    ``sorted(self.factors)``. Every cell of the cartesian product is
    present in the result (count 0 if no item lands there); items
    whose ``factor_levels`` don't name every declared factor are
    excluded entirely (they belong to no cell).
    """
    from itertools import product as _product

    if not self.factors:
        return {}
    factor_names = sorted(self.factors)
    # Initialise every cell to zero so under-populated cells appear.
    cells: dict[tuple[str, ...], int] = {
        tuple(combo): 0
        for combo in _product(*(self.factors[f] for f in factor_names))
    }
    for item in self.items:
        if not all(f in item.factor_levels for f in factor_names):
            continue
        key = tuple(item.factor_levels[f] for f in factor_names)
        cells[key] = cells.get(key, 0) + 1
    return cells

panel_names ¶

panel_names() -> list[str]

Sorted unique panel names across :attr:analysts.

Returns [] for an unpanelled (flat) benchmark. Phase 1.4 of the construct-validity infrastructure (R4).

Source code in src/infereval/benchmark.py

def panel_names(self) -> list[str]:
    """Sorted unique panel names across :attr:`analysts`.

    Returns ``[]`` for an unpanelled (flat) benchmark. Phase 1.4 of
    the construct-validity infrastructure (R4).
    """
    return sorted({a.panel for a in self.analysts if a.panel is not None})

resolved_primary_panel ¶

resolved_primary_panel() -> str | None

The primary panel name to use for analyses.

Returns :attr:primary_panel if set; otherwise the alphabetically-first declared panel name; otherwise None for unpanelled benchmarks.

Source code in src/infereval/benchmark.py

def resolved_primary_panel(self) -> str | None:
    """The primary panel name to use for analyses.

    Returns :attr:`primary_panel` if set; otherwise the
    alphabetically-first declared panel name; otherwise ``None`` for
    unpanelled benchmarks.
    """
    if self.primary_panel is not None:
        return self.primary_panel
    names = self.panel_names()
    return names[0] if names else None

infereval.benchmark.BenchmarkItem ¶

Bases: BaseModel

A single benchmark item: an implication paired with analyst verdicts.

analyst_rationales `class-attribute` `instance-attribute` ¶

analyst_rationales: list[str] | None = Field(default=None, description="Optional per-analyst, per-item rationales: the natural-language reason each analyst gave for their verdict on this item. Positionally aligned to analyst_verdicts — index j is analyst j's rationale, matching the benchmark's analysts declaration order. null (or absent) means 'this benchmark carries no rationale discipline.' A present-but-empty entry ('') means 'this analyst gave a verdict but recorded no reason on this item' — semantically distinct from null. When present, the length must equal len(benchmark.analysts).")

Optional per-analyst, per-item rationales: the natural-language reason each analyst gave for their verdict on this item. Positionally aligned to :attr:analyst_verdicts — index j is analyst j's rationale, matching :attr:Benchmark.analysts declaration order. None (or absent) means "this benchmark carries no rationale discipline." A present-but-empty entry ("") means "this analyst gave a verdict but recorded no reason on this item" — semantically distinct from None. When present, the length must equal len(benchmark.analysts) (enforced in :meth:Benchmark._check_consistency). The framework validates structure and length only; content is the analyst's responsibility. Added in v0.5.4 (AR1–AR12).

references `class-attribute` `instance-attribute` ¶

references: list[Reference] = Field(default_factory=list)

Provenance for this implication: the guideline section, paper, or regulatory document that justifies the analyst's verdict. Empty by default; populating these turns the benchmark into an auditable artifact that a domain expert can cross-check against source material.

factor_levels `class-attribute` `instance-attribute` ¶

factor_levels: dict[str, str] = Field(default_factory=dict)

Per-factor level assignments for this item, naming its position in the benchmark's crossed design. Keys must be factor names declared in :attr:Benchmark.factors; values must be levels from the corresponding levels list. Empty by default — items without factor_levels appear in no cell and are ignored by the min_items_per_cell check.

construction_metadata `class-attribute` `instance-attribute` ¶

construction_metadata: ConstructionMetadata | None = None

Per-item provenance for construct-validity audit: who authored the item, when, which models the author was blind to at construction time, and what source material they worked from. None by default; populate selectively for items where the provenance matters. Phase 1.3 of the construct-validity infrastructure (R5, R8, R9).

ladder `class-attribute` `instance-attribute` ¶

ladder: str | None = None

Ladder identifier (v0.5 schema) grouping items into a directed variation / monotonicity sequence, e.g. "A" … "G". Descriptive metadata; the shape-blind measurement layer never branches on it. Additive (v0.17.0).

variation `class-attribute` `instance-attribute` ¶

variation: VariationType | None = None

The item's role in its ladder: base / strengthen / contested / defeat / abstain_anchor / monotonicity_step. Drives reporting stratification (v0.17.1); the κ layer (Definitions 6–10) ignores it. Additive (v0.17.0).

target `class-attribute` `instance-attribute` ¶

target: str | None = None

Differential / succedent label for the item (v0.5 schema), e.g. "cpe" or "ards". In the single-succedent fragment it coincides with the sole conclusion; kept explicit for per-target reporting. Must name a declared :attr:Benchmark.targets entry when both are set. Additive.

placeholder `class-attribute` `instance-attribute` ¶

placeholder: PlaceholderVerdict | None = None

Author-provided provisional verdict for pre-recruitment model dry-runs ONLY. Not an analyst verdict: the measurement layer is mechanically firewalled from reading it (the placeholder firewall). "contested" marks an item the author expects the analyst panel to split on. Additive.

construction_note `class-attribute` `instance-attribute` ¶

construction_note: str | None = None

Author's prose justifying the item's design (e.g. why an item is contested). Distinct from :attr:references (citations) and :attr:analyst_rationales (per-analyst verdict reasons). Additive.

monotonicity_step `class-attribute` `instance-attribute` ¶

monotonicity_step: MonotonicityStep | None = None

Ordinal-ladder annotation, present iff :attr:variation is "monotonicity_step". Consumed by the monotonicity scorer (v0.17.1).

to_implication ¶

to_implication() -> Implication

Return the runtime :class:Implication view of this item.

Source code in src/infereval/benchmark.py

def to_implication(self) -> Implication:
    """Return the runtime :class:`Implication` view of this item."""
    return Implication(
        premises=frozenset(self.premises),
        conclusions=frozenset(self.conclusions),
        id=self.id,
    )

infereval.benchmark.BearerModel ¶

Bases: BaseModel

JSON shape for a :class:infereval.types.Bearer.

references `class-attribute` `instance-attribute` ¶

references: list[Reference] = Field(default_factory=list)

Provenance for the bearer's definition, e.g. the guideline section that defines the threshold "P/F < 300" is measured against.

ordinal_family `class-attribute` `instance-attribute` ¶

ordinal_family: str | None = None

Name of the ordinal family this bearer is a tier of, when it is one (e.g. "bnp" for bnp_lo). Must name a family declared in :attr:Benchmark.ordinal_families. None (default) for non-tier bearers. Additive (v0.17.0): pre-existing benchmarks validate unchanged.

infereval.benchmark.AnalystModel ¶

Bases: BaseModel

A human analyst :math:a_j whose verdicts appear in :math:V_i.

expertise_description `class-attribute` `instance-attribute` ¶

expertise_description: str | None = None

Free-text description of the analyst's relevant expertise — the domain background, specialty, years of practice, role, board certifications. Captured at recruitment time via infereval survey: the survey's first block asks each respondent for this prose, which lands here when the response CSV is imported. Distinct from :attr:notes (general analyst-side annotations) — this field is specifically recruitment metadata about why this analyst's judgement counts for the benchmark's domain. None (the default) means no expertise was declared (e.g. analysts seeded by the benchmark author rather than recruited via a survey).

Added in v0.9.0 with the infereval survey recruitment workflow. Additive: pre-v0.9.0 benchmarks validate unchanged.

panel `class-attribute` `instance-attribute` ¶

panel: str | None = None

Optional panel identifier. Analysts sharing the same panel string are members of the same panel for cross-panel agreement analysis (R4: independent reference check). None (default) means the benchmark is flat — every analyst is treated equivalently. Adding a panel string to ANY analyst requires ALL analysts to declare one (no partial-panel benchmarks).

infereval.benchmark.RSRTarget ¶

Bases: BaseModel

Target inference :math:\langle X, A \rangle for an RSR-targeted item.

See the paper's Remark on "RSR-targeted benchmarks".

infereval.benchmark.ConstructionMetadata ¶

Bases: BaseModel

Per-item construction provenance for benchmark audit.

Records who authored an item, when, against what training-cutoff posture, and from what source materials. Phase 1.3 of the construct-validity infrastructure series, providing the data model for requirements R5 (documented construction), R8 (held-out items), and R9 (training-data separation).

Content is the analyst's responsibility — the framework validates structure (Pydantic types, extra="forbid") but does not enforce that, e.g., authored_on actually post-dates a model's training cutoff. The point is to make the presence of these declarations auditable.

authored_by `class-attribute` `instance-attribute` ¶

authored_by: str | None = None

Identifier for the author of the item, e.g. "physician-c".

authored_on `class-attribute` `instance-attribute` ¶

authored_on: date | None = None

ISO date the item was authored.

authored_blind_to_models `class-attribute` `instance-attribute` ¶

authored_blind_to_models: list[str] = Field(default_factory=list)

Model identifiers the author had not observed at construction time. Critical for R8 (held-out items): if the author had seen M's draft-version output on the item, M's agreement on the final item does not constitute independent evidence.

source `class-attribute` `instance-attribute` ¶

source: str | None = None

Free-form source citation for the primary material the author worked from (e.g. "Sanford Guide to Antimicrobial Therapy 2025"). Distinct from :attr:BenchmarkItem.references, which carries the framework-level :class:Reference objects supporting the verdict; source is intended for the primary material, not the literature that justifies the analyst's call.

infereval.benchmark.FactorConstraints ¶

Bases: BaseModel

Constraints the benchmark validator should enforce on the factorial design.

Currently supports min_items_per_cell: every cell of the fully crossed design (cartesian product of all declared factor levels) must contain at least this many items, where a cell is defined by the per-factor level assignments in :attr:BenchmarkItem.factor_levels.

Per Closing the Construct-Validity Gap in infereval (Phase 1.1) addressing requirement R7 (multiple items per condition) and supporting R12 (per-condition decomposition).

min_items_per_cell `class-attribute` `instance-attribute` ¶

min_items_per_cell: int | None = None

If set, every cell of the crossed design must have at least this many items. Set to None to skip the cell-count validation entirely (the per-key / per-value type checks on factor_levels still run).

infereval.benchmark.ContextBuilders ¶

Bases: BaseModel

Pair of context builders for :math:\mathrm{ctx}_\Gamma and :math:\mathrm{ctx}_\Delta.

infereval.benchmark.TemplateContextBuilder ¶

Bases: BaseModel

A context builder specified by an inline template string.

The template is a format string with a single {expressions} placeholder. Bearer expressions are joined by joiner to fill it.

infereval.benchmark.PluginContextBuilder ¶

Bases: BaseModel

A context builder specified by a dotted import path.

The plugin must resolve to a callable (Sequence[str]) -> str taking bearer expressions and returning the natural-language context.

infereval.benchmark.VerificationPromptOverride ¶

Bases: BaseModel

Optional benchmark-level override of the framework's default verification prompt.

All four fields are optional in practice (template is required by the schema since it is the minimal thing an override needs to contribute). When a field is None the framework default is used:

:attr:system None → :data:infereval.prompts.DEFAULT_SYSTEM_PROMPT.
:attr:parse_regex None → :data:infereval.prompts.DEFAULT_PARSE_REGEX.
:attr:id None → the caller-supplied override_id parameter to :func:infereval.prompts.resolve_verification_prompt.

Adding the system field makes the paraphrase-axis experiment fully JSON-drivable (no Python required to vary the verification prompt).

infereval.benchmark.Reference ¶

Bases: BaseModel

A traceable provenance entry for a benchmark, bearer, or item.

The motivating use case is regulated-domain benchmarks (medical, legal, financial) where every non-trivial implication needs a citation to a guideline, statute, or peer-reviewed source. Recording these as structured objects lets downstream tooling render bibliographies, validate DOIs, and connect items to the documents that justify them.

Only :attr:citation is required. The other fields populate when the relevant identifier is known and remain None otherwise.

Authoring shorthand: a plain string in any references list is auto-promoted to a :class:Reference with the string as :attr:citation and everything else None. See :func:_promote_reference_shorthand.

section `class-attribute` `instance-attribute` ¶

section: str | None = None

Pinpoint location within the cited work, e.g. "Section 5.2" or "Hypoxemia criterion".

note `class-attribute` `instance-attribute` ¶

note: str | None = None

What specifically this reference supports, in the author's words.

Evaluation¶

infereval.evaluation.evaluate ¶

evaluate(benchmark: Benchmark, provider: Provider, *, config: EndorsementConfig | None = None, params: ProviderParams | None = None, verification_prompt: VerificationPrompt | None = None, strip_tex: bool = True, run_id: str | None = None, log_path: Path | str | None = None, variant: int = 0, template: Template | None = None, coherence_frame: CoherenceFrame | None = None) -> Evaluation

Run a model against a benchmark and assemble the resulting :math:\eta.

Iterates over every benchmark item, calls :func:infereval.endorsement.endorse to compute :math:E_M, and packages the per-item samples + majority-vote tally into an :class:Evaluation.

Parameters¶

benchmark The :math:\beta to evaluate against. provider Any :class:infereval.providers.Provider (Anthropic, OpenAI, OpenRouter, or a mock). config Endorsement configuration. Defaults to EndorsementConfig() (n_samples=5, tie_break=abstain, default verification prompt id). params Provider decoding parameters. Defaults to ProviderParams() (temperature=1.0, max_tokens=1024). verification_prompt If supplied, overrides the framework default. If the benchmark has verification_prompt set, it takes precedence over the framework default but not over this argument. strip_tex Whether to strip $...$ TeX-math delimiters from bearer expressions at prompt-construction time (default True). run_id Stable identifier for this evaluation run, recorded as :attr:Evaluation.id. Generated as a UUID4 if not supplied. log_path Optional path for a JSONL run log; one event per line, suitable for jq or pandas.read_json(lines=True). If None (the default), no log file is written; library callers can still attach their own handlers to the infereval logger. template Explicit template override for the coherence question form. When None (the default), resolution follows :func:infereval.templates.resolve_template: a programmatic register_template binding for benchmark.id wins, then the benchmark's declared template_id, then the framework default. The resolved template's id is recorded in the run.started log event (§12.3 provenance).

Returns¶

Evaluation The fully-populated :math:\eta ready to serialize to JSON.

Source code in src/infereval/evaluation.py

def evaluate(
    benchmark: Benchmark,
    provider: Provider,
    *,
    config: EndorsementConfig | None = None,
    params: ProviderParams | None = None,
    verification_prompt: VerificationPrompt | None = None,
    strip_tex: bool = True,
    run_id: str | None = None,
    log_path: Path | str | None = None,
    variant: int = 0,
    template: Template | None = None,
    coherence_frame: CoherenceFrame | None = None,
) -> Evaluation:
    """Run a model against a benchmark and assemble the resulting :math:`\\eta`.

    Iterates over every benchmark item, calls
    :func:`infereval.endorsement.endorse` to compute :math:`E_M`, and
    packages the per-item samples + majority-vote tally into an
    :class:`Evaluation`.

    Parameters
    ----------
    benchmark
        The :math:`\\beta` to evaluate against.
    provider
        Any :class:`infereval.providers.Provider` (Anthropic, OpenAI,
        OpenRouter, or a mock).
    config
        Endorsement configuration. Defaults to ``EndorsementConfig()``
        (n_samples=5, tie_break=abstain, default verification prompt id).
    params
        Provider decoding parameters. Defaults to ``ProviderParams()``
        (temperature=1.0, max_tokens=1024).
    verification_prompt
        If supplied, overrides the framework default. If the benchmark
        has ``verification_prompt`` set, it takes precedence over the
        framework default but not over this argument.
    strip_tex
        Whether to strip ``$...$`` TeX-math delimiters from bearer
        expressions at prompt-construction time (default ``True``).
    run_id
        Stable identifier for this evaluation run, recorded as
        :attr:`Evaluation.id`. Generated as a UUID4 if not supplied.
    log_path
        Optional path for a JSONL run log; one event per line, suitable
        for ``jq`` or ``pandas.read_json(lines=True)``. If ``None`` (the
        default), no log file is written; library callers can still attach
        their own handlers to the ``infereval`` logger.
    template
        Explicit template override for the ``coherence`` question form.
        When ``None`` (the default), resolution follows
        :func:`infereval.templates.resolve_template`: a programmatic
        ``register_template`` binding for ``benchmark.id`` wins, then the
        benchmark's declared ``template_id``, then the framework default.
        The resolved template's id is recorded in the ``run.started`` log
        event (§12.3 provenance).

    Returns
    -------
    Evaluation
        The fully-populated :math:`\\eta` ready to serialize to JSON.
    """
    # Late imports to avoid the evaluation <-> endorsement <-> context cycle.
    from .context import resolve_context_builders
    from .endorsement import endorse
    from .logging_setup import configure_run_logging, log_event
    from .prompts import resolve_verification_prompt
    from .templates import (
        THIN_COHERENCE_FRAME,
        coherence_frame_for_id,
        resolve_coherence_frame,
        resolve_template,
    )

    cfg = config or EndorsementConfig()
    par = params or ProviderParams()
    rid = run_id or str(uuid.uuid4())
    prompt = verification_prompt or resolve_verification_prompt(
        benchmark.verification_prompt
    )

    bearers = benchmark.runtime_bearers()
    premise_builder, conclusion_builder = resolve_context_builders(
        benchmark.context_builders
    )
    # Resolve once, up front: an unknown benchmark-declared template_id
    # fails loudly before any provider call is made.
    resolved_template = (
        template
        if template is not None
        else resolve_template(benchmark.id, template_id=benchmark.template_id)
    )
    # Coherence frame: explicit argument > input-config binding > programmatic
    # registration > benchmark binding > thin default — resolved up front so an
    # unknown id fails before any provider call, then stamped into the config
    # the η records (the frame leg of the §12.3 provenance tuple).
    if coherence_frame is not None:
        resolved_frame = coherence_frame
    elif cfg.coherence_frame_id != THIN_COHERENCE_FRAME.id:
        resolved_frame = coherence_frame_for_id(cfg.coherence_frame_id)
    else:
        resolved_frame = resolve_coherence_frame(
            benchmark.id, frame_id=benchmark.coherence_frame_id
        )
    cfg = cfg.model_copy(update={"coherence_frame_id": resolved_frame.id})

    bench_hash = canonical_benchmark_hash(benchmark)

    with configure_run_logging(
        log_path,
        run_id=rid,
        extra_context={"benchmark_id": benchmark.id, "framework_version": __version__},
    ):
        started = datetime.now(timezone.utc)
        log_event(
            log,
            "run.started",
            benchmark_id=benchmark.id,
            benchmark_hash=bench_hash,
            n_items=benchmark.n,
            provider=provider.name,
            model_id=provider.model_id,
            params=par.model_dump(mode="json"),
            endorsement_config=cfg.model_dump(mode="json"),
            verification_prompt_id=prompt.id,
            template_id=resolved_template.id,
            coherence_frame_id=resolved_frame.id,
            strip_tex=strip_tex,
            paraphrase_variant=variant,
            framework_version=__version__,
        )

        items: list[EvaluationItem] = []
        for bench_item in benchmark.items:
            implication = bench_item.to_implication()
            record = endorse(
                implication,
                bearers,
                provider,
                cfg,
                par,
                premise_builder=premise_builder,
                conclusion_builder=conclusion_builder,
                verification_prompt=prompt,
                strip_tex=strip_tex,
                request_id_prefix=f"{rid}:{bench_item.id}",
                variant=variant,
                question_form=cfg.question_form,
                template=resolved_template,
                coherence_frame=resolved_frame,
            )
            items.append(
                EvaluationItem(
                    id=bench_item.id,
                    premises=sorted(bench_item.premises),
                    conclusions=sorted(bench_item.conclusions),
                    analyst_verdicts=list(bench_item.analyst_verdicts),
                    analyst_rationales=(
                        list(bench_item.analyst_rationales)
                        if bench_item.analyst_rationales is not None
                        else None
                    ),
                    model_verdict=record.verdict,
                    samples=record.samples,
                    majority_vote=record.to_majority_vote(),
                    tags=list(bench_item.tags),
                    references=list(bench_item.references),
                )
            )

        finished = datetime.now(timezone.utc)
        cfg_with_prompt_id = cfg.model_copy(
            update={"verification_prompt_id": prompt.id}
        )

        log_event(
            log,
            "run.finished",
            n_items=len(items),
            wall_time_s=(finished - started).total_seconds(),
        )

    return Evaluation(
        id=rid,
        benchmark_id=benchmark.id,
        benchmark_hash=bench_hash,
        model=ModelInfo(
            provider=provider.name,
            model_id=provider.model_id,
            params=par,
        ),
        endorsement_config=cfg_with_prompt_id,
        started_at=started,
        finished_at=finished,
        items=items,
        references=list(benchmark.references),
        paraphrase_variant=variant,
    )

infereval.evaluation.Evaluation ¶

Bases: BaseModel

An evaluation :math:\eta of a model against a benchmark.

references `class-attribute` `instance-attribute` ¶

references: list[Reference] = Field(default_factory=list)

Corpus-level provenance, propagated from :attr:infereval.benchmark.Benchmark.references at evaluation time. Carries the paper, dialogue, or regulatory framework the benchmark is derived from, so an evaluation JSON read in isolation still names its primary sources.

paraphrase_variant `class-attribute` `instance-attribute` ¶

paraphrase_variant: int = 0

Index of the paraphrase variant used at evaluation time. 0 (default) means the canonical :attr:BearerModel.expression was used for every bearer. k >= 1 means bearer.paraphrases[k-1] was used per :func:infereval.endorsement._expressions_for (with fallback to the canonical for bearers that don't carry that paraphrase). Phase 1.2 of the construct-validity infrastructure (R10: paraphrase variation under fixed inferential content).

endorsements ¶

endorsements() -> dict[Implication, Verdict]

Mapping Implication -> Verdict suitable for :meth:DerivedFrame.from_endorsements.

Source code in src/infereval/evaluation.py

def endorsements(self) -> dict[Implication, Verdict]:
    """Mapping ``Implication -> Verdict`` suitable for :meth:`DerivedFrame.from_endorsements`."""
    return {item.to_implication(): item.model_verdict for item in self.items}

infereval.evaluation.EvaluationItem ¶

Bases: BaseModel

One row of the evaluation :math:\eta: implication + analyst verdicts + :math:E_M.

analyst_rationales `class-attribute` `instance-attribute` ¶

analyst_rationales: list[str] | None = Field(default=None, description="Optional per-analyst rationales propagated from the source benchmark item's analyst_rationales at evaluation build time. Positionally aligned to analyst_verdicts. null (or absent) when the source benchmark carried no rationale discipline; a present list (possibly containing empty strings) when it did. Covered by Evaluation.benchmark_hash.")

Optional per-analyst rationales propagated from :attr:infereval.benchmark.BenchmarkItem.analyst_rationales at evaluation-build time. Positionally aligned to :attr:analyst_verdicts. None (or absent) when the source benchmark carried no rationale discipline; a present list (possibly containing empty strings) when it did. Covered by the existing :attr:Evaluation.benchmark_hash integrity mechanism, so a rationale cannot be silently altered between evaluation and report without changing the hash. Added in v0.5.4 (AR8, AR9).

references `class-attribute` `instance-attribute` ¶

references: list[Reference] = Field(default_factory=list)

Per-item provenance, propagated from :attr:infereval.benchmark.BenchmarkItem.references at evaluation time. Carries the guideline / paper / regulatory citation that justifies the analyst's verdict so the evaluation JSON is a self-contained, auditable artifact (no need to look up the source benchmark separately).

infereval.evaluation.EndorsementConfig ¶

Bases: BaseModel

Configuration governing how :math:E_M is computed.

Note on terminology: n_samples is the number of completions drawn from M per benchmark item, in the LLM-literature sense of "sample" (one draw from the model's output distribution). It is not the number of dataset rows — that is the benchmark's item count and is fixed by the benchmark. The methodology issues n_samples provider calls per item, parses each completion's verdict token, and majority-votes to compute :math:E_M for that item. See docs/concepts.md for the full terminology note.

question_form `class-attribute` `instance-attribute` ¶

question_form: Literal['support', 'coherence'] = 'support'

The logical question posed about each item (brief §3.1). support asks "does the conclusion follow?" and is defined only for single-succedent items; coherence asks "is committing Γ and denying Δ coherent?" and is defined for every arity. Persisted here so an evaluation records which question was asked (part of the §12.3 provenance tuple). Additive (v0.17.3): pre-existing evaluations load as support.

coherence_frame_id `class-attribute` `instance-attribute` ¶

coherence_frame_id: str = 'thin-v1'

The :class:infereval.templates.CoherenceFrame the coherence question is elicited under (part of the §12.3 provenance tuple; meaningful only when question_form="coherence"). Always concrete — never null — so the retest compatibility check refuses cross-frame comparisons, the same refusal the support path gets from verification_prompt_id.

On an input config, the default "thin-v1" means "resolve normally": an explicit coherence_frame argument to :func:evaluate, then a programmatic :func:infereval.templates.register_coherence_frame binding, then the benchmark's coherence_frame_id, then the thin default. A non-default value is an explicit binding, looked up in the catalog and failing loudly on unknown ids. (To force the thin frame over a benchmark binding, pass coherence_frame=THIN_COHERENCE_FRAME explicitly.) :func:evaluate stamps the resolved id back into the config the η records. Legacy ηs lacking the field backfill to "thin-v1" at load time: correct for every η produced by :func:evaluate, which elicited coherence only under the thin system before frames existed; hand-elicited experiment artifacts predating this field are identified by their run ids instead.

infereval.evaluation.ProviderParams ¶

Bases: BaseModel

Decoding parameters passed to a provider sample call.

The max_tokens default of 1024 is sized for current frontier models that consume budget on silent internal reasoning. See :class:infereval.providers.base.SampleRequest and docs/providers.md for the rationale and per-provider guidance.

infereval.evaluation.SampleRecord ¶

Bases: BaseModel

One sampled response from the provider plus its parsed verdict.

finish_reason `class-attribute` `instance-attribute` ¶

finish_reason: str | None = None

Provider-side stop reason, when reported. See :class:infereval.providers.base.SampleResult.finish_reason.

reasoning_tokens `class-attribute` `instance-attribute` ¶

reasoning_tokens: int | None = None

Reasoning / thinking token count, when the provider reports it. See :class:infereval.providers.base.SampleResult.reasoning_tokens.

provider_error `class-attribute` `instance-attribute` ¶

provider_error: str | None = None

v0.15.0+: non-None when the provider call failed (rate limit, HTTP error, empty response body, malformed response). When set, this sample represents an instrument failure, not a model decision — aggregators (:class:MajorityVote.from_samples) skip these samples rather than counting their parsed_verdict (which is left at Verdict.ABSTAIN for backward compatibility with v0.14.0 consumers that don't understand the new field). The R22 audit and metrics paths likewise treat provider_error is not None as missing data, not as a model abstention.

Pre-v0.15.0 eta JSONs without this field load with the default None — backward-compatible. Historical captures with silent empty-response failures (the v0.14.0 bug; see KNOWN_ISSUES_v0.14.0.md) can be heuristically re-classified post-hoc via infereval audit (new in v0.15.0).

infereval.evaluation.MajorityVote ¶

Bases: BaseModel

Tally of parsed verdicts plus the resolved majority and tie-break flag.

infereval.evaluation.canonical_benchmark_hash ¶

canonical_benchmark_hash(benchmark: Benchmark) -> str

SHA-256 of the benchmark's canonical-JSON form, prefixed sha256:.

Recorded in :attr:Evaluation.benchmark_hash for tamper detection. Two benchmarks that round-trip to the same canonical JSON have the same hash; this is robust to insertion order in dicts.

Source code in src/infereval/evaluation.py

def canonical_benchmark_hash(benchmark: Benchmark) -> str:
    """SHA-256 of the benchmark's canonical-JSON form, prefixed ``sha256:``.

    Recorded in :attr:`Evaluation.benchmark_hash` for tamper detection.
    Two benchmarks that round-trip to the same canonical JSON have the
    same hash; this is robust to insertion order in dicts.
    """
    canonical = json.dumps(
        benchmark.model_dump(mode="json", exclude_none=True),
        sort_keys=True,
        separators=(",", ":"),
    )
    return f"sha256:{hashlib.sha256(canonical.encode('utf-8')).hexdigest()}"

Metrics¶

infereval.metrics.coverage ¶

coverage(eta: Evaluation) -> float

:math:\mathrm{cov}(\eta) = |\{i : E_M(I_i) \neq \text{abstain}\}| / n.

Returns 0.0 for an empty evaluation rather than raising.

Source code in src/infereval/metrics.py

def coverage(eta: Evaluation) -> float:
    """:math:`\\mathrm{cov}(\\eta) = |\\{i : E_M(I_i) \\neq \\text{abstain}\\}| / n`.

    Returns ``0.0`` for an empty evaluation rather than raising.
    """
    if eta.n == 0:
        return 0.0
    substantive = sum(1 for it in eta.items if it.model_verdict != Verdict.ABSTAIN)
    return substantive / eta.n

infereval.metrics.consensus_verdict ¶

consensus_verdict(verdicts: Sequence[Verdict]) -> Verdict

Return the analyst consensus :math:c_i for one item's verdicts.

From the paper, Definition 8: good if strict majority of analysts say good (vs. bad); bad if strict majority say bad; otherwise abstain. Abstain votes do not count toward the majority of either substantive class but contribute to a tie.

Source code in src/infereval/metrics.py

def consensus_verdict(verdicts: Sequence[Verdict]) -> Verdict:
    """Return the analyst consensus :math:`c_i` for one item's verdicts.

    From the paper, Definition 8: ``good`` if strict majority of analysts
    say ``good`` (vs. ``bad``); ``bad`` if strict majority say ``bad``;
    otherwise ``abstain``. Abstain votes do not count toward the majority
    of either substantive class but contribute to a tie.
    """
    good = sum(1 for v in verdicts if v == Verdict.GOOD)
    bad = sum(1 for v in verdicts if v == Verdict.BAD)
    if good > bad:
        return Verdict.GOOD
    if bad > good:
        return Verdict.BAD
    return Verdict.ABSTAIN

infereval.metrics.consensus_reference ¶

consensus_reference(eta: Evaluation) -> ReferenceFn

Return :math:r(i) = c_i as a :data:ReferenceFn.

Source code in src/infereval/metrics.py

def consensus_reference(eta: Evaluation) -> ReferenceFn:
    """Return :math:`r(i) = c_i` as a :data:`ReferenceFn`."""
    per_item = [consensus_verdict(it.analyst_verdicts) for it in eta.items]

    def _ref(i: int) -> Verdict:
        return per_item[i]

    return _ref

infereval.metrics.cohens_kappa ¶

cohens_kappa(eta: Evaluation, reference: ReferenceFn, *, weights: WeightFn | None = None) -> float | None

:math:\kappa_C(\eta, r) = (p_o - p_e) / (1 - p_e).

Returns :data:None when :math:S(\eta, r) is empty or :math:p_e = 1 (degenerate distribution). Logs a warning in both cases so the user sees why the value is undefined.

Parameters¶

eta The evaluation. reference Per-item reference verdict function (typically the analyst consensus :math:c_i). weights Optional per-item weight function. When None (default), all substantive items count equally and the result is byte-identical to the unweighted formulation. When provided, observed and chance-expected agreement are computed as weighted relative frequencies — items with low weight contribute less to the agreement statistic. See :func:margin_weight for the standard confidence weighting. Items with zero weight are dropped from the substantive subset for numerical stability.

Source code in src/infereval/metrics.py

def cohens_kappa(
    eta: Evaluation,
    reference: ReferenceFn,
    *,
    weights: WeightFn | None = None,
) -> float | None:
    """:math:`\\kappa_C(\\eta, r) = (p_o - p_e) / (1 - p_e)`.

    Returns :data:`None` when :math:`S(\\eta, r)` is empty or
    :math:`p_e = 1` (degenerate distribution). Logs a warning in both
    cases so the user sees why the value is undefined.

    Parameters
    ----------
    eta
        The evaluation.
    reference
        Per-item reference verdict function (typically the analyst
        consensus :math:`c_i`).
    weights
        Optional per-item weight function. When ``None`` (default), all
        substantive items count equally and the result is byte-identical
        to the unweighted formulation. When provided, observed and
        chance-expected agreement are computed as weighted relative
        frequencies — items with low weight contribute less to the
        agreement statistic. See :func:`margin_weight` for the standard
        confidence weighting. Items with zero weight are dropped from
        the substantive subset for numerical stability.
    """
    S = sorted(substantive_index(eta, reference))
    if not S:
        log.warning(
            "kappa_C undefined: substantive subset S(eta, r) is empty"
        )
        return None

    if weights is None:
        # Unweighted path: behaviour preserved exactly.
        n_S = len(S)
        p_o = sum(1 for i in S if eta.items[i].model_verdict == reference(i)) / n_S
        p_M: dict[Verdict, float] = {}
        p_r: dict[Verdict, float] = {}
        for c in (Verdict.GOOD, Verdict.BAD):
            p_M[c] = sum(1 for i in S if eta.items[i].model_verdict == c) / n_S
            p_r[c] = sum(1 for i in S if reference(i) == c) / n_S
        p_e = sum(p_M[c] * p_r[c] for c in (Verdict.GOOD, Verdict.BAD))
    else:
        # Weighted path: each item contributes its weight to numerator
        # and denominator. Items with zero weight are dropped.
        w_S = [weights(eta.items[i]) for i in S]
        w_total = sum(w_S)
        if w_total <= 0:
            log.warning(
                "kappa_C undefined: total weight over the substantive subset is "
                "non-positive (all items have zero weight under the supplied "
                "weight function)"
            )
            return None
        p_o = (
            sum(
                w
                for i, w in zip(S, w_S, strict=True)
                if eta.items[i].model_verdict == reference(i)
            )
            / w_total
        )
        p_M_w: dict[Verdict, float] = {}
        p_r_w: dict[Verdict, float] = {}
        for c in (Verdict.GOOD, Verdict.BAD):
            p_M_w[c] = (
                sum(
                    w
                    for i, w in zip(S, w_S, strict=True)
                    if eta.items[i].model_verdict == c
                )
                / w_total
            )
            p_r_w[c] = (
                sum(
                    w
                    for i, w in zip(S, w_S, strict=True)
                    if reference(i) == c
                )
                / w_total
            )
        p_e = sum(p_M_w[c] * p_r_w[c] for c in (Verdict.GOOD, Verdict.BAD))

    if abs(1.0 - p_e) < 1e-12:
        log.warning(
            "kappa_C undefined: chance-expected agreement p_e = 1 "
            "(M and reference both degenerate on a single class over S)"
        )
        return None

    return (p_o - p_e) / (1.0 - p_e)

infereval.metrics.fleiss_kappa ¶

fleiss_kappa(eta: Evaluation, *, weights: WeightFn | None = None) -> float | None

:math:\kappa_F(\eta) with :math:M as the :math:(m+1)-th annotator.

The annotators on each item are the analyst verdicts followed by model_verdict. Items where any annotator (analyst or model) is non-substantive are excluded from :math:S_F per the paper's Definition 10.

Parameters¶

eta The evaluation. weights Optional per-item weight function. When None (default), each item contributes equally and the result is byte-identical to the unweighted formulation. When provided, each item's contribution to P_bar and to the chance-expected agreement is scaled by weights(item) — pass :func:margin_weight to compute the confidence-weighted variant.

Source code in src/infereval/metrics.py

def fleiss_kappa(
    eta: Evaluation,
    *,
    weights: WeightFn | None = None,
) -> float | None:
    """:math:`\\kappa_F(\\eta)` with :math:`M` as the :math:`(m+1)`-th annotator.

    The annotators on each item are the analyst verdicts followed by
    ``model_verdict``. Items where any annotator (analyst or model) is
    non-substantive are excluded from :math:`S_F` per the paper's
    Definition 10.

    Parameters
    ----------
    eta
        The evaluation.
    weights
        Optional per-item weight function. When ``None`` (default), each
        item contributes equally and the result is byte-identical to the
        unweighted formulation. When provided, each item's contribution
        to ``P_bar`` and to the chance-expected agreement is scaled by
        ``weights(item)`` — pass :func:`margin_weight` to compute the
        confidence-weighted variant.
    """
    tuples = [
        [*item.analyst_verdicts, item.model_verdict] for item in eta.items
    ]
    tuple_weights = (
        [weights(item) for item in eta.items] if weights is not None else None
    )
    return _fleiss_over_tuples(tuples, tuple_weights=tuple_weights)

infereval.metrics.inter_analyst_fleiss ¶

inter_analyst_fleiss(source: Evaluation | Benchmark, *, analyst_indices: Sequence[int] | None = None) -> float | None

:math:\kappa_F^*(\beta): Fleiss' kappa over analyst verdicts alone.

Accepts either an :class:~infereval.evaluation.Evaluation or a :class:~infereval.benchmark.Benchmark. Returns :data:None (with a logged warning) when :math:m < 2 or when the analysts are unanimous on every item -- the two conditions Remark 4 calls out as making the baseline unavailable.

By default Fleiss' kappa is computed over all analyst columns declared on source.analysts. This is the methodological reading of Remark 4: the inter-analyst baseline is the agreement of the full analyst pool whose verdicts the benchmark records.

For panelled benchmarks, the per-panel decomposition is available via :func:inter_analyst_fleiss_per_panel; the cross-panel Cohen's kappa is in :func:cross_panel_kappa. Those exist alongside the all-analyst figure here, not in place of it.

Parameters¶

source :class:Evaluation or :class:Benchmark. analyst_indices Optional positional indices into source.analysts. When supplied, Fleiss' kappa is computed only over those columns. Useful for computing the figure over any subset (e.g. analyst_indices=benchmark.analyst_indices_in_panel( benchmark.resolved_primary_panel()) reproduces the primary-panel-only behaviour the pre-v0.7.0 default returned on panelled benchmarks).

Notes¶

.. versionchanged:: 0.7.0 On panelled benchmarks, the default behaviour changed from "primary panel only" to "all analysts". The previous default silently inflated κ_F* when the primary panel was internally unanimous; per :gh-issue:82, the all-analyst figure is the methodologically correct Remark 4 baseline. The analyst_indices parameter is the explicit migration path for the rare caller who wants the pre-0.7.0 narrowed value.

Source code in src/infereval/metrics.py

def inter_analyst_fleiss(
    source: Evaluation | Benchmark,
    *,
    analyst_indices: Sequence[int] | None = None,
) -> float | None:
    """:math:`\\kappa_F^*(\\beta)`: Fleiss' kappa over analyst verdicts alone.

    Accepts either an :class:`~infereval.evaluation.Evaluation` or a
    :class:`~infereval.benchmark.Benchmark`. Returns :data:`None` (with
    a logged warning) when :math:`m < 2` or when the analysts are
    unanimous on every item -- the two conditions Remark 4 calls out as
    making the baseline unavailable.

    By default Fleiss' kappa is computed over **all** analyst columns
    declared on ``source.analysts``. This is the methodological reading
    of Remark 4: the inter-analyst baseline is the agreement of the
    full analyst pool whose verdicts the benchmark records.

    For panelled benchmarks, the per-panel decomposition is available
    via :func:`inter_analyst_fleiss_per_panel`; the cross-panel Cohen's
    kappa is in :func:`cross_panel_kappa`. Those exist alongside the
    all-analyst figure here, not in place of it.

    Parameters
    ----------
    source
        :class:`Evaluation` or :class:`Benchmark`.
    analyst_indices
        Optional positional indices into ``source.analysts``. When
        supplied, Fleiss' kappa is computed only over those columns.
        Useful for computing the figure over any subset (e.g.
        ``analyst_indices=benchmark.analyst_indices_in_panel(
        benchmark.resolved_primary_panel())`` reproduces the
        primary-panel-only behaviour the pre-v0.7.0 default returned
        on panelled benchmarks).

    Notes
    -----
    .. versionchanged:: 0.7.0
        On panelled benchmarks, the default behaviour changed from
        "primary panel only" to "all analysts". The previous default
        silently inflated κ_F* when the primary panel was internally
        unanimous; per :gh-issue:`82`, the all-analyst figure is the
        methodologically correct Remark 4 baseline. The
        ``analyst_indices`` parameter is the explicit migration path
        for the rare caller who wants the pre-0.7.0 narrowed value.
    """
    items = source.items
    if analyst_indices is None:
        tuples = [list(it.analyst_verdicts) for it in items]
    else:
        tuples = [[it.analyst_verdicts[j] for j in analyst_indices] for it in items]
    return _fleiss_over_tuples(tuples)

infereval.metrics.inter_analyst_fleiss_per_panel ¶

inter_analyst_fleiss_per_panel(benchmark: Benchmark) -> dict[str, float | None]

:math:\kappa_F^* computed per analyst panel.

Returns a mapping panel_name -> κ_F* for every panel declared on the benchmark. A panel value is None when the panel has fewer than 2 analysts or when the analysts are unanimous on every item (per the same conditions :func:inter_analyst_fleiss honours).

Empty dict if the benchmark is unpanelled. Phase 1.4 of the construct-validity infrastructure (R4).

Source code in src/infereval/metrics.py

def inter_analyst_fleiss_per_panel(
    benchmark: Benchmark,
) -> dict[str, float | None]:
    """:math:`\\kappa_F^*` computed per analyst panel.

    Returns a mapping ``panel_name -> κ_F*`` for every panel declared on
    the benchmark. A panel value is ``None`` when the panel has fewer
    than 2 analysts or when the analysts are unanimous on every item
    (per the same conditions :func:`inter_analyst_fleiss` honours).

    Empty dict if the benchmark is unpanelled. Phase 1.4 of the
    construct-validity infrastructure (R4).
    """
    out: dict[str, float | None] = {}
    for name in benchmark.panel_names():
        indices = benchmark.analyst_indices_in_panel(name)
        tuples = [
            [it.analyst_verdicts[j] for j in indices] for it in benchmark.items
        ]
        out[name] = _fleiss_over_tuples(tuples)
    return out

infereval.metrics.cross_panel_kappa ¶

cross_panel_kappa(benchmark: Benchmark, *, primary: str | None = None, check: str | None = None) -> float | None

Cohen's :math:\kappa_C between two panels' per-item consensus verdicts.

Computes a per-panel consensus verdict for each item (majority among panel members, abstain on tie) and then runs Cohen's kappa between the two columns, restricted to items where both panels yield a substantive verdict.

Parameters¶

benchmark Panelled benchmark. primary Name of the primary panel. Defaults to benchmark.resolved_primary_panel(). check Name of the panel to compare against. When None and exactly two panels are declared, picks the non-primary one automatically.

Returns¶

float | None Cohen's kappa over the substantive-on-both items, or None when fewer than two non-trivial agreement counts are available, or when either named panel doesn't exist.

Phase 1.4 of the construct-validity infrastructure (R4 — guards against shared-error agreement within the primary panel by surfacing the independent panel's view).

Source code in src/infereval/metrics.py

def cross_panel_kappa(
    benchmark: Benchmark,
    *,
    primary: str | None = None,
    check: str | None = None,
) -> float | None:
    """Cohen's :math:`\\kappa_C` between two panels' per-item consensus verdicts.

    Computes a per-panel consensus verdict for each item (majority among
    panel members, abstain on tie) and then runs Cohen's kappa between
    the two columns, restricted to items where both panels yield a
    substantive verdict.

    Parameters
    ----------
    benchmark
        Panelled benchmark.
    primary
        Name of the primary panel. Defaults to
        ``benchmark.resolved_primary_panel()``.
    check
        Name of the panel to compare against. When ``None`` and exactly
        two panels are declared, picks the non-primary one
        automatically.

    Returns
    -------
    float | None
        Cohen's kappa over the substantive-on-both items, or ``None``
        when fewer than two non-trivial agreement counts are available,
        or when either named panel doesn't exist.

    Phase 1.4 of the construct-validity infrastructure (R4 — guards
    against shared-error agreement within the primary panel by
    surfacing the independent panel's view).
    """
    names = benchmark.panel_names()
    if primary is None:
        primary = benchmark.resolved_primary_panel()
    if primary is None or primary not in names:
        log.warning(
            "cross_panel_kappa: primary panel %r not declared on benchmark %r",
            primary,
            benchmark.id,
        )
        return None
    if check is None:
        others = [n for n in names if n != primary]
        if len(others) != 1:
            log.warning(
                "cross_panel_kappa: 'check' panel must be supplied when the "
                "benchmark declares != 2 panels (declared: %s)",
                names,
            )
            return None
        check = others[0]
    if check not in names:
        log.warning(
            "cross_panel_kappa: check panel %r not declared on benchmark %r",
            check,
            benchmark.id,
        )
        return None

    primary_idx = benchmark.analyst_indices_in_panel(primary)
    check_idx = benchmark.analyst_indices_in_panel(check)

    primary_col = [
        _panel_consensus_verdict(it.analyst_verdicts, primary_idx)
        for it in benchmark.items
    ]
    check_col = [
        _panel_consensus_verdict(it.analyst_verdicts, check_idx)
        for it in benchmark.items
    ]

    # Restrict to items where both panels reached a substantive consensus.
    pairs = [
        (p, c)
        for p, c in zip(primary_col, check_col, strict=True)
        if p != Verdict.ABSTAIN and c != Verdict.ABSTAIN
    ]
    if not pairs:
        log.warning(
            "cross_panel_kappa: empty substantive intersection between panels "
            "%r and %r on benchmark %r",
            primary,
            check,
            benchmark.id,
        )
        return None

    cats = (Verdict.GOOD, Verdict.BAD)
    n = len(pairs)
    p_obs = sum(1 for p, c in pairs if p == c) / n
    pa = {v: sum(1 for p, _ in pairs if p == v) / n for v in cats}
    pc = {v: sum(1 for _, c in pairs if c == v) / n for v in cats}
    p_exp = sum(pa[v] * pc[v] for v in cats)
    if abs(1 - p_exp) < 1e-12:
        log.warning(
            "cross_panel_kappa: chance-expected agreement = 1 (one panel is "
            "fully degenerate to a single class); kappa is undefined"
        )
        return None
    return (p_obs - p_exp) / (1 - p_exp)

infereval.metrics.MetricsReport `dataclass` ¶

MetricsReport(eta: Evaluation, benchmark: Benchmark | None = None)

Bundle of metrics over an :class:Evaluation, with decomposition filters.

Parameters¶

eta The evaluation to report on. benchmark Optional benchmark. Required for :meth:by_rsr_target and :meth:coverage_per_analyst_named; other methods work without it.

n `property` ¶

n: int

Number of evaluation items.

verdict_distributions `property` ¶

verdict_distributions: dict[str, VerdictDistribution]

Per-item :class:VerdictDistribution keyed by item id.

Always computed (cheap); included in :meth:to_dict output so downstream consumers can read the dispersion without re-running majority-vote logic.

coverage_per_analyst_named ¶

coverage_per_analyst_named() -> dict[str, float]

Per-analyst coverage keyed by analyst id (requires :attr:benchmark).

Source code in src/infereval/metrics.py

def coverage_per_analyst_named(self) -> dict[str, float]:
    """Per-analyst coverage keyed by analyst id (requires :attr:`benchmark`)."""
    if self.benchmark is None:
        raise ValueError(
            "coverage_per_analyst_named requires a benchmark to resolve analyst ids"
        )
    return {
        a.id: coverage_analyst(self.eta, j)
        for j, a in enumerate(self.benchmark.analysts)
    }

cohens_kappa ¶

cohens_kappa(reference: ReferenceFn | None = None, *, weights: WeightFn | None = None) -> float | None

:math:\kappa_C(\eta, r). Default reference is the analyst consensus :math:c_i.

Pass weights=margin_weight for the confidence-weighted variant.

Source code in src/infereval/metrics.py

def cohens_kappa(
    self,
    reference: ReferenceFn | None = None,
    *,
    weights: WeightFn | None = None,
) -> float | None:
    """:math:`\\kappa_C(\\eta, r)`. Default reference is the analyst consensus :math:`c_i`.

    Pass ``weights=margin_weight`` for the confidence-weighted variant.
    """
    ref = reference if reference is not None else consensus_reference(self.eta)
    return cohens_kappa(self.eta, ref, weights=weights)

cohens_kappa_analyst ¶

cohens_kappa_analyst(analyst_index: int, *, weights: WeightFn | None = None) -> float | None

:math:\kappa_C(\eta, v_{:,j}): M vs. one specific analyst.

Source code in src/infereval/metrics.py

def cohens_kappa_analyst(
    self,
    analyst_index: int,
    *,
    weights: WeightFn | None = None,
) -> float | None:
    """:math:`\\kappa_C(\\eta, v_{:,j})`: M vs. one specific analyst."""
    return cohens_kappa(
        self.eta,
        analyst_reference(self.eta, analyst_index),
        weights=weights,
    )

fleiss_kappa_weighted ¶

fleiss_kappa_weighted(weights: WeightFn) -> float | None

Weighted :math:\kappa_F(\eta) — pass :func:margin_weight for the standard variant.

Exposed as a method (not a property) because weighting is an opt-in methodological choice that needs to be made explicitly, per the locked-default that the unweighted κ remains the headline number.

Source code in src/infereval/metrics.py

def fleiss_kappa_weighted(self, weights: WeightFn) -> float | None:
    """Weighted :math:`\\kappa_F(\\eta)` — pass :func:`margin_weight` for the standard variant.

    Exposed as a method (not a property) because weighting is an
    opt-in methodological choice that needs to be made explicitly,
    per the locked-default that the unweighted κ remains the
    headline number.
    """
    return fleiss_kappa(self.eta, weights=weights)

cell_summary ¶

cell_summary(reference: ReferenceFn | None = None) -> CellSummary

:class:CellSummary for this report against reference.

Default reference is the analyst consensus :math:c_i. Used by the infereval metrics --by-tag / --by-rsr-target CLI renderers and by infereval report --by-tag … to surface the per-cell substantive-n and class counts and to gate the under-powered annotation.

Source code in src/infereval/metrics.py

def cell_summary(
    self, reference: ReferenceFn | None = None
) -> CellSummary:
    """:class:`CellSummary` for this report against ``reference``.

    Default reference is the analyst consensus :math:`c_i`. Used by
    the ``infereval metrics --by-tag`` / ``--by-rsr-target`` CLI
    renderers and by ``infereval report --by-tag …`` to surface the
    per-cell substantive-n and class counts and to gate the
    under-powered annotation.
    """
    ref = reference if reference is not None else consensus_reference(self.eta)
    return cell_summary(self.eta, ref)

cohens_kappa_with_ci ¶

cohens_kappa_with_ci(reference: ReferenceFn | None = None, *, iterations: int = 1000, subsample_size: int | None = None, confidence: float = 0.95, seed: int | None = None) -> tuple[float, float, float]

:math:\kappa_C with a Politis-Romano subsampling CI.

Convenience wrapper around :func:subsampling_kappa_ci. See its docstring for the procedure, the subsample-size default, and the raises behaviour for too-small benchmarks.

Source code in src/infereval/metrics.py

def cohens_kappa_with_ci(
    self,
    reference: ReferenceFn | None = None,
    *,
    iterations: int = 1000,
    subsample_size: int | None = None,
    confidence: float = 0.95,
    seed: int | None = None,
) -> tuple[float, float, float]:
    """:math:`\\kappa_C` with a Politis-Romano subsampling CI.

    Convenience wrapper around :func:`subsampling_kappa_ci`. See its
    docstring for the procedure, the subsample-size default, and
    the raises behaviour for too-small benchmarks.
    """
    ref = reference if reference is not None else consensus_reference(self.eta)
    return subsampling_kappa_ci(
        lambda e: cohens_kappa(e, ref),
        self.eta,
        iterations=iterations,
        subsample_size=subsample_size,
        confidence=confidence,
        seed=seed,
    )

fleiss_kappa_with_ci ¶

fleiss_kappa_with_ci(*, iterations: int = 1000, subsample_size: int | None = None, confidence: float = 0.95, seed: int | None = None) -> tuple[float, float, float]

:math:\kappa_F with a Politis-Romano subsampling CI.

Source code in src/infereval/metrics.py

def fleiss_kappa_with_ci(
    self,
    *,
    iterations: int = 1000,
    subsample_size: int | None = None,
    confidence: float = 0.95,
    seed: int | None = None,
) -> tuple[float, float, float]:
    """:math:`\\kappa_F` with a Politis-Romano subsampling CI."""
    return subsampling_kappa_ci(
        fleiss_kappa,
        self.eta,
        iterations=iterations,
        subsample_size=subsample_size,
        confidence=confidence,
        seed=seed,
    )

aggregate_dispersion_summary ¶

aggregate_dispersion_summary(*, thin_margin_threshold: float = 0.4) -> AggregateDispersion

Corpus-level summary of per-item dispersion.

Source code in src/infereval/metrics.py

def aggregate_dispersion_summary(
    self, *, thin_margin_threshold: float = 0.4
) -> AggregateDispersion:
    """Corpus-level summary of per-item dispersion."""
    dists = list(self.verdict_distributions.values())
    n = len(dists)
    if n == 0:
        return AggregateDispersion(
            n_items=0,
            mean_entropy=0.0,
            mean_margin=0.0,
            n_thin_margin=0,
            thin_margin_threshold=thin_margin_threshold,
            n_tie_broken=0,
        )
    return AggregateDispersion(
        n_items=n,
        mean_entropy=sum(d.entropy for d in dists) / n,
        mean_margin=sum(d.margin for d in dists) / n,
        n_thin_margin=sum(1 for d in dists if d.margin < thin_margin_threshold),
        thin_margin_threshold=thin_margin_threshold,
        n_tie_broken=sum(1 for d in dists if d.tie_broken),
    )

by_tag ¶

by_tag(tag: str) -> MetricsReport

Return a report restricted to items carrying tag.

Source code in src/infereval/metrics.py

def by_tag(self, tag: str) -> MetricsReport:
    """Return a report restricted to items carrying ``tag``."""
    filtered = self.eta.model_copy(
        update={"items": [it for it in self.eta.items if tag in it.tags]}
    )
    return MetricsReport(eta=filtered, benchmark=self.benchmark)

by_rsr_target ¶

by_rsr_target(X: frozenset[str], A: frozenset[str]) -> MetricsReport

Return a report restricted to items whose rsr_target matches (X, A).

rsr_target lives on benchmark items, not evaluation items, so :attr:benchmark is required.

Source code in src/infereval/metrics.py

def by_rsr_target(self, X: frozenset[str], A: frozenset[str]) -> MetricsReport:
    """Return a report restricted to items whose ``rsr_target`` matches ``(X, A)``.

    ``rsr_target`` lives on benchmark items, not evaluation items, so
    :attr:`benchmark` is required.
    """
    if self.benchmark is None:
        raise ValueError("by_rsr_target requires a benchmark to read rsr_target fields")
    keep_ids = {
        bi.id
        for bi in self.benchmark.items
        if bi.rsr_target is not None
        and frozenset(bi.rsr_target.X) == X
        and frozenset(bi.rsr_target.A) == A
    }
    filtered = self.eta.model_copy(
        update={"items": [it for it in self.eta.items if it.id in keep_ids]}
    )
    return MetricsReport(eta=filtered, benchmark=self.benchmark)

to_dict ¶

to_dict(*, include_verdict_distributions: bool = True, thin_margin_threshold: float = 0.4) -> dict[str, Any]

Render as a JSON-friendly dict (None where a kappa is undefined).

Parameters¶

include_verdict_distributions When True (default, the report_verdict_distribution = true locked methodology default), include per-item :class:VerdictDistribution entries plus the aggregate dispersion summary. Pass False to suppress for consumers that want the pre-dispersion shape exactly. thin_margin_threshold Plurality-margin cutoff for the aggregate-dispersion "thin-margin" count. Default 0.4 matches the :class:~infereval.structure.ThinMarginAgreementCheck default.

Source code in src/infereval/metrics.py

def to_dict(
    self,
    *,
    include_verdict_distributions: bool = True,
    thin_margin_threshold: float = 0.4,
) -> dict[str, Any]:
    """Render as a JSON-friendly dict (None where a kappa is undefined).

    Parameters
    ----------
    include_verdict_distributions
        When ``True`` (default, the ``report_verdict_distribution = true``
        locked methodology default), include per-item
        :class:`VerdictDistribution` entries plus the aggregate
        dispersion summary. Pass ``False`` to suppress for
        consumers that want the pre-dispersion shape exactly.
    thin_margin_threshold
        Plurality-margin cutoff for the aggregate-dispersion
        "thin-margin" count. Default ``0.4`` matches the
        :class:`~infereval.structure.ThinMarginAgreementCheck`
        default.
    """
    out: dict[str, Any] = {
        "n": self.n,
        "coverage": self.coverage,
        "coverage_per_analyst": self.coverage_per_analyst,
        "cohens_kappa_consensus": self.cohens_kappa(),
        "fleiss_kappa": self.fleiss_kappa,
        "inter_analyst_fleiss": self.inter_analyst_fleiss,
    }
    if self.benchmark is not None:
        out["coverage_per_analyst_named"] = self.coverage_per_analyst_named()
    if include_verdict_distributions:
        distributions = self.verdict_distributions
        out["verdict_distributions"] = {
            item_id: {
                "good": d.good,
                "bad": d.bad,
                "abstain": d.abstain,
                "n_samples": d.n_samples,
                "verdict": d.verdict.value,
                "tie_broken": d.tie_broken,
                "entropy": d.entropy,
                "margin": d.margin,
            }
            for item_id, d in distributions.items()
        }
        agg = self.aggregate_dispersion_summary(
            thin_margin_threshold=thin_margin_threshold
        )
        out["aggregate_dispersion"] = {
            "n_items": agg.n_items,
            "mean_entropy": agg.mean_entropy,
            "mean_margin": agg.mean_margin,
            "n_thin_margin": agg.n_thin_margin,
            "fraction_thin_margin": agg.fraction_thin_margin,
            "thin_margin_threshold": agg.thin_margin_threshold,
            "n_tie_broken": agg.n_tie_broken,
        }
    return out

Structural checks (R13)¶

infereval.structure.run_all_checks ¶

run_all_checks(evaluation: Evaluation, benchmark: Benchmark, *, thin_margin_threshold: float = DEFAULT_THIN_MARGIN_THRESHOLD) -> StructuralReport

Run all four structural checks and bundle the results.

Source code in src/infereval/structure.py

def run_all_checks(
    evaluation: Evaluation,
    benchmark: Benchmark,
    *,
    thin_margin_threshold: float = DEFAULT_THIN_MARGIN_THRESHOLD,
) -> StructuralReport:
    """Run all four structural checks and bundle the results."""
    return StructuralReport(
        evaluation_id=evaluation.id,
        benchmark_id=evaluation.benchmark_id,
        checks=(
            containment_closure_check(evaluation, benchmark),
            rsr_role_consistency_check(evaluation, benchmark),
            base_case_stability_check(evaluation, benchmark),
            thin_margin_agreement_check(
                evaluation, benchmark, threshold=thin_margin_threshold
            ),
        ),
    )

infereval.structure.containment_closure_check ¶

containment_closure_check(evaluation: Evaluation, benchmark: Benchmark) -> StructuralCheck

Sanity-check that all self-implications are in I_M by construction.

Per Definition 3 clause i, every implication ⟨Γ, Δ⟩ with Γ ∩ Δ ≠ ∅ is in I_M regardless of what the model says. This check counts such items in the benchmark and confirms they're structurally satisfied; it doesn't need to consult the model's verdict (the framework guarantees it). Reported anyway because the count itself is informative — a benchmark with zero self-implications has different structural texture from one with many.

Source code in src/infereval/structure.py

def containment_closure_check(
    evaluation: Evaluation, benchmark: Benchmark
) -> StructuralCheck:
    """Sanity-check that all self-implications are in ``I_M`` by construction.

    Per Definition 3 clause i, every implication ⟨Γ, Δ⟩ with
    ``Γ ∩ Δ ≠ ∅`` is in ``I_M`` regardless of what the model says.
    This check counts such items in the benchmark and confirms they're
    structurally satisfied; it doesn't *need* to consult the model's
    verdict (the framework guarantees it). Reported anyway because the
    count itself is informative — a benchmark with zero self-implications
    has different structural texture from one with many.
    """
    self_implications = [
        it
        for it in benchmark.items
        if set(it.premises) & set(it.conclusions)
    ]
    # Per construction, every such item is in I_M; rate is trivially 1.0
    # whenever items_checked > 0. We report the count for visibility.
    return StructuralCheck(
        name="containment_closure",
        items_checked=len(self_implications),
        items_satisfying=len(self_implications),
        anomalies=(),
    )

infereval.structure.rsr_role_consistency_check ¶

rsr_role_consistency_check(evaluation: Evaluation, benchmark: Benchmark) -> StructuralCheck

Check that role-tagged items' model verdicts match the role's prediction.

For each item carrying a role tag (supporter / defeater / irrelevant-addition) AND an rsr_target, looks up the "base-inference" item with the same target and uses the model's verdict on the base to predict the expected verdict on the role-tagged item:

supporter is supposed to strengthen the base verdict. If the base is GOOD, the supporter should remain GOOD; if the base is BAD, the supporter is excluded (a supporter can't strengthen a bad inference — that's a defeater being treated wrongly).
defeater is supposed to flip the base verdict. If the base is GOOD, the defeater should be BAD.
irrelevant-addition is supposed to preserve the base verdict under RSR. If the base is GOOD, the irrelevant addition should stay GOOD; if the base is BAD, it should stay BAD.

Anomalies are items whose model verdict contradicts the expected role-conditional verdict. Items where the base or the role-tagged item itself has an ABSTAIN verdict are excluded from the check (the role's prediction is undefined relative to abstention).

Source code in src/infereval/structure.py

def rsr_role_consistency_check(
    evaluation: Evaluation, benchmark: Benchmark
) -> StructuralCheck:
    """Check that role-tagged items' model verdicts match the role's prediction.

    For each item carrying a role tag (``supporter`` / ``defeater`` /
    ``irrelevant-addition``) AND an ``rsr_target``, looks up the
    "base-inference" item with the same target and uses the model's
    verdict on the base to predict the expected verdict on the role-tagged
    item:

    - ``supporter`` is supposed to *strengthen* the base verdict. If the
      base is GOOD, the supporter should remain GOOD; if the base is
      BAD, the supporter is excluded (a supporter can't strengthen a
      bad inference — that's a defeater being treated wrongly).
    - ``defeater`` is supposed to *flip* the base verdict. If the base
      is GOOD, the defeater should be BAD.
    - ``irrelevant-addition`` is supposed to preserve the base verdict
      under RSR. If the base is GOOD, the irrelevant addition should
      stay GOOD; if the base is BAD, it should stay BAD.

    Anomalies are items whose model verdict contradicts the expected
    role-conditional verdict. Items where the base or the role-tagged
    item itself has an ABSTAIN verdict are excluded from the check
    (the role's prediction is undefined relative to abstention).
    """
    # Index items by id, evaluation-side and benchmark-side.
    eval_by_id = {it.id: it for it in evaluation.items}

    # Group benchmark items by their rsr_target's canonical key, then
    # within each target separate the base-inference reference items
    # from the role-tagged items we're going to check.
    targets: dict[
        tuple[tuple[str, ...], tuple[str, ...]],
        dict[str, list[BenchmarkItem]],
    ] = defaultdict(lambda: {"base": [], "checked": []})

    for it in benchmark.items:
        if it.rsr_target is None:
            continue
        key = (
            tuple(sorted(it.rsr_target.X)),
            tuple(sorted(it.rsr_target.A)),
        )
        role = _role_of(it)
        if role == _ROLE_BASE:
            targets[key]["base"].append(it)
        elif role in {_ROLE_SUPPORTER, _ROLE_DEFEATER, _ROLE_IRRELEVANT}:
            targets[key]["checked"].append(it)

    anomalies: list[StructuralAnomaly] = []
    items_checked = 0
    items_satisfying = 0

    for key, groups in targets.items():
        base_items = groups["base"]
        checked_items = groups["checked"]
        if not base_items or not checked_items:
            # Need a base reference and at least one role-tagged item
            # to run the check on this target.
            continue

        # If multiple base items exist, use their majority verdict (or
        # skip when they disagree — the base_case_stability_check
        # surfaces the divergence separately).
        # Partial-evaluation guard: a benchmark item carrying an
        # rsr_target may not appear in this evaluation (e.g. when the
        # eval was produced from a paraphrase-cycle variant or a tag
        # filter). Skip those items rather than raise; the metrics
        # contract elsewhere in the package is "missing data is
        # surfaced via warnings + None, not exceptions".
        present_base_items = [b for b in base_items if b.id in eval_by_id]
        if len(present_base_items) < len(base_items):
            missing = [b.id for b in base_items if b.id not in eval_by_id]
            log.warning(
                "rsr_role_consistency_check: skipping base items absent from "
                "evaluation %r: %s",
                evaluation.id,
                missing,
            )
        if not present_base_items:
            continue
        base_verdicts = [eval_by_id[b.id].model_verdict for b in present_base_items]
        if len(set(base_verdicts)) > 1:
            continue  # base is unstable; can't predict roles
        base_verdict = base_verdicts[0]
        if base_verdict == Verdict.ABSTAIN:
            # Base is non-substantive; role predictions are undefined.
            continue

        for it in checked_items:
            if it.id not in eval_by_id:
                log.warning(
                    "rsr_role_consistency_check: skipping role-tagged item "
                    "%r absent from evaluation %r",
                    it.id,
                    evaluation.id,
                )
                continue
            eval_item = eval_by_id[it.id]
            actual = eval_item.model_verdict
            if actual == Verdict.ABSTAIN:
                # The role-tagged item itself is non-substantive; skip.
                continue
            role = _role_of(it)
            assert role is not None  # checked above
            expected = _expected_verdict_for_role(role, base_verdict)
            if expected is None:
                continue  # role doesn't make a prediction here
            items_checked += 1
            if actual == expected:
                items_satisfying += 1
            else:
                target_str = (
                    f"⟨{{{','.join(key[0])}}}, {{{','.join(key[1])}}}⟩"
                )
                anomalies.append(
                    StructuralAnomaly(
                        item_id=it.id,
                        expected=str(expected),
                        actual=str(actual),
                        explanation=(
                            f"item is tagged '{role}' on target {target_str} "
                            f"with base verdict {base_verdict}; role predicts "
                            f"{expected} but model returned {actual}"
                        ),
                    )
                )

    return StructuralCheck(
        name="rsr_role_consistency",
        items_checked=items_checked,
        items_satisfying=items_satisfying,
        anomalies=tuple(anomalies),
    )

infereval.structure.base_case_stability_check ¶

base_case_stability_check(evaluation: Evaluation, benchmark: Benchmark) -> StructuralCheck

When a target has multiple base-inference items, the model should agree on all of them.

Anomalies surface targets where the model gives different verdicts on multiple base-inference items, since the base verdict is what the rest of the RSR machinery is anchored to.

Source code in src/infereval/structure.py

def base_case_stability_check(
    evaluation: Evaluation, benchmark: Benchmark
) -> StructuralCheck:
    """When a target has multiple ``base-inference`` items, the model should agree on all of them.

    Anomalies surface targets where the model gives different verdicts
    on multiple base-inference items, since the base verdict is what
    the rest of the RSR machinery is anchored to.
    """
    eval_by_id = {it.id: it for it in evaluation.items}
    base_by_target: dict[
        tuple[tuple[str, ...], tuple[str, ...]], list[BenchmarkItem]
    ] = defaultdict(list)

    for it in benchmark.items:
        if it.rsr_target is None or _role_of(it) != _ROLE_BASE:
            continue
        key = (
            tuple(sorted(it.rsr_target.X)),
            tuple(sorted(it.rsr_target.A)),
        )
        base_by_target[key].append(it)

    anomalies: list[StructuralAnomaly] = []
    items_checked = 0
    items_satisfying = 0
    for key, bases in base_by_target.items():
        # Partial-evaluation guard: same as in rsr_role_consistency_check.
        present_bases = [b for b in bases if b.id in eval_by_id]
        if len(present_bases) < len(bases):
            missing = [b.id for b in bases if b.id not in eval_by_id]
            log.warning(
                "base_case_stability_check: skipping base items absent from "
                "evaluation %r: %s",
                evaluation.id,
                missing,
            )
        if len(present_bases) < 2:
            continue  # nothing to check (need ≥ 2 present bases per target)
        verdicts = [eval_by_id[b.id].model_verdict for b in present_bases]
        unique = set(verdicts)
        items_checked += len(present_bases)
        if len(unique) == 1:
            items_satisfying += len(present_bases)
        else:
            target_str = (
                f"⟨{{{','.join(key[0])}}}, {{{','.join(key[1])}}}⟩"
            )
            # Flag every present base item in the divergent set as an anomaly.
            for b, v in zip(present_bases, verdicts, strict=True):
                anomalies.append(
                    StructuralAnomaly(
                        item_id=b.id,
                        expected=f"a single shared verdict across base-inferences on {target_str}",
                        actual=f"{v} (other base items on this target: {unique - {v}})",
                        explanation=(
                            f"target {target_str} has {len(present_bases)} base-inference "
                            f"items present in the evaluation with verdicts "
                            f"{[str(v) for v in verdicts]} — the base case is "
                            f"structurally unstable"
                        ),
                    )
                )

    return StructuralCheck(
        name="base_case_stability",
        items_checked=items_checked,
        items_satisfying=items_satisfying,
        anomalies=tuple(anomalies),
    )

infereval.structure.StructuralReport `dataclass` ¶

StructuralReport(evaluation_id: str, benchmark_id: str, checks: tuple[StructuralCheck, ...] = tuple())

Bundle of structural checks run against an Evaluation + Benchmark pair.

all_satisfied `property` ¶

all_satisfied: bool

True iff every check has rate == 1.0 (and no anomalies).

infereval.structure.StructuralCheck `dataclass` ¶

StructuralCheck(name: str, items_checked: int, items_satisfying: int, anomalies: tuple[StructuralAnomaly, ...] = ())

Result of one structural property check against an Evaluation.

name `instance-attribute` ¶

name: str

Short identifier, e.g. "containment_closure".

rate `property` ¶

rate: float | None

Proportion of checked items satisfying the property; None when no items checked.

infereval.structure.StructuralAnomaly `dataclass` ¶

StructuralAnomaly(item_id: str, expected: str, actual: str, explanation: str)

One item that failed a structural check, with diagnostic context.

expected `instance-attribute` ¶

expected: str

Human-readable description of what the structural rule predicted.

actual `instance-attribute` ¶

actual: str

What the model's verdict actually was.

explanation `instance-attribute` ¶

explanation: str

Why this is flagged as an anomaly.

Factor-effects model (R7 / R12)¶

infereval.modeling.fit_factor_model ¶

fit_factor_model(evaluation: Evaluation, benchmark: Benchmark, *, reference: str = _DEFAULT_REFERENCE) -> ModelFit

Logistic regression of agreement on declared factor levels.

Parameters¶

evaluation The :class:~infereval.evaluation.Evaluation to model. Each item's per-sample verdicts are unrolled into separate observations; samples with ABSTAIN verdicts are dropped. benchmark The source :class:~infereval.benchmark.Benchmark. Must declare at least one factor in benchmark.factors (per Phase 1.1). reference Which analyst column defines "agreement". "consensus" (default) uses the per-item majority of the analyst panel (abstain on tie). An "analyst:<id>" string picks a single analyst column.

Returns¶

ModelFit

Raises¶

ModelingError If the benchmark declares no factors, if no sample observations remain after dropping abstains, or if the design matrix is rank-deficient (e.g. every item in the same cell).

Source code in src/infereval/modeling.py

def fit_factor_model(
    evaluation: Evaluation,
    benchmark: Benchmark,
    *,
    reference: str = _DEFAULT_REFERENCE,
) -> ModelFit:
    """Logistic regression of agreement on declared factor levels.

    Parameters
    ----------
    evaluation
        The :class:`~infereval.evaluation.Evaluation` to model. Each
        item's per-sample verdicts are unrolled into separate
        observations; samples with ABSTAIN verdicts are dropped.
    benchmark
        The source :class:`~infereval.benchmark.Benchmark`. Must declare
        at least one factor in ``benchmark.factors`` (per Phase 1.1).
    reference
        Which analyst column defines "agreement". ``"consensus"``
        (default) uses the per-item majority of the analyst panel
        (abstain on tie). An ``"analyst:<id>"`` string picks a single
        analyst column.

    Returns
    -------
    ModelFit

    Raises
    ------
    ModelingError
        If the benchmark declares no factors, if no sample observations
        remain after dropping abstains, or if the design matrix is
        rank-deficient (e.g. every item in the same cell).
    """
    if not benchmark.factors:
        raise ModelingError(
            "Benchmark declares no factors. infereval model needs at least "
            "one factor in `benchmark.factors` (Phase 1.1) to fit against. "
            "Re-author the benchmark with factor declarations, or use "
            "`infereval metrics --by-tag` for a tag-based decomposition."
        )

    # Late import so the rest of the package works without statsmodels.
    try:
        import pandas as pd  # type: ignore[import-untyped]
        import statsmodels.api as sm  # type: ignore[import-untyped]
    except ImportError as exc:
        raise ModelingError(
            "infereval.modeling requires the [stats] extra: "
            "pip install 'infereval[stats]'"
        ) from exc

    # 1. Build the long-format observation table.
    rows = _build_observation_rows(evaluation, benchmark, reference=reference)
    n_dropped = sum(1 for r in rows if r["agrees"] is None)
    rows = [r for r in rows if r["agrees"] is not None]
    if not rows:
        raise ModelingError(
            "No substantive observations after dropping abstain samples; "
            "cannot fit a model on an all-abstain dataset."
        )

    df = pd.DataFrame(rows)

    # 2. One-hot encode each factor; the alphabetically-first level is
    # dropped as the baseline (statsmodels default).
    factor_names = sorted(benchmark.factors)
    # Ensure every level is observed as a category so dummies are stable
    # even when the dataset doesn't contain every level.
    for f in factor_names:
        df[f] = pd.Categorical(df[f], categories=benchmark.factors[f], ordered=False)

    design_parts = [pd.Series(1.0, index=df.index, name="Intercept")]
    factor_to_cols: dict[str, list[str]] = {}
    for f in factor_names:
        dummies = pd.get_dummies(df[f], prefix=f, drop_first=True, dtype=float)
        design_parts.append(dummies)
        factor_to_cols[f] = list(dummies.columns)
    design_X = pd.concat(design_parts, axis=1)  # noqa: N806 -- statistical convention
    y = df["agrees"].astype(int).to_numpy()

    if design_X.shape[0] <= design_X.shape[1]:
        raise ModelingError(
            f"Design matrix is rank-deficient: {design_X.shape[0]} observations vs. "
            f"{design_X.shape[1]} parameters. Add items or declare fewer levels."
        )

    # 3. Fit logistic regression with item-clustered SEs.
    model = sm.Logit(y, design_X)
    fit = model.fit(
        method="bfgs",
        disp=False,
        cov_type="cluster",
        cov_kwds={"groups": df["item_id"].to_numpy()},
        maxiter=200,
    )

    # 4. Fit the null model for pseudo-R² and overall LR test.
    null_model = sm.Logit(y, design_X[["Intercept"]])
    null_fit = null_model.fit(method="bfgs", disp=False, maxiter=200)

    deviance = float(-2 * fit.llf)
    null_deviance = float(-2 * null_fit.llf)
    if null_fit.llf == 0:
        pseudo_r2: float | None = None
    else:
        pseudo_r2 = float(1 - fit.llf / null_fit.llf)

    # 5. Per-factor joint Wald tests via the f_test API.
    factor_wald: dict[str, float] = {}
    for f in factor_names:
        cols = factor_to_cols[f]
        if not cols:
            continue
        constraint = " = 0, ".join(f"{c}" for c in cols) + " = 0"
        try:
            wald = fit.wald_test(constraint, scalar=True)
            p = float(wald.pvalue)
        except Exception:  # noqa: BLE001 — Wald can fail when factor is collinear
            p = float("nan")
        factor_wald[f] = p

    # 6. Per-level effects table.
    params = fit.params
    bse = fit.bse
    pvalues = fit.pvalues
    conf = fit.conf_int()
    effects: list[FactorEffect] = []
    for f in factor_names:
        for col in factor_to_cols[f]:
            level = col[len(f) + 1 :]  # strip "factor_" prefix
            ci_low, ci_high = conf.loc[col]
            effects.append(
                FactorEffect(
                    factor=f,
                    level=level,
                    coef=float(params[col]),
                    std_err=float(bse[col]),
                    z_value=float(params[col] / bse[col]) if bse[col] else float("nan"),
                    p_value=float(pvalues[col]),
                    conf_int_low=float(ci_low),
                    conf_int_high=float(ci_high),
                )
            )

    notes = (
        "Fixed-effects logistic regression with item-clustered standard errors.",
        "Approximates the per-item random-effect structure of a proper GLMM.",
        f"Reference for 'agreement': {reference!r}.",
    )

    return ModelFit(
        n_observations=int(design_X.shape[0]),
        n_items=int(df["item_id"].nunique()),
        n_factors=len(factor_names),
        n_dropped_abstain=n_dropped,
        deviance=deviance,
        null_deviance=null_deviance,
        pseudo_r2=pseudo_r2,
        effects=tuple(effects),
        factor_wald=factor_wald,
        notes=notes,
    )

infereval.modeling.ModelFit `dataclass` ¶

ModelFit(n_observations: int, n_items: int, n_factors: int, n_dropped_abstain: int, deviance: float, null_deviance: float, pseudo_r2: float | None, effects: tuple[FactorEffect, ...], factor_wald: dict[str, float], notes: tuple[str, ...])

Result of fitting the factor-effects logistic regression.

n_observations `instance-attribute` ¶

n_observations: int

Number of (item, sample) rows used in the fit.

n_items `instance-attribute` ¶

n_items: int

Number of distinct items contributing observations (= number of clusters).

n_dropped_abstain `instance-attribute` ¶

n_dropped_abstain: int

Sample observations excluded because the verdict was ABSTAIN.

deviance `instance-attribute` ¶

deviance: float

-2 × log-likelihood of the fitted model.

null_deviance `instance-attribute` ¶

null_deviance: float

-2 × log-likelihood of the intercept-only model.

pseudo_r2 `instance-attribute` ¶

pseudo_r2: float | None

McFadden's pseudo-R² = 1 - log-lik(full) / log-lik(null).

effects `instance-attribute` ¶

effects: tuple[FactorEffect, ...]

One row per non-baseline level of each declared factor.

factor_wald `instance-attribute` ¶

factor_wald: dict[str, float]

Per-factor joint Wald p-value testing 'this factor has no effect'.

notes `instance-attribute` ¶

notes: tuple[str, ...]

Methodology notes / caveats surfaced for the CLI report.

infereval.modeling.FactorEffect `dataclass` ¶

FactorEffect(factor: str, level: str, coef: float, std_err: float, z_value: float, p_value: float, conf_int_low: float, conf_int_high: float)

One row of the fitted coefficient table.

Coefficients are log-odds relative to the alphabetically-first level of the same factor (the baseline). Positive coef → higher odds of agreement than the baseline level.

Sensitivity sweeps (R11)¶

infereval.sweep.run_sweep ¶

run_sweep(benchmark: Benchmark, provider: Provider, *, parameter: str, values: list[object], out_dir: Path, config: EndorsementConfig | None = None, params: ProviderParams | None = None, run_id_prefix: str | None = None) -> SweepResult

Run :func:evaluate once per value and bundle the metrics.

Per-value outputs land in out_dir with deterministic names so a re-run replaces them in place.

Source code in src/infereval/sweep.py

def run_sweep(
    benchmark: Benchmark,
    provider: Provider,
    *,
    parameter: str,
    values: list[object],
    out_dir: Path,
    config: EndorsementConfig | None = None,
    params: ProviderParams | None = None,
    run_id_prefix: str | None = None,
) -> SweepResult:
    """Run :func:`evaluate` once per value and bundle the metrics.

    Per-value outputs land in ``out_dir`` with deterministic names so a
    re-run replaces them in place.
    """
    if parameter not in _SUPPORTED_PARAMS:
        raise SweepError(f"unsupported sweep parameter: {parameter!r}")
    if not values:
        raise SweepError("--values must contain at least one value")

    out_dir.mkdir(parents=True, exist_ok=True)
    base_config = config or EndorsementConfig()
    base_params = params or ProviderParams()

    rows: list[SweepRow] = []
    for value in values:
        cfg, par, variant = _apply_value(parameter, value, base_config, base_params)

        # Render value into a filename-safe form.
        value_str = str(value).replace("/", "-").replace(" ", "_")
        eta_path = out_dir / f"sweep-{parameter}={value_str}-eta.json"
        log_path = out_dir / f"sweep-{parameter}={value_str}-run.jsonl"

        rid = (
            f"{run_id_prefix}-{parameter}={value_str}"
            if run_id_prefix
            else f"sweep-{parameter}={value_str}"
        )

        eta = evaluate(
            benchmark,
            provider,
            config=cfg,
            params=par,
            variant=variant,
            run_id=rid,
            log_path=log_path,
        )
        eta.dump(eta_path)

        ref = consensus_reference(eta)
        kc = cohens_kappa(eta, ref)
        kf = fleiss_kappa(eta)
        cov_val = coverage(eta)
        n_agreement = sum(
            1
            for i, it in enumerate(eta.items)
            if it.model_verdict == ref(i)
        )

        rows.append(
            SweepRow(
                value=value,
                coverage=cov_val,
                kappa_c=kc,
                kappa_f=kf,
                n_agreement=n_agreement,
                n_total=eta.n,
                eta_path=eta_path,
            )
        )

    return SweepResult(parameter=parameter, rows=tuple(rows))

infereval.sweep.SweepResult `dataclass` ¶

SweepResult(parameter: str, rows: tuple[SweepRow, ...])

Bundle of per-value rows + an overall stability assessment.

parameter `instance-attribute` ¶

parameter: str

Name of the swept parameter.

kappa_c_range `property` ¶

kappa_c_range: float | None

Max-minus-min of κ_C across rows; None if any κ_C is None.

stability_verdict `property` ¶

stability_verdict: str

Human-readable single-sentence assessment of κ_C variation.

infereval.sweep.SweepRow `dataclass` ¶

SweepRow(value: object, coverage: float, kappa_c: float | None, kappa_f: float | None, n_agreement: int, n_total: int, eta_path: Path)

One row of the sweep summary: parameters + metrics for one value.

value `instance-attribute` ¶

value: object

The swept parameter's value for this row, type-coerced per the parameter.

n_agreement `instance-attribute` ¶

n_agreement: int

Count of items where model_verdict == consensus_reference.

eta_path `instance-attribute` ¶

eta_path: Path

On-disk location of the per-value evaluation JSON.

Construct-validity report (R16–R21)¶

infereval.report.ConstructValidityClaims ¶

Bases: BaseModel

Top-level container for the analyst's construct-validity declarations.

reliability `class-attribute` `instance-attribute` ¶

reliability: ReliabilityClaim | None = None

R22, second leg: declared individuation criterion for the reliability claim. Optional at the top level so pre-0.6.1 claims files validate; required at scope ≥ domain_D_as_sampled for R22 satisfaction (the verdict gate in :func:compute_verdict caps the verdict at partially_defensible when it's missing AND competing_explanations.test_retest_run is True, mirroring the R19 carving-acknowledgement gate).

stub `classmethod` ¶

stub() -> ConstructValidityClaims

Return an obviously-placeholder stub for --init-claims.

Source code in src/infereval/report.py

@classmethod
def stub(cls) -> ConstructValidityClaims:
    """Return an obviously-placeholder stub for ``--init-claims``."""
    return cls(
        mastery_sense=MasterySenseClaim(
            sense="evaluative",
            description="FILL IN: the analyst's articulation of what mastery means here.",
        ),
        scope=ScopeClaim(
            scope="items_in_benchmark",
            justification="FILL IN: why this scope is appropriate.",
        ),
        constitution=ConstitutionClaim(
            position="evidence_of_mastery",
            justification="FILL IN: brief explanation of the position taken.",
        ),
        carving=CarvingClaim(
            acknowledges_carving_indexed=False,
            notes="FILL IN if acknowledges_carving_indexed=true.",
        ),
        competing_explanations=CompetingExplanationChecks(),
        reliability=ReliabilityClaim(
            identity_criterion=IdentityCriterion(
                # Framework-substantiated booleans default to True.
                # The analyst can deny them (set to False) only by
                # also deciding not to do a retest; otherwise
                # `infereval retest` would reject the run pair.
                same_benchmark_hash=True,
                same_endorsement_config=True,
                same_paraphrase_variant=True,
                # Analyst-substantiated booleans — these are real
                # commitments the analyst has to think about. The
                # stub leaves them as False to force the analyst
                # to consciously assert each one.
                same_provider_model_id=False,
                cross_update_identity_asserted=False,
                same_scaffolding=False,
                unverifiable_caveats=(
                    "FILL IN: what individuation commitments are being made "
                    "without framework-mechanical verification (e.g. "
                    "provider snapshot stability, scaffolding constancy)."
                ),
                rationale=(
                    "FILL IN: why these individuation choices are right for "
                    "this evaluation. Required at scope >= "
                    "domain_D_as_sampled for R22 satisfaction."
                ),
            )
        ),
    )

infereval.report.MasterySenseClaim ¶

Bases: BaseModel

R16: which sense of mastery the claim is about.

sense `instance-attribute` ¶

sense: Literal['evaluative', 'generative', 'standing', 'combination']

evaluative: endorsements-when-asked (the methodology's direct measurement).
generative: inferential behavior in unprompted production.
standing: a dispositional competence underlying both.
combination: a mix; describe explicitly in description.

description `instance-attribute` ¶

description: str

One to three sentences, the analyst's own articulation.

infereval.report.ScopeClaim ¶

Bases: BaseModel

R17: scope the mastery claim applies over.

scope `instance-attribute` ¶

scope: Literal['items_in_benchmark', 'domain_D_as_sampled', 'general_capacity']

items_in_benchmark: the claim is about the specific items in β.
domain_D_as_sampled: the claim generalises to D as sampled by β.
general_capacity: the claim is about inferential mastery as a general capacity.

justification `instance-attribute` ¶

justification: str

Why this scope is appropriate given β and the methodology used.

infereval.report.ConstitutionClaim ¶

Bases: BaseModel

R18: is agreement evidence of mastery or constitutive of it?

position `instance-attribute` ¶

position: Literal['evidence_of_mastery', 'constitutive_of_mastery']

evidence_of_mastery: agreement is evidence for a deeper underlying property.
constitutive_of_mastery: agreement (with structural coherence) IS mastery (Brandom's structural-behavioural characterisation).

justification `instance-attribute` ¶

justification: str

Brief explanation of the position taken and why.

infereval.report.CarvingClaim ¶

Bases: BaseModel

R19: carving-indexed framing of in-principle claims.

acknowledges_carving_indexed `instance-attribute` ¶

acknowledges_carving_indexed: bool

True iff any in-principle claims are framed in the carving-indexed form Remark 10 specifies.

notes `class-attribute` `instance-attribute` ¶

notes: str = ''

Required when acknowledges_carving_indexed is True; document the carving used or pointers to the discussion.

infereval.report.CompetingExplanationChecks ¶

Bases: BaseModel

R4, R8, R9, R11, R13, R14, R15: which checks were actually run.

All fields default to False (the conservative posture — the framework assumes no check was done unless the analyst explicitly declares it). The report's Unaddressed competing explanations section lists every False.

test_retest_run `class-attribute` `instance-attribute` ¶

test_retest_run: bool = False

R22: test-retest reliability check has been run (two independent evaluations against the same benchmark have been compared via infereval retest). Required at scope ≥ domain_D_as_sampled; informational at narrower scope. Per the methodology, an evaluation that doesn't replicate is not evidence of anything — within-run agreement statistics presuppose across-run reliability.

infereval.report.ReportVerdict `dataclass` ¶

ReportVerdict(label: Literal['defensible', 'partially_defensible', 'not_defensible'], one_liner: str, rationale: list[str])

Deterministic summary verdict computed from the claims + evidence.

infereval.report.compute_verdict ¶

compute_verdict(claims: ConstructValidityClaims, *, structure_report: dict[str, object] | None = None, benchmark: Benchmark | None = None, retest_result: dict[str, object] | None = None) -> ReportVerdict

Return the deterministic summary verdict for the claims + evidence.

The verdict is computed against the claims file together with the supplied analytical artifacts. When no artifacts are passed (structure_report=None, benchmark=None), the verdict is computed from claims alone and a "verdict computed unaudited" rationale line is added so the reader can tell.

The deterministic rule:

"defensible" iff every check required by the declared scope is marked True AND no audited check returned a failing artifact AND the carving claim is explicit (acknowledges = True iff any in-principle claims are being made) AND the benchmark supports an inter-analyst baseline when one is required by the scope.
"not_defensible" iff more than half of the required checks are missing.
"partially_defensible" otherwise — including the "ran but didn't pass" cases (structural anomalies present, single-analyst benchmark with items_in_benchmark scope).

Audit caps (added in v0.5.3 from external review):

If structure_report is supplied AND structural_check_run is marked True AND the report contains any anomaly, the structural check is treated as failing — the verdict is capped at partially_defensible with a rationale line naming the count.
If benchmark is supplied AND the scope is items_in_benchmark AND len(benchmark.analysts) < 2, the verdict is capped at partially_defensible with a rationale line surfacing the panel size — agreement with a single analyst cannot inherit the convergent-validity guarantee that multi-analyst agreement carries.

Backwards-compatible callers that don't pass the artifacts get behaviour identical to v0.5.2 except for the additional "verdict computed unaudited" rationale line.

Source code in src/infereval/report.py

def compute_verdict(
    claims: ConstructValidityClaims,
    *,
    structure_report: dict[str, object] | None = None,
    benchmark: Benchmark | None = None,
    retest_result: dict[str, object] | None = None,
) -> ReportVerdict:
    """Return the deterministic summary verdict for the claims + evidence.

    The verdict is computed against the *claims* file together with the
    supplied analytical artifacts. When no artifacts are passed
    (``structure_report=None``, ``benchmark=None``), the verdict is
    computed from claims alone and a "verdict computed unaudited"
    rationale line is added so the reader can tell.

    The deterministic rule:

    - "defensible" iff every check required by the declared scope is
      marked True AND no audited check returned a failing artifact AND
      the carving claim is explicit (acknowledges = True iff any
      in-principle claims are being made) AND the benchmark supports
      an inter-analyst baseline when one is required by the scope.
    - "not_defensible" iff *more than half* of the required checks
      are missing.
    - "partially_defensible" otherwise — including the "ran but didn't
      pass" cases (structural anomalies present, single-analyst benchmark
      with ``items_in_benchmark`` scope).

    Audit caps (added in v0.5.3 from external review):

    - If ``structure_report`` is supplied AND ``structural_check_run``
      is marked True AND the report contains any anomaly, the structural
      check is treated as failing — the verdict is capped at
      ``partially_defensible`` with a rationale line naming the count.
    - If ``benchmark`` is supplied AND the scope is
      ``items_in_benchmark`` AND ``len(benchmark.analysts) < 2``, the
      verdict is capped at ``partially_defensible`` with a rationale
      line surfacing the panel size — agreement with a single analyst
      cannot inherit the convergent-validity guarantee that
      multi-analyst agreement carries.

    Backwards-compatible callers that don't pass the artifacts get
    behaviour identical to v0.5.2 except for the additional "verdict
    computed unaudited" rationale line.
    """
    required = _REQUIRED_CHECKS_BY_SCOPE[claims.scope.scope]
    ce = claims.competing_explanations
    present = {name for name in required if getattr(ce, name)}
    missing = required - present

    rationale = []
    if not missing:
        rationale.append(
            f"All {len(required)} competing-explanation checks required for "
            f"scope={claims.scope.scope!r} are marked as run."
        )
    else:
        rationale.append(
            f"{len(missing)} of {len(required)} required checks NOT run: "
            f"{sorted(missing)}."
        )

    # Carving check applies only when scope reaches beyond items_in_benchmark.
    carving_ok = True
    if claims.scope.scope != "items_in_benchmark":
        if not claims.carving.acknowledges_carving_indexed:
            carving_ok = False
            rationale.append(
                f"Scope={claims.scope.scope!r} reaches beyond the items "
                "themselves, but carving-indexed framing is NOT acknowledged "
                "(R19 unaddressed)."
            )
        elif not claims.carving.notes.strip():
            carving_ok = False
            rationale.append(
                "Carving acknowledged but no notes supplied; R19 requires "
                "the carving to be documented."
            )

    # Audit caps (v0.5.3): downgrade when the analyst declared a check
    # was run but the corresponding artifact tells a different story.
    structural_failed = False
    if (
        structure_report is not None
        and getattr(ce, "structural_check_run", False)
    ):
        checks_obj = structure_report.get("checks") or []
        checks_iter = checks_obj if isinstance(checks_obj, list) else []
        total_anomalies = 0
        for check in checks_iter:
            if not isinstance(check, dict):
                continue
            anomalies = check.get("anomalies", ())
            if isinstance(anomalies, (list, tuple)):
                total_anomalies += len(anomalies)
        if total_anomalies > 0:
            structural_failed = True
            rationale.append(
                f"`structural_check_run` is marked True, but the supplied "
                f"structure report contains {total_anomalies} anomal"
                f"{'y' if total_anomalies == 1 else 'ies'} — "
                "the check ran but did not pass. Verdict capped at "
                "partially_defensible."
            )

    panel_too_small = False
    panel_size: int | None = None
    if benchmark is not None and claims.scope.scope == "items_in_benchmark":
        panel_size = len(benchmark.analysts)
        if panel_size < 2:
            panel_too_small = True
            rationale.append(
                f"Benchmark has m={panel_size} analyst(s); κ_F\\*(β) is "
                "undefined and there is no independent reference column. "
                "A green verdict at items_in_benchmark scope would certify "
                "agreement with a single labeler — capped at "
                "partially_defensible."
            )

    # R22 audit cap: if test_retest_run is asserted and the supplied
    # retest artifact shows substantively-unstable reliability (or κ is
    # undefined), cap the verdict at partially_defensible. Same shape
    # as the v0.5.3 structural-anomaly cap.
    #
    # v0.13.0: the cap now handles both single-interval (v0.11.0+
    # RetestResult) and multi-interval (v0.12.0+ MultiIntervalRetestResult)
    # artifacts. Multi-interval is reduced via worst-case-across-pairs:
    # if ANY captured interval is substantively unstable or has
    # undefined κ, the cap fires. Conservative reading — the mastery
    # claim has to hold at every time scale the analyst captured.
    retest_failed = False
    if (
        retest_result is not None
        and getattr(ce, "test_retest_run", False)
    ):
        if _retest_is_multi_interval(retest_result):
            worst = _retest_worst_pair(retest_result)
            if worst is not None:
                worst_retest = (
                    worst.get("retest") if isinstance(worst, dict) else None
                )
                worst_verdict_str = (
                    str(worst_retest.get("stability_verdict", ""))
                    if isinstance(worst_retest, dict)
                    else ""
                )
                worst_kappa = (
                    worst_retest.get("test_retest_kappa")
                    if isinstance(worst_retest, dict)
                    else None
                )
                worst_interval = worst.get("interval_s", 0)
                retest_is_substantively_unstable = (
                    "substantively unstable" in worst_verdict_str.lower()
                )
                retest_undefined = worst_kappa is None
                if retest_is_substantively_unstable or retest_undefined:
                    retest_failed = True
                    n_pairs_raw = retest_result.get("pairs")
                    n_pairs = (
                        len(n_pairs_raw)
                        if isinstance(n_pairs_raw, list)
                        else 0
                    )
                    if retest_undefined:
                        rationale.append(
                            f"`test_retest_run` is marked True, but the "
                            f"supplied multi-interval retest result has "
                            f"undefined κ at interval {worst_interval}s "
                            f"(degenerate agreement structure on the "
                            f"comparison column) — the check ran across "
                            f"{n_pairs} interval"
                            f"{'s' if n_pairs != 1 else ''} but at least "
                            f"one did not produce a usable reliability "
                            f"estimate. Verdict capped at "
                            f"partially_defensible."
                        )
                    else:
                        flip_rate = (
                            worst_retest.get("flip_rate")
                            if isinstance(worst_retest, dict)
                            else None
                        )
                        flip_str = (
                            f", flip rate = {flip_rate * 100:.1f}%"
                            if isinstance(flip_rate, (int, float))
                            else ""
                        )
                        rationale.append(
                            f"`test_retest_run` is marked True, but the "
                            f"supplied multi-interval retest result has "
                            f"a substantively-unstable pair at interval "
                            f"{worst_interval}s "
                            f"(κ = {worst_kappa:+.3f}{flip_str}); the "
                            f"headline κ_C cannot be interpreted as "
                            f"signal under this reliability across the "
                            f"time scales captured. Verdict capped at "
                            f"partially_defensible."
                        )
        else:
            retest_verdict = str(retest_result.get("stability_verdict", ""))
            retest_kappa = retest_result.get("test_retest_kappa")
            retest_is_substantively_unstable = (
                "substantively unstable" in retest_verdict.lower()
            )
            retest_undefined = retest_kappa is None
            if retest_is_substantively_unstable or retest_undefined:
                retest_failed = True
                if retest_undefined:
                    rationale.append(
                        "`test_retest_run` is marked True, but the supplied "
                        "retest result has undefined κ (degenerate agreement "
                        "structure on the comparison column) — the check "
                        "ran but did not produce a usable reliability "
                        "estimate. Verdict capped at partially_defensible."
                    )
                else:
                    flip_rate = retest_result.get("flip_rate")
                    flip_str = (
                        f", flip rate = {flip_rate * 100:.1f}%"
                        if isinstance(flip_rate, (int, float))
                        else ""
                    )
                    rationale.append(
                        f"`test_retest_run` is marked True, but the supplied "
                        f"retest result is substantively unstable "
                        f"(κ = {retest_kappa:+.3f}{flip_str}) — the check ran "
                        f"but did not pass. The headline κ_C cannot be "
                        f"interpreted as signal under this reliability. "
                        f"Verdict capped at partially_defensible."
                    )

    # v0.6.1 R22 second leg: at scope >= domain_D_as_sampled, R22
    # satisfaction requires `test_retest_run=True` AND a declared
    # IdentityCriterion (`reliability.identity_criterion` populated
    # with a non-empty rationale). Without a declared criterion the κ
    # is uninterpretable — same shape as the R19 carving-acknowledgement
    # gate.
    individuation_undeclared = False
    if (
        claims.scope.scope != "items_in_benchmark"
        and getattr(ce, "test_retest_run", False)
    ):
        reliability = getattr(claims, "reliability", None)
        if reliability is None or not reliability.identity_criterion.rationale.strip():
            individuation_undeclared = True
            rationale.append(
                f"`test_retest_run` is marked True at "
                f"scope={claims.scope.scope!r}, but the identity "
                f"criterion under which the test-retest κ is "
                f"interpretable has not been declared (R22 second leg "
                f"— `reliability.identity_criterion` missing or rationale "
                f"empty). Without a declared criterion the κ is "
                f"uninterpretable as a reliability number. Verdict "
                f"capped at partially_defensible. Same shape as the R19 "
                f"carving-acknowledgement gate."
            )

    if structure_report is None and benchmark is None and retest_result is None:
        rationale.append(
            "Verdict computed unaudited: no structure_report, benchmark, "
            "or retest_result supplied to compute_verdict, so 'check run' "
            "is taken at face value and panel size / retest stability are "
            "not inspected. Render through `infereval report` (which "
            "passes all three) for the audited verdict."
        )

    # Decide.
    audit_passes = (
        not structural_failed
        and not panel_too_small
        and not retest_failed
        and not individuation_undeclared
    )
    if not missing and carving_ok and audit_passes:
        one_liner = f"Mastery claim defensible at scope={claims.scope.scope!r}."
        if panel_size is not None:
            one_liner = (
                f"Mastery claim defensible at scope={claims.scope.scope!r} "
                f"(m={panel_size} analysts)."
            )
        return ReportVerdict(
            label="defensible",
            one_liner=one_liner,
            rationale=rationale,
        )
    if (len(missing) > len(required) / 2 or not carving_ok) and audit_passes:
        return ReportVerdict(
            label="not_defensible",
            one_liner=(
                f"Mastery claim NOT defensible from the supplied evidence at "
                f"scope={claims.scope.scope!r}."
            ),
            rationale=rationale,
        )
    return ReportVerdict(
        label="partially_defensible",
        one_liner=(
            f"Mastery claim partially defensible at scope={claims.scope.scope!r} — "
            "see Unaddressed competing explanations."
        ),
        rationale=rationale,
    )

infereval.report.render_markdown ¶

render_markdown(*, evaluation: Evaluation, benchmark: Benchmark, claims: ConstructValidityClaims, structure_report: dict[str, object] | None = None, sweep_summary: dict[str, object] | None = None, model_fit: dict[str, object] | None = None, retest_result: dict[str, object] | None = None, decomposition_cells: list[dict[str, object]] | None = None, generated_at: datetime | None = None, suppress_negatives: bool = False) -> str

Produce the construct-validity report as Markdown.

Optional arguments (structure_report, sweep_summary, model_fit) populate the Evidence section; when absent, that section explicitly notes the missing evidence.

Source code in src/infereval/report.py

def render_markdown(
    *,
    evaluation: Evaluation,
    benchmark: Benchmark,
    claims: ConstructValidityClaims,
    structure_report: dict[str, object] | None = None,
    sweep_summary: dict[str, object] | None = None,
    model_fit: dict[str, object] | None = None,
    retest_result: dict[str, object] | None = None,
    decomposition_cells: list[dict[str, object]] | None = None,
    generated_at: datetime | None = None,
    suppress_negatives: bool = False,
) -> str:
    """Produce the construct-validity report as Markdown.

    Optional arguments (``structure_report``, ``sweep_summary``,
    ``model_fit``) populate the Evidence section; when absent, that
    section explicitly notes the missing evidence.
    """
    from .metrics import (
        cohens_kappa,
        consensus_reference,
        coverage,
        fleiss_kappa,
        inter_analyst_fleiss,
        inter_analyst_fleiss_per_panel,
    )

    generated_at = generated_at or datetime.now(timezone.utc)

    kappa_c = cohens_kappa(evaluation, consensus_reference(evaluation))
    kappa_f = fleiss_kappa(evaluation)
    # v0.7.0 (#82): inter_analyst_fleiss returns the all-analyst κ_F*
    # by default. On panelled benchmarks the primary-panel value is
    # rendered as a sub-bullet below for methodological transparency.
    kappa_f_star = inter_analyst_fleiss(benchmark)
    panel_names = benchmark.panel_names() if benchmark is not None else []
    primary_panel_kappa: float | None = None
    primary_panel_name: str | None = None
    if panel_names:
        primary_panel_name = benchmark.resolved_primary_panel()
        if primary_panel_name is not None:
            per_panel = inter_analyst_fleiss_per_panel(benchmark)
            primary_panel_kappa = per_panel.get(primary_panel_name)
    cov = coverage(evaluation)
    verdict = compute_verdict(
        claims,
        structure_report=structure_report,
        benchmark=benchmark,
        retest_result=retest_result,
    )

    # Collect negative findings up-front so we can both render them and
    # apply the suppression penalty to the verdict in one place.
    findings = collect_negative_findings(
        structure_report=structure_report,
        sweep_summary=sweep_summary,
        model_fit=model_fit,
        retest_result=retest_result,
        factor_kinds=dict(benchmark.factor_kinds) if benchmark.factor_kinds else None,
        decomposition_cells=decomposition_cells,
    )
    any_phase2_supplied = any(
        x is not None for x in (
            structure_report, sweep_summary, model_fit, retest_result,
            decomposition_cells,
        )
    )

    # If suppression is enabled, the Summary verdict downgrades one tier:
    # defensible -> partially_defensible -> not_defensible. Hiding
    # evidence is itself a negative construct-validity signal.
    if suppress_negatives:
        downgraded_label = {
            "defensible": "partially_defensible",
            "partially_defensible": "not_defensible",
            "not_defensible": "not_defensible",
        }[verdict.label]
        if downgraded_label != verdict.label:
            verdict = ReportVerdict(
                label=downgraded_label,  # type: ignore[arg-type]
                one_liner=(
                    "Verdict downgraded one tier because "
                    "--suppress-negatives is enabled."
                ),
                rationale=[
                    *verdict.rationale,
                    "Negative-findings suppression downgrades the verdict "
                    "(Phase 3.2 / R21).",
                ],
            )

    lines: list[str] = []
    lines.append("# Construct-validity report")
    lines.append("")
    lines.append(f"_Generated: {generated_at.isoformat()}_")
    if suppress_negatives:
        lines.append("")
        lines.append(
            "> ⚠️ **Negative-findings suppression: ENABLED.** This is an "
            "explicit author choice via `--suppress-negatives`; the "
            "framework normally surfaces negative findings by default. "
            "Reviewers: ask why this flag was set."
        )
    lines.append("")

    # 1. Identity
    lines.append("## 1. Identity")
    lines.append("")
    lines.append(f"- **Evaluation**: `{evaluation.id}`")
    lines.append(f"- **Benchmark**: `{benchmark.id}`")
    lines.append(
        f"- **Model**: `{evaluation.model.provider}` / `{evaluation.model.model_id}`"
    )
    if evaluation.started_at:
        lines.append(f"- **Run started**: {evaluation.started_at.isoformat()}")
    lines.append(f"- **Items**: {evaluation.n}")
    lines.append(f"- **Analysts**: {benchmark.m}")
    lines.append("")

    # 2. Summary metrics
    #
    # v0.13.0 (#?): §2 is restructured into two sibling subheaded blocks
    # — `### Agreement` (cov/κ_C/κ_F/κ_F*) and `### Reliability (R22)`
    # (test-retest) — so test-retest reliability sits at the same visual
    # level as agreement. The `## 2.` anchor is preserved (no
    # renumbering cascade) but the visual hierarchy now reflects the
    # methodology paper's framing: agreement and reliability are
    # co-equal construct-validity dimensions, not a primary plus an
    # optional footnote.
    lines.append("## 2. Summary metrics")
    lines.append("")
    lines.append("### Agreement")
    lines.append("")
    lines.append(f"- **Coverage**: {cov:.4f}")
    lines.append(f"- **Cohen's κ_C (vs consensus)**: {_format_kappa(kappa_c)}")
    lines.append(f"- **Fleiss' κ_F**: {_format_kappa(kappa_f)}")
    # v0.7.0 (#82): on panelled benchmarks the headline κ_F* is the
    # all-analyst figure; the primary panel's value is rendered as a
    # sub-bullet so the methodological distinction (panels are an
    # additive convergent-validity device, not a replacement for the
    # baseline) is visible at the surface where the reader looks for
    # the Remark 4 number.
    if panel_names:
        lines.append(
            f"- **Inter-analyst κ_F\\* (all analysts)**: "
            f"{_format_kappa(kappa_f_star)}"
        )
        if primary_panel_name is not None:
            lines.append(
                f"  - *Primary panel (`{primary_panel_name}`) κ_F\\* = "
                f"{_format_kappa(primary_panel_kappa)}*"
            )
    else:
        lines.append(f"- **Inter-analyst κ_F\\***: {_format_kappa(kappa_f_star)}")
    lines.append("")
    lines.append("### Reliability (R22)")
    lines.append("")
    # Test-retest κ (R22): within-model analog of κ_F*. Always rendered
    # — informational at items_in_benchmark scope, verdict-gating at
    # scope ≥ domain_D_as_sampled. The renderer auto-detects single vs
    # multi-interval shape via the presence of a `pairs` key on the
    # supplied artifact, so a single CLI flag (--retest) consumes
    # either v0.11.0+ RetestResult or v0.12.0+ MultiIntervalRetestResult
    # JSON. v0.6.1: the κ row carries an explicit "under the declared
    # identity criterion ..." suffix when the criterion is present in
    # the supplied retest artifact, making explicit what the reliability
    # number is relative to (Hlobil's individuation point).
    _render_retest_section(lines, retest_result)
    lines.append("")

    # 3. Construct-validity claims (R16-R20)
    lines.append("## 3. Construct-validity claims (R16–R20)")
    lines.append("")
    lines.append(f"**Mastery sense (R16)**: {claims.mastery_sense.sense}")
    lines.append("")
    lines.append(f"> {claims.mastery_sense.description}")
    lines.append("")
    lines.append(f"**Scope (R17)**: {claims.scope.scope}")
    lines.append("")
    lines.append(f"> {claims.scope.justification}")
    lines.append("")
    lines.append(f"**Constitution vs. evidence (R18)**: {claims.constitution.position}")
    lines.append("")
    lines.append(f"> {claims.constitution.justification}")
    lines.append("")
    carving_status = (
        "acknowledged" if claims.carving.acknowledges_carving_indexed else "not acknowledged"
    )
    lines.append(f"**Carving-indexed framing (R19)**: {carving_status}")
    if claims.carving.notes.strip():
        lines.append("")
        lines.append(f"> {claims.carving.notes}")
    lines.append("")

    # v0.6.1 R22 second leg: render the declared individuation
    # criterion verbatim when claims include the reliability block.
    # Doubly-relative framing — the reliability claim is relative to
    # both the carving (R19, above) and the identity criterion (R22
    # second leg, here). Same commitment-and-relativity pattern.
    if claims.reliability is not None:
        crit = claims.reliability.identity_criterion
        lines.append(
            "**Reliability — identity criterion "
            "(R22, doubly-relative)**:"
        )
        lines.append("")
        lines.append(
            f"- Framework-substantiated: same_benchmark_hash="
            f"`{crit.same_benchmark_hash}`, same_endorsement_config="
            f"`{crit.same_endorsement_config}`, same_paraphrase_variant="
            f"`{crit.same_paraphrase_variant}`."
        )
        lines.append(
            f"- Analyst-substantiated: same_provider_model_id="
            f"`{crit.same_provider_model_id}`, "
            f"cross_update_identity_asserted="
            f"`{crit.cross_update_identity_asserted}`, "
            f"same_scaffolding=`{crit.same_scaffolding}`."
        )
        if crit.unverifiable_caveats.strip():
            lines.append("")
            lines.append(f"> _Unverifiable caveats:_ {crit.unverifiable_caveats}")
        if crit.rationale.strip():
            lines.append("")
            lines.append(f"> _Rationale:_ {crit.rationale}")
        lines.append("")

    # 4. Evidence
    lines.append("## 4. Evidence")
    lines.append("")
    lines.append("Auto-collected from optional Phase 2 artifacts:")
    lines.append("")

    if structure_report is not None:
        total_anomalies = structure_report.get("total_anomalies", 0)
        lines.append(
            f"- **Structural coherence checks** (R13): "
            f"{total_anomalies} anomalies flagged across the bundled checks."
        )
    else:
        lines.append("- **Structural coherence checks** (R13): NOT SUPPLIED.")

    if sweep_summary is not None:
        kc_range = sweep_summary.get("kappa_c_range")
        param = sweep_summary.get("parameter", "?")
        verdict_str = sweep_summary.get("stability_verdict", "?")
        if kc_range is not None:
            lines.append(
                f"- **Sensitivity sweep** over `{param}` (R11): "
                f"κ_C range = {kc_range:.3f}. {verdict_str}"
            )
        else:
            lines.append(
                f"- **Sensitivity sweep** over `{param}` (R11): {verdict_str}"
            )
    else:
        lines.append("- **Sensitivity sweep** (R11): NOT SUPPLIED.")

    if model_fit is not None:
        wald_raw = model_fit.get("factor_wald", {})
        wald = wald_raw if isinstance(wald_raw, dict) else {}
        sig = sum(1 for p in wald.values() if isinstance(p, (int, float)) and p < 0.05)
        lines.append(
            f"- **Factor-effects model fit** (R7, R12): "
            f"{sig}/{len(wald)} factors significant at α=0.05."
        )
    else:
        lines.append("- **Factor-effects model fit** (R7, R12): NOT SUPPLIED.")

    if retest_result is not None:
        retest_verdict_str = retest_result.get("stability_verdict", "?")
        n_flipped_raw = retest_result.get("flipped_items", [])
        n_flipped = len(n_flipped_raw) if isinstance(n_flipped_raw, list) else 0
        lines.append(
            f"- **Test-retest reliability** (R22): {retest_verdict_str} "
            f"({n_flipped} item(s) flipped between runs)."
        )
    else:
        lines.append("- **Test-retest reliability** (R22): NOT SUPPLIED.")
    lines.append("")

    # 4b. Negative findings (Phase 3.2, R21)
    lines.append("## 4b. Negative findings")
    lines.append("")
    if suppress_negatives:
        lines.append(
            "⚠️ **Suppressed via `--suppress-negatives`.** This is an "
            "explicit author choice; the framework normally surfaces "
            "negative findings by default. Reviewers: ask why this flag "
            "was set."
        )
    elif not any_phase2_supplied:
        lines.append(
            "No Phase 2 artifacts supplied; the auto-collection step had "
            "nothing to scan. See Unaddressed competing explanations (§5) "
            "for the analyst-declared check status."
        )
    elif not findings:
        lines.append("No negative findings detected in the supplied Phase 2 artifacts.")
    else:
        lines.append(
            "The framework auto-collects negative findings from the "
            "supplied Phase 2 artifacts. Each item below represents a "
            "check that ran but returned a finding that *weakens or "
            "complicates* the mastery claim."
        )
        lines.append("")
        # Group by source for readability.
        for src_label, src_key in [
            ("Structural anomalies", "structure"),
            ("Sweep instability", "sweep"),
            ("Factor-effects null findings", "model_fit"),
            ("Test-retest anomalies (R22)", "retest"),
            ("Decomposition under-powered (R12)", "decomposition_under_powered"),
        ]:
            src_items = [f for f in findings if f.source == src_key]
            if not src_items:
                continue
            lines.append(f"### {src_label} ({len(src_items)} flagged)")
            for f in src_items:
                lines.append(f"- {f.summary}")
            lines.append("")
    lines.append("")

    # 5. Unaddressed competing explanations
    lines.append("## 5. Unaddressed competing explanations")
    lines.append("")
    ce = claims.competing_explanations
    unaddressed = [
        (name, _human_label_for_check(name))
        for name in (
            "paraphrase_sweep_run",
            "sensitivity_sweep_run",
            "structural_check_run",
            "cross_panel_check_run",
            "independent_reference_panel_used",
            "held_out_items_used",
            "training_data_separation_verified",
            "cross_domain_comparison_run",
            "replication_attempted",
            "test_retest_run",
        )
        if not getattr(ce, name)
    ]
    if not unaddressed:
        lines.append("All declared competing-explanation checks marked as run.")
    else:
        lines.append(
            "The following checks were NOT run. Each omission weakens the "
            "defensibility of the corresponding mastery claim:"
        )
        lines.append("")
        for name, label in unaddressed:
            lines.append(f"- **{label}** (`{name}`)")
    lines.append("")

    # 6. Summary verdict
    lines.append("## 6. Summary verdict")
    lines.append("")
    badge = {
        "defensible": "✅",
        "partially_defensible": "⚠️",
        "not_defensible": "❌",
    }[verdict.label]
    lines.append(f"### {badge} {verdict.one_liner}")
    lines.append("")
    for note in verdict.rationale:
        lines.append(f"- {note}")
    lines.append("")
    lines.append("---")
    lines.append("")
    lines.append(
        "*Generated by `infereval report` (Phase 3.1, R16–R20). The verdict "
        "is computed deterministically from the claims file; the framework "
        "refuses to render a 'defensible' verdict without the corresponding "
        "competing-explanation checks.*"
    )

    return "\n".join(lines) + "\n"

infereval.report.NegativeFinding `dataclass` ¶

NegativeFinding(source: Literal['structure', 'sweep', 'model_fit', 'retest', 'decomposition_under_powered'], summary: str)

One auto-collected negative finding from a Phase 2 artifact.

A finding is "negative" in the construct-validity sense — a check that ran and returned a result that weakens or complicates the mastery claim. Per Closing the Construct-Validity Gap in infereval (Phase 3.2 / R21), the framework surfaces these by default in the report.

summary `instance-attribute` ¶

summary: str

One-line description rendered in the Negative findings section.

infereval.report.collect_negative_findings ¶

collect_negative_findings(*, structure_report: dict[str, object] | None = None, sweep_summary: dict[str, object] | None = None, model_fit: dict[str, object] | None = None, retest_result: dict[str, object] | None = None, factor_kinds: dict[str, str] | None = None, decomposition_cells: list[dict[str, object]] | None = None) -> list[NegativeFinding]

Scan the supplied Phase 2 artifacts and return their negative findings.

Sources:

structure_report: each anomaly across all checks is one finding.
sweep_summary: instability (verdict not "stable across the sweep range") is one finding.
model_fit: factors whose Wald p > 0.05 are surfaced as no-significant-effect findings. When factor_kinds supplies a valence label for a factor, the finding's summary explicitly states whether the null is a weakening of the mastery claim (a substantive factor that didn't differentiate) or a strengthening one (an experimentally-controlled factor that properly didn't affect behavior — e.g. the paraphrase axis). Unlabelled factors get the historical neutral summary so the analyst can read the valence from context.
decomposition_cells (v0.8.0, closes #84): under-powered by-tag / by-rsr-target cells. Each cell is a dict with keys title (str), n_substantive (int), cohens_kappa (float | None), fleiss_kappa (float | None), and is_under_powered (bool). Cells with is_under_powered = True emit one finding each — the κ value on the cell is forced by single-class-each marginals (n below :data:infereval.metrics.MIN_K_FOR_SUBSAMPLING_CI), not measured, and shouldn't carry the verdict on its own.

Parameters¶

factor_kinds Optional mapping factor_name -> {"substantive", "experimentally_controlled"} from Benchmark.factor_kinds. When omitted, all null-effect findings are summarised neutrally. decomposition_cells Optional list of per-cell summaries produced by :func:infereval.metrics.cell_summary (rendered as plain dicts for JSON-friendliness). When supplied, under-powered cells become section 4b negative findings under the decomposition_under_powered source.

Source code in src/infereval/report.py

def collect_negative_findings(
    *,
    structure_report: dict[str, object] | None = None,
    sweep_summary: dict[str, object] | None = None,
    model_fit: dict[str, object] | None = None,
    retest_result: dict[str, object] | None = None,
    factor_kinds: dict[str, str] | None = None,
    decomposition_cells: list[dict[str, object]] | None = None,
) -> list[NegativeFinding]:
    """Scan the supplied Phase 2 artifacts and return their negative findings.

    Sources:

    - **structure_report**: each anomaly across all checks is one finding.
    - **sweep_summary**: instability (verdict not "stable across the sweep
      range") is one finding.
    - **model_fit**: factors whose Wald p > 0.05 are surfaced as
      no-significant-effect findings. When ``factor_kinds`` supplies a
      valence label for a factor, the finding's summary explicitly
      states whether the null is a *weakening* of the mastery claim
      (a substantive factor that didn't differentiate) or a *strengthening*
      one (an experimentally-controlled factor that properly didn't
      affect behavior — e.g. the paraphrase axis). Unlabelled factors
      get the historical neutral summary so the analyst can read the
      valence from context.
    - **decomposition_cells** (v0.8.0, closes #84): under-powered by-tag /
      by-rsr-target cells. Each cell is a dict with keys ``title`` (str),
      ``n_substantive`` (int), ``cohens_kappa`` (float | None),
      ``fleiss_kappa`` (float | None), and ``is_under_powered`` (bool).
      Cells with ``is_under_powered = True`` emit one finding each —
      the κ value on the cell is forced by single-class-each marginals
      (n below :data:`infereval.metrics.MIN_K_FOR_SUBSAMPLING_CI`),
      not measured, and shouldn't carry the verdict on its own.

    Parameters
    ----------
    factor_kinds
        Optional mapping ``factor_name -> {"substantive",
        "experimentally_controlled"}`` from ``Benchmark.factor_kinds``.
        When omitted, all null-effect findings are summarised neutrally.
    decomposition_cells
        Optional list of per-cell summaries produced by
        :func:`infereval.metrics.cell_summary` (rendered as plain dicts
        for JSON-friendliness). When supplied, under-powered cells
        become section 4b negative findings under the
        ``decomposition_under_powered`` source.
    """
    findings: list[NegativeFinding] = []

    if structure_report is not None:
        checks_raw = structure_report.get("checks", [])
        checks = checks_raw if isinstance(checks_raw, list) else []
        for check in checks:
            if not isinstance(check, dict):
                continue
            anomalies = check.get("anomalies", ()) if isinstance(check, dict) else ()
            if not anomalies:
                continue
            check_name = check.get("name", "?")
            for a in anomalies:
                if isinstance(a, dict):
                    item_id = a.get("item_id", "?")
                    expl = a.get("explanation", "")
                    findings.append(
                        NegativeFinding(
                            source="structure",
                            summary=f"{check_name} / {item_id}: {expl}",
                        )
                    )

    if sweep_summary is not None:
        verdict_raw = sweep_summary.get("stability_verdict", "")
        verdict_str = str(verdict_raw).lower()
        # The SweepResult.stability_verdict strings live in three flavours:
        # "stable" (positive), "moderately sensitive" (negative),
        # "substantively" (negative). "Stable" doesn't appear in the
        # negative ones, so its absence is the right signal.
        if verdict_str and "stable" not in verdict_str:
            param = sweep_summary.get("parameter", "?")
            findings.append(
                NegativeFinding(
                    source="sweep",
                    summary=f"Sweep over `{param}`: {sweep_summary.get('stability_verdict')}",
                )
            )

    if model_fit is not None:
        wald_raw = model_fit.get("factor_wald", {})
        wald = wald_raw if isinstance(wald_raw, dict) else {}
        kinds = factor_kinds or {}
        for factor, p in wald.items():
            if not isinstance(p, (int, float)):
                continue
            if p > 0.05:
                kind = kinds.get(str(factor))
                if kind == "substantive":
                    valence = (
                        " — **weakens the mastery claim**: this factor was "
                        "declared substantive, so the model failing to "
                        "differentiate across its levels is a negative finding"
                    )
                elif kind == "experimentally_controlled":
                    valence = (
                        " — **strengthens the mastery claim**: this factor "
                        "was declared experimentally-controlled, so the null "
                        "result is the wanted outcome (content-not-form "
                        "behavior)"
                    )
                else:
                    valence = ""
                findings.append(
                    NegativeFinding(
                        source="model_fit",
                        summary=(
                            f"`{factor}`: Wald p = {p:.3f} "
                            f"(no significant effect detected){valence}"
                        ),
                    )
                )

    if retest_result is not None:
        # v0.13.0: dispatch on artifact shape. Multi-interval emits one
        # corpus-level finding per non-stable pair and pools flipped
        # items across pairs by item_id (earliest-interval first-seen
        # annotation). Single-interval keeps the v0.12.0 behavior
        # verbatim.
        if _retest_is_multi_interval(retest_result):
            _collect_negative_findings_multi_interval(findings, retest_result)
        else:
            _collect_negative_findings_single(findings, retest_result)

    # Decomposition cells (v0.8.0, closes #84): under-powered by-tag /
    # by-rsr-target cells become section 4b negative findings. The
    # framework already gates Politis-Romano CIs at MIN_K_FOR_SUBSAMPLING_CI
    # on the headline; this extends that discipline into the decomposition.
    if decomposition_cells:
        from .metrics import MIN_K_FOR_SUBSAMPLING_CI

        for cell in decomposition_cells:
            if not isinstance(cell, dict):
                continue
            if not cell.get("is_under_powered"):
                continue
            title = cell.get("title", "?")
            n_sub = cell.get("n_substantive", "?")
            kappa_c = cell.get("cohens_kappa")
            kappa_f = cell.get("fleiss_kappa")
            kappa_pieces: list[str] = []
            if isinstance(kappa_c, (int, float)):
                kappa_pieces.append(f"κ_C = {kappa_c:+.3f}")
            elif kappa_c is None:
                kappa_pieces.append("κ_C undefined")
            if isinstance(kappa_f, (int, float)):
                kappa_pieces.append(f"κ_F = {kappa_f:+.3f}")
            elif kappa_f is None:
                kappa_pieces.append("κ_F undefined")
            kappa_str = "; ".join(kappa_pieces) if kappa_pieces else "κ undefined"
            findings.append(
                NegativeFinding(
                    source="decomposition_under_powered",
                    summary=(
                        f"{title}: n_substantive = {n_sub} "
                        f"(< {MIN_K_FOR_SUBSAMPLING_CI}); {kappa_str} is "
                        "under-powered — the magnitude is forced by "
                        "single-class-each marginals on a small subset, not "
                        "measured. Use the direction as a diagnostic lead; "
                        "confirm via a paraphrase or content-axis check."
                    ),
                )
            )

    return findings

Providers¶

infereval.providers.get_provider ¶

get_provider(provider: str, model_id: str, **kwargs: Any) -> Provider

Construct a provider by short name.

Parameters¶

provider Provider short name: "anthropic", "openai", "openrouter", or "mock". model_id Provider-specific model identifier. **kwargs Passed through to the concrete provider's constructor (e.g. api_key, base_url, retry_policy, http_referer).

Returns¶

Provider A constructed provider instance satisfying the :class:Provider Protocol.

Raises¶

ProviderConfigError If provider is not a known short name or required configuration is missing (e.g. API key not set, optional SDK not installed).

Source code in src/infereval/providers/__init__.py

def get_provider(provider: str, model_id: str, **kwargs: Any) -> Provider:
    """Construct a provider by short name.

    Parameters
    ----------
    provider
        Provider short name: ``"anthropic"``, ``"openai"``, ``"openrouter"``, or ``"mock"``.
    model_id
        Provider-specific model identifier.
    **kwargs
        Passed through to the concrete provider's constructor (e.g.
        ``api_key``, ``base_url``, ``retry_policy``, ``http_referer``).

    Returns
    -------
    Provider
        A constructed provider instance satisfying the :class:`Provider`
        Protocol.

    Raises
    ------
    ProviderConfigError
        If ``provider`` is not a known short name or required configuration is
        missing (e.g. API key not set, optional SDK not installed).
    """
    normalized = provider.strip().lower()
    if normalized == "anthropic":
        from .anthropic import AnthropicProvider

        return AnthropicProvider(model_id, **kwargs)
    if normalized == "openai":
        from .openai import OpenAIProvider

        return OpenAIProvider(model_id, **kwargs)
    if normalized == "openrouter":
        from .openrouter import OpenRouterProvider

        return OpenRouterProvider(model_id, **kwargs)
    if normalized == "mock":
        from .mock import ScriptedProvider

        return ScriptedProvider(model_id=model_id, **kwargs)
    raise ProviderConfigError(
        f"Unknown provider {provider!r}. "
        "Supported: 'anthropic', 'openai', 'openrouter', 'mock'."
    )

infereval.providers.base.Provider ¶

Bases: Protocol

The structural contract every LLM backend must satisfy.

infereval.providers.base.BaseProvider ¶

BaseProvider(model_id: str, *, retry_policy: RetryPolicy | None = None, rng: Random | None = None)

Bases: ABC

Abstract base providing the retry loop, timing, and logging.

Subclasses set the class-level :attr:name, implement :meth:_sample_once (one provider call), and implement :meth:_is_transient (which exceptions warrant a retry).

Source code in src/infereval/providers/base.py

def __init__(
    self,
    model_id: str,
    *,
    retry_policy: RetryPolicy | None = None,
    rng: random.Random | None = None,
) -> None:
    self.model_id = model_id
    self.retry_policy = retry_policy or RetryPolicy()
    self._rng = rng or random.Random()

sample ¶

sample(req: SampleRequest) -> SampleResult

Sample once with retries.

Raises¶

ProviderSampleError If all retry attempts fail with transient errors, or the first attempt fails with a non-transient error.

Source code in src/infereval/providers/base.py

def sample(self, req: SampleRequest) -> SampleResult:
    """Sample once with retries.

    Raises
    ------
    ProviderSampleError
        If all retry attempts fail with transient errors, or the first
        attempt fails with a non-transient error.
    """
    last_exc: Exception | None = None
    for attempt in range(self.retry_policy.max_attempts):
        try:
            result = self._sample_once(req)
            if attempt > 0:
                log_event(
                    log,
                    "provider.sample.recovered",
                    provider=self.name,
                    model_id=self.model_id,
                    request_id=req.request_id,
                    attempt=attempt + 1,
                )
            return result
        except Exception as exc:  # noqa: BLE001 -- intentional broad catch; classified below
            last_exc = exc
            # v0.15.0: EmptyResponseError is always-transient — every
            # subclass provider inherits this classification without
            # needing to override _is_transient. Catches the v0.14.0
            # silent-empty-response failure mode.
            transient = (
                True
                if isinstance(exc, EmptyResponseError)
                else self._is_transient(exc)
            )
            log.warning(
                "provider.sample.error",
                extra={
                    "provider": self.name,
                    "model_id": self.model_id,
                    "request_id": req.request_id,
                    "attempt": attempt + 1,
                    "transient": transient,
                    "err": str(exc),
                },
            )
            if not transient:
                raise ProviderSampleError(
                    f"{self.name} sample failed (non-transient): {exc}"
                ) from exc
            if attempt + 1 >= self.retry_policy.max_attempts:
                break  # exhausted -- raise below
            sleep_for = self.retry_policy.sleep_for(attempt, self._rng)
            log_event(
                log,
                "provider.sample.retry",
                provider=self.name,
                request_id=req.request_id,
                sleep_s=sleep_for,
            )
            self._sleep(sleep_for)

    raise ProviderSampleError(
        f"{self.name} sample failed after {self.retry_policy.max_attempts} "
        f"attempts: {last_exc}"
    ) from last_exc

infereval.providers.base.SampleRequest `dataclass` ¶

SampleRequest(prompt: str, system: str | None = None, temperature: float = 1.0, max_tokens: int = 1024, top_p: float | None = None, seed: int | None = None, stop: tuple[str, ...] = (), request_id: str | None = None)

A single completion request issued to a provider.

The max_tokens default of 1024 is sized for current frontier models that consume budget on silent internal reasoning (DeepSeek v4-flash, OpenAI o-series, Gemini 2.5 Pro). Pre-reasoning models will only emit a handful of tokens for a one-word verdict regardless of this cap, so the higher default is cheap insurance against budget-clipping. See docs/providers.md for per-provider guidance.

request_id `class-attribute` `instance-attribute` ¶

request_id: str | None = None

Client-side correlation id propagated to logs and to :attr:SampleResult.request_id.

infereval.providers.base.SampleResult `dataclass` ¶

SampleResult(text: str, provider: str, model_id: str, request_id: str | None = None, wall_time_ms: float = 0.0, usage: Mapping[str, int] = dict(), raw: Mapping[str, Any] | None = None, finish_reason: str | None = None, reasoning_tokens: int | None = None)

One completed sample from a provider.

The finish_reason and reasoning_tokens fields surface provider-side stop-reason and reasoning-token-consumption metadata so that downstream code (the endorser, the JSONL log, the evaluation JSON) can distinguish budget-clipped abstains (model ran out of tokens on silent internal reasoning) from genuine abstains (model declined to commit). The values are passed through verbatim from each provider — see :data:BUDGET_FINISH_REASONS for the canonical union of values that signal a budget hit.

raw `class-attribute` `instance-attribute` ¶

raw: Mapping[str, Any] | None = None

Provider-native response payload, when available, for forensic inspection.

finish_reason `class-attribute` `instance-attribute` ¶

finish_reason: str | None = None

Provider-side stop reason. OpenAI: "stop" / "length" / ...; Anthropic: "end_turn" / "max_tokens" / "stop_sequence" / .... None if the provider didn't report one.

reasoning_tokens `class-attribute` `instance-attribute` ¶

reasoning_tokens: int | None = None

Count of tokens consumed by silent internal reasoning, where the provider exposes it (OpenAI: usage.completion_tokens_details.reasoning_tokens). None if not reported.

infereval.providers.base.RetryPolicy `dataclass` ¶

RetryPolicy(max_attempts: int = 4, backoff_initial_s: float = 0.5, backoff_factor: float = 2.0, jitter: float = 0.25)

Exponential-backoff-with-jitter retry policy.

Sleep before attempt i+1 (after the i-th transient failure) is

.. math:: s_i = b \cdot f^{\,i} \cdot (1 + j \cdot u)

where :math:b is backoff_initial_s, :math:f is backoff_factor, :math:j is jitter, and :math:u \sim U[-1, 1].

sleep_for ¶

sleep_for(attempt_index: int, rng: Random) -> float

Return the sleep duration in seconds before the next attempt.

Source code in src/infereval/providers/base.py

def sleep_for(self, attempt_index: int, rng: random.Random) -> float:
    """Return the sleep duration in seconds before the next attempt."""
    base = self.backoff_initial_s * (self.backoff_factor**attempt_index)
    jitter_mul = 1.0 + self.jitter * (rng.random() * 2.0 - 1.0)
    return max(0.0, base * jitter_mul)

infereval.providers.mock.ScriptedProvider `dataclass` ¶

ScriptedProvider(responses: list[str | SampleResult], model_id: str = 'scripted-mock-v1', name: str = 'mock')

Returns a pre-determined sequence of responses, cycling on exhaustion.

Each element may be either a plain str (in which case it is wrapped in a :class:SampleResult at sample time) or a fully-formed :class:SampleResult.

Parameters¶

responses Sequence of responses to return on successive sample calls. model_id Identifier reported in :attr:SampleResult.model_id. name Identifier reported in :attr:SampleResult.provider. Defaults to "mock" so evaluation JSON written from a test cleanly identifies itself as not real.

reset ¶

reset() -> None

Reset the index to the start of the response sequence.

Source code in src/infereval/providers/mock.py

def reset(self) -> None:
    """Reset the index to the start of the response sequence."""
    self._index = 0

infereval.providers.mock.ReplayProvider ¶

ReplayProvider(fixture_path: Path | str, *, model_id: str | None = None)

Replays recorded provider responses from a JSONL fixture.

The fixture is one JSON object per line. Each record must carry a prompt_hash (matching :func:infereval.logging_setup.prompt_hash of the prompt that produced it) and a text field. Optional fields: provider, model_id, request_id, wall_time_ms, usage, raw.

When multiple records share a prompt hash, they are returned in fixture order; ReplayProvider cycles when the per-prompt sequence is exhausted, matching :class:ScriptedProvider semantics.

Missing prompt hashes raise :class:ProviderSampleError with a diagnostic message listing how many hashes are recorded.

This is the M8 vehicle for byte-for-byte regression testing of the endorsement pipeline without hitting a real API. Generate fixtures via the developer helper at tests/fixtures/build_stop_sign_replay.py.

Source code in src/infereval/providers/mock.py

def __init__(
    self,
    fixture_path: Path | str,
    *,
    model_id: str | None = None,
) -> None:
    self.fixture_path = Path(fixture_path)
    if not self.fixture_path.exists():
        raise ProviderConfigError(
            f"ReplayProvider fixture not found: {self.fixture_path}"
        )

    records: dict[str, list[dict[str, Any]]] = {}
    with self.fixture_path.open("r", encoding="utf-8") as f:
        for line_no, raw_line in enumerate(f, start=1):
            line = raw_line.strip()
            if not line:
                continue
            try:
                record = json.loads(line)
            except json.JSONDecodeError as exc:
                raise ProviderConfigError(
                    f"ReplayProvider fixture {self.fixture_path} line {line_no} "
                    f"is not valid JSON: {exc}"
                ) from exc
            if "prompt_hash" not in record or "text" not in record:
                raise ProviderConfigError(
                    f"ReplayProvider fixture {self.fixture_path} line {line_no} "
                    "missing required fields 'prompt_hash' and/or 'text'"
                )
            records.setdefault(record["prompt_hash"], []).append(record)

    if not records:
        raise ProviderConfigError(
            f"ReplayProvider fixture {self.fixture_path} is empty"
        )

    self._records = records
    self._cursors: dict[str, int] = {}

    # Default model_id: explicit > first record's > generic placeholder.
    if model_id is not None:
        self.model_id = model_id
    else:
        first_record = next(iter(records.values()))[0]
        self.model_id = first_record.get("model_id", "replay-v1")

reset ¶

reset() -> None

Reset all per-prompt cursors so replay restarts from the top.

Source code in src/infereval/providers/mock.py

def reset(self) -> None:
    """Reset all per-prompt cursors so replay restarts from the top."""
    self._cursors.clear()

Prompts¶

infereval.prompts.VerificationPrompt `dataclass` ¶

VerificationPrompt(id: str, system: str, user_template: str, parse_regex: str = DEFAULT_PARSE_REGEX, survey_header: str | None = None, survey_stem: str | None = None)

A verification prompt template.

Attributes¶

id Stable identifier recorded in evaluation JSON (endorsement_config.verification_prompt_id). system System message sent to the provider. May be empty. user_template Format string with {premise_context} and {conclusion_context} placeholders, used to build each per-sample user prompt. parse_regex Regex applied (case-insensitively) to the model's response. The first match's group 1 is uppercased and interpreted as a :class:Verdict value (GOOD / BAD / ABSTAIN). survey_header Optional human-facing surface of this prompt's frame: the survey instruction text stating the same assessment norms in respondent voice (the support-form analogue of :attr:infereval.templates.CoherenceFrame.survey_header). It changes only the header — choice labels and importer decode stay library-controlled. None means no survey surface is declared; the survey renderer falls back to the locked v0.9.0 header for default-v1 only and fails loudly for any other prompt, so a non-default frame can never silently elicit humans under the default's wording. survey_stem The header's own closing question line, repeated per item when an exporter renders the header once as an instructions page (the support-form analogue of :attr:infereval.templates.CoherenceFrame.survey_stem). Must be a verbatim trailing substring of survey_header — the stem adds no wording beyond the frame's reviewed surface. None means the instructions header mode fails loudly for this prompt.

build_user ¶

build_user(premise_context: str, conclusion_context: str) -> str

Return the per-sample user prompt with both contexts substituted in.

Source code in src/infereval/prompts.py

def build_user(self, premise_context: str, conclusion_context: str) -> str:
    """Return the per-sample user prompt with both contexts substituted in."""
    return self.user_template.format(
        premise_context=premise_context,
        conclusion_context=conclusion_context,
    )

compile_parser ¶

compile_parser() -> re.Pattern[str]

Compile :attr:parse_regex as a case-insensitive pattern.

Source code in src/infereval/prompts.py

def compile_parser(self) -> re.Pattern[str]:
    """Compile :attr:`parse_regex` as a case-insensitive pattern."""
    return re.compile(self.parse_regex, re.IGNORECASE)

infereval.prompts.resolve_verification_prompt ¶

resolve_verification_prompt(override: VerificationPromptOverride | None, *, override_id: str = 'benchmark-override-v1') -> VerificationPrompt

Return the default prompt, or a benchmark-supplied override.

Each override field that is None falls back to the framework default:

:attr:VerificationPromptOverride.system None → :data:DEFAULT_SYSTEM_PROMPT.
:attr:VerificationPromptOverride.parse_regex None → :data:DEFAULT_PARSE_REGEX.
:attr:VerificationPromptOverride.id None → override_id (caller-supplied fallback identifier).

A benchmark JSON can now fully specify a custom verification prompt (system + user template + parse regex + identifier) without dropping to the Python API.

Source code in src/infereval/prompts.py

def resolve_verification_prompt(
    override: VerificationPromptOverride | None,
    *,
    override_id: str = "benchmark-override-v1",
) -> VerificationPrompt:
    """Return the default prompt, or a benchmark-supplied override.

    Each override field that is ``None`` falls back to the framework
    default:

    - :attr:`VerificationPromptOverride.system` ``None`` →
      :data:`DEFAULT_SYSTEM_PROMPT`.
    - :attr:`VerificationPromptOverride.parse_regex` ``None`` →
      :data:`DEFAULT_PARSE_REGEX`.
    - :attr:`VerificationPromptOverride.id` ``None`` → ``override_id``
      (caller-supplied fallback identifier).

    A benchmark JSON can now fully specify a custom verification prompt
    (system + user template + parse regex + identifier) without dropping
    to the Python API.
    """
    if override is None:
        return DEFAULT_VERIFICATION_PROMPT
    return VerificationPrompt(
        id=override.id or override_id,
        system=override.system if override.system is not None else DEFAULT_SYSTEM_PROMPT,
        user_template=override.template,
        parse_regex=override.parse_regex or DEFAULT_PARSE_REGEX,
    )

Endorsement¶

infereval.endorsement.endorse ¶

endorse(implication: Implication, bearers: Mapping[str, Bearer], provider: Provider, config: EndorsementConfig, params: ProviderParams, *, premise_builder: ContextBuilder, conclusion_builder: ContextBuilder, verification_prompt: VerificationPrompt = DEFAULT_VERIFICATION_PROMPT, strip_tex: bool = True, request_id_prefix: str | None = None, variant: int = 0, question_form: QuestionForm = 'support', template: Template | None = None, coherence_frame: CoherenceFrame | None = None) -> EndorsementRecord

Compute :math:E_M(\langle \Gamma, \Delta \rangle) for one implication.

Issues config.n_samples calls to provider with the verification prompt built from premise_builder and conclusion_builder, parses each response, and aggregates via :func:majority_vote.

Provider sample failures (after the provider's own retries are exhausted) are recorded as sample_failed and contribute an ABSTAIN verdict to the vote.

The variant parameter selects which expression each bearer is rendered with. variant=0 (the default) uses the canonical expressions; variant=k uses bearer.paraphrases[k-1] per :func:_expressions_for. Use this to drive the paraphrase axis of variation (R10) without needing to mutate the benchmark JSON between runs.

Source code in src/infereval/endorsement.py

def endorse(
    implication: Implication,
    bearers: Mapping[str, Bearer],
    provider: Provider,
    config: EndorsementConfig,
    params: ProviderParams,
    *,
    premise_builder: ContextBuilder,
    conclusion_builder: ContextBuilder,
    verification_prompt: VerificationPrompt = DEFAULT_VERIFICATION_PROMPT,
    strip_tex: bool = True,
    request_id_prefix: str | None = None,
    variant: int = 0,
    question_form: QuestionForm = "support",
    template: Template | None = None,
    coherence_frame: CoherenceFrame | None = None,
) -> EndorsementRecord:
    """Compute :math:`E_M(\\langle \\Gamma, \\Delta \\rangle)` for one implication.

    Issues ``config.n_samples`` calls to ``provider`` with the verification
    prompt built from ``premise_builder`` and ``conclusion_builder``,
    parses each response, and aggregates via :func:`majority_vote`.

    Provider sample failures (after the provider's own retries are
    exhausted) are recorded as ``sample_failed`` and contribute an
    ``ABSTAIN`` verdict to the vote.

    The ``variant`` parameter selects which expression each bearer is
    rendered with. ``variant=0`` (the default) uses the canonical
    expressions; ``variant=k`` uses ``bearer.paraphrases[k-1]`` per
    :func:`_expressions_for`. Use this to drive the paraphrase axis
    of variation (R10) without needing to mutate the benchmark JSON
    between runs.
    """
    premise_exprs = _expressions_for(
        implication.premises, bearers, strip_tex=strip_tex, variant=variant
    )
    conclusion_exprs = _expressions_for(
        implication.conclusions, bearers, strip_tex=strip_tex, variant=variant
    )
    premise_ctx = premise_builder(premise_exprs)
    conclusion_ctx = conclusion_builder(conclusion_exprs)

    system_text, user_text, extract, prompt_id = _build_prompt(
        implication=implication,
        question_form=question_form,
        verification_prompt=verification_prompt,
        template=template,
        coherence_frame=coherence_frame,
        premise_ctx=premise_ctx,
        conclusion_ctx=conclusion_ctx,
        conclusion_exprs=conclusion_exprs,
    )

    sample_records: list[SampleRecord] = []
    verdicts: list[Verdict] = []
    user_prompt_hash = prompt_hash(user_text)
    premise_ids = sorted(implication.premises)
    conclusion_ids = sorted(implication.conclusions)

    log_event(
        log,
        "item.started",
        item_id=implication.id,
        n_samples=config.n_samples,
        tie_break=config.tie_break,
        premise_ids=premise_ids,
        conclusion_ids=conclusion_ids,
        prompt_hash=user_prompt_hash,
        prompt=user_text,
        system=system_text,
        verification_prompt_id=prompt_id,
        question_form=question_form,
    )

    for i in range(config.n_samples):
        rid = f"{request_id_prefix}:sample-{i}" if request_id_prefix else None
        req = SampleRequest(
            prompt=user_text,
            system=system_text,
            temperature=params.temperature,
            max_tokens=params.max_tokens,
            top_p=params.top_p,
            seed=params.seed,
            stop=params.stop,
            request_id=rid,
        )
        try:
            result = provider.sample(req)
            verdict, status = extract(result.text)
            # Promote unparseable -> budget_clipped when the provider says the
            # response was truncated by max_tokens. The verdict stays abstain
            # (Definition 2 fallback) but the parse_status now tells the user
            # the abstain is operational, not a model decision.
            if (
                status == "unparseable"
                and result.finish_reason in BUDGET_FINISH_REASONS
            ):
                status = "budget_clipped"
            record = SampleRecord(
                sample_index=i,
                raw_response=result.text,
                parsed_verdict=verdict,
                parse_status=status,
                request_id=result.request_id,
                wall_time_ms=result.wall_time_ms,
                usage=_usage_from_mapping(result.usage),
                finish_reason=result.finish_reason,
                reasoning_tokens=result.reasoning_tokens,
            )
            log_event(
                log,
                "sample.completed",
                item_id=implication.id,
                sample_index=i,
                provider=result.provider,
                model_id=result.model_id,
                request_id=result.request_id,
                prompt_hash=user_prompt_hash,
                raw_response=result.text,
                parsed_verdict=str(verdict),
                parse_status=status,
                wall_time_ms=result.wall_time_ms,
                input_tokens=result.usage.get("input_tokens") if result.usage else None,
                output_tokens=result.usage.get("output_tokens") if result.usage else None,
                finish_reason=result.finish_reason,
                reasoning_tokens=result.reasoning_tokens,
            )
        except ProviderSampleError as exc:
            log_event(
                log,
                "sample.failed",
                item_id=implication.id,
                sample_index=i,
                prompt_hash=user_prompt_hash,
                err=str(exc),
            )
            verdict = Verdict.ABSTAIN
            # v0.15.0: provider_error carries the str of the underlying
            # exception so downstream metrics / retest / report code can
            # distinguish instrument failure from real model abstention.
            # parsed_verdict stays ABSTAIN for backward compatibility
            # with v0.14.0 consumers that don't understand the new field;
            # aggregators that DO understand it (v0.15.0+) skip the sample.
            record = SampleRecord(
                sample_index=i,
                raw_response="",
                parsed_verdict=Verdict.ABSTAIN,
                parse_status="sample_failed",
                request_id=rid,
                wall_time_ms=None,
                usage=None,
                finish_reason=None,
                reasoning_tokens=None,
                provider_error=str(exc),
            )
        sample_records.append(record)
        verdicts.append(verdict)

    # v0.15.0: skip samples whose provider call failed (provider_error set)
    # when computing the majority vote and the per-verdict counts. The
    # raw SampleRecord retains the placeholder ABSTAIN parsed_verdict so
    # the eta JSON round-trips for v0.14.0 consumers; the *aggregated*
    # majority vote and counts reflect only samples that actually
    # produced a real model response. If every sample failed, the vote
    # collapses to ABSTAIN (the existing majority_vote(empty) contract)
    # — a fuller "model_verdict = None" representation is deferred to a
    # later release per the v0.15.0 plan.
    voting_verdicts = [
        rec.parsed_verdict
        for rec in sample_records
        if rec.provider_error is None
    ]
    final, tie_broken = majority_vote(voting_verdicts, tie_break=config.tie_break)
    counts: dict[Verdict, int] = {v: 0 for v in Verdict}
    for v in voting_verdicts:
        counts[v] += 1

    log_event(
        log,
        "item.completed",
        item_id=implication.id,
        verdict=str(final),
        tie_broken=tie_broken,
        good=counts[Verdict.GOOD],
        bad=counts[Verdict.BAD],
        abstain=counts[Verdict.ABSTAIN],
    )

    return EndorsementRecord(
        implication=implication,
        samples=sample_records,
        counts=counts,
        verdict=final,
        tie_broken=tie_broken,
        premise_context=premise_ctx,
        conclusion_context=conclusion_ctx,
        rendered_user_prompt=user_text,
    )

infereval.endorsement.EndorsementRecord `dataclass` ¶

EndorsementRecord(implication: Implication, samples: list[SampleRecord], counts: dict[Verdict, int], verdict: Verdict, tie_broken: bool, premise_context: str, conclusion_context: str, rendered_user_prompt: str = '')

Result of one :func:endorse call.

This is the in-memory analog of an evaluation file's per-item record; :func:infereval.evaluation.evaluate converts it to an :class:infereval.evaluation.EvaluationItem for serialization.

premise_context `instance-attribute` ¶

premise_context: str

The full natural-language premise context shown to the model.

conclusion_context `instance-attribute` ¶

conclusion_context: str

The full natural-language conclusion context shown to the model.

to_majority_vote ¶

to_majority_vote() -> MajorityVote

Project counts + verdict into a Pydantic :class:MajorityVote.

Source code in src/infereval/endorsement.py

def to_majority_vote(self) -> MajorityVote:
    """Project counts + verdict into a Pydantic :class:`MajorityVote`."""
    return MajorityVote(
        good=self.counts.get(Verdict.GOOD, 0),
        bad=self.counts.get(Verdict.BAD, 0),
        abstain=self.counts.get(Verdict.ABSTAIN, 0),
        verdict=self.verdict,
        tie_broken=self.tie_broken,
    )

infereval.endorsement.parse_verdict ¶

parse_verdict(text: str, pattern: Pattern[str] | None = None) -> tuple[Verdict, ParseStatus]

Extract a :class:Verdict from a raw model response.

Returns (Verdict.ABSTAIN, "unparseable") if no token matches; per the paper's Definition 2 ("Unparseable responses are mapped to abstain").

Source code in src/infereval/prompts.py

def parse_verdict(
    text: str,
    pattern: re.Pattern[str] | None = None,
) -> tuple[Verdict, ParseStatus]:
    """Extract a :class:`Verdict` from a raw model response.

    Returns ``(Verdict.ABSTAIN, "unparseable")`` if no token matches; per
    the paper's Definition 2 ("Unparseable responses are mapped to abstain").
    """
    if pattern is None:
        pattern = DEFAULT_VERIFICATION_PROMPT.compile_parser()
    match = pattern.search(text)
    if match is None:
        return Verdict.ABSTAIN, "unparseable"
    token = match.group(1).upper()
    try:
        return Verdict(token.lower()), "ok"
    except ValueError:
        # Regex group didn't match a known verdict; treat as unparseable.
        return Verdict.ABSTAIN, "unparseable"

infereval.endorsement.majority_vote ¶

majority_vote(verdicts: list[Verdict], tie_break: TieBreak = 'abstain') -> tuple[Verdict, bool]

Aggregate per-sample verdicts into a single verdict.

Returns (chosen_verdict, tie_broken_flag).

Tie rules (in order):

If verdicts is empty, return (ABSTAIN, False).
If exactly one verdict has the max count, return it.
Tie: if ABSTAIN is among the tied set, return ABSTAIN.
Otherwise (pure GOOD/BAD tie), apply tie_break.

Source code in src/infereval/endorsement.py

def majority_vote(
    verdicts: list[Verdict],
    tie_break: TieBreak = "abstain",
) -> tuple[Verdict, bool]:
    """Aggregate per-sample verdicts into a single verdict.

    Returns ``(chosen_verdict, tie_broken_flag)``.

    Tie rules (in order):

    1. If ``verdicts`` is empty, return ``(ABSTAIN, False)``.
    2. If exactly one verdict has the max count, return it.
    3. Tie: if ABSTAIN is among the tied set, return ABSTAIN.
    4. Otherwise (pure GOOD/BAD tie), apply ``tie_break``.
    """
    if not verdicts:
        return Verdict.ABSTAIN, False

    counts = Counter(verdicts)
    max_count = max(counts.values())
    top = [v for v in counts if counts[v] == max_count]

    if len(top) == 1:
        return top[0], False

    # Tie. ABSTAIN wins any tie it is part of.
    if Verdict.ABSTAIN in top:
        return Verdict.ABSTAIN, True

    # Pure GOOD/BAD tie. Apply tie_break.
    if tie_break == "abstain":
        return Verdict.ABSTAIN, True
    if tie_break == "good":
        return Verdict.GOOD, True
    if tie_break == "bad":
        return Verdict.BAD, True
    if tie_break == "first":
        for v in verdicts:
            if v in top:
                return v, True
        return top[0], True  # unreachable, defensive
    # Unknown tie_break: fall back to abstain (Literal forbids this at type level).
    return Verdict.ABSTAIN, True

Context builders¶

infereval.context.resolve_context_builder ¶

resolve_context_builder(model: TemplateContextBuilder | PluginContextBuilder) -> ContextBuilder

Convert a benchmark's serialized context-builder config to a callable.

Source code in src/infereval/context.py

def resolve_context_builder(
    model: TemplateContextBuilder | PluginContextBuilder,
) -> ContextBuilder:
    """Convert a benchmark's serialized context-builder config to a callable."""
    if isinstance(model, TemplateContextBuilder):
        return make_template_builder(template=model.template, joiner=model.joiner)
    if isinstance(model, PluginContextBuilder):
        return resolve_plugin(model.plugin)
    raise TypeError(f"Unsupported context-builder model: {type(model).__name__}")

infereval.context.resolve_context_builders ¶

resolve_context_builders(builders: ContextBuilders) -> tuple[ContextBuilder, ContextBuilder]

Resolve a :class:ContextBuilders pair into (premise, conclusion) callables.

Source code in src/infereval/context.py

def resolve_context_builders(
    builders: ContextBuilders,
) -> tuple[ContextBuilder, ContextBuilder]:
    """Resolve a :class:`ContextBuilders` pair into ``(premise, conclusion)`` callables."""
    return (
        resolve_context_builder(builders.premise),
        resolve_context_builder(builders.conclusion),
    )

infereval.context.make_template_builder ¶

make_template_builder(*, template: str = '{expressions}', joiner: str = ' and ') -> ContextBuilder

Build a context builder that joins expressions and formats into a template.

Parameters¶

template Format string with a {expressions} placeholder. joiner Separator inserted between bearer expressions.

Returns¶

ContextBuilder Callable that takes a sequence of expressions and returns the formatted context.

Notes¶

The empty-input case returns the template formatted against an empty string, which by default yields the empty string. The endorser does not use this builder on empty implications (Definition 3 excludes them).

Source code in src/infereval/context.py

def make_template_builder(
    *, template: str = "{expressions}", joiner: str = " and "
) -> ContextBuilder:
    """Build a context builder that joins expressions and formats into a template.

    Parameters
    ----------
    template
        Format string with a ``{expressions}`` placeholder.
    joiner
        Separator inserted between bearer expressions.

    Returns
    -------
    ContextBuilder
        Callable that takes a sequence of expressions and returns the
        formatted context.

    Notes
    -----
    The empty-input case returns the template formatted against an empty
    string, which by default yields the empty string. The endorser does not
    use this builder on empty implications (Definition 3 excludes them).
    """

    def _build(expressions: Sequence[str]) -> str:
        joined = joiner.join(expressions)
        return template.format(expressions=joined)

    return _build

infereval.context.strip_tex_math ¶

strip_tex_math(text: str) -> str

Strip $...$ TeX-math delimiters, preserving their contents.

Examples¶

strip_tex_math("$a$ is a stop sign") 'a is a stop sign' strip_tex_math("$a$ and $b$") 'a and b' strip_tex_math("no math here") 'no math here'

Unmatched single $ characters are left in place; only paired $...$ spans (without nested $) are stripped.

Source code in src/infereval/context.py

def strip_tex_math(text: str) -> str:
    """Strip ``$...$`` TeX-math delimiters, preserving their contents.

    Examples
    --------
    >>> strip_tex_math("$a$ is a stop sign")
    'a is a stop sign'
    >>> strip_tex_math("$a$ and $b$")
    'a and b'
    >>> strip_tex_math("no math here")
    'no math here'

    Unmatched single ``$`` characters are left in place; only paired
    ``$...$`` spans (without nested ``$``) are stripped.
    """
    return _TEX_MATH_RE.sub(r"\1", text)

API reference¶

Core data types¶

infereval.types.Verdict ¶

infereval.types.Bearer dataclass ¶

Parameters¶

all_expressions ¶

infereval.types.Implication dataclass ¶

is_empty_empty property ¶

of classmethod ¶

intersects ¶

infereval.frame.DerivedFrame dataclass ¶

Attributes¶

from_endorsements classmethod ¶

Raises¶

contains ¶

satisfies_containment ¶

queried_implications ¶

Benchmark¶

infereval.benchmark.Benchmark ¶

n property ¶

m property ¶

load classmethod ¶

dump ¶

cells ¶

panel_names ¶

resolved_primary_panel ¶

infereval.benchmark.BenchmarkItem ¶

analyst_rationales class-attribute instance-attribute ¶

references class-attribute instance-attribute ¶

factor_levels class-attribute instance-attribute ¶

construction_metadata class-attribute instance-attribute ¶

ladder class-attribute instance-attribute ¶

variation class-attribute instance-attribute ¶

target class-attribute instance-attribute ¶

placeholder class-attribute instance-attribute ¶

construction_note class-attribute instance-attribute ¶

monotonicity_step class-attribute instance-attribute ¶

to_implication ¶

infereval.benchmark.BearerModel ¶

references class-attribute instance-attribute ¶

ordinal_family class-attribute instance-attribute ¶

infereval.benchmark.AnalystModel ¶

expertise_description class-attribute instance-attribute ¶

panel class-attribute instance-attribute ¶

infereval.benchmark.RSRTarget ¶

infereval.benchmark.ConstructionMetadata ¶

authored_by class-attribute instance-attribute ¶

authored_on class-attribute instance-attribute ¶

authored_blind_to_models class-attribute instance-attribute ¶

source class-attribute instance-attribute ¶

infereval.benchmark.FactorConstraints ¶

min_items_per_cell class-attribute instance-attribute ¶

infereval.benchmark.ContextBuilders ¶

infereval.benchmark.TemplateContextBuilder ¶

infereval.benchmark.PluginContextBuilder ¶

infereval.benchmark.VerificationPromptOverride ¶

infereval.benchmark.Reference ¶

section class-attribute instance-attribute ¶

note class-attribute instance-attribute ¶

Evaluation¶

infereval.evaluation.evaluate ¶

Parameters¶

Returns¶

infereval.evaluation.Evaluation ¶

references class-attribute instance-attribute ¶

paraphrase_variant class-attribute instance-attribute ¶

endorsements ¶

infereval.evaluation.EvaluationItem ¶

analyst_rationales class-attribute instance-attribute ¶

references class-attribute instance-attribute ¶

infereval.evaluation.EndorsementConfig ¶

question_form class-attribute instance-attribute ¶

coherence_frame_id class-attribute instance-attribute ¶

infereval.evaluation.ProviderParams ¶

infereval.evaluation.SampleRecord ¶

finish_reason class-attribute instance-attribute ¶

reasoning_tokens class-attribute instance-attribute ¶

provider_error class-attribute instance-attribute ¶

infereval.evaluation.MajorityVote ¶

infereval.evaluation.canonical_benchmark_hash ¶

infereval.types.Bearer `dataclass` ¶

infereval.types.Implication `dataclass` ¶

is_empty_empty `property` ¶

of `classmethod` ¶

infereval.frame.DerivedFrame `dataclass` ¶

from_endorsements `classmethod` ¶

n `property` ¶

m `property` ¶

load `classmethod` ¶

analyst_rationales `class-attribute` `instance-attribute` ¶

references `class-attribute` `instance-attribute` ¶

factor_levels `class-attribute` `instance-attribute` ¶

construction_metadata `class-attribute` `instance-attribute` ¶

ladder `class-attribute` `instance-attribute` ¶

variation `class-attribute` `instance-attribute` ¶

target `class-attribute` `instance-attribute` ¶

placeholder `class-attribute` `instance-attribute` ¶

construction_note `class-attribute` `instance-attribute` ¶

monotonicity_step `class-attribute` `instance-attribute` ¶

references `class-attribute` `instance-attribute` ¶

ordinal_family `class-attribute` `instance-attribute` ¶

expertise_description `class-attribute` `instance-attribute` ¶

panel `class-attribute` `instance-attribute` ¶

authored_by `class-attribute` `instance-attribute` ¶

authored_on `class-attribute` `instance-attribute` ¶

authored_blind_to_models `class-attribute` `instance-attribute` ¶

source `class-attribute` `instance-attribute` ¶

min_items_per_cell `class-attribute` `instance-attribute` ¶

section `class-attribute` `instance-attribute` ¶

note `class-attribute` `instance-attribute` ¶

references `class-attribute` `instance-attribute` ¶

paraphrase_variant `class-attribute` `instance-attribute` ¶

analyst_rationales `class-attribute` `instance-attribute` ¶

references `class-attribute` `instance-attribute` ¶

question_form `class-attribute` `instance-attribute` ¶

coherence_frame_id `class-attribute` `instance-attribute` ¶

finish_reason `class-attribute` `instance-attribute` ¶

reasoning_tokens `class-attribute` `instance-attribute` ¶

provider_error `class-attribute` `instance-attribute` ¶

infereval.metrics.MetricsReport `dataclass` ¶

n `property` ¶

verdict_distributions `property` ¶

infereval.structure.StructuralReport `dataclass` ¶

all_satisfied `property` ¶

infereval.structure.StructuralCheck `dataclass` ¶

name `instance-attribute` ¶

rate `property` ¶

infereval.structure.StructuralAnomaly `dataclass` ¶

expected `instance-attribute` ¶

actual `instance-attribute` ¶

explanation `instance-attribute` ¶

infereval.modeling.ModelFit `dataclass` ¶

n_observations `instance-attribute` ¶

n_items `instance-attribute` ¶

n_dropped_abstain `instance-attribute` ¶

deviance `instance-attribute` ¶

null_deviance `instance-attribute` ¶

pseudo_r2 `instance-attribute` ¶

effects `instance-attribute` ¶

factor_wald `instance-attribute` ¶

notes `instance-attribute` ¶

infereval.modeling.FactorEffect `dataclass` ¶

infereval.sweep.SweepResult `dataclass` ¶

parameter `instance-attribute` ¶

kappa_c_range `property` ¶

stability_verdict `property` ¶

infereval.sweep.SweepRow `dataclass` ¶

value `instance-attribute` ¶

n_agreement `instance-attribute` ¶