Skip to content

API reference

Auto-generated from the docstrings in src/infereval/. The docstrings are maintained as a first-class artifact and paper-cross-referenced, so help(infereval.metrics.cohens_kappa) is reliable; this page just renders the same content as a navigable site.

If you're looking for symbolic notation rather than callables, see the Glossary.

Core data types

infereval.types.Verdict

Bases: str, Enum

Endorsement verdict :math:E_M(\langle \Gamma, \Delta \rangle).

String-valued so JSON serialization yields "good" / "bad" / "abstain" directly. The (str, Enum) pattern is used in place of :class:enum.StrEnum for Python 3.10 compatibility.

infereval.types.Bearer dataclass

Bearer(id: str, expression: str, paraphrases: tuple[str, ...] = ())

A propositional content-bearer :math:\varphi \in B.

Parameters

id Short stable identifier, e.g. "sa" for "a is a stop sign". expression Canonical natural-language statement :math:\delta(\varphi). May contain TeX-math delimiters (e.g. "$a$ is a stop sign"); these are stripped at prompt-construction time, not here. paraphrases Optional family of meaning-preserving variants of :math:\delta(\varphi). Empty by default. Supports the paraphrase axis of variation discussed in the paper's Discussion.

all_expressions

all_expressions() -> tuple[str, ...]

Return the canonical expression followed by any paraphrases.

Source code in src/infereval/types.py
def all_expressions(self) -> tuple[str, ...]:
    """Return the canonical expression followed by any paraphrases."""
    return (self.expression, *self.paraphrases)

infereval.types.Implication dataclass

Implication(premises: frozenset[str], conclusions: frozenset[str], id: str | None = None)

A candidate implication :math:\langle \Gamma, \Delta \rangle.

Premises and conclusions are frozenset of bearer ids (the id field of :class:Bearer). The optional id field is a benchmark-level reference label; it is excluded from equality and hashing so that two implications with the same premise/conclusion sets compare equal regardless of label.

is_empty_empty property

is_empty_empty: bool

Whether this is the :math:\langle \emptyset, \emptyset \rangle implication.

Excluded from :math:I_M by stipulation (Definition 3, last sentence).

of classmethod

of(premises: Iterable[str], conclusions: Iterable[str], *, id: str | None = None) -> Implication

Convenience constructor accepting any iterables of bearer ids.

Source code in src/infereval/types.py
@classmethod
def of(
    cls,
    premises: Iterable[str],
    conclusions: Iterable[str],
    *,
    id: str | None = None,
) -> Implication:
    """Convenience constructor accepting any iterables of bearer ids."""
    return cls(frozenset(premises), frozenset(conclusions), id=id)

intersects

intersects() -> bool

Whether :math:\Gamma \cap \Delta \neq \emptyset (Definition 3, clause (i)).

Source code in src/infereval/types.py
def intersects(self) -> bool:
    """Whether :math:`\\Gamma \\cap \\Delta \\neq \\emptyset` (Definition 3, clause (i))."""
    return bool(self.premises & self.conclusions)

infereval.frame.DerivedFrame dataclass

DerivedFrame(bearers: Mapping[str, Bearer], endorsements: Mapping[Implication, Verdict])

The implication frame :math:\langle B, I_M \rangle derived from a model :math:M.

Construct via :meth:from_endorsements; do not instantiate directly unless you have already validated that all implication bearer-ids reference elements of bearers.

Attributes

bearers Read-only mapping id -> Bearer representing :math:B. endorsements Read-only mapping from each queried :class:Implication to the verdict :math:E_M returned. Implications not in this mapping are treated as un-queried; :meth:contains returns False for them unless clause (i) applies.

from_endorsements classmethod

from_endorsements(bearers: Mapping[str, Bearer], endorsements: Mapping[Implication, Verdict]) -> DerivedFrame

Build a frame from a bearer set and a mapping of queried endorsements.

Raises

ValueError If any implication references a bearer id not present in bearers.

Source code in src/infereval/frame.py
@classmethod
def from_endorsements(
    cls,
    bearers: Mapping[str, Bearer],
    endorsements: Mapping[Implication, Verdict],
) -> DerivedFrame:
    """Build a frame from a bearer set and a mapping of queried endorsements.

    Raises
    ------
    ValueError
        If any implication references a bearer id not present in ``bearers``.
    """
    bearer_ids = set(bearers)
    for imp in endorsements:
        unknown = (imp.premises | imp.conclusions) - bearer_ids
        if unknown:
            raise ValueError(
                f"Implication {imp!r} references unknown bearer ids: {sorted(unknown)}"
            )
    return cls(
        bearers=MappingProxyType(dict(bearers)),
        endorsements=MappingProxyType(dict(endorsements)),
    )

contains

contains(implication: Implication) -> bool

Membership in :math:I_M per Definition 3.

<empty, empty> is excluded by stipulation. Otherwise the iff of clauses (i) and (ii) decides. For implications not in :attr:endorsements, clause (ii) is treated as false (we have no evidence of endorsement).

Source code in src/infereval/frame.py
def contains(self, implication: Implication) -> bool:
    """Membership in :math:`I_M` per Definition 3.

    ``<empty, empty>`` is excluded by stipulation. Otherwise the iff of
    clauses (i) and (ii) decides. For implications not in
    :attr:`endorsements`, clause (ii) is treated as false (we have no
    evidence of endorsement).
    """
    if implication.is_empty_empty:
        return False
    if implication.intersects():
        return True  # clause (i)
    return self.endorsements.get(implication) == Verdict.GOOD  # clause (ii)

satisfies_containment

satisfies_containment() -> bool

Containment is satisfied by construction (clause (i) of Definition 3).

This always returns True; the method is provided as an explicit witness of the invariant the paper makes a remark of ("Containment by construction"). Tests assert it to guard against future refactors that might break the invariant.

Source code in src/infereval/frame.py
def satisfies_containment(self) -> bool:
    """Containment is satisfied by construction (clause (i) of Definition 3).

    This always returns ``True``; the method is provided as an explicit
    witness of the invariant the paper makes a remark of ("Containment by
    construction"). Tests assert it to guard against future refactors that
    might break the invariant.
    """
    return True

queried_implications

queried_implications() -> frozenset[Implication]

The implications for which an :math:E_M verdict has been recorded.

Source code in src/infereval/frame.py
def queried_implications(self) -> frozenset[Implication]:
    """The implications for which an :math:`E_M` verdict has been recorded."""
    return frozenset(self.endorsements)

Benchmark

infereval.benchmark.Benchmark

Bases: BaseModel

A benchmark :math:\beta over a bearer set, analyst panel, and items.

See the paper, Definition 4.

n property

n: int

Number of items :math:n.

m property

m: int

Number of analysts :math:m.

load classmethod

load(path: str | Path) -> Benchmark

Load a benchmark from JSON on disk and validate.

Source code in src/infereval/benchmark.py
@classmethod
def load(cls, path: str | Path) -> Benchmark:
    """Load a benchmark from JSON on disk and validate."""
    with Path(path).open("r", encoding="utf-8") as f:
        data = json.load(f)
    return cls.model_validate(data)

dump

dump(path: str | Path, *, indent: int = 2) -> None

Write the benchmark to path as canonical-ish JSON.

Source code in src/infereval/benchmark.py
def dump(self, path: str | Path, *, indent: int = 2) -> None:
    """Write the benchmark to ``path`` as canonical-ish JSON."""
    with Path(path).open("w", encoding="utf-8") as f:
        f.write(self.dumps(indent=indent))
        f.write("\n")

cells

cells() -> dict[tuple[str, ...], int]

Count items per cell of the fully crossed design.

Returns a mapping from cell-tuple to item count, where the cell-tuple is the per-factor level value in the order given by sorted(self.factors). Every cell of the cartesian product is present in the result (count 0 if no item lands there); items whose factor_levels don't name every declared factor are excluded entirely (they belong to no cell).

Source code in src/infereval/benchmark.py
def cells(self) -> dict[tuple[str, ...], int]:
    """Count items per cell of the fully crossed design.

    Returns a mapping from cell-tuple to item count, where the
    cell-tuple is the per-factor level value in the order given by
    ``sorted(self.factors)``. Every cell of the cartesian product is
    present in the result (count 0 if no item lands there); items
    whose ``factor_levels`` don't name every declared factor are
    excluded entirely (they belong to no cell).
    """
    from itertools import product as _product

    if not self.factors:
        return {}
    factor_names = sorted(self.factors)
    # Initialise every cell to zero so under-populated cells appear.
    cells: dict[tuple[str, ...], int] = {
        tuple(combo): 0
        for combo in _product(*(self.factors[f] for f in factor_names))
    }
    for item in self.items:
        if not all(f in item.factor_levels for f in factor_names):
            continue
        key = tuple(item.factor_levels[f] for f in factor_names)
        cells[key] = cells.get(key, 0) + 1
    return cells

panel_names

panel_names() -> list[str]

Sorted unique panel names across :attr:analysts.

Returns [] for an unpanelled (flat) benchmark. Phase 1.4 of the construct-validity infrastructure (R4).

Source code in src/infereval/benchmark.py
def panel_names(self) -> list[str]:
    """Sorted unique panel names across :attr:`analysts`.

    Returns ``[]`` for an unpanelled (flat) benchmark. Phase 1.4 of
    the construct-validity infrastructure (R4).
    """
    return sorted({a.panel for a in self.analysts if a.panel is not None})

resolved_primary_panel

resolved_primary_panel() -> str | None

The primary panel name to use for analyses.

Returns :attr:primary_panel if set; otherwise the alphabetically-first declared panel name; otherwise None for unpanelled benchmarks.

Source code in src/infereval/benchmark.py
def resolved_primary_panel(self) -> str | None:
    """The primary panel name to use for analyses.

    Returns :attr:`primary_panel` if set; otherwise the
    alphabetically-first declared panel name; otherwise ``None`` for
    unpanelled benchmarks.
    """
    if self.primary_panel is not None:
        return self.primary_panel
    names = self.panel_names()
    return names[0] if names else None

infereval.benchmark.BenchmarkItem

Bases: BaseModel

A single benchmark item: an implication paired with analyst verdicts.

analyst_rationales class-attribute instance-attribute

analyst_rationales: list[str] | None = Field(default=None, description="Optional per-analyst, per-item rationales: the natural-language reason each analyst gave for their verdict on this item. Positionally aligned to analyst_verdicts — index j is analyst j's rationale, matching the benchmark's analysts declaration order. null (or absent) means 'this benchmark carries no rationale discipline.' A present-but-empty entry ('') means 'this analyst gave a verdict but recorded no reason on this item' — semantically distinct from null. When present, the length must equal len(benchmark.analysts).")

Optional per-analyst, per-item rationales: the natural-language reason each analyst gave for their verdict on this item. Positionally aligned to :attr:analyst_verdicts — index j is analyst j's rationale, matching :attr:Benchmark.analysts declaration order. None (or absent) means "this benchmark carries no rationale discipline." A present-but-empty entry ("") means "this analyst gave a verdict but recorded no reason on this item" — semantically distinct from None. When present, the length must equal len(benchmark.analysts) (enforced in :meth:Benchmark._check_consistency). The framework validates structure and length only; content is the analyst's responsibility. Added in v0.5.4 (AR1–AR12).

references class-attribute instance-attribute

references: list[Reference] = Field(default_factory=list)

Provenance for this implication: the guideline section, paper, or regulatory document that justifies the analyst's verdict. Empty by default; populating these turns the benchmark into an auditable artifact that a domain expert can cross-check against source material.

factor_levels class-attribute instance-attribute

factor_levels: dict[str, str] = Field(default_factory=dict)

Per-factor level assignments for this item, naming its position in the benchmark's crossed design. Keys must be factor names declared in :attr:Benchmark.factors; values must be levels from the corresponding levels list. Empty by default — items without factor_levels appear in no cell and are ignored by the min_items_per_cell check.

construction_metadata class-attribute instance-attribute

construction_metadata: ConstructionMetadata | None = None

Per-item provenance for construct-validity audit: who authored the item, when, which models the author was blind to at construction time, and what source material they worked from. None by default; populate selectively for items where the provenance matters. Phase 1.3 of the construct-validity infrastructure (R5, R8, R9).

to_implication

to_implication() -> Implication

Return the runtime :class:Implication view of this item.

Source code in src/infereval/benchmark.py
def to_implication(self) -> Implication:
    """Return the runtime :class:`Implication` view of this item."""
    return Implication(
        premises=frozenset(self.premises),
        conclusions=frozenset(self.conclusions),
        id=self.id,
    )

infereval.benchmark.BearerModel

Bases: BaseModel

JSON shape for a :class:infereval.types.Bearer.

references class-attribute instance-attribute

references: list[Reference] = Field(default_factory=list)

Provenance for the bearer's definition, e.g. the guideline section that defines the threshold "P/F < 300" is measured against.

infereval.benchmark.AnalystModel

Bases: BaseModel

A human analyst :math:a_j whose verdicts appear in :math:V_i.

panel class-attribute instance-attribute

panel: str | None = None

Optional panel identifier. Analysts sharing the same panel string are members of the same panel for cross-panel agreement analysis (R4: independent reference check). None (default) means the benchmark is flat — every analyst is treated equivalently. Adding a panel string to ANY analyst requires ALL analysts to declare one (no partial-panel benchmarks).

infereval.benchmark.RSRTarget

Bases: BaseModel

Target inference :math:\langle X, A \rangle for an RSR-targeted item.

See the paper's Remark on "RSR-targeted benchmarks".

infereval.benchmark.ConstructionMetadata

Bases: BaseModel

Per-item construction provenance for benchmark audit.

Records who authored an item, when, against what training-cutoff posture, and from what source materials. Phase 1.3 of the construct-validity infrastructure series, providing the data model for requirements R5 (documented construction), R8 (held-out items), and R9 (training-data separation).

Content is the analyst's responsibility — the framework validates structure (Pydantic types, extra="forbid") but does not enforce that, e.g., authored_on actually post-dates a model's training cutoff. The point is to make the presence of these declarations auditable.

authored_by class-attribute instance-attribute

authored_by: str | None = None

Identifier for the author of the item, e.g. "physician-c".

authored_on class-attribute instance-attribute

authored_on: date | None = None

ISO date the item was authored.

authored_blind_to_models class-attribute instance-attribute

authored_blind_to_models: list[str] = Field(default_factory=list)

Model identifiers the author had not observed at construction time. Critical for R8 (held-out items): if the author had seen M's draft-version output on the item, M's agreement on the final item does not constitute independent evidence.

source class-attribute instance-attribute

source: str | None = None

Free-form source citation for the primary material the author worked from (e.g. "Sanford Guide to Antimicrobial Therapy 2025"). Distinct from :attr:BenchmarkItem.references, which carries the framework-level :class:Reference objects supporting the verdict; source is intended for the primary material, not the literature that justifies the analyst's call.

infereval.benchmark.FactorConstraints

Bases: BaseModel

Constraints the benchmark validator should enforce on the factorial design.

Currently supports min_items_per_cell: every cell of the fully crossed design (cartesian product of all declared factor levels) must contain at least this many items, where a cell is defined by the per-factor level assignments in :attr:BenchmarkItem.factor_levels.

Per Closing the Construct-Validity Gap in infereval (Phase 1.1) addressing requirement R7 (multiple items per condition) and supporting R12 (per-condition decomposition).

min_items_per_cell class-attribute instance-attribute

min_items_per_cell: int | None = None

If set, every cell of the crossed design must have at least this many items. Set to None to skip the cell-count validation entirely (the per-key / per-value type checks on factor_levels still run).

infereval.benchmark.ContextBuilders

Bases: BaseModel

Pair of context builders for :math:\mathrm{ctx}_\Gamma and :math:\mathrm{ctx}_\Delta.

infereval.benchmark.TemplateContextBuilder

Bases: BaseModel

A context builder specified by an inline template string.

The template is a format string with a single {expressions} placeholder. Bearer expressions are joined by joiner to fill it.

infereval.benchmark.PluginContextBuilder

Bases: BaseModel

A context builder specified by a dotted import path.

The plugin must resolve to a callable (Sequence[str]) -> str taking bearer expressions and returning the natural-language context.

infereval.benchmark.VerificationPromptOverride

Bases: BaseModel

Optional benchmark-level override of the framework's default verification prompt.

All four fields are optional in practice (template is required by the schema since it is the minimal thing an override needs to contribute). When a field is None the framework default is used:

  • :attr:system None → :data:infereval.prompts.DEFAULT_SYSTEM_PROMPT.
  • :attr:parse_regex None → :data:infereval.prompts.DEFAULT_PARSE_REGEX.
  • :attr:id None → the caller-supplied override_id parameter to :func:infereval.prompts.resolve_verification_prompt.

Adding the system field makes the paraphrase-axis experiment fully JSON-drivable (no Python required to vary the verification prompt).

infereval.benchmark.Reference

Bases: BaseModel

A traceable provenance entry for a benchmark, bearer, or item.

The motivating use case is regulated-domain benchmarks (medical, legal, financial) where every non-trivial implication needs a citation to a guideline, statute, or peer-reviewed source. Recording these as structured objects lets downstream tooling render bibliographies, validate DOIs, and connect items to the documents that justify them.

Only :attr:citation is required. The other fields populate when the relevant identifier is known and remain None otherwise.

Authoring shorthand: a plain string in any references list is auto-promoted to a :class:Reference with the string as :attr:citation and everything else None. See :func:_promote_reference_shorthand.

section class-attribute instance-attribute

section: str | None = None

Pinpoint location within the cited work, e.g. "Section 5.2" or "Hypoxemia criterion".

note class-attribute instance-attribute

note: str | None = None

What specifically this reference supports, in the author's words.

Evaluation

infereval.evaluation.evaluate

evaluate(benchmark: Benchmark, provider: Provider, *, config: EndorsementConfig | None = None, params: ProviderParams | None = None, verification_prompt: VerificationPrompt | None = None, strip_tex: bool = True, run_id: str | None = None, log_path: Path | str | None = None, variant: int = 0) -> Evaluation

Run a model against a benchmark and assemble the resulting :math:\eta.

Iterates over every benchmark item, calls :func:infereval.endorsement.endorse to compute :math:E_M, and packages the per-item samples + majority-vote tally into an :class:Evaluation.

Parameters

benchmark The :math:\beta to evaluate against. provider Any :class:infereval.providers.Provider (Anthropic, OpenAI, OpenRouter, or a mock). config Endorsement configuration. Defaults to EndorsementConfig() (n_samples=5, tie_break=abstain, default verification prompt id). params Provider decoding parameters. Defaults to ProviderParams() (temperature=1.0, max_tokens=1024). verification_prompt If supplied, overrides the framework default. If the benchmark has verification_prompt set, it takes precedence over the framework default but not over this argument. strip_tex Whether to strip $...$ TeX-math delimiters from bearer expressions at prompt-construction time (default True). run_id Stable identifier for this evaluation run, recorded as :attr:Evaluation.id. Generated as a UUID4 if not supplied. log_path Optional path for a JSONL run log; one event per line, suitable for jq or pandas.read_json(lines=True). If None (the default), no log file is written; library callers can still attach their own handlers to the infereval logger.

Returns

Evaluation The fully-populated :math:\eta ready to serialize to JSON.

Source code in src/infereval/evaluation.py
def evaluate(
    benchmark: Benchmark,
    provider: Provider,
    *,
    config: EndorsementConfig | None = None,
    params: ProviderParams | None = None,
    verification_prompt: VerificationPrompt | None = None,
    strip_tex: bool = True,
    run_id: str | None = None,
    log_path: Path | str | None = None,
    variant: int = 0,
) -> Evaluation:
    """Run a model against a benchmark and assemble the resulting :math:`\\eta`.

    Iterates over every benchmark item, calls
    :func:`infereval.endorsement.endorse` to compute :math:`E_M`, and
    packages the per-item samples + majority-vote tally into an
    :class:`Evaluation`.

    Parameters
    ----------
    benchmark
        The :math:`\\beta` to evaluate against.
    provider
        Any :class:`infereval.providers.Provider` (Anthropic, OpenAI,
        OpenRouter, or a mock).
    config
        Endorsement configuration. Defaults to ``EndorsementConfig()``
        (n_samples=5, tie_break=abstain, default verification prompt id).
    params
        Provider decoding parameters. Defaults to ``ProviderParams()``
        (temperature=1.0, max_tokens=1024).
    verification_prompt
        If supplied, overrides the framework default. If the benchmark
        has ``verification_prompt`` set, it takes precedence over the
        framework default but not over this argument.
    strip_tex
        Whether to strip ``$...$`` TeX-math delimiters from bearer
        expressions at prompt-construction time (default ``True``).
    run_id
        Stable identifier for this evaluation run, recorded as
        :attr:`Evaluation.id`. Generated as a UUID4 if not supplied.
    log_path
        Optional path for a JSONL run log; one event per line, suitable
        for ``jq`` or ``pandas.read_json(lines=True)``. If ``None`` (the
        default), no log file is written; library callers can still attach
        their own handlers to the ``infereval`` logger.

    Returns
    -------
    Evaluation
        The fully-populated :math:`\\eta` ready to serialize to JSON.
    """
    # Late imports to avoid the evaluation <-> endorsement <-> context cycle.
    from .context import resolve_context_builders
    from .endorsement import endorse
    from .logging_setup import configure_run_logging, log_event
    from .prompts import resolve_verification_prompt

    cfg = config or EndorsementConfig()
    par = params or ProviderParams()
    rid = run_id or str(uuid.uuid4())
    prompt = verification_prompt or resolve_verification_prompt(
        benchmark.verification_prompt
    )

    bearers = benchmark.runtime_bearers()
    premise_builder, conclusion_builder = resolve_context_builders(
        benchmark.context_builders
    )

    bench_hash = canonical_benchmark_hash(benchmark)

    with configure_run_logging(
        log_path,
        run_id=rid,
        extra_context={"benchmark_id": benchmark.id, "framework_version": __version__},
    ):
        started = datetime.now(timezone.utc)
        log_event(
            log,
            "run.started",
            benchmark_id=benchmark.id,
            benchmark_hash=bench_hash,
            n_items=benchmark.n,
            provider=provider.name,
            model_id=provider.model_id,
            params=par.model_dump(mode="json"),
            endorsement_config=cfg.model_dump(mode="json"),
            verification_prompt_id=prompt.id,
            strip_tex=strip_tex,
            paraphrase_variant=variant,
            framework_version=__version__,
        )

        items: list[EvaluationItem] = []
        for bench_item in benchmark.items:
            implication = bench_item.to_implication()
            record = endorse(
                implication,
                bearers,
                provider,
                cfg,
                par,
                premise_builder=premise_builder,
                conclusion_builder=conclusion_builder,
                verification_prompt=prompt,
                strip_tex=strip_tex,
                request_id_prefix=f"{rid}:{bench_item.id}",
                variant=variant,
            )
            items.append(
                EvaluationItem(
                    id=bench_item.id,
                    premises=sorted(bench_item.premises),
                    conclusions=sorted(bench_item.conclusions),
                    analyst_verdicts=list(bench_item.analyst_verdicts),
                    analyst_rationales=(
                        list(bench_item.analyst_rationales)
                        if bench_item.analyst_rationales is not None
                        else None
                    ),
                    model_verdict=record.verdict,
                    samples=record.samples,
                    majority_vote=record.to_majority_vote(),
                    tags=list(bench_item.tags),
                    references=list(bench_item.references),
                )
            )

        finished = datetime.now(timezone.utc)
        cfg_with_prompt_id = cfg.model_copy(
            update={"verification_prompt_id": prompt.id}
        )

        log_event(
            log,
            "run.finished",
            n_items=len(items),
            wall_time_s=(finished - started).total_seconds(),
        )

    return Evaluation(
        id=rid,
        benchmark_id=benchmark.id,
        benchmark_hash=bench_hash,
        model=ModelInfo(
            provider=provider.name,
            model_id=provider.model_id,
            params=par,
        ),
        endorsement_config=cfg_with_prompt_id,
        started_at=started,
        finished_at=finished,
        items=items,
        references=list(benchmark.references),
        paraphrase_variant=variant,
    )

infereval.evaluation.Evaluation

Bases: BaseModel

An evaluation :math:\eta of a model against a benchmark.

references class-attribute instance-attribute

references: list[Reference] = Field(default_factory=list)

Corpus-level provenance, propagated from :attr:infereval.benchmark.Benchmark.references at evaluation time. Carries the paper, dialogue, or regulatory framework the benchmark is derived from, so an evaluation JSON read in isolation still names its primary sources.

paraphrase_variant class-attribute instance-attribute

paraphrase_variant: int = 0

Index of the paraphrase variant used at evaluation time. 0 (default) means the canonical :attr:BearerModel.expression was used for every bearer. k >= 1 means bearer.paraphrases[k-1] was used per :func:infereval.endorsement._expressions_for (with fallback to the canonical for bearers that don't carry that paraphrase). Phase 1.2 of the construct-validity infrastructure (R10: paraphrase variation under fixed inferential content).

endorsements

endorsements() -> dict[Implication, Verdict]

Mapping Implication -> Verdict suitable for :meth:DerivedFrame.from_endorsements.

Source code in src/infereval/evaluation.py
def endorsements(self) -> dict[Implication, Verdict]:
    """Mapping ``Implication -> Verdict`` suitable for :meth:`DerivedFrame.from_endorsements`."""
    return {item.to_implication(): item.model_verdict for item in self.items}

infereval.evaluation.EvaluationItem

Bases: BaseModel

One row of the evaluation :math:\eta: implication + analyst verdicts + :math:E_M.

analyst_rationales class-attribute instance-attribute

analyst_rationales: list[str] | None = Field(default=None, description="Optional per-analyst rationales propagated from the source benchmark item's analyst_rationales at evaluation build time. Positionally aligned to analyst_verdicts. null (or absent) when the source benchmark carried no rationale discipline; a present list (possibly containing empty strings) when it did. Covered by Evaluation.benchmark_hash.")

Optional per-analyst rationales propagated from :attr:infereval.benchmark.BenchmarkItem.analyst_rationales at evaluation-build time. Positionally aligned to :attr:analyst_verdicts. None (or absent) when the source benchmark carried no rationale discipline; a present list (possibly containing empty strings) when it did. Covered by the existing :attr:Evaluation.benchmark_hash integrity mechanism, so a rationale cannot be silently altered between evaluation and report without changing the hash. Added in v0.5.4 (AR8, AR9).

references class-attribute instance-attribute

references: list[Reference] = Field(default_factory=list)

Per-item provenance, propagated from :attr:infereval.benchmark.BenchmarkItem.references at evaluation time. Carries the guideline / paper / regulatory citation that justifies the analyst's verdict so the evaluation JSON is a self-contained, auditable artifact (no need to look up the source benchmark separately).

infereval.evaluation.EndorsementConfig

Bases: BaseModel

Configuration governing how :math:E_M is computed.

Note on terminology: n_samples is the number of completions drawn from M per benchmark item, in the LLM-literature sense of "sample" (one draw from the model's output distribution). It is not the number of dataset rows — that is the benchmark's item count and is fixed by the benchmark. The methodology issues n_samples provider calls per item, parses each completion's verdict token, and majority-votes to compute :math:E_M for that item. See docs/concepts.md for the full terminology note.

infereval.evaluation.ProviderParams

Bases: BaseModel

Decoding parameters passed to a provider sample call.

The max_tokens default of 1024 is sized for current frontier models that consume budget on silent internal reasoning. See :class:infereval.providers.base.SampleRequest and docs/providers.md for the rationale and per-provider guidance.

infereval.evaluation.SampleRecord

Bases: BaseModel

One sampled response from the provider plus its parsed verdict.

finish_reason class-attribute instance-attribute

finish_reason: str | None = None

Provider-side stop reason, when reported. See :class:infereval.providers.base.SampleResult.finish_reason.

reasoning_tokens class-attribute instance-attribute

reasoning_tokens: int | None = None

Reasoning / thinking token count, when the provider reports it. See :class:infereval.providers.base.SampleResult.reasoning_tokens.

infereval.evaluation.MajorityVote

Bases: BaseModel

Tally of parsed verdicts plus the resolved majority and tie-break flag.

infereval.evaluation.canonical_benchmark_hash

canonical_benchmark_hash(benchmark: Benchmark) -> str

SHA-256 of the benchmark's canonical-JSON form, prefixed sha256:.

Recorded in :attr:Evaluation.benchmark_hash for tamper detection. Two benchmarks that round-trip to the same canonical JSON have the same hash; this is robust to insertion order in dicts.

Source code in src/infereval/evaluation.py
def canonical_benchmark_hash(benchmark: Benchmark) -> str:
    """SHA-256 of the benchmark's canonical-JSON form, prefixed ``sha256:``.

    Recorded in :attr:`Evaluation.benchmark_hash` for tamper detection.
    Two benchmarks that round-trip to the same canonical JSON have the
    same hash; this is robust to insertion order in dicts.
    """
    canonical = json.dumps(
        benchmark.model_dump(mode="json", exclude_none=True),
        sort_keys=True,
        separators=(",", ":"),
    )
    return f"sha256:{hashlib.sha256(canonical.encode('utf-8')).hexdigest()}"

Metrics

infereval.metrics.coverage

coverage(eta: Evaluation) -> float

:math:\mathrm{cov}(\eta) = |\{i : E_M(I_i) \neq \text{abstain}\}| / n.

Returns 0.0 for an empty evaluation rather than raising.

Source code in src/infereval/metrics.py
def coverage(eta: Evaluation) -> float:
    """:math:`\\mathrm{cov}(\\eta) = |\\{i : E_M(I_i) \\neq \\text{abstain}\\}| / n`.

    Returns ``0.0`` for an empty evaluation rather than raising.
    """
    if eta.n == 0:
        return 0.0
    substantive = sum(1 for it in eta.items if it.model_verdict != Verdict.ABSTAIN)
    return substantive / eta.n

infereval.metrics.consensus_verdict

consensus_verdict(verdicts: Sequence[Verdict]) -> Verdict

Return the analyst consensus :math:c_i for one item's verdicts.

From the paper, Definition 8: good if strict majority of analysts say good (vs. bad); bad if strict majority say bad; otherwise abstain. Abstain votes do not count toward the majority of either substantive class but contribute to a tie.

Source code in src/infereval/metrics.py
def consensus_verdict(verdicts: Sequence[Verdict]) -> Verdict:
    """Return the analyst consensus :math:`c_i` for one item's verdicts.

    From the paper, Definition 8: ``good`` if strict majority of analysts
    say ``good`` (vs. ``bad``); ``bad`` if strict majority say ``bad``;
    otherwise ``abstain``. Abstain votes do not count toward the majority
    of either substantive class but contribute to a tie.
    """
    good = sum(1 for v in verdicts if v == Verdict.GOOD)
    bad = sum(1 for v in verdicts if v == Verdict.BAD)
    if good > bad:
        return Verdict.GOOD
    if bad > good:
        return Verdict.BAD
    return Verdict.ABSTAIN

infereval.metrics.consensus_reference

consensus_reference(eta: Evaluation) -> ReferenceFn

Return :math:r(i) = c_i as a :data:ReferenceFn.

Source code in src/infereval/metrics.py
def consensus_reference(eta: Evaluation) -> ReferenceFn:
    """Return :math:`r(i) = c_i` as a :data:`ReferenceFn`."""
    per_item = [consensus_verdict(it.analyst_verdicts) for it in eta.items]

    def _ref(i: int) -> Verdict:
        return per_item[i]

    return _ref

infereval.metrics.cohens_kappa

cohens_kappa(eta: Evaluation, reference: ReferenceFn) -> float | None

:math:\kappa_C(\eta, r) = (p_o - p_e) / (1 - p_e).

Returns :data:None when :math:S(\eta, r) is empty or :math:p_e = 1 (degenerate distribution). Logs a warning in both cases so the user sees why the value is undefined.

Source code in src/infereval/metrics.py
def cohens_kappa(eta: Evaluation, reference: ReferenceFn) -> float | None:
    """:math:`\\kappa_C(\\eta, r) = (p_o - p_e) / (1 - p_e)`.

    Returns :data:`None` when :math:`S(\\eta, r)` is empty or
    :math:`p_e = 1` (degenerate distribution). Logs a warning in both
    cases so the user sees why the value is undefined.
    """
    S = sorted(substantive_index(eta, reference))
    if not S:
        log.warning(
            "kappa_C undefined: substantive subset S(eta, r) is empty"
        )
        return None

    n_S = len(S)
    p_o = sum(1 for i in S if eta.items[i].model_verdict == reference(i)) / n_S

    p_M: dict[Verdict, float] = {}
    p_r: dict[Verdict, float] = {}
    for c in (Verdict.GOOD, Verdict.BAD):
        p_M[c] = sum(1 for i in S if eta.items[i].model_verdict == c) / n_S
        p_r[c] = sum(1 for i in S if reference(i) == c) / n_S

    p_e = sum(p_M[c] * p_r[c] for c in (Verdict.GOOD, Verdict.BAD))

    if abs(1.0 - p_e) < 1e-12:
        log.warning(
            "kappa_C undefined: chance-expected agreement p_e = 1 "
            "(M and reference both degenerate on a single class over S)"
        )
        return None

    return (p_o - p_e) / (1.0 - p_e)

infereval.metrics.fleiss_kappa

fleiss_kappa(eta: Evaluation) -> float | None

:math:\kappa_F(\eta) with :math:M as the :math:(m+1)-th annotator.

The annotators on each item are the analyst verdicts followed by model_verdict. Items where any annotator (analyst or model) is non-substantive are excluded from :math:S_F per the paper's Definition 10.

Source code in src/infereval/metrics.py
def fleiss_kappa(eta: Evaluation) -> float | None:
    """:math:`\\kappa_F(\\eta)` with :math:`M` as the :math:`(m+1)`-th annotator.

    The annotators on each item are the analyst verdicts followed by
    ``model_verdict``. Items where any annotator (analyst or model) is
    non-substantive are excluded from :math:`S_F` per the paper's
    Definition 10.
    """
    tuples = [
        [*item.analyst_verdicts, item.model_verdict] for item in eta.items
    ]
    return _fleiss_over_tuples(tuples)

infereval.metrics.inter_analyst_fleiss

inter_analyst_fleiss(source: Evaluation | Benchmark) -> float | None

:math:\kappa_F^*(\beta): Fleiss' kappa over analyst verdicts alone.

Accepts either an :class:~infereval.evaluation.Evaluation or a :class:~infereval.benchmark.Benchmark. Returns :data:None (with a logged warning) when :math:m < 2 or when the analysts are unanimous on every item -- the two conditions Remark 4 calls out as making the baseline unavailable.

For panelled benchmarks (Issue #36, Phase 1.4), this returns the κ_F* of the primary panel only — see :func:inter_analyst_fleiss_per_panel for per-panel breakdown and :func:cross_panel_kappa for the cross-panel agreement metric.

Source code in src/infereval/metrics.py
def inter_analyst_fleiss(source: Evaluation | Benchmark) -> float | None:
    """:math:`\\kappa_F^*(\\beta)`: Fleiss' kappa over analyst verdicts alone.

    Accepts either an :class:`~infereval.evaluation.Evaluation` or a
    :class:`~infereval.benchmark.Benchmark`. Returns :data:`None` (with
    a logged warning) when :math:`m < 2` or when the analysts are
    unanimous on every item -- the two conditions Remark 4 calls out as
    making the baseline unavailable.

    For panelled benchmarks (Issue #36, Phase 1.4), this returns the
    κ_F* of the *primary* panel only — see
    :func:`inter_analyst_fleiss_per_panel` for per-panel breakdown and
    :func:`cross_panel_kappa` for the cross-panel agreement metric.
    """
    # Late import to avoid the metrics <-> benchmark cycle.
    from .benchmark import Benchmark as _Benchmark

    if isinstance(source, _Benchmark) and source.panel_names():
        primary = source.resolved_primary_panel()
        if primary is None:
            return None
        indices = source.analyst_indices_in_panel(primary)
        tuples = [[it.analyst_verdicts[j] for j in indices] for it in source.items]
        return _fleiss_over_tuples(tuples)
    items = source.items
    tuples = [list(it.analyst_verdicts) for it in items]
    return _fleiss_over_tuples(tuples)

infereval.metrics.inter_analyst_fleiss_per_panel

inter_analyst_fleiss_per_panel(benchmark: Benchmark) -> dict[str, float | None]

:math:\kappa_F^* computed per analyst panel.

Returns a mapping panel_name -> κ_F* for every panel declared on the benchmark. A panel value is None when the panel has fewer than 2 analysts or when the analysts are unanimous on every item (per the same conditions :func:inter_analyst_fleiss honours).

Empty dict if the benchmark is unpanelled. Phase 1.4 of the construct-validity infrastructure (R4).

Source code in src/infereval/metrics.py
def inter_analyst_fleiss_per_panel(
    benchmark: Benchmark,
) -> dict[str, float | None]:
    """:math:`\\kappa_F^*` computed per analyst panel.

    Returns a mapping ``panel_name -> κ_F*`` for every panel declared on
    the benchmark. A panel value is ``None`` when the panel has fewer
    than 2 analysts or when the analysts are unanimous on every item
    (per the same conditions :func:`inter_analyst_fleiss` honours).

    Empty dict if the benchmark is unpanelled. Phase 1.4 of the
    construct-validity infrastructure (R4).
    """
    out: dict[str, float | None] = {}
    for name in benchmark.panel_names():
        indices = benchmark.analyst_indices_in_panel(name)
        tuples = [
            [it.analyst_verdicts[j] for j in indices] for it in benchmark.items
        ]
        out[name] = _fleiss_over_tuples(tuples)
    return out

infereval.metrics.cross_panel_kappa

cross_panel_kappa(benchmark: Benchmark, *, primary: str | None = None, check: str | None = None) -> float | None

Cohen's :math:\kappa_C between two panels' per-item consensus verdicts.

Computes a per-panel consensus verdict for each item (majority among panel members, abstain on tie) and then runs Cohen's kappa between the two columns, restricted to items where both panels yield a substantive verdict.

Parameters

benchmark Panelled benchmark. primary Name of the primary panel. Defaults to benchmark.resolved_primary_panel(). check Name of the panel to compare against. When None and exactly two panels are declared, picks the non-primary one automatically.

Returns

float | None Cohen's kappa over the substantive-on-both items, or None when fewer than two non-trivial agreement counts are available, or when either named panel doesn't exist.

Phase 1.4 of the construct-validity infrastructure (R4 — guards against shared-error agreement within the primary panel by surfacing the independent panel's view).

Source code in src/infereval/metrics.py
def cross_panel_kappa(
    benchmark: Benchmark,
    *,
    primary: str | None = None,
    check: str | None = None,
) -> float | None:
    """Cohen's :math:`\\kappa_C` between two panels' per-item consensus verdicts.

    Computes a per-panel consensus verdict for each item (majority among
    panel members, abstain on tie) and then runs Cohen's kappa between
    the two columns, restricted to items where both panels yield a
    substantive verdict.

    Parameters
    ----------
    benchmark
        Panelled benchmark.
    primary
        Name of the primary panel. Defaults to
        ``benchmark.resolved_primary_panel()``.
    check
        Name of the panel to compare against. When ``None`` and exactly
        two panels are declared, picks the non-primary one
        automatically.

    Returns
    -------
    float | None
        Cohen's kappa over the substantive-on-both items, or ``None``
        when fewer than two non-trivial agreement counts are available,
        or when either named panel doesn't exist.

    Phase 1.4 of the construct-validity infrastructure (R4 — guards
    against shared-error agreement within the primary panel by
    surfacing the independent panel's view).
    """
    names = benchmark.panel_names()
    if primary is None:
        primary = benchmark.resolved_primary_panel()
    if primary is None or primary not in names:
        log.warning(
            "cross_panel_kappa: primary panel %r not declared on benchmark %r",
            primary,
            benchmark.id,
        )
        return None
    if check is None:
        others = [n for n in names if n != primary]
        if len(others) != 1:
            log.warning(
                "cross_panel_kappa: 'check' panel must be supplied when the "
                "benchmark declares != 2 panels (declared: %s)",
                names,
            )
            return None
        check = others[0]
    if check not in names:
        log.warning(
            "cross_panel_kappa: check panel %r not declared on benchmark %r",
            check,
            benchmark.id,
        )
        return None

    primary_idx = benchmark.analyst_indices_in_panel(primary)
    check_idx = benchmark.analyst_indices_in_panel(check)

    primary_col = [
        _panel_consensus_verdict(it.analyst_verdicts, primary_idx)
        for it in benchmark.items
    ]
    check_col = [
        _panel_consensus_verdict(it.analyst_verdicts, check_idx)
        for it in benchmark.items
    ]

    # Restrict to items where both panels reached a substantive consensus.
    pairs = [
        (p, c)
        for p, c in zip(primary_col, check_col, strict=True)
        if p != Verdict.ABSTAIN and c != Verdict.ABSTAIN
    ]
    if not pairs:
        log.warning(
            "cross_panel_kappa: empty substantive intersection between panels "
            "%r and %r on benchmark %r",
            primary,
            check,
            benchmark.id,
        )
        return None

    cats = (Verdict.GOOD, Verdict.BAD)
    n = len(pairs)
    p_obs = sum(1 for p, c in pairs if p == c) / n
    pa = {v: sum(1 for p, _ in pairs if p == v) / n for v in cats}
    pc = {v: sum(1 for _, c in pairs if c == v) / n for v in cats}
    p_exp = sum(pa[v] * pc[v] for v in cats)
    if abs(1 - p_exp) < 1e-12:
        log.warning(
            "cross_panel_kappa: chance-expected agreement = 1 (one panel is "
            "fully degenerate to a single class); kappa is undefined"
        )
        return None
    return (p_obs - p_exp) / (1 - p_exp)

infereval.metrics.MetricsReport dataclass

MetricsReport(eta: Evaluation, benchmark: Benchmark | None = None)

Bundle of metrics over an :class:Evaluation, with decomposition filters.

Parameters

eta The evaluation to report on. benchmark Optional benchmark. Required for :meth:by_rsr_target and :meth:coverage_per_analyst_named; other methods work without it.

n property

n: int

Number of evaluation items.

coverage_per_analyst_named

coverage_per_analyst_named() -> dict[str, float]

Per-analyst coverage keyed by analyst id (requires :attr:benchmark).

Source code in src/infereval/metrics.py
def coverage_per_analyst_named(self) -> dict[str, float]:
    """Per-analyst coverage keyed by analyst id (requires :attr:`benchmark`)."""
    if self.benchmark is None:
        raise ValueError(
            "coverage_per_analyst_named requires a benchmark to resolve analyst ids"
        )
    return {
        a.id: coverage_analyst(self.eta, j)
        for j, a in enumerate(self.benchmark.analysts)
    }

cohens_kappa

cohens_kappa(reference: ReferenceFn | None = None) -> float | None

:math:\kappa_C(\eta, r). Default reference is the analyst consensus :math:c_i.

Source code in src/infereval/metrics.py
def cohens_kappa(self, reference: ReferenceFn | None = None) -> float | None:
    """:math:`\\kappa_C(\\eta, r)`. Default reference is the analyst consensus :math:`c_i`."""
    ref = reference if reference is not None else consensus_reference(self.eta)
    return cohens_kappa(self.eta, ref)

cohens_kappa_analyst

cohens_kappa_analyst(analyst_index: int) -> float | None

:math:\kappa_C(\eta, v_{:,j}): M vs. one specific analyst.

Source code in src/infereval/metrics.py
def cohens_kappa_analyst(self, analyst_index: int) -> float | None:
    """:math:`\\kappa_C(\\eta, v_{:,j})`: M vs. one specific analyst."""
    return cohens_kappa(self.eta, analyst_reference(self.eta, analyst_index))

by_tag

by_tag(tag: str) -> MetricsReport

Return a report restricted to items carrying tag.

Source code in src/infereval/metrics.py
def by_tag(self, tag: str) -> MetricsReport:
    """Return a report restricted to items carrying ``tag``."""
    filtered = self.eta.model_copy(
        update={"items": [it for it in self.eta.items if tag in it.tags]}
    )
    return MetricsReport(eta=filtered, benchmark=self.benchmark)

by_rsr_target

by_rsr_target(X: frozenset[str], A: frozenset[str]) -> MetricsReport

Return a report restricted to items whose rsr_target matches (X, A).

rsr_target lives on benchmark items, not evaluation items, so :attr:benchmark is required.

Source code in src/infereval/metrics.py
def by_rsr_target(self, X: frozenset[str], A: frozenset[str]) -> MetricsReport:
    """Return a report restricted to items whose ``rsr_target`` matches ``(X, A)``.

    ``rsr_target`` lives on benchmark items, not evaluation items, so
    :attr:`benchmark` is required.
    """
    if self.benchmark is None:
        raise ValueError("by_rsr_target requires a benchmark to read rsr_target fields")
    keep_ids = {
        bi.id
        for bi in self.benchmark.items
        if bi.rsr_target is not None
        and frozenset(bi.rsr_target.X) == X
        and frozenset(bi.rsr_target.A) == A
    }
    filtered = self.eta.model_copy(
        update={"items": [it for it in self.eta.items if it.id in keep_ids]}
    )
    return MetricsReport(eta=filtered, benchmark=self.benchmark)

to_dict

to_dict() -> dict[str, Any]

Render as a JSON-friendly dict (None where a kappa is undefined).

Source code in src/infereval/metrics.py
def to_dict(self) -> dict[str, Any]:
    """Render as a JSON-friendly dict (None where a kappa is undefined)."""
    out: dict[str, Any] = {
        "n": self.n,
        "coverage": self.coverage,
        "coverage_per_analyst": self.coverage_per_analyst,
        "cohens_kappa_consensus": self.cohens_kappa(),
        "fleiss_kappa": self.fleiss_kappa,
        "inter_analyst_fleiss": self.inter_analyst_fleiss,
    }
    if self.benchmark is not None:
        out["coverage_per_analyst_named"] = self.coverage_per_analyst_named()
    return out

Structural checks (R13)

infereval.structure.run_all_checks

run_all_checks(evaluation: Evaluation, benchmark: Benchmark) -> StructuralReport

Run all three structural checks and bundle the results.

Source code in src/infereval/structure.py
def run_all_checks(
    evaluation: Evaluation, benchmark: Benchmark
) -> StructuralReport:
    """Run all three structural checks and bundle the results."""
    return StructuralReport(
        evaluation_id=evaluation.id,
        benchmark_id=evaluation.benchmark_id,
        checks=(
            containment_closure_check(evaluation, benchmark),
            rsr_role_consistency_check(evaluation, benchmark),
            base_case_stability_check(evaluation, benchmark),
        ),
    )

infereval.structure.containment_closure_check

containment_closure_check(evaluation: Evaluation, benchmark: Benchmark) -> StructuralCheck

Sanity-check that all self-implications are in I_M by construction.

Per Definition 3 clause i, every implication ⟨Γ, Δ⟩ with Γ ∩ Δ ≠ ∅ is in I_M regardless of what the model says. This check counts such items in the benchmark and confirms they're structurally satisfied; it doesn't need to consult the model's verdict (the framework guarantees it). Reported anyway because the count itself is informative — a benchmark with zero self-implications has different structural texture from one with many.

Source code in src/infereval/structure.py
def containment_closure_check(
    evaluation: Evaluation, benchmark: Benchmark
) -> StructuralCheck:
    """Sanity-check that all self-implications are in ``I_M`` by construction.

    Per Definition 3 clause i, every implication ⟨Γ, Δ⟩ with
    ``Γ ∩ Δ ≠ ∅`` is in ``I_M`` regardless of what the model says.
    This check counts such items in the benchmark and confirms they're
    structurally satisfied; it doesn't *need* to consult the model's
    verdict (the framework guarantees it). Reported anyway because the
    count itself is informative — a benchmark with zero self-implications
    has different structural texture from one with many.
    """
    self_implications = [
        it
        for it in benchmark.items
        if set(it.premises) & set(it.conclusions)
    ]
    # Per construction, every such item is in I_M; rate is trivially 1.0
    # whenever items_checked > 0. We report the count for visibility.
    return StructuralCheck(
        name="containment_closure",
        items_checked=len(self_implications),
        items_satisfying=len(self_implications),
        anomalies=(),
    )

infereval.structure.rsr_role_consistency_check

rsr_role_consistency_check(evaluation: Evaluation, benchmark: Benchmark) -> StructuralCheck

Check that role-tagged items' model verdicts match the role's prediction.

For each item carrying a role tag (supporter / defeater / irrelevant-addition) AND an rsr_target, looks up the "base-inference" item with the same target and uses the model's verdict on the base to predict the expected verdict on the role-tagged item:

  • supporter is supposed to strengthen the base verdict. If the base is GOOD, the supporter should remain GOOD; if the base is BAD, the supporter is excluded (a supporter can't strengthen a bad inference — that's a defeater being treated wrongly).
  • defeater is supposed to flip the base verdict. If the base is GOOD, the defeater should be BAD.
  • irrelevant-addition is supposed to preserve the base verdict under RSR. If the base is GOOD, the irrelevant addition should stay GOOD; if the base is BAD, it should stay BAD.

Anomalies are items whose model verdict contradicts the expected role-conditional verdict. Items where the base or the role-tagged item itself has an ABSTAIN verdict are excluded from the check (the role's prediction is undefined relative to abstention).

Source code in src/infereval/structure.py
def rsr_role_consistency_check(
    evaluation: Evaluation, benchmark: Benchmark
) -> StructuralCheck:
    """Check that role-tagged items' model verdicts match the role's prediction.

    For each item carrying a role tag (``supporter`` / ``defeater`` /
    ``irrelevant-addition``) AND an ``rsr_target``, looks up the
    "base-inference" item with the same target and uses the model's
    verdict on the base to predict the expected verdict on the role-tagged
    item:

    - ``supporter`` is supposed to *strengthen* the base verdict. If the
      base is GOOD, the supporter should remain GOOD; if the base is
      BAD, the supporter is excluded (a supporter can't strengthen a
      bad inference — that's a defeater being treated wrongly).
    - ``defeater`` is supposed to *flip* the base verdict. If the base
      is GOOD, the defeater should be BAD.
    - ``irrelevant-addition`` is supposed to preserve the base verdict
      under RSR. If the base is GOOD, the irrelevant addition should
      stay GOOD; if the base is BAD, it should stay BAD.

    Anomalies are items whose model verdict contradicts the expected
    role-conditional verdict. Items where the base or the role-tagged
    item itself has an ABSTAIN verdict are excluded from the check
    (the role's prediction is undefined relative to abstention).
    """
    # Index items by id, evaluation-side and benchmark-side.
    eval_by_id = {it.id: it for it in evaluation.items}

    # Group benchmark items by their rsr_target's canonical key, then
    # within each target separate the base-inference reference items
    # from the role-tagged items we're going to check.
    targets: dict[
        tuple[tuple[str, ...], tuple[str, ...]],
        dict[str, list[BenchmarkItem]],
    ] = defaultdict(lambda: {"base": [], "checked": []})

    for it in benchmark.items:
        if it.rsr_target is None:
            continue
        key = (
            tuple(sorted(it.rsr_target.X)),
            tuple(sorted(it.rsr_target.A)),
        )
        role = _role_of(it)
        if role == _ROLE_BASE:
            targets[key]["base"].append(it)
        elif role in {_ROLE_SUPPORTER, _ROLE_DEFEATER, _ROLE_IRRELEVANT}:
            targets[key]["checked"].append(it)

    anomalies: list[StructuralAnomaly] = []
    items_checked = 0
    items_satisfying = 0

    for key, groups in targets.items():
        base_items = groups["base"]
        checked_items = groups["checked"]
        if not base_items or not checked_items:
            # Need a base reference and at least one role-tagged item
            # to run the check on this target.
            continue

        # If multiple base items exist, use their majority verdict (or
        # skip when they disagree — the base_case_stability_check
        # surfaces the divergence separately).
        # Partial-evaluation guard: a benchmark item carrying an
        # rsr_target may not appear in this evaluation (e.g. when the
        # eval was produced from a paraphrase-cycle variant or a tag
        # filter). Skip those items rather than raise; the metrics
        # contract elsewhere in the package is "missing data is
        # surfaced via warnings + None, not exceptions".
        present_base_items = [b for b in base_items if b.id in eval_by_id]
        if len(present_base_items) < len(base_items):
            missing = [b.id for b in base_items if b.id not in eval_by_id]
            log.warning(
                "rsr_role_consistency_check: skipping base items absent from "
                "evaluation %r: %s",
                evaluation.id,
                missing,
            )
        if not present_base_items:
            continue
        base_verdicts = [eval_by_id[b.id].model_verdict for b in present_base_items]
        if len(set(base_verdicts)) > 1:
            continue  # base is unstable; can't predict roles
        base_verdict = base_verdicts[0]
        if base_verdict == Verdict.ABSTAIN:
            # Base is non-substantive; role predictions are undefined.
            continue

        for it in checked_items:
            if it.id not in eval_by_id:
                log.warning(
                    "rsr_role_consistency_check: skipping role-tagged item "
                    "%r absent from evaluation %r",
                    it.id,
                    evaluation.id,
                )
                continue
            eval_item = eval_by_id[it.id]
            actual = eval_item.model_verdict
            if actual == Verdict.ABSTAIN:
                # The role-tagged item itself is non-substantive; skip.
                continue
            role = _role_of(it)
            assert role is not None  # checked above
            expected = _expected_verdict_for_role(role, base_verdict)
            if expected is None:
                continue  # role doesn't make a prediction here
            items_checked += 1
            if actual == expected:
                items_satisfying += 1
            else:
                target_str = (
                    f"⟨{{{','.join(key[0])}}}, {{{','.join(key[1])}}}⟩"
                )
                anomalies.append(
                    StructuralAnomaly(
                        item_id=it.id,
                        expected=str(expected),
                        actual=str(actual),
                        explanation=(
                            f"item is tagged '{role}' on target {target_str} "
                            f"with base verdict {base_verdict}; role predicts "
                            f"{expected} but model returned {actual}"
                        ),
                    )
                )

    return StructuralCheck(
        name="rsr_role_consistency",
        items_checked=items_checked,
        items_satisfying=items_satisfying,
        anomalies=tuple(anomalies),
    )

infereval.structure.base_case_stability_check

base_case_stability_check(evaluation: Evaluation, benchmark: Benchmark) -> StructuralCheck

When a target has multiple base-inference items, the model should agree on all of them.

Anomalies surface targets where the model gives different verdicts on multiple base-inference items, since the base verdict is what the rest of the RSR machinery is anchored to.

Source code in src/infereval/structure.py
def base_case_stability_check(
    evaluation: Evaluation, benchmark: Benchmark
) -> StructuralCheck:
    """When a target has multiple ``base-inference`` items, the model should agree on all of them.

    Anomalies surface targets where the model gives different verdicts
    on multiple base-inference items, since the base verdict is what
    the rest of the RSR machinery is anchored to.
    """
    eval_by_id = {it.id: it for it in evaluation.items}
    base_by_target: dict[
        tuple[tuple[str, ...], tuple[str, ...]], list[BenchmarkItem]
    ] = defaultdict(list)

    for it in benchmark.items:
        if it.rsr_target is None or _role_of(it) != _ROLE_BASE:
            continue
        key = (
            tuple(sorted(it.rsr_target.X)),
            tuple(sorted(it.rsr_target.A)),
        )
        base_by_target[key].append(it)

    anomalies: list[StructuralAnomaly] = []
    items_checked = 0
    items_satisfying = 0
    for key, bases in base_by_target.items():
        # Partial-evaluation guard: same as in rsr_role_consistency_check.
        present_bases = [b for b in bases if b.id in eval_by_id]
        if len(present_bases) < len(bases):
            missing = [b.id for b in bases if b.id not in eval_by_id]
            log.warning(
                "base_case_stability_check: skipping base items absent from "
                "evaluation %r: %s",
                evaluation.id,
                missing,
            )
        if len(present_bases) < 2:
            continue  # nothing to check (need ≥ 2 present bases per target)
        verdicts = [eval_by_id[b.id].model_verdict for b in present_bases]
        unique = set(verdicts)
        items_checked += len(present_bases)
        if len(unique) == 1:
            items_satisfying += len(present_bases)
        else:
            target_str = (
                f"⟨{{{','.join(key[0])}}}, {{{','.join(key[1])}}}⟩"
            )
            # Flag every present base item in the divergent set as an anomaly.
            for b, v in zip(present_bases, verdicts, strict=True):
                anomalies.append(
                    StructuralAnomaly(
                        item_id=b.id,
                        expected=f"a single shared verdict across base-inferences on {target_str}",
                        actual=f"{v} (other base items on this target: {unique - {v}})",
                        explanation=(
                            f"target {target_str} has {len(present_bases)} base-inference "
                            f"items present in the evaluation with verdicts "
                            f"{[str(v) for v in verdicts]} — the base case is "
                            f"structurally unstable"
                        ),
                    )
                )

    return StructuralCheck(
        name="base_case_stability",
        items_checked=items_checked,
        items_satisfying=items_satisfying,
        anomalies=tuple(anomalies),
    )

infereval.structure.StructuralReport dataclass

StructuralReport(evaluation_id: str, benchmark_id: str, checks: tuple[StructuralCheck, ...] = tuple())

Bundle of structural checks run against an Evaluation + Benchmark pair.

all_satisfied property

all_satisfied: bool

True iff every check has rate == 1.0 (and no anomalies).

infereval.structure.StructuralCheck dataclass

StructuralCheck(name: str, items_checked: int, items_satisfying: int, anomalies: tuple[StructuralAnomaly, ...] = ())

Result of one structural property check against an Evaluation.

name instance-attribute

name: str

Short identifier, e.g. "containment_closure".

rate property

rate: float | None

Proportion of checked items satisfying the property; None when no items checked.

infereval.structure.StructuralAnomaly dataclass

StructuralAnomaly(item_id: str, expected: str, actual: str, explanation: str)

One item that failed a structural check, with diagnostic context.

expected instance-attribute

expected: str

Human-readable description of what the structural rule predicted.

actual instance-attribute

actual: str

What the model's verdict actually was.

explanation instance-attribute

explanation: str

Why this is flagged as an anomaly.

Factor-effects model (R7 / R12)

infereval.modeling.fit_factor_model

fit_factor_model(evaluation: Evaluation, benchmark: Benchmark, *, reference: str = _DEFAULT_REFERENCE) -> ModelFit

Logistic regression of agreement on declared factor levels.

Parameters

evaluation The :class:~infereval.evaluation.Evaluation to model. Each item's per-sample verdicts are unrolled into separate observations; samples with ABSTAIN verdicts are dropped. benchmark The source :class:~infereval.benchmark.Benchmark. Must declare at least one factor in benchmark.factors (per Phase 1.1). reference Which analyst column defines "agreement". "consensus" (default) uses the per-item majority of the analyst panel (abstain on tie). An "analyst:<id>" string picks a single analyst column.

Returns

ModelFit

Raises

ModelingError If the benchmark declares no factors, if no sample observations remain after dropping abstains, or if the design matrix is rank-deficient (e.g. every item in the same cell).

Source code in src/infereval/modeling.py
def fit_factor_model(
    evaluation: Evaluation,
    benchmark: Benchmark,
    *,
    reference: str = _DEFAULT_REFERENCE,
) -> ModelFit:
    """Logistic regression of agreement on declared factor levels.

    Parameters
    ----------
    evaluation
        The :class:`~infereval.evaluation.Evaluation` to model. Each
        item's per-sample verdicts are unrolled into separate
        observations; samples with ABSTAIN verdicts are dropped.
    benchmark
        The source :class:`~infereval.benchmark.Benchmark`. Must declare
        at least one factor in ``benchmark.factors`` (per Phase 1.1).
    reference
        Which analyst column defines "agreement". ``"consensus"``
        (default) uses the per-item majority of the analyst panel
        (abstain on tie). An ``"analyst:<id>"`` string picks a single
        analyst column.

    Returns
    -------
    ModelFit

    Raises
    ------
    ModelingError
        If the benchmark declares no factors, if no sample observations
        remain after dropping abstains, or if the design matrix is
        rank-deficient (e.g. every item in the same cell).
    """
    if not benchmark.factors:
        raise ModelingError(
            "Benchmark declares no factors. infereval model needs at least "
            "one factor in `benchmark.factors` (Phase 1.1) to fit against. "
            "Re-author the benchmark with factor declarations, or use "
            "`infereval metrics --by-tag` for a tag-based decomposition."
        )

    # Late import so the rest of the package works without statsmodels.
    try:
        import pandas as pd  # type: ignore[import-untyped]
        import statsmodels.api as sm  # type: ignore[import-untyped]
    except ImportError as exc:
        raise ModelingError(
            "infereval.modeling requires the [stats] extra: "
            "pip install 'infereval[stats]'"
        ) from exc

    # 1. Build the long-format observation table.
    rows = _build_observation_rows(evaluation, benchmark, reference=reference)
    n_dropped = sum(1 for r in rows if r["agrees"] is None)
    rows = [r for r in rows if r["agrees"] is not None]
    if not rows:
        raise ModelingError(
            "No substantive observations after dropping abstain samples; "
            "cannot fit a model on an all-abstain dataset."
        )

    df = pd.DataFrame(rows)

    # 2. One-hot encode each factor; the alphabetically-first level is
    # dropped as the baseline (statsmodels default).
    factor_names = sorted(benchmark.factors)
    # Ensure every level is observed as a category so dummies are stable
    # even when the dataset doesn't contain every level.
    for f in factor_names:
        df[f] = pd.Categorical(df[f], categories=benchmark.factors[f], ordered=False)

    design_parts = [pd.Series(1.0, index=df.index, name="Intercept")]
    factor_to_cols: dict[str, list[str]] = {}
    for f in factor_names:
        dummies = pd.get_dummies(df[f], prefix=f, drop_first=True, dtype=float)
        design_parts.append(dummies)
        factor_to_cols[f] = list(dummies.columns)
    design_X = pd.concat(design_parts, axis=1)  # noqa: N806 -- statistical convention
    y = df["agrees"].astype(int).to_numpy()

    if design_X.shape[0] <= design_X.shape[1]:
        raise ModelingError(
            f"Design matrix is rank-deficient: {design_X.shape[0]} observations vs. "
            f"{design_X.shape[1]} parameters. Add items or declare fewer levels."
        )

    # 3. Fit logistic regression with item-clustered SEs.
    model = sm.Logit(y, design_X)
    fit = model.fit(
        method="bfgs",
        disp=False,
        cov_type="cluster",
        cov_kwds={"groups": df["item_id"].to_numpy()},
        maxiter=200,
    )

    # 4. Fit the null model for pseudo-R² and overall LR test.
    null_model = sm.Logit(y, design_X[["Intercept"]])
    null_fit = null_model.fit(method="bfgs", disp=False, maxiter=200)

    deviance = float(-2 * fit.llf)
    null_deviance = float(-2 * null_fit.llf)
    if null_fit.llf == 0:
        pseudo_r2: float | None = None
    else:
        pseudo_r2 = float(1 - fit.llf / null_fit.llf)

    # 5. Per-factor joint Wald tests via the f_test API.
    factor_wald: dict[str, float] = {}
    for f in factor_names:
        cols = factor_to_cols[f]
        if not cols:
            continue
        constraint = " = 0, ".join(f"{c}" for c in cols) + " = 0"
        try:
            wald = fit.wald_test(constraint, scalar=True)
            p = float(wald.pvalue)
        except Exception:  # noqa: BLE001 — Wald can fail when factor is collinear
            p = float("nan")
        factor_wald[f] = p

    # 6. Per-level effects table.
    params = fit.params
    bse = fit.bse
    pvalues = fit.pvalues
    conf = fit.conf_int()
    effects: list[FactorEffect] = []
    for f in factor_names:
        for col in factor_to_cols[f]:
            level = col[len(f) + 1 :]  # strip "factor_" prefix
            ci_low, ci_high = conf.loc[col]
            effects.append(
                FactorEffect(
                    factor=f,
                    level=level,
                    coef=float(params[col]),
                    std_err=float(bse[col]),
                    z_value=float(params[col] / bse[col]) if bse[col] else float("nan"),
                    p_value=float(pvalues[col]),
                    conf_int_low=float(ci_low),
                    conf_int_high=float(ci_high),
                )
            )

    notes = (
        "Fixed-effects logistic regression with item-clustered standard errors.",
        "Approximates the per-item random-effect structure of a proper GLMM.",
        f"Reference for 'agreement': {reference!r}.",
    )

    return ModelFit(
        n_observations=int(design_X.shape[0]),
        n_items=int(df["item_id"].nunique()),
        n_factors=len(factor_names),
        n_dropped_abstain=n_dropped,
        deviance=deviance,
        null_deviance=null_deviance,
        pseudo_r2=pseudo_r2,
        effects=tuple(effects),
        factor_wald=factor_wald,
        notes=notes,
    )

infereval.modeling.ModelFit dataclass

ModelFit(n_observations: int, n_items: int, n_factors: int, n_dropped_abstain: int, deviance: float, null_deviance: float, pseudo_r2: float | None, effects: tuple[FactorEffect, ...], factor_wald: dict[str, float], notes: tuple[str, ...])

Result of fitting the factor-effects logistic regression.

n_observations instance-attribute

n_observations: int

Number of (item, sample) rows used in the fit.

n_items instance-attribute

n_items: int

Number of distinct items contributing observations (= number of clusters).

n_dropped_abstain instance-attribute

n_dropped_abstain: int

Sample observations excluded because the verdict was ABSTAIN.

deviance instance-attribute

deviance: float

-2 × log-likelihood of the fitted model.

null_deviance instance-attribute

null_deviance: float

-2 × log-likelihood of the intercept-only model.

pseudo_r2 instance-attribute

pseudo_r2: float | None

McFadden's pseudo-R² = 1 - log-lik(full) / log-lik(null).

effects instance-attribute

effects: tuple[FactorEffect, ...]

One row per non-baseline level of each declared factor.

factor_wald instance-attribute

factor_wald: dict[str, float]

Per-factor joint Wald p-value testing 'this factor has no effect'.

notes instance-attribute

notes: tuple[str, ...]

Methodology notes / caveats surfaced for the CLI report.

infereval.modeling.FactorEffect dataclass

FactorEffect(factor: str, level: str, coef: float, std_err: float, z_value: float, p_value: float, conf_int_low: float, conf_int_high: float)

One row of the fitted coefficient table.

Coefficients are log-odds relative to the alphabetically-first level of the same factor (the baseline). Positive coef → higher odds of agreement than the baseline level.

Sensitivity sweeps (R11)

infereval.sweep.run_sweep

run_sweep(benchmark: Benchmark, provider: Provider, *, parameter: str, values: list[object], out_dir: Path, config: EndorsementConfig | None = None, params: ProviderParams | None = None, run_id_prefix: str | None = None) -> SweepResult

Run :func:evaluate once per value and bundle the metrics.

Per-value outputs land in out_dir with deterministic names so a re-run replaces them in place.

Source code in src/infereval/sweep.py
def run_sweep(
    benchmark: Benchmark,
    provider: Provider,
    *,
    parameter: str,
    values: list[object],
    out_dir: Path,
    config: EndorsementConfig | None = None,
    params: ProviderParams | None = None,
    run_id_prefix: str | None = None,
) -> SweepResult:
    """Run :func:`evaluate` once per value and bundle the metrics.

    Per-value outputs land in ``out_dir`` with deterministic names so a
    re-run replaces them in place.
    """
    if parameter not in _SUPPORTED_PARAMS:
        raise SweepError(f"unsupported sweep parameter: {parameter!r}")
    if not values:
        raise SweepError("--values must contain at least one value")

    out_dir.mkdir(parents=True, exist_ok=True)
    base_config = config or EndorsementConfig()
    base_params = params or ProviderParams()

    rows: list[SweepRow] = []
    for value in values:
        cfg, par, variant = _apply_value(parameter, value, base_config, base_params)

        # Render value into a filename-safe form.
        value_str = str(value).replace("/", "-").replace(" ", "_")
        eta_path = out_dir / f"sweep-{parameter}={value_str}-eta.json"
        log_path = out_dir / f"sweep-{parameter}={value_str}-run.jsonl"

        rid = (
            f"{run_id_prefix}-{parameter}={value_str}"
            if run_id_prefix
            else f"sweep-{parameter}={value_str}"
        )

        eta = evaluate(
            benchmark,
            provider,
            config=cfg,
            params=par,
            variant=variant,
            run_id=rid,
            log_path=log_path,
        )
        eta.dump(eta_path)

        ref = consensus_reference(eta)
        kc = cohens_kappa(eta, ref)
        kf = fleiss_kappa(eta)
        cov_val = coverage(eta)
        n_agreement = sum(
            1
            for i, it in enumerate(eta.items)
            if it.model_verdict == ref(i)
        )

        rows.append(
            SweepRow(
                value=value,
                coverage=cov_val,
                kappa_c=kc,
                kappa_f=kf,
                n_agreement=n_agreement,
                n_total=eta.n,
                eta_path=eta_path,
            )
        )

    return SweepResult(parameter=parameter, rows=tuple(rows))

infereval.sweep.SweepResult dataclass

SweepResult(parameter: str, rows: tuple[SweepRow, ...])

Bundle of per-value rows + an overall stability assessment.

parameter instance-attribute

parameter: str

Name of the swept parameter.

kappa_c_range property

kappa_c_range: float | None

Max-minus-min of κ_C across rows; None if any κ_C is None.

stability_verdict property

stability_verdict: str

Human-readable single-sentence assessment of κ_C variation.

infereval.sweep.SweepRow dataclass

SweepRow(value: object, coverage: float, kappa_c: float | None, kappa_f: float | None, n_agreement: int, n_total: int, eta_path: Path)

One row of the sweep summary: parameters + metrics for one value.

value instance-attribute

value: object

The swept parameter's value for this row, type-coerced per the parameter.

n_agreement instance-attribute

n_agreement: int

Count of items where model_verdict == consensus_reference.

eta_path instance-attribute

eta_path: Path

On-disk location of the per-value evaluation JSON.

Construct-validity report (R16–R21)

infereval.report.ConstructValidityClaims

Bases: BaseModel

Top-level container for the analyst's construct-validity declarations.

stub classmethod

stub() -> ConstructValidityClaims

Return an obviously-placeholder stub for --init-claims.

Source code in src/infereval/report.py
@classmethod
def stub(cls) -> ConstructValidityClaims:
    """Return an obviously-placeholder stub for ``--init-claims``."""
    return cls(
        mastery_sense=MasterySenseClaim(
            sense="evaluative",
            description="FILL IN: the analyst's articulation of what mastery means here.",
        ),
        scope=ScopeClaim(
            scope="items_in_benchmark",
            justification="FILL IN: why this scope is appropriate.",
        ),
        constitution=ConstitutionClaim(
            position="evidence_of_mastery",
            justification="FILL IN: brief explanation of the position taken.",
        ),
        carving=CarvingClaim(
            acknowledges_carving_indexed=False,
            notes="FILL IN if acknowledges_carving_indexed=true.",
        ),
        competing_explanations=CompetingExplanationChecks(),
    )

infereval.report.MasterySenseClaim

Bases: BaseModel

R16: which sense of mastery the claim is about.

sense instance-attribute

sense: Literal['evaluative', 'generative', 'standing', 'combination']
  • evaluative: endorsements-when-asked (the methodology's direct measurement).
  • generative: inferential behavior in unprompted production.
  • standing: a dispositional competence underlying both.
  • combination: a mix; describe explicitly in description.

description instance-attribute

description: str

One to three sentences, the analyst's own articulation.

infereval.report.ScopeClaim

Bases: BaseModel

R17: scope the mastery claim applies over.

scope instance-attribute

scope: Literal['items_in_benchmark', 'domain_D_as_sampled', 'general_capacity']
  • items_in_benchmark: the claim is about the specific items in β.
  • domain_D_as_sampled: the claim generalises to D as sampled by β.
  • general_capacity: the claim is about inferential mastery as a general capacity.

justification instance-attribute

justification: str

Why this scope is appropriate given β and the methodology used.

infereval.report.ConstitutionClaim

Bases: BaseModel

R18: is agreement evidence of mastery or constitutive of it?

position instance-attribute

position: Literal['evidence_of_mastery', 'constitutive_of_mastery']
  • evidence_of_mastery: agreement is evidence for a deeper underlying property.
  • constitutive_of_mastery: agreement (with structural coherence) IS mastery (Brandom's structural-behavioural characterisation).

justification instance-attribute

justification: str

Brief explanation of the position taken and why.

infereval.report.CarvingClaim

Bases: BaseModel

R19: carving-indexed framing of in-principle claims.

acknowledges_carving_indexed instance-attribute

acknowledges_carving_indexed: bool

True iff any in-principle claims are framed in the carving-indexed form Remark 10 specifies.

notes class-attribute instance-attribute

notes: str = ''

Required when acknowledges_carving_indexed is True; document the carving used or pointers to the discussion.

infereval.report.CompetingExplanationChecks

Bases: BaseModel

R4, R8, R9, R11, R13, R14, R15: which checks were actually run.

All fields default to False (the conservative posture — the framework assumes no check was done unless the analyst explicitly declares it). The report's Unaddressed competing explanations section lists every False.

infereval.report.ReportVerdict dataclass

ReportVerdict(label: Literal['defensible', 'partially_defensible', 'not_defensible'], one_liner: str, rationale: list[str])

Deterministic summary verdict computed from the claims + evidence.

infereval.report.compute_verdict

compute_verdict(claims: ConstructValidityClaims, *, structure_report: dict[str, object] | None = None, benchmark: Benchmark | None = None) -> ReportVerdict

Return the deterministic summary verdict for the claims + evidence.

The verdict is computed against the claims file together with the supplied analytical artifacts. When no artifacts are passed (structure_report=None, benchmark=None), the verdict is computed from claims alone and a "verdict computed unaudited" rationale line is added so the reader can tell.

The deterministic rule:

  • "defensible" iff every check required by the declared scope is marked True AND no audited check returned a failing artifact AND the carving claim is explicit (acknowledges = True iff any in-principle claims are being made) AND the benchmark supports an inter-analyst baseline when one is required by the scope.
  • "not_defensible" iff more than half of the required checks are missing.
  • "partially_defensible" otherwise — including the "ran but didn't pass" cases (structural anomalies present, single-analyst benchmark with items_in_benchmark scope).

Audit caps (added in v0.5.3 from external review):

  • If structure_report is supplied AND structural_check_run is marked True AND the report contains any anomaly, the structural check is treated as failing — the verdict is capped at partially_defensible with a rationale line naming the count.
  • If benchmark is supplied AND the scope is items_in_benchmark AND len(benchmark.analysts) < 2, the verdict is capped at partially_defensible with a rationale line surfacing the panel size — agreement with a single analyst cannot inherit the convergent-validity guarantee that multi-analyst agreement carries.

Backwards-compatible callers that don't pass the artifacts get behaviour identical to v0.5.2 except for the additional "verdict computed unaudited" rationale line.

Source code in src/infereval/report.py
def compute_verdict(
    claims: ConstructValidityClaims,
    *,
    structure_report: dict[str, object] | None = None,
    benchmark: Benchmark | None = None,
) -> ReportVerdict:
    """Return the deterministic summary verdict for the claims + evidence.

    The verdict is computed against the *claims* file together with the
    supplied analytical artifacts. When no artifacts are passed
    (``structure_report=None``, ``benchmark=None``), the verdict is
    computed from claims alone and a "verdict computed unaudited"
    rationale line is added so the reader can tell.

    The deterministic rule:

    - "defensible" iff every check required by the declared scope is
      marked True AND no audited check returned a failing artifact AND
      the carving claim is explicit (acknowledges = True iff any
      in-principle claims are being made) AND the benchmark supports
      an inter-analyst baseline when one is required by the scope.
    - "not_defensible" iff *more than half* of the required checks
      are missing.
    - "partially_defensible" otherwise — including the "ran but didn't
      pass" cases (structural anomalies present, single-analyst benchmark
      with ``items_in_benchmark`` scope).

    Audit caps (added in v0.5.3 from external review):

    - If ``structure_report`` is supplied AND ``structural_check_run``
      is marked True AND the report contains any anomaly, the structural
      check is treated as failing — the verdict is capped at
      ``partially_defensible`` with a rationale line naming the count.
    - If ``benchmark`` is supplied AND the scope is
      ``items_in_benchmark`` AND ``len(benchmark.analysts) < 2``, the
      verdict is capped at ``partially_defensible`` with a rationale
      line surfacing the panel size — agreement with a single analyst
      cannot inherit the convergent-validity guarantee that
      multi-analyst agreement carries.

    Backwards-compatible callers that don't pass the artifacts get
    behaviour identical to v0.5.2 except for the additional "verdict
    computed unaudited" rationale line.
    """
    required = _REQUIRED_CHECKS_BY_SCOPE[claims.scope.scope]
    ce = claims.competing_explanations
    present = {name for name in required if getattr(ce, name)}
    missing = required - present

    rationale = []
    if not missing:
        rationale.append(
            f"All {len(required)} competing-explanation checks required for "
            f"scope={claims.scope.scope!r} are marked as run."
        )
    else:
        rationale.append(
            f"{len(missing)} of {len(required)} required checks NOT run: "
            f"{sorted(missing)}."
        )

    # Carving check applies only when scope reaches beyond items_in_benchmark.
    carving_ok = True
    if claims.scope.scope != "items_in_benchmark":
        if not claims.carving.acknowledges_carving_indexed:
            carving_ok = False
            rationale.append(
                f"Scope={claims.scope.scope!r} reaches beyond the items "
                "themselves, but carving-indexed framing is NOT acknowledged "
                "(R19 unaddressed)."
            )
        elif not claims.carving.notes.strip():
            carving_ok = False
            rationale.append(
                "Carving acknowledged but no notes supplied; R19 requires "
                "the carving to be documented."
            )

    # Audit caps (v0.5.3): downgrade when the analyst declared a check
    # was run but the corresponding artifact tells a different story.
    structural_failed = False
    if (
        structure_report is not None
        and getattr(ce, "structural_check_run", False)
    ):
        checks_obj = structure_report.get("checks") or []
        checks_iter = checks_obj if isinstance(checks_obj, list) else []
        total_anomalies = 0
        for check in checks_iter:
            if not isinstance(check, dict):
                continue
            anomalies = check.get("anomalies", ())
            if isinstance(anomalies, (list, tuple)):
                total_anomalies += len(anomalies)
        if total_anomalies > 0:
            structural_failed = True
            rationale.append(
                f"`structural_check_run` is marked True, but the supplied "
                f"structure report contains {total_anomalies} anomal"
                f"{'y' if total_anomalies == 1 else 'ies'} — "
                "the check ran but did not pass. Verdict capped at "
                "partially_defensible."
            )

    panel_too_small = False
    panel_size: int | None = None
    if benchmark is not None and claims.scope.scope == "items_in_benchmark":
        panel_size = len(benchmark.analysts)
        if panel_size < 2:
            panel_too_small = True
            rationale.append(
                f"Benchmark has m={panel_size} analyst(s); κ_F\\*(β) is "
                "undefined and there is no independent reference column. "
                "A green verdict at items_in_benchmark scope would certify "
                "agreement with a single labeler — capped at "
                "partially_defensible."
            )

    if structure_report is None and benchmark is None:
        rationale.append(
            "Verdict computed unaudited: no structure_report or benchmark "
            "supplied to compute_verdict, so 'check run' is taken at face "
            "value and panel size is not inspected. Render through "
            "`infereval report` (which passes both) for the audited verdict."
        )

    # Decide.
    audit_passes = not structural_failed and not panel_too_small
    if not missing and carving_ok and audit_passes:
        one_liner = f"Mastery claim defensible at scope={claims.scope.scope!r}."
        if panel_size is not None:
            one_liner = (
                f"Mastery claim defensible at scope={claims.scope.scope!r} "
                f"(m={panel_size} analysts)."
            )
        return ReportVerdict(
            label="defensible",
            one_liner=one_liner,
            rationale=rationale,
        )
    if (len(missing) > len(required) / 2 or not carving_ok) and audit_passes:
        return ReportVerdict(
            label="not_defensible",
            one_liner=(
                f"Mastery claim NOT defensible from the supplied evidence at "
                f"scope={claims.scope.scope!r}."
            ),
            rationale=rationale,
        )
    return ReportVerdict(
        label="partially_defensible",
        one_liner=(
            f"Mastery claim partially defensible at scope={claims.scope.scope!r} — "
            "see Unaddressed competing explanations."
        ),
        rationale=rationale,
    )

infereval.report.render_markdown

render_markdown(*, evaluation: Evaluation, benchmark: Benchmark, claims: ConstructValidityClaims, structure_report: dict[str, object] | None = None, sweep_summary: dict[str, object] | None = None, model_fit: dict[str, object] | None = None, generated_at: datetime | None = None, suppress_negatives: bool = False) -> str

Produce the construct-validity report as Markdown.

Optional arguments (structure_report, sweep_summary, model_fit) populate the Evidence section; when absent, that section explicitly notes the missing evidence.

Source code in src/infereval/report.py
def render_markdown(
    *,
    evaluation: Evaluation,
    benchmark: Benchmark,
    claims: ConstructValidityClaims,
    structure_report: dict[str, object] | None = None,
    sweep_summary: dict[str, object] | None = None,
    model_fit: dict[str, object] | None = None,
    generated_at: datetime | None = None,
    suppress_negatives: bool = False,
) -> str:
    """Produce the construct-validity report as Markdown.

    Optional arguments (``structure_report``, ``sweep_summary``,
    ``model_fit``) populate the Evidence section; when absent, that
    section explicitly notes the missing evidence.
    """
    from .metrics import (
        cohens_kappa,
        consensus_reference,
        coverage,
        fleiss_kappa,
        inter_analyst_fleiss,
    )

    generated_at = generated_at or datetime.now(timezone.utc)

    kappa_c = cohens_kappa(evaluation, consensus_reference(evaluation))
    kappa_f = fleiss_kappa(evaluation)
    kappa_f_star = inter_analyst_fleiss(benchmark)
    cov = coverage(evaluation)
    verdict = compute_verdict(
        claims,
        structure_report=structure_report,
        benchmark=benchmark,
    )

    # Collect negative findings up-front so we can both render them and
    # apply the suppression penalty to the verdict in one place.
    findings = collect_negative_findings(
        structure_report=structure_report,
        sweep_summary=sweep_summary,
        model_fit=model_fit,
        factor_kinds=dict(benchmark.factor_kinds) if benchmark.factor_kinds else None,
    )
    any_phase2_supplied = any(
        x is not None for x in (structure_report, sweep_summary, model_fit)
    )

    # If suppression is enabled, the Summary verdict downgrades one tier:
    # defensible -> partially_defensible -> not_defensible. Hiding
    # evidence is itself a negative construct-validity signal.
    if suppress_negatives:
        downgraded_label = {
            "defensible": "partially_defensible",
            "partially_defensible": "not_defensible",
            "not_defensible": "not_defensible",
        }[verdict.label]
        if downgraded_label != verdict.label:
            verdict = ReportVerdict(
                label=downgraded_label,  # type: ignore[arg-type]
                one_liner=(
                    "Verdict downgraded one tier because "
                    "--suppress-negatives is enabled."
                ),
                rationale=[
                    *verdict.rationale,
                    "Negative-findings suppression downgrades the verdict "
                    "(Phase 3.2 / R21).",
                ],
            )

    lines: list[str] = []
    lines.append("# Construct-validity report")
    lines.append("")
    lines.append(f"_Generated: {generated_at.isoformat()}_")
    if suppress_negatives:
        lines.append("")
        lines.append(
            "> ⚠️ **Negative-findings suppression: ENABLED.** This is an "
            "explicit author choice via `--suppress-negatives`; the "
            "framework normally surfaces negative findings by default. "
            "Reviewers: ask why this flag was set."
        )
    lines.append("")

    # 1. Identity
    lines.append("## 1. Identity")
    lines.append("")
    lines.append(f"- **Evaluation**: `{evaluation.id}`")
    lines.append(f"- **Benchmark**: `{benchmark.id}`")
    lines.append(
        f"- **Model**: `{evaluation.model.provider}` / `{evaluation.model.model_id}`"
    )
    if evaluation.started_at:
        lines.append(f"- **Run started**: {evaluation.started_at.isoformat()}")
    lines.append(f"- **Items**: {evaluation.n}")
    lines.append(f"- **Analysts**: {benchmark.m}")
    lines.append("")

    # 2. Summary metrics
    lines.append("## 2. Summary metrics")
    lines.append("")
    lines.append(f"- **Coverage**: {cov:.4f}")
    lines.append(f"- **Cohen's κ_C (vs consensus)**: {_format_kappa(kappa_c)}")
    lines.append(f"- **Fleiss' κ_F**: {_format_kappa(kappa_f)}")
    lines.append(f"- **Inter-analyst κ_F\\***: {_format_kappa(kappa_f_star)}")
    lines.append("")

    # 3. Construct-validity claims (R16-R20)
    lines.append("## 3. Construct-validity claims (R16–R20)")
    lines.append("")
    lines.append(f"**Mastery sense (R16)**: {claims.mastery_sense.sense}")
    lines.append("")
    lines.append(f"> {claims.mastery_sense.description}")
    lines.append("")
    lines.append(f"**Scope (R17)**: {claims.scope.scope}")
    lines.append("")
    lines.append(f"> {claims.scope.justification}")
    lines.append("")
    lines.append(f"**Constitution vs. evidence (R18)**: {claims.constitution.position}")
    lines.append("")
    lines.append(f"> {claims.constitution.justification}")
    lines.append("")
    carving_status = (
        "acknowledged" if claims.carving.acknowledges_carving_indexed else "not acknowledged"
    )
    lines.append(f"**Carving-indexed framing (R19)**: {carving_status}")
    if claims.carving.notes.strip():
        lines.append("")
        lines.append(f"> {claims.carving.notes}")
    lines.append("")

    # 4. Evidence
    lines.append("## 4. Evidence")
    lines.append("")
    lines.append("Auto-collected from optional Phase 2 artifacts:")
    lines.append("")

    if structure_report is not None:
        total_anomalies = structure_report.get("total_anomalies", 0)
        lines.append(
            f"- **Structural coherence checks** (R13): "
            f"{total_anomalies} anomalies flagged across the bundled checks."
        )
    else:
        lines.append("- **Structural coherence checks** (R13): NOT SUPPLIED.")

    if sweep_summary is not None:
        kc_range = sweep_summary.get("kappa_c_range")
        param = sweep_summary.get("parameter", "?")
        verdict_str = sweep_summary.get("stability_verdict", "?")
        if kc_range is not None:
            lines.append(
                f"- **Sensitivity sweep** over `{param}` (R11): "
                f"κ_C range = {kc_range:.3f}. {verdict_str}"
            )
        else:
            lines.append(
                f"- **Sensitivity sweep** over `{param}` (R11): {verdict_str}"
            )
    else:
        lines.append("- **Sensitivity sweep** (R11): NOT SUPPLIED.")

    if model_fit is not None:
        wald_raw = model_fit.get("factor_wald", {})
        wald = wald_raw if isinstance(wald_raw, dict) else {}
        sig = sum(1 for p in wald.values() if isinstance(p, (int, float)) and p < 0.05)
        lines.append(
            f"- **Factor-effects model fit** (R7, R12): "
            f"{sig}/{len(wald)} factors significant at α=0.05."
        )
    else:
        lines.append("- **Factor-effects model fit** (R7, R12): NOT SUPPLIED.")
    lines.append("")

    # 4b. Negative findings (Phase 3.2, R21)
    lines.append("## 4b. Negative findings")
    lines.append("")
    if suppress_negatives:
        lines.append(
            "⚠️ **Suppressed via `--suppress-negatives`.** This is an "
            "explicit author choice; the framework normally surfaces "
            "negative findings by default. Reviewers: ask why this flag "
            "was set."
        )
    elif not any_phase2_supplied:
        lines.append(
            "No Phase 2 artifacts supplied; the auto-collection step had "
            "nothing to scan. See Unaddressed competing explanations (§5) "
            "for the analyst-declared check status."
        )
    elif not findings:
        lines.append("No negative findings detected in the supplied Phase 2 artifacts.")
    else:
        lines.append(
            "The framework auto-collects negative findings from the "
            "supplied Phase 2 artifacts. Each item below represents a "
            "check that ran but returned a finding that *weakens or "
            "complicates* the mastery claim."
        )
        lines.append("")
        # Group by source for readability.
        for src_label, src_key in [
            ("Structural anomalies", "structure"),
            ("Sweep instability", "sweep"),
            ("Factor-effects null findings", "model_fit"),
        ]:
            src_items = [f for f in findings if f.source == src_key]
            if not src_items:
                continue
            lines.append(f"### {src_label} ({len(src_items)} flagged)")
            for f in src_items:
                lines.append(f"- {f.summary}")
            lines.append("")
    lines.append("")

    # 5. Unaddressed competing explanations
    lines.append("## 5. Unaddressed competing explanations")
    lines.append("")
    ce = claims.competing_explanations
    unaddressed = [
        (name, _human_label_for_check(name))
        for name in (
            "paraphrase_sweep_run",
            "sensitivity_sweep_run",
            "structural_check_run",
            "cross_panel_check_run",
            "independent_reference_panel_used",
            "held_out_items_used",
            "training_data_separation_verified",
            "cross_domain_comparison_run",
            "replication_attempted",
        )
        if not getattr(ce, name)
    ]
    if not unaddressed:
        lines.append("All declared competing-explanation checks marked as run.")
    else:
        lines.append(
            "The following checks were NOT run. Each omission weakens the "
            "defensibility of the corresponding mastery claim:"
        )
        lines.append("")
        for name, label in unaddressed:
            lines.append(f"- **{label}** (`{name}`)")
    lines.append("")

    # 6. Summary verdict
    lines.append("## 6. Summary verdict")
    lines.append("")
    badge = {
        "defensible": "✅",
        "partially_defensible": "⚠️",
        "not_defensible": "❌",
    }[verdict.label]
    lines.append(f"### {badge} {verdict.one_liner}")
    lines.append("")
    for note in verdict.rationale:
        lines.append(f"- {note}")
    lines.append("")
    lines.append("---")
    lines.append("")
    lines.append(
        "*Generated by `infereval report` (Phase 3.1, R16–R20). The verdict "
        "is computed deterministically from the claims file; the framework "
        "refuses to render a 'defensible' verdict without the corresponding "
        "competing-explanation checks.*"
    )

    return "\n".join(lines) + "\n"

infereval.report.NegativeFinding dataclass

NegativeFinding(source: Literal['structure', 'sweep', 'model_fit'], summary: str)

One auto-collected negative finding from a Phase 2 artifact.

A finding is "negative" in the construct-validity sense — a check that ran and returned a result that weakens or complicates the mastery claim. Per Closing the Construct-Validity Gap in infereval (Phase 3.2 / R21), the framework surfaces these by default in the report.

summary instance-attribute

summary: str

One-line description rendered in the Negative findings section.

infereval.report.collect_negative_findings

collect_negative_findings(*, structure_report: dict[str, object] | None = None, sweep_summary: dict[str, object] | None = None, model_fit: dict[str, object] | None = None, factor_kinds: dict[str, str] | None = None) -> list[NegativeFinding]

Scan the supplied Phase 2 artifacts and return their negative findings.

Sources:

  • structure_report: each anomaly across all checks is one finding.
  • sweep_summary: instability (verdict not "stable across the sweep range") is one finding.
  • model_fit: factors whose Wald p > 0.05 are surfaced as no-significant-effect findings. When factor_kinds supplies a valence label for a factor, the finding's summary explicitly states whether the null is a weakening of the mastery claim (a substantive factor that didn't differentiate) or a strengthening one (an experimentally-controlled factor that properly didn't affect behavior — e.g. the paraphrase axis). Unlabelled factors get the historical neutral summary so the analyst can read the valence from context.

Parameters

factor_kinds Optional mapping factor_name -> {"substantive", "experimentally_controlled"} from Benchmark.factor_kinds. When omitted, all null-effect findings are summarised neutrally.

Source code in src/infereval/report.py
def collect_negative_findings(
    *,
    structure_report: dict[str, object] | None = None,
    sweep_summary: dict[str, object] | None = None,
    model_fit: dict[str, object] | None = None,
    factor_kinds: dict[str, str] | None = None,
) -> list[NegativeFinding]:
    """Scan the supplied Phase 2 artifacts and return their negative findings.

    Sources:

    - **structure_report**: each anomaly across all checks is one finding.
    - **sweep_summary**: instability (verdict not "stable across the sweep
      range") is one finding.
    - **model_fit**: factors whose Wald p > 0.05 are surfaced as
      no-significant-effect findings. When ``factor_kinds`` supplies a
      valence label for a factor, the finding's summary explicitly
      states whether the null is a *weakening* of the mastery claim
      (a substantive factor that didn't differentiate) or a *strengthening*
      one (an experimentally-controlled factor that properly didn't
      affect behavior — e.g. the paraphrase axis). Unlabelled factors
      get the historical neutral summary so the analyst can read the
      valence from context.

    Parameters
    ----------
    factor_kinds
        Optional mapping ``factor_name -> {"substantive",
        "experimentally_controlled"}`` from ``Benchmark.factor_kinds``.
        When omitted, all null-effect findings are summarised neutrally.
    """
    findings: list[NegativeFinding] = []

    if structure_report is not None:
        checks_raw = structure_report.get("checks", [])
        checks = checks_raw if isinstance(checks_raw, list) else []
        for check in checks:
            if not isinstance(check, dict):
                continue
            anomalies = check.get("anomalies", ()) if isinstance(check, dict) else ()
            if not anomalies:
                continue
            check_name = check.get("name", "?")
            for a in anomalies:
                if isinstance(a, dict):
                    item_id = a.get("item_id", "?")
                    expl = a.get("explanation", "")
                    findings.append(
                        NegativeFinding(
                            source="structure",
                            summary=f"{check_name} / {item_id}: {expl}",
                        )
                    )

    if sweep_summary is not None:
        verdict_raw = sweep_summary.get("stability_verdict", "")
        verdict_str = str(verdict_raw).lower()
        # The SweepResult.stability_verdict strings live in three flavours:
        # "stable" (positive), "moderately sensitive" (negative),
        # "substantively" (negative). "Stable" doesn't appear in the
        # negative ones, so its absence is the right signal.
        if verdict_str and "stable" not in verdict_str:
            param = sweep_summary.get("parameter", "?")
            findings.append(
                NegativeFinding(
                    source="sweep",
                    summary=f"Sweep over `{param}`: {sweep_summary.get('stability_verdict')}",
                )
            )

    if model_fit is not None:
        wald_raw = model_fit.get("factor_wald", {})
        wald = wald_raw if isinstance(wald_raw, dict) else {}
        kinds = factor_kinds or {}
        for factor, p in wald.items():
            if not isinstance(p, (int, float)):
                continue
            if p > 0.05:
                kind = kinds.get(str(factor))
                if kind == "substantive":
                    valence = (
                        " — **weakens the mastery claim**: this factor was "
                        "declared substantive, so the model failing to "
                        "differentiate across its levels is a negative finding"
                    )
                elif kind == "experimentally_controlled":
                    valence = (
                        " — **strengthens the mastery claim**: this factor "
                        "was declared experimentally-controlled, so the null "
                        "result is the wanted outcome (content-not-form "
                        "behavior)"
                    )
                else:
                    valence = ""
                findings.append(
                    NegativeFinding(
                        source="model_fit",
                        summary=(
                            f"`{factor}`: Wald p = {p:.3f} "
                            f"(no significant effect detected){valence}"
                        ),
                    )
                )

    return findings

Providers

infereval.providers.get_provider

get_provider(provider: str, model_id: str, **kwargs: Any) -> Provider

Construct a provider by short name.

Parameters

provider Provider short name: "anthropic", "openai", "openrouter", or "mock". model_id Provider-specific model identifier. **kwargs Passed through to the concrete provider's constructor (e.g. api_key, base_url, retry_policy, http_referer).

Returns

Provider A constructed provider instance satisfying the :class:Provider Protocol.

Raises

ProviderConfigError If provider is not a known short name or required configuration is missing (e.g. API key not set, optional SDK not installed).

Source code in src/infereval/providers/__init__.py
def get_provider(provider: str, model_id: str, **kwargs: Any) -> Provider:
    """Construct a provider by short name.

    Parameters
    ----------
    provider
        Provider short name: ``"anthropic"``, ``"openai"``, ``"openrouter"``, or ``"mock"``.
    model_id
        Provider-specific model identifier.
    **kwargs
        Passed through to the concrete provider's constructor (e.g.
        ``api_key``, ``base_url``, ``retry_policy``, ``http_referer``).

    Returns
    -------
    Provider
        A constructed provider instance satisfying the :class:`Provider`
        Protocol.

    Raises
    ------
    ProviderConfigError
        If ``provider`` is not a known short name or required configuration is
        missing (e.g. API key not set, optional SDK not installed).
    """
    normalized = provider.strip().lower()
    if normalized == "anthropic":
        from .anthropic import AnthropicProvider

        return AnthropicProvider(model_id, **kwargs)
    if normalized == "openai":
        from .openai import OpenAIProvider

        return OpenAIProvider(model_id, **kwargs)
    if normalized == "openrouter":
        from .openrouter import OpenRouterProvider

        return OpenRouterProvider(model_id, **kwargs)
    if normalized == "mock":
        from .mock import ScriptedProvider

        return ScriptedProvider(model_id=model_id, **kwargs)
    raise ProviderConfigError(
        f"Unknown provider {provider!r}. "
        "Supported: 'anthropic', 'openai', 'openrouter', 'mock'."
    )

infereval.providers.base.Provider

Bases: Protocol

The structural contract every LLM backend must satisfy.

infereval.providers.base.BaseProvider

BaseProvider(model_id: str, *, retry_policy: RetryPolicy | None = None, rng: Random | None = None)

Bases: ABC

Abstract base providing the retry loop, timing, and logging.

Subclasses set the class-level :attr:name, implement :meth:_sample_once (one provider call), and implement :meth:_is_transient (which exceptions warrant a retry).

Source code in src/infereval/providers/base.py
def __init__(
    self,
    model_id: str,
    *,
    retry_policy: RetryPolicy | None = None,
    rng: random.Random | None = None,
) -> None:
    self.model_id = model_id
    self.retry_policy = retry_policy or RetryPolicy()
    self._rng = rng or random.Random()

sample

sample(req: SampleRequest) -> SampleResult

Sample once with retries.

Raises

ProviderSampleError If all retry attempts fail with transient errors, or the first attempt fails with a non-transient error.

Source code in src/infereval/providers/base.py
def sample(self, req: SampleRequest) -> SampleResult:
    """Sample once with retries.

    Raises
    ------
    ProviderSampleError
        If all retry attempts fail with transient errors, or the first
        attempt fails with a non-transient error.
    """
    last_exc: Exception | None = None
    for attempt in range(self.retry_policy.max_attempts):
        try:
            result = self._sample_once(req)
            if attempt > 0:
                log_event(
                    log,
                    "provider.sample.recovered",
                    provider=self.name,
                    model_id=self.model_id,
                    request_id=req.request_id,
                    attempt=attempt + 1,
                )
            return result
        except Exception as exc:  # noqa: BLE001 -- intentional broad catch; classified below
            last_exc = exc
            transient = self._is_transient(exc)
            log.warning(
                "provider.sample.error",
                extra={
                    "provider": self.name,
                    "model_id": self.model_id,
                    "request_id": req.request_id,
                    "attempt": attempt + 1,
                    "transient": transient,
                    "err": str(exc),
                },
            )
            if not transient:
                raise ProviderSampleError(
                    f"{self.name} sample failed (non-transient): {exc}"
                ) from exc
            if attempt + 1 >= self.retry_policy.max_attempts:
                break  # exhausted -- raise below
            sleep_for = self.retry_policy.sleep_for(attempt, self._rng)
            log_event(
                log,
                "provider.sample.retry",
                provider=self.name,
                request_id=req.request_id,
                sleep_s=sleep_for,
            )
            self._sleep(sleep_for)

    raise ProviderSampleError(
        f"{self.name} sample failed after {self.retry_policy.max_attempts} "
        f"attempts: {last_exc}"
    ) from last_exc

infereval.providers.base.SampleRequest dataclass

SampleRequest(prompt: str, system: str | None = None, temperature: float = 1.0, max_tokens: int = 1024, top_p: float | None = None, seed: int | None = None, stop: tuple[str, ...] = (), request_id: str | None = None)

A single completion request issued to a provider.

The max_tokens default of 1024 is sized for current frontier models that consume budget on silent internal reasoning (DeepSeek v4-flash, OpenAI o-series, Gemini 2.5 Pro). Pre-reasoning models will only emit a handful of tokens for a one-word verdict regardless of this cap, so the higher default is cheap insurance against budget-clipping. See docs/providers.md for per-provider guidance.

request_id class-attribute instance-attribute

request_id: str | None = None

Client-side correlation id propagated to logs and to :attr:SampleResult.request_id.

infereval.providers.base.SampleResult dataclass

SampleResult(text: str, provider: str, model_id: str, request_id: str | None = None, wall_time_ms: float = 0.0, usage: Mapping[str, int] = dict(), raw: Mapping[str, Any] | None = None, finish_reason: str | None = None, reasoning_tokens: int | None = None)

One completed sample from a provider.

The finish_reason and reasoning_tokens fields surface provider-side stop-reason and reasoning-token-consumption metadata so that downstream code (the endorser, the JSONL log, the evaluation JSON) can distinguish budget-clipped abstains (model ran out of tokens on silent internal reasoning) from genuine abstains (model declined to commit). The values are passed through verbatim from each provider — see :data:BUDGET_FINISH_REASONS for the canonical union of values that signal a budget hit.

raw class-attribute instance-attribute

raw: Mapping[str, Any] | None = None

Provider-native response payload, when available, for forensic inspection.

finish_reason class-attribute instance-attribute

finish_reason: str | None = None

Provider-side stop reason. OpenAI: "stop" / "length" / ...; Anthropic: "end_turn" / "max_tokens" / "stop_sequence" / .... None if the provider didn't report one.

reasoning_tokens class-attribute instance-attribute

reasoning_tokens: int | None = None

Count of tokens consumed by silent internal reasoning, where the provider exposes it (OpenAI: usage.completion_tokens_details.reasoning_tokens). None if not reported.

infereval.providers.base.RetryPolicy dataclass

RetryPolicy(max_attempts: int = 4, backoff_initial_s: float = 0.5, backoff_factor: float = 2.0, jitter: float = 0.25)

Exponential-backoff-with-jitter retry policy.

Sleep before attempt i+1 (after the i-th transient failure) is

.. math:: s_i = b \cdot f^{\,i} \cdot (1 + j \cdot u)

where :math:b is backoff_initial_s, :math:f is backoff_factor, :math:j is jitter, and :math:u \sim U[-1, 1].

sleep_for

sleep_for(attempt_index: int, rng: Random) -> float

Return the sleep duration in seconds before the next attempt.

Source code in src/infereval/providers/base.py
def sleep_for(self, attempt_index: int, rng: random.Random) -> float:
    """Return the sleep duration in seconds before the next attempt."""
    base = self.backoff_initial_s * (self.backoff_factor**attempt_index)
    jitter_mul = 1.0 + self.jitter * (rng.random() * 2.0 - 1.0)
    return max(0.0, base * jitter_mul)

infereval.providers.mock.ScriptedProvider dataclass

ScriptedProvider(responses: list[str | SampleResult], model_id: str = 'scripted-mock-v1', name: str = 'mock')

Returns a pre-determined sequence of responses, cycling on exhaustion.

Each element may be either a plain str (in which case it is wrapped in a :class:SampleResult at sample time) or a fully-formed :class:SampleResult.

Parameters

responses Sequence of responses to return on successive sample calls. model_id Identifier reported in :attr:SampleResult.model_id. name Identifier reported in :attr:SampleResult.provider. Defaults to "mock" so evaluation JSON written from a test cleanly identifies itself as not real.

reset

reset() -> None

Reset the index to the start of the response sequence.

Source code in src/infereval/providers/mock.py
def reset(self) -> None:
    """Reset the index to the start of the response sequence."""
    self._index = 0

infereval.providers.mock.ReplayProvider

ReplayProvider(fixture_path: Path | str, *, model_id: str | None = None)

Replays recorded provider responses from a JSONL fixture.

The fixture is one JSON object per line. Each record must carry a prompt_hash (matching :func:infereval.logging_setup.prompt_hash of the prompt that produced it) and a text field. Optional fields: provider, model_id, request_id, wall_time_ms, usage, raw.

When multiple records share a prompt hash, they are returned in fixture order; ReplayProvider cycles when the per-prompt sequence is exhausted, matching :class:ScriptedProvider semantics.

Missing prompt hashes raise :class:ProviderSampleError with a diagnostic message listing how many hashes are recorded.

This is the M8 vehicle for byte-for-byte regression testing of the endorsement pipeline without hitting a real API. Generate fixtures via the developer helper at tests/fixtures/build_stop_sign_replay.py.

Source code in src/infereval/providers/mock.py
def __init__(
    self,
    fixture_path: Path | str,
    *,
    model_id: str | None = None,
) -> None:
    self.fixture_path = Path(fixture_path)
    if not self.fixture_path.exists():
        raise ProviderConfigError(
            f"ReplayProvider fixture not found: {self.fixture_path}"
        )

    records: dict[str, list[dict[str, Any]]] = {}
    with self.fixture_path.open("r", encoding="utf-8") as f:
        for line_no, raw_line in enumerate(f, start=1):
            line = raw_line.strip()
            if not line:
                continue
            try:
                record = json.loads(line)
            except json.JSONDecodeError as exc:
                raise ProviderConfigError(
                    f"ReplayProvider fixture {self.fixture_path} line {line_no} "
                    f"is not valid JSON: {exc}"
                ) from exc
            if "prompt_hash" not in record or "text" not in record:
                raise ProviderConfigError(
                    f"ReplayProvider fixture {self.fixture_path} line {line_no} "
                    "missing required fields 'prompt_hash' and/or 'text'"
                )
            records.setdefault(record["prompt_hash"], []).append(record)

    if not records:
        raise ProviderConfigError(
            f"ReplayProvider fixture {self.fixture_path} is empty"
        )

    self._records = records
    self._cursors: dict[str, int] = {}

    # Default model_id: explicit > first record's > generic placeholder.
    if model_id is not None:
        self.model_id = model_id
    else:
        first_record = next(iter(records.values()))[0]
        self.model_id = first_record.get("model_id", "replay-v1")

reset

reset() -> None

Reset all per-prompt cursors so replay restarts from the top.

Source code in src/infereval/providers/mock.py
def reset(self) -> None:
    """Reset all per-prompt cursors so replay restarts from the top."""
    self._cursors.clear()

Prompts

infereval.prompts.VerificationPrompt dataclass

VerificationPrompt(id: str, system: str, user_template: str, parse_regex: str = DEFAULT_PARSE_REGEX)

A verification prompt template.

Attributes

id Stable identifier recorded in evaluation JSON (endorsement_config.verification_prompt_id). system System message sent to the provider. May be empty. user_template Format string with {premise_context} and {conclusion_context} placeholders, used to build each per-sample user prompt. parse_regex Regex applied (case-insensitively) to the model's response. The first match's group 1 is uppercased and interpreted as a :class:Verdict value (GOOD / BAD / ABSTAIN).

build_user

build_user(premise_context: str, conclusion_context: str) -> str

Return the per-sample user prompt with both contexts substituted in.

Source code in src/infereval/prompts.py
def build_user(self, premise_context: str, conclusion_context: str) -> str:
    """Return the per-sample user prompt with both contexts substituted in."""
    return self.user_template.format(
        premise_context=premise_context,
        conclusion_context=conclusion_context,
    )

compile_parser

compile_parser() -> re.Pattern[str]

Compile :attr:parse_regex as a case-insensitive pattern.

Source code in src/infereval/prompts.py
def compile_parser(self) -> re.Pattern[str]:
    """Compile :attr:`parse_regex` as a case-insensitive pattern."""
    return re.compile(self.parse_regex, re.IGNORECASE)

infereval.prompts.resolve_verification_prompt

resolve_verification_prompt(override: VerificationPromptOverride | None, *, override_id: str = 'benchmark-override-v1') -> VerificationPrompt

Return the default prompt, or a benchmark-supplied override.

Each override field that is None falls back to the framework default:

  • :attr:VerificationPromptOverride.system None → :data:DEFAULT_SYSTEM_PROMPT.
  • :attr:VerificationPromptOverride.parse_regex None → :data:DEFAULT_PARSE_REGEX.
  • :attr:VerificationPromptOverride.id Noneoverride_id (caller-supplied fallback identifier).

A benchmark JSON can now fully specify a custom verification prompt (system + user template + parse regex + identifier) without dropping to the Python API.

Source code in src/infereval/prompts.py
def resolve_verification_prompt(
    override: VerificationPromptOverride | None,
    *,
    override_id: str = "benchmark-override-v1",
) -> VerificationPrompt:
    """Return the default prompt, or a benchmark-supplied override.

    Each override field that is ``None`` falls back to the framework
    default:

    - :attr:`VerificationPromptOverride.system` ``None`` →
      :data:`DEFAULT_SYSTEM_PROMPT`.
    - :attr:`VerificationPromptOverride.parse_regex` ``None`` →
      :data:`DEFAULT_PARSE_REGEX`.
    - :attr:`VerificationPromptOverride.id` ``None`` → ``override_id``
      (caller-supplied fallback identifier).

    A benchmark JSON can now fully specify a custom verification prompt
    (system + user template + parse regex + identifier) without dropping
    to the Python API.
    """
    if override is None:
        return DEFAULT_VERIFICATION_PROMPT
    return VerificationPrompt(
        id=override.id or override_id,
        system=override.system if override.system is not None else DEFAULT_SYSTEM_PROMPT,
        user_template=override.template,
        parse_regex=override.parse_regex or DEFAULT_PARSE_REGEX,
    )

Endorsement

infereval.endorsement.endorse

endorse(implication: Implication, bearers: Mapping[str, Bearer], provider: Provider, config: EndorsementConfig, params: ProviderParams, *, premise_builder: ContextBuilder, conclusion_builder: ContextBuilder, verification_prompt: VerificationPrompt = DEFAULT_VERIFICATION_PROMPT, strip_tex: bool = True, request_id_prefix: str | None = None, variant: int = 0) -> EndorsementRecord

Compute :math:E_M(\langle \Gamma, \Delta \rangle) for one implication.

Issues config.n_samples calls to provider with the verification prompt built from premise_builder and conclusion_builder, parses each response, and aggregates via :func:majority_vote.

Provider sample failures (after the provider's own retries are exhausted) are recorded as sample_failed and contribute an ABSTAIN verdict to the vote.

The variant parameter selects which expression each bearer is rendered with. variant=0 (the default) uses the canonical expressions; variant=k uses bearer.paraphrases[k-1] per :func:_expressions_for. Use this to drive the paraphrase axis of variation (R10) without needing to mutate the benchmark JSON between runs.

Source code in src/infereval/endorsement.py
def endorse(
    implication: Implication,
    bearers: Mapping[str, Bearer],
    provider: Provider,
    config: EndorsementConfig,
    params: ProviderParams,
    *,
    premise_builder: ContextBuilder,
    conclusion_builder: ContextBuilder,
    verification_prompt: VerificationPrompt = DEFAULT_VERIFICATION_PROMPT,
    strip_tex: bool = True,
    request_id_prefix: str | None = None,
    variant: int = 0,
) -> EndorsementRecord:
    """Compute :math:`E_M(\\langle \\Gamma, \\Delta \\rangle)` for one implication.

    Issues ``config.n_samples`` calls to ``provider`` with the verification
    prompt built from ``premise_builder`` and ``conclusion_builder``,
    parses each response, and aggregates via :func:`majority_vote`.

    Provider sample failures (after the provider's own retries are
    exhausted) are recorded as ``sample_failed`` and contribute an
    ``ABSTAIN`` verdict to the vote.

    The ``variant`` parameter selects which expression each bearer is
    rendered with. ``variant=0`` (the default) uses the canonical
    expressions; ``variant=k`` uses ``bearer.paraphrases[k-1]`` per
    :func:`_expressions_for`. Use this to drive the paraphrase axis
    of variation (R10) without needing to mutate the benchmark JSON
    between runs.
    """
    premise_exprs = _expressions_for(
        implication.premises, bearers, strip_tex=strip_tex, variant=variant
    )
    conclusion_exprs = _expressions_for(
        implication.conclusions, bearers, strip_tex=strip_tex, variant=variant
    )
    premise_ctx = premise_builder(premise_exprs)
    conclusion_ctx = conclusion_builder(conclusion_exprs)
    user_text = verification_prompt.build_user(premise_ctx, conclusion_ctx)

    parser = verification_prompt.compile_parser()
    sample_records: list[SampleRecord] = []
    verdicts: list[Verdict] = []
    user_prompt_hash = prompt_hash(user_text)
    premise_ids = sorted(implication.premises)
    conclusion_ids = sorted(implication.conclusions)

    log_event(
        log,
        "item.started",
        item_id=implication.id,
        n_samples=config.n_samples,
        tie_break=config.tie_break,
        premise_ids=premise_ids,
        conclusion_ids=conclusion_ids,
        prompt_hash=user_prompt_hash,
        verification_prompt_id=verification_prompt.id,
    )

    for i in range(config.n_samples):
        rid = f"{request_id_prefix}:sample-{i}" if request_id_prefix else None
        req = SampleRequest(
            prompt=user_text,
            system=verification_prompt.system,
            temperature=params.temperature,
            max_tokens=params.max_tokens,
            top_p=params.top_p,
            seed=params.seed,
            stop=params.stop,
            request_id=rid,
        )
        try:
            result = provider.sample(req)
            verdict, status = parse_verdict(result.text, parser)
            # Promote unparseable -> budget_clipped when the provider says the
            # response was truncated by max_tokens. The verdict stays abstain
            # (Definition 2 fallback) but the parse_status now tells the user
            # the abstain is operational, not a model decision.
            if (
                status == "unparseable"
                and result.finish_reason in BUDGET_FINISH_REASONS
            ):
                status = "budget_clipped"
            record = SampleRecord(
                sample_index=i,
                raw_response=result.text,
                parsed_verdict=verdict,
                parse_status=status,
                request_id=result.request_id,
                wall_time_ms=result.wall_time_ms,
                usage=_usage_from_mapping(result.usage),
                finish_reason=result.finish_reason,
                reasoning_tokens=result.reasoning_tokens,
            )
            log_event(
                log,
                "sample.completed",
                item_id=implication.id,
                sample_index=i,
                provider=result.provider,
                model_id=result.model_id,
                request_id=result.request_id,
                prompt_hash=user_prompt_hash,
                raw_response=result.text,
                parsed_verdict=str(verdict),
                parse_status=status,
                wall_time_ms=result.wall_time_ms,
                input_tokens=result.usage.get("input_tokens") if result.usage else None,
                output_tokens=result.usage.get("output_tokens") if result.usage else None,
                finish_reason=result.finish_reason,
                reasoning_tokens=result.reasoning_tokens,
            )
        except ProviderSampleError as exc:
            log_event(
                log,
                "sample.failed",
                item_id=implication.id,
                sample_index=i,
                prompt_hash=user_prompt_hash,
                err=str(exc),
            )
            verdict = Verdict.ABSTAIN
            record = SampleRecord(
                sample_index=i,
                raw_response="",
                parsed_verdict=Verdict.ABSTAIN,
                parse_status="sample_failed",
                request_id=rid,
                wall_time_ms=None,
                usage=None,
                finish_reason=None,
                reasoning_tokens=None,
            )
        sample_records.append(record)
        verdicts.append(verdict)

    final, tie_broken = majority_vote(verdicts, tie_break=config.tie_break)
    counts: dict[Verdict, int] = {v: 0 for v in Verdict}
    for v in verdicts:
        counts[v] += 1

    log_event(
        log,
        "item.completed",
        item_id=implication.id,
        verdict=str(final),
        tie_broken=tie_broken,
        good=counts[Verdict.GOOD],
        bad=counts[Verdict.BAD],
        abstain=counts[Verdict.ABSTAIN],
    )

    return EndorsementRecord(
        implication=implication,
        samples=sample_records,
        counts=counts,
        verdict=final,
        tie_broken=tie_broken,
        premise_context=premise_ctx,
        conclusion_context=conclusion_ctx,
        rendered_user_prompt=user_text,
    )

infereval.endorsement.EndorsementRecord dataclass

EndorsementRecord(implication: Implication, samples: list[SampleRecord], counts: dict[Verdict, int], verdict: Verdict, tie_broken: bool, premise_context: str, conclusion_context: str, rendered_user_prompt: str = '')

Result of one :func:endorse call.

This is the in-memory analog of an evaluation file's per-item record; :func:infereval.evaluation.evaluate converts it to an :class:infereval.evaluation.EvaluationItem for serialization.

premise_context instance-attribute

premise_context: str

The full natural-language premise context shown to the model.

conclusion_context instance-attribute

conclusion_context: str

The full natural-language conclusion context shown to the model.

to_majority_vote

to_majority_vote() -> MajorityVote

Project counts + verdict into a Pydantic :class:MajorityVote.

Source code in src/infereval/endorsement.py
def to_majority_vote(self) -> MajorityVote:
    """Project counts + verdict into a Pydantic :class:`MajorityVote`."""
    return MajorityVote(
        good=self.counts.get(Verdict.GOOD, 0),
        bad=self.counts.get(Verdict.BAD, 0),
        abstain=self.counts.get(Verdict.ABSTAIN, 0),
        verdict=self.verdict,
        tie_broken=self.tie_broken,
    )

infereval.endorsement.parse_verdict

parse_verdict(text: str, pattern: Pattern[str] | None = None) -> tuple[Verdict, ParseStatus]

Extract a :class:Verdict from a raw model response.

Returns (Verdict.ABSTAIN, "unparseable") if no token matches; per the paper's Definition 2 ("Unparseable responses are mapped to abstain").

Source code in src/infereval/prompts.py
def parse_verdict(
    text: str,
    pattern: re.Pattern[str] | None = None,
) -> tuple[Verdict, ParseStatus]:
    """Extract a :class:`Verdict` from a raw model response.

    Returns ``(Verdict.ABSTAIN, "unparseable")`` if no token matches; per
    the paper's Definition 2 ("Unparseable responses are mapped to abstain").
    """
    if pattern is None:
        pattern = DEFAULT_VERIFICATION_PROMPT.compile_parser()
    match = pattern.search(text)
    if match is None:
        return Verdict.ABSTAIN, "unparseable"
    token = match.group(1).upper()
    try:
        return Verdict(token.lower()), "ok"
    except ValueError:
        # Regex group didn't match a known verdict; treat as unparseable.
        return Verdict.ABSTAIN, "unparseable"

infereval.endorsement.majority_vote

majority_vote(verdicts: list[Verdict], tie_break: TieBreak = 'abstain') -> tuple[Verdict, bool]

Aggregate per-sample verdicts into a single verdict.

Returns (chosen_verdict, tie_broken_flag).

Tie rules (in order):

  1. If verdicts is empty, return (ABSTAIN, False).
  2. If exactly one verdict has the max count, return it.
  3. Tie: if ABSTAIN is among the tied set, return ABSTAIN.
  4. Otherwise (pure GOOD/BAD tie), apply tie_break.
Source code in src/infereval/endorsement.py
def majority_vote(
    verdicts: list[Verdict],
    tie_break: TieBreak = "abstain",
) -> tuple[Verdict, bool]:
    """Aggregate per-sample verdicts into a single verdict.

    Returns ``(chosen_verdict, tie_broken_flag)``.

    Tie rules (in order):

    1. If ``verdicts`` is empty, return ``(ABSTAIN, False)``.
    2. If exactly one verdict has the max count, return it.
    3. Tie: if ABSTAIN is among the tied set, return ABSTAIN.
    4. Otherwise (pure GOOD/BAD tie), apply ``tie_break``.
    """
    if not verdicts:
        return Verdict.ABSTAIN, False

    counts = Counter(verdicts)
    max_count = max(counts.values())
    top = [v for v in counts if counts[v] == max_count]

    if len(top) == 1:
        return top[0], False

    # Tie. ABSTAIN wins any tie it is part of.
    if Verdict.ABSTAIN in top:
        return Verdict.ABSTAIN, True

    # Pure GOOD/BAD tie. Apply tie_break.
    if tie_break == "abstain":
        return Verdict.ABSTAIN, True
    if tie_break == "good":
        return Verdict.GOOD, True
    if tie_break == "bad":
        return Verdict.BAD, True
    if tie_break == "first":
        for v in verdicts:
            if v in top:
                return v, True
        return top[0], True  # unreachable, defensive
    # Unknown tie_break: fall back to abstain (Literal forbids this at type level).
    return Verdict.ABSTAIN, True

Context builders

infereval.context.resolve_context_builder

resolve_context_builder(model: TemplateContextBuilder | PluginContextBuilder) -> ContextBuilder

Convert a benchmark's serialized context-builder config to a callable.

Source code in src/infereval/context.py
def resolve_context_builder(
    model: TemplateContextBuilder | PluginContextBuilder,
) -> ContextBuilder:
    """Convert a benchmark's serialized context-builder config to a callable."""
    if isinstance(model, TemplateContextBuilder):
        return make_template_builder(template=model.template, joiner=model.joiner)
    if isinstance(model, PluginContextBuilder):
        return resolve_plugin(model.plugin)
    raise TypeError(f"Unsupported context-builder model: {type(model).__name__}")

infereval.context.resolve_context_builders

resolve_context_builders(builders: ContextBuilders) -> tuple[ContextBuilder, ContextBuilder]

Resolve a :class:ContextBuilders pair into (premise, conclusion) callables.

Source code in src/infereval/context.py
def resolve_context_builders(
    builders: ContextBuilders,
) -> tuple[ContextBuilder, ContextBuilder]:
    """Resolve a :class:`ContextBuilders` pair into ``(premise, conclusion)`` callables."""
    return (
        resolve_context_builder(builders.premise),
        resolve_context_builder(builders.conclusion),
    )

infereval.context.make_template_builder

make_template_builder(*, template: str = '{expressions}', joiner: str = ' and ') -> ContextBuilder

Build a context builder that joins expressions and formats into a template.

Parameters

template Format string with a {expressions} placeholder. joiner Separator inserted between bearer expressions.

Returns

ContextBuilder Callable that takes a sequence of expressions and returns the formatted context.

Notes

The empty-input case returns the template formatted against an empty string, which by default yields the empty string. The endorser does not use this builder on empty implications (Definition 3 excludes them).

Source code in src/infereval/context.py
def make_template_builder(
    *, template: str = "{expressions}", joiner: str = " and "
) -> ContextBuilder:
    """Build a context builder that joins expressions and formats into a template.

    Parameters
    ----------
    template
        Format string with a ``{expressions}`` placeholder.
    joiner
        Separator inserted between bearer expressions.

    Returns
    -------
    ContextBuilder
        Callable that takes a sequence of expressions and returns the
        formatted context.

    Notes
    -----
    The empty-input case returns the template formatted against an empty
    string, which by default yields the empty string. The endorser does not
    use this builder on empty implications (Definition 3 excludes them).
    """

    def _build(expressions: Sequence[str]) -> str:
        joined = joiner.join(expressions)
        return template.format(expressions=joined)

    return _build

infereval.context.strip_tex_math

strip_tex_math(text: str) -> str

Strip $...$ TeX-math delimiters, preserving their contents.

Examples

strip_tex_math("\(a\) is a stop sign") 'a is a stop sign' strip_tex_math("\(a\) and \(b\)") 'a and b' strip_tex_math("no math here") 'no math here'

Unmatched single $ characters are left in place; only paired $...$ spans (without nested $) are stripped.

Source code in src/infereval/context.py
def strip_tex_math(text: str) -> str:
    """Strip ``$...$`` TeX-math delimiters, preserving their contents.

    Examples
    --------
    >>> strip_tex_math("$a$ is a stop sign")
    'a is a stop sign'
    >>> strip_tex_math("$a$ and $b$")
    'a and b'
    >>> strip_tex_math("no math here")
    'no math here'

    Unmatched single ``$`` characters are left in place; only paired
    ``$...$`` spans (without nested ``$``) are stripped.
    """
    return _TEX_MATH_RE.sub(r"\1", text)