API reference¶
Auto-generated from the docstrings in src/infereval/. The docstrings are
maintained as a first-class artifact and paper-cross-referenced, so
help(infereval.metrics.cohens_kappa) is reliable; this page just renders
the same content as a navigable site.
If you're looking for symbolic notation rather than callables, see the Glossary.
Core data types¶
infereval.types.Verdict ¶
Bases: str, Enum
Endorsement verdict :math:E_M(\langle \Gamma, \Delta \rangle).
String-valued so JSON serialization yields "good" / "bad" / "abstain"
directly. The (str, Enum) pattern is used in place of
:class:enum.StrEnum for Python 3.10 compatibility.
infereval.types.Bearer
dataclass
¶
A propositional content-bearer :math:\varphi \in B.
Parameters¶
id
Short stable identifier, e.g. "sa" for "a is a stop sign".
expression
Canonical natural-language statement :math:\delta(\varphi).
May contain TeX-math delimiters (e.g. "$a$ is a stop sign"); these
are stripped at prompt-construction time, not here.
paraphrases
Optional family of meaning-preserving variants of :math:\delta(\varphi).
Empty by default. Supports the paraphrase axis of variation discussed in
the paper's Discussion.
all_expressions ¶
infereval.types.Implication
dataclass
¶
A candidate implication :math:\langle \Gamma, \Delta \rangle.
Premises and conclusions are frozenset of bearer ids (the id field of
:class:Bearer). The optional id field is a benchmark-level reference
label; it is excluded from equality and hashing so that two implications with
the same premise/conclusion sets compare equal regardless of label.
is_empty_empty
property
¶
Whether this is the :math:\langle \emptyset, \emptyset \rangle implication.
Excluded from :math:I_M by stipulation (Definition 3, last sentence).
of
classmethod
¶
Convenience constructor accepting any iterables of bearer ids.
Source code in src/infereval/types.py
intersects ¶
infereval.frame.DerivedFrame
dataclass
¶
The implication frame :math:\langle B, I_M \rangle derived from a model :math:M.
Construct via :meth:from_endorsements; do not instantiate directly unless
you have already validated that all implication bearer-ids reference
elements of bearers.
Attributes¶
bearers
Read-only mapping id -> Bearer representing :math:B.
endorsements
Read-only mapping from each queried :class:Implication to the
verdict :math:E_M returned. Implications not in this mapping are
treated as un-queried; :meth:contains returns False for them
unless clause (i) applies.
from_endorsements
classmethod
¶
from_endorsements(bearers: Mapping[str, Bearer], endorsements: Mapping[Implication, Verdict]) -> DerivedFrame
Build a frame from a bearer set and a mapping of queried endorsements.
Raises¶
ValueError
If any implication references a bearer id not present in bearers.
Source code in src/infereval/frame.py
contains ¶
Membership in :math:I_M per Definition 3.
<empty, empty> is excluded by stipulation. Otherwise the iff of
clauses (i) and (ii) decides. For implications not in
:attr:endorsements, clause (ii) is treated as false (we have no
evidence of endorsement).
Source code in src/infereval/frame.py
satisfies_containment ¶
Containment is satisfied by construction (clause (i) of Definition 3).
This always returns True; the method is provided as an explicit
witness of the invariant the paper makes a remark of ("Containment by
construction"). Tests assert it to guard against future refactors that
might break the invariant.
Source code in src/infereval/frame.py
queried_implications ¶
Benchmark¶
infereval.benchmark.Benchmark ¶
Bases: BaseModel
A benchmark :math:\beta over a bearer set, analyst panel, and items.
See the paper, Definition 4.
load
classmethod
¶
Load a benchmark from JSON on disk and validate.
dump ¶
Write the benchmark to path as canonical-ish JSON.
cells ¶
Count items per cell of the fully crossed design.
Returns a mapping from cell-tuple to item count, where the
cell-tuple is the per-factor level value in the order given by
sorted(self.factors). Every cell of the cartesian product is
present in the result (count 0 if no item lands there); items
whose factor_levels don't name every declared factor are
excluded entirely (they belong to no cell).
Source code in src/infereval/benchmark.py
panel_names ¶
Sorted unique panel names across :attr:analysts.
Returns [] for an unpanelled (flat) benchmark. Phase 1.4 of
the construct-validity infrastructure (R4).
Source code in src/infereval/benchmark.py
resolved_primary_panel ¶
The primary panel name to use for analyses.
Returns :attr:primary_panel if set; otherwise the
alphabetically-first declared panel name; otherwise None for
unpanelled benchmarks.
Source code in src/infereval/benchmark.py
infereval.benchmark.BenchmarkItem ¶
Bases: BaseModel
A single benchmark item: an implication paired with analyst verdicts.
analyst_rationales
class-attribute
instance-attribute
¶
analyst_rationales: list[str] | None = Field(default=None, description="Optional per-analyst, per-item rationales: the natural-language reason each analyst gave for their verdict on this item. Positionally aligned to analyst_verdicts — index j is analyst j's rationale, matching the benchmark's analysts declaration order. null (or absent) means 'this benchmark carries no rationale discipline.' A present-but-empty entry ('') means 'this analyst gave a verdict but recorded no reason on this item' — semantically distinct from null. When present, the length must equal len(benchmark.analysts).")
Optional per-analyst, per-item rationales: the natural-language
reason each analyst gave for their verdict on this item.
Positionally aligned to :attr:analyst_verdicts — index j is
analyst j's rationale, matching :attr:Benchmark.analysts
declaration order. None (or absent) means "this benchmark
carries no rationale discipline." A present-but-empty entry
("") means "this analyst gave a verdict but recorded no reason
on this item" — semantically distinct from None. When present,
the length must equal len(benchmark.analysts) (enforced in
:meth:Benchmark._check_consistency). The framework validates
structure and length only; content is the analyst's responsibility.
Added in v0.5.4 (AR1–AR12).
references
class-attribute
instance-attribute
¶
Provenance for this implication: the guideline section, paper, or regulatory document that justifies the analyst's verdict. Empty by default; populating these turns the benchmark into an auditable artifact that a domain expert can cross-check against source material.
factor_levels
class-attribute
instance-attribute
¶
Per-factor level assignments for this item, naming its position
in the benchmark's crossed design. Keys must be factor names
declared in :attr:Benchmark.factors; values must be levels from
the corresponding levels list. Empty by default — items without
factor_levels appear in no cell and are ignored by the
min_items_per_cell check.
construction_metadata
class-attribute
instance-attribute
¶
Per-item provenance for construct-validity audit: who authored
the item, when, which models the author was blind to at
construction time, and what source material they worked from.
None by default; populate selectively for items where the
provenance matters. Phase 1.3 of the construct-validity
infrastructure (R5, R8, R9).
to_implication ¶
Return the runtime :class:Implication view of this item.
infereval.benchmark.BearerModel ¶
Bases: BaseModel
JSON shape for a :class:infereval.types.Bearer.
references
class-attribute
instance-attribute
¶
Provenance for the bearer's definition, e.g. the guideline section
that defines the threshold "P/F < 300" is measured against.
infereval.benchmark.AnalystModel ¶
Bases: BaseModel
A human analyst :math:a_j whose verdicts appear in :math:V_i.
panel
class-attribute
instance-attribute
¶
Optional panel identifier. Analysts sharing the same panel string
are members of the same panel for cross-panel agreement analysis
(R4: independent reference check). None (default) means the
benchmark is flat — every analyst is treated equivalently. Adding
a panel string to ANY analyst requires ALL analysts to declare one
(no partial-panel benchmarks).
infereval.benchmark.RSRTarget ¶
Bases: BaseModel
Target inference :math:\langle X, A \rangle for an RSR-targeted item.
See the paper's Remark on "RSR-targeted benchmarks".
infereval.benchmark.ConstructionMetadata ¶
Bases: BaseModel
Per-item construction provenance for benchmark audit.
Records who authored an item, when, against what training-cutoff posture, and from what source materials. Phase 1.3 of the construct-validity infrastructure series, providing the data model for requirements R5 (documented construction), R8 (held-out items), and R9 (training-data separation).
Content is the analyst's responsibility — the framework validates
structure (Pydantic types, extra="forbid") but does not enforce
that, e.g., authored_on actually post-dates a model's training
cutoff. The point is to make the presence of these declarations
auditable.
authored_by
class-attribute
instance-attribute
¶
Identifier for the author of the item, e.g. "physician-c".
authored_on
class-attribute
instance-attribute
¶
ISO date the item was authored.
authored_blind_to_models
class-attribute
instance-attribute
¶
Model identifiers the author had not observed at construction time. Critical for R8 (held-out items): if the author had seen M's draft-version output on the item, M's agreement on the final item does not constitute independent evidence.
source
class-attribute
instance-attribute
¶
Free-form source citation for the primary material the author
worked from (e.g. "Sanford Guide to Antimicrobial Therapy 2025").
Distinct from :attr:BenchmarkItem.references, which carries the
framework-level :class:Reference objects supporting the verdict;
source is intended for the primary material, not the literature
that justifies the analyst's call.
infereval.benchmark.FactorConstraints ¶
Bases: BaseModel
Constraints the benchmark validator should enforce on the factorial design.
Currently supports min_items_per_cell: every cell of the fully
crossed design (cartesian product of all declared factor levels)
must contain at least this many items, where a cell is defined by
the per-factor level assignments in :attr:BenchmarkItem.factor_levels.
Per Closing the Construct-Validity Gap in infereval (Phase 1.1) addressing requirement R7 (multiple items per condition) and supporting R12 (per-condition decomposition).
min_items_per_cell
class-attribute
instance-attribute
¶
If set, every cell of the crossed design must have at least this
many items. Set to None to skip the cell-count validation
entirely (the per-key / per-value type checks on factor_levels
still run).
infereval.benchmark.ContextBuilders ¶
Bases: BaseModel
Pair of context builders for :math:\mathrm{ctx}_\Gamma and :math:\mathrm{ctx}_\Delta.
infereval.benchmark.TemplateContextBuilder ¶
Bases: BaseModel
A context builder specified by an inline template string.
The template is a format string with a single {expressions} placeholder.
Bearer expressions are joined by joiner to fill it.
infereval.benchmark.PluginContextBuilder ¶
Bases: BaseModel
A context builder specified by a dotted import path.
The plugin must resolve to a callable (Sequence[str]) -> str taking
bearer expressions and returning the natural-language context.
infereval.benchmark.VerificationPromptOverride ¶
Bases: BaseModel
Optional benchmark-level override of the framework's default verification prompt.
All four fields are optional in practice (template is required by
the schema since it is the minimal thing an override needs to
contribute). When a field is None the framework default is used:
- :attr:
systemNone→ :data:infereval.prompts.DEFAULT_SYSTEM_PROMPT. - :attr:
parse_regexNone→ :data:infereval.prompts.DEFAULT_PARSE_REGEX. - :attr:
idNone→ the caller-suppliedoverride_idparameter to :func:infereval.prompts.resolve_verification_prompt.
Adding the system field makes the paraphrase-axis experiment
fully JSON-drivable (no Python required to vary the verification prompt).
infereval.benchmark.Reference ¶
Bases: BaseModel
A traceable provenance entry for a benchmark, bearer, or item.
The motivating use case is regulated-domain benchmarks (medical, legal, financial) where every non-trivial implication needs a citation to a guideline, statute, or peer-reviewed source. Recording these as structured objects lets downstream tooling render bibliographies, validate DOIs, and connect items to the documents that justify them.
Only :attr:citation is required. The other fields populate when the
relevant identifier is known and remain None otherwise.
Authoring shorthand: a plain string in any references list is
auto-promoted to a :class:Reference with the string as
:attr:citation and everything else None. See
:func:_promote_reference_shorthand.
Evaluation¶
infereval.evaluation.evaluate ¶
evaluate(benchmark: Benchmark, provider: Provider, *, config: EndorsementConfig | None = None, params: ProviderParams | None = None, verification_prompt: VerificationPrompt | None = None, strip_tex: bool = True, run_id: str | None = None, log_path: Path | str | None = None, variant: int = 0) -> Evaluation
Run a model against a benchmark and assemble the resulting :math:\eta.
Iterates over every benchmark item, calls
:func:infereval.endorsement.endorse to compute :math:E_M, and
packages the per-item samples + majority-vote tally into an
:class:Evaluation.
Parameters¶
benchmark
The :math:\beta to evaluate against.
provider
Any :class:infereval.providers.Provider (Anthropic, OpenAI,
OpenRouter, or a mock).
config
Endorsement configuration. Defaults to EndorsementConfig()
(n_samples=5, tie_break=abstain, default verification prompt id).
params
Provider decoding parameters. Defaults to ProviderParams()
(temperature=1.0, max_tokens=1024).
verification_prompt
If supplied, overrides the framework default. If the benchmark
has verification_prompt set, it takes precedence over the
framework default but not over this argument.
strip_tex
Whether to strip $...$ TeX-math delimiters from bearer
expressions at prompt-construction time (default True).
run_id
Stable identifier for this evaluation run, recorded as
:attr:Evaluation.id. Generated as a UUID4 if not supplied.
log_path
Optional path for a JSONL run log; one event per line, suitable
for jq or pandas.read_json(lines=True). If None (the
default), no log file is written; library callers can still attach
their own handlers to the infereval logger.
Returns¶
Evaluation
The fully-populated :math:\eta ready to serialize to JSON.
Source code in src/infereval/evaluation.py
294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 | |
infereval.evaluation.Evaluation ¶
Bases: BaseModel
An evaluation :math:\eta of a model against a benchmark.
references
class-attribute
instance-attribute
¶
Corpus-level provenance, propagated from
:attr:infereval.benchmark.Benchmark.references at evaluation
time. Carries the paper, dialogue, or regulatory framework the
benchmark is derived from, so an evaluation JSON read in isolation
still names its primary sources.
paraphrase_variant
class-attribute
instance-attribute
¶
Index of the paraphrase variant used at evaluation time. 0
(default) means the canonical :attr:BearerModel.expression was
used for every bearer. k >= 1 means bearer.paraphrases[k-1]
was used per :func:infereval.endorsement._expressions_for (with
fallback to the canonical for bearers that don't carry that
paraphrase). Phase 1.2 of the construct-validity infrastructure
(R10: paraphrase variation under fixed inferential content).
endorsements ¶
Mapping Implication -> Verdict suitable for :meth:DerivedFrame.from_endorsements.
infereval.evaluation.EvaluationItem ¶
Bases: BaseModel
One row of the evaluation :math:\eta: implication + analyst verdicts + :math:E_M.
analyst_rationales
class-attribute
instance-attribute
¶
analyst_rationales: list[str] | None = Field(default=None, description="Optional per-analyst rationales propagated from the source benchmark item's analyst_rationales at evaluation build time. Positionally aligned to analyst_verdicts. null (or absent) when the source benchmark carried no rationale discipline; a present list (possibly containing empty strings) when it did. Covered by Evaluation.benchmark_hash.")
Optional per-analyst rationales propagated from
:attr:infereval.benchmark.BenchmarkItem.analyst_rationales at
evaluation-build time. Positionally aligned to
:attr:analyst_verdicts. None (or absent) when the source
benchmark carried no rationale discipline; a present list (possibly
containing empty strings) when it did. Covered by the existing
:attr:Evaluation.benchmark_hash integrity mechanism, so a
rationale cannot be silently altered between evaluation and report
without changing the hash. Added in v0.5.4 (AR8, AR9).
references
class-attribute
instance-attribute
¶
Per-item provenance, propagated from
:attr:infereval.benchmark.BenchmarkItem.references at evaluation
time. Carries the guideline / paper / regulatory citation that
justifies the analyst's verdict so the evaluation JSON is a
self-contained, auditable artifact (no need to look up the source
benchmark separately).
infereval.evaluation.EndorsementConfig ¶
Bases: BaseModel
Configuration governing how :math:E_M is computed.
Note on terminology: n_samples is the number of completions
drawn from M per benchmark item, in the LLM-literature sense of
"sample" (one draw from the model's output distribution). It is
not the number of dataset rows — that is the benchmark's item
count and is fixed by the benchmark. The methodology issues
n_samples provider calls per item, parses each completion's
verdict token, and majority-votes to compute :math:E_M for that
item. See docs/concepts.md for the full terminology note.
infereval.evaluation.ProviderParams ¶
Bases: BaseModel
Decoding parameters passed to a provider sample call.
The max_tokens default of 1024 is sized for current frontier models
that consume budget on silent internal reasoning. See
:class:infereval.providers.base.SampleRequest and
docs/providers.md for the rationale and per-provider guidance.
infereval.evaluation.SampleRecord ¶
Bases: BaseModel
One sampled response from the provider plus its parsed verdict.
finish_reason
class-attribute
instance-attribute
¶
Provider-side stop reason, when reported. See
:class:infereval.providers.base.SampleResult.finish_reason.
reasoning_tokens
class-attribute
instance-attribute
¶
Reasoning / thinking token count, when the provider reports it.
See :class:infereval.providers.base.SampleResult.reasoning_tokens.
infereval.evaluation.MajorityVote ¶
Bases: BaseModel
Tally of parsed verdicts plus the resolved majority and tie-break flag.
infereval.evaluation.canonical_benchmark_hash ¶
SHA-256 of the benchmark's canonical-JSON form, prefixed sha256:.
Recorded in :attr:Evaluation.benchmark_hash for tamper detection.
Two benchmarks that round-trip to the same canonical JSON have the
same hash; this is robust to insertion order in dicts.
Source code in src/infereval/evaluation.py
Metrics¶
infereval.metrics.coverage ¶
:math:\mathrm{cov}(\eta) = |\{i : E_M(I_i) \neq \text{abstain}\}| / n.
Returns 0.0 for an empty evaluation rather than raising.
Source code in src/infereval/metrics.py
infereval.metrics.consensus_verdict ¶
Return the analyst consensus :math:c_i for one item's verdicts.
From the paper, Definition 8: good if strict majority of analysts
say good (vs. bad); bad if strict majority say bad;
otherwise abstain. Abstain votes do not count toward the majority
of either substantive class but contribute to a tie.
Source code in src/infereval/metrics.py
infereval.metrics.consensus_reference ¶
Return :math:r(i) = c_i as a :data:ReferenceFn.
Source code in src/infereval/metrics.py
infereval.metrics.cohens_kappa ¶
:math:\kappa_C(\eta, r) = (p_o - p_e) / (1 - p_e).
Returns :data:None when :math:S(\eta, r) is empty or
:math:p_e = 1 (degenerate distribution). Logs a warning in both
cases so the user sees why the value is undefined.
Source code in src/infereval/metrics.py
infereval.metrics.fleiss_kappa ¶
:math:\kappa_F(\eta) with :math:M as the :math:(m+1)-th annotator.
The annotators on each item are the analyst verdicts followed by
model_verdict. Items where any annotator (analyst or model) is
non-substantive are excluded from :math:S_F per the paper's
Definition 10.
Source code in src/infereval/metrics.py
infereval.metrics.inter_analyst_fleiss ¶
:math:\kappa_F^*(\beta): Fleiss' kappa over analyst verdicts alone.
Accepts either an :class:~infereval.evaluation.Evaluation or a
:class:~infereval.benchmark.Benchmark. Returns :data:None (with
a logged warning) when :math:m < 2 or when the analysts are
unanimous on every item -- the two conditions Remark 4 calls out as
making the baseline unavailable.
For panelled benchmarks (Issue #36, Phase 1.4), this returns the
κ_F* of the primary panel only — see
:func:inter_analyst_fleiss_per_panel for per-panel breakdown and
:func:cross_panel_kappa for the cross-panel agreement metric.
Source code in src/infereval/metrics.py
infereval.metrics.inter_analyst_fleiss_per_panel ¶
:math:\kappa_F^* computed per analyst panel.
Returns a mapping panel_name -> κ_F* for every panel declared on
the benchmark. A panel value is None when the panel has fewer
than 2 analysts or when the analysts are unanimous on every item
(per the same conditions :func:inter_analyst_fleiss honours).
Empty dict if the benchmark is unpanelled. Phase 1.4 of the construct-validity infrastructure (R4).
Source code in src/infereval/metrics.py
infereval.metrics.cross_panel_kappa ¶
cross_panel_kappa(benchmark: Benchmark, *, primary: str | None = None, check: str | None = None) -> float | None
Cohen's :math:\kappa_C between two panels' per-item consensus verdicts.
Computes a per-panel consensus verdict for each item (majority among panel members, abstain on tie) and then runs Cohen's kappa between the two columns, restricted to items where both panels yield a substantive verdict.
Parameters¶
benchmark
Panelled benchmark.
primary
Name of the primary panel. Defaults to
benchmark.resolved_primary_panel().
check
Name of the panel to compare against. When None and exactly
two panels are declared, picks the non-primary one
automatically.
Returns¶
float | None
Cohen's kappa over the substantive-on-both items, or None
when fewer than two non-trivial agreement counts are available,
or when either named panel doesn't exist.
Phase 1.4 of the construct-validity infrastructure (R4 — guards against shared-error agreement within the primary panel by surfacing the independent panel's view).
Source code in src/infereval/metrics.py
328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 | |
infereval.metrics.MetricsReport
dataclass
¶
Bundle of metrics over an :class:Evaluation, with decomposition filters.
Parameters¶
eta
The evaluation to report on.
benchmark
Optional benchmark. Required for :meth:by_rsr_target and
:meth:coverage_per_analyst_named; other methods work without it.
coverage_per_analyst_named ¶
Per-analyst coverage keyed by analyst id (requires :attr:benchmark).
Source code in src/infereval/metrics.py
cohens_kappa ¶
:math:\kappa_C(\eta, r). Default reference is the analyst consensus :math:c_i.
Source code in src/infereval/metrics.py
cohens_kappa_analyst ¶
:math:\kappa_C(\eta, v_{:,j}): M vs. one specific analyst.
by_tag ¶
Return a report restricted to items carrying tag.
Source code in src/infereval/metrics.py
by_rsr_target ¶
Return a report restricted to items whose rsr_target matches (X, A).
rsr_target lives on benchmark items, not evaluation items, so
:attr:benchmark is required.
Source code in src/infereval/metrics.py
to_dict ¶
Render as a JSON-friendly dict (None where a kappa is undefined).
Source code in src/infereval/metrics.py
Structural checks (R13)¶
infereval.structure.run_all_checks ¶
Run all three structural checks and bundle the results.
Source code in src/infereval/structure.py
infereval.structure.containment_closure_check ¶
Sanity-check that all self-implications are in I_M by construction.
Per Definition 3 clause i, every implication ⟨Γ, Δ⟩ with
Γ ∩ Δ ≠ ∅ is in I_M regardless of what the model says.
This check counts such items in the benchmark and confirms they're
structurally satisfied; it doesn't need to consult the model's
verdict (the framework guarantees it). Reported anyway because the
count itself is informative — a benchmark with zero self-implications
has different structural texture from one with many.
Source code in src/infereval/structure.py
infereval.structure.rsr_role_consistency_check ¶
Check that role-tagged items' model verdicts match the role's prediction.
For each item carrying a role tag (supporter / defeater /
irrelevant-addition) AND an rsr_target, looks up the
"base-inference" item with the same target and uses the model's
verdict on the base to predict the expected verdict on the role-tagged
item:
supporteris supposed to strengthen the base verdict. If the base is GOOD, the supporter should remain GOOD; if the base is BAD, the supporter is excluded (a supporter can't strengthen a bad inference — that's a defeater being treated wrongly).defeateris supposed to flip the base verdict. If the base is GOOD, the defeater should be BAD.irrelevant-additionis supposed to preserve the base verdict under RSR. If the base is GOOD, the irrelevant addition should stay GOOD; if the base is BAD, it should stay BAD.
Anomalies are items whose model verdict contradicts the expected role-conditional verdict. Items where the base or the role-tagged item itself has an ABSTAIN verdict are excluded from the check (the role's prediction is undefined relative to abstention).
Source code in src/infereval/structure.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 | |
infereval.structure.base_case_stability_check ¶
When a target has multiple base-inference items, the model should agree on all of them.
Anomalies surface targets where the model gives different verdicts on multiple base-inference items, since the base verdict is what the rest of the RSR machinery is anchored to.
Source code in src/infereval/structure.py
infereval.structure.StructuralReport
dataclass
¶
StructuralReport(evaluation_id: str, benchmark_id: str, checks: tuple[StructuralCheck, ...] = tuple())
Bundle of structural checks run against an Evaluation + Benchmark pair.
all_satisfied
property
¶
True iff every check has rate == 1.0 (and no anomalies).
infereval.structure.StructuralCheck
dataclass
¶
infereval.structure.StructuralAnomaly
dataclass
¶
One item that failed a structural check, with diagnostic context.
Factor-effects model (R7 / R12)¶
infereval.modeling.fit_factor_model ¶
fit_factor_model(evaluation: Evaluation, benchmark: Benchmark, *, reference: str = _DEFAULT_REFERENCE) -> ModelFit
Logistic regression of agreement on declared factor levels.
Parameters¶
evaluation
The :class:~infereval.evaluation.Evaluation to model. Each
item's per-sample verdicts are unrolled into separate
observations; samples with ABSTAIN verdicts are dropped.
benchmark
The source :class:~infereval.benchmark.Benchmark. Must declare
at least one factor in benchmark.factors (per Phase 1.1).
reference
Which analyst column defines "agreement". "consensus"
(default) uses the per-item majority of the analyst panel
(abstain on tie). An "analyst:<id>" string picks a single
analyst column.
Returns¶
ModelFit
Raises¶
ModelingError If the benchmark declares no factors, if no sample observations remain after dropping abstains, or if the design matrix is rank-deficient (e.g. every item in the same cell).
Source code in src/infereval/modeling.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 | |
infereval.modeling.ModelFit
dataclass
¶
ModelFit(n_observations: int, n_items: int, n_factors: int, n_dropped_abstain: int, deviance: float, null_deviance: float, pseudo_r2: float | None, effects: tuple[FactorEffect, ...], factor_wald: dict[str, float], notes: tuple[str, ...])
Result of fitting the factor-effects logistic regression.
n_observations
instance-attribute
¶
Number of (item, sample) rows used in the fit.
n_items
instance-attribute
¶
Number of distinct items contributing observations (= number of clusters).
n_dropped_abstain
instance-attribute
¶
Sample observations excluded because the verdict was ABSTAIN.
null_deviance
instance-attribute
¶
-2 × log-likelihood of the intercept-only model.
pseudo_r2
instance-attribute
¶
McFadden's pseudo-R² = 1 - log-lik(full) / log-lik(null).
effects
instance-attribute
¶
One row per non-baseline level of each declared factor.
factor_wald
instance-attribute
¶
Per-factor joint Wald p-value testing 'this factor has no effect'.
notes
instance-attribute
¶
Methodology notes / caveats surfaced for the CLI report.
infereval.modeling.FactorEffect
dataclass
¶
FactorEffect(factor: str, level: str, coef: float, std_err: float, z_value: float, p_value: float, conf_int_low: float, conf_int_high: float)
One row of the fitted coefficient table.
Coefficients are log-odds relative to the alphabetically-first level of the same factor (the baseline). Positive coef → higher odds of agreement than the baseline level.
Sensitivity sweeps (R11)¶
infereval.sweep.run_sweep ¶
run_sweep(benchmark: Benchmark, provider: Provider, *, parameter: str, values: list[object], out_dir: Path, config: EndorsementConfig | None = None, params: ProviderParams | None = None, run_id_prefix: str | None = None) -> SweepResult
Run :func:evaluate once per value and bundle the metrics.
Per-value outputs land in out_dir with deterministic names so a
re-run replaces them in place.
Source code in src/infereval/sweep.py
infereval.sweep.SweepResult
dataclass
¶
Bundle of per-value rows + an overall stability assessment.
infereval.sweep.SweepRow
dataclass
¶
SweepRow(value: object, coverage: float, kappa_c: float | None, kappa_f: float | None, n_agreement: int, n_total: int, eta_path: Path)
Construct-validity report (R16–R21)¶
infereval.report.ConstructValidityClaims ¶
Bases: BaseModel
Top-level container for the analyst's construct-validity declarations.
stub
classmethod
¶
Return an obviously-placeholder stub for --init-claims.
Source code in src/infereval/report.py
infereval.report.MasterySenseClaim ¶
Bases: BaseModel
R16: which sense of mastery the claim is about.
sense
instance-attribute
¶
evaluative: endorsements-when-asked (the methodology's direct measurement).generative: inferential behavior in unprompted production.standing: a dispositional competence underlying both.combination: a mix; describe explicitly indescription.
description
instance-attribute
¶
One to three sentences, the analyst's own articulation.
infereval.report.ScopeClaim ¶
Bases: BaseModel
R17: scope the mastery claim applies over.
scope
instance-attribute
¶
items_in_benchmark: the claim is about the specific items in β.domain_D_as_sampled: the claim generalises to D as sampled by β.general_capacity: the claim is about inferential mastery as a general capacity.
justification
instance-attribute
¶
Why this scope is appropriate given β and the methodology used.
infereval.report.ConstitutionClaim ¶
Bases: BaseModel
R18: is agreement evidence of mastery or constitutive of it?
position
instance-attribute
¶
evidence_of_mastery: agreement is evidence for a deeper underlying property.constitutive_of_mastery: agreement (with structural coherence) IS mastery (Brandom's structural-behavioural characterisation).
justification
instance-attribute
¶
Brief explanation of the position taken and why.
infereval.report.CarvingClaim ¶
Bases: BaseModel
R19: carving-indexed framing of in-principle claims.
infereval.report.CompetingExplanationChecks ¶
Bases: BaseModel
R4, R8, R9, R11, R13, R14, R15: which checks were actually run.
All fields default to False (the conservative posture — the
framework assumes no check was done unless the analyst explicitly
declares it). The report's Unaddressed competing explanations
section lists every False.
infereval.report.ReportVerdict
dataclass
¶
ReportVerdict(label: Literal['defensible', 'partially_defensible', 'not_defensible'], one_liner: str, rationale: list[str])
Deterministic summary verdict computed from the claims + evidence.
infereval.report.compute_verdict ¶
compute_verdict(claims: ConstructValidityClaims, *, structure_report: dict[str, object] | None = None, benchmark: Benchmark | None = None) -> ReportVerdict
Return the deterministic summary verdict for the claims + evidence.
The verdict is computed against the claims file together with the
supplied analytical artifacts. When no artifacts are passed
(structure_report=None, benchmark=None), the verdict is
computed from claims alone and a "verdict computed unaudited"
rationale line is added so the reader can tell.
The deterministic rule:
- "defensible" iff every check required by the declared scope is marked True AND no audited check returned a failing artifact AND the carving claim is explicit (acknowledges = True iff any in-principle claims are being made) AND the benchmark supports an inter-analyst baseline when one is required by the scope.
- "not_defensible" iff more than half of the required checks are missing.
- "partially_defensible" otherwise — including the "ran but didn't
pass" cases (structural anomalies present, single-analyst benchmark
with
items_in_benchmarkscope).
Audit caps (added in v0.5.3 from external review):
- If
structure_reportis supplied ANDstructural_check_runis marked True AND the report contains any anomaly, the structural check is treated as failing — the verdict is capped atpartially_defensiblewith a rationale line naming the count. - If
benchmarkis supplied AND the scope isitems_in_benchmarkANDlen(benchmark.analysts) < 2, the verdict is capped atpartially_defensiblewith a rationale line surfacing the panel size — agreement with a single analyst cannot inherit the convergent-validity guarantee that multi-analyst agreement carries.
Backwards-compatible callers that don't pass the artifacts get behaviour identical to v0.5.2 except for the additional "verdict computed unaudited" rationale line.
Source code in src/infereval/report.py
321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 | |
infereval.report.render_markdown ¶
render_markdown(*, evaluation: Evaluation, benchmark: Benchmark, claims: ConstructValidityClaims, structure_report: dict[str, object] | None = None, sweep_summary: dict[str, object] | None = None, model_fit: dict[str, object] | None = None, generated_at: datetime | None = None, suppress_negatives: bool = False) -> str
Produce the construct-validity report as Markdown.
Optional arguments (structure_report, sweep_summary,
model_fit) populate the Evidence section; when absent, that
section explicitly notes the missing evidence.
Source code in src/infereval/report.py
483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 | |
infereval.report.NegativeFinding
dataclass
¶
One auto-collected negative finding from a Phase 2 artifact.
A finding is "negative" in the construct-validity sense — a check that ran and returned a result that weakens or complicates the mastery claim. Per Closing the Construct-Validity Gap in infereval (Phase 3.2 / R21), the framework surfaces these by default in the report.
summary
instance-attribute
¶
One-line description rendered in the Negative findings section.
infereval.report.collect_negative_findings ¶
collect_negative_findings(*, structure_report: dict[str, object] | None = None, sweep_summary: dict[str, object] | None = None, model_fit: dict[str, object] | None = None, factor_kinds: dict[str, str] | None = None) -> list[NegativeFinding]
Scan the supplied Phase 2 artifacts and return their negative findings.
Sources:
- structure_report: each anomaly across all checks is one finding.
- sweep_summary: instability (verdict not "stable across the sweep range") is one finding.
- model_fit: factors whose Wald p > 0.05 are surfaced as
no-significant-effect findings. When
factor_kindssupplies a valence label for a factor, the finding's summary explicitly states whether the null is a weakening of the mastery claim (a substantive factor that didn't differentiate) or a strengthening one (an experimentally-controlled factor that properly didn't affect behavior — e.g. the paraphrase axis). Unlabelled factors get the historical neutral summary so the analyst can read the valence from context.
Parameters¶
factor_kinds
Optional mapping factor_name -> {"substantive",
"experimentally_controlled"} from Benchmark.factor_kinds.
When omitted, all null-effect findings are summarised neutrally.
Source code in src/infereval/report.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 | |
Providers¶
infereval.providers.get_provider ¶
Construct a provider by short name.
Parameters¶
provider
Provider short name: "anthropic", "openai", "openrouter", or "mock".
model_id
Provider-specific model identifier.
**kwargs
Passed through to the concrete provider's constructor (e.g.
api_key, base_url, retry_policy, http_referer).
Returns¶
Provider
A constructed provider instance satisfying the :class:Provider
Protocol.
Raises¶
ProviderConfigError
If provider is not a known short name or required configuration is
missing (e.g. API key not set, optional SDK not installed).
Source code in src/infereval/providers/__init__.py
infereval.providers.base.Provider ¶
Bases: Protocol
The structural contract every LLM backend must satisfy.
infereval.providers.base.BaseProvider ¶
Bases: ABC
Abstract base providing the retry loop, timing, and logging.
Subclasses set the class-level :attr:name, implement
:meth:_sample_once (one provider call), and implement
:meth:_is_transient (which exceptions warrant a retry).
Source code in src/infereval/providers/base.py
sample ¶
Sample once with retries.
Raises¶
ProviderSampleError If all retry attempts fail with transient errors, or the first attempt fails with a non-transient error.
Source code in src/infereval/providers/base.py
infereval.providers.base.SampleRequest
dataclass
¶
SampleRequest(prompt: str, system: str | None = None, temperature: float = 1.0, max_tokens: int = 1024, top_p: float | None = None, seed: int | None = None, stop: tuple[str, ...] = (), request_id: str | None = None)
A single completion request issued to a provider.
The max_tokens default of 1024 is sized for current frontier models
that consume budget on silent internal reasoning (DeepSeek v4-flash,
OpenAI o-series, Gemini 2.5 Pro). Pre-reasoning models will only emit
a handful of tokens for a one-word verdict regardless of this cap, so
the higher default is cheap insurance against budget-clipping. See
docs/providers.md for per-provider guidance.
request_id
class-attribute
instance-attribute
¶
Client-side correlation id propagated to logs and to :attr:SampleResult.request_id.
infereval.providers.base.SampleResult
dataclass
¶
SampleResult(text: str, provider: str, model_id: str, request_id: str | None = None, wall_time_ms: float = 0.0, usage: Mapping[str, int] = dict(), raw: Mapping[str, Any] | None = None, finish_reason: str | None = None, reasoning_tokens: int | None = None)
One completed sample from a provider.
The finish_reason and reasoning_tokens fields surface
provider-side stop-reason and reasoning-token-consumption metadata so
that downstream code (the endorser, the JSONL log, the evaluation
JSON) can distinguish budget-clipped abstains (model ran out of
tokens on silent internal reasoning) from genuine abstains (model
declined to commit). The values are passed through verbatim from
each provider — see :data:BUDGET_FINISH_REASONS for the canonical
union of values that signal a budget hit.
raw
class-attribute
instance-attribute
¶
Provider-native response payload, when available, for forensic inspection.
finish_reason
class-attribute
instance-attribute
¶
Provider-side stop reason. OpenAI: "stop" / "length" / ...;
Anthropic: "end_turn" / "max_tokens" / "stop_sequence" / ....
None if the provider didn't report one.
reasoning_tokens
class-attribute
instance-attribute
¶
Count of tokens consumed by silent internal reasoning, where the
provider exposes it (OpenAI: usage.completion_tokens_details.reasoning_tokens).
None if not reported.
infereval.providers.base.RetryPolicy
dataclass
¶
RetryPolicy(max_attempts: int = 4, backoff_initial_s: float = 0.5, backoff_factor: float = 2.0, jitter: float = 0.25)
Exponential-backoff-with-jitter retry policy.
Sleep before attempt i+1 (after the i-th transient failure) is
.. math:: s_i = b \cdot f^{\,i} \cdot (1 + j \cdot u)
where :math:b is backoff_initial_s, :math:f is backoff_factor,
:math:j is jitter, and :math:u \sim U[-1, 1].
sleep_for ¶
Return the sleep duration in seconds before the next attempt.
Source code in src/infereval/providers/base.py
infereval.providers.mock.ScriptedProvider
dataclass
¶
ScriptedProvider(responses: list[str | SampleResult], model_id: str = 'scripted-mock-v1', name: str = 'mock')
Returns a pre-determined sequence of responses, cycling on exhaustion.
Each element may be either a plain str (in which case it is wrapped
in a :class:SampleResult at sample time) or a fully-formed
:class:SampleResult.
Parameters¶
responses
Sequence of responses to return on successive sample calls.
model_id
Identifier reported in :attr:SampleResult.model_id.
name
Identifier reported in :attr:SampleResult.provider. Defaults to
"mock" so evaluation JSON written from a test cleanly identifies
itself as not real.
infereval.providers.mock.ReplayProvider ¶
Replays recorded provider responses from a JSONL fixture.
The fixture is one JSON object per line. Each record must carry a
prompt_hash (matching :func:infereval.logging_setup.prompt_hash
of the prompt that produced it) and a text field. Optional fields:
provider, model_id, request_id, wall_time_ms, usage,
raw.
When multiple records share a prompt hash, they are returned in
fixture order; ReplayProvider cycles when the per-prompt sequence
is exhausted, matching :class:ScriptedProvider semantics.
Missing prompt hashes raise :class:ProviderSampleError with a
diagnostic message listing how many hashes are recorded.
This is the M8 vehicle for byte-for-byte regression testing of the
endorsement pipeline without hitting a real API. Generate fixtures via
the developer helper at tests/fixtures/build_stop_sign_replay.py.
Source code in src/infereval/providers/mock.py
Prompts¶
infereval.prompts.VerificationPrompt
dataclass
¶
VerificationPrompt(id: str, system: str, user_template: str, parse_regex: str = DEFAULT_PARSE_REGEX)
A verification prompt template.
Attributes¶
id
Stable identifier recorded in evaluation JSON
(endorsement_config.verification_prompt_id).
system
System message sent to the provider. May be empty.
user_template
Format string with {premise_context} and {conclusion_context}
placeholders, used to build each per-sample user prompt.
parse_regex
Regex applied (case-insensitively) to the model's response. The
first match's group 1 is uppercased and interpreted as a
:class:Verdict value (GOOD / BAD / ABSTAIN).
infereval.prompts.resolve_verification_prompt ¶
resolve_verification_prompt(override: VerificationPromptOverride | None, *, override_id: str = 'benchmark-override-v1') -> VerificationPrompt
Return the default prompt, or a benchmark-supplied override.
Each override field that is None falls back to the framework
default:
- :attr:
VerificationPromptOverride.systemNone→ :data:DEFAULT_SYSTEM_PROMPT. - :attr:
VerificationPromptOverride.parse_regexNone→ :data:DEFAULT_PARSE_REGEX. - :attr:
VerificationPromptOverride.idNone→override_id(caller-supplied fallback identifier).
A benchmark JSON can now fully specify a custom verification prompt (system + user template + parse regex + identifier) without dropping to the Python API.
Source code in src/infereval/prompts.py
Endorsement¶
infereval.endorsement.endorse ¶
endorse(implication: Implication, bearers: Mapping[str, Bearer], provider: Provider, config: EndorsementConfig, params: ProviderParams, *, premise_builder: ContextBuilder, conclusion_builder: ContextBuilder, verification_prompt: VerificationPrompt = DEFAULT_VERIFICATION_PROMPT, strip_tex: bool = True, request_id_prefix: str | None = None, variant: int = 0) -> EndorsementRecord
Compute :math:E_M(\langle \Gamma, \Delta \rangle) for one implication.
Issues config.n_samples calls to provider with the verification
prompt built from premise_builder and conclusion_builder,
parses each response, and aggregates via :func:majority_vote.
Provider sample failures (after the provider's own retries are
exhausted) are recorded as sample_failed and contribute an
ABSTAIN verdict to the vote.
The variant parameter selects which expression each bearer is
rendered with. variant=0 (the default) uses the canonical
expressions; variant=k uses bearer.paraphrases[k-1] per
:func:_expressions_for. Use this to drive the paraphrase axis
of variation (R10) without needing to mutate the benchmark JSON
between runs.
Source code in src/infereval/endorsement.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 | |
infereval.endorsement.EndorsementRecord
dataclass
¶
EndorsementRecord(implication: Implication, samples: list[SampleRecord], counts: dict[Verdict, int], verdict: Verdict, tie_broken: bool, premise_context: str, conclusion_context: str, rendered_user_prompt: str = '')
Result of one :func:endorse call.
This is the in-memory analog of an evaluation file's per-item record;
:func:infereval.evaluation.evaluate converts it to an
:class:infereval.evaluation.EvaluationItem for serialization.
premise_context
instance-attribute
¶
The full natural-language premise context shown to the model.
conclusion_context
instance-attribute
¶
The full natural-language conclusion context shown to the model.
to_majority_vote ¶
Project counts + verdict into a Pydantic :class:MajorityVote.
Source code in src/infereval/endorsement.py
infereval.endorsement.parse_verdict ¶
Extract a :class:Verdict from a raw model response.
Returns (Verdict.ABSTAIN, "unparseable") if no token matches; per
the paper's Definition 2 ("Unparseable responses are mapped to abstain").
Source code in src/infereval/prompts.py
infereval.endorsement.majority_vote ¶
Aggregate per-sample verdicts into a single verdict.
Returns (chosen_verdict, tie_broken_flag).
Tie rules (in order):
- If
verdictsis empty, return(ABSTAIN, False). - If exactly one verdict has the max count, return it.
- Tie: if ABSTAIN is among the tied set, return ABSTAIN.
- Otherwise (pure GOOD/BAD tie), apply
tie_break.
Source code in src/infereval/endorsement.py
Context builders¶
infereval.context.resolve_context_builder ¶
Convert a benchmark's serialized context-builder config to a callable.
Source code in src/infereval/context.py
infereval.context.resolve_context_builders ¶
Resolve a :class:ContextBuilders pair into (premise, conclusion) callables.
Source code in src/infereval/context.py
infereval.context.make_template_builder ¶
Build a context builder that joins expressions and formats into a template.
Parameters¶
template
Format string with a {expressions} placeholder.
joiner
Separator inserted between bearer expressions.
Returns¶
ContextBuilder Callable that takes a sequence of expressions and returns the formatted context.
Notes¶
The empty-input case returns the template formatted against an empty string, which by default yields the empty string. The endorser does not use this builder on empty implications (Definition 3 excludes them).
Source code in src/infereval/context.py
infereval.context.strip_tex_math ¶
Strip $...$ TeX-math delimiters, preserving their contents.
Examples¶
strip_tex_math("\(a\) is a stop sign") 'a is a stop sign' strip_tex_math("\(a\) and \(b\)") 'a and b' strip_tex_math("no math here") 'no math here'
Unmatched single $ characters are left in place; only paired
$...$ spans (without nested $) are stripped.