Documentation¶
The README is the front door. This directory has the longer-form material.
Guides¶
| Guide | When you want it |
|---|---|
| Concepts | Mental model of the methodology — bearers, expression functions, contexts, the endorsement function E_M, the derived frame, the benchmark, the metrics. Pedagogical complement to the paper, §3–4. |
| Authoring benchmarks | Writing your own benchmark.json for a new domain — bearer carving, expression functions, RSR targets, factorial design, construction provenance, reference panels, validation. |
| Interpreting metrics | What κ_C, κ_F, κ_F* (per-panel and cross-panel) tell you; reading by-tag and factor-effects decompositions; sensitivity-sweep stability verdicts; what to do about low coverage. |
| Providers | Per-provider quirks: Anthropic's seed ignore, DeepSeek's silent reasoning tokens, OpenRouter attribution headers, the OpenAI Chat-Completions choice. |
| Construct-validity workflow | End-to-end practitioner's guide for producing reproducible, well-founded evidence for a claim of inferential mastery against a carving. Covers the framework's nine analytical capabilities plus the research-program responsibilities that remain outside the tool. |
| Closing the construct-validity gap | Implementation-annotated companion to the workflow guide — which requirements (R1–R21) each release closed, what remains research-program work, and the as-shipped coverage table. |
Tutorials (Jupyter notebooks)¶
Each tutorial runs end-to-end without any API key by using the bundled ReplayProvider fixture or ScriptedProvider. To replace replay/scripted with a real provider, set the appropriate API key env var (ANTHROPIC_API_KEY, OPENAI_API_KEY, or OPENROUTER_API_KEY) and swap the provider construction call.
| Tutorial | Topic |
|---|---|
01_quickstart.ipynb |
The README quickstart, interactively — describe → validate → evaluate → metrics, against the committed stop-sign replay fixture. |
02_authoring_a_benchmark.ipynb |
Build a small medical-defeasibility benchmark from scratch. Carve bearers, write expressions, declare items, validate, evaluate against the replay provider. |
03_paraphrase_axis_experiment.ipynb |
Narrate the cross-model paraphrase-axis experiment from experiments/paraphrase_axis_triangulation.py, with explanations between the cells that produce the cross-model κ_C table. |
04_pulmonology_visualization.ipynb |
Visual analytics on the bundled pulmonary-edema benchmark + the six cross-family evaluations: bearer co-occurrence heatmap, per-target verdict distribution, item × model verdict matrix, pairwise Cohen κ, per-item disagreement counts. Requires matplotlib + pandas. |
Reference¶
- API reference: the docstrings in
src/infereval/*.pyare kept comprehensive and paper-cross-referenced.help(infereval.evaluation.evaluate)is reliable. - CLI commands (each has
--help):validate,describe,evaluate,metrics,structure(v0.4.0),model(v0.4.1),sweep(v0.4.2),report(v0.5.0). - JSON Schemas (Draft 2020-12): see
schemas.mdfor the rendered field reference; the raw files are committed atbenchmark.schema.jsonandevaluation.schema.json. They are generated from the Pydantic models; a drift test keeps them in sync. - Paper: the methodology's normative specification, Note on Simonelli's Stop Sign Dialogue (Allen 2026), is maintained as a separate paper. These docs are the gentle introduction.
Stability¶
- Framework version: 0.x — public Python API may shift between minor releases. Stable from 1.0.
- JSON schemas (
schema_version: "1.0"): versioned independently from the framework. Stability from 1.0 onward is promised regardless of framework version. The construct-validity infrastructure series (v0.3.0 → v0.5.1) added optional fields only — every pre-0.3.0 benchmark continues to validate against the current schema. - CLI surface: subcommand and flag names track the framework version. Stable from 1.0.
See CHANGELOG.md at the repo root for per-release notes.