Providers¶

The framework supports four providers out of the box: Anthropic, OpenAI, OpenRouter, and mock (ScriptedProvider for tests, ReplayProvider for deterministic replay against a JSONL fixture).

This page is the operational reference for the three real providers — credentials, model id conventions, and the per-provider quirks that have actually bitten us.

Quick reference¶

Provider	API key env var	`pip install` extras	Best for
`anthropic`	`ANTHROPIC_API_KEY`	`infereval[anthropic]`	Claude models (Opus, Sonnet, Haiku families). Messages API.
`openai`	`OPENAI_API_KEY`	`infereval[openai]`	OpenAI models. Chat Completions API.
`openrouter`	`OPENROUTER_API_KEY`	`infereval[openai]` (shared SDK)	DeepSeek, Mistral, Llama, Qwen, and any other model OpenRouter aggregates. Also Anthropic / OpenAI models if you want a unified billing flow.
`mock`	—	(always present)	Tests; the `--replay-from` flow for deterministic CI / demo runs.

Anthropic¶

pip install 'infereval[anthropic]'
export ANTHROPIC_API_KEY=sk-ant-...
infereval evaluate bench.json \
    --provider anthropic \
    --model claude-haiku-4-5-20251001 \
    --output eta.json --n-samples 5 --log run.jsonl

Model id conventions (as of 2026-05): claude-haiku-4-5-20251001, claude-sonnet-4-5-20250929, claude-opus-4-7-20260201, plus dated variants. The shorter aliases (claude-haiku-4-5, claude-sonnet-4-5, claude-opus-4-7, etc.) also work and point at the latest version of each model line. Opus 4.7+ engages extended thinking and benefits from a larger --max-tokens cap (the framework default of 1024 handles it; bump to 2048-4096 if you observe parse_status: budget_clipped). The recorded fixture filenames in experiments/results/stop_sign/ use the human shorthand claude-opus-4.7-… (with a dot) — that's a filename convention, not an Anthropic model id; the actual id sent to the API is always the dashed form.

Quirks¶

seed is ignored. Anthropic's Messages API does not honor a seed parameter. If you supply one, the framework records it in Evaluation.model.params.seed (so you can see what was requested) and emits a one-time warning on the infereval.providers.anthropic logger: provider.anthropic.seed_ignored. The seed has no effect on the model's output. Use temperature=0.0 for the closest approximation of determinism Anthropic supports.
stop sequences are passed through as stop_sequences=[...]. The default-v1 verification prompt doesn't need them.
Per-call cost is small for typical evaluations: each sample is ~80 input tokens + ~6 output tokens. A 4-item × 5-sample run is ~$0.001 on Haiku.

Recommended defaults¶

EndorsementConfig(n_samples=5)
ProviderParams(temperature=0.0, max_tokens=1024)   # framework default

The framework defaults to max_tokens=1024 at both the Python API and the CLI (--max-tokens 1024), which is sufficient for Haiku, Sonnet, and Opus on the default-v1 prompt (Opus 4.7+ engages extended thinking, which consumes part of the budget). Older Anthropic models emit only a handful of tokens for a one-word verdict regardless of this cap.

OpenAI¶

pip install 'infereval[openai]'
export OPENAI_API_KEY=sk-...
infereval evaluate bench.json \
    --provider openai \
    --model gpt-4.1 \
    --output eta.json --n-samples 5 --log run.jsonl

Model id conventions: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-5, gpt-5-mini, etc., plus dated variants (gpt-4.1-2025-04-14, …). Reasoning-family models (o1, o3, o4-mini) work but consume tokens silently — see the reasoning-tokens caveat below.

Quirks¶

seed is honored for most chat-completion models. Passes through as-is. Reproducibility under fixed seed + temperature is the best you'll get from any provider.
API surface choice: the framework uses Chat Completions, not the newer Responses API. This was a deliberate decision to maximize OpenRouter compatibility (Chat Completions is the shared interface). If you need Responses-API-specific features (e.g. structured outputs with strict schemas), drop to the OpenAI SDK directly and pass a client= kwarg to OpenAIProvider.
Reasoning models (o1, o3, o4-mini, ...) use internal reasoning tokens that consume the max_tokens budget invisibly. See the OpenRouter section below for the canonical case (DeepSeek v4-flash); the OpenAI reasoning models exhibit the same behavior. The framework default of 1024 handles the common case; bump to 2048–4096 for o1-pro / o3-pro / heavy thinking variants.

Recommended defaults¶

# Default — works for both reasoning and non-reasoning OpenAI models:
EndorsementConfig(n_samples=5)
ProviderParams(temperature=0.0, max_tokens=1024, seed=42)   # framework default

# Heavy-reasoning models (o3-pro, o1-pro, gpt-5.4-pro) — raise the cap:
ProviderParams(temperature=0.0, max_tokens=4096, seed=42)

OpenRouter¶

pip install 'infereval[openai]'   # shared SDK with OpenAI
export OPENROUTER_API_KEY=sk-or-...
infereval evaluate bench.json \
    --provider openrouter \
    --model deepseek/deepseek-v4-flash \
    --output eta.json --n-samples 5 --max-tokens 512 --log run.jsonl \
    --http-referer https://github.com/your-org/your-repo \
    --x-title your-experiment-name

OpenRouter is OpenAI-API-compatible at the protocol level; the framework uses the OpenAI SDK with base_url="https://openrouter.ai/api/v1" and the OpenRouter API key. Any of the hundreds of models OpenRouter aggregates is reachable through this single provider.

Model id conventions: <vendor>/<model> (anthropic/claude-3.5-sonnet, openai/gpt-4o-mini, deepseek/deepseek-v4-flash, meta-llama/llama-3.3-70b-instruct, qwen/qwen-2.5-72b-instruct, ...).

Quirks¶

Attribution headers (HTTP-Referer, X-Title) are recommended for accountability and show up in your OpenRouter dashboard. The framework supports them as --http-referer and --x-title flags (or http_referer=, x_title= kwargs in the Python API).
The API key env var is OPENROUTER_API_KEY, not OPENAI_API_KEY. The framework explicitly does not fall back to the OpenAI key (and a test enforces this), so you can have both set without cross-contamination.
DeepSeek v4-flash uses silent reasoning tokens. This is the canonical reasoning-tokens case. At a low max_tokens (e.g. 32, which was the framework default before v0.5.2), the model returns content=None, finish_reason="length", and reasoning_tokens equal to the cap. The framework now: (a) defaults max_tokens to 1024 at both the Python API and the CLI, which clears this case for v4-flash and most reasoning models; (b) distinguishes budget-clipped abstains from genuine ones: when finish_reason ∈ {"length", "max_tokens"} and no verdict could be parsed, the resulting SampleRecord.parse_status is "budget_clipped" (not "unparseable"). Inspect that field in your evaluation JSON to spot any remaining budget issues. For heavy-reasoning models (openai/o1-pro, openai/o3-pro, qwen3-max-thinking), bump --max-tokens further (2048-4096).
Per-model pricing varies enormously. Check the OpenRouter model page before launching a large benchmark. deepseek/deepseek-v4-flash is cheap (~$0.10 per million input tokens); openai/o3 is not.

Recommended defaults¶

# Default works for both reasoning and non-reasoning models:
ProviderParams(temperature=0.0, max_tokens=1024)   # framework default

# Heavy reasoning (deepseek-r1, openai/o1-pro, qwen3-max-thinking):
ProviderParams(temperature=0.0, max_tokens=4096)

Listing currently available DeepSeek models¶

curl -s -H "Authorization: Bearer $OPENROUTER_API_KEY" \
    https://openrouter.ai/api/v1/models | jq '.data[] | select(.id|test("deepseek")) | .id'

The same trick works for any other vendor: replace deepseek with anthropic, meta-llama, etc.

Mock providers (testing / replay)¶

Two flavors live in infereval.providers.mock:

`ScriptedProvider`¶

from infereval.providers.mock import ScriptedProvider
from infereval.providers.base import SampleRequest

p = ScriptedProvider(responses=["GOOD", "GOOD", "BAD"])
p.sample(SampleRequest(prompt="anything"))  # -> SampleResult(text="GOOD", ...)

Cycles through responses indefinitely. Useful for unit tests that need to control what the "model" says without hitting an API.

`ReplayProvider`¶

from infereval.providers.mock import ReplayProvider

p = ReplayProvider("tests/fixtures/stop_sign_replay.jsonl")

Loads a JSONL fixture where each record carries a prompt_hash (SHA-256 of the user prompt) and a text field. When sample() is called, the prompt is hashed, the matching record (or next record sharing that hash) is returned. Missing prompt → clear ProviderSampleError with a diagnostic message.

The CLI exposes this as --replay-from <path>. It's the basis of the 60-second quickstart in the README and is the right tool for any deterministic CI / demo run.

Fixture format:

{"prompt_hash":"sha256:...","text":"GOOD","provider":"anthropic","model_id":"claude-haiku-4-5-20251001","wall_time_ms":120.0,"usage":{"input_tokens":60,"output_tokens":6}}

To author a fixture from real provider runs, the pattern is:

Run infereval evaluate ... --log run.jsonl against the real provider.
Process run.jsonl to extract one record per sample.completed event into the replay format.

A helper to do step 2 cleanly is on the roadmap (issue not yet open); for now tests/fixtures/build_stop_sign_replay.py is the canonical example.

A common-issue checklist¶

Symptom	Likely cause	Fix
Coverage near zero, lots of `parse_status: budget_clipped` with empty `raw_response`	`max_tokens` too low for a reasoning-capable model	Raise `--max-tokens`. The framework default (1024) clears most cases; bump to 2048-4096 for heavy reasoning models.
Coverage near zero, `parse_status: unparseable` (not `budget_clipped`)	Genuine unparseable output: model is talking but not producing a verdict token	Inspect a few `raw_response` strings; consider a verification prompt that's stricter about the output format
`ProviderConfigError: ANTHROPIC_API_KEY not set` despite `export ANTHROPIC_API_KEY=...` in `~/.bashrc`	Shell didn't re-source profile	`source ~/.bashrc` or open a new terminal; verify with `echo $ANTHROPIC_API_KEY \\| head -c 8`
`ProviderConfigError: anthropic SDK not installed`	`infereval` installed without provider extras	`pip install 'infereval[anthropic]'` (or `[openai]` / `[all]`)
`provider.anthropic.seed_ignored` warning on every Anthropic run	Anthropic doesn't honor seed	Expected; either drop `seed` or accept the warning
`ProviderSampleError: no recorded response for prompt_hash=...` from `ReplayProvider`	Verification prompt template, bearer expressions, or context builders have changed since the fixture was generated	Regenerate the fixture (`python -m tests.fixtures.build_stop_sign_replay`) or use a real provider for new prompts
First sample on a long evaluation takes minutes	Anthropic / OpenAI / OpenRouter latency spike on a specific prompt; the BaseProvider retry loop is sleeping with exponential backoff	Wait it out (the retry policy caps at 4 attempts), or interrupt and inspect the partial JSONL log to confirm what happened
OpenRouter dashboard shows no app attribution	Missing `--http-referer` / `--x-title`	Pass both flags

Adding a new provider¶

If you want to wire up a new backend (Cohere, Together, Replicate, a self-hosted endpoint, ...), the pattern is:

Subclass infereval.providers.base.BaseProvider.
Set name = "your-name" as a class attribute.
Implement _sample_once(self, req: SampleRequest) -> SampleResult — one provider call, no retries (the base class handles them).
Implement _is_transient(self, exc: Exception) -> bool — return True for errors the retry loop should re-try.
Register in infereval.providers.__init__.get_provider.

The Anthropic / OpenAI implementations are ~150 lines each and serve as templates. If you write one, please open a PR.

Providers¶

Quick reference¶

Anthropic¶

Quirks¶

Recommended defaults¶

OpenAI¶

Quirks¶

Recommended defaults¶

OpenRouter¶

Quirks¶

Recommended defaults¶

Listing currently available DeepSeek models¶

Mock providers (testing / replay)¶

ScriptedProvider¶

ReplayProvider¶

A common-issue checklist¶

Adding a new provider¶

`ScriptedProvider`¶

`ReplayProvider`¶