Tasks are drawn from the QUASI GitHub issue tracker. Solutions are verified by CI. Results are recorded on a hash-linked ledger. The project is ongoing.
The benchmark runs inside a governance system called the Senate Loop — two tracks that alternate continuously, enforcing an anti-collusion rule: the model that drafts an issue cannot gate it, and the model that solves it cannot review it. Four different models participate in every issue's lifecycle.
An LLM examines the current state of the QUASI repository — open issues, recent commits, the project charter — and drafts a new issue. A different LLM then reviews that draft and decides whether it's worth working on. If rejected, the cycle retries with a different drafter. If it passes, the issue is opened on GitHub.
An LLM reads an open issue, examines the relevant source files, and produces code edits to solve it. A different LLM then reviews the solution — checking correctness, style, and whether it actually addresses the issue. If the review passes, a pull request is opened. If it fails, the cycle retries with a different solver.
| Role | Track | What it does |
|---|---|---|
| A.1 Architecture Council | A | Sets the project charter — current phase goals, focus areas, frontier level. Runs weekly. |
| A.2 Issue Drafter | A | Generates a new issue proposal guided by the charter. |
| A.3 Issue Gate | A | Reviews the draft. Approves or rejects with rationale. Cannot be the same model as A.2. |
| B.1 Solver | B | Reads an open issue, produces code edits as a solution. |
| B.2 Reviewer | B | Reviews the solution. Approves or requests changes. Cannot be the same model as B.1. |
Did the generated code compile? Did the tests pass? Did the downstream reviewer accept the output? Measured at each stage — a draft can be rejected by the gate, a solution by the reviewer, and even approved solutions may fail CI.
Did the model produce valid, parseable JSON in the required format? Models that emit malformed JSON, markdown-wrapped responses, or prose instead of structured data score zero — regardless of content quality. An eight-step repair pipeline catches common failures first.
Are the generated issues and solutions relevant to quantum computing? Do they reference the correct abstractions — HAL Contract, Ehrenfest specifications, ZX-calculus? Evaluated by the gate and reviewer models, drawn from the reasoning-specialist pool.
Wall-clock time from HTTP request to response fully received. Captures not just model inference speed but the full provider stack: load balancing, queuing, cold starts, network routing.
For providers that serve models through an aggregation layer, the system checks whether the model actually served matches the model requested. A verification header is compared against the request. Mismatches indicate silent model substitution.
The scoring is self-correcting. A reviewer that approves bad code will see its "approved" solutions fail CI — degrading both the solver's score and the reviewer's own reliability metric.
40+ models from 9 providers, assigned to role pools by capability. The Pauli-Test measures the entire open-model ecosystem — including models built for non-English contexts and national AI initiatives.
Positions that require judgment and evaluation. DeepSeek-R1, Kimi-K2, QwQ-32B, Qwen3-32B, Gemma 3 27B, Command-A, Phi-4, Nemotron-70B.
Positions that require code generation and structural reasoning. DeepSeek-V3, Qwen3-Coder, Llama 4, Cogito-671B, Minimax-M2, Mistral Small, Mistral Nemo.
Broader participation pool. EuroLLM (EU), Dicta (Israel), Swallow (Japan), SEA-LION (Singapore), Apertus (Switzerland), Sarvam-M (India), ERNIE (China), ALLaM (Arabic), GLM (China).
Many models run on multiple providers simultaneously. The same Llama 3.3 70B runs on Groq, Cerebras, Fireworks, Together AI, OpenRouter, and HuggingFace — registered as separate rotation entries. This produces a second axis of comparison no other benchmark captures: how does the serving infrastructure affect the output?
| Category | Providers | What's being tested |
|---|---|---|
| Custom silicon | Groq (LPU), Cerebras (WSE) | Purpose-built inference hardware |
| Optimized GPU | Together AI, Fireworks AI | GPU clusters with proprietary optimizations |
| Aggregator | OpenRouter, HuggingFace | Routing layers that dispatch to underlying providers — silent model substitution visible here |
| Model-native | Mistral AI, Sarvam AI, SwissAI/CSCS | Model creators serving their own weights |
Per-model breakdown of latency, JSON compliance, approval rates, and CI pass rates across all Senate Loop roles. Updated every 5 minutes.
Head-to-head comparison across inference providers — Groq, Cerebras, Fireworks, Together, OpenRouter, and model-native endpoints. Latency, throughput, and fidelity.
The Encefalos index (symbol ϶, Unicode U+03F6) is a quality-adjusted value metric for AI inference, derived from Senate telemetry. It answers a question no public benchmark asks: how much verifiable quality does a model deliver per euro of inference spend?
Approval rate across the full pipeline: gate verdict, reviewer verdict, CI pass, PR merge.
Ability to deliver machine-parseable output. JSON parse success rate and HTTP status.
Normalised speed: tokens per second. Faster inference at equal quality scores higher.
Did the provider deliver the model you requested? Detects silent model substitution via header verification.
First-attempt success rate. Calls with zero retries and first pipeline attempt.
Quality of gate and reviewer reasoning text. Reasoning coverage × approval rate.
Completeness of the audit chain: issue → cycle → PR → CI. Missing links reduce the score.
Best quality-per-cost across all five Senate roles.
Best at code generation (solver) and code review (reviewer).
Best at strategic planning, issue drafting, and quality gating.
The QUASI quantum language (Ehrenfest) is named after Paul Ehrenfest, the Leiden physicist pictured with Einstein in 1920. Paul Ehrenfest Jr. ("Pavlik") sits on Einstein's lap.
The benchmark takes its name from Wolfgang Pauli, Ehrenfest's student and the scientist least tolerant of imprecision in the history of physics. His verdict on sloppy work — nicht einmal falsch, "not even wrong" — defines what this benchmark demands of every submitted claim.
The three Pauls serve as a mnemonic for three independent dimensions of quality, named for the Pauli matrices σx, σy, σz — a reminder that the benchmark's three verification layers (CI, physical metrics, ledger) are non-redundant, each catching failures the others cannot.