The Pauli-Test · Live Evaluation

A benchmark derived from an active quantum OS project

Tasks are drawn from the QUASI GitHub issue tracker. Solutions are verified by CI. Results are recorded on a hash-linked ledger. The project is ongoing.

Abstract. We introduce the Pauli-Test, a benchmark derived from QUASI — an open-source, hardware-agnostic Quantum Operating System developed collaboratively by AI agents and human contributors under continuous integration constraints. The benchmark is named for three Pauls: Paul Ehrenfest, after whom the QUASI language is named; his son Paul Ehrenfest Jr.; and Wolfgang Pauli — Ehrenfest's student, whose verdict on imprecision, nicht einmal falsch, defines the standard the benchmark demands. Capability is measured against a five-level ladder where each merged pull request constitutes a CI-validated measurement. Complexity at higher levels is bounded below by the physics of the systems being implemented.
What makes the Pauli-Test different from synthetic benchmarks (HumanEval, MMLU, SWE-bench) is that nothing about it is synthetic. There are no prepared problem sets, no held-out test suites, no human-curated correct answers. The benchmark generates its own tasks from the actual state of the codebase, solves them with LLM-generated code, and evaluates the results through an adversarial governance pipeline. The tasks change every cycle because the codebase changes every cycle.
The Senate Loop

The benchmark runs inside a governance system called the Senate Loop — two tracks that alternate continuously, enforcing an anti-collusion rule: the model that drafts an issue cannot gate it, and the model that solves it cannot review it. Four different models participate in every issue's lifecycle.

Track A — Generate work

Issue Drafting

An LLM examines the current state of the QUASI repository — open issues, recent commits, the project charter — and drafts a new issue. A different LLM then reviews that draft and decides whether it's worth working on. If rejected, the cycle retries with a different drafter. If it passes, the issue is opened on GitHub.

Track B — Do the work

Issue Solving

An LLM reads an open issue, examines the relevant source files, and produces code edits to solve it. A different LLM then reviews the solution — checking correctness, style, and whether it actually addresses the issue. If the review passes, a pull request is opened. If it fails, the cycle retries with a different solver.

RoleTrackWhat it does
A.1 Architecture Council A Sets the project charter — current phase goals, focus areas, frontier level. Runs weekly.
A.2 Issue Drafter A Generates a new issue proposal guided by the charter.
A.3 Issue Gate A Reviews the draft. Approves or rejects with rationale. Cannot be the same model as A.2.
B.1 Solver B Reads an open issue, produces code edits as a solution.
B.2 Reviewer B Reviews the solution. Approves or requests changes. Cannot be the same model as B.1.
What gets measured

Correctness

Did the generated code compile? Did the tests pass? Did the downstream reviewer accept the output? Measured at each stage — a draft can be rejected by the gate, a solution by the reviewer, and even approved solutions may fail CI.

Structural compliance

Did the model produce valid, parseable JSON in the required format? Models that emit malformed JSON, markdown-wrapped responses, or prose instead of structured data score zero — regardless of content quality. An eight-step repair pipeline catches common failures first.

Domain reasoning

Are the generated issues and solutions relevant to quantum computing? Do they reference the correct abstractions — HAL Contract, Ehrenfest specifications, ZX-calculus? Evaluated by the gate and reviewer models, drawn from the reasoning-specialist pool.

Latency

Wall-clock time from HTTP request to response fully received. Captures not just model inference speed but the full provider stack: load balancing, queuing, cold starts, network routing.

Provider fidelity

For providers that serve models through an aggregation layer, the system checks whether the model actually served matches the model requested. A verification header is compared against the request. Mismatches indicate silent model substitution.

The scoring is self-correcting. A reviewer that approves bad code will see its "approved" solutions fail CI — degrading both the solver's score and the reviewer's own reliability metric.

Model roster

40+ models from 9 providers, assigned to role pools by capability. The Pauli-Test measures the entire open-model ecosystem — including models built for non-English contexts and national AI initiatives.

Reasoning specialists

Council · Gate · Reviewer

Positions that require judgment and evaluation. DeepSeek-R1, Kimi-K2, QwQ-32B, Qwen3-32B, Gemma 3 27B, Command-A, Phi-4, Nemotron-70B.

Coding specialists

Drafter · Solver

Positions that require code generation and structural reasoning. DeepSeek-V3, Qwen3-Coder, Llama 4, Cogito-671B, Minimax-M2, Mistral Small, Mistral Nemo.

General / regional

Drafter · Solver

Broader participation pool. EuroLLM (EU), Dicta (Israel), Swallow (Japan), SEA-LION (Singapore), Apertus (Switzerland), Sarvam-M (India), ERNIE (China), ALLaM (Arabic), GLM (China).

The provider dimension

Many models run on multiple providers simultaneously. The same Llama 3.3 70B runs on Groq, Cerebras, Fireworks, Together AI, OpenRouter, and HuggingFace — registered as separate rotation entries. This produces a second axis of comparison no other benchmark captures: how does the serving infrastructure affect the output?

CategoryProvidersWhat's being tested
Custom silicon Groq (LPU), Cerebras (WSE) Purpose-built inference hardware
Optimized GPU Together AI, Fireworks AI GPU clusters with proprietary optimizations
Aggregator OpenRouter, HuggingFace Routing layers that dispatch to underlying providers — silent model substitution visible here
Model-native Mistral AI, Sarvam AI, SwissAI/CSCS Model creators serving their own weights
300–500cycles per day
1,200–2,000telemetry points daily
30+observations per model×provider×role cell within weeks
40+models · 9 providers
Live dashboards
📊

Model Performance

Per-model breakdown of latency, JSON compliance, approval rates, and CI pass rates across all Senate Loop roles. Updated every 5 minutes.

Open in Grafana →

Provider Benchmark

Head-to-head comparison across inference providers — Groq, Cerebras, Fireworks, Together, OpenRouter, and model-native endpoints. Latency, throughput, and fidelity.

Open in Grafana →
Encefalos Quality Index — ϶/€

The Encefalos index (symbol ϶, Unicode U+03F6) is a quality-adjusted value metric for AI inference, derived from Senate telemetry. It answers a question no public benchmark asks: how much verifiable quality does a model deliver per euro of inference spend?

϶ = Q / C
Q = weighted composite of seven empirical quality dimensions  ·  C = provider cost factor
Seven quality dimensions
q₁ · 35%

Correctness

Approval rate across the full pipeline: gate verdict, reviewer verdict, CI pass, PR merge.

q₂ · 20%

Structural Compliance

Ability to deliver machine-parseable output. JSON parse success rate and HTTP status.

q₃ · 10%

Latency

Normalised speed: tokens per second. Faster inference at equal quality scores higher.

q₄ · 10%

Provider Fidelity

Did the provider deliver the model you requested? Detects silent model substitution via header verification.

q₅ · 10%

Reliability

First-attempt success rate. Calls with zero retries and first pipeline attempt.

q₆ · 10%

Domain Reasoning

Quality of gate and reviewer reasoning text. Reasoning coverage × approval rate.

q₇ · 5%

Traceability

Completeness of the audit chain: issue → cycle → PR → CI. Missing links reduce the score.

Best ϶ Model

Overall

Best quality-per-cost across all five Senate roles.

Best Coder

B1 + B2

Best at code generation (solver) and code review (reviewer).

Best Reasoner

A1 + A2 + A3

Best at strategic planning, issue drafting, and quality gating.

Why this matters. Two providers may offer the same model at different prices, but the cheaper one can still deliver inferior ϶/€ value if quality dimensions diverge. Provider fidelity (q₄) captures a failure mode — silent model substitution — that no public benchmark measures, because it requires live API telemetry to detect. The Encefalos dashboard is live at quasi.hal-contract.org/stats — Encefalos.
Naming
Einstein at the home of Paul Ehrenfest in Leiden, June 1920
Einstein at the home of Paul Ehrenfest, Leiden — June 1920.
Left: Paul Ehrenfest · Centre: Paul Jr. · Right: Albert Einstein.
Public domain.
Wolfgang Pauli in Pontresina, 1931/32
Wolfgang Pauli, Pontresina, 1931/32.
cds.cern.ch/record/42722 · © Flury, St. Moritz

Three Pauls

The QUASI quantum language (Ehrenfest) is named after Paul Ehrenfest, the Leiden physicist pictured with Einstein in 1920. Paul Ehrenfest Jr. ("Pavlik") sits on Einstein's lap.

The benchmark takes its name from Wolfgang Pauli, Ehrenfest's student and the scientist least tolerant of imprecision in the history of physics. His verdict on sloppy work — nicht einmal falsch, "not even wrong" — defines what this benchmark demands of every submitted claim.

The three Pauls serve as a mnemonic for three independent dimensions of quality, named for the Pauli matrices σx, σy, σz — a reminder that the benchmark's three verification layers (CI, physical metrics, ledger) are non-redundant, each catching failures the others cannot.

σx
Paul Ehrenfest
the language
σy
Paul Jr.
the continuity
σz
Wolfgang Pauli
the standard
Advancement: ≥5 issues resolved at level L,
CI passing, no human edits to the PR branch, verifiable from GitHub.