Reproducibility, by design.

Every QueryGym run produces a single JSON conforming to a public, versioned schema. Submissions to the leaderboard carry the JSON, a TREC-format run file, and the reformulated queries — together they reconstruct the experiment from a fresh clone.

subdomain

leaderboard.querygym.com

Browse the full SIGIR 2026 reproducibility leaderboard. Per-dataset, per-method, per-LLM views with citable run files.

View Leaderboard →

schema

Field-by-field documentation, validation rules, and a worked example. Mirrors the canonical JSON Schema file.

Read schema.md ↗

Submitting a result

Run the example pipeline, then use submit_run.py and open a PR. CI validates the JSON against the schema; a maintainer verifies the numbers locally before merge.

bash

# 1. Run the example pipeline
python examples/querygym_pyserini/pipeline.py \
    --dataset msmarco-v1-passage.trecdl2019 \
    --method query2e --model gpt-4.1-mini \
    --output-dir outputs/dl19_query2e

# 2. Copy into the canonical layout
python -m reproducibility.scripts.submit_run \
    --from-dir outputs/dl19_query2e

# 3. Regenerate the aggregate
make repro-aggregate

# 4. Open a PR
git add reproducibility/data/ && git commit && git push
gh pr create

Papers

QueryGym is backed by two papers: the toolkit demo and a multi-LLM reproduction study. Both link directly to the committed corpus.

See full citations →