Reproducibility, by design.
Every QueryGym run produces a single JSON conforming to a public, versioned schema. Submissions to the leaderboard carry the JSON, a TREC-format run file, and the reformulated queries — together they reconstruct the experiment from a fresh clone.
Browse the full SIGIR 2026 reproducibility leaderboard. Per-dataset, per-method, per-LLM views with citable run files.
Field-by-field documentation, validation rules, and a worked example. Mirrors the canonical JSON Schema file.
Submitting a result
Run the example pipeline, then use submit_run.py
and open a PR. CI validates the JSON against the schema; a maintainer
verifies the numbers locally before merge.
# 1. Run the example pipeline
python examples/querygym_pyserini/pipeline.py \
--dataset msmarco-v1-passage.trecdl2019 \
--method query2e --model gpt-4.1-mini \
--output-dir outputs/dl19_query2e
# 2. Copy into the canonical layout
python -m reproducibility.scripts.submit_run \
--from-dir outputs/dl19_query2e
# 3. Regenerate the aggregate
make repro-aggregate
# 4. Open a PR
git add reproducibility/data/ && git commit && git push
gh pr create Papers
QueryGym is backed by two papers: the toolkit demo and a multi-LLM reproduction study. Both link directly to the committed corpus.