Reproducible query reformulation, powered by LLMs.
QueryGym is a toolkit for benchmarking and reproducing LLM-based query rewriting methods across IR datasets. Open prompt bank, pluggable searchers, frozen schemas, citable runs.
What you get
QueryGym pairs a small, opinionated library with a contract-driven reproducibility pipeline. The toolkit and the leaderboard share the same data shape.
- 📚
Single Prompt Bank
One YAML registry of every prompt with version, license, and authorship metadata. Cite the exact text used in any run.
- 🔌
Pluggable searchers
Drop-in adapters for Pyserini, PyTerrier, BEIR, MS MARCO, and any custom retriever. Bring your own index.
- 🧬
Stable run schema
Every run emits a versioned JSON conforming to a public JSON Schema — same shape across the toolkit and third-party submitters.
- 🧠
OpenAI-compatible LLMs
Works with any OpenAI-compatible endpoint. Gpt-4.1, Qwen, Mistral, vLLM, Ollama — switch with a config change.
- 🔁
Reproducible by design
Every leaderboard row links a JSON, a TREC run file, and the reformulated queries. Re-evaluate from a fresh clone.
- 📄
Citable artifacts
Backed by two papers (WWW 2026 Demos, SIGIR 2026 Reproducibility) and a tagged reproducibility corpus on GitHub.
The ecosystem
QueryGym is split into three surfaces. They share the same data contract, so a run from the toolkit or a third-party submitter lands in the same leaderboard.
The toolkit itself. Methods, prompt bank, searchers, CLI.
SIGIR 2026 reproducibility results across IR benchmarks. Every row backed by a citable JSON.
API reference, methods reference, contributor guide, schema docs.
Try it in 30 seconds
Reformulate a query against any OpenAI-compatible endpoint. Pyserini and BEIR are optional extras.
pip install querygym import querygym as qg
reformulator = qg.create_reformulator("genqr_ensemble", model="gpt-4.1-mini")
result = reformulator.reformulate(qg.QueryItem("q1", "what causes diabetes?"))
print(result.reformulated) Supported methods
Nine reformulation methods, each with a registered prompt, a paper reference, and a reproducibility entry on the leaderboard.
-
GenQR
genqrGeneric LLM-driven keyword expansion.
-
GenQR Ensemble
genqr_ensemble10 instruction variants for diverse keyword expansion.
-
Query2Doc
query2docGenerates pseudo-documents from LLM knowledge.
-
QA Expand
qa_expandQuestion-answer expansion with sub-questions.
-
MuGI
mugiMulti-granularity expansion with adaptive concatenation.
-
LameR
lamerContext-based passage synthesis from retrieved docs.
-
CSQE
csqeSentence-level context expansion (KEQE + CSQE).
-
ThinkQE
thinkqeMulti-round reasoning with corpus feedback.
-
Query2E
query2eQuery-to-entity / keyword expansion.