This page describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections. Details about our model can be found in the following paper:
Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807.
For uniCOIL, we make the corpus (sparse vectors) as well as the pre-built indexes available to download.
This document also describes hybrid combinations with our TCT-ColBERTv2 dense retrieval mode.
At present, these indexes are referenced as absolute paths on our Waterloo machine orca
, so these results are not broadly reproducible.
We are working on figuring out ways to distribute the indexes.
For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model and we did not have time to finish doc2query-T5 expansions. Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO (V1) passage corpus, described here.
Specifically, we applied inference over the MS MARCO V2 passage corpus and segmented document corpus to obtain the term weights.
We start from the corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
Download the sparse representation of the corpus generated by uniCOIL:
wget https://vault.cs.uwaterloo.ca/s/a29gEzyXrK5NG4o/download -O collections/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar
tar -xvf collections/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar -C collections/
Index the sparse vectors:
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-passage-v2-unicoil-noexp-0shot-b8 \
-index indexes/lucene.unicoil-noexp.0shot.msmarco-passage-v2 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 32
If you want to save time and skip the indexing step, download the prebuilt index directly:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/lucene.unicoil-noexp.0shot.msmarco-passage-v2.tar.gz -P indexes/
# Alternate mirror
# wget https://vault.cs.uwaterloo.ca/s/bKbHmN6CjRtmoJq/download -O indexes/lucene.unicoil-noexp.0shot.msmarco-passage-v2.tar.gz
tar -xvf indexes/lucene.unicoil-noexp.0shot.msmarco-passage-v2.tar.gz -C indexes/
Sparse retrieval with uniCOIL:
python -m pyserini.search --topics msmarco-passage-v2-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--index indexes/lucene.unicoil-noexp.0shot.msmarco-passage-v2 \
--output runs/run.msmarco-passage-v2.unicoil-noexp.0shot.txt \
--impact \
--hits 1000 \
--batch 144 \
--threads 36 \
--min-idf 1
To evaluate, using trec_eval
:
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-passage-v2-dev runs/run.msmarco-passage-v2.unicoil-noexp.0shot.txt
Results:
map all 0.1306
recip_rank all 0.1314
$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-passage-v2-dev runs/run.msmarco-passage-v2.unicoil-noexp.0shot.txt
Results:
recall_100 all 0.4964
recall_1000 all 0.7013
Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
We start from the corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
Download the sparse representation of the corpus generated by uniCOIL:
wget https://vault.cs.uwaterloo.ca/s/x5cEaM3rXnTaE7j/download -O collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar
tar -xvf collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar -C collections/
Index the sparse vectors:
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8 \
-index indexes/lucene.unicoil-noexp.0shot.msmarco-doc-v2-segmented \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 32
If you want to save time and skip the indexing step, download the prebuilt index directly:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/lucene.unicoil-noexp.0shot.msmarco-doc-v2-segmented.tar.gz -P indexes/
# Alternate mirror
# wget https://vault.cs.uwaterloo.ca/s/PwHpjHrS2fcgR2Y/download -O indexes/lucene.unicoil-noexp.0shot.msmarco-doc-v2-segmented.tar.gz
tar -xvf indexes/lucene.unicoil-noexp.0shot.msmarco-doc-v2-segmented.tar.gz -C indexes/
Sparse retrieval with uniCOIL:
python -m pyserini.search --topics msmarco-doc-v2-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--index indexes/lucene.unicoil-noexp.0shot.msmarco-doc-v2-segmented \
--output runs/run.msmarco-document-v2-segmented.unicoil-noexp.0shot.txt \
--impact \
--hits 10000 \
--batch 144 \
--threads 36 \
--max-passage-hits 1000 \
--max-passage \
--min-idf 1
For the document corpus, since we are searching the segmented version, we retrieve the top 10k segments and perform MaxP to obtain the top 1000 documents.
To evaluate, using trec_eval
:
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-doc-v2-dev runs/run.msmarco-document-v2-segmented.unicoil-noexp.0shot.txt
Results:
map all 0.2012
recip_rank all 0.2032
$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-doc-v2-dev runs/run.msmarco-document-v2-segmented.unicoil-noexp.0shot.txt
Results:
recall_100 all 0.7190
recall_1000 all 0.8813
We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
Because there are duplicate passages in MS MARCO V2 collections, score differences might be observed due to tie-breaking effects.
For example, if we output in MS MARCO format --output-format msmarco
and then convert to TREC format with pyserini.eval.convert_msmarco_run_to_trec_run
, the scores will be different.
Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 zero-shot):
python -m pyserini.hsearch dense --index /store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-passage-v2-augmented \
--encoder castorini/tct_colbert-v2-hnp-msmarco \
sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-passage-v2 \
--encoder castorini/unicoil-noexp-msmarco-passage \
--impact \
--min-idf 1 \
fusion --alpha 0.46 --normalization \
run --topics collections/passv2_dev_queries.tsv \
--output runs/run.msmarco-passage-v2.tct_v2+unicoil-noexp.0shot.top1k.dev1.trec \
--batch-size 72 --threads 72 \
--output-format trec
Evaluation:
$ python -m pyserini.eval.trec_eval -c -m recall.10,100,1000 -mmap -m recip_rank collections/passv2_dev_qrels.tsv runs/run.msmarco-passage-v2.tct_v2+unicoil-noexp.0shot.top1k.dev1.trec
Results:
map all 0.1823
recip_rank all 0.1835
recall_10 all 0.3373
recall_100 all 0.6375
recall_1000 all 0.8620
Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 trained):
python -m pyserini.hsearch dense --index /store/scratch/j587yang/project/trec_2021/indexes/dl2021/passage/title_headings_body/tct_colbert-v2-hnp-msmarco-hn-msmarcov2-full \
--encoder /store/scratch/j587yang/project/trec_2021/checkpoints/torch_ckpt/tct_colbert-v2-hnp-msmarco-hn-msmarcov2 \
sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-passage-v2 \
--encoder castorini/unicoil-noexp-msmarco-passage \
--impact \
--min-idf 1 \
fusion --alpha 0.29 --normalization \
run --topics collections/passv2_dev_queries.tsv \
--output runs/run.msmarco-passage-v2.tct_v2-trained+unicoil-noexp-0shot.top1k.dev1.trec \
--batch-size 72 --threads 72 \
--output-format trec
Evaluation:
$ python -m pyserini.eval.trec_eval -c -m recall.10,100,1000 -mmap -m recip_rank collections/passv2_dev_qrels.tsv runs/run.msmarco-passage-v2.tct_v2-trained+unicoil-noexp-0shot.top1k.dev1.trec
Results:
map all 0.2265
recip_rank all 0.2283
recall_10 all 0.3964
recall_100 all 0.6701
recall_1000 all 0.8748
Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 zero-shot):
python -m pyserini.hsearch dense --index /store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-doc-v2-segmented \
--encoder castorini/tct_colbert-v2-hnp-msmarco \
sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-doc-v2-segmented \
--encoder castorini/unicoil-noexp-msmarco-passage \
--impact \
--min-idf 1 \
fusion --alpha 0.56 --normalization \
run --topics collections/docv2_dev_queries.tsv \
--output runs/run.msmarco-document-v2-segmented.tct_v2+unicoil_noexp.0shot.maxp.top100.dev1.trec \
--batch-size 72 --threads 72 \
--max-passage \
--max-passage-hits 100 \
--output-format trec
Evaluation:
$ python -m pyserini.eval.trec_eval -c -m recall.10,100 -mmap -m recip_rank collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_v2+unicoil_noexp.0shot.maxp.top100.dev1.trec
Results:
map all 0.2550
recip_rank all 0.2575
recall_10 all 0.5051
recall_100 all 0.8082
Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 trained):
python -m pyserini.hsearch dense --index /store/scratch/j587yang/project/trec_2021/indexes/dl2021/document/title_headings_body/tct_colbert-v2-hnp-msmarco-hn-msmarcov2-full-maxp \
--encoder /store/scratch/j587yang/project/trec_2021/checkpoints/torch_ckpt/tct_colbert-v2-hnp-msmarco-hn-msmarcov2 \
sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-doc-v2-segmented \
--encoder castorini/unicoil-noexp-msmarco-passage \
--impact \
--min-idf 1 \
fusion --alpha 0.54 --normalization \
run --topics collections/docv2_dev_queries.tsv \
--output runs/run.msmarco-document-v2-segmented.tct_v2-trained+unicoil-noexp-0shot.maxp.top100.dev1.trec \
--batch-size 72 --threads 72 \
--max-passage \
--max-passage-hits 100 \
--output-format trec
Evaluation:
$ python -m pyserini.eval.trec_eval -c -m recall.10,100 -mmap -m recip_rank collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_v2-trained+unicoil-noexp-0shot.maxp.top100.dev1.trec
Results:
map all 0.2945
recip_rank all 0.2970
recall_10 all 0.5389
recall_100 all 0.8128