SimpleText 2024 Task 1

This repository includes all of the codes for my submission to SimpleText Task 1.

Installation

This repository was created on Python 3.8.18, no testing done on other versions. To run this code, you will need access to the dataset for task 1 of SimpleText CLEF lab, and login details to access their servers. I will be assuming you have access for the rest of the explanation.

First create a virtual environment, for example

python -m venv myenv
source myenv/bin/activate

Install libraries using pip and the requirements.txt

pip install -r requirements.txt

Clone the repository all to one folder to properly run. Directories may need to be changed to fit your machine.

Text Files

baseline.txt -> the top 100 results from ElasticSearch in a three stage search ("{query}", {query}, {topictext})
selective.txt -> the top results from ElasticSearch in a one-to-two stage search ("{query}", {query})
rr_baseline.txt -> results from cross_encoder.py using baseline.txt as argument
final_results.txt -> results from combine_scores.py using rr_baseline.txt and selective.txt

Steps to Run

Get the SimpleText dataset from CLEF and have both the qrels and topics csv file in repository
Update the config.json file with user, password, and ElasticSearch URL, to log in to Elastic Search
run python run_everything.py (should modify names of files and directories in this file)

Model Details

The final results come from a combination of a re-ranked baseline retrieval using our finetuned ms-marco-MiniLM-L-6-v2 cross encoder and the selective retrieval from ElasticSearch. The cross encoder does its own re-ranking of the top 100 results from ElasticSearch, then that output is directed into a program that does a final re-ranking using a combination of the cross encoder ranking and the selective retrieval. This is all based on the assumption that when ElasticSearch and the cross-encoder rank a document highly, there is a higher chance that it is more relevant than a document rated highly on only one system. When training on only the 2023 training data, and testing on the 2023 testing data, we get results as shown below.

Fine-Tuning

The model was fine-tuned using SimpleText 2024 Train qrels, following the training split from 2023, which leaves the rest for testing. The hyper parameters used are as such,

epochs: 5
learning rate: 1e-05
warmup_steps: 10% of train data
evaluator: CERerankingEvaluator

Final Results on Unseen 2023 G Test Set

MRR: 0.8235294117647058
NDCG@10: 0.5149918729116792
NDCG@20: 0.44750129093006613
MAP: 0.2458704904940402
BPRREF: 0.2968438829981929

Final Results on Unseen 2023 T Test Set

MRR: 1.0
NDCG@10: 0.7549655522724663
NDCG@20: 0.6687157520666694
MAP: 0.35702353860955655
BPRREF: 0.40167510420799807

Final Results on Unseen 2023 Test Set

MRR: 0.9117647058823529
NDCG@10: 0.6349787125920727
NDCG@20: 0.5581085214983679
MAP: 0.30144701455179845
BPRREF: 0.3492594936030955

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.gitignore		.gitignore
2023_final_results.txt		2023_final_results.txt
Preprocessing_tools.py		Preprocessing_tools.py
README.md		README.md
baseline.txt		baseline.txt
combine_scores.py		combine_scores.py
config.json		config.json
create_test_qrels.py		create_test_qrels.py
cross_encoder.py		cross_encoder.py
download_jsons.py		download_jsons.py
evaluation.py		evaluation.py
final_results.txt		final_results.txt
finetune.py		finetune.py
ft_with_cv.py		ft_with_cv.py
read_json.py		read_json.py
requirements.txt		requirements.txt
rr_baseline.txt		rr_baseline.txt
run_everything.py		run_everything.py
selective.txt		selective.txt
significance_testing.py		significance_testing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimpleText 2024 Task 1

Table of Contents

Installation

Text Files

Steps to Run

Model Details

Fine-Tuning

Final Results on Unseen 2023 G Test Set

Final Results on Unseen 2023 T Test Set

Final Results on Unseen 2023 Test Set

About

Releases

Packages

Languages

sheaDurgin/Simpletext-2024

Folders and files

Latest commit

History

Repository files navigation

SimpleText 2024 Task 1

Table of Contents

Installation

Text Files

Steps to Run

Model Details

Fine-Tuning

Final Results on Unseen 2023 G Test Set

Final Results on Unseen 2023 T Test Set

Final Results on Unseen 2023 Test Set

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages