GitHub

All commands must be executed in the project directory (gen_query_red_made):

0. Download extended docs and VK data (optional)

By default DB contains docs, only represented in qrels (5185 docs), in case if you want to test pipeline on large amount of docs, you can download larger data.tsv file:

download and extract data-file from Google drive (5 gb, 400k docs)
replace existing: mv docs_400k.tsv data/msmarco/docs.tsv

To perform operations with VK dataset:

download vk-dataset files docs.tsv, queries.tsv, qrels.tsv
create directory data/vk
move downloaded files to data/vk/*

1. Create virtual environment:

python -m venv .env
source .env/bin/activate
pip install -r requirements.txt

2. Create DB (`db_name` = `vk` or `msmarco`):

docker build -t db_red .
docker run -it -v /path_to_project/gen_query_red_made/volume:/volume -v /path_to_project/gen_query_red_made/data:/data db_red ./db_init.sh <DB_NAME>

DB_NAME - msmarco/vk

Test example (creating msmarco.db and vk.db):

docker run -it -v $(pwd)/volume:/volume -v $(pwd)/data:/data db_red ./db_init.sh msmarco
docker run -it -v $(pwd)/volume:/volume -v $(pwd)/data:/data db_red ./db_init.sh vk

Run experiment

Build docker image for indexing documents and performing searching:
docker build search_engine/ -t search_engine:0.3

Build docker image for metrics calculation:
docker build metrics_calc/ -t metrics_calc:0.3

Run experiment:
./pipeline.sh <DB_NAME> <DATA_TABLE>

DB_NAME - msmarco/vk
DATA_TABLE - DOCS/JOINED

Test example (run experiment on joined table in vk DB):

./pipeline.sh vk joined

3. Iterator usage:

python iterator.py table_name batch_size shuffle

table_name - string
batch_size - number
shuffle - True/False

Example (get shuffled data from table/view joined with batch_size=16):

python iterator.py joined 16 True

Iterator's response:

[
  [(row_1),(row_2),(row_3)...(row_BS)],
  .....,
  [(row_k1),(row_k2),(row_k3)...(row_kBS)]
]

DB setup:

Run container:

docker run -it -v /path_to_project/gen_query_red_made/volume:/volume -v /path_to_project/gen_query_red_made/data:/data <image_name>

Open sqlite (type command in docker container console):

sqlite3 volume/db/project.db

Write SQL-queries (pay attention to ;):

select count(*) from qrels;

DB schema:

(foreign keys removed due to the missing docs.data)

Pipeline v0.2: Documents indexing and metrics calculation

Step 1

The first step of the pipeline is document indexing and search engine service can help us with this.
Build docker image (from root)
docker build search_engine/ -t search_engine:0.2

Then use search_engine.sh to interact with service:

Command format: ./search_engine.sh <COMMAND> <SQL TABLE> <EXPERIMENT NAME>

COMMAND - index/start-run
SQL TABLE - DOCS/JOINED
EXPERIMENT - any string u want

Command example: ./search_engine.sh index DOCS no_queries_test

First argument can be one of two commands: index and start-run.
In first step case we should use index command.

Second argument is table name of sqlite database. This table should contain: TODO: write column names

Third argument is name of experiment. Attention: you should use this name in all other steps.

Step 2

Next step is make a run: We should create a .jsonl file that will contain work result of search engine: top-100 best search engine answers on each query.

Command format:
./search_engine.sh <COMMAND> <SQL TABLE> <EXPERIMENT NAME> <--b> <--k>

COMMAND - start-run
b - bm25 coefficient
k - bm25 coefficient

Now In first argument you should use another command: start-run

Test example:

./search_engine.sh start-run QUERIES no_queries_test --b=0.4 --k=0.9

Step 3

Last step is metrics calculation.

First we should build docker of metrics_calc service:
docker build metrics_calc/ -t metrics_calc:0.1

Command format:
./metrics_calc.sh <COMMAND> <TABLE_NAME> <EXPERIMENT_NAME>

COMMAND = eval
TABLE_NAME = table with queries relationships
EXPERIMENT_NAME = like in previous steps

Test example:
./metrics_calc.sh eval QRELS no_queries_test

You will see results of experiment in folder: experiments_runs/<experiment_name>

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
config		config
data/msmarco		data/msmarco
metrics_calc		metrics_calc
ml		ml
search_engine		search_engine
.dockerignore		.dockerignore
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
__init__.py		__init__.py
db_init.sh		db_init.sh
huggingface_models_inference.py		huggingface_models_inference.py
init.sql		init.sql
iterator.py		iterator.py
metrics_calc.sh		metrics_calc.sh
pipeline.sh		pipeline.sh
query_generator.py		query_generator.py
russian_model_inference.py		russian_model_inference.py
search_engine.sh		search_engine.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

0. Download extended docs and VK data (optional)

1. Create virtual environment:

2. Create DB (`db_name` = `vk` or `msmarco`):

Run experiment

3. Iterator usage:

DB setup:

DB schema:

Pipeline v0.2: Documents indexing and metrics calculation

Step 1

Step 2

Step 3

About

Releases

Packages

Contributors 3

Languages

License

NikOrlov/gen_query_red_made

Folders and files

Latest commit

History

Repository files navigation

0. Download extended docs and VK data (optional)

1. Create virtual environment:

2. Create DB (db_name = vk or msmarco):

Run experiment

3. Iterator usage:

DB setup:

DB schema:

Pipeline v0.2: Documents indexing and metrics calculation

Step 1

Step 2

Step 3

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

2. Create DB (`db_name` = `vk` or `msmarco`):

Packages