This repository contains the code and pre-trained models for our paper XPR: Cross-lingual Phrase Retriever.
**************************** Updates ****************************
- 5/10 We released our model checkpoint, evaluation code and dataset.
- 4/19 We released our paper.
- 2/26 Our paper has been accepted to ACL2022.
We propose a cross-lingual phrase retriever that extracts phrase representations from unlabeled example sentences named XPR.
We also create a large-scale cross-lingual phrase retrieval dataset, which contains 65K bilingual phrase pairs and 4.2M example sentences in 8 English-centric language pairs.
In the following sections, we describe how to use our XPR.
- First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct
torch==1.8.1+cu111
version corresponding to your platforms/CUDA versions. PyTorch version higher than1.8.1
should also work. - Then, run the following script to fetch the repo and install the remaining dependencies.
git clone [email protected]:cwszz/XPR.git
cd xpr
pip install -r requirements.txt
mkdir data
mkdir model
mkdir result
Before using XPR, please process the dataset by following the steps below.
-
Download Our Dataset Here: link
-
Unzip our dataset and move dataset into data folder. (Make sure the path in bash file is the path of dataset)
Before using XPR, please process the checkpoint by following the steps below.
-
Download Our Checkpoint Here: link
-
Get our checkpoint files and move the files in repo into model folder.
bash train.sh
Test our method:
- Download the XPR checkpoint from Huggingface: [link]
- Make sure the model path and dataset path in test.sh are correct
- The output log can be found in log folder
Here is an example for evaluate XPR:
bash test.sh
or
export CUDA_VISIBLE_DEVICES='0'
python3 predict.py \
--lg $lg \
--test_lg $test_lg \
--dataset_path ./datset/ \
--load_model_path ./model/pytorch_model.bin \
--queue_length 0 \
--unsupervised 0 \
--wo_projection 0 \
--layer_id = 12 \
> log/test-${lg}-${test_lg}-32.log 2>&1
- $lg: The language on which the model was trained
- $test_lg: The language on which the model will be tested on
- --dataset_path: The path of dataset folder
- --load_model_path: The path of checkpoint folder
- --queue_length: The length of memory queue
- --unsupervised: Unsupervised mode
- --wo_projection: Without SimCLR projection head
- --layer_id: The layer to represent phrase
Please cite this paper, if you found the resources in this repository useful.