ed-pooling

This repository contains the code to reproduce the results presented in The Impact of Subword Pooling Strategy for Cross-lingual Event Detection.

Setup

$ conda env create -f environment.yml
$ conda activate ed-pooling

Data

ACE

The ACE training, validation, and test data is available in English, Arabic and Chinese. Since Chinese is a non-white-space delimited language, we have excluded it from the present effort. We use the same train/dev/test splits as used in Huang et al. 2022 and Xu et al. 2021. The English and Arabic data, formatted to be compatible with the code in this repository, is available under

data
├── ar-ace
│   ├── ar_dev.jhu.better-split-80.json
│   ├── ar_test.jhu.better-split-80.json
│   └── ar_train.jhu.better-split-80.json
└── en-ace
    ├── en_dev.jhu.better.json
    ├── en_test.jhu.better.json
    └── en_train.jhu.better.json

BETTER

To access the BETTER data (Abstract/Phase-1/Phase-2), please visit the official IARPA BETTER website.

MINION

This dataset was introduced in Pouran Ben Veyseh et al. (2022). We use the same train/dev/test splits as in the official release. The data, formatted to be compatible with the code in this repository, is available under

data
└── {en,es,hi,ko,pl,pt,tr}-minion
    ├── dev.json
    ├── test.json
    └── train.json

IDF

We gather a corpus in the language on interest and subsequently use the scripts/token_scores.py script to generate the IDF scores for each token. The files containing the IDF scores which were used in this work are provided under data/idfs.

Training

We provide complete guidelines on how to train on the English ACE dataset.

For strategy=first_token/last_token/average:

$ TASK=en-ace
$ STRATEGY=first_token  # can be either first_token/last_token/average
$ MLM=xlm-roberta-large
$ OUTPUT_DIR=expts/en-ace
$ SEED=42
$ bash scripts/train_triggers.sh \
    ${TASK} \
    ${STRATEGY} \
    ${MLM} \
    ${OUTPUT_DIR} \
    ${SEED}

For strategy=attention:

$ STRATEGY=attention
$ ATTENTION_TEMPERATURE=1.
$ bash scripts/train_triggers.sh \
    ${TASK} \
    ${STRATEGY} \
    ${MLM} \
    ${OUTPUT_DIR} \
    ${SEED} \
    ${ATTENTION_TEMPERATURE}

For strategy=idf:

$ STRATEGY=idf
$ IDF_TEMPERATURE=1.
$ IDF_DEFAULT_SCORE=10
$ IDF_SCORES_FILE=data/idfs/en.tsv
$ bash scripts/train_triggers.sh \
    ${TASK} \
    ${STRATEGY} \
    ${MLM} \
    ${OUTPUT_DIR} \
    ${SEED} \
    ${IDF_TEMPERATURE} \
    ${IDF_DEFAULT_SCORE} \
    ${IDF_SCORES_FILE}

Training on other datasets (BETTER/MINION) follow similar commands. Valid tasks when training on English are TASK=en-ace/en-minion/phase1/phase2/abstract If one wants to train a model on data other than English, then valid tasks are TASK=ar-ace/{es,hi,ko,pl,pt,tr}-minion.

Prediction

Using the models trained on English ACE, we can now make predictions on English/Arabic ACE test sets. In the following, we provide commands to make predictions on Arabic ACE.

For strategy=first_token/last_token/average:

$ TASK=en-ace
$ TRAINING_LANG=en
$ TASK_TYPE=AceTrigger
$ STRATEGY=first_token  # can be either first_token/last_token/average
$ MLM=xlm-roberta-large
$ MLM_TYPE=xlmr
$ OUTPUT_DIR=expts/en-ace
$ SEED=42
$ MAX_SEQ_LENGTH=128
$ TEST_INPUT_FILE=data/ar-ace/ar_test.jhu.better-split-80.json
$ OUTPUT_FILE=ar_test.jhu.better-split-80.preds.json
$ python run_token_classification.py \
    --task_type AceTrigger \
    --model_name_or_path ${MLM} \
    --model_type ${MLM_TYPE} \
    --test_file ${TEST_INPUT_FILE} \
    --do_predict \
    --preds_out_file ${OUTPUT_FILE} \
    --max_seq_length ${MAX_SEQ_LENGTH} \
    --output_dir ${OUTPUT_DIR}/${TASK}_${TRAINING_LANG}_${STRATEGY}_${MLM}_${SEED} \
    --seed ${SEED} \
    --pooling_strategy ${STRATEGY}

For strategy=attention:

$ STRATEGY=attention
$ ATTENTION_TEMPERATURE=1.
$ python run_token_classification.py \
    --task_type AceTrigger \
    --model_name_or_path ${MLM} \
    --model_type ${MLM_TYPE} \
    --test_file ${TEST_INPUT_FILE} \
    --do_predict \
    --preds_out_file ${OUTPUT_FILE} \
    --max_seq_length ${MAX_SEQ_LENGTH} \
    --output_dir ${OUTPUT_DIR}/${TASK}_${TRAINING_LANG}_${STRATEGY}_${MLM}_${SEED} \
    --seed ${SEED} \
    --pooling_strategy ${STRATEGY} \
    --token_scores_temperature ${ATTENTION_TEMPERATURE}

For strategy=idf:

$ STRATEGY=idf
$ IDF_TEMPERATURE=1.
$ IDF_DEFAULT_SCORE=10
$ IDF_SCORES_FILE=data/idfs/ar.tsv  # ar.tsv since we are predicting on Arabic data
$ python run_token_classification.py \
    --task_type AceTrigger \
    --model_name_or_path ${MLM} \
    --model_type ${MLM_TYPE} \
    --test_file ${TEST_INPUT_FILE} \
    --do_predict \
    --preds_out_file ${OUTPUT_FILE} \
    --max_seq_length ${MAX_SEQ_LENGTH} \
    --output_dir ${OUTPUT_DIR}/${TASK}_${TRAINING_LANG}_${STRATEGY}_${MLM}_${SEED} \
    --seed ${SEED} \
    --pooling_strategy ${STRATEGY} \
    --token_scores_temperature ${IDF_TEMPERATURE} \
    --default_token_score ${IDF_DEFAULT_SCORE} \
    --token_scores_file ${IDF_SCORES_FILE}

If training/predicting on task other than ACE, we would need to change the TASK/TASK_TYPE accordingly. Mapping of TASK to TASK_TYPE follows:

{
    {en,ar}-ace: AceTrigger,
    {phase1,phase2}: BetterBasicTrigger,
    {abstract}: BetterAbstractTrigger,
    {en,es,hi,ko,pl,pt,tr}-minion: TriggerClassificationMinionTask
}

Score

Once the predictions are made, we can evaluate the predictions against the gold test set.

$ TASK=en-ace
$ TRAINING_LANG=en
$ STRATEGY=first_token  # can be either first_token/last_token/average/attention/idf
$ MLM=xlm-roberta-large
$ OUTPUT_DIR=expts/en-ace
$ SEED=42
$ TEST_INPUT_FILE=data/ar-ace/ar_test.jhu.better-split-80.json
$ OUTPUT_FILE=ar_test.jhu.better-split-80.preds.json
$ python scripts/score_triggers.py \
    --gold ${TEST_INPUT_FILE} \
    --system ${OUTPUT_DIR}/${TASK}_${TRAINING_LANG}_${STRATEGY}_${MLM}_${SEED}/${OUTPUT_FILE} \
    --task ${TASK}

How to Cite

@misc{https://doi.org/10.48550/arxiv.2302.11365,
  doi = {10.48550/ARXIV.2302.11365},
  url = {https://arxiv.org/abs/2302.11365},
  author = {Agarwal, Shantanu and Fincke, Steven and Jenkins, Chris and Miller, Scott and Boschee, Elizabeth},
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Impact of Subword Pooling Strategy on Cross-lingual Event Detection},
  publisher = {arXiv},
  year = {2023},
  copyright = {Creative Commons Attribution 4.0 International}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
better_events		better_events
data		data
docs		docs
scripts		scripts
tasks		tasks
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml
logging_utils.py		logging_utils.py
metrics.py		metrics.py
model.py		model.py
prediction_object.py		prediction_object.py
run_token_classification.py		run_token_classification.py
span_extractor.py		span_extractor.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ed-pooling

Table of Contents

Setup

Data

ACE

BETTER

MINION

IDF

Training

Prediction

Score

How to Cite

About

Releases

Packages

Languages

isi-boston/ed-pooling

Folders and files

Latest commit

History

Repository files navigation

ed-pooling

Table of Contents

Setup

Data

ACE

BETTER

MINION

IDF

Training

Prediction

Score

How to Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages