Skip to content

Files

Latest commit

 

History

History
84 lines (66 loc) · 3.19 KB

README.md

File metadata and controls

84 lines (66 loc) · 3.19 KB

sequence-tagging

image one sentence from JNLPBA-dataset visualized with doccano

setup

pip install -r requirements
python -m spacy download en_core_web_sm

data

scierc-data

python -c "from util.data_io import download_data; download_data('http://nlp.cs.washington.edu/sciIE/data','sciERC_processed.tar.gz','data',unzip_it=True)"

JNLPBA

git clone https://github.com/allenai/scibert.git

see scibert/data/ner/JNLPBA

learning-curve on scierc-data

train test

learning-curves on JNLPBA-data

train test

active learning curves

uncertainty sampling vs. random sampling

  • sequence-tagger: spacy-features + crfsuite
  • 5 times 10 "steps"

steps of 10% of trainset-size

0.1 steps

steps of 1% of trainset-size

0.01 steps

result

  • entropy/uncertainty -based sampling seems not beneficial if model is dumb (too few traindata or too shallow?)

3fold shuffle split on JNLPBA-dataset

  • 20% of train-data, evaluated on test-set (which is not splitted) 20percent
  • why is farm so bad? where is the bug?

sequence tagging transformers + lightning

setup on HPC

  1. git clone https://github.com/dertilo/transformers.git
  2. git checkout lightning_examples
  3. cd transformers/examples && pip install -r requirements.txt
  4. on frontend: OMP_NUM_THREADS=2 wandb init
  5. on frontend: OMP_NUM_THREADS=8 bash download_data.sh
  6. on node: python preprocess.py --model_name_or_path bert-base-multilingual-cased --max_seq_length 128
  7. on node: export PYTHONPATH=~/transformers/examples
  8. on frontend: to download pretrained model: OMP_NUM_THREADS=8 python3 run_pl_ner.py --data_dir ./ --labels ./labels.txt --model_name_or_path $BERT_MODEL --do_train

train & evaluate

PYTHONPATH=~/transformers/examples WANDB_MODE=dryrun python ~/transformers/examples/token-classification/run_pl_ner.py --data_dir ./ \
--labels ./labels.txt \
--model_name_or_path bert-base-multilingual-cased  \
--output_dir germeval2014 \
--max_seq_length  128 \
--num_train_epochs 3 \
--train_batch_size 32 \
--seed 1 \
--do_train \
--do_predict
  • sync with wandb: OMP_NUM_THREADS=2 wandb sync wandb/dryrun-...
  • resuls after 3 epochs in ~20 minutes:
TEST RESULTS
{'avg_test_loss': tensor(0.0733),
 'f1': 0.8625160051216388,
 'precision': 0.8529597974042419,
 'recall': 0.8722887665911299,
 'val_loss': tensor(0.0733)}