Skip to content

ken77921/torch-relation-extraction

 
 

Repository files navigation

Universal Schema based relation extraction implemented in Torch.

Paper

This code was used for the paper Multilingual Relation Extraction using Compositional Universal Schema by Patrick Verga, David Belanger, Emma Strubell, Benjamin Roth, Andrew McCallum.

If you use this code, please cite us.

Dependencies

Overview

Universal Schema embeds text and knowledge base relations together to perform relation extraction and automatic knowledge base population. The typical universal schema model performs matrix factorization where rows are entity pairs and columns are relations.

This code allows you to perform matrix factorization where the column and row embeddings are parameterized by an arbitrary encoder. In the simplest case, a 'standard' matrix factorization would have each encoder as a lookup-table. More complex models could use combinations of LSTMs, CNNs, etc. To do this is as simple as setting the rowEncoder and colEnoder parameters

th src/UniversalSchema.lua -rowEncoder lookup-table -colEncoder lstm

Data Processing

Your entity-relation data should be 4 col tsv.

entity1 \t entity2 \t relation \t 1

./bin/process/process-data.sh -i your-data -o your-data.torch -v vocab-file

There are other flags in you can look at by doing ./bin/process/process-data.sh --help

You can also process arbitrary data in 3 column format with the -b flag

row_value \t col_value \t 1

If you want your rows and columns to share the same vocabulary, use the -g flag
./bin/process/process-data.sh -i your-3column-data -o your-data.torch -v vocab-file -b -g

Training Models

You can run various Universal Schema models located in src. Check out the various options in CmdArgs.lua

You can train models using this train script. The script takes two parameters, a gpuid (-1 for cpu) and a config file. You can run an example base Universal Schema model and evaluate MAP with the following command.

./bin/train/train-model.sh 0 bin/train/configs/examples/uschema-example

Evaluation

MAP

MAP will be calculated every kth iteration based on the -evaluateFrequency cmd arg. AP is calculated on a per-column basis and then averaged to get MAP. To calculate MAP for your model, you need to generate one file per test column in the same format as your test data. Unlike the training data, in the test data you need to explicitly give negative examples. Negative samples should just have a 0 in the last column of the file while positive examples have a 1.

Place all of these files in a directory, test-data-dir for example, and then run the following command:
./bin/process/process-test-data-dir.sh test-data-dir test-data-dir.torch vocab-file
Here vocab-file should be the same vocab file that you generated your training data with.

  • This requires setting up Relation Factory and setting $TAC_ROOT=/path/to/relation-factory. Just follow the setup instructions on the relation factory github or run $TH_RELEX_ROOT/setup-relationfactory.sh.

First run :./setup-tac-eval.sh

We include candidate files for years 2012, 2013, and 2014 as well as config files to evaluate each year.

You can tune thresholds on year 2012 and evaluate on year 2013 with this command :

./bin/tac-evaluation/tune-and-score.sh 2012 2013 trained-model vocab-file.txt gpu-id max-length-seq-to-consider output-dir

You can also download some pretrained models from our paper Multilingual Relation Extraction using Compositional Universal Schema. The download includes a script that will evaluate the models.

Relation Extraction

You can also use this code to score relations. Here we'll walk through the steps to train a universal schema model.

e1 e2 relation 1
/m/02k__v /m/01y5zy $ARG1 lives in the city of $ARG2 1
/m/09cg6 /m/0r297 $ARG2 is a type of $ARG1 1
/m/02mwx2g /m/02lmm0_ /biology/gene_group_membership/gene 1
/m/0hqv6zr /m/0hqx04q /medicine/drug_formulation/formulation_of 1
/m/011zd3 /m/02jknp /people/person/profession 1
  1. First create a training set that combines KB triples that you care about as well as text relations you care about. For example generate a file like the one above called train.tsv.
  2. Next, process that file : ./bin/process/process-data.sh -i train.tsv -o data/train.torch -v vocab-file
  3. Now we want to train a model. Edit the example lstm config to say export TRAIN_FILE=train-mtx.torch and start the model training : ./bin/train/train-model.sh 0 bin/train/configs/lstm-example. This will save a model to models/lstm-example/*-model every 3 epochs.
  4. Now we can use this model to perform relation extraction. Generate a candidate file called candidates.tsv. The file should be tab serparated with the following form :
    entity_1      kb_relation    entity_2      doc_info      arg1_start_token_idx      arg1_send_token_idx      arg2_start_token_idx      arg2_end_token_idx      sentence.
    A concrete example is :
    Barack Obama      per:spouse      Michelle Obama      doc_info      0      2      8      10      Barack Obama was seen yesterday with his wife Michelle Obama .
  5. Finally, we can score each relation with the following command th src/eval/ScoreCandidateFile.lua -candidates candidates.tsv -outFile scored-candidates.tsv -vocabFile vocab-file-tokens.txt -model models/lstm-example/5-model -gpuid 0

This will generate a scored candidate file with the same number of lines and the sentenece replaced by a score where higher is more probable.

Barack Obama      per:spouse      Michelle Obama      doc_info      0      2      8      10      0.94 .

Contact

Feel free to contact me with questions : [email protected]

About

Universal Schema based relation extraction implemented in Torch.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Lua 33.4%
  • Shell 24.4%
  • Java 23.3%
  • Python 18.9%