Skip to content

Latest commit

 

History

History
42 lines (23 loc) · 3.37 KB

README.md

File metadata and controls

42 lines (23 loc) · 3.37 KB

Preprocessing

This folder contains the scripts used to generate the candidate entity pairs and the statistics.

Pair Generation

In the literature, two main approaches are used to generate the candidate pairs:

  1. Sentence splitting + random combination of entities
  2. Heuristic + random combination of entities

1. Sentence splitting + random combination of entities

Sentence splitting is only applied to the n2c2 corpus, as the DDI corpus is already split into sentences.

A clinical document is split into sentences, and all valid combinations of entity pairs within that sentence are considered. This approach has been used by Wei et al., 2020 and Christopoulou et al., 2020. The following libraries and tools provide sentence splitting of clinical documents:

2. Heuristic + random combination of entities

A heuristic is used to determine a window in which entities might hold a relation, and all valid combinations of pairs within that window are considered. The following heuristic was first used by Xu et al., (2018) and then imitated by Alimova et al., (2020): the number of characters between the entities is smaller than 1000, and the number of other entities that may participate in relations and locate between the candidate entities is not more than 3.

Our Approach

We have chosen the first approach, using PyRuSH Sentecizer to split clinical documents into sentences. The main drawback of the second approach is that its heuristic is based on statistics from the whole training set, and considers 1000 characters as the maximum distance between two entities to be considered as candidate pairs. This particular value can result in extreme cases where a relationship occurs between entities in different sentences.

The code for sentence splitting is located in the file ./split_sentences.py and uses the PyRuSH Sentecizer to split n2c2 .txt documents into sentences and store the result in a .json file of the same name.

The file ./generate_relations.py contains the code for preprocessing the corpora and creating a train and test relation collection. The train and test relation collections are stored in datadings for efficient loading. For more information on relation collection implementation, see the README.

Statistics

The file ./generate_statistics.py contains the code to generate a summary of the corpora, including the number of relations for each type and their proportion to the total.