GitHub - yaricom/english-article-correction: The experiment with applying NLP to correction of definite/indefinite articles in English text corpus

The repository contains the source code for experiment with applying NLP to correction of English definite/indefinite articles in text corpus

Setting up working environment

Dependencies

The source code in this directory is written in Python 3 and number of dependencies on common Python libraries for ML and NLP:

Numpy - the Base N-dimensional array package
Pandas - the data analysis toolkit for Python
scikit-learn - the collection of tools for data mining and data analysis

All dependencies can be installed manually or as part of distributive by installing Anaconda data science platform. For Anaconda installation instructions visit Anaconda web site. In order to use source code present in this directory install Anaconda3 which includes Python3.

The data corpus

The data corpora in this experiment was generated using UMBC WebBase corpus. UMBC WebBase corpus by Lushan Han, UMBC Ebiquity Lab (http://ebiquity.umbc.edu/) is licensed under a Creative Commons Attribution 3.0 Unported License (http://creativecommons.org/licenses/by/3.0/deed.en_US). Based on a work at http://ebiq.org/r/351.

The data corpora comprise of following files:

sentences: sentence_train.txt, sentence_test.txt
corrections: corrections_train.txt, corrections_test.txt
constituency parse trees: parse_train.txt, parse_test.txt
dependency parse trees: dependencies_train.txt, dependencies_test.txt
part of speech tags: pos_tags_train.txt, pos_tags_test.txt
GloVe vectors indexes for list of vectors in glove_vectors.txt file: glove_train.txt, glove_test.txt

Before running experiments unpack data corpus archive file data.zip into data directory

Source code structure

The source code consists of series of Python3 scripts encapsulating specific functionality:

config.py - holds common configuration parameters (file and directories names, global constants, etc)
data_set.py - the data sets generator to transform raw corpora into Numpy arrays with features and labels
data_set_test.py - the unit tests for data_set.py script
tree_dict.py - the parser and manipulation utilities for constituency parse trees
tree_dict_test.py - the unit tests for tree_dict.py script
utils.py - the common utilities, such as: JSON parsing, data corpora sanity checks, etc
predictor.py - the predictive models runner. Encapsulates common functionality which can be applied to different predictors.
random_forest_model.py - the predictive model based on sklearn.ensemble.RandomForestClassifier
evaluate.py - the results evaluation script where the target metric is not accuracy but recall level at a specified false positive rate level.

Running experiments

Before running experiments with training a model that detects and corrects an incorrect usage of the English article (“a”, “an” or “the”) we need to process raw data corpora files.

To generate data set files execute from the terminal in the root directory following command:

$ create_data_set.sh

The generated data sets comprise features and labels per specific text corpora. The total of sixteen features considered as important. Some of the features represented the words and POS tags found at specific locations adjacent to the determiner (only English articles: a, an, the); others represented the nouns, and verbs that preceded or followed the determiner. Table 1 shows a subset of the feature list.

Index	Feature	Description
0	PrW	The preceding word's Glove index
1	PrW POS	The preceding word's POS tag
2	DTa	The Glove index of determiner (only English articles: a, an, the)
3	FlW	The following word's Glove index
4	FlW POS	The following word's POS tag
5	FlW2	The second following word's Glove index
6	FlW2 POS	The second following word's POS tag
7	FlNNs (i > 0)	The Glove index of following single/plural noun (where DT is not at the start of sentence)
8	FlNNs (i > 0) POS	The following single/plural noun's POS tag
9	PrW2	The second preceding word's Glove index
10	PrW2 POS	The second preceding word's POS tag
11	PrW3 (VB, VBD)	The Glove index of third preceding verb (VB, VBD) if present
12	PrW3 (VB, VBD) POS	The POS tag of third preceding verb (VB, VBD) if present
13	Vowel [0, 1]	The flag to indicate if following word starts with specific vowel
14	FlW (VB,VBD,VBG,VBN,VBP,VBZ)	The Glove index of third following verb (VB,VBD,VBG,VBN,VBP,VBZ) if present
15	FlW (VB,VBD,VBG,VBN,VBP,VBZ) POS	The POS tag of third following verb (VB,VBD,VBG,VBN,VBP,VBZ) if present

Table 1 The features list description

To run prediction against validation corpora with subsequent evaluation against ground truth labels execute from the terminal in the root directory following command:

$ run_validate.sh

To run prediction against test corpora with default parameters execute from the terminal in the root directory following command:

$ generate_results.sh

The predicted results will be saved in out directory as submission_test.txt file.

Conclusions

As result of conducted experiments and analysis, several main findings can be released:

The provided data corpora have a small number of samples which excludes building of advanced predictive models based on neural network methods.
The best predictive performance was achieved with ensemble classifiers based on decision trees architecture.
Among tried decision tree algorithms the best prediction score was achieved with Random Forest Classifier
It was found that the optimal number of estimators for the classifier is 1500. Other parameters were selected as by default.

The results for validation corpora by running evaluate.py script:

FP counts: {'the': 1519, 'an': 77,  'a': 435} 
FN counts: {'the': 1054, 'an': 112, 'a': 597}
TP counts: {'the': 5114, 'an': 97,  'a': 1142}
TN counts: {'the': 7093, 'an': 286, 'a': 2084}
>>> target score = 40.28 % (measured at 0.02 false positive rate level)
>>> accuracy (just for info) = 80.70 %

Authors

This source code maintained and managed by Iaroslav Omelianenko

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
create_data_set.sh		create_data_set.sh
generate_results.sh		generate_results.sh
ngram_res_validate.sh		ngram_res_validate.sh
run_pred_validate.sh		run_pred_validate.sh
run_validate.sh		run_validate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setting up working environment

Dependencies

The data corpus

Source code structure

Running experiments

Conclusions

Authors

About

Releases

Packages

Languages

License

yaricom/english-article-correction

Folders and files

Latest commit

History

Repository files navigation

Setting up working environment

Dependencies

The data corpus

Source code structure

Running experiments

Conclusions

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages