ML_classification

Steps to run the project:

git clone https://github.com/srijyothsna/ML_classification.git
cd ML_classification
pip install -r requirements.txt
python classify.py <path_to_input_file_incl_filename> <dir_path_to_save_models>
python predict.py <dir_path_to_save_model> <doctor_assesment_str_in_quotes>

Framework/ Libraries

The solution is implemented in Python and tested on v3.4. The following packages (in requirements.txt) and any dependencies need to installed in the Python environment : sklearn scipy pandas numpy

Approach

Given that the problem involves multilabel/multiclass text classification, I have used a OneVsRestClassifier with LinerSVC model, fitted on a TfIdfVectorizer. The doctor’s written assessment and plan (‘A/P' field from the tsv file) is used as the feature vector for training. The model is trained 67% of the data, while the remaining 33% has been used as the test set.

The code is divided into 2 scripts:

classify.py - The data from the tsv file is read into a pandas dataframe; the classification model is built; data is divided into training/ test sets; the classifier is trained and tested on the test data; and the vectorizer and classifier are saved to file to persist the model.

This script can be run as follows: python classify.py <path_to_input_file_incl_filename> <dir_path_to_save_models>. The filenames for saving the vectorizer and classifier are hardcoded. In this project, input files is in ‘data’ directory and models are saved into 'state’ directory.

The script expects the input file to be in tsv format similar to the sample in 'data' directory. The output from the script - classification report with precision, recall and f1 score - is printed to terminal

predict.py - The saved vectorizer and classifier are loaded; and runs the model on the standalone payload string.

This script can be run as follows: python predict.py <dir_path_to_save_model> <doctor_assesment_str_in_quotes>. In this project, models are loaded from 'state’ directory. The output from this script - the predicted ICD code (first 3 characters) - is printed to the terminal.

The reason for splitting it into 2 scripts is to make it easier to run the model multiple times without having to train and test the model each time.

classify.py needs to be run only when there is change to the training/ test data or the classier itself. On the other hand, predict.py can be run any number of times with the currently trained model.

Results

The overall precision and recall for this test set are 0.64 and 0.61, respectively, with an f1-score of 0.61.

Possible improvements:

Given the limited time for solving this problem, the following steps have not been implemented, which can improve the performance further:

NLP steps like stemming, pos tagging, named entity recognition with ICD10 as the controlled vocabulary, negation
Use other models to train/ test like Random Forest, Bag of words or Word2vec.
More test data and randomized training/ tests with given data to avoid overfitting/ underfitting.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
ml		ml
state		state
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML_classification

Steps to run the project:

Framework/ Libraries

Approach

Results

Possible improvements:

About

Releases

Packages

Languages

srijyothsna/ML_classification

Folders and files

Latest commit

History

Repository files navigation

ML_classification

Steps to run the project:

Framework/ Libraries

Approach

Results

Possible improvements:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages