auto-discern

Automating the application of the DISCERN instrument to rate the quality of health information on the web.

How to Use this Repo
Data Preprocessing
Model Training
- Training the "Traditional" Random Forest Model
  - General sacred Usage
  - The Published Model
- Training the Neural Models
Model Deployment with the Web App
- Prepping Your Selected Model for Deployment
- Deployment with Docker
Known issues
- Installing on Windows OS

How to Use this Repo

Installation

git clone the repo and cd into it.
Run pip install -e . to install the repo's python package.
- If you get a g++ error during installation, this may be due to a OSX Mojave, see this StackOverflow answer.
Acquire a copy of this project's data and structure it according to "A Note on Data" below.
Skip on down to Example Usage below.

A Note on Data

This repo contains no data. To use this package, you must have a copy of the data locally, in the following file structure:

path/to/discern/
├── data/
|   ├── target_ids.csv
|   ├── responses.csv
|   ├── html_articles/
|   |   └── *.html
|   └── transformed_data/
|       ├── *.pkl
|       ├── *_processor.dill
|       └── *_code.txt
└── experiment_objects/
    └── *.dill

Notebooks

Please follow this notebook naming convention for exploratory notebooks in the shared Switchdrive folder: <number>_<initials>_<short_description>.ipynb.

Setup Instructions for MetaMap

Download MetaMapLite:
- Download MetaMapLite from here. You will need to request a license to access the download, which takes a few hours.
- Place the zip file in a new directory called metamap, and unzip.
- If necessary, install Java as per metamap instructions.
- Test metamap by creating a test.txt file with the contents "John had a huge heart attack". Run ./metamap.sh test.txt. A new file, test.mmi, should be created with details about the Myocardial Infarction concept.
Install pymetamap wrapper:
- (A working version of pymetamap compatible with MetaMapLite is on someone's forked repo's branch)
- git clone https://github.com/kaushikacharya/pymetamap.git
- git checkout lite
- Inside your project environment: python setup.py install

`pymetamap` Usage Example

pymetamap ingests text and returns NamedTuples for each MetaMap concept with named fields.

from pymetamap import MetaMapLite
# insert the path to your parent `metamap` dir here
mm = MetaMapLite.get_instance('/Users/laurakinkead/Documents/metamap/public_mm_lite/')

sents = ['Heart Attack', 'John had a huge heart attack']
concepts, error = mm.extract_concepts(sents,[1,2])

for concept in concepts:
    for fld in concept._fields:
        print("{}: {}".format(fld, getattr(concept, fld)))
    print("\n")

prints:

index: 2
mm: MMI
score: 3.75
preferred_name: Myocardial Infarction
cui: C0027051
semtypes: [dsyn]
trigger: "Heart Attack"-text-0-"heart attack"-NN-0
pos_info: 17/12
tree_codes: C14.280.647.500;C14.907.585.500

Data Preprocessing

`DataManager` General Usage

DataManager provides an interface for saving and loading intermediary data sets, while automatically tracking how each data set was generated.

You pass the DataManager your raw data and your transformation function, and DataManager...

runs the transformation function on your data
saves the result, named with timestamp, git hash, and descriptive tag of your choice
saves the transformation function alongside the data, so it can be re-loaded, re-used, and even re-read!

Here's and example of using the data caching interface.

raw_data = pd.DataFrame()

# do a bunch of processing that takes a long time to run
def transform_func(df):
    # your complex and time consuming transformation code here
    return df


dm = DataManager(your_discern_path)

cached_file_name = dm.cache_data_processor(raw_data, transform_func, tag="short_description here")
# cached_file_name will look like 2019-08-15_06-24-58_10d88c9_short_description

# === at some later date, when you want to load up the data ===

data_processor = dm.load_cached_data_processor(cached_file_name)

# access the cached data set
data_processor.data

# re-use the transform func that was used to create the cached data set
# useful for deploying a ML model, and making sure the exact same transforms get applied to prediction data points as were to the training set!
transformed_prediction_data_point = data_processor.rerun(raw_prediction_data_point)

# you can also access the function directly, to pass to another object
transform_func = data_processor.func

# you can also read the code of transform_func!
data_processor.view_code()

The files for generating cached data sets in this way are stored in auto-discern/autodiscern/data_processors/*.py.

Loading a Previously Transformed Dataset

# IPython magics for auto-reloading code changes to the library
%load_ext autoreload
%autoreload 2

import autodiscern as ad

# See "Note on Data" above for what to pass here
dm = ad.DataManager("path/to/discern/data")

# Load up a pickled data dictionary.
# automatically loads the file with the most recent timestamp
transformed_data = dm.load_most_recent_transformed_data()

# To load a specific file, use:
transformed_data = dm.load_transformed_data('filename')

transformed data is a dictionary in the format {id: data_dict}. Each data dict represents a snippet of text, and contains keys with information about that text. Here is an example of the data structure:

{
    '123-4': {
        'entity_id': 123,
        'sub_id': 4,
        'content': "Deep brain stimulation involves implanting electrodes within certain areas of your brain.",
        'tokens': ['Deep', 'brain', 'stimulation', 'involves', 'implanting', 'electrodes', 'within', 'certain', 'areas', 'of', 'your', 'brain', '.'],
        'categoryName': 5,
        'url': 'http://www.mayoclinic.com/health/deep-brain-stimulation/MY00184/METHOD=print',
        'html_tags': ['h2', 'a'],
        'domains': ['nih'],
        'link_type': ['external'],
        'metamap': ['Procedures', 'Anatomy'],
        'metamap_detail': [{
                'index': "'123-4'",
                'mm': 'MMI',
                'score': '2.57',
                'preferred_name': 'Deep Brain Stimulation',
                'cui': 'C0394162',
                'semtypes': '[topp]',
                'trigger': '"Deep Brain Stimulation"-text-0-"Deep brain stimulation"-NNP-0',
                'pos_info': '1/22',
                'tree_codes': 'E02.331.300;E04.190'
            }, 
            {
                'index': "'123-4'",
                'mm': 'MMI',
                'score': '1.44',
                'preferred_name': 'Brain',
                'cui': 'C0006104',
                'semtypes': '[bpoc]',
                'trigger': '"Brain"-text-0-"brain"-NN-0',
                'pos_info': '84/5',
                'tree_codes': 'A08.186.211'
            }],
        'responses': pd.DataFrame(
                uid         5  6
                questionID      
                1           1  1
                2           1  1
                3           5  5
                4           3  3
                5           3  4
                6           3  3
                7           2  3
                8           5  4
                9           5  4
                10          4  3
                11          5  5
                12          1  1
                13          4  1
                14          3  2
                15          5  3
                ),
    }
}

Make Your Own Data Transformer

# IPython magics for auto-reloading code changes to the library
%load_ext autoreload
%autoreload 2

import autodiscern as ad
import autodiscern.annotations as ada
import autodiscern.transformations as adt

# ============================================
# STEP 1: Load the raw data 
# ============================================

# See "Note on Data" above for what to pass here
dm = ad.DataManager("path/to/discern/data")

# (Optional) View the raw data like this (data is loaded in automatically):
dm.html_articles.head()
dm.responses.head()

# Build data dictionaries for processing. This builds a dict of dicts, each data dict keyed on its entity_id. 
data_dict = dm.build_dicts()

# ============================================
# STEP 2: Clean and transform the data
# ============================================

# Select which transformations and segmentations you want to apply
# segment_into: words, sentences, paragraphs
html_transformer = adt.Transformer(leave_some_html=True,      # leave important html tags in place
                              html_to_plain_text=True,   # convert html tags to a form that doesnt interrupt segmentation
                              segment_into='sentences',  # segment documents into sentences
                              flatten=True,              # after segmentation, flatten list[doc_dict([sentences]] into list[sentences]
                              annotate_html=True,        # annotate sentences with html tags
                              parallelism=True           # run in parallel for 2x speedup
                              )
transformed_data = html_transformer.apply(data_dict)

# ============================================
# STEP 3: Add annotations
# ============================================

# Apply annotations, which add new keys to each data dict
transformed_data = ada.add_word_token_annotations(transformed_data)

# Applying MetaMap annotations takes about half an hour for the full dataset
# This requires a independent installation of MetaMapLite.
# See more details below on using the MetaMapLite and the pymetamap package
transformed_data = ada.add_metamap_annotations(transformed_data, dm)

# WARNING: ner annotations are *very* slow
transformed_data = ada.add_ner_annotations(transformed_data)

# ============================================
# STEP 4: Save and reload data for future use
# ============================================

# Save the data with pickle. The filename is assigned automatically.
# You may add a descriptor to the filename via
#   dm.save_transformed_data(transformed_data, tag='note')
dm.save_transformed_data(transformed_data)

# Load up a pickled data dictionary.
# automatically loads the file with the most recent timestamp
# To load a specific file, use
#   dm.load_transformed_data('filename')
transformed_data = dm.load_most_recent_transformed_data()

# View results
counter = 5
for i in transformed_data:
    counter -= 1
    if counter < 0:
        break
    print("==={}===".format(i))
    for key in transformed_data[i]:
        print("{}: {}".format(key, transformed_data[i][key]))
    print()

# =====================================
# MISC
# =====================================

# tag Named Entities
from allennlp.predictors.predictor import Predictor
from IPython.display import HTML
ner_predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.12.18.tar.gz")
ner = []
# look at the first 50 sentences of the first document
for sentence in transformed_data[0]['content'][:50]:
    ner.append(adt.allennlp_ner_tagger(sentence, ner_predictor))
HTML(adt.ner_tuples_to_html(ner))

Model Training

Training the "Traditional" Random Forest Model

Model training experiments are managed via sacred. Experiment files are located at auto-discern/sacred_experiments/.

General `sacred` Usage

Experiments can be run like this:

python sacred_experiments/first_experiment.py

Config parameters can be modified for a run like this:

python first_experiment.py with "test_mode=True"

The Published Model

The model that was published was trained with the following command:

python sacred_experiments/doc_experiment.py

Note to self: This model was trained in ScienceCloud.

You can open up a saved experiment object using its sacred id like this:

from autodiscern import DataManager

# See "Note on Data" above for what to pass here
dm = DataManager("path/to/discern/data")

sacred_id = 147
exp = dm.load_experiment(sacred_id)

This will return the trained experiment object, which you can use to calculate new results or use to make novel predictions.

Training the Neural Models

The neural models were trained with neural/neural_discern_run_script.py script.

A test version of the script can be run with python neural_discern_run_script.py --test-mode. This will train each question for one fold and one epoch, and skip doing the hyperparameter search.

This script trains the 5 Discern question models in parallel across 5 GPUs. You can choose which GPUs to use by modifying the question_gpu_map entry in the config.

Note to self: This model was trained on LeoMed (sing_dis; /opt/conda/bin/python neural_dicsern_run_script.py)

Model Deployment with the Web App

Prepping your Selected Model for Deployment

To deploy a model, that model's <experiment_dir> and supporting files must be copied into the repository using the following structure:

auto-discern/
└── autodiscern/
    └── pakage_data/
        ├── predictors/
        |    └── <experiment_dir>/
        |         ├── test/
        |         |    ├── question_4/
        |         |    |    └── fold_0/
        |         |    |         └── config/
        |         |    |             ├── exp_options.pkl
        |         |    |             └── mconfig.pkl
        |         |    └── ...
        |         |    
        |         └── train_validation/
        |              ├── question_4/
        |              |    └── fold_0/
        |              |         └── model_statedict/
        |              |             ├── doc_categ_scorer.pkl
        |              |             └── doc_encoder.pkl
        |              |             └── sent_encoder.pkl
        |              └── ...
        └── pytorch_biobert/
             ├── bert-base-cased-vocab.txt
             ├── bert_config.json
             └── biobert_statedict.pkl

Then, in auto-discern/validator_site/app.py:

Set DEFAULT_NEURAL_EXP_DIR to <experiment_dir>.
Set DEFAULT_USE_GPU to True or False, depending on whether the machine you will be deploying the model on has GPUs.
If you want to use different folds from the cross validation than the default (fold 0, as shown in the file diagram above), set DEFAULT_QUESTION_FOLD_MAP accordingly.

Deployment with Docker

On your local machine, from within autodiscern/:

Build the docker image

docker build --tag=autodiscern .
Run the image locally and make sure it works

docker run -p 80:80 autodiscern

You can also open up the image and take a look around:

docker run -it autodiscern /bin/bash
Tag the image, incrementing the tag number

docker tag autodiscern lokijuhy/autodiscern:v2
Push the image to repository

docker push lokijuhy/autodiscern:v2

On the server:

(optional?) Log in

docker login -u docker-registry-username
Pull down the image

docker pull lokijuhy/autodiscern:v2
Run the image!

docker run -d -p 80:80 lokijuhy/autodiscern:v2

Known issues

Installing on Windows OS

When passing path to the data (i.e. path/to/data in autodiscern.Datamanager class), escape the backslash characters such as C:\\Users\\Username\\Path\\to\\Data.
There might be permission error while initializing autodiscern.Transformer class because of spacy module. The best way to resolve this issue is to reinstall spacy using conda. Make sure to run Anaconda prompt in Administrator mode and run:
```
conda install spacy
python -m spacy download en
```

Name		Name	Last commit message	Last commit date
Latest commit History 233 Commits
autodiscern		autodiscern
neural		neural
posthoc_analysis		posthoc_analysis
sacred_experiments		sacred_experiments
tests		tests
validator_site		validator_site
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

auto-discern

Table of Contents

How to Use this Repo

Installation

A Note on Data

Notebooks

Setup Instructions for MetaMap

`pymetamap` Usage Example

Data Preprocessing

`DataManager` General Usage

Loading a Previously Transformed Dataset

Make Your Own Data Transformer

Model Training

Training the "Traditional" Random Forest Model

General `sacred` Usage

The Published Model

Training the Neural Models

Model Deployment with the Web App

Prepping your Selected Model for Deployment

Deployment with Docker

Known issues

Installing on Windows OS

About

Releases

Packages

Contributors 3

Languages

uzh-dqbm-cmi/auto-discern

Folders and files

Latest commit

History

Repository files navigation

auto-discern

Table of Contents

How to Use this Repo

Installation

A Note on Data

Notebooks

Setup Instructions for MetaMap

pymetamap Usage Example

Data Preprocessing

DataManager General Usage

Loading a Previously Transformed Dataset

Make Your Own Data Transformer

Model Training

Training the "Traditional" Random Forest Model

General sacred Usage

The Published Model

Training the Neural Models

Model Deployment with the Web App

Prepping your Selected Model for Deployment

Deployment with Docker

Known issues

Installing on Windows OS

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

`pymetamap` Usage Example

`DataManager` General Usage

General `sacred` Usage

Packages