Dependencies

Anaconda 3.x - full installation
gensim package
xgboost package
dill package
pickle package
To unpickle a trined model without warnings: scikit-learn==0.21.3, pandas=0.25.1

Google's word2Vec pre-trained model is required to train pipeline_advanced3 from scratch. It is available here.

Caution: it takes up more than 3GB of space unzipped. You can skip using it and just unpickle the pre-trained pipeline from the root folder.

Installation

As long as you have all the dependencies installed, this will run on the standard distribution of Anaconda with Python 3.

Project motivation

For this project, I use the Figure Eight disaster response data to build a classifier that flags messages for various emergency services.

Notes on modelling approach

The primary classifier I use is XGBoost. To address the imbalance in the input variables, I increase the weights of the positive observations in the target data.

The standout part of this model is where I use the word2Vec model along with a clusterization algorithm to react to clusters of words with similar meanings within a message. Extensive comments are in the ML Pipeline Preparation.ipynb.

To produce a stable model, I subsample both observations and colums.

The implementation is quite complex, and I suggest that anyone interested check out the ML Pipeline Preparation.ipynb.

File descriptions

ETL Pipeline Preparation.ipynb - data parsing, preparation, and saving
ML Pipeline Preparation.ipynb - classification (main file). DOES requrie Google's pretrained word2Vec to run fully (but you can always just unpickle).
./pickles/pipeline_advanced3.pkl - pre-trained classifier that can be unpickled
./pickles/words_mappings.pkl - cached words mappings
./raw_data/categories.csv - message categories data
./raw_data/messages.csv - messages themselves
./web/process_data.py - file for command-line ETL
./web/train_classifier.py - file for command-line classifier training. Does NOT require Google's pre-trained word2Vec (uses cached mappings)
./web/words_mappings.pkl - cached mappings for train_classifier.py
./web/run.py - file to run with flask (used in conjunction with Udacity's IDE, won't run by itself)
./py_scripts/my_etl_pipeline/etlfuncs.py - my ETL helper functions
./py_scripts/my_etl_pipeline/etl_pipeline.py - another implementation of the ETL pipeline

Licensing, Authors, Acknowledgements

You are free to use the code as you like.

Raw data and it's terms of usage can be found here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dependencies

Installation

Project motivation

Notes on modelling approach

File descriptions

Licensing, Authors, Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
pickles		pickles
py_scripts		py_scripts
raw_data		raw_data
web		web
.gitignore		.gitignore
ETL Pipeline Preparation.ipynb		ETL Pipeline Preparation.ipynb
ML Pipeline Preparation.ipynb		ML Pipeline Preparation.ipynb
README.md		README.md

AlexKirko/udacity-disaster-pipeline

Folders and files

Latest commit

History

Repository files navigation

Dependencies

Installation

Project motivation

Notes on modelling approach

File descriptions

Licensing, Authors, Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages