The Nutritional Peril:

  <p class="lead">Every year, millions of individuals across the globe suffer from non-communicable diseases. Non-communicable disease are not contagious, rather they develop through your lifestyle choices including dietary intake. There has been past contradictory research on nutrition. According to the <a href="http://www.nytimes.com/2016/08/11/upshot/were-so-confused-the-problems-with-food-and-exercise-studies.html"> New York Times</a>, the same ingredient could be interpreted as starving away cancer or aiding its onset, depending on who you ask. Only two facts are known: holding all exogenous variables constant, consuming more calories will cause weight gain while exercising will produce weight loss. For this project, we chose to apply a data science approach to the former, nutrition, to ask if with enough information, can we can identify trends between food consumption and disease onset rates?  </p>

</div>

Project Goal

We want to better understand the impact of select food groups and their correlation to the mortality rates of cancer, diabetes, and cardiovascular disease. Our goal is to apply machine learning techniques to read into the datasets and generate predictions of mortality rate.

Data Sources

Food Data stems from FAOSTAT, the Food and Agriculture Organization of the UN. There are 74 crop predictors and 27 livestock predictors. Food data was measured in kg/capita/year to standardize measurements across different country populations and average over annual time frames. Disease data was obtained from the WHO, World Health Organization. We tested three non-communicable diseases: cancer, diabetes, and cardiovascular disease. The obtain datasets stored values in terms of mortality rates per 100,000, again controlling for country size to isolate the impact of nutrition as a predictor across the globe.

Data Exploration

Data exploration was first performed on the dataset to understand general trends between the predictors and disease mortality rates. The predictors were sorted by R squared values to find the most informative food predictors. From the graph below, we observe that wheats and products appear to have the strongest correlation to cancer mortality rate, with an individual R squared of around 0.18. While these individual R squared values are low, when dozens of predictors are combined in a multivariate model, their impact can compound to provide more accurate predictions. The data exploration and analysis chapter will explore these predictors in further detail and across all three disease categories.

Data Cleaning

One of the most significant difficulties was cleaning the data for analysis. Almost 25% of the values were missing as shown in the below visualization for the top 45 most incomplete livestock and crop data where a white bar indicates missing data and a black bar indicates the presence of a value. Extensive code was written to populate missing values with a regional average. For example, missing values for Cuba were filled with the average of all Central America countries. We hypothesize that averaging over geographically nearby countries would be the most accurate because certain foods may be more prevalant across a region.

Data Science Models

Four different models were tested:

Naive Linear Regression Model
Linear Regression with LASSO for Variable Selection
Linear Regression with Ridge for Variable Selection
Regression Tree for Advanced Analysis

Predictor Impact

To visualize the impact of certain food predictors, a bubble chart was generated with the radius of the circle porportional to the size of the coefficient used in the linear regression prediction model. The red and blue colors correspond respectively to positive and negative coefficients. From the data below for diabetes mortality rates, there is a strong mix of predictors that influence disease mortality in both directions.

Predicted Cancer Rates

The predicted disease mortality rates from our model were then visualized in a world map format to examine for geographic trends. The above graph shows estimated cardiovascular disease mortality per 100,000 individuals while the below map shows the actual cardiovascular disease mortality rates. A dark blue indicates a higher mortality rate while a lighter blue indicates that the disease is not as prevelant in the country. There are clear similiaries between the models, supporting the validity of our approaches. However there are also noticeable differences that suggest that the nuances of the country were oversimplified in the model.

Conclusion

The goal of this project was to better understand the impact certain food predictors have. While we understand that health is a hollistic outcome dependent on many variables in addition to diet such as exercise, a country's GDP, healthcare access, the consumption of certain food groups appears to be more strongly correlated than others. In particular, starches including potatoes, rice, and wheat should be limited as they are positively linked with cancer, diabetes, and cardiovascular disease incidence rates across the world. The goal of this project was to be informative and utilize a machine learning approach to understanding global health. We hope that you find our full report informative and successful in completing this goal.

Name		Name	Last commit message	Last commit date
Latest commit History 238 Commits
.ipynb_checkpoints		.ipynb_checkpoints
d3_vis		d3_vis
datasets		datasets
images		images
javascripts		javascripts
js		js
stylesheets		stylesheets
Advanced Models.ipynb		Advanced Models.ipynb
Conclusions.ipynb		Conclusions.ipynb
Introduction.ipynb		Introduction.ipynb
Lasso Ridge.ipynb		Lasso Ridge.ipynb
Models.ipynb		Models.ipynb
README.md		README.md
Updated Map v2.ipynb		Updated Map v2.ipynb
Updated Map.ipynb		Updated Map.ipynb
advanced.html		advanced.html
all 6 world maps v2.ipynb		all 6 world maps v2.ipynb
all 6 world maps.ipynb		all 6 world maps.ipynb
cancer_156.csv		cancer_156.csv
cardio_156.csv		cardio_156.csv
conclusions.html		conclusions.html
conclusions0.html		conclusions0.html
country lists v2.ipynb		country lists v2.ipynb
country lists.ipynb		country lists.ipynb
country to food names.ipynb		country to food names.ipynb
dataanalysis.ipynb		dataanalysis.ipynb
dataanalysis0.html		dataanalysis0.html
dataanalysis_ipynb.html		dataanalysis_ipynb.html
datacleaning.html		datacleaning.html
datacleaning0.html		datacleaning0.html
dataexploration.html		dataexploration.html
dataexploration.ipynb		dataexploration.ipynb
dataexploration0.html		dataexploration0.html
decision tree.ipynb		decision tree.ipynb
diabetes_156.csv		diabetes_156.csv
error bar graph w maps v2.ipynb		error bar graph w maps v2.ipynb
error bar graph w maps v3.ipynb		error bar graph w maps v3.ipynb
error bar graph w maps.ipynb		error bar graph w maps.ipynb
error bar graph.ipynb		error bar graph.ipynb
index.html		index.html
intro.html		intro.html
introduction.html		introduction.html
map_for_poster.ipynb		map_for_poster.ipynb
map_full.ipynb		map_full.ipynb
milestone_4_model.ipynb		milestone_4_model.ipynb
milestone_4_model_AR.ipynb		milestone_4_model_AR.ipynb
milestone_4_model_SX.ipynb		milestone_4_model_SX.ipynb
missing data imputation 2.ipynb		missing data imputation 2.ipynb
missing data imputation.ipynb		missing data imputation.ipynb
missing data.ipynb		missing data.ipynb
missingdata.ipynb		missingdata.ipynb
missingdata_all.ipynb		missingdata_all.ipynb
missingdata_all_with_null_model.ipynb		missingdata_all_with_null_model.ipynb
modelanalysis.html		modelanalysis.html
models.html		models.html
naive.html		naive.html
overview.html		overview.html
overview.ipynb		overview.ipynb
overview5.html		overview5.html
params.json		params.json
predictors_filled_156.csv		predictors_filled_156.csv
references.html		references.html
refs.html		refs.html
regression_models.html		regression_models.html
stepwise.ipynb		stepwise.ipynb
susan_milestone4.ipynb		susan_milestone4.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Nutritional Peril:

Project Goal

Data Sources

Data Exploration

Data Cleaning

Data Science Models

Predictor Impact

Predicted Cancer Rates

Conclusion

About

Releases

Packages

Contributors 3

Languages

angierao/cs109a-project

Folders and files

Latest commit

History

Repository files navigation

The Nutritional Peril:

Project Goal

Data Sources

Data Exploration

Data Cleaning

Data Science Models

Predictor Impact

Predicted Cancer Rates

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages