According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.This dataset is used to predict whether a patient is likely to get a stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.
To predict brain stroke from patient's records such as age, bmi score, heart problem, hypertension and smoking practice. The dataset includes 100k patient records. Among the records, 1.5% of them are related to stroke patients and the remaining of them are related to non-stroke patients. Therefore, the data is extremely imbalanced.
This project describes step-by-step procedure for building a machine learning (ML) model for stroke prediction and for analysing which features are most useful for the prediction.
- Python (v3.7)
- Pandas (v1.0.3)
- Numpy (v1.18.1)
- Matplotlib (v3.2.1)
- Scikit-Learn (v0.22.1)
- Imbalanced-Learn (v0.6.2)
- LightGBM (v2.3.1)
- XGBoost (v1.0.2)
The dataset contains 5110 real world observations and 10 different attributes:
gender
: "Male", "Female" or "Other"age
: age of the patienthypertension
:- 0: if the patient doesn't have hypertension
- 1: if the patient has hypertension
heart_disease
:- 0: if the patient doesn't have any heart diseases
- 1: if the patient has a heart disease
ever_married
: "No" or "Yes"Residence_type
: "Rural" or "Urban"avg_glucose_level
: average glucose level in bloodbmi
: body mass indexsmoking_status
: "formerly smoked", "never smoked", "smokes" or "Unknown"stroke
: 1 if the patient had a stroke or 0 if not
Dataset can be downloaded from the Kaggle stroke dataset