This Python script analyzes the heart disease dataset from the UCI Machine Learning Repository. The analysis includes data exploration, cleaning, analysis, and visualization using the Pandas library for data manipulation and Matplotlib for plotting.
The heart disease dataset is sourced from the UCI Machine Learning Repository. The dataset contains various features related to heart disease, such as age, sex, chest pain type, cholesterol levels, and more. The data is read into a Pandas DataFrame using the provided URL.
The script performs initial data exploration and cleaning steps, including checking for missing values and handling them appropriately by dropping rows with missing values.
The script calculates summary statistics and correlation matrix to understand the data distribution and relationships between different features.
The script creates various visualizations to further explore the dataset:
- Distribution of age
- Scatter plot of age vs. cholesterol
- Correlation heatmap
- Health risk score distribution
- Scatter plot of age vs. health risk score
- Histogram of age distribution for people with the target condition
- Percentage of people with high cholesterol in each age group
The script calculates the age with the maximum number of persons suffering from high cholesterol and creates a histogram of age distribution for people with the target condition. Additionally, it calculates the percentage of people in each age group with high cholesterol and identifies the age group with the maximum percentage.
The plot images are included in the repository:
- Distribution_of_Age.png
- Scatter_Plot_Age_vs_Cholesterol.png
- Correlation_Heatmap.png
- Health_Risk_Score_Distribution.png
- Scatter_Plot_Age_vs_Health_Risk_Score.png
- Age_Distribution_of_People_with_Target_Condition.png
- Percentage_of_People_with_High_Cholesterol_in_Each_Age_Group.png
- Install the required Python libraries: pandas, matplotlib.
- Run the Python script main.py.
- View the generated visualizations to gain insights into the heart disease dataset.