This project focuses on analyzing, visualizing, and extracting insights from a dataset using various data mining and machine learning techniques. The work includes preprocessing data, exploring relationships, implementing clustering methods, and applying decision tree algorithms using libraries like sklearn. Advanced visualization techniques are employed to interpret results effectively.
- Null Values Check: Identified and handled missing values in the dataset.
- Normalization: Standardized the dataset to ensure uniform scales for better analysis.
- Visualized the dataset to understand patterns, distributions, and outliers using:
- Scatter plots
- Heatmaps
- Correlation matrices
- Calculated Euclidean Distance to measure similarity between data points.
- Computed Entropy to evaluate uncertainty or information gain in the dataset.
- Implemented and visualized the correlation matrix to identify relationships between variables and explore multicollinearity.
Using the sklearn
library, decision tree algorithms were applied for classification and decision-making tasks:
- CART Algorithm (Classification and Regression Trees)
- Hunt's Algorithm for structured and efficient tree generation.
- Performed clustering on the dataset and visualized clusters using scatter plots.
- Determined the optimal number of clusters using the Elbow Method and visualized results with a SEE chart.
- Implemented hierarchical clustering and visualized the dendrogram to understand cluster formation.
- Python (Jupyter Notebook - Google Colab)
- Pandas: Data manipulation and preprocessing
- NumPy: Mathematical computations
- Matplotlib/Seaborn: Data visualization
- Sklearn: Machine learning algorithms (Decision Trees, K-Means)
- Scipy: Hierarchical clustering support
Clone the Repository:
git clone https://github.com/9twy/Data-mining-and-analysis.git