In this project, our group had to use Python 3 and Jupyter Notebook, together with Scikit-learn, Orange3, or both, to perform a ML task.
The dataset analysed was the Donors_dataset.csv, downloaded from Kaggle: Donors-Prediction.
In this project, our team was supposed to use only tabular data (not Images or Image Metadata) and see how far we could go in predicting donations and understanding the donors. We had to use both supervised and unsupervised learning to tackle 2 tasks:
- Task 1 (Supervised Learning) - Predicting Donation and Donation Type
- Task 2 (Unsupervised Learning) - Characterizing Donors
An important preliminary step, consisted on Data Cleaning and Preprocessing. The following had to be considered:
- Data can contain errors/typos, whose correction might improve the analysis.
- Some features can contain many values, whose grouping in categories (aggregation into bins) might improve the analysis.
- Data can contain missing values, that you might decide to fill. You might also decide to eliminate instances/features with high percentages of missing values.
- Not all features are necessarily important for the analysis.
- Depending on the analysis, some features might have to be excluded.
- Class distribution is an important characteristic of the dataset that should be checked. Class imbalance might impair machine learning.
This project includes all necessary files, including the dataset (Donors_dataset.csv), Jupyter Notebook (AA_202021_Final_Project_Group25.ipynb) and several Orange3
files.