Workshop W16 and solutions to Workshop W10

SoCSTech · Dec 13, 2024 · 907b0ab · 907b0ab
1 parent b36f136
commit 907b0ab
Show file tree

Hide file tree

Showing 2 changed files with 1,199 additions and 0 deletions.
diff --git a/CMP3751 Machine Learning/workshops/Week 10/Week10_workshop_solutions.ipynb b/CMP3751 Machine Learning/workshops/Week 10/Week10_workshop_solutions.ipynb
diff --git a/CMP3751 Machine Learning/workshops/Week 16/Week16_workshop.ipynb b/CMP3751 Machine Learning/workshops/Week 16/Week16_workshop.ipynb
@@ -0,0 +1,137 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Week 16 - Learning from a dataset\n",
+    "\n",
+    "As the final workshop in this module, this week's workshop will be a free-form exercise.\n",
+    "\n",
+    "The goal of Machine Learning is to construct predictive models from data (= samples represented by their features), in a way that:\n",
+    "- groups the data by discovering the structure of the data (**clustering**)\n",
+    "- we are able to predict an unknown feature for a new, unseen sample (**supervised learning**)\n",
+    "    - if this feature is a real-valued, numerical feature, we talk about **regression**\n",
+    "    - if this feature is a categorical feature, describing some _type_ of our sample, we talk about **classification**\n",
+    "- if the model is able to predict actions of an agent in a dynamic environment, we talk about **reinforcement learning**."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The type of ML approaches mostly covered in this module all fall under the **supervised learning** paradigm. Supervised learning approaches to ML are **data driven**, in the sense that the predictive models are constructed by learning from (many) examples.\n",
+    "\n",
+    "Your task for today's workshop will therefore be to pick and download a supervised learning (classification or regression) dataset from [a dataset repository](https://archive.ics.uci.edu/), find the best performing ML approach for the dataset, and train a predictive model. The suggested workflow consists of three parts:\n",
+    "- [Chosing and cleaning a dataset](#Chosing-and-cleaning-a-dataset)\n",
+    "- [Chosing the model and the parameters](#Chosing-the-model-and-the-parameters)\n",
+    "- [Evaluating the model](#Evaluating-the-model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Chosing and cleaning a dataset\n",
+    "\n",
+    "Take care that you may need to \"clean\" the dataset of your choice: check for missing values, make sure it has been loaded in the correct format (i.e. some datasets use different delimiters, some have header lines and some do not, can rely on different file types...).\n",
+    "\n",
+    "You will also need to **handle the different types of data** if they are present in the dataset. As most algorithms implemented in `sklearn` are implemented for _numerical_ data, you should represent all of your features as numerical:\n",
+    "- **real-valued** numerical values can be left as-is\n",
+    "- **categorical** values should be one-hot encoded. This will transform a single feature column in a dataset with $n$ categorical values (e.g. blood type can be A, B, AB or O (n=4)) into $n$ indicator features (one column indicating if the blood type of the patient is A; one column for B, one column for AB and one for O. For this, you can check out [`pandas.get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) or [`sklearn.preprocessing.OneHotEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)\n",
+    "- **ordinal** values fall between the above two categories. They might sometimes be expressed as numbers (user rating 1-5) or as categories ('low' 'mid' 'high'); however they _can be ordered and compared_. These should be left as-is if they are already represented as numbers, or encoded as numbers ('low' -> 0, 'mid' -> 1, 'high' -> 2) if expressed as categories.\n",
+    "\n",
+    "You should also check the **scale** of all the features. Some ML approaches benefit from normalised data, while some do not care about the scale. Check the scale of all your features (the difference between the lowest and highest value for each feature). Especially if the scales are vastly different, consider:\n",
+    "- **normalising** the data (i.e. forcing the feature into a 0-1 range)\n",
+    "- **zero-centering** the data (i.e. after normalisation, forcing the data to have a median at zero)\n",
+    "- for these purposes you can check out [`sklearn.preprocessing.StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)\n",
+    "\n",
+    "Here is a list of some datasets I have used this year, but feel free to pick a different one from the repository (or anywhere else):\n",
+    "- presence of [amphibian species](https://archive.ics.uci.edu/ml/datasets/Amphibians) along planned roadways (multi class classification)\n",
+    "- quality of [Portugeese wines](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) (classification)\n",
+    "- presence of [forest fires in Algeria](https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++) (classification)\n",
+    "- distinguishing between varieties of [dried beans](https://archive.ics.uci.edu/ml/datasets/dry+bean+dataset) (classification)\n",
+    "- [heart desease](https://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29) dataset (classification)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Chosing the model and the parameters\n",
+    "\n",
+    "Compare the performance of different ML models on your dataset (e.g. for classification, you could compare [`sklearn.neigbors.KNeigborsClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [`sklearn.ensemble.RandomForestClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), [`sklearn.svm.SVC()`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) (the Support Vector Machine classifier), [`sklearn.neural_network.MLPClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)).\n",
+    "\n",
+    "Be careful how you perform the comparison. When **comparing performance**, be careful to use a _separate set_ for training (the training set), and a separate set for testing - your performance evaluation should be done on samples unseen during training. For this purpose, check out [`sklearn.model_selection.train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).\n",
+    "\n",
+    "When chosing your performance **metrics**, you should take into account how _balanced_ your dataset is:\n",
+    "- **accuracy** is good for a balanced set (e.g. around the same number of instances for all the classes)\n",
+    "- but **F1** or **Cohen's Kappa** might be better for a dataset with a class imbalance.\n",
+    "\n",
+    "You will also notice that the above classifiers (and many more ML approaches) have different **parameters** to chose from. Their performance could differ substantially depending on the parameter choice. To chose between different parameters of the same algorithm (e.g. to chose K for [`sklearn.neigbors.KNeigborsClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)), you can perform **hyperparameter search**. This means comparing the performance for different values of K. However, there are some things to be careful about:\n",
+    "- the _test set_ (used to measure the final performance) **should not** be used to determine the best parameters. Doing that would mean that the \"unseen\" data from the test set was actually not completely unseen, as it was used to chose the best model (which is considered part of training).\n",
+    "- therefore, the _training set_ should be further split, by separating a smaller _validation set_ (while the remainder of the _training set_ with the validation part removed should be used for training).\n",
+    "- to achieve this, you can call [`sklearn.model_selection.train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) _again_ on your training set.\n",
+    "- to quickly compare the classifier performance when using different combinations of parameters, you can look into [`sklearn.model_selection.GridSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluating the model\n",
+    "\n",
+    "Our goal so far has been to select the best performing model on our data. For this purpose, we trained different models on the _training set_, compared different parameters of the same model on the _validation set_, and finally compared the performance of different models on the _testing set_.\n",
+    "\n",
+    "However, this means that the our models were evaluated only on the small portion of the dataset (the size of the training set is usually selected to be around 0.2-0.3). To get a better sense of the model performance, it would be good to test how it performs on _all the available data_. However, we still should not test the model on the data seen during training.\n",
+    "- To get around this, one can use k-fold cross validation. (Unless your chosen dataset is very small, say less than 100 samples, in which case leave-one-out might be better).\n",
+    "- For k-fold cross validation, check [`sklearn.model_selection.KFold()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).\n",
+    "- Alternatively, for leave-one-out, check [`sklearn.model_selection.LeaveOneOut()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html).\n",
+    "\n",
+    "If you are training a classifier model, try visualising the [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) (using [`sklearn.metrics.ConfusionMatrixDisplay()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html)) _on the whole dataset_ by _aggregating the results for different folds_."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}