\n",
"
Tip
\n",
@@ -133,13 +132,13 @@
"Whether or not a machine learning model requires scaling the features depends\n",
"on the model family. Linear models such as logistic regression generally\n",
"benefit from scaling the features while other models such as decision trees do\n",
- "not need such preprocessing (but will not suffer from it).\n",
+ "not need such preprocessing (but would not suffer from it).\n",
"\n",
"We show how to apply such normalization using a scikit-learn transformer\n",
"called `StandardScaler`. This transformer shifts and scales each feature\n",
"individually so that they all have a 0-mean and a unit standard deviation.\n",
"\n",
- "We will investigate different steps used in scikit-learn to achieve such a\n",
+ "We now investigate different steps used in scikit-learn to achieve such a\n",
"transformation of the data.\n",
"\n",
"First, one needs to call the method `fit` in order to learn the scaling from\n",
@@ -175,10 +174,10 @@
"\n",
"
\n",
"
Note
\n",
- "
The fact that the model states of this scaler are arrays of means and\n",
- "standard deviations is specific to the StandardScaler. Other\n",
- "scikit-learn transformers will compute different statistics and store them\n",
- "as model states, in the same fashion.
\n",
+ "
The fact that the model states of this scaler are arrays of means and standard\n",
+ "deviations is specific to the StandardScaler. Other scikit-learn\n",
+ "transformers may compute different statistics and store them as model states,\n",
+ "in a similar fashion.
\n",
"
\n",
"\n",
"We can inspect the computed means and standard deviations."
@@ -353,7 +352,7 @@
"source": [
"We can easily combine sequential operations with a scikit-learn `Pipeline`,\n",
"which chains together operations and is used as any other classifier or\n",
- "regressor. The helper function `make_pipeline` will create a `Pipeline`: it\n",
+ "regressor. The helper function `make_pipeline` creates a `Pipeline`: it\n",
"takes as arguments the successive transformations to perform, followed by the\n",
"classifier or regressor model."
]
@@ -378,8 +377,8 @@
"source": [
"The `make_pipeline` function did not require us to give a name to each step.\n",
"Indeed, it was automatically assigned based on the name of the classes\n",
- "provided; a `StandardScaler` will be a step named `\"standardscaler\"` in the\n",
- "resulting pipeline. We can check the name of each steps of our model:"
+ "provided; a `StandardScaler` step is named `\"standardscaler\"` in the resulting\n",
+ "pipeline. We can check the name of each steps of our model:"
]
},
{
@@ -421,7 +420,7 @@
"![pipeline fit diagram](../figures/api_diagram-pipeline.fit.svg)\n",
"\n",
"When calling `model.fit`, the method `fit_transform` from each underlying\n",
- "transformer (here a single transformer) in the pipeline will be called to:\n",
+ "transformer (here a single transformer) in the pipeline is called to:\n",
"\n",
"- learn their internal model states\n",
"- transform the training data. Finally, the preprocessed data are provided to\n",
@@ -452,7 +451,7 @@
"called to preprocess the data. Note that there is no need to call the `fit`\n",
"method for these transformers because we are using the internal model states\n",
"computed when calling `model.fit`. The preprocessed data is then provided to\n",
- "the predictor that will output the predicted target by calling its method\n",
+ "the predictor that outputs the predicted target by calling its method\n",
"`predict`.\n",
"\n",
"As a shorthand, we can check the score of the full predictive pipeline calling\n",
diff --git a/notebooks/02_numerical_pipeline_sol_00.ipynb b/notebooks/02_numerical_pipeline_sol_00.ipynb
index ff144d5c0..e5be6f7e2 100644
--- a/notebooks/02_numerical_pipeline_sol_00.ipynb
+++ b/notebooks/02_numerical_pipeline_sol_00.ipynb
@@ -44,11 +44,12 @@
"number of neighbors we are going to use to make a prediction for a new data\n",
"point.\n",
"\n",
- "What is the default value of the `n_neighbors` parameter? Hint: Look at the\n",
- "documentation on the [scikit-learn\n",
+ "What is the default value of the `n_neighbors` parameter?\n",
+ "\n",
+ "**Hint**: Look at the documentation on the [scikit-learn\n",
"website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n",
"or directly access the description inside your notebook by running the\n",
- "following cell. This will open a pager pointing to the documentation."
+ "following cell. This opens a pager pointing to the documentation."
]
},
{
diff --git a/notebooks/02_numerical_pipeline_sol_01.ipynb b/notebooks/02_numerical_pipeline_sol_01.ipynb
index 2198c76b8..352cf234f 100644
--- a/notebooks/02_numerical_pipeline_sol_01.ipynb
+++ b/notebooks/02_numerical_pipeline_sol_01.ipynb
@@ -37,8 +37,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We will first split our dataset to have the target separated from the data\n",
- "used to train our predictive model."
+ "We first split our dataset to have the target separated from the data used to\n",
+ "train our predictive model."
]
},
{
@@ -96,8 +96,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Use a `DummyClassifier` such that the resulting classifier will always predict\n",
- "the class `' >50K'`. What is the accuracy score on the test set? Repeat the\n",
+ "Use a `DummyClassifier` such that the resulting classifier always predict the\n",
+ "class `' >50K'`. What is the accuracy score on the test set? Repeat the\n",
"experiment by always predicting the class `' <=50K'`.\n",
"\n",
"Hint: you can set the `strategy` parameter of the `DummyClassifier` to achieve\n",
@@ -131,8 +131,8 @@
},
"source": [
"We clearly see that the score is below 0.5 which might be surprising at first.\n",
- "We will now check the generalization performance of a model which always\n",
- "predict the low revenue class, i.e. `\" <=50K\"`."
+ "We now check the generalization performance of a model which always predict\n",
+ "the low revenue class, i.e. `\" <=50K\"`."
]
},
{
@@ -175,7 +175,7 @@
},
"source": [
"Therefore, any predictive model giving results below this dummy classifier\n",
- "will not be helpful."
+ "would not be helpful."
]
},
{
diff --git a/notebooks/03_categorical_pipeline.ipynb b/notebooks/03_categorical_pipeline.ipynb
index 3972842a5..575268c9f 100644
--- a/notebooks/03_categorical_pipeline.ipynb
+++ b/notebooks/03_categorical_pipeline.ipynb
@@ -6,9 +6,9 @@
"source": [
"# Encoding of categorical variables\n",
"\n",
- "In this notebook, we will present typical ways of dealing with\n",
- "**categorical variables** by encoding them, namely **ordinal encoding** and\n",
- "**one-hot encoding**."
+ "In this notebook, we present some typical ways of dealing with **categorical\n",
+ "variables** by encoding them, namely **ordinal encoding** and **one-hot\n",
+ "encoding**."
]
},
{
@@ -94,9 +94,9 @@
"## Select features based on their data type\n",
"\n",
"In the previous notebook, we manually defined the numerical columns. We could\n",
- "do a similar approach. Instead, we will use the scikit-learn helper function\n",
- "`make_column_selector`, which allows us to select columns based on\n",
- "their data type. We will illustrate how to use this helper."
+ "do a similar approach. Instead, we can use the scikit-learn helper function\n",
+ "`make_column_selector`, which allows us to select columns based on their data\n",
+ "type. We now illustrate how to use this helper."
]
},
{
@@ -159,9 +159,8 @@
"### Encoding ordinal categories\n",
"\n",
"The most intuitive strategy is to encode each category with a different\n",
- "number. The `OrdinalEncoder` will transform the data in such manner.\n",
- "We will start by encoding a single column to understand how the encoding\n",
- "works."
+ "number. The `OrdinalEncoder` transforms the data in such manner. We start by\n",
+ "encoding a single column to understand how the encoding works."
]
},
{
@@ -258,13 +257,13 @@
"\n",
"`OneHotEncoder` is an alternative encoder that prevents the downstream\n",
"models to make a false assumption about the ordering of categories. For a\n",
- "given feature, it will create as many new columns as there are possible\n",
+ "given feature, it creates as many new columns as there are possible\n",
"categories. For a given sample, the value of the column corresponding to the\n",
- "category will be set to `1` while all the columns of the other categories\n",
- "will be set to `0`.\n",
+ "category is set to `1` while all the columns of the other categories\n",
+ "are set to `0`.\n",
"\n",
- "We will start by encoding a single feature (e.g. `\"education\"`) to illustrate\n",
- "how the encoding works."
+ "We can encode a single feature (e.g. `\"education\"`) to illustrate how the\n",
+ "encoding works."
]
},
{
@@ -299,7 +298,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We see that encoding a single feature will give a dataframe full of zeros\n",
+ "We see that encoding a single feature gives a dataframe full of zeros\n",
"and ones. Each category (unique value) became a column; the encoding\n",
"returned, for each sample, a 1 to specify which category it belongs to.\n",
"\n",
@@ -353,8 +352,8 @@
"source": [
"### Choosing an encoding strategy\n",
"\n",
- "Choosing an encoding strategy will depend on the underlying models and the\n",
- "type of categories (i.e. ordinal vs. nominal)."
+ "Choosing an encoding strategy depends on the underlying models and the type of\n",
+ "categories (i.e. ordinal vs. nominal)."
]
},
{
@@ -373,12 +372,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "\n",
- "Using an `OrdinalEncoder` will output ordinal categories. This means\n",
+ "Using an `OrdinalEncoder` outputs ordinal categories. This means\n",
"that there is an order in the resulting categories (e.g. `0 < 1 < 2`). The\n",
"impact of violating this ordering assumption is really dependent on the\n",
- "downstream models. Linear models will be impacted by misordered categories\n",
- "while tree-based models will not.\n",
+ "downstream models. Linear models would be impacted by misordered categories\n",
+ "while tree-based models would not.\n",
"\n",
"You can still use an `OrdinalEncoder` with linear models but you need to be\n",
"sure that:\n",
@@ -426,7 +424,7 @@
"We see that the `\"Holand-Netherlands\"` category is occurring rarely. This will\n",
"be a problem during cross-validation: if the sample ends up in the test set\n",
"during splitting then the classifier would not have seen the category during\n",
- "training and will not be able to encode it.\n",
+ "training and would not be able to encode it.\n",
"\n",
"In scikit-learn, there are some possible solutions to bypass this issue:\n",
"\n",
@@ -455,8 +453,8 @@
"
Tip
\n",
"
Be aware the OrdinalEncoder exposes a parameter also named handle_unknown.\n",
"It can be set to use_encoded_value. If that option is chosen, you can define\n",
- "a fixed value to which all unknowns will be set to during transform. For\n",
- "example, OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=42) will set all values encountered during transform to 42\n",
+ "a fixed value that is assigned to all unknown categories during transform.\n",
+ "For example, OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1) would set all values encountered during transform to -1\n",
"which are not part of the data encountered during the fit call. You are\n",
"going to use these parameters in the next exercise.
\n",
"
"
diff --git a/notebooks/03_categorical_pipeline_column_transformer.ipynb b/notebooks/03_categorical_pipeline_column_transformer.ipynb
index aca827f4c..f9f3d5293 100644
--- a/notebooks/03_categorical_pipeline_column_transformer.ipynb
+++ b/notebooks/03_categorical_pipeline_column_transformer.ipynb
@@ -6,12 +6,12 @@
"source": [
"# Using numerical and categorical variables together\n",
"\n",
- "In the previous notebooks, we showed the required preprocessing to apply\n",
- "when dealing with numerical and categorical variables. However, we decoupled\n",
- "the process to treat each type individually. In this notebook, we will show\n",
- "how to combine these preprocessing steps.\n",
+ "In the previous notebooks, we showed the required preprocessing to apply when\n",
+ "dealing with numerical and categorical variables. However, we decoupled the\n",
+ "process to treat each type individually. In this notebook, we show how to\n",
+ "combine these preprocessing steps.\n",
"\n",
- "We will first load the entire adult census dataset."
+ "We first load the entire adult census dataset."
]
},
{
@@ -38,10 +38,10 @@
"source": [
"## Selection based on data types\n",
"\n",
- "We will separate categorical and numerical variables using their data\n",
- "types to identify them, as we saw previously that `object` corresponds\n",
- "to categorical columns (strings). We make use of `make_column_selector`\n",
- "helper to select the corresponding columns."
+ "We separate categorical and numerical variables using their data types to\n",
+ "identify them, as we saw previously that `object` corresponds to categorical\n",
+ "columns (strings). We make use of `make_column_selector` helper to select the\n",
+ "corresponding columns."
]
},
{
@@ -84,14 +84,14 @@
"In the previous sections, we saw that we need to treat data differently\n",
"depending on their nature (i.e. numerical or categorical).\n",
"\n",
- "Scikit-learn provides a `ColumnTransformer` class which will send specific\n",
+ "Scikit-learn provides a `ColumnTransformer` class which sends specific\n",
"columns to a specific transformer, making it easy to fit a single predictive\n",
"model on a dataset that combines both kinds of variables together\n",
"(heterogeneously typed tabular data).\n",
"\n",
"We first define the columns depending on their data type:\n",
"\n",
- "* **one-hot encoding** will be applied to categorical columns. Besides, we use\n",
+ "* **one-hot encoding** is applied to categorical columns. Besides, we use\n",
" `handle_unknown=\"ignore\"` to solve the potential issues due to rare\n",
" categories.\n",
"* **numerical scaling** numerical features which will be standardized.\n",
@@ -149,11 +149,11 @@
"A `ColumnTransformer` does the following:\n",
"\n",
"* It **splits the columns** of the original dataset based on the column names\n",
- " or indices provided. We will obtain as many subsets as the number of\n",
- " transformers passed into the `ColumnTransformer`.\n",
+ " or indices provided. We obtain as many subsets as the number of transformers\n",
+ " passed into the `ColumnTransformer`.\n",
"* It **transforms each subsets**. A specific transformer is applied to each\n",
- " subset: it will internally call `fit_transform` or `transform`. The output\n",
- " of this step is a set of transformed datasets.\n",
+ " subset: it internally calls `fit_transform` or `transform`. The output of\n",
+ " this step is a set of transformed datasets.\n",
"* It then **concatenates the transformed datasets** into a single dataset.\n",
"\n",
"The important thing is that `ColumnTransformer` is like any other scikit-learn\n",
@@ -234,7 +234,7 @@
"source": [
"Then, we can send the raw dataset straight to the pipeline. Indeed, we do not\n",
"need to make any manual preprocessing (calling the `transform` or\n",
- "`fit_transform` methods) as it will be handled when calling the `predict`\n",
+ "`fit_transform` methods) as it is already handled when calling the `predict`\n",
"method. As an example, we predict on the five first samples from the test set."
]
},
@@ -337,10 +337,10 @@
"\n",
"However, it is often useful to check whether more complex models such as an\n",
"ensemble of decision trees can lead to higher predictive performance. In this\n",
- "section we will use such a model called **gradient-boosting trees** and\n",
- "evaluate its generalization performance. More precisely, the scikit-learn\n",
- "model we will use is called `HistGradientBoostingClassifier`. Note that\n",
- "boosting models will be covered in more detail in a future module.\n",
+ "section we use such a model called **gradient-boosting trees** and evaluate\n",
+ "its generalization performance. More precisely, the scikit-learn model we use\n",
+ "is called `HistGradientBoostingClassifier`. Note that boosting models will be\n",
+ "covered in more detail in a future module.\n",
"\n",
"For tree-based models, the handling of numerical and categorical variables is\n",
"simpler than for linear models:\n",
diff --git a/notebooks/03_categorical_pipeline_ex_01.ipynb b/notebooks/03_categorical_pipeline_ex_01.ipynb
index 1f7ab830e..d77bbef38 100644
--- a/notebooks/03_categorical_pipeline_ex_01.ipynb
+++ b/notebooks/03_categorical_pipeline_ex_01.ipynb
@@ -47,9 +47,8 @@
"source": [
"In the previous notebook, we used `sklearn.compose.make_column_selector` to\n",
"automatically select columns with a specific data type (also called `dtype`).\n",
- "Here, we will use this selector to get only the columns containing strings\n",
- "(column with `object` dtype) that correspond to categorical features in our\n",
- "dataset."
+ "Here, we use this selector to get only the columns containing strings (column\n",
+ "with `object` dtype) that correspond to categorical features in our dataset."
]
},
{
@@ -102,11 +101,11 @@
"