Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Improve wording in stratification notebook #760

Merged
merged 7 commits into from
May 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 40 additions & 31 deletions python_scripts/cross_validation_stratification.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,11 @@
model = make_pipeline(StandardScaler(), LogisticRegression())

# %% [markdown]
# Once we created our model, we will use the cross-validation framework to
# evaluate it. We will use the `KFold` cross-validation strategy. We will define
# a dataset with nine samples and repeat the cross-validation three times (i.e.
# `n_splits`).
# Once the model is created, we can evaluate it using cross-validation. We start
# by using the `KFold` strategy.
#
# Let's review how this strategy works. For such purpose, we define a dataset
# with nine samples and split the dataset into three folds (i.e. `n_splits=3`).

# %%
import numpy as np
Expand All @@ -51,12 +52,12 @@
print("TRAIN:", train_index, "TEST:", test_index)

# %% [markdown]
# By defining three splits, we will use three samples for testing and six for
# training each time. `KFold` does not shuffle by default. It means that it will
# select the three first samples for the testing set at the first split, then
# the next three samples for the second split, and the three next for the
# last split. In the end, all samples have been used in testing at least once
# among the different splits.
# By defining three splits, we use three samples (1-fold) for testing and six
# (2-folds) for training each time. `KFold` does not shuffle by default. It
# means that the three first samples are selected for the testing set at the
# first split, then the three next three samples for the second split, and the
# three next for the last split. In the end, all samples have been used in
# testing at least once among the different splits.
#
# Now, let's apply this strategy to check the generalization performance of our
# model.
Expand All @@ -73,8 +74,8 @@

# %% [markdown]
# It is a real surprise that our model cannot correctly classify any sample in
# any cross-validation split. We will now check our target's value to understand
# the issue.
# any cross-validation split. We now check our target's value to understand the
# issue.

# %%
import matplotlib.pyplot as plt
Expand All @@ -86,18 +87,17 @@
_ = plt.title("Class value in target y")

# %% [markdown]
# We see that the target vector `target` is ordered. It will have some
# unexpected consequences when using the `KFold` cross-validation. To illustrate
# the consequences, we will show the class count in each fold of the
# cross-validation in the train and test set.
# We see that the target vector `target` is ordered. This has some unexpected
# consequences when using the `KFold` cross-validation. To illustrate the
# consequences, we show the class count in each fold of the cross-validation in
# the train and test set.
#
# Let's compute the class counts for both the training and testing sets using
# the `KFold` cross-validation, and plot these information in a bar plot.
#
# We will iterate given the number of split and check how many samples of each
# are present in the training and testing set. We will store the information
# into two distincts lists; one for the training set and one for the testing
# set.
# We iterate given the number of split and check how many samples of each are
# present in the training and testing set. We then store the information into
# two distinct lists; one for the training set and one for the testing set.

# %%
import pandas as pd
Expand All @@ -114,8 +114,8 @@
test_cv_counts.append(target_test.value_counts())

# %% [markdown]
# To plot the information on a single figure, we will concatenate the
# information regarding the fold within the same dataset.
# To plot the information on a single figure, we concatenate the information
# regarding the fold within the same dataset.

# %%
train_cv_counts = pd.concat(
Expand All @@ -138,13 +138,13 @@
train_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Training set")
_ = plt.title("Training set class counts")

# %%
test_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Test set")
_ = plt.title("Test set class counts")

# %% [markdown]
# We can confirm that in each fold, only two of the three classes are present in
Expand All @@ -168,7 +168,7 @@
# 90%. Now that we solved our first issue, it would be interesting to check if
# the class frequency in the training and testing set is equal to our original
# set's class frequency. It would ensure that we are training and testing our
# model with a class distribution that we will encounter in production.
# model with a class distribution that we would encounter in production.

# %%
train_cv_counts = []
Expand All @@ -191,13 +191,13 @@
train_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Training set")
_ = plt.title("Training set class counts\n(with suffling)")

# %%
test_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Test set")
_ = plt.title("Test set class counts\n(with suffling)")

# %% [markdown]
# We see that neither the training and testing sets have the same class
Expand Down Expand Up @@ -242,18 +242,27 @@
train_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Training set")
_ = plt.title("Training set class counts\n(with stratifying)")

# %%
test_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Test set")
_ = plt.title("Test set class counts\n(with stratifying)")

# %% [markdown]
# In this case, we observe that the class counts are very close both in the
# train set and the test set. The difference is due to the small number of
# samples in the iris dataset.
#
# In conclusion, this is a good practice to use stratification within the
# cross-validation framework when dealing with a classification problem.
# In other words, stratifying is more effective than just shuffling when it
# comes to making sure that the distributions of classes in all the folds are
# representative of the entire dataset. As training and testing folds have
# similar class distributions, stratifying leads to a more realistic measure of
# the model’s ability to generalize. This is specially important when the
# performance metrics depend on the proportion of the positive class, as we will
# see in a future notebook.
#
# The interested reader can learn about other stratified cross-validation
# techniques in the [scikit-learn user
# guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-with-stratification-based-on-class-labels).
Loading