Skip to content

Commit

Permalink
[ci skip] FIX correction for some typos (INRIA#779)
Browse files Browse the repository at this point in the history
Co-authored-by: ArturoAmorQ <[email protected]> 00379f8
  • Loading branch information
kalona committed May 23, 2024
1 parent c4a77f2 commit 482baaa
Show file tree
Hide file tree
Showing 500 changed files with 5,539 additions and 4,556 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 4d0b8ff6c57249ded5f0eb4d309710a9
config: b1d24180e61c8890f3448461bef0fb15
tags: 645f666f9bcd5a90fca523b33c5a78b7
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Binary file modified _images/cross_validation_train_test_diagram.png
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Binary file modified _images/nested_cross_validation_diagram.png
6 changes: 3 additions & 3 deletions _sources/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ interpreting their predictions.
<a href="https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn">
"Machine learning in Python with scikit-learn MOOC"
</a>,
is available starting on October 18, 2022 and will last for 3 months. Enroll for
the full MOOC experience (quizz solutions, executable notebooks, discussion
is available starting on November 8th, 2023 and will remain open in self-paced mode.
Enroll for the full MOOC experience (quiz solutions, executable notebooks, discussion
forum, etc ...) !
</br>
The MOOC is free and the platform does not use the student data for any other purpose
Expand Down Expand Up @@ -79,7 +79,7 @@ You can cite us through the project's Zenodo archive using the following DOI:
[10.5281/zenodo.7220306](https://doi.org/10.5281/zenodo.7220306).

The following repository includes the notebooks, exercises and solutions to the
exercises (but not the quizz solutions ;):
exercises (but not the quizzes' solutions ;):

https://github.com/INRIA/scikit-learn-mooc/

Expand Down
9 changes: 9 additions & 0 deletions _sources/python_scripts/01_tabular_data_exploration.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,15 @@
# %%
adult_census.head()

# %% [markdown]
# An alternative is to omit the `head` method. This would output the intial and
# final rows and columns, but everything in between is not shown by default. It
# also provides the dataframe's dimensions at the bottom in the format `n_rows`
# x `n_columns`.

# %%
adult_census

# %% [markdown]
# The column named **class** is our target variable (i.e., the variable which we
# want to predict). The two possible classes are `<=50K` (low-revenue) and
Expand Down
8 changes: 4 additions & 4 deletions _sources/python_scripts/02_numerical_pipeline_hands_on.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
adult_census = pd.read_csv("../datasets/adult-census.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")
adult_census.head()
adult_census

# %% [markdown]
# The next step separates the target from the data. We performed the same
Expand All @@ -44,7 +44,7 @@
data, target = adult_census.drop(columns="class"), adult_census["class"]

# %%
data.head()
data

# %%
target
Expand Down Expand Up @@ -95,7 +95,7 @@
# the `object` data type.

# %%
data.head()
data

# %% [markdown]
# We see that the `object` data type corresponds to columns containing strings.
Expand All @@ -105,7 +105,7 @@

# %%
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
data[numerical_columns].head()
data[numerical_columns]

# %% [markdown]
# Now that we limited the dataset to numerical columns only, we can analyse
Expand Down
6 changes: 3 additions & 3 deletions _sources/python_scripts/02_numerical_pipeline_introduction.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
# Let's have a look at the first records of this dataframe:

# %%
adult_census.head()
adult_census

# %% [markdown]
# We see that this CSV file contains all information: the target that we would
Expand All @@ -56,10 +56,10 @@

# %%
data = adult_census.drop(columns=[target_name])
data.head()
data

# %% [markdown]
# We can now linger on the variables, also denominated features, that we later
# We can now focus on the variables, also denominated features, that we later
# use to build our predictive model. In addition, we can also check how many
# samples are available in our dataset.

Expand Down
13 changes: 7 additions & 6 deletions _sources/python_scripts/03_categorical_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@

# %%
data_categorical = data[categorical_columns]
data_categorical.head()
data_categorical

# %%
print(f"The dataset is composed of {data_categorical.shape[1]} features")
Expand Down Expand Up @@ -194,7 +194,7 @@

# %%
print(f"The dataset is composed of {data_categorical.shape[1]} features")
data_categorical.head()
data_categorical

# %%
data_encoded = encoder.fit_transform(data_categorical)
Expand Down Expand Up @@ -253,7 +253,7 @@
# and check the generalization performance of this machine learning pipeline using
# cross-validation.
#
# Before we create the pipeline, we have to linger on the `native-country`.
# Before we create the pipeline, we have to focus on the `native-country`.
# Let's recall some statistics regarding this column.

# %%
Expand Down Expand Up @@ -329,9 +329,10 @@
print(f"The accuracy is: {scores.mean():.3f} ± {scores.std():.3f}")

# %% [markdown]
# As you can see, this representation of the categorical variables is
# slightly more predictive of the revenue than the numerical variables
# that we used previously.
# As you can see, this representation of the categorical variables is slightly
# more predictive of the revenue than the numerical variables that we used
# previously. The reason being that we have more (predictive) categorical
# features than numerical ones.

# %% [markdown]
#
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@
# method. As an example, we predict on the five first samples from the test set.

# %%
data_test.head()
data_test

# %%
model.predict(data_test)[:5]
Expand Down
2 changes: 1 addition & 1 deletion _sources/python_scripts/cross_validation_ex_01.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
# exercise.
#
# Also, this classifier can become more flexible/expressive by using a so-called
# kernel that makes the model become non-linear. Again, no requirement regarding
# kernel that makes the model become non-linear. Again, no undestanding regarding
# the mathematics is required to accomplish this exercise.
#
# We will use an RBF kernel where a parameter `gamma` allows to tune the
Expand Down
76 changes: 47 additions & 29 deletions _sources/python_scripts/cross_validation_grouping.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,8 @@

# %% [markdown]
# # Sample grouping
# We are going to linger into the concept of sample groups. As in the previous
# section, we will give an example to highlight some surprising results. This
# time, we will use the handwritten digits dataset.
# In this notebook we present the concept of **sample groups**. We use the
# handwritten digits dataset to highlight some surprising results.

# %%
from sklearn.datasets import load_digits
Expand All @@ -18,8 +17,17 @@
data, target = digits.data, digits.target

# %% [markdown]
# We will recreate the same model used in the previous notebook: a logistic
# regression classifier with a preprocessor to scale the data.
# We create a model consisting of a logistic regression classifier with a
# preprocessor to scale the data.
#
# ```{note}
# Here we use a `MinMaxScaler` as we know that each pixel's gray-scale is
# strictly bounded between 0 (white) and 16 (black). This makes `MinMaxScaler`
# more suited in this case than `StandardScaler`, as some pixels consistently
# have low variance (pixels at the borders might almost always be zero if most
# digits are centered in the image). Then, using `StandardScaler` can result in
# a very high scaled value due to division by a small number.
# ```

# %%
from sklearn.preprocessing import MinMaxScaler
Expand All @@ -29,8 +37,10 @@
model = make_pipeline(MinMaxScaler(), LogisticRegression(max_iter=1_000))

# %% [markdown]
# We will use the same baseline model. We will use a `KFold` cross-validation
# without shuffling the data at first.
# The idea is to compare the estimated generalization performance using
# different cross-validation techniques and see how such estimations are
# impacted by underlying data structures. We first use a `KFold`
# cross-validation without shuffling the data.

# %%
from sklearn.model_selection import cross_val_score, KFold
Expand Down Expand Up @@ -59,9 +69,9 @@
)

# %% [markdown]
# We observe that shuffling the data improves the mean accuracy. We could go a
# little further and plot the distribution of the testing score. We can first
# concatenate the test scores.
# We observe that shuffling the data improves the mean accuracy. We can go a
# little further and plot the distribution of the testing score. For such
# purpose we concatenate the test scores.

# %%
import pandas as pd
Expand All @@ -72,29 +82,29 @@
).T

# %% [markdown]
# Let's plot the distribution now.
# Let's now plot the score distributions.

# %%
import matplotlib.pyplot as plt

all_scores.plot.hist(bins=10, edgecolor="black", alpha=0.7)
all_scores.plot.hist(bins=16, edgecolor="black", alpha=0.7)
plt.xlim([0.8, 1.0])
plt.xlabel("Accuracy score")
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
_ = plt.title("Distribution of the test scores")

# %% [markdown]
# The cross-validation testing error that uses the shuffling has less variance
# than the one that does not impose any shuffling. It means that some specific
# fold leads to a low score in this case.
# Shuffling the data results in a higher cross-validated test accuracy with less
# variance compared to when the data is not shuffled. It means that some
# specific fold leads to a low score in this case.

# %%
print(test_score_no_shuffling)

# %% [markdown]
# Thus, there is an underlying structure in the data that shuffling will break
# and get better results. To get a better understanding, we should read the
# documentation shipped with the dataset.
# Thus, shuffling the data breaks the underlying structure and thus makes the
# classification task easier to our model. To get a better understanding, we can
# read the dataset description in more detail:

# %%
print(digits.DESCR)
Expand Down Expand Up @@ -165,7 +175,7 @@
groups[lb:up] = group_id

# %% [markdown]
# We can check the grouping by plotting the indices linked to writer ids.
# We can check the grouping by plotting the indices linked to writers' ids.

# %%
plt.plot(groups)
Expand All @@ -176,8 +186,9 @@
_ = plt.title("Underlying writer groups existing in the target")

# %% [markdown]
# Once we group the digits by writer, we can use cross-validation to take this
# information into account: the class containing `Group` should be used.
# Once we group the digits by writer, we can incorporate this information into
# the cross-validation process by using group-aware variations of the strategies
# we have explored in this course, for example, the `GroupKFold` strategy.

# %%
from sklearn.model_selection import GroupKFold
Expand All @@ -191,10 +202,12 @@
)

# %% [markdown]
# We see that this strategy is less optimistic regarding the model
# generalization performance. However, this is the most reliable if our goal is
# to make handwritten digits recognition writers independent. Besides, we can as
# well see that the standard deviation was reduced.
# We see that this strategy leads to a lower generalization performance than the
# other two techniques. However, this is the most reliable estimate if our goal
# is to evaluate the capabilities of the model to generalize to new unseen
# writers. In this sense, shuffling the dataset (or alternatively using the
# writers' ids as a new feature) would lead the model to memorize the different
# writer's particular handwriting.

# %%
all_scores = pd.DataFrame(
Expand All @@ -207,13 +220,18 @@
).T

# %%
all_scores.plot.hist(bins=10, edgecolor="black", alpha=0.7)
all_scores.plot.hist(bins=16, edgecolor="black", alpha=0.7)
plt.xlim([0.8, 1.0])
plt.xlabel("Accuracy score")
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
_ = plt.title("Distribution of the test scores")

# %% [markdown]
# As a conclusion, it is really important to take any sample grouping pattern
# into account when evaluating a model. Otherwise, the results obtained will be
# over-optimistic in regards with reality.
# In conclusion, accounting for any sample grouping patterns is crucial when
# assessing a model’s ability to generalize to new groups. Without this
# consideration, the results may appear overly optimistic compared to the actual
# performance.
#
# The interested reader can learn about other group-aware cross-validation
# techniques in the [scikit-learn user
# guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data).
2 changes: 1 addition & 1 deletion _sources/python_scripts/cross_validation_learning_curve.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# generalizing. Besides these aspects, it is also important to understand how
# the different errors are influenced by the number of samples available.
#
# In this notebook, we will show this aspect by looking a the variability of
# In this notebook, we will show this aspect by looking at the variability of
# the different errors.
#
# Let's first load the data and create the same model as in the previous
Expand Down
2 changes: 1 addition & 1 deletion _sources/python_scripts/cross_validation_sol_01.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
# exercise.
#
# Also, this classifier can become more flexible/expressive by using a so-called
# kernel that makes the model become non-linear. Again, no requirement regarding
# kernel that makes the model become non-linear. Again, no understanding regarding
# the mathematics is required to accomplish this exercise.
#
# We will use an RBF kernel where a parameter `gamma` allows to tune the
Expand Down
Loading

0 comments on commit 482baaa

Please sign in to comment.