Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Improve wording in stratification notebook #760

Merged
merged 7 commits into from
May 17, 2024

Conversation

ArturoAmorQ
Copy link
Collaborator

During a formation at Inria Academy we noticed that this notebook never really justifies why stratification is important. This PR adds a couple of paragraphs to better motivate the reason why a simple KFold with shuffling is not a good enough practice.

It also takes the opportunity to implement verbs in present mode and improve general wording.

NB. I think this PR is safe to merge at it does not change the overall experience of the mooc.

@glemaitre glemaitre changed the title Improve wording in stratification notebook ENH Improve wording in stratification notebook Apr 26, 2024
@glemaitre glemaitre assigned glemaitre and unassigned glemaitre Apr 26, 2024
@glemaitre glemaitre self-requested a review April 26, 2024 13:50
Copy link
Collaborator

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

python_scripts/cross_validation_stratification.py Outdated Show resolved Hide resolved
python_scripts/cross_validation_stratification.py Outdated Show resolved Hide resolved
python_scripts/cross_validation_stratification.py Outdated Show resolved Hide resolved
python_scripts/cross_validation_stratification.py Outdated Show resolved Hide resolved
#
# In conclusion, it is a good practice to use stratification within the
# cross-validation framework when dealing with a classification problem,
# especially for datasets with imbalanced classes or when the class distribution
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure of the conclusion. To me, I'm thinking more about the following two aspects:

  • if target labels are ordered or grouped, then stratification allows to overcome this issue if you forget to shuffle as in the k-fold case;
  • if the sample size is limited or small (and data are shuffled), then taking a stratified fold ensure similar train/test distribution compare to the uniform sampling.

And thus overcoming the both above issues make that the evaluation is closer to reality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely convinced on the rewording I made, but I think I addressed your points in 42c9775

Copy link
Collaborator

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@glemaitre glemaitre merged commit ca7d1d7 into INRIA:main May 17, 2024
3 checks passed
github-actions bot pushed a commit that referenced this pull request May 17, 2024
Co-authored-by: ArturoAmorQ <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]> ca7d1d7
@ArturoAmorQ ArturoAmorQ deleted the stratification branch May 17, 2024 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants