ENH Improve wording in stratification notebook #760

ArturoAmorQ · 2024-02-06T16:34:48Z

During a formation at Inria Academy we noticed that this notebook never really justifies why stratification is important. This PR adds a couple of paragraphs to better motivate the reason why a simple KFold with shuffling is not a good enough practice.

It also takes the opportunity to implement verbs in present mode and improve general wording.

NB. I think this PR is safe to merge at it does not change the overall experience of the mooc.

glemaitre

Otherwise LGTM.

python_scripts/cross_validation_stratification.py

glemaitre · 2024-04-26T14:03:25Z

python_scripts/cross_validation_stratification.py

+#
+# In conclusion, it is a good practice to use stratification within the
+# cross-validation framework when dealing with a classification problem,
+# especially for datasets with imbalanced classes or when the class distribution


I am not really sure of the conclusion. To me, I'm thinking more about the following two aspects:

if target labels are ordered or grouped, then stratification allows to overcome this issue if you forget to shuffle as in the k-fold case;

if the sample size is limited or small (and data are shuffled), then taking a stratified fold ensure similar train/test distribution compare to the uniform sampling.

And thus overcoming the both above issues make that the evaluation is closer to reality.

I am not entirely convinced on the rewording I made, but I think I addressed your points in 42c9775

Co-authored-by: Guillaume Lemaitre <[email protected]>

glemaitre

LGTM.

Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]> ca7d1d7

ArturoAmorQ added 2 commits February 6, 2024 17:27

Improve wording in stratification notebook

8cffd48

tweak

630c535

glemaitre changed the title ~~Improve wording in stratification notebook~~ ENH Improve wording in stratification notebook Apr 26, 2024

glemaitre assigned glemaitre and unassigned glemaitre Apr 26, 2024

glemaitre self-requested a review April 26, 2024 13:50

glemaitre reviewed Apr 26, 2024

View reviewed changes

ArturoAmorQ and others added 5 commits April 26, 2024 16:19

Apply suggestions from code review

bc96354

Co-authored-by: Guillaume Lemaitre <[email protected]>

Solve conflicts

1cb0046

Rephrase as per Guillaume's comment

42c9775

Add link to user guide

9e97332

Improve plot titles

060bd89

glemaitre approved these changes May 17, 2024

View reviewed changes

glemaitre merged commit ca7d1d7 into INRIA:main May 17, 2024
3 checks passed

github-actions bot pushed a commit that referenced this pull request May 17, 2024

[ci skip] ENH Improve wording in stratification notebook (#760)

f29d582

Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]> ca7d1d7

ArturoAmorQ deleted the stratification branch May 17, 2024 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Improve wording in stratification notebook #760

ENH Improve wording in stratification notebook #760

ArturoAmorQ commented Feb 6, 2024

glemaitre left a comment

glemaitre Apr 26, 2024

ArturoAmorQ Apr 26, 2024

glemaitre left a comment

ENH Improve wording in stratification notebook #760

ENH Improve wording in stratification notebook #760

Conversation

ArturoAmorQ commented Feb 6, 2024

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Apr 26, 2024

Choose a reason for hiding this comment

ArturoAmorQ Apr 26, 2024

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment