Cross-validation for imbalanced label case #4

jjc2718 · 2020-08-04T14:44:03Z

If labels are highly imbalanced (for example, TP53 in ovarian cancer) ROC can break because some cross-validation splits will only have one class.

Maybe using StratifiedKFold instead of standard k-fold CV is the best solution here?

jjc2718 · 2020-09-23T13:38:28Z

After thinking about this, I'm not planning to stratify cross-validation folds by label. I think if there are so few positive labels that splits only have one class sometimes by chance (e.g. the TP53/OV case described above), we're not going to be able to train effective models anyway due to the extreme label imbalance.

In general, I think there are downsides to stratifying by label (see, e.g. this CrossValidated post or this one). I want to make sure our cross-validation is as representative of external datasets as it can be (some of which may have different label proportions than TCGA), and generating CV folds randomly many times seems like a better way to evaluate generalization than forcing every test dataset to have the same label proportion.

I may revisit this in the future, but closing for now.

jjc2718 · 2020-10-19T20:15:37Z

Reopening this in light of #31 (comment) . I think stratification by label may be the best solution to the issue described there - need to think about it a bit.

jjc2718 added the bug Something isn't working label Aug 5, 2020

jjc2718 mentioned this issue Aug 10, 2020

Subset gene features by MAD and clean up data preprocessing code #10

Merged

jjc2718 closed this as completed Sep 23, 2020

jjc2718 mentioned this issue Oct 19, 2020

Model coefficient stability analysis #31

Merged

jjc2718 reopened this Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-validation for imbalanced label case #4

Cross-validation for imbalanced label case #4

jjc2718 commented Aug 4, 2020

jjc2718 commented Sep 23, 2020 •

edited

Loading

jjc2718 commented Oct 19, 2020

Cross-validation for imbalanced label case #4

Cross-validation for imbalanced label case #4

Comments

jjc2718 commented Aug 4, 2020

jjc2718 commented Sep 23, 2020 • edited Loading

jjc2718 commented Oct 19, 2020

jjc2718 commented Sep 23, 2020 •

edited

Loading