You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After thinking about this, I'm not planning to stratify cross-validation folds by label. I think if there are so few positive labels that splits only have one class sometimes by chance (e.g. the TP53/OV case described above), we're not going to be able to train effective models anyway due to the extreme label imbalance.
In general, I think there are downsides to stratifying by label (see, e.g. this CrossValidated post or this one). I want to make sure our cross-validation is as representative of external datasets as it can be (some of which may have different label proportions than TCGA), and generating CV folds randomly many times seems like a better way to evaluate generalization than forcing every test dataset to have the same label proportion.
I may revisit this in the future, but closing for now.
Reopening this in light of #31 (comment) . I think stratification by label may be the best solution to the issue described there - need to think about it a bit.
If labels are highly imbalanced (for example, TP53 in ovarian cancer) ROC can break because some cross-validation splits will only have one class.
Maybe using StratifiedKFold instead of standard k-fold CV is the best solution here?
The text was updated successfully, but these errors were encountered: