-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Bug in sample
Parameter in feature_select()
: Prevent Unintended Row Removal and Dataset Modification
#495
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #495 +/- ##
==========================================
+ Coverage 94.39% 94.41% +0.02%
==========================================
Files 57 57
Lines 3159 3171 +12
==========================================
+ Hits 2982 2994 +12
Misses 177 177
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
sample
Parameter: Prevent Unintended Row Removal and Dataset Modificationsample
Parameter in feature_select()
: Prevent Unintended Row Removal and Dataset Modification
This looks functionally good to me @axiomcura - kudos and thanks! @d33bs could you give a quick, but close look to see if there are any technical things for Erik to consider prior to merging? One thing that I could potentially see is to create a new test (rather than append to an existing one), which would be more explicitly named to test the sample parameter (e.g., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job @axiomcura! I added some comments with this review and would be interested in getting your response. Please don't hesitate to let me know if you have any questions.
Generally I also wondered:
- Have we covered the change in functionality with new tests? I noticed there was a test for
feature_select
but perhaps not the other functions which were modified. Consider adding tests for these as well to ensure the sampling edge cases are covered (to avoid surprises later).
@@ -175,7 +175,7 @@ def drop_outlier_features( | |||
|
|||
# Subset dataframe | |||
if samples != "all": | |||
population_df.query(samples, inplace=True) | |||
population_df = population_df.query(samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comment here or replacing the one above this if-branch to better describe what's happening here and why. For example, the comment in correlation_threshold
provides more information (but may also need to be updated).
@@ -48,7 +48,7 @@ def correlation_threshold( | |||
|
|||
# Subset dataframe and calculate correlation matrix across subset features | |||
if samples != "all": | |||
population_df.query(samples, inplace=True) | |||
population_df = population_df.query(samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider also adding or enhancing the code comment above this if-branch to provide a description of what this is doing.
@@ -33,7 +33,7 @@ def get_na_columns(population_df, features="infer", samples="all", cutoff=0.05): | |||
""" | |||
|
|||
if samples != "all": | |||
population_df.query(samples, inplace=True) | |||
population_df = population_df.query(samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here: consider adding more detail through a code comment which describes what's happening here and why.
@@ -43,7 +43,7 @@ def noise_removal( | |||
""" | |||
# Subset dataframe | |||
if samples != "all": | |||
population_df.query(samples, inplace=True) | |||
population_df = population_df.query(samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here: consider adding more detail through a code comment which describes what's happening here and why.
@@ -51,7 +51,7 @@ def variance_threshold( | |||
|
|||
# Subset dataframe | |||
if samples != "all": | |||
population_df.query(samples, inplace=True) | |||
population_df = population_df.query(samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each one of these modified population_df
lines consider adding in the named parameter within the call.
population_df = population_df.query(samples) | |
population_df = population_df.query(expr=samples) |
@@ -169,6 +169,58 @@ def test_feature_select_noise_removal(): | |||
pd.testing.assert_frame_equal(result4b, expected_result4b) | |||
pd.testing.assert_frame_equal(result5b, expected_result5b) | |||
|
|||
# testing samples query leveraging metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment applies to the modifications here within tests. I agree with @gwaybio and would suggest that this becomes a new test here to help limit the lines of code found within this single test.
@@ -43,7 +43,7 @@ def noise_removal( | |||
""" | |||
# Subset dataframe | |||
if samples != "all": | |||
population_df.query(samples, inplace=True) | |||
population_df = population_df.query(samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing how these functions which implement a query
pattern made me wonder if there's an opportunity in future changes to abstract the functionality into a decorator or class which can help provide the capability without the repetition. I mention this as a thought towards continuous improvement and not an expectation within this PR for review.
Description
Thank you for your contribution to pycytominer!
Please succinctly summarize your proposed change.
What motivated you to make this change?
This pull request closes #494, which describes when using the
sample
parameter unexpectedly removes rows and modifies the original dataset. The changes aim to resolve this issue, ensuring that thesample
parameter behaves as intended without altering the original data and removing data.I have also created new tests to verify that the number of rows remains unchanged when applying the
samples
parameter, whether through individual operations and a combination of multiple operations.Please also link to any relevant issues that your code is associated with.
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.