Fix Bug in `sample` Parameter in `feature_select()`: Prevent Unintended Row Removal and Dataset Modification #495

axiomcura · 2025-01-21T19:24:12Z

Description

Thank you for your contribution to pycytominer!
Please succinctly summarize your proposed change.
What motivated you to make this change?

This pull request closes #494, which describes when using the sample parameter unexpectedly removes rows and modifies the original dataset. The changes aim to resolve this issue, ensuring that the sample parameter behaves as intended without altering the original data and removing data.

I have also created new tests to verify that the number of rows remains unchanged when applying the samples parameter, whether through individual operations and a combination of multiple operations.

Please also link to any relevant issues that your code is associated with.

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

codecov-commenter · 2025-01-21T19:29:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.41%. Comparing base (5743c62) to head (f205458).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #495      +/-   ##
==========================================
+ Coverage   94.39%   94.41%   +0.02%     
==========================================
  Files          57       57              
  Lines        3159     3171      +12     
==========================================
+ Hits         2982     2994      +12     
  Misses        177      177

Flag	Coverage Δ
unittests	`94.41% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gwaybio · 2025-01-21T23:06:40Z

This looks functionally good to me @axiomcura - kudos and thanks! @d33bs could you give a quick, but close look to see if there are any technical things for Erik to consider prior to merging? One thing that I could potentially see is to create a new test (rather than append to an existing one), which would be more explicitly named to test the sample parameter (e.g., test_feature_select_samples_paramater()), but I'm not sure how hard or important this would be.

d33bs

Nice job @axiomcura! I added some comments with this review and would be interested in getting your response. Please don't hesitate to let me know if you have any questions.

Generally I also wondered:

Have we covered the change in functionality with new tests? I noticed there was a test for feature_select but perhaps not the other functions which were modified. Consider adding tests for these as well to ensure the sampling edge cases are covered (to avoid surprises later).

d33bs · 2025-01-22T13:31:51Z

pycytominer/cyto_utils/features.py

@@ -175,7 +175,7 @@ def drop_outlier_features(

    # Subset dataframe
    if samples != "all":
-        population_df.query(samples, inplace=True)
+        population_df = population_df.query(samples)


Consider adding a comment here or replacing the one above this if-branch to better describe what's happening here and why. For example, the comment in correlation_threshold provides more information (but may also need to be updated).

d33bs · 2025-01-22T13:33:07Z

pycytominer/operations/correlation_threshold.py

@@ -48,7 +48,7 @@ def correlation_threshold(

    # Subset dataframe and calculate correlation matrix across subset features
    if samples != "all":
-        population_df.query(samples, inplace=True)
+        population_df = population_df.query(samples)


Consider also adding or enhancing the code comment above this if-branch to provide a description of what this is doing.

d33bs · 2025-01-22T13:33:31Z

pycytominer/operations/get_na_columns.py

@@ -33,7 +33,7 @@ def get_na_columns(population_df, features="infer", samples="all", cutoff=0.05):
    """

    if samples != "all":
-        population_df.query(samples, inplace=True)
+        population_df = population_df.query(samples)


Similar comment here: consider adding more detail through a code comment which describes what's happening here and why.

d33bs · 2025-01-22T13:33:46Z

pycytominer/operations/noise_removal.py

@@ -43,7 +43,7 @@ def noise_removal(
    """
    # Subset dataframe
    if samples != "all":
-        population_df.query(samples, inplace=True)
+        population_df = population_df.query(samples)


Similar comment here: consider adding more detail through a code comment which describes what's happening here and why.

d33bs · 2025-01-22T13:35:06Z

pycytominer/operations/variance_threshold.py

@@ -51,7 +51,7 @@ def variance_threshold(

    # Subset dataframe
    if samples != "all":
-        population_df.query(samples, inplace=True)
+        population_df = population_df.query(samples)


For each one of these modified population_df lines consider adding in the named parameter within the call.

Suggested change

population_df = population_df.query(samples)

population_df = population_df.query(expr=samples)

d33bs · 2025-01-22T13:38:53Z

tests/test_feature_select.py

@@ -169,6 +169,58 @@ def test_feature_select_noise_removal():
    pd.testing.assert_frame_equal(result4b, expected_result4b)
    pd.testing.assert_frame_equal(result5b, expected_result5b)

+    # testing samples query leveraging metadata


Comment applies to the modifications here within tests. I agree with @gwaybio and would suggest that this becomes a new test here to help limit the lines of code found within this single test.

d33bs · 2025-01-22T13:45:27Z

pycytominer/operations/noise_removal.py

@@ -43,7 +43,7 @@ def noise_removal(
    """
    # Subset dataframe
    if samples != "all":
-        population_df.query(samples, inplace=True)
+        population_df = population_df.query(samples)


Seeing how these functions which implement a query pattern made me wonder if there's an opportunity in future changes to abstract the functionality into a decorator or class which can help provide the capability without the repetition. I mention this as a thought towards continuous improvement and not an expectation within this PR for review.

axiomcura added 3 commits January 21, 2025 12:16

added new test to test 'samples' parameter

7bc24b3

updated gitignore

c4b421b

updated samples query step in all operations

f205458

axiomcura changed the title ~~Fix Bug in sample Parameter: Prevent Unintended Row Removal and Dataset Modification~~ Fix Bug in sample Parameter in feature_select(): Prevent Unintended Row Removal and Dataset Modification Jan 21, 2025

d33bs reviewed Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Bug in `sample` Parameter in `feature_select()`: Prevent Unintended Row Removal and Dataset Modification #495

Fix Bug in `sample` Parameter in `feature_select()`: Prevent Unintended Row Removal and Dataset Modification #495

axiomcura commented Jan 21, 2025

codecov-commenter commented Jan 21, 2025

gwaybio commented Jan 21, 2025

d33bs left a comment

d33bs Jan 22, 2025

d33bs Jan 22, 2025

d33bs Jan 22, 2025

d33bs Jan 22, 2025

d33bs Jan 22, 2025

d33bs Jan 22, 2025

d33bs Jan 22, 2025

	population_df = population_df.query(samples)
	population_df = population_df.query(expr=samples)

Fix Bug in sample Parameter in feature_select(): Prevent Unintended Row Removal and Dataset Modification #495

Are you sure you want to change the base?

Fix Bug in sample Parameter in feature_select(): Prevent Unintended Row Removal and Dataset Modification #495

Conversation

axiomcura commented Jan 21, 2025

Description

What is the nature of your change?

Checklist

codecov-commenter commented Jan 21, 2025

Codecov Report

gwaybio commented Jan 21, 2025

d33bs left a comment

Choose a reason for hiding this comment

d33bs Jan 22, 2025

Choose a reason for hiding this comment

d33bs Jan 22, 2025

Choose a reason for hiding this comment

d33bs Jan 22, 2025

Choose a reason for hiding this comment

d33bs Jan 22, 2025

Choose a reason for hiding this comment

d33bs Jan 22, 2025

Choose a reason for hiding this comment

d33bs Jan 22, 2025

Choose a reason for hiding this comment

d33bs Jan 22, 2025

Choose a reason for hiding this comment

Fix Bug in `sample` Parameter in `feature_select()`: Prevent Unintended Row Removal and Dataset Modification #495

Fix Bug in `sample` Parameter in `feature_select()`: Prevent Unintended Row Removal and Dataset Modification #495