Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Bug in sample Parameter in feature_select(): Prevent Unintended Row Removal and Dataset Modification #495

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

axiomcura
Copy link
Member

Description

Thank you for your contribution to pycytominer!
Please succinctly summarize your proposed change.
What motivated you to make this change?

This pull request closes #494, which describes when using the sample parameter unexpectedly removes rows and modifies the original dataset. The changes aim to resolve this issue, ensuring that the sample parameter behaves as intended without altering the original data and removing data.

I have also created new tests to verify that the number of rows remains unchanged when applying the samples parameter, whether through individual operations and a combination of multiple operations.

Please also link to any relevant issues that your code is associated with.

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.41%. Comparing base (5743c62) to head (f205458).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #495      +/-   ##
==========================================
+ Coverage   94.39%   94.41%   +0.02%     
==========================================
  Files          57       57              
  Lines        3159     3171      +12     
==========================================
+ Hits         2982     2994      +12     
  Misses        177      177              
Flag Coverage Δ
unittests 94.41% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@axiomcura axiomcura changed the title Fix Bug in sample Parameter: Prevent Unintended Row Removal and Dataset Modification Fix Bug in sample Parameter in feature_select(): Prevent Unintended Row Removal and Dataset Modification Jan 21, 2025
@gwaybio
Copy link
Member

gwaybio commented Jan 21, 2025

This looks functionally good to me @axiomcura - kudos and thanks! @d33bs could you give a quick, but close look to see if there are any technical things for Erik to consider prior to merging? One thing that I could potentially see is to create a new test (rather than append to an existing one), which would be more explicitly named to test the sample parameter (e.g., test_feature_select_samples_paramater()), but I'm not sure how hard or important this would be.

Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job @axiomcura! I added some comments with this review and would be interested in getting your response. Please don't hesitate to let me know if you have any questions.

Generally I also wondered:

  • Have we covered the change in functionality with new tests? I noticed there was a test for feature_select but perhaps not the other functions which were modified. Consider adding tests for these as well to ensure the sampling edge cases are covered (to avoid surprises later).

@@ -175,7 +175,7 @@ def drop_outlier_features(

# Subset dataframe
if samples != "all":
population_df.query(samples, inplace=True)
population_df = population_df.query(samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comment here or replacing the one above this if-branch to better describe what's happening here and why. For example, the comment in correlation_threshold provides more information (but may also need to be updated).

@@ -48,7 +48,7 @@ def correlation_threshold(

# Subset dataframe and calculate correlation matrix across subset features
if samples != "all":
population_df.query(samples, inplace=True)
population_df = population_df.query(samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider also adding or enhancing the code comment above this if-branch to provide a description of what this is doing.

@@ -33,7 +33,7 @@ def get_na_columns(population_df, features="infer", samples="all", cutoff=0.05):
"""

if samples != "all":
population_df.query(samples, inplace=True)
population_df = population_df.query(samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here: consider adding more detail through a code comment which describes what's happening here and why.

@@ -43,7 +43,7 @@ def noise_removal(
"""
# Subset dataframe
if samples != "all":
population_df.query(samples, inplace=True)
population_df = population_df.query(samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here: consider adding more detail through a code comment which describes what's happening here and why.

@@ -51,7 +51,7 @@ def variance_threshold(

# Subset dataframe
if samples != "all":
population_df.query(samples, inplace=True)
population_df = population_df.query(samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each one of these modified population_df lines consider adding in the named parameter within the call.

Suggested change
population_df = population_df.query(samples)
population_df = population_df.query(expr=samples)

@@ -169,6 +169,58 @@ def test_feature_select_noise_removal():
pd.testing.assert_frame_equal(result4b, expected_result4b)
pd.testing.assert_frame_equal(result5b, expected_result5b)

# testing samples query leveraging metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment applies to the modifications here within tests. I agree with @gwaybio and would suggest that this becomes a new test here to help limit the lines of code found within this single test.

@@ -43,7 +43,7 @@ def noise_removal(
"""
# Subset dataframe
if samples != "all":
population_df.query(samples, inplace=True)
population_df = population_df.query(samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing how these functions which implement a query pattern made me wonder if there's an opportunity in future changes to abstract the functionality into a decorator or class which can help provide the capability without the repetition. I mention this as a thought towards continuous improvement and not an expectation within this PR for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: samples parameter in feature_select function reduces the rows in output
4 participants