Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: samples parameter in feature_select function reduces the rows in output #494

Open
jenna-tomkinson opened this issue Jan 14, 2025 · 1 comment · May be fixed by #495
Open

Bug: samples parameter in feature_select function reduces the rows in output #494

jenna-tomkinson opened this issue Jan 14, 2025 · 1 comment · May be fixed by #495
Assignees
Labels
bug Something isn't working

Comments

@jenna-tomkinson
Copy link
Member

Example code with output

Example code:

    # Only for Plate 6, we want to normalize to iNFixion institution and Null and WT cells 
    # to keep consistent with how the other plates are normalized (same cell line)
    if plate == "Plate_6":
        samples = "Metadata_Institution == 'iNFixion' and (Metadata_genotype == 'Null' or Metadata_genotype == 'WT')"

    print(f"Performing normalization for {plate} using samples parameter: {samples}")

    # Step 2: Normalization
    normalized_df = normalize(
        profiles=output_annotated_file,
        method="standardize",
        samples=samples,
    )
    print("Normalized dataframe shape", normalized_df.shape)

    output(
        df=normalized_df,
        output_filename=output_normalized_file,
        output_type="parquet",
    )

    # Step 3: Feature selection
    feature_select_df = feature_select(
        output_normalized_file,
        operation=feature_select_ops,
        na_cutoff=0,
        samples=samples,
    )

    print("Feature selected dataframe shape", feature_select_df.shape)

    output(
        df=feature_select_df,
        output_filename=output_feature_select_file,
        output_type="parquet",
    )

Output:

Normalized dataframe shape (7383, 2319)
Feature selected dataframe shape (1674, 1148)

Issue description

When using the samples parameter in the feature_select function, with the help of @gwaybio, we noticed that the number of rows reduced significantly. This did not happen if the samples parameter is removed, meaning it is defaulted to "all".

Through talking with @axiomcura, it was determined that the issue was related to this part of the code:

if samples != "all":
    population_df.query(samples, inplace=True)

When using query in this way, it removes all the other samples/rows in the dataframe, which means the output will be incorrect.

NOTE: This parameter is also seen in the normalize function but it does not have this issue.

Expected behavior

The expected behavior would be that the feature_select function will return a data frame with the same number of rows as the input when the samples parameter is used but will only determine which columns to keep based on only the samples specified.

Additional information

  • Pycytominer v1.1.0
  • Linux operating system; Pop OS!
  • Running in a jupyter notebook
@jenna-tomkinson jenna-tomkinson added the bug Something isn't working label Jan 14, 2025
@axiomcura
Copy link
Member

Hello @jenna-tomkinson

After conducting preliminary tests, it appears that the original dataset is being overwritten, confirming our initial intuition regarding the code behavior. Look below

image

To diagnose the issue, a dummy dataset named df_ddata was created to simulate morphological data. This dataset was used to establish an expected output. The feature selection method drop_na_columns was chosen for testing, with a cutoff of 0. This specific operation was selected because it is highly predictable, making it easier to determine which columns will be dropped. The drop_na_columns method also utilizes the samples parameter, which plays a central role in the issue under investigation.

The expected output focused on the number of rows remaining after the feature selection process. This aspect is critical because the issue pertains to rows being erroneously deleted when the samples parameter is used.

image
Cell number 4 sets up the test and uses the pycytominer.feature_select() function , specifying the samples parameter to limit the feature selection to wells labeled “A” from the generated dataset. The results showed that rows and columns were being removed, which is unexpected and incorrect. These observations indicate a fundamental issue in how the function handles the samples parameter.

We can see that the output removes rows and columns which this should not be the case.

In Cell 5 we can see that the original dataset was also being overwritten. When the samples parameter was not set to “all,” the dataset was reduced to include only the data points matching the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants