Bug: `samples` parameter in `feature_select` function reduces the rows in output #494

jenna-tomkinson · 2025-01-14T23:08:53Z

Example code with output

Example code:

    # Only for Plate 6, we want to normalize to iNFixion institution and Null and WT cells 
    # to keep consistent with how the other plates are normalized (same cell line)
    if plate == "Plate_6":
        samples = "Metadata_Institution == 'iNFixion' and (Metadata_genotype == 'Null' or Metadata_genotype == 'WT')"

    print(f"Performing normalization for {plate} using samples parameter: {samples}")

    # Step 2: Normalization
    normalized_df = normalize(
        profiles=output_annotated_file,
        method="standardize",
        samples=samples,
    )
    print("Normalized dataframe shape", normalized_df.shape)

    output(
        df=normalized_df,
        output_filename=output_normalized_file,
        output_type="parquet",
    )

    # Step 3: Feature selection
    feature_select_df = feature_select(
        output_normalized_file,
        operation=feature_select_ops,
        na_cutoff=0,
        samples=samples,
    )

    print("Feature selected dataframe shape", feature_select_df.shape)

    output(
        df=feature_select_df,
        output_filename=output_feature_select_file,
        output_type="parquet",
    )

Output:

Normalized dataframe shape (7383, 2319)
Feature selected dataframe shape (1674, 1148)

Issue description

When using the samples parameter in the feature_select function, with the help of @gwaybio, we noticed that the number of rows reduced significantly. This did not happen if the samples parameter is removed, meaning it is defaulted to "all".

Through talking with @axiomcura, it was determined that the issue was related to this part of the code:

if samples != "all":
    population_df.query(samples, inplace=True)

When using query in this way, it removes all the other samples/rows in the dataframe, which means the output will be incorrect.

NOTE: This parameter is also seen in the normalize function but it does not have this issue.

Expected behavior

The expected behavior would be that the feature_select function will return a data frame with the same number of rows as the input when the samples parameter is used but will only determine which columns to keep based on only the samples specified.

Additional information

Pycytominer v1.1.0
Linux operating system; Pop OS!
Running in a jupyter notebook

The text was updated successfully, but these errors were encountered:

axiomcura · 2025-01-15T03:32:23Z

Hello @jenna-tomkinson

After conducting preliminary tests, it appears that the original dataset is being overwritten, confirming our initial intuition regarding the code behavior. Look below

To diagnose the issue, a dummy dataset named df_ddata was created to simulate morphological data. This dataset was used to establish an expected output. The feature selection method drop_na_columns was chosen for testing, with a cutoff of 0. This specific operation was selected because it is highly predictable, making it easier to determine which columns will be dropped. The drop_na_columns method also utilizes the samples parameter, which plays a central role in the issue under investigation.

The expected output focused on the number of rows remaining after the feature selection process. This aspect is critical because the issue pertains to rows being erroneously deleted when the samples parameter is used.

Cell number 4 sets up the test and uses the pycytominer.feature_select() function , specifying the samples parameter to limit the feature selection to wells labeled “A” from the generated dataset. The results showed that rows and columns were being removed, which is unexpected and incorrect. These observations indicate a fundamental issue in how the function handles the samples parameter.

We can see that the output removes rows and columns which this should not be the case.

In Cell 5 we can see that the original dataset was also being overwritten. When the samples parameter was not set to “all,” the dataset was reduced to include only the data points matching the query.

jenna-tomkinson added the bug Something isn't working label Jan 14, 2025

jenna-tomkinson assigned axiomcura Jan 14, 2025

axiomcura linked a pull request Jan 21, 2025 that will close this issue

Fix Bug in sample Parameter in feature_select(): Prevent Unintended Row Removal and Dataset Modification #495

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: `samples` parameter in `feature_select` function reduces the rows in output #494

Bug: `samples` parameter in `feature_select` function reduces the rows in output #494

jenna-tomkinson commented Jan 14, 2025

axiomcura commented Jan 15, 2025

Bug: samples parameter in feature_select function reduces the rows in output #494

Bug: samples parameter in feature_select function reduces the rows in output #494

Comments

jenna-tomkinson commented Jan 14, 2025

Example code with output

Issue description

Expected behavior

Additional information

axiomcura commented Jan 15, 2025

Bug: `samples` parameter in `feature_select` function reduces the rows in output #494

Bug: `samples` parameter in `feature_select` function reduces the rows in output #494