Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mzIdentML plugin #101

Closed
wants to merge 1 commit into from
Closed

Add mzIdentML plugin #101

wants to merge 1 commit into from

Conversation

yueqixuan
Copy link
Contributor

@yueqixuan yueqixuan commented Oct 22, 2024

User description

Feature added: Provide data in mzIdentML and mgf (or mzML) formats to generate a report similar to quantms.


PR Type

enhancement, documentation


Description

  • Added a new command-line option --mzid_plugin to enable mzIdentML data extraction and reporting.
  • Implemented support for processing mzIdentML and MGF files, enhancing the data analysis capabilities.
  • Updated the quantms module to integrate mzIdentML data processing and visualization.
  • Enhanced the reporting features to include mzid-specific statistics and plots.
  • Updated the README to document the new --mzid_plugin option and its usage.

Changes walkthrough 📝

Relevant files
Enhancement
cli.py
Add `--mzid_plugin` option for mzIdentML extraction           

pmultiqc/cli.py

  • Added a new command-line option --mzid_plugin.
  • The option enables extraction of mzIdentML data.
  • +3/-4     
    main.py
    Implement mzIdentML and MGF file support and reporting     

    pmultiqc/main.py

  • Added support for mzIdentML and MGF file processing.
  • Implemented new functions for mzid data parsing and heatmap score
    calculation.
  • Updated report generation to include mzid-specific statistics.
  • Enhanced handling of peptide and protein identification data.
  • +6/-0     
    quantms.py
    Integrate mzIdentML processing and visualization enhancements

    pmultiqc/modules/quantms/quantms.py

  • Integrated mzIdentML data processing into the quantms module.
  • Added functions for parsing mzid and mgf files.
  • Enhanced data visualization with mzid-specific plots.
  • Improved peptide and protein identification workflows.
  • +584/-152
    Configuration changes
    setup.py
    Register `mzid_plugin` CLI option                                               

    setup.py

    • Registered mzid_plugin as a new CLI option.
    +2/-1     
    Documentation
    README.md
    Document `--mzid_plugin` option in README                               

    README.md

  • Documented the new --mzid_plugin option.
  • Updated usage examples to include mzid_plugin.
  • +1/-0     

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Summary by CodeRabbit

    • New Features

      • Introduced a new command-line parameter --mzid_plugin for generating reports based on mzid and mzML/mgf files.
      • Enhanced the QuantMSModule to support additional data formats and introduced new methods for improved data visualization.
    • Documentation

      • Updated the README.md to reflect new command-line options and provide guidance on potential limitations with DIA-NN outputs.
    • Bug Fixes

      • Adjusted internal logic to ensure proper handling of new file types in the plugin.
    • Chores

      • Minor formatting adjustments made to command-line options for consistency.

    Copy link

    coderabbitai bot commented Oct 22, 2024

    Walkthrough

    The changes introduce a new command-line parameter --mzid_plugin in the pmultiqc library, enhancing its capability to generate reports from mzid and mzML/mgf files. Documentation in the README.md has been updated to reflect this addition and inform users about potential limitations with DIA-NN outputs. Modifications in pmultiqc/cli.py include the addition of the mzid_plugin option, while pmultiqc/main.py and pmultiqc/modules/quantms/quantms.py have been updated to handle new file types and improve data processing logic. A new entry point for the plugin has also been added in setup.py.

    Changes

    File Change Summary
    README.md Added --mzid_plugin command-line parameter and note on DIA-NN output limitations.
    pmultiqc/cli.py Introduced mzid_plugin option; reformatted affix_type and disable_plugin for consistency.
    pmultiqc/main.py Added checks for quantms/mgf and quantms/mzid in config.sp to handle new file types.
    pmultiqc/modules/quantms/quantms.py Major restructuring of QuantMSModule; added methods for handling mzid and mgf files.
    setup.py Added entry point for mzid_plugin in multiqc.cli_options.v1.

    Sequence Diagram(s)

    sequenceDiagram
        participant User
        participant CLI
        participant Main
        participant QuantMSModule
    
        User->>CLI: run command with --mzid_plugin
        CLI->>Main: process command
        Main->>QuantMSModule: initialize with mzid and mgf handling
        QuantMSModule->>Main: return processed data
        Main->>CLI: output results
    
    Loading

    🐇 "In the garden, we hop and play,
    New features bloom in a bright array.
    With mzid and mgf, we explore anew,
    Reports to craft, for me and you!
    So let’s celebrate with a joyful cheer,
    For the changes that bring us all near!" 🥕


    Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

    ❤️ Share
    🪧 Tips

    Chat

    There are 3 ways to chat with CodeRabbit:

    • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
      • I pushed a fix in commit <commit_id>, please review it.
      • Generate unit testing code for this file.
      • Open a follow-up GitHub issue for this discussion.
    • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
      • @coderabbitai generate unit testing code for this file.
      • @coderabbitai modularize this function.
    • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
      • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
      • @coderabbitai read src/utils.ts and generate unit testing code.
      • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
      • @coderabbitai help me debug CodeRabbit configuration file.

    Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

    CodeRabbit Commands (Invoked using PR comments)

    • @coderabbitai pause to pause the reviews on a PR.
    • @coderabbitai resume to resume the paused reviews.
    • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
    • @coderabbitai full review to do a full review from scratch and review all the files again.
    • @coderabbitai summary to regenerate the summary of the PR.
    • @coderabbitai resolve resolve all the CodeRabbit review comments.
    • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
    • @coderabbitai help to get help.

    Other keywords and placeholders

    • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
    • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
    • Add @coderabbitai anywhere in the PR title to generate the title automatically.

    CodeRabbit Configuration File (.coderabbit.yaml)

    • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
    • Please see the configuration documentation for more information.
    • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

    Documentation and Community

    • Visit our Documentation for detailed information on how to use CodeRabbit.
    • Join our Discord Community to get help, request features, and share feedback.
    • Follow us on X/Twitter for updates and announcements.

    @qodo-merge-pro qodo-merge-pro bot added documentation Improvements or additions to documentation enhancement New feature or request Review effort [1-5]: 4 labels Oct 22, 2024
    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Code Complexity
    The mzid_CalHeatMapScore function contains complex logic for calculating various scores. This should be carefully reviewed for correctness and potential optimizations.

    Data Processing
    The parse_out_mgf and parse_out_mzid functions handle large amounts of data processing. These should be reviewed for efficiency and potential memory issues with large datasets.

    New CLI Option
    The new --mzid_plugin option has been added. Ensure it's properly integrated with the rest of the CLI and doesn't conflict with existing options.

    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Performance
    Use a generator-based approach for parsing MGF files to reduce memory usage

    Consider using a more efficient method for parsing MGF files, such as using a
    generator or streaming approach, to reduce memory usage for large files.

    pmultiqc/modules/quantms/quantms.py [2040-2043]

    -mgf_data = mgf.read(m)
    -log.info("{}: Done parsing MGF file {}...".format(datetime.now().strftime("%H:%M:%S"), m))
    +def mgf_generator(file_path):
    +    with mgf.read(file_path) as reader:
    +        for spectrum in reader:
    +            yield spectrum
    +
     m = self.file_prefix(m)
    -log.info("{}: Aggregating MGF file {}...".format(datetime.now().strftime("%H:%M:%S"), m))
    +log.info("{}: Parsing and aggregating MGF file {}...".format(datetime.now().strftime("%H:%M:%S"), m))
    +for spectrum in mgf_generator(m):
    +    # Process spectrum data here
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: Implementing a generator for parsing MGF files can substantially reduce memory usage, which is crucial for handling large files efficiently, making this a highly impactful suggestion.

    9
    Optimize numerical computations using numpy arrays for improved performance

    Consider using a more efficient data structure, such as numpy arrays, for storing
    and manipulating large datasets like peak intensities and retention times. This can
    significantly improve performance for large datasets.

    pmultiqc/modules/quantms/quantms.py [1168-1172]

    -x = group['retention_time'] / np.sum(group['retention_time'])
    -n = len(group['retention_time'])
    -y = np.sum(x) / n
    -worst = ((1 - y) ** 0.5) * 1 / n + (y ** 0.5) * (n - 1) / n
    -sc = np.sum(np.abs(x - y) ** 0.5) / n
    +retention_times = group['retention_time'].values
    +x = retention_times / np.sum(retention_times)
    +n = len(retention_times)
    +y = np.mean(x)
    +worst = np.sqrt(1 - y) / n + np.sqrt(y) * (n - 1) / n
    +sc = np.mean(np.sqrt(np.abs(x - y)))
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Using numpy arrays for numerical computations can significantly improve performance, especially for large datasets, making this a valuable optimization.

    8
    Best practice
    Use a more descriptive variable name to improve code clarity

    Use a more descriptive variable name for 'mt' to improve code readability. The
    current name doesn't clearly indicate its purpose or content.

    pmultiqc/modules/quantms/quantms.py [132]

    -mt = self.parse_mzml()
    +mzml_data = self.parse_mzml()
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Renaming 'mt' to 'mzml_data' enhances code clarity by making the variable's purpose more explicit, which is beneficial for maintainability and readability.

    7
    Enhancement
    Consolidate multiple dictionaries into a single data structure for improved efficiency and readability

    Consider using a more efficient data structure for storing spectrum information.
    Instead of multiple separate dictionaries, use a single dictionary with named tuples
    or a custom class to store all related information for each spectrum. This can
    improve code readability and performance.

    pmultiqc/modules/quantms/quantms.py [86-91]

    -self.oversampling = dict()
    -self.identified_spectrum = dict()
    -self.delta_mass = dict()
    +from collections import namedtuple
    +SpectrumInfo = namedtuple('SpectrumInfo', ['oversampling', 'identified', 'delta_mass'])
    +self.spectrum_data = {}
     self.Total_ms2_Spectral_Identified = 0
     self.Total_Peptide_Count = 0
     self.total_ms2_spectra = 0
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: The suggestion to use a named tuple for consolidating related data into a single structure can improve code readability and maintainability. However, the performance gain might be marginal given the current usage context.

    5

    💡 Need additional feedback ? start a PR chat

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 12

    🧹 Outside diff range and nitpick comments (4)
    pmultiqc/main.py (2)

    46-47: LGTM! Consider adding a 'shared' key for consistency.

    The addition of the 'quantms/mgf' configuration is correct and aligns with the PR objectives. It follows the existing pattern for file type configurations.

    For consistency with some other entries (e.g., 'quantms/exp_design'), consider adding a 'shared' key:

    -    config.update_dict(config.sp, {'quantms/mgf': {'fn': '*.mgf', 'num_lines': 0}})
    +    config.update_dict(config.sp, {'quantms/mgf': {'fn': '*.mgf', 'num_lines': 0, 'shared': False}})

    49-50: LGTM! Consider adding a 'shared' key for consistency.

    The addition of the 'quantms/mzid' configuration is correct and aligns with the PR objectives. It follows the existing pattern for file type configurations.

    For consistency with some other entries (e.g., 'quantms/exp_design'), consider adding a 'shared' key:

    -    config.update_dict(config.sp, {'quantms/mzid': {'fn': '*.mzid', 'num_lines': 0}})
    +    config.update_dict(config.sp, {'quantms/mzid': {'fn': '*.mzid', 'num_lines': 0, 'shared': False}})
    README.md (2)

    36-36: LGTM! Consider adding an example usage.

    The addition of the --mzid_plugin parameter is well-documented and aligns with the PR objectives. The description is clear and concise.

    To further improve the documentation, consider adding a brief example of how to use this parameter, similar to the examples provided for other parameters.


    Line range hint 140-140: Improve visibility and wording of the DIA-NN note.

    The note about DIA-NN output differences provides important information about potential limitations. However, its current placement and wording could be improved.

    Consider the following suggestions:

    1. Move this note to a more prominent location, such as just after the "Usage" section or create a new "Limitations" section.
    2. Rephrase the note for clarity and formality. For example:
      **Note:** Due to significant differences in DIA-NN output files, some metrics may be challenging to calculate when using this plugin with DIA-NN data.
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL

    📥 Commits

    Files that changed from the base of the PR and between d2965ef and 1e5c921.

    📒 Files selected for processing (5)
    • README.md (1 hunks)
    • pmultiqc/cli.py (1 hunks)
    • pmultiqc/main.py (1 hunks)
    • pmultiqc/modules/quantms/quantms.py (14 hunks)
    • setup.py (1 hunks)
    🧰 Additional context used
    🪛 Ruff
    pmultiqc/modules/quantms/quantms.py

    152-152: Avoid equality comparisons to False; use if not self.enable_exp: for false checks

    Replace with not self.enable_exp

    (E712)


    162-162: Avoid equality comparisons to False; use if not self.enable_sdrf: for false checks

    Replace with not self.enable_sdrf

    (E712)


    162-162: Avoid equality comparisons to False; use if not self.enable_exp: for false checks

    Replace with not self.enable_exp

    (E712)


    1153-1155: Use a single if statement instead of nested if statements

    (SIM102)


    1165-1165: Use sc.get(0, 0) instead of an if block

    Replace with sc.get(0, 0)

    (SIM401)


    2152-2152: Use spectrum_item_part.get("rank") instead of spectrum_item_part.get("rank", None)

    Replace spectrum_item_part.get("rank", None) with spectrum_item_part.get("rank")

    (SIM910)


    2156-2156: Use key in dict instead of key in dict.keys()

    Remove .keys()

    (SIM118)


    2158-2158: Use key in dict instead of key in dict.keys()

    Remove .keys()

    (SIM118)


    2160-2160: Use key in dict instead of key in dict.keys()

    Remove .keys()

    (SIM118)


    2174-2174: Undefined name search_engines

    (F821)


    2194-2194: Remove unnecessary True if ... else False

    Remove unnecessary True if ... else False

    (SIM210)


    2205-2205: Avoid equality comparisons to False; use if not mzid_table["isDecoy"]: for false checks

    Replace with not mzid_table["isDecoy"]

    (E712)


    2234-2234: Loop control variable i not used within loop body

    (B007)


    2242-2242: Avoid equality comparisons to True; use if group["pep_to_prot_unique"]: for truth checks

    Replace with group["pep_to_prot_unique"]

    (E712)

    🔇 Additional comments (7)
    pmultiqc/cli.py (4)

    34-34: LGTM: Reformatting of affix_type option

    The reformatting of the affix_type option improves consistency with other options while maintaining its functionality.


    35-35: LGTM: Reformatting of disable_plugin option

    The reformatting of the disable_plugin option improves consistency with other options while maintaining its functionality as a flag.


    34-36: Summary of changes to command-line options

    The modifications to this file enhance the command-line interface of pmultiqc:

    1. The affix_type and disable_plugin options have been reformatted for consistency.
    2. A new mzid_plugin option has been added to support mzIdentML file extraction, aligning with the PR objectives.

    These changes improve the tool's functionality and maintain a consistent interface style. Ensure that the new mzid_plugin option is properly integrated into the main logic of the tool.


    36-36: Approve addition of mzid_plugin option with suggestions

    The addition of the mzid_plugin option aligns well with the PR objectives to support mzIdentML files. However, consider the following suggestions:

    1. The help text could be more descriptive. Consider updating it to: "Enable extraction and processing of mzIdentML files for report generation".

    2. Ensure that this option is properly handled in the main logic of the tool.

    To verify the proper handling of this new option, please run the following script:

    ✅ Verification successful

    Approve addition of mzid_plugin option with suggestions

    The addition of the mzid_plugin option aligns well with the PR objectives to support mzIdentML files. However, consider the following suggestion:

    1. Improve the help text for clarity. For example: "Enable extraction and processing of mzIdentML files for report generation."
    🏁 Scripts executed

    The following scripts were executed for the analysis:

    Script:

    #!/bin/bash
    # Description: Verify the handling of the new mzid_plugin option in the main logic
    
    # Test: Search for usage of mzid_plugin in the main logic files
    echo "Searching for mzid_plugin usage in main.py:"
    rg "mzid_plugin" pmultiqc/main.py
    
    echo "\nSearching for mzid_plugin usage in quantms.py:"
    rg "mzid_plugin" pmultiqc/modules/quantms/quantms.py
    
    # Test: Check if there's a corresponding entry point in setup.py
    echo "\nChecking for mzid_plugin entry point in setup.py:"
    rg "mzid_plugin" setup.py
    

    Length of output: 911

    pmultiqc/main.py (1)

    46-50: Summary: Successful implementation of mgf and mzid support

    The changes in this file successfully add support for mgf and mzIdentML file formats, aligning with the PR objectives. The new configurations are well-integrated into the existing pmultiqc_plugin_execution_start() function and follow established patterns.

    These additions enhance the plugin's capability to handle diverse input formats, particularly benefiting users working with mgf and mzIdentML files in quantitative proteomics workflows.

    setup.py (1)

    55-56: LGTM! Verify the referenced function exists.

    The addition of the 'mzid_plugin' entry point is consistent with the PR objectives and follows the established pattern for CLI options. This change enables the integration of the new mzIdentML plugin feature.

    To ensure the referenced function exists, run the following script:

    README.md (1)

    Line range hint 1-142: Overall, the README updates are appropriate and informative.

    The changes to the README.md file accurately reflect the new --mzid_plugin feature and provide important information about potential limitations with DIA-NN outputs. The documentation is clear and consistent with the existing content. The suggested minor improvements will further enhance the usability of the documentation.

    mzid_tmp_part = {k: v for k, v in mzid_tmp.items() if k not in ["SpectrumIdentificationItem"]}
    for spectrum_item in mzid_tmp.get("SpectrumIdentificationItem", []):
    spectrum_item_part = {k: v for k, v in spectrum_item.items() if k not in ["PeptideEvidenceRef", "PeptideSequence"]}
    if spectrum_item_part.get("rank", None) == 1 and \
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Remove redundant default value in get method

    When using the dict.get method, the default value None is already implied, so specifying it is unnecessary.

    Apply this diff to simplify the code:

    -                        if spectrum_item_part.get("rank", None) == 1 and \
    +                        if spectrum_item_part.get("rank") == 1 and \
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    if spectrum_item_part.get("rank", None) == 1 and \
    if spectrum_item_part.get("rank") == 1 and \
    🧰 Tools
    🪛 Ruff

    2152-2152: Use spectrum_item_part.get("rank") instead of spectrum_item_part.get("rank", None)

    Replace spectrum_item_part.get("rank", None) with spectrum_item_part.get("rank")

    (SIM910)

    @@ -1634,8 +1811,8 @@ def parse_out_mzTab(self):
    # draw PSMs table for spectral counting
    if config.kwargs['quantification_method'] == "spectral_counting" and not config.kwargs.get('disable_table',
    True):
    mztab_data_psm_full = psm[['sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']]
    mztab_data_psm_full.rename(columns={"sequence": "Sequence", "accession": "Accession",
    mztab_data_psm_full = psm[['opt_global_cv_MS:1000889_peptidoform_sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']].copy
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Fix typo: Missing parentheses in method call

    The .copy method should be called with parentheses to create a copy of the DataFrame. Missing parentheses may lead to unexpected behavior.

    Apply this diff to fix the typo:

    -                mztab_data_psm_full = psm[['opt_global_cv_MS:1000889_peptidoform_sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']].copy
    +                mztab_data_psm_full = psm[['opt_global_cv_MS:1000889_peptidoform_sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']].copy()
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    mztab_data_psm_full = psm[['opt_global_cv_MS:1000889_peptidoform_sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']].copy
    mztab_data_psm_full = psm[['opt_global_cv_MS:1000889_peptidoform_sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']].copy()


    for m, group in psm.groupby("filename"):
    self.oversampling_plot = Histogram('MS/MS counts per 3D-peak', plot_category='frequency', breaks=[1, 2, 3])
    for i, j in group.groupby(["PeptideSequence", "chargeState", "Modifications"]):
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Optimize loop by ignoring unused loop variable

    The loop variable i is not used within the loop body. You can replace it with _ to indicate that it is intentionally unused.

    Apply this diff to make the code clearer:

    -                for i, j in group.groupby(["PeptideSequence", "chargeState", "Modifications"]):
    +                for _, j in group.groupby(["PeptideSequence", "chargeState", "Modifications"]):
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    for i, j in group.groupby(["PeptideSequence", "chargeState", "Modifications"]):
    for _, j in group.groupby(["PeptideSequence", "chargeState", "Modifications"]):
    🧰 Tools
    🪛 Ruff

    2234-2234: Loop control variable i not used within loop body

    (B007)


    # TODO remove_decoy
    if config.kwargs['remove_decoy']:
    mzid_table = mzid_table[mzid_table["isDecoy"] == False]
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Simplify boolean filtering in pandas DataFrame

    Instead of comparing mzid_table["isDecoy"] == False, you can use ~mzid_table["isDecoy"] for boolean indexing. This is more idiomatic in pandas.

    Apply this diff to simplify the filtering:

    -                mzid_table = mzid_table[mzid_table["isDecoy"] == False]
    +                mzid_table = mzid_table[~mzid_table["isDecoy"]]
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    mzid_table = mzid_table[mzid_table["isDecoy"] == False]
    mzid_table = mzid_table[~mzid_table["isDecoy"]]
    🧰 Tools
    🪛 Ruff

    2205-2205: Avoid equality comparisons to False; use if not mzid_table["isDecoy"]: for false checks

    Replace with not mzid_table["isDecoy"]

    (E712)


    proteins = set(group["accession_group"])
    peptides = group[["PeptideSequence", "Modifications"]].drop_duplicates()
    unique_peptides = group[group["pep_to_prot_unique"] == True][["PeptideSequence", "Modifications"]].drop_duplicates()
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Simplify boolean comparison in DataFrame filtering

    Instead of comparing group["pep_to_prot_unique"] == True, you can simply use group["pep_to_prot_unique"] for filtering.

    Apply this diff to simplify the filtering:

    -                unique_peptides = group[group["pep_to_prot_unique"] == True][["PeptideSequence", "Modifications"]].drop_duplicates()
    +                unique_peptides = group[group["pep_to_prot_unique"]][["PeptideSequence", "Modifications"]].drop_duplicates()
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    unique_peptides = group[group["pep_to_prot_unique"] == True][["PeptideSequence", "Modifications"]].drop_duplicates()
    unique_peptides = group[group["pep_to_prot_unique"]][["PeptideSequence", "Modifications"]].drop_duplicates()
    🧰 Tools
    🪛 Ruff

    2242-2242: Avoid equality comparisons to True; use if group["pep_to_prot_unique"]: for truth checks

    Replace with group["pep_to_prot_unique"]

    (E712)

    self.enable_sdrf = True

    # TODO in theory the module can work without the design. We just need to remove the according sections!
    if self.enable_sdrf == False and self.enable_exp == False:
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Simplify boolean comparisons in conditional statement

    You can simplify the condition by using if not self.enable_sdrf and not self.enable_exp: instead of comparing to False. This improves readability.

    Apply this diff to simplify the condition:

    -                if self.enable_sdrf == False and self.enable_exp == False:
    +                if not self.enable_sdrf and not self.enable_exp:
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    if self.enable_sdrf == False and self.enable_exp == False:
    if not self.enable_sdrf and not self.enable_exp:
    🧰 Tools
    🪛 Ruff

    162-162: Avoid equality comparisons to False; use if not self.enable_sdrf: for false checks

    Replace with not self.enable_sdrf

    (E712)


    162-162: Avoid equality comparisons to False; use if not self.enable_exp: for false checks

    Replace with not self.enable_exp

    (E712)

    else:
    mzid_table["CrossLinked_Peptide"] = False

    mzid_table["pep_to_prot_unique"] = mzid_table.groupby(["spectrumID", "PeptideSequence"])["accession"].transform(lambda x: True if len(set(x)) == 1 else False)
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Simplify boolean expression in lambda function

    The lambda function can be simplified by removing the unnecessary True if ... else False, since the expression itself evaluates to a boolean.

    Apply this diff to simplify the lambda function:

    -            mzid_table["pep_to_prot_unique"] = mzid_table.groupby(["spectrumID", "PeptideSequence"])["accession"].transform(lambda x: True if len(set(x)) == 1 else False)
    +            mzid_table["pep_to_prot_unique"] = mzid_table.groupby(["spectrumID", "PeptideSequence"])["accession"].transform(lambda x: len(set(x)) == 1)
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    mzid_table["pep_to_prot_unique"] = mzid_table.groupby(["spectrumID", "PeptideSequence"])["accession"].transform(lambda x: True if len(set(x)) == 1 else False)
    mzid_table["pep_to_prot_unique"] = mzid_table.groupby(["spectrumID", "PeptideSequence"])["accession"].transform(lambda x: len(set(x)) == 1)
    🧰 Tools
    🪛 Ruff

    2194-2194: Remove unnecessary True if ... else False

    Remove unnecessary True if ... else False

    (SIM210)


    enzyme_list = list()
    for mzid_path in self.mzid_paths:
    enzyme_list.append(list(next(mzid.MzIdentML(mzid_path).iterfind("Enzyme")).get("EnzymeName", None).keys())[0])
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Add exception handling for potential StopIteration

    When using next() without a default value, if the iterator is empty, it raises a StopIteration exception. Consider handling this exception to prevent the program from crashing if no Enzyme elements are found in the mzIdentML file.

    Apply this diff to handle the exception:

                     enzyme_list.append(
    -                    list(next(mzid.MzIdentML(mzid_path).iterfind("Enzyme")).get("EnzymeName", None).keys())[0])
    +                try:
    +                    enzyme_iter = mzid.MzIdentML(mzid_path).iterfind("Enzyme")
    +                    enzyme = next(enzyme_iter).get("EnzymeName", None)
    +                    if enzyme:
    +                        enzyme_name = list(enzyme.keys())[0]
    +                    else:
    +                        enzyme_name = "Trypsin"
    +                    enzyme_list.append(enzyme_name)
    +                except StopIteration:
    +                    log.warning(f"No Enzyme found in {mzid_path}; defaulting to 'Trypsin'")
    +                    enzyme_list.append("Trypsin")
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    enzyme_list.append(list(next(mzid.MzIdentML(mzid_path).iterfind("Enzyme")).get("EnzymeName", None).keys())[0])
    try:
    enzyme_iter = mzid.MzIdentML(mzid_path).iterfind("Enzyme")
    enzyme = next(enzyme_iter).get("EnzymeName", None)
    if enzyme:
    enzyme_name = list(enzyme.keys())[0]
    else:
    enzyme_name = "Trypsin"
    enzyme_list.append(enzyme_name)
    except StopIteration:
    log.warning(f"No Enzyme found in {mzid_path}; defaulting to 'Trypsin'")
    enzyme_list.append("Trypsin")


    for name, group in psm.groupby('stand_spectra_ref'):
    sc = group['missed_cleavages'].value_counts()
    mis_0 = sc[0] if 0 in sc else 0
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Simplify dictionary access with get method

    Instead of checking if 0 is in sc, you can use sc.get(0, 0) to directly retrieve the value with a default of 0. This simplifies the code.

    Apply this diff to use the get method:

    -                mis_0 = sc[0] if 0 in sc else 0
    +                mis_0 = sc.get(0, 0)
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    mis_0 = sc[0] if 0 in sc else 0
    mis_0 = sc.get(0, 0)
    🧰 Tools
    🪛 Ruff

    1165-1165: Use sc.get(0, 0) instead of an if block

    Replace with sc.get(0, 0)

    (SIM401)

    Comment on lines +1153 to +1155
    if "retention_time" not in psm.columns:
    # MGF
    if self.mgf_paths:
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Simplify nested conditional statements

    Consider combining the nested if statements into a single condition to improve code readability.

    Apply this diff to combine the conditions:

    -            if "retention_time" not in psm.columns:
    -                # MGF
    -                if self.mgf_paths:
    +            if "retention_time" not in psm.columns and self.mgf_paths:
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    if "retention_time" not in psm.columns:
    # MGF
    if self.mgf_paths:
    if "retention_time" not in psm.columns and self.mgf_paths:
    # MGF
    🧰 Tools
    🪛 Ruff

    1153-1155: Use a single if statement instead of nested if statements

    (SIM102)

    @yueqixuan yueqixuan closed this Oct 29, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    documentation Improvements or additions to documentation enhancement New feature or request Review effort [1-5]: 4
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant