-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mzIdentML plugin #101
Add mzIdentML plugin #101
Conversation
WalkthroughThe changes introduce a new command-line parameter Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant Main
participant QuantMSModule
User->>CLI: run command with --mzid_plugin
CLI->>Main: process command
Main->>QuantMSModule: initialize with mzid and mgf handling
QuantMSModule->>Main: return processed data
Main->>CLI: output results
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
🧹 Outside diff range and nitpick comments (4)
pmultiqc/main.py (2)
46-47
: LGTM! Consider adding a 'shared' key for consistency.The addition of the 'quantms/mgf' configuration is correct and aligns with the PR objectives. It follows the existing pattern for file type configurations.
For consistency with some other entries (e.g., 'quantms/exp_design'), consider adding a 'shared' key:
- config.update_dict(config.sp, {'quantms/mgf': {'fn': '*.mgf', 'num_lines': 0}}) + config.update_dict(config.sp, {'quantms/mgf': {'fn': '*.mgf', 'num_lines': 0, 'shared': False}})
49-50
: LGTM! Consider adding a 'shared' key for consistency.The addition of the 'quantms/mzid' configuration is correct and aligns with the PR objectives. It follows the existing pattern for file type configurations.
For consistency with some other entries (e.g., 'quantms/exp_design'), consider adding a 'shared' key:
- config.update_dict(config.sp, {'quantms/mzid': {'fn': '*.mzid', 'num_lines': 0}}) + config.update_dict(config.sp, {'quantms/mzid': {'fn': '*.mzid', 'num_lines': 0, 'shared': False}})README.md (2)
36-36
: LGTM! Consider adding an example usage.The addition of the
--mzid_plugin
parameter is well-documented and aligns with the PR objectives. The description is clear and concise.To further improve the documentation, consider adding a brief example of how to use this parameter, similar to the examples provided for other parameters.
Line range hint
140-140
: Improve visibility and wording of the DIA-NN note.The note about DIA-NN output differences provides important information about potential limitations. However, its current placement and wording could be improved.
Consider the following suggestions:
- Move this note to a more prominent location, such as just after the "Usage" section or create a new "Limitations" section.
- Rephrase the note for clarity and formality. For example:
**Note:** Due to significant differences in DIA-NN output files, some metrics may be challenging to calculate when using this plugin with DIA-NN data.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
- README.md (1 hunks)
- pmultiqc/cli.py (1 hunks)
- pmultiqc/main.py (1 hunks)
- pmultiqc/modules/quantms/quantms.py (14 hunks)
- setup.py (1 hunks)
🧰 Additional context used
🪛 Ruff
pmultiqc/modules/quantms/quantms.py
152-152: Avoid equality comparisons to
False
; useif not self.enable_exp:
for false checksReplace with
not self.enable_exp
(E712)
162-162: Avoid equality comparisons to
False
; useif not self.enable_sdrf:
for false checksReplace with
not self.enable_sdrf
(E712)
162-162: Avoid equality comparisons to
False
; useif not self.enable_exp:
for false checksReplace with
not self.enable_exp
(E712)
1153-1155: Use a single
if
statement instead of nestedif
statements(SIM102)
1165-1165: Use
sc.get(0, 0)
instead of anif
blockReplace with
sc.get(0, 0)
(SIM401)
2152-2152: Use
spectrum_item_part.get("rank")
instead ofspectrum_item_part.get("rank", None)
Replace
spectrum_item_part.get("rank", None)
withspectrum_item_part.get("rank")
(SIM910)
2156-2156: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
2158-2158: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
2160-2160: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
2174-2174: Undefined name
search_engines
(F821)
2194-2194: Remove unnecessary
True if ... else False
Remove unnecessary
True if ... else False
(SIM210)
2205-2205: Avoid equality comparisons to
False
; useif not mzid_table["isDecoy"]:
for false checksReplace with
not mzid_table["isDecoy"]
(E712)
2234-2234: Loop control variable
i
not used within loop body(B007)
2242-2242: Avoid equality comparisons to
True
; useif group["pep_to_prot_unique"]:
for truth checksReplace with
group["pep_to_prot_unique"]
(E712)
🔇 Additional comments (7)
pmultiqc/cli.py (4)
34-34
: LGTM: Reformatting ofaffix_type
optionThe reformatting of the
affix_type
option improves consistency with other options while maintaining its functionality.
35-35
: LGTM: Reformatting ofdisable_plugin
optionThe reformatting of the
disable_plugin
option improves consistency with other options while maintaining its functionality as a flag.
34-36
: Summary of changes to command-line optionsThe modifications to this file enhance the command-line interface of pmultiqc:
- The
affix_type
anddisable_plugin
options have been reformatted for consistency.- A new
mzid_plugin
option has been added to support mzIdentML file extraction, aligning with the PR objectives.These changes improve the tool's functionality and maintain a consistent interface style. Ensure that the new
mzid_plugin
option is properly integrated into the main logic of the tool.
36-36
: Approve addition ofmzid_plugin
option with suggestionsThe addition of the
mzid_plugin
option aligns well with the PR objectives to support mzIdentML files. However, consider the following suggestions:
The help text could be more descriptive. Consider updating it to: "Enable extraction and processing of mzIdentML files for report generation".
Ensure that this option is properly handled in the main logic of the tool.
To verify the proper handling of this new option, please run the following script:
✅ Verification successful
Approve addition of
mzid_plugin
option with suggestionsThe addition of the
mzid_plugin
option aligns well with the PR objectives to support mzIdentML files. However, consider the following suggestion:
- Improve the help text for clarity. For example: "Enable extraction and processing of mzIdentML files for report generation."
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the handling of the new mzid_plugin option in the main logic # Test: Search for usage of mzid_plugin in the main logic files echo "Searching for mzid_plugin usage in main.py:" rg "mzid_plugin" pmultiqc/main.py echo "\nSearching for mzid_plugin usage in quantms.py:" rg "mzid_plugin" pmultiqc/modules/quantms/quantms.py # Test: Check if there's a corresponding entry point in setup.py echo "\nChecking for mzid_plugin entry point in setup.py:" rg "mzid_plugin" setup.pyLength of output: 911
pmultiqc/main.py (1)
46-50
: Summary: Successful implementation of mgf and mzid supportThe changes in this file successfully add support for mgf and mzIdentML file formats, aligning with the PR objectives. The new configurations are well-integrated into the existing
pmultiqc_plugin_execution_start()
function and follow established patterns.These additions enhance the plugin's capability to handle diverse input formats, particularly benefiting users working with mgf and mzIdentML files in quantitative proteomics workflows.
setup.py (1)
55-56
: LGTM! Verify the referenced function exists.The addition of the 'mzid_plugin' entry point is consistent with the PR objectives and follows the established pattern for CLI options. This change enables the integration of the new mzIdentML plugin feature.
To ensure the referenced function exists, run the following script:
README.md (1)
Line range hint
1-142
: Overall, the README updates are appropriate and informative.The changes to the README.md file accurately reflect the new
--mzid_plugin
feature and provide important information about potential limitations with DIA-NN outputs. The documentation is clear and consistent with the existing content. The suggested minor improvements will further enhance the usability of the documentation.
mzid_tmp_part = {k: v for k, v in mzid_tmp.items() if k not in ["SpectrumIdentificationItem"]} | ||
for spectrum_item in mzid_tmp.get("SpectrumIdentificationItem", []): | ||
spectrum_item_part = {k: v for k, v in spectrum_item.items() if k not in ["PeptideEvidenceRef", "PeptideSequence"]} | ||
if spectrum_item_part.get("rank", None) == 1 and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove redundant default value in get
method
When using the dict.get
method, the default value None
is already implied, so specifying it is unnecessary.
Apply this diff to simplify the code:
- if spectrum_item_part.get("rank", None) == 1 and \
+ if spectrum_item_part.get("rank") == 1 and \
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if spectrum_item_part.get("rank", None) == 1 and \ | |
if spectrum_item_part.get("rank") == 1 and \ |
🧰 Tools
🪛 Ruff
2152-2152: Use
spectrum_item_part.get("rank")
instead ofspectrum_item_part.get("rank", None)
Replace
spectrum_item_part.get("rank", None)
withspectrum_item_part.get("rank")
(SIM910)
@@ -1634,8 +1811,8 @@ def parse_out_mzTab(self): | |||
# draw PSMs table for spectral counting | |||
if config.kwargs['quantification_method'] == "spectral_counting" and not config.kwargs.get('disable_table', | |||
True): | |||
mztab_data_psm_full = psm[['sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']] | |||
mztab_data_psm_full.rename(columns={"sequence": "Sequence", "accession": "Accession", | |||
mztab_data_psm_full = psm[['opt_global_cv_MS:1000889_peptidoform_sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']].copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo: Missing parentheses in method call
The .copy
method should be called with parentheses to create a copy of the DataFrame. Missing parentheses may lead to unexpected behavior.
Apply this diff to fix the typo:
- mztab_data_psm_full = psm[['opt_global_cv_MS:1000889_peptidoform_sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']].copy
+ mztab_data_psm_full = psm[['opt_global_cv_MS:1000889_peptidoform_sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']].copy()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
mztab_data_psm_full = psm[['opt_global_cv_MS:1000889_peptidoform_sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']].copy | |
mztab_data_psm_full = psm[['opt_global_cv_MS:1000889_peptidoform_sequence', 'accession', 'search_engine_score[1]', 'stand_spectra_ref']].copy() |
|
||
for m, group in psm.groupby("filename"): | ||
self.oversampling_plot = Histogram('MS/MS counts per 3D-peak', plot_category='frequency', breaks=[1, 2, 3]) | ||
for i, j in group.groupby(["PeptideSequence", "chargeState", "Modifications"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Optimize loop by ignoring unused loop variable
The loop variable i
is not used within the loop body. You can replace it with _
to indicate that it is intentionally unused.
Apply this diff to make the code clearer:
- for i, j in group.groupby(["PeptideSequence", "chargeState", "Modifications"]):
+ for _, j in group.groupby(["PeptideSequence", "chargeState", "Modifications"]):
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for i, j in group.groupby(["PeptideSequence", "chargeState", "Modifications"]): | |
for _, j in group.groupby(["PeptideSequence", "chargeState", "Modifications"]): |
🧰 Tools
🪛 Ruff
2234-2234: Loop control variable
i
not used within loop body(B007)
|
||
# TODO remove_decoy | ||
if config.kwargs['remove_decoy']: | ||
mzid_table = mzid_table[mzid_table["isDecoy"] == False] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify boolean filtering in pandas DataFrame
Instead of comparing mzid_table["isDecoy"] == False
, you can use ~mzid_table["isDecoy"]
for boolean indexing. This is more idiomatic in pandas.
Apply this diff to simplify the filtering:
- mzid_table = mzid_table[mzid_table["isDecoy"] == False]
+ mzid_table = mzid_table[~mzid_table["isDecoy"]]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
mzid_table = mzid_table[mzid_table["isDecoy"] == False] | |
mzid_table = mzid_table[~mzid_table["isDecoy"]] |
🧰 Tools
🪛 Ruff
2205-2205: Avoid equality comparisons to
False
; useif not mzid_table["isDecoy"]:
for false checksReplace with
not mzid_table["isDecoy"]
(E712)
|
||
proteins = set(group["accession_group"]) | ||
peptides = group[["PeptideSequence", "Modifications"]].drop_duplicates() | ||
unique_peptides = group[group["pep_to_prot_unique"] == True][["PeptideSequence", "Modifications"]].drop_duplicates() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify boolean comparison in DataFrame filtering
Instead of comparing group["pep_to_prot_unique"] == True
, you can simply use group["pep_to_prot_unique"]
for filtering.
Apply this diff to simplify the filtering:
- unique_peptides = group[group["pep_to_prot_unique"] == True][["PeptideSequence", "Modifications"]].drop_duplicates()
+ unique_peptides = group[group["pep_to_prot_unique"]][["PeptideSequence", "Modifications"]].drop_duplicates()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
unique_peptides = group[group["pep_to_prot_unique"] == True][["PeptideSequence", "Modifications"]].drop_duplicates() | |
unique_peptides = group[group["pep_to_prot_unique"]][["PeptideSequence", "Modifications"]].drop_duplicates() |
🧰 Tools
🪛 Ruff
2242-2242: Avoid equality comparisons to
True
; useif group["pep_to_prot_unique"]:
for truth checksReplace with
group["pep_to_prot_unique"]
(E712)
self.enable_sdrf = True | ||
|
||
# TODO in theory the module can work without the design. We just need to remove the according sections! | ||
if self.enable_sdrf == False and self.enable_exp == False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify boolean comparisons in conditional statement
You can simplify the condition by using if not self.enable_sdrf and not self.enable_exp:
instead of comparing to False
. This improves readability.
Apply this diff to simplify the condition:
- if self.enable_sdrf == False and self.enable_exp == False:
+ if not self.enable_sdrf and not self.enable_exp:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if self.enable_sdrf == False and self.enable_exp == False: | |
if not self.enable_sdrf and not self.enable_exp: |
🧰 Tools
🪛 Ruff
162-162: Avoid equality comparisons to
False
; useif not self.enable_sdrf:
for false checksReplace with
not self.enable_sdrf
(E712)
162-162: Avoid equality comparisons to
False
; useif not self.enable_exp:
for false checksReplace with
not self.enable_exp
(E712)
else: | ||
mzid_table["CrossLinked_Peptide"] = False | ||
|
||
mzid_table["pep_to_prot_unique"] = mzid_table.groupby(["spectrumID", "PeptideSequence"])["accession"].transform(lambda x: True if len(set(x)) == 1 else False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify boolean expression in lambda function
The lambda function can be simplified by removing the unnecessary True if ... else False
, since the expression itself evaluates to a boolean.
Apply this diff to simplify the lambda function:
- mzid_table["pep_to_prot_unique"] = mzid_table.groupby(["spectrumID", "PeptideSequence"])["accession"].transform(lambda x: True if len(set(x)) == 1 else False)
+ mzid_table["pep_to_prot_unique"] = mzid_table.groupby(["spectrumID", "PeptideSequence"])["accession"].transform(lambda x: len(set(x)) == 1)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
mzid_table["pep_to_prot_unique"] = mzid_table.groupby(["spectrumID", "PeptideSequence"])["accession"].transform(lambda x: True if len(set(x)) == 1 else False) | |
mzid_table["pep_to_prot_unique"] = mzid_table.groupby(["spectrumID", "PeptideSequence"])["accession"].transform(lambda x: len(set(x)) == 1) |
🧰 Tools
🪛 Ruff
2194-2194: Remove unnecessary
True if ... else False
Remove unnecessary
True if ... else False
(SIM210)
|
||
enzyme_list = list() | ||
for mzid_path in self.mzid_paths: | ||
enzyme_list.append(list(next(mzid.MzIdentML(mzid_path).iterfind("Enzyme")).get("EnzymeName", None).keys())[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add exception handling for potential StopIteration
When using next()
without a default value, if the iterator is empty, it raises a StopIteration
exception. Consider handling this exception to prevent the program from crashing if no Enzyme
elements are found in the mzIdentML file.
Apply this diff to handle the exception:
enzyme_list.append(
- list(next(mzid.MzIdentML(mzid_path).iterfind("Enzyme")).get("EnzymeName", None).keys())[0])
+ try:
+ enzyme_iter = mzid.MzIdentML(mzid_path).iterfind("Enzyme")
+ enzyme = next(enzyme_iter).get("EnzymeName", None)
+ if enzyme:
+ enzyme_name = list(enzyme.keys())[0]
+ else:
+ enzyme_name = "Trypsin"
+ enzyme_list.append(enzyme_name)
+ except StopIteration:
+ log.warning(f"No Enzyme found in {mzid_path}; defaulting to 'Trypsin'")
+ enzyme_list.append("Trypsin")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
enzyme_list.append(list(next(mzid.MzIdentML(mzid_path).iterfind("Enzyme")).get("EnzymeName", None).keys())[0]) | |
try: | |
enzyme_iter = mzid.MzIdentML(mzid_path).iterfind("Enzyme") | |
enzyme = next(enzyme_iter).get("EnzymeName", None) | |
if enzyme: | |
enzyme_name = list(enzyme.keys())[0] | |
else: | |
enzyme_name = "Trypsin" | |
enzyme_list.append(enzyme_name) | |
except StopIteration: | |
log.warning(f"No Enzyme found in {mzid_path}; defaulting to 'Trypsin'") | |
enzyme_list.append("Trypsin") |
|
||
for name, group in psm.groupby('stand_spectra_ref'): | ||
sc = group['missed_cleavages'].value_counts() | ||
mis_0 = sc[0] if 0 in sc else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify dictionary access with get
method
Instead of checking if 0
is in sc
, you can use sc.get(0, 0)
to directly retrieve the value with a default of 0
. This simplifies the code.
Apply this diff to use the get
method:
- mis_0 = sc[0] if 0 in sc else 0
+ mis_0 = sc.get(0, 0)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
mis_0 = sc[0] if 0 in sc else 0 | |
mis_0 = sc.get(0, 0) |
🧰 Tools
🪛 Ruff
1165-1165: Use
sc.get(0, 0)
instead of anif
blockReplace with
sc.get(0, 0)
(SIM401)
if "retention_time" not in psm.columns: | ||
# MGF | ||
if self.mgf_paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify nested conditional statements
Consider combining the nested if
statements into a single condition to improve code readability.
Apply this diff to combine the conditions:
- if "retention_time" not in psm.columns:
- # MGF
- if self.mgf_paths:
+ if "retention_time" not in psm.columns and self.mgf_paths:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if "retention_time" not in psm.columns: | |
# MGF | |
if self.mgf_paths: | |
if "retention_time" not in psm.columns and self.mgf_paths: | |
# MGF |
🧰 Tools
🪛 Ruff
1153-1155: Use a single
if
statement instead of nestedif
statements(SIM102)
User description
Feature added: Provide data in mzIdentML and mgf (or mzML) formats to generate a report similar to quantms.
PR Type
enhancement, documentation
Description
--mzid_plugin
to enable mzIdentML data extraction and reporting.--mzid_plugin
option and its usage.Changes walkthrough 📝
cli.py
Add `--mzid_plugin` option for mzIdentML extraction
pmultiqc/cli.py
--mzid_plugin
.main.py
Implement mzIdentML and MGF file support and reporting
pmultiqc/main.py
calculation.
quantms.py
Integrate mzIdentML processing and visualization enhancements
pmultiqc/modules/quantms/quantms.py
setup.py
Register `mzid_plugin` CLI option
setup.py
mzid_plugin
as a new CLI option.README.md
Document `--mzid_plugin` option in README
README.md
--mzid_plugin
option.Summary by CodeRabbit
New Features
--mzid_plugin
for generating reports based on mzid and mzML/mgf files.QuantMSModule
to support additional data formats and introduced new methods for improved data visualization.Documentation
Bug Fixes
Chores