Skip to content

Protein Parsimony & Grouping (Protein Inference)

rmillikin edited this page Aug 25, 2017 · 7 revisions

Settings

Search Task

  • "Apply Protein Parsimony & Construct Protein Groups" checkbox - constructs protein groups according to the rule of maximum parsimony (Occam's razor).
  • "Require at least two peptides to identify protein" checkbox - At least 2 peptides with a Q-Value under 0.01 are required to construct a target or contaminant protein group. PSMs still have a protein association listed in the PSM file even if that protein group only has one peptide (there are no orphan PSMs).
  • "Treat modified peptides as different peptides" checkbox - Modified forms of a peptide base sequence are treated as different for the purposes of parsimony, protein group displays, peptide counts, etc. but not for protein group scoring. You should check this if two proteins in an XML database are distinguished by a modification.

Output

Output is located in the ProteinGroups .tsv file and can be opened with Excel, Notepad, etc. Each header for the output is described here:

  • Protein Accession - Protein accession numbers (from the input protein database) for all proteins in the group are listed here with the "|" character as the delimiter.
  • Gene - Gene names (from the input protein database) for all proteins in the group are listed here with the "|" character as the delimiter.
  • Protein Full Name - Protein names (from the input protein database) for all proteins in the group are listed here with the "|" character as the delimiter.
  • Number of proteins in group - The number of proteins in the protein group.
  • Unique peptides - Peptides that are unique to the listed protein (they can only come from that one protein, based on the database in silico digestion). Currently, peptides that are unique to the group are not listed here; i.e., a protein group with >1 protein will always have 0 unique peptides because they are shared between all proteins in the group.
  • Shared peptides - Peptides that are shared between multiple proteins or protein groups are listed.
  • Number of peptides - Sum of unique + shared peptides.
  • Number of unique peptides - Number of unique proteins for the group.
  • Sequence coverage % - Number of residues observed (in the group's peptides) divided by the total number of residues in the protein, as a percent.
  • Sequence coverage - Displays the sequence coverage for each protein in the group with the "|" character as the delimiter. Lowercase residues were not observed. Uppercase residues were observed.
  • Number of PSMs - Number of PSMs with Q-Values <0.01 corresponding to the peptides observed for the group.
  • Summed MetaMorpheus Score - The highest-scoring PSM per peptide base sequence, summed for all observed peptide base sequences. The list of protein groups are ordered by score.
  • Decoy/Contaminant/Target - "D" means decoy protein group, "C" means contaminant, "T" means target.
  • Cumulative Target - Used for calculating Q-Value. Sum of target+contaminant proteins so far.
  • Cumulative Decoy - Used for calculating Q-Value. Sum of decoy proteins so far.
  • Q-Value (%)

Methodology

  • Indistinguishable proteins are listed; subset and subsumable proteins are not listed in the protein group.
  • If multiple raw files are searched with "Apply Protein Parsimony & Construct Protein Groups" checked, MetaMorpheus will wait until the end of searching all the raw files to apply parsimony and construct protein groups based on the aggregate results. The upshot is that protein groups are stabilized run-to-run for comparison.
  • Protein groups are constructed based on a global peptide Q-Value (all PSMs from all raw files are aggregated and the Q-Value cutoff is applied to these aggregated results). Only peptides below a global cutoff at Q-Value 0.01 are used for protein group construction.

Further Reading

For an excellent and complete description of parsimony and the protein inference problem, consult:

Nesvizhskii, A. I.; Aebersold, R. Interpretation of Shotgun Proteomic Data. Mol. Cell. Proteomics 2005, 4 (10), 1419–1440.