-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use Protein.ids in downstream analysis #1224
Comments
Hi Gambrian, cRAP is the contaminant proteins ('Contaminant' option was selected in DIA-NN).
Protein.Ids column, in contrast to Protein.Group, lists all proteins from which the peptide can originate.
I suggest to use Protein.Group (proteins inferred using the maximum parsimony principle) or Genes (the corresponding genes) columns. In each case, if still multiple ids listed with ;, then can best is (i) either to work with such 'groups' as is, or, (ii) if not possible for particular analysis, just discard such (rare) ids. Best, |
Dear Vadim You mean Protein.Ids are all members of Protein.Group? Maybe I understood it correctly , but in fact, maybe in some projects this phenomenon is not "rare" but often happens, this is a result of a rat lung astral project, I used all uniport fasta (Swiss-Prot + TrEMBL) as reference fasta, 308334 out of 893821 direct results (just diann_load("report.tsv")) and 3720/9108 abundance matrix row names (diann_maxlfq(df, group.header="Protein.Group", id.header = "Precursor.Id", quantify.header = "Precursor.Normalised")) have ";",which makes me very confused when annotating them (GO KEGG) functions . I think the main reason for this is that there are indeed many similar proteins in the reference fasta. In another human data (only using Swiss-Prot fasta), this situation is indeed very rare, but it is indeed unavoidable in the study of some species. What do you recommend in this case? In addition, in 1.9.2, I saw that you have made improvements to the normalization part. Does this mean that the results of diann 1.9.2 no longer need to be normalized to the total amount or use other normalization methods, and differential expression analysis can be performed directly? By the way, the diann version I used is 1.9.1, I will test the performance of 1.9.2. Thanks again for the great software and your quick response Best |
Are you using the default 'heuristic' protein inference?
Make sure 'heuristic' option is used, also perform the analysis on the genes level.
Yes, this is expected. But on genes level it's the same.
DIA-NN normalisation has always been quite good, if you re-normalise by the total amount, it will in most cases get worse. In 1.9.2 the normalisation is 'more precise'. |
I used "heuristic" protein inference . Since I am still new to proteomics, I read through diann's user guide carefully and used most of the parameters you recommended, only the precursor parameters were suitable for our experiment. Here are the specific parameters for my two search processes. Did I make any mistakes in setting any parameters? |
Looks good, but please fix the mass accuracies and the scan window. |
You specifically want to do this at sequence ID level and not gene level? For example, GO enrichment is inherently gene-level. |
Maybe I'm wrong but I thought most of the time the sequence IDs match the gene IDs so i think they will have many gene id like "gene_id1;gene_id2" and how do I determine which is the unique gene ID for a certain the Protein.Group with ";" Do you think these id in one Protein.Group will have same gene id (this is the real meaning of Protein inference:Genes)? |
Canonical human uniprot proteome is ~20k genes and ~70k sequences.
Very rarely. |
Oh thanks, I will check and report back, the reason I annotated this by sequence id is that I downloaded fasta via uniport and got the annotation information via uniport, and all uniport results are based on accession id. Would it be a good idea to use entrez id if I merged them by gene id? |
There's already 'Genes' and 'Genes.MaxLFQ' columns in DIA-NN report, so can use as is :) |
If I still want to do functional annotation at sequence ID level (I don't really want to do that, just for understanding and discussion), should I set Protein Inference: Protein Name (from Fasta)? |
In this case, best to set to isoform IDs. |
I hope it is ok to hijack this thread but as Vadim pointed out to set the mass accuracies and scan windows. also on this matter. Why is it important to do the lib generation and main search in samples in two steps? I've seen you mentioning that peptidoform scoring is off if not but that is only relevant for PTM or isoform searches and apparently this has been fixed with 1.9.2. Why is it still recommended to search in two steps? And I have not observed big differences, or hardly any at all, if I separate the lib gen from fasta and the main search and in this case either put protein inference in the second step for the search in samples either again on gene (as for the fasta lib generation) or off (based on a comment in the documentation here). What is recommended here? Thank you! |
Please see https://github.com/vdemichev/DiaNN?tab=readme-ov-file#changing-default-settings for an overview.
These values must be identical for all runs for best quantification.
On any timsTOF, mass accuracies can be safely set to 15ppm (if you set outside 10ppm-15ppm range, DIA-NN will print a warning). Scan window, on any instrument, should approx match the peak width (the whole, not FWHM) - but here it's very simple, you just see what DIA-NN sets automatically and then set this number.
The mode with combining these steps is simply not supported. Meaning we don't validate it and there are therefore no guarantees that it works correctly. So there's a 'strongly not recommended' warning in DIA-NN about this. |
Thank you for your reply Vadim. I believe I speak for all user that this active conversation is truly great and such support is not seen in all MS software solutions.
Thank you! |
This can overload timsTOF and yield bad ID numbers. What was the estimated amount of peptides (after enrichment) per injection and what timsTOF model it is? |
4.1 We are choosing ST only as Y are quite different in their MS2 behaviour and so low abundant (naturally and/or regarding enrichment). We supposed that we run into a FDR problem if Y is included for phospho and I think this has been discussed in the phospho community. Do you have a preference for "most solid" phospho data if to include Y in the searches or not? 4.2 Yes, I will try with and without M(Ox). In one dataset with DIA-NN 1.9.1 and M(Ox) on we have observed that 10% of the resulting phosphopeptides have an M(Ox). 4.3 We only have so far measured techn. replicates of cell lysates. I know from experience with Fragpipe that the restoring and validation algorithms, and MBR, highly benefit from replicate numbers but also from biological variability. I guess this is the same for any machine learning/matrice size depending software such as DIA-NN? So I am excited to see what DIA-NN site ID numbers I will get in actual datasets with treatments/timepoints etc. 4.4. The numbers are with DIA-NN 1.9.2. I am just wondering how those numbers compare to your experience as DDAPasef with MSfragger seem to yield 2x more phosphopeptides/sites than DIAPasef and DIA-NN. I know comparison at different loc. prob. cut-offs is difficult and I have no idea what the benchmark or real number is or if the sites reported for DDA are any (more) real. Naively I would have guessed DIA and DIA-NN should yield more based on missing values/low abundant phospho peptides. Regarding peptides - this was used for the DDA acquisition as well (enriched twice and pooled after elution in the same vial so I can inject the same sample twice - once for DIA and once for DDA for comparison). 100 µg of starting protein material were digested and subsequently enriched for phosphopeptides with Ti-MOAC (µphos). I have no data on peptide concentration after enrichment and before injection but both the BPC and TIC are absolutely fine. An old version of this protocol (EasyPhos) utilizes 1 mg of starting material with the same amount of beads. Other recent ZrIMAC protocols (MagResyn HP) are also based upon 100-200 µg of digest input. Thank you! |
More important is how many more unique sequences it yields. |
Thank you again for your fast and insightful responses.
|
|
Dear Vadim I spent some time testing the performance of 1.9.2 and the impact on protein inference( on 1.9.1), and wanted to report back and get some advice.
diann_result_gene <- diann_load(file.path(work_dir_gene,"report.tsv"))
abundance_df_gene <- diann_maxlfq(diann_result_gene, group.header="Protein.Group", id.header = "Precursor.Id", quantity.header = "Precursor.Normalised")
diann_result_isoform <- diann_load(file.path(work_dir_isoform,"report.tsv"))
abundance_df_isoform <- diann_maxlfq(diann_result_isoform, group.header="Protein.Group", id.header = "Precursor.Id", quantity.header = "Precursor.Normalised")
diann_result_protein <- diann_load(file.path(work_dir_protein,"report.tsv"))
abundance_df_protein <- diann_maxlfq(diann_result_protein, group.header="Protein.Group", id.header = "Precursor.Id", quantity.header = "Precursor.Normalised")
> ## gene
> str_detect(diann_result_gene$Protein.Group,";") %>% sum()
[1] 408760
> str_detect(diann_result_gene$Protein.Group,"cRAP-") %>% sum()
[1] 1546
> nrow(diann_result_gene)
[1] 1297780
>
> str_detect(abundance_df_gene %>% rownames(),";") %>% sum()
[1] 3130
> str_detect(rownames(abundance_df_gene),"cRAP-") %>% sum()
[1] 25
> nrow(abundance_df_gene)
[1] 8062
>
> str_detect(diann_result_gene$Genes,";") %>% sum()
[1] 24190
>
> ## isoform
> str_detect(diann_result_isoform$Protein.Group,";") %>% sum()
[1] 376122
> str_detect(diann_result_isoform$Protein.Group,"cRAP-") %>% sum()
[1] 3345
> nrow(diann_result_isoform)
[1] 1297780
>
> str_detect(abundance_df_isoform %>% rownames(),";") %>% sum()
[1] 3091
> str_detect(rownames(abundance_df_isoform),"cRAP-") %>% sum()
[1] 29
> nrow(abundance_df_isoform)
[1] 8170
>
> str_detect(diann_result_isoform$Genes,";") %>% sum()
[1] 24574
>
> ## protein name
> str_detect(diann_result_protein$Protein.Group,";") %>% sum()
[1] 376122
> str_detect(diann_result_protein$Protein.Group,"cRAP-") %>% sum()
[1] 3345
> nrow(diann_result_protein)
[1] 1297780
>
> str_detect(abundance_df_protein %>% rownames(),";") %>% sum()
[1] 3091
> str_detect(rownames(abundance_df_protein),"cRAP-") %>% sum()
[1] 29
> nrow(abundance_df_protein)
[1] 8170
>
> str_detect(diann_result_protein$Genes,";") %>% sum()
[1] 24574 So setting protein inference to "Isoform ID" does reduce this, but the effect is not significant. As you predicted, most results do not have ";" in the gene column, so when there is no ";" in the gene column of this row, I want to keep the first access ID, and delete the row when there is ";" in the gene column to ensure that the results are annotated as much as possible using the uniprot functional annotation. ② Do you think this is the right way? ③ At the same time, there is another small question. Since diann has normalized the results, should I not use diann_maxlfq to obtain the expression matrix, but use diann_matrix to obtain the results directly? Best |
Dear Vadim
In the diann results, most of the Protein.ids are uniprot accession ids (from the fasta file I used), but some are accession ids combined by ";" . Some are combined with cRAP-***, I think this is because they are searched in both the reference fasta file and camprotR_240512_cRAP_20190401_full_tags.fasta. Can I delete these results? Others are just two or more accession ids combined with ";". Why is this? I think these proteins cannot find the master protein because they have the same matching peptides? Can I use the first id instead in subsequent analysis?
Best
The text was updated successfully, but these errors were encountered: