Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent PG Q-value for multiple precursors of the same run and PG #1262

Open
hollenstein opened this issue Nov 14, 2024 · 3 comments
Open

Comments

@hollenstein
Copy link

Dear Vadim,

I am currently looking into how to work with the report.parquet file generated by DIANN, specifically at this point I want to aggregate (or rather filter) the output to the protein level. So I’ve started with some sanity checks to make sure that I understand what all of the columns mean. By doing this I believe I might have stumbled upon a bug in the output, or at least I don't understand the behaviour.

What I did was group the data based on unique combinations of "Run.Index" and "Protein.Group". My assumption was that for all precursors of the same Protein group in the same run, all entries in one of the following columns should be identical: "PG.Q.Value", "PG.PEP", "Global.PG.Q.Value", "Lib.PG.Q.Value", "Protein.Q.Value".

However, what I observed occasionally was that one precursor had a different value for Global.PG.Q.Value than the others, and sometimes I observed the same for the Protein.Q.Value column. It is also quite suspicious that the different value is always 1, which could indicate that for this precursor the Protein Qvalue was never written or updated. Below are two examples.

image

image

I've also checked multiple DIANN analysis with different input files (timsTOF HT and Astral), and I've consistently observed this issue. Please let me know if I can provide further information or if you need any additional files.

Best,
David

@vdemichev
Copy link
Owner

Hi David,

Thank you for bringing this up, this will happen by design, i.e. all q-values are meant for filtering, and if some IDs are considered not good, they get value ==1.

  • With Protein.Q.Value - this will be 1 for non-proteotypic peptides.
  • With Global.PG.Q.Value - if the precursor did not pass certain global quality filtering, i.e. the value should be 1 for all runs, I guess this is the case?

Best,
Vadim

@hollenstein
Copy link
Author

Dear Vadim,

Thanks for the quick response. I have to admit that I find this behavior rather unintuitive, and it might also be unnecessarily complicated. Maybe you are interested in why I think it would be worthwhile to reconsider doing things this way?

First, in this implementation several q-value columns function to encode two different types of information. 1) the actual q-value that is associated with a certain entry in the PG (or another) column, and 2), a kind of a boolean flag that indicates whether the row is invalid according to certain criteria. To correctly parse or interpret the data, these rules need to be known and reflected in the parsing function.

Second, this is an undocumented (at least I couldn't find it) and unexpected behaviour, which makes it somewhat confusing to work with the output. Another downside to this is that if you would document it, it would make the description of the columns much more complicated. And there is always the danger that the documentation of this behaviour doesn't get updated when the actual code gets changed.

Third, as I mentioned before, I am working on a convenient reader for DIANN results that allows to extract data on different abstraction levels (like proteins). In the Main output reference in the readme file the description is
Global.PG.Q.Value is global q-value for the protein group. From this description I would have assumed that for all entries in the report.parquet file of the same PG, the q-value would be the same. To extract the information on the PG level, I would simply need the select any entry of a specific PG, remove all the columns from the precursor, peptide, and modification level, and keep only the columns that are specific to the PG level. However, this is not possible with the current implementation. So, what I would have to do first is to know all q-value columns which are also used to encode additional information, and remove all rows for which the value in any of those columns is 1. In principle this is no big issue, but it makes the implementation more error prone and vulnerable to future changes.

I think I understand the motivation behind implementing it like that, however, there might be simpler solutions for filtering that don't necessitate this double information encoding in the q-value columns.

  • For example, if I want to remove precursor entries from peptides that are not proteotypic, there is already the very useful and unambiguous column "Proteotypic".
  • When working on the peptidoform level, I would apply a filter to the "Global.Peptidoform.Q.Value" to remove unreliable peptidoform entries. If I want to look only a peptides from protein with a certain q-value, I would of course also apply a "Global.PG.Q.Value" filter.
  • Another possible solution would be to add a new boolean column like "Global quality filter", that is set to 1 if an ID is considered not good.

My point is that it is easier for other people (and my future self) to understand what is going on when explicitly using a peptide level column to filter Peptide entries, and not a protein level column. And with the very comprehensive output, which I appreciate a lot btw., all information is already present, and in principle it wouldn’t need to be also encoded in protein q-value columns.

Sorry for the long post. Maybe there are good reasons to do it like this that I haven't considered. I am looking forward to hearing your thoughts on this topic.

Best,
David

@vdemichev
Copy link
Owner

Hi David,

Thank you very much for the detailed analysis of this! I agree, it's somewhat unintuitive. In general, the rationale behind many decisions in DIA-NN is to minimise the chance of users making a mistake by not following docs or best practices. That is, the idea is that even if the user does something they really should not, they still end up with good quality data. Such as filtering by Protein.Q.Value must always produce only proteotypic peptides - just in case. But maybe we went too far here, I have added this question to the todo list and we will analyse pros and cons here, the changes will most likely not make it in the next release though.

Many thanks again!
Best,
Vadim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants