-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent PG Q-value for multiple precursors of the same run and PG #1262
Comments
Hi David, Thank you for bringing this up, this will happen by design, i.e. all q-values are meant for filtering, and if some IDs are considered not good, they get value ==1.
Best, |
Dear Vadim, Thanks for the quick response. I have to admit that I find this behavior rather unintuitive, and it might also be unnecessarily complicated. Maybe you are interested in why I think it would be worthwhile to reconsider doing things this way? First, in this implementation several q-value columns function to encode two different types of information. 1) the actual q-value that is associated with a certain entry in the PG (or another) column, and 2), a kind of a boolean flag that indicates whether the row is invalid according to certain criteria. To correctly parse or interpret the data, these rules need to be known and reflected in the parsing function. Second, this is an undocumented (at least I couldn't find it) and unexpected behaviour, which makes it somewhat confusing to work with the output. Another downside to this is that if you would document it, it would make the description of the columns much more complicated. And there is always the danger that the documentation of this behaviour doesn't get updated when the actual code gets changed. Third, as I mentioned before, I am working on a convenient reader for DIANN results that allows to extract data on different abstraction levels (like proteins). In the Main output reference in the readme file the description is I think I understand the motivation behind implementing it like that, however, there might be simpler solutions for filtering that don't necessitate this double information encoding in the q-value columns.
My point is that it is easier for other people (and my future self) to understand what is going on when explicitly using a peptide level column to filter Peptide entries, and not a protein level column. And with the very comprehensive output, which I appreciate a lot btw., all information is already present, and in principle it wouldn’t need to be also encoded in protein q-value columns. Sorry for the long post. Maybe there are good reasons to do it like this that I haven't considered. I am looking forward to hearing your thoughts on this topic. Best, |
Hi David, Thank you very much for the detailed analysis of this! I agree, it's somewhat unintuitive. In general, the rationale behind many decisions in DIA-NN is to minimise the chance of users making a mistake by not following docs or best practices. That is, the idea is that even if the user does something they really should not, they still end up with good quality data. Such as filtering by Protein.Q.Value must always produce only proteotypic peptides - just in case. But maybe we went too far here, I have added this question to the todo list and we will analyse pros and cons here, the changes will most likely not make it in the next release though. Many thanks again! |
Dear Vadim,
I am currently looking into how to work with the report.parquet file generated by DIANN, specifically at this point I want to aggregate (or rather filter) the output to the protein level. So I’ve started with some sanity checks to make sure that I understand what all of the columns mean. By doing this I believe I might have stumbled upon a bug in the output, or at least I don't understand the behaviour.
What I did was group the data based on unique combinations of "Run.Index" and "Protein.Group". My assumption was that for all precursors of the same Protein group in the same run, all entries in one of the following columns should be identical: "PG.Q.Value", "PG.PEP", "Global.PG.Q.Value", "Lib.PG.Q.Value", "Protein.Q.Value".
However, what I observed occasionally was that one precursor had a different value for Global.PG.Q.Value than the others, and sometimes I observed the same for the Protein.Q.Value column. It is also quite suspicious that the different value is always 1, which could indicate that for this precursor the Protein Qvalue was never written or updated. Below are two examples.
I've also checked multiple DIANN analysis with different input files (timsTOF HT and Astral), and I've consistently observed this issue. Please let me know if I can provide further information or if you need any additional files.
Best,
David
The text was updated successfully, but these errors were encountered: