-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gene identifier column to pan-gene set .hsh file? #45
Comments
@sammyjava Can you give an example? Here is the status quo:
|
Simple in this example but we have cases where the transcript identifiers are NOT a number appended to the gene identifier, like
|
Note: this isn't needed for mine loading, since I already know which genes go with which transcripts from the GFFs. It's more about ease of use for users, such as the GCV. |
I see. |
Yeah, I think the only way right now is to use the transcript-gene child-parent relation in the GFF. I am updating my mine loader to do that properly in a post-processor (just load transcripts from the pan-gene sets and then find the genes and proteins in post). |
FWIW, when I need to do something similar in the context of gene family assignments, I extract a transcript2gene lookup from the gff (since mRNA records contain both their ID and the Parent attribute, it is pretty straightforward, at least assuming the order of ID and Parent attributes doesn't get switched around...). |
I'm curious as to what "extract a transcript2gene lookup" means. Does that refer to a program, like gffread? Or something you've hacked? |
Completely the latter. From a comment to one of the scripts that uses such a thing:
|
Having given this some more thought ...
My preference is to do it as a separate transcript-gene hash file. Two reasons: first, the pangene "hsh.tsv" file is the end product of a lengthy process, and I don't want to carry the extra information (gene ID) through that process; and second, I think of a hash file as having two fields (definitionally). Admittedly, neither of these reasons is particularly strong -- but I don't think of any good reason not to store the transcript-gene mapping in a separate file. Implementation: I propose re-generating all BED files in the Data Store annotations, to have seven columns: This would have the advantage of making another place to find this mapping easily (it would become one of the standard files in an annotation set); and it would also allow regularizing the BED files in the Data Store. They are currently without standard names:
In the update, all would be Then, during creation of a pangene or gene family set, the transcript-gene hash will just be derived from columns 4 and 7 in the BED files. |
No opinion here. Like I said, I use the annotation GFFs to connect transcripts to genes (and proteins, when they have the same identifier), so the extra data in the pan-gene set collections isn't needed for mine loading, but I do get that people like to see gene names pretty much everywhere. |
@StevenCannon-USDA I have no objection to augmenting the bed files with genes and standardizing naming.
Seriously, though, on analogy with the gene family assignment files I could at least imagine having some metadata about a gene's inclusion in a pangene set and storing it there (similar to how we store goodness of match to the HMM in the gfa files). Some off-the-cuff examples of such metadata that come to mind (none of them very compelling, but maybe could inspire some other thoughts about it)
pretty much just thinking out loud there, and don't feel strongly about any of it. |
Yes. Something like the following (contrived, but with real data).
In this sample, all are trivial except Sindora (singl). As you point out, it needn't be restricted to two columns - though this example is. It could be called e.g. |
Progress on this task: I have regenerated the bed files for all annotation collections (121 currently).
I'll also try to incorporate the transcript name -- gene name correspondence in the pan-gene sets when I recalculate those. |
Sounds like with the current timing I'll be able to load genes and proteins from the new pan-gene sets in 5.1.0.4. Not that it matters much, since I currently find the genes and proteins with a post-processor, but nice to do it in one swell foop. |
A conversation amongst NCGR legumistas concluded that it would be nice to have gene identifiers in the pan-gene set collections. This would be easily enabled with a gene identifier column in the .hsh file.
The text was updated successfully, but these errors were encountered: