-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New genome+annotations for Arachis stenosperma (GenBank annotation) #189
Comments
@StevenCannon-USDA are you sure about calling these gnm1.ann2 annotations? I guess @jd-campbell had started a gnm1.ann1 but since it was never made public (that I know of) it seems a little weird to start counting at 2. But then again it wouldn't be the first time versioning got weird, so just let me know if you think this best and I'll go with it. |
@adf-ncgr - You are right. Thanks for catching this. I'll fix this upstream (probably not until this evening though). |
Looking into this more closely: At that time (~2021), we had reserved to keys to use for stenosperma: PFL2 for gnm1 and CZRZ for gnm1.ann1. At this point, we could ... Of those options, I am leaning toward the option of least resistance - which is to use gnm1.ann2.CZRZ for the GenBank annotation; and just live with the fact that there is likely never to be an ann1. This would be similar to the situation for Glycine max Wm82, where we have gnm1, gnm2, gnm4, gnm6. I don't feel strongly about this. Recalculating everything by replaying the notes would only take 15 minutes. I guess I'll offer the proposal above (the first annotation on gnm1 will be called ann2) in the spirit of a Request for Objections. Seeking opinions/votes particularly from @sdash-github, @jd-campbell, @adf-ncgr |
I don't object to 2), but my mild preference would be for 1). No preference on reusing the key or not. It's too bad that soybean didn't hop from gnm4 to gnm8, that would have been a nice numerical sequence (and the number of Wm82* genomes seems to grow exponentially no matter what!) |
I'd say the mild preference (for 1) wins. I'll recalculate the collections now. |
OK: option 1 has been implemented. The annotations collection is now For the record: replaying the data-preparation notes took half an hour - so the curation job is getting faster (though of course preparing the config file and modifying the notes for a new data set will still be a significant job -- maybe half a day or so, barring data curveballs). |
Thanks Steven- will get it into the AHRD queue with the others |
OK, the AHRD/BUSCO/GFA stuff is done on this. Should we go ahead and move from annex to public? Some of the following steps will probably prefer if it's located at the final resting place (e.g. the browser config updates which will want to embed stable URLs in the config file). I don't think the primary datastore contents will be updated much if any following those steps, so seems like a logical time to get it over there, but let me know your thoughts. Same will apply to the other couple of genomes which I've got at more or less the same level of analysis now. Happy to do the moving honors if you agree that is a sensible point at which to do so, @StevenCannon-USDA Also, one thing possibly worth discussing further on this particular dataset (and other future NCBI-sourced annotations), it looks like you've decided to go ahead with the idea of coining transcript/protein ids by tacking on .digit suffixes to the gene ids as is common practice outside of NCBI. I don't see any major issue with that, but I do think that the featid_map file ought to record the mapping from what they used (ie accession numbers) to what we're using, ie not: |
Thanks for the quick work. About linking the LOC IDs back to the transcript accession IDs: I agree. I just wasn't up to it at the time. Dealing with the GenBank GFFs left me feeling beat up. I was slow to realize that not only do the mRNA IDs have no string-based correspondence with their parent gene IDs, but that different splice variants within a gene also have no string-based correspondence with one another:
The (new) script that "fixes" this is rename_gff_mRNA_IDs.pl To do next: I guess to have it also report the mapping ...
... and then teach ds_souschef to use that. |
Hi again- quick note about Arachis/stenosperma/genomes/V10309.gnm1.PFL2 for @StevenCannon-USDA; looks like there are some minor differences between what's already in the v2 area of the datastore and what's in the annex. For example the one in v2 has chromosome names chr01-chr10 whereas the one in the annex has Chr1-Chr10 (all with full yuck of course). Since the gff file matches the new version in the annex, I assume we want to replace the older one with the new one but maybe you could take a quick look and ensure that the differences are all intentional before moving ahead? thanks! |
Good to check. |
Axctually, I just remembered that changing the chromosome names will have at least one implication, namely that some genome alignments we did previously using that genome against duranensis will need to be redone. Not a big deal in and of itself, and I can't think of other things outside of BLAST that would use the un-annotated genomes. But it looks like the NCBI sequence report just refers to them as 1-10 so presumably adding Chr or chr%02d is a subjective choice on our end. I kind of like using zero-padding on genomes with >=10 chromosomes, and it looks like every other Arachis genome has settled on the chr%02d form, so if you don't object or have other reasons maybe I will sed them back to that (including the gff). Let me know if you disagree! |
No objections. Better to do the patch than to have to re-run the alignments. |
Great, thanks for tolerating my obsession with trivial matters of consistency- I've made the changes for this and will proceed with the remaining tasks downstream. |
@adf-ncgr and @StevenCannon-USDA |
@StevenCannon-USDA, getting an odd error (though claiming to be a non-error!) in the JBrowse2 in some regions: |
I think I would handle this as part of the pre-processing work - near line 90 in this file: |
OK, but what exactly are you proposing to do- just re-ID the tRNA features so that there is no collision with the gene IDs? |
For the reasons you give, I think I would remove them (probably into another file). |
I can tackle this if you would like -- but I probably won't get to it until next week. |
I'm happy to do it as post-processing on this one if you want to tackle making it part of the pre-processing in future. |
I won't object. (Always happy to make you happy :-) ) |
OK, I've done the basic partitioning, although the separated non-coding file still has what JBrowse2 considers a "NonError". So if we offer that as a separate browser track, it will probably still need to get fixed. But I'm going to quit while I'm ahead... |
Great. Thank you, and a good weekend to you. |
Main steps for adding new genome and annotation collections
Genus/species/collection names:
What are the collection types and names? Example:
Arachis/stenosperma/annotations/V10309.gnm1.ann2.CZRZ
Arachis/stenosperma/genomes/V10309.gnm1.PFL2
Add collection(s) to the Data Store (annex)
Validate the README(s)
Update about_this_collection.yml
Calculate AHRD functional annotations
Calculate gene family assignments (.gfa)
Add to pan-gene set
Load relevant mine
Add BLAST targets
Incorporate into GCV
Update the jekyll collections listing
Update browser configs
run BUSCO
The text was updated successfully, but these errors were encountered: