Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add genome and annotations for Acacia crassicarpa #195

Open
2 of 14 tasks
StevenCannon-USDA opened this issue Feb 13, 2024 · 6 comments
Open
2 of 14 tasks

Add genome and annotations for Acacia crassicarpa #195

StevenCannon-USDA opened this issue Feb 13, 2024 · 6 comments

Comments

@StevenCannon-USDA
Copy link

Main steps for adding new genome and annotation collections

Genus/species/collection names:

  • Acacia/crassicarpa/genomes/Acra3RX.gnm1.YX4L

  • Acacia/crassicarpa/annotations/Acra3RX.gnm1.ann1.6C0V

  • Add collection(s) to the Data Store, including commits to datastore-metadata (at annex as of 2024-02-13)

  • Validate the README(s)

  • Update about_this_collection.yml

  • Calculate AHRD functional annotations

  • Calculate gene family assignments (.gfa)

  • Add to pan-gene set

  • Load relevant mine

  • Add BLAST targets

  • Incorporate into GCV

  • Update the jekyll collections listing

  • Update browser configs

  • run BUSCO

  • Update DSCensor

  • Add LINKOUTS to datastore, refresh linkout service

@adf-ncgr
Copy link
Contributor

Little FYI on this one, @StevenCannon-USDA ; it looks like most of the children of the gene features are given type "transcript" but some of the AHRD-related processing behaves badly if protein-coding transcripts are not given as mRNA. Let me know if you have any concerns about me making that change wholesale. I checked and the number of genes is only off by one from the number of primary proteins, so I think it's fair to assume they are mRNA. The discrepancy seems to be caused by a gene with ID=acacr.Acra3RX.gnm1.ann1.nbis-gene-1 which also has attributes: gene_id=g12430;transcript_id=g12430.t2
which seems to imply it really ought to just be another isoform of g12430. I suppose I could manually fix that little oddity as well.

@StevenCannon-USDA
Copy link
Author

Thank you - and I don't have concerns about s/transcript/mRNA/. (Why are there a million ways to munge a GFF?)

@adf-ncgr
Copy link
Contributor

(Why are there a million ways to munge a GFF?)

maybe you should write a song about it! ;)

@adf-ncgr
Copy link
Contributor

Oh, actually there's probably more to do on this file, but I'd like a second opinion. In addition to that one weird little gene, it looks like there are a bunch (~2000) of existing mRNA features that appear to essentially just be duplicates of what was originally represented as "transcript", but with an odd ID that nothing else references. For example:

acacr.Acra3RX.gnm1.scaffold_1   GeneMark.hmm3   mRNA    127849  128478  .       +       .       ID=acacr.Acra3RX.gnm1.ann1.nbis-mrna-1;Parent=acacr.Acra3RX.gnm1.ann1.g3;gene_id=g3;transcript_id=g3.t1
acacr.Acra3RX.gnm1.scaffold_1   GeneMark.hmm3   transcript      127849  128478  .       +       .       ID=acacr.Acra3RX.gnm1.ann1.g3.t1;Parent=acacr.Acra3RX.gnm1.ann1.g3

note that the transcript_id=g3.t1 part of the first one seems to suggest it really is a duplicate of the one below. I'm proposing to just delete these "extras" since they don't appear to provide any value and are arguably detrimental in that they appear in the transcripts fasta without any splicing, since they have no exon children (they don't appear in cds or protein since they don't have CDS children). Let me know if you see something about these that I'm overlooking that would argue for their being preserved

@StevenCannon-USDA
Copy link
Author

I agree with you: OK to delete those transcript records that seem to be duplicates of the mRNA features.

History of this gene file: I received it in gtf format (broken actually, with 29 lines being space- rather than tab-separated). I used AGAT to transform the file to gff3. Labels nbis indicate new identifiers added by AGAT (NBIS=National Bioinformatics Infrastructure Sweden).

The original structure of the of the noncompliant records is:

scaffold_1  GeneMark.hmm3 gene  127849  128478  . + . g3
scaffold_1  GeneMark.hmm3 transcript  127849  128478  . + . g3.t1
scaffold_1  GeneMark.hmm3 start_codon 127849  127851  . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 mRNA  127849  128478  . + . transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 CDS 127849  128478  . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 exon  127849  128478  . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 stop_codon  128476  128478  . + 0 transcript_id "g3.t1"; gene_id "g3";

In the gff3, this becomes

scaffold_1  GeneMark.hmm3 gene  127849  128478  . + . ID=g3
scaffold_1  GeneMark.hmm3 mRNA  127849  128478  . + . ID=nbis-mrna-1;Parent=g3;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 transcript  127849  128478  . + . ID=g3.t1;Parent=g3
scaffold_1  GeneMark.hmm3 exon  127849  128478  . + 0 ID=exon-26;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 CDS 127849  128478  . + 0 ID=cds-26;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 start_codon 127849  127851  . + 0 ID=start_codon-4;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 stop_codon  128476  128478  . + 0 ID=stop_codon-4;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1

My reading of this is that the transcript record is a duplicate of mRNA and could be deleted.

@adf-ncgr
Copy link
Contributor

Thanks, this helps clarify- I think I've seen "nbis" appearing in other files occasionally too. In any case I'll delete the "nbis" records and retain the other to keep the naming consistent (and switch the "transcript" -> "mRNA").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants