Add genome and annotations for Acacia crassicarpa #195

StevenCannon-USDA · 2024-02-13T15:58:48Z

adf-ncgr · 2024-02-16T22:33:45Z

Little FYI on this one, @StevenCannon-USDA ; it looks like most of the children of the gene features are given type "transcript" but some of the AHRD-related processing behaves badly if protein-coding transcripts are not given as mRNA. Let me know if you have any concerns about me making that change wholesale. I checked and the number of genes is only off by one from the number of primary proteins, so I think it's fair to assume they are mRNA. The discrepancy seems to be caused by a gene with ID=acacr.Acra3RX.gnm1.ann1.nbis-gene-1 which also has attributes: gene_id=g12430;transcript_id=g12430.t2
which seems to imply it really ought to just be another isoform of g12430. I suppose I could manually fix that little oddity as well.

StevenCannon-USDA · 2024-02-17T00:24:03Z

Thank you - and I don't have concerns about s/transcript/mRNA/. (Why are there a million ways to munge a GFF?)

adf-ncgr · 2024-02-17T00:56:47Z

(Why are there a million ways to munge a GFF?)

maybe you should write a song about it! ;)

adf-ncgr · 2024-02-17T01:10:58Z

Oh, actually there's probably more to do on this file, but I'd like a second opinion. In addition to that one weird little gene, it looks like there are a bunch (~2000) of existing mRNA features that appear to essentially just be duplicates of what was originally represented as "transcript", but with an odd ID that nothing else references. For example:

acacr.Acra3RX.gnm1.scaffold_1   GeneMark.hmm3   mRNA    127849  128478  .       +       .       ID=acacr.Acra3RX.gnm1.ann1.nbis-mrna-1;Parent=acacr.Acra3RX.gnm1.ann1.g3;gene_id=g3;transcript_id=g3.t1
acacr.Acra3RX.gnm1.scaffold_1   GeneMark.hmm3   transcript      127849  128478  .       +       .       ID=acacr.Acra3RX.gnm1.ann1.g3.t1;Parent=acacr.Acra3RX.gnm1.ann1.g3

note that the transcript_id=g3.t1 part of the first one seems to suggest it really is a duplicate of the one below. I'm proposing to just delete these "extras" since they don't appear to provide any value and are arguably detrimental in that they appear in the transcripts fasta without any splicing, since they have no exon children (they don't appear in cds or protein since they don't have CDS children). Let me know if you see something about these that I'm overlooking that would argue for their being preserved

StevenCannon-USDA · 2024-02-17T13:55:22Z

I agree with you: OK to delete those transcript records that seem to be duplicates of the mRNA features.

History of this gene file: I received it in gtf format (broken actually, with 29 lines being space- rather than tab-separated). I used AGAT to transform the file to gff3. Labels nbis indicate new identifiers added by AGAT (NBIS=National Bioinformatics Infrastructure Sweden).

The original structure of the of the noncompliant records is:

scaffold_1  GeneMark.hmm3 gene  127849  128478  . + . g3
scaffold_1  GeneMark.hmm3 transcript  127849  128478  . + . g3.t1
scaffold_1  GeneMark.hmm3 start_codon 127849  127851  . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 mRNA  127849  128478  . + . transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 CDS 127849  128478  . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 exon  127849  128478  . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold_1  GeneMark.hmm3 stop_codon  128476  128478  . + 0 transcript_id "g3.t1"; gene_id "g3";

In the gff3, this becomes

scaffold_1  GeneMark.hmm3 gene  127849  128478  . + . ID=g3
scaffold_1  GeneMark.hmm3 mRNA  127849  128478  . + . ID=nbis-mrna-1;Parent=g3;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 transcript  127849  128478  . + . ID=g3.t1;Parent=g3
scaffold_1  GeneMark.hmm3 exon  127849  128478  . + 0 ID=exon-26;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 CDS 127849  128478  . + 0 ID=cds-26;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 start_codon 127849  127851  . + 0 ID=start_codon-4;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1
scaffold_1  GeneMark.hmm3 stop_codon  128476  128478  . + 0 ID=stop_codon-4;Parent=g3.t1;gene_id=g3;transcript_id=g3.t1

My reading of this is that the transcript record is a duplicate of mRNA and could be deleted.

adf-ncgr · 2024-02-17T16:06:28Z

Thanks, this helps clarify- I think I've seen "nbis" appearing in other files occasionally too. In any case I'll delete the "nbis" records and retain the other to keep the naming consistent (and switch the "transcript" -> "mRNA").

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add genome and annotations for Acacia crassicarpa #195

Add genome and annotations for Acacia crassicarpa #195

StevenCannon-USDA commented Feb 13, 2024

adf-ncgr commented Feb 16, 2024

StevenCannon-USDA commented Feb 17, 2024

adf-ncgr commented Feb 17, 2024

adf-ncgr commented Feb 17, 2024

StevenCannon-USDA commented Feb 17, 2024

adf-ncgr commented Feb 17, 2024

Add genome and annotations for Acacia crassicarpa #195

Add genome and annotations for Acacia crassicarpa #195

Comments

StevenCannon-USDA commented Feb 13, 2024

Main steps for adding new genome and annotation collections

Genus/species/collection names:

adf-ncgr commented Feb 16, 2024

StevenCannon-USDA commented Feb 17, 2024

adf-ncgr commented Feb 17, 2024

adf-ncgr commented Feb 17, 2024

StevenCannon-USDA commented Feb 17, 2024

adf-ncgr commented Feb 17, 2024