Add gene identifier column to pan-gene set .hsh file? #45

sammyjava · 2023-11-13T18:17:39Z

A conversation amongst NCGR legumistas concluded that it would be nice to have gene identifiers in the pan-gene set collections. This would be easily enabled with a gene identifier column in the .hsh file.

StevenCannon-USDA · 2023-11-13T19:03:36Z

@sammyjava Can you give an example? Here is the status quo:

zcat < Glycine.pan4.RK4P.hsh.tsv.gz | head
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100.1
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900.1
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261700.1
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262300.1
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G263500.1
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G264200.1
Glycine.pan4.pan00001	glyma.Hefeng25_IGA1002.gnm1.ann1.SoyHF25_06G281800.1
Glycine.pan4.pan00001	glyma.Huaxia3_IGA1007.gnm1.ann1.SoyHX3_06G278500.1
Glycine.pan4.pan00001	glyma.Huaxia3_IGA1007.gnm1.ann1.SoyHX3_06G279500.1
Glycine.pan4.pan00001	glyma.Jinyuan_IGA1006.gnm1.ann1.SoyJY_06G280300.1

sammyjava · 2023-11-13T20:00:06Z

Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100.1  glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900.1  glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900

Simple in this example but we have cases where the transcript identifiers are NOT a number appended to the gene identifier, like

Arachis.pan2.pan00001   arahy.BaileyII.gnm1.ann1.IDmodified-mrna-2217   arahy.BaileyII.gnm1.ann1.IDmodified-gene-2104

sammyjava · 2023-11-13T20:01:04Z

Note: this isn't needed for mine loading, since I already know which genes go with which transcripts from the GFFs. It's more about ease of use for users, such as the GCV.

StevenCannon-USDA · 2023-11-13T20:18:39Z

I see.
Will need to think about how to do it robustly.

sammyjava · 2023-11-13T20:47:51Z

Yeah, I think the only way right now is to use the transcript-gene child-parent relation in the GFF. I am updating my mine loader to do that properly in a post-processor (just load transcripts from the pan-gene sets and then find the genes and proteins in post).

adf-ncgr · 2023-11-13T20:51:04Z

FWIW, when I need to do something similar in the context of gene family assignments, I extract a transcript2gene lookup from the gff (since mRNA records contain both their ID and the Parent attribute, it is pretty straightforward, at least assuming the order of ID and Parent attributes doesn't get switched around...).

sammyjava · 2023-11-14T15:41:19Z

I'm curious as to what "extract a transcript2gene lookup" means. Does that refer to a program, like gffread? Or something you've hacked?

adf-ncgr · 2023-11-14T15:58:14Z

Completely the latter. From a comment to one of the scripts that uses such a thing:

#if you need a transcript2gene lookup, something like this should suffice
#zgrep '  mRNA    ' medtr.HM004.gnm1.ann1.2XTB.gene_models_main.gff3.gz | sed 's/.*ID=\([^;]*\).*Parent=\([^;]*\).*/\1\t\2/' > transcript2gene

StevenCannon-USDA · 2023-11-27T16:22:14Z

Having given this some more thought ...

This would be easily enabled with a gene identifier column in the .hsh file.

My preference is to do it as a separate transcript-gene hash file. Two reasons: first, the pangene "hsh.tsv" file is the end product of a lengthy process, and I don't want to carry the extra information (gene ID) through that process; and second, I think of a hash file as having two fields (definitionally).

Admittedly, neither of these reasons is particularly strong -- but I don't think of any good reason not to store the transcript-gene mapping in a separate file.

Implementation: I propose re-generating all BED files in the Data Store annotations, to have seven columns:
molecule, feature-start, feature-end, mRNA-ID, score(0), strand, gene-ID
The seventh column (gene-ID) would be nonstandard for BED, but would be an innocuous addition I think. It would be the per-specification BED6, with an extra seventh column holding the gene-ID.

This would have the advantage of making another place to find this mapping easily (it would become one of the standard files in an annotation set); and it would also allow regularizing the BED files in the Data Store. They are currently without standard names:

find . -name "*bed.gz" | perl -pe 's/.+\.(\w+\.bed)\.gz/$1/' | sort | uniq -c
  67 cds.bed
   1 gene_models_lowqual.bed
  54 gene_models_main.bed
   1 transdec_gff3.bed

In the update, all would be gene_models_main.bed

Then, during creation of a pangene or gene family set, the transcript-gene hash will just be derived from columns 4 and 7 in the BED files.

sammyjava · 2023-11-27T17:30:22Z

No opinion here. Like I said, I use the annotation GFFs to connect transcripts to genes (and proteins, when they have the same identifier), so the extra data in the pan-gene set collections isn't needed for mine loading, but I do get that people like to see gene names pretty much everywhere.

adf-ncgr · 2023-11-27T21:01:26Z

@StevenCannon-USDA I have no objection to augmenting the bed files with genes and standardizing naming.
But when you say

Then, during creation of a pangene or gene family set, the transcript-gene hash will just be derived from columns 4 and 7 in the BED files.
does this imply you would use that hash to add a representation of the genes somewhere in the pangene set files too? Because as @sammyjava says "people like to see gene names pretty much everywhere"- and I just know you want to store a reference to an array as values in your hashes... ;)

Seriously, though, on analogy with the gene family assignment files I could at least imagine having some metadata about a gene's inclusion in a pangene set and storing it there (similar to how we store goodness of match to the HMM in the gfa files). Some off-the-cuff examples of such metadata that come to mind (none of them very compelling, but maybe could inspire some other thoughts about it)

whether a given gene was chosen as the exemplar (of course, this also appears in the fastas)
the transcript length (obviously available elsewhere but possibly of use for assessing allelic variation at a very coarse level)
a hash based on the sequence of the transcript and protein sequences (again, a coarse way of representing whether there is some sort of sequence variation among the member sequences)

pretty much just thinking out loud there, and don't feel strongly about any of it.

StevenCannon-USDA · 2023-11-27T21:26:47Z

does this imply you would use that hash to add a representation of the genes somewhere in the pangene set files too?

Yes. Something like the following (contrived, but with real data).

for file in *bed.gz; do zcat $file | head -1 | cut -f4,7; done
arath.Col0.gnm9.ann11.AT2G44175.1	arath.Col0.gnm9.ann11.AT2G44175
glyma.Wm82.gnm4.ann1.Glyma.20G155500.1	glyma.Wm82.gnm4.ann1.Glyma.20G155500
medtr.A17_HM341.gnm4.ann2.Medtr3g070500.3	medtr.A17_HM341.gnm4.ann2.Medtr3g070500
phavu.G19833.gnm2.ann1.Phvul.005G131400.1	phavu.G19833.gnm2.ann1.Phvul.005G131400
pissa.Cameor.gnm1.ann1.Psat0s798g0080.1	pissa.Cameor.gnm1.ann1.Psat0s798g0080
prupe.Lovell.gnm2.ann1.Prupe.7G026800.1	prupe.Lovell.gnm2.ann1.Prupe.7G026800
sento.Myeongyun.gnm1.ann1.Sto13g434570	sento.Myeongyun.gnm1.Sto13g434570
singl.CAF01.gnm1.ann1.evm_model_Chr03_3668	singl.CAF01.gnm1.ann1.evm_TU_Chr03_3668
vigun.IT97K-499-35.gnm1.ann2.Vigun06g214150.1	vigun.IT97K-499-35.gnm1.ann2.Vigun06g214150
vitvi.PN40024.gnm2.ann1.VIT_200s0271g00070.1	vitvi.PN40024.gnm2.ann1.VIT_200s0271g00070

In this sample, all are trivial except Sindora (singl). As you point out, it needn't be restricted to two columns - though this example is. It could be called e.g. transcript-gene.tsv if it were simply a map/hash; or if it held additional information, it could be called e.g. transcript-info.tsv.

StevenCannon-USDA · 2023-12-19T02:00:26Z

Progress on this task: I have regenerated the bed files for all annotation collections (121 currently).
Now, each such collection has a bed.gz file with seven columns. The fourth has the transcript name and the seventh has the gene name. For example, zcat vigun.ZN016.gnm1.ann2.C7YV.gene_models_main.bed.gz | head -4

vigun.ZN016.gnm1.chr1	13793	14745	vigun.ZN016.gnm1.ann2.VuZN016.01G000100.2	0	-	vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1	13793	14745	vigun.ZN016.gnm1.ann2.VuZN016.01G000100.3	0	-	vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1	13793	15571	vigun.ZN016.gnm1.ann2.VuZN016.01G000100.1	0	-	vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1	16175	19400	vigun.ZN016.gnm1.ann2.VuZN016.01G000200.1	0	+	vigun.ZN016.gnm1.ann2.VuZN016.01G000200

I'll also try to incorporate the transcript name -- gene name correspondence in the pan-gene sets when I recalculate those.

sammyjava · 2023-12-19T14:21:53Z

Sounds like with the current timing I'll be able to load genes and proteins from the new pan-gene sets in 5.1.0.4. Not that it matters much, since I currently find the genes and proteins with a post-processor, but nice to do it in one swell foop.

sammyjava added the enhancement New feature or request label Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gene identifier column to pan-gene set .hsh file? #45

Add gene identifier column to pan-gene set .hsh file? #45

sammyjava commented Nov 13, 2023

StevenCannon-USDA commented Nov 13, 2023

sammyjava commented Nov 13, 2023

sammyjava commented Nov 13, 2023 •

edited

Loading

StevenCannon-USDA commented Nov 13, 2023

sammyjava commented Nov 13, 2023

adf-ncgr commented Nov 13, 2023

sammyjava commented Nov 14, 2023

adf-ncgr commented Nov 14, 2023

StevenCannon-USDA commented Nov 27, 2023 •

edited

Loading

sammyjava commented Nov 27, 2023

adf-ncgr commented Nov 27, 2023

StevenCannon-USDA commented Nov 27, 2023

StevenCannon-USDA commented Dec 19, 2023

sammyjava commented Dec 19, 2023

Add gene identifier column to pan-gene set .hsh file? #45

Add gene identifier column to pan-gene set .hsh file? #45

Comments

sammyjava commented Nov 13, 2023

StevenCannon-USDA commented Nov 13, 2023

sammyjava commented Nov 13, 2023

sammyjava commented Nov 13, 2023 • edited Loading

StevenCannon-USDA commented Nov 13, 2023

sammyjava commented Nov 13, 2023

adf-ncgr commented Nov 13, 2023

sammyjava commented Nov 14, 2023

adf-ncgr commented Nov 14, 2023

StevenCannon-USDA commented Nov 27, 2023 • edited Loading

sammyjava commented Nov 27, 2023

adf-ncgr commented Nov 27, 2023

StevenCannon-USDA commented Nov 27, 2023

StevenCannon-USDA commented Dec 19, 2023

sammyjava commented Dec 19, 2023

sammyjava commented Nov 13, 2023 •

edited

Loading

StevenCannon-USDA commented Nov 27, 2023 •

edited

Loading