Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gene identifier column to pan-gene set .hsh file? #45

Open
sammyjava opened this issue Nov 13, 2023 · 14 comments
Open

Add gene identifier column to pan-gene set .hsh file? #45

sammyjava opened this issue Nov 13, 2023 · 14 comments
Labels
enhancement New feature or request

Comments

@sammyjava
Copy link
Contributor

A conversation amongst NCGR legumistas concluded that it would be nice to have gene identifiers in the pan-gene set collections. This would be easily enabled with a gene identifier column in the .hsh file.

@sammyjava sammyjava added the enhancement New feature or request label Nov 13, 2023
@StevenCannon-USDA
Copy link
Contributor

@sammyjava Can you give an example? Here is the status quo:

zcat < Glycine.pan4.RK4P.hsh.tsv.gz | head
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100.1
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900.1
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261700.1
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262300.1
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G263500.1
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G264200.1
Glycine.pan4.pan00001	glyma.Hefeng25_IGA1002.gnm1.ann1.SoyHF25_06G281800.1
Glycine.pan4.pan00001	glyma.Huaxia3_IGA1007.gnm1.ann1.SoyHX3_06G278500.1
Glycine.pan4.pan00001	glyma.Huaxia3_IGA1007.gnm1.ann1.SoyHX3_06G279500.1
Glycine.pan4.pan00001	glyma.Jinyuan_IGA1006.gnm1.ann1.SoyJY_06G280300.1

@sammyjava
Copy link
Contributor Author

Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100.1  glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100
Glycine.pan4.pan00001	glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900.1  glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900

Simple in this example but we have cases where the transcript identifiers are NOT a number appended to the gene identifier, like

Arachis.pan2.pan00001   arahy.BaileyII.gnm1.ann1.IDmodified-mrna-2217   arahy.BaileyII.gnm1.ann1.IDmodified-gene-2104

@sammyjava
Copy link
Contributor Author

sammyjava commented Nov 13, 2023

Note: this isn't needed for mine loading, since I already know which genes go with which transcripts from the GFFs. It's more about ease of use for users, such as the GCV.

@StevenCannon-USDA
Copy link
Contributor

I see.
Will need to think about how to do it robustly.

@sammyjava
Copy link
Contributor Author

Yeah, I think the only way right now is to use the transcript-gene child-parent relation in the GFF. I am updating my mine loader to do that properly in a post-processor (just load transcripts from the pan-gene sets and then find the genes and proteins in post).

@adf-ncgr
Copy link
Contributor

FWIW, when I need to do something similar in the context of gene family assignments, I extract a transcript2gene lookup from the gff (since mRNA records contain both their ID and the Parent attribute, it is pretty straightforward, at least assuming the order of ID and Parent attributes doesn't get switched around...).

@sammyjava
Copy link
Contributor Author

I'm curious as to what "extract a transcript2gene lookup" means. Does that refer to a program, like gffread? Or something you've hacked?

@adf-ncgr
Copy link
Contributor

Completely the latter. From a comment to one of the scripts that uses such a thing:

#if you need a transcript2gene lookup, something like this should suffice
#zgrep '  mRNA    ' medtr.HM004.gnm1.ann1.2XTB.gene_models_main.gff3.gz | sed 's/.*ID=\([^;]*\).*Parent=\([^;]*\).*/\1\t\2/' > transcript2gene

@StevenCannon-USDA
Copy link
Contributor

StevenCannon-USDA commented Nov 27, 2023

Having given this some more thought ...

This would be easily enabled with a gene identifier column in the .hsh file.

My preference is to do it as a separate transcript-gene hash file. Two reasons: first, the pangene "hsh.tsv" file is the end product of a lengthy process, and I don't want to carry the extra information (gene ID) through that process; and second, I think of a hash file as having two fields (definitionally).

Admittedly, neither of these reasons is particularly strong -- but I don't think of any good reason not to store the transcript-gene mapping in a separate file.

Implementation: I propose re-generating all BED files in the Data Store annotations, to have seven columns:
molecule, feature-start, feature-end, mRNA-ID, score(0), strand, gene-ID
The seventh column (gene-ID) would be nonstandard for BED, but would be an innocuous addition I think. It would be the per-specification BED6, with an extra seventh column holding the gene-ID.

This would have the advantage of making another place to find this mapping easily (it would become one of the standard files in an annotation set); and it would also allow regularizing the BED files in the Data Store. They are currently without standard names:

find . -name "*bed.gz" | perl -pe 's/.+\.(\w+\.bed)\.gz/$1/' | sort | uniq -c
  67 cds.bed
   1 gene_models_lowqual.bed
  54 gene_models_main.bed
   1 transdec_gff3.bed

In the update, all would be gene_models_main.bed

Then, during creation of a pangene or gene family set, the transcript-gene hash will just be derived from columns 4 and 7 in the BED files.

@sammyjava
Copy link
Contributor Author

No opinion here. Like I said, I use the annotation GFFs to connect transcripts to genes (and proteins, when they have the same identifier), so the extra data in the pan-gene set collections isn't needed for mine loading, but I do get that people like to see gene names pretty much everywhere.

@adf-ncgr
Copy link
Contributor

@StevenCannon-USDA I have no objection to augmenting the bed files with genes and standardizing naming.
But when you say

Then, during creation of a pangene or gene family set, the transcript-gene hash will just be derived from columns 4 and 7 in the BED files.
does this imply you would use that hash to add a representation of the genes somewhere in the pangene set files too? Because as @sammyjava says "people like to see gene names pretty much everywhere"- and I just know you want to store a reference to an array as values in your hashes... ;)

Seriously, though, on analogy with the gene family assignment files I could at least imagine having some metadata about a gene's inclusion in a pangene set and storing it there (similar to how we store goodness of match to the HMM in the gfa files). Some off-the-cuff examples of such metadata that come to mind (none of them very compelling, but maybe could inspire some other thoughts about it)

  • whether a given gene was chosen as the exemplar (of course, this also appears in the fastas)
  • the transcript length (obviously available elsewhere but possibly of use for assessing allelic variation at a very coarse level)
  • a hash based on the sequence of the transcript and protein sequences (again, a coarse way of representing whether there is some sort of sequence variation among the member sequences)

pretty much just thinking out loud there, and don't feel strongly about any of it.

@StevenCannon-USDA
Copy link
Contributor

does this imply you would use that hash to add a representation of the genes somewhere in the pangene set files too?

Yes. Something like the following (contrived, but with real data).

for file in *bed.gz; do zcat $file | head -1 | cut -f4,7; done
arath.Col0.gnm9.ann11.AT2G44175.1	arath.Col0.gnm9.ann11.AT2G44175
glyma.Wm82.gnm4.ann1.Glyma.20G155500.1	glyma.Wm82.gnm4.ann1.Glyma.20G155500
medtr.A17_HM341.gnm4.ann2.Medtr3g070500.3	medtr.A17_HM341.gnm4.ann2.Medtr3g070500
phavu.G19833.gnm2.ann1.Phvul.005G131400.1	phavu.G19833.gnm2.ann1.Phvul.005G131400
pissa.Cameor.gnm1.ann1.Psat0s798g0080.1	pissa.Cameor.gnm1.ann1.Psat0s798g0080
prupe.Lovell.gnm2.ann1.Prupe.7G026800.1	prupe.Lovell.gnm2.ann1.Prupe.7G026800
sento.Myeongyun.gnm1.ann1.Sto13g434570	sento.Myeongyun.gnm1.Sto13g434570
singl.CAF01.gnm1.ann1.evm_model_Chr03_3668	singl.CAF01.gnm1.ann1.evm_TU_Chr03_3668
vigun.IT97K-499-35.gnm1.ann2.Vigun06g214150.1	vigun.IT97K-499-35.gnm1.ann2.Vigun06g214150
vitvi.PN40024.gnm2.ann1.VIT_200s0271g00070.1	vitvi.PN40024.gnm2.ann1.VIT_200s0271g00070

In this sample, all are trivial except Sindora (singl). As you point out, it needn't be restricted to two columns - though this example is. It could be called e.g. transcript-gene.tsv if it were simply a map/hash; or if it held additional information, it could be called e.g. transcript-info.tsv.

@StevenCannon-USDA
Copy link
Contributor

Progress on this task: I have regenerated the bed files for all annotation collections (121 currently).
Now, each such collection has a bed.gz file with seven columns. The fourth has the transcript name and the seventh has the gene name. For example, zcat vigun.ZN016.gnm1.ann2.C7YV.gene_models_main.bed.gz | head -4

vigun.ZN016.gnm1.chr1	13793	14745	vigun.ZN016.gnm1.ann2.VuZN016.01G000100.2	0	-	vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1	13793	14745	vigun.ZN016.gnm1.ann2.VuZN016.01G000100.3	0	-	vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1	13793	15571	vigun.ZN016.gnm1.ann2.VuZN016.01G000100.1	0	-	vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1	16175	19400	vigun.ZN016.gnm1.ann2.VuZN016.01G000200.1	0	+	vigun.ZN016.gnm1.ann2.VuZN016.01G000200

I'll also try to incorporate the transcript name -- gene name correspondence in the pan-gene sets when I recalculate those.

@sammyjava
Copy link
Contributor Author

Sounds like with the current timing I'll be able to load genes and proteins from the new pan-gene sets in 5.1.0.4. Not that it matters much, since I currently find the genes and proteins with a post-processor, but nice to do it in one swell foop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants