Support species besides human? #4

dhimmel · 2021-10-12T20:48:15Z

@ACastanza asked:

Versions for Mouse and Rat as well?

We should think about support for additional species, since it appears there will at least be demand for it.

dhimmel · 2021-10-12T20:50:29Z

There are at least two places where humanity is assumed:

ensembl-genes/src/ensembl_genes.py

Line 43 in 627b264

species: str = "homo_sapiens"

ensembl-genes/src/ensembl_genes.py

Lines 97 to 98 in 627b264

    
           mhc = pd.Interval(left=28_510_120, right=33_480_577, closed="both") 
        
           xmhc = pd.Interval(left=25_726_063, right=33_410_226, closed="both")

ACastanza · 2021-10-12T21:06:54Z

ensembl-genes/src/ensembl_genes.py

Line 11 in 627b264

ensembl_human_gene_pattern = r"^ENSG[0-9]{11}$"

Would also need a pattern for other species. I think it would be = r"^ENSMUSG[0-9]{11}$"/= r"^ENSRNOG[0-9]{11}$" for Mouse/Rat respectively

refs #4

dhimmel · 2021-10-14T20:54:09Z

Have made lot's of progress here such that we can configure the species.

One remaining task is to switch to output branches that include species. I am thinking from changing the output branches from output-104 to output_homo_sapiens_core_104_38 such that they include the full database name. This will lead to URLs that are a bit longer and more unwieldy but will better match the design that each export corresponds to a single core database.

I believe the full list of current databases is at http://ftp.ensembl.org/pub/current_mysql/ For humans, the current set of dbs is:

homo_sapiens_cdna_104_38/                          08-Apr-2021 11:23                   -
homo_sapiens_core_104_38/                          08-Apr-2021 11:30                   -
homo_sapiens_funcgen_104_38/                       08-Apr-2021 13:43                   -
homo_sapiens_otherfeatures_104_38/                 08-Apr-2021 11:35                   -
homo_sapiens_rnaseq_104_38/                        08-Apr-2021 11:21                   -
homo_sapiens_variation_104_38/                     09-Apr-2021 13:38                   -

I wonder if we will ever need to query across multiple of these databases (and not just core)?

Another option would be to have multiple directories under each release for different species. This would result in less branches but would require exporting all species at once, which doesn't seem ideal.

ACastanza · 2021-10-14T20:57:56Z

I would lean towards folders. It would keep releases together better and In my use-case (which is admittedly not the typical one) I'd want to take the information from all species for a given release. Also, the release on Ensembl's side happens at the same time for all species. I suppose it would make it a little unwieldy for someone who was triggering their own build and wanting only (for example) human data, but I don't think it would be too bad with only Human/Mouse/Rat.

I think calling the other databases would need their own set of special purpose scripts.

dhimmel · 2021-10-14T21:25:09Z

I don't think it would be too bad with only Human/Mouse/Rat.

It looks like there are currently 310 species (core databases). And if these datasets become popular, I could see requests for broader species coverage.

In my use-case (which is admittedly not the typical one) I'd want to take the information from all species for a given release.

I think we could write a script that would pretty easily download multiple species.

One last question that might help us decide is the reference genome / assembly. It seems that sometimes databases are released for multiple assemblies. See http://ftp.ensembl.org/pub/grch37/current/mysql/: there is a homo_sapiens_core_104_37 and homo_sapiens_core_104_38 database currently. Are we ever going to need to support anything but a single assembly for a given ensembl release?

Another question: should I rename species.reference_genome to species.assembly / species.genome_assembly? What is the most accurate term for this field?

ACastanza · 2021-10-14T21:38:25Z

Separate branches for each species for each release would be a huge mess if expanding this to any particularly large selection of the available species, if anything, that seems to me to be a stronger argument for folders within a given release, although maybe there's some way to split the difference, like branches for some taxonomical level then folders under it.

I'm sure that someone would find annotations for current genes on the old assembly useful, I know clinicians in particular are, lets say reluctant to move on. But GRCh38 has been out for so long that people really need to move on from 37. The bigger question is the future looking one. At some point there will be a GRCh39 - the hold up there has been end-to-end gap closure for all the chromosomes. I think that's just about there. When that does someday finally happen, there will be a reasonable transition period when annotations for both assemblies are going to be reasonable to ask for.

Similarly there was a recent transition from GRCm38 to m39, I know that there is at least one use-case for the liftover database in mouse, m39 did not (at last I checked) have chromosomal kayrotype band information available. Someone interested in looking at the genes in those coordinates might want to use the latest gene build with the old assembly.

Also WRT to renaming, I think the "assembly" terminology is the generally accepted one, I know that Gencode uses it when referring to a particular build of the genome.

as per generally accepted terminology #4 (comment)

dhimmel · 2021-10-14T22:11:31Z

I'm sure that someone would find annotations for current genes on the old assembly useful

Thanks for the explanation. It seems likely that we might want to support multiple assemblies per release especially during transitional periods, even if that deprives us the satisfaction of forcing others to upgrade 😏 . Therefore, I'm inclined to use the database name like homo_sapiens_core_104_38 as either the species directory or branch name (depending on how we proceed). This would require users to explicitly choose the assembly when selecting a branch/directory and is probably good for awareness that some fields are assembly specific.

Separate branches for each species for each release would be a huge mess if expanding this to any particularly large selection of the available species

Worst case scenario, we support all 310 species and there are 4 releases per year and the life of this project is a decade. Ignoring supporting multiple assemblies for the same release, we'd have 12400 branches.

https://github.com/flutter/flutter has 265 branches and the github interface is quite nice for filtering branches by substring match.

I imagine at some point we'll hit limits, but not due to the number of branches but instead cumulative file size in the repo. We can start deploying these branches to new repos, perhaps one repo per release or release year. Or we might switch to another storage solution.

One challenge with directories is that users who only want a single species, might have to download all species. Git's sparseCheckout is one workaround but looks a bit technical for many users.

With the branch method, a shell script like the following would get all three directories you'd like next to each other in a local directory.

release=104
databases=( homo_sapiens_core_${release}_38 mus_musculus_core_${release}_39 rattus_norvegicus_core_${release}_6 )
for database in "${databases[@]}"
do
   # remove echo to execute
   echo git clone --branch=output_$database --depth=1 https://github.com/related-sciences/ensembl-genes.git
done

Another challenge to directories is that the CI export jobs will take longer if they have to process multiple species. But I'm still open to this and will think about it some more. Happy to hear any additional feedback.

ACastanza · 2021-10-14T22:17:14Z

Looking at the flutter repo, that seems less bad than I thought it was going to be, the splitting between active and stale branches keeps it pretty clean and the search function seems good. The downloading issue with folders is a valid one. Looks like the balance comes down on the side of branches.

refs #4

dhimmel · 2021-10-15T02:40:12Z

Looks to be working! https://github.com/related-sciences/ensembl-genes/actions/runs/1344344511

#4

dhimmel added a commit that referenced this issue Oct 12, 2021

abstract species to module

37a81fc

refs #4

dhimmel mentioned this issue Oct 12, 2021

add support for rat species #6

Merged

dhimmel added a commit that referenced this issue Oct 12, 2021

abstract species to module

dfa286d

refs #4

dhimmel added a commit that referenced this issue Oct 13, 2021

abstract species to module

925fb98

refs #4

dhimmel added a commit that referenced this issue Oct 13, 2021

abstract species to module

2bcdfff

refs #4

dhimmel added a commit that referenced this issue Oct 14, 2021

rename species.reference_genome to assembly

99201cc

as per generally accepted terminology #4 (comment)

dhimmel added a commit that referenced this issue Oct 15, 2021

switch output branch to output/database format

72ab359

refs #4

dhimmel added a commit that referenced this issue Oct 15, 2021

switch output branch to output/database format

ba949c6

refs #4

dhimmel added a commit that referenced this issue Oct 15, 2021

switch output branch to output/database format

db91405

refs #4

dhimmel added a commit that referenced this issue Oct 15, 2021

CI export: species matrix

9738939

refs #4

dhimmel added a commit that referenced this issue Oct 15, 2021

readme: species support

f7a0ba0

#4

dhimmel closed this as completed Oct 15, 2021

dhimmel mentioned this issue Oct 15, 2021

MHC / xMHC genomic coordinates for rat & mouse #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support species besides human? #4

Support species besides human? #4

dhimmel commented Oct 12, 2021

dhimmel commented Oct 12, 2021

ACastanza commented Oct 12, 2021

dhimmel commented Oct 14, 2021 •

edited

Loading

ACastanza commented Oct 14, 2021 •

edited

Loading

dhimmel commented Oct 14, 2021

ACastanza commented Oct 14, 2021

dhimmel commented Oct 14, 2021

ACastanza commented Oct 14, 2021

dhimmel commented Oct 15, 2021

Support species besides human? #4

Support species besides human? #4

Comments

dhimmel commented Oct 12, 2021

dhimmel commented Oct 12, 2021

ACastanza commented Oct 12, 2021

dhimmel commented Oct 14, 2021 • edited Loading

ACastanza commented Oct 14, 2021 • edited Loading

dhimmel commented Oct 14, 2021

ACastanza commented Oct 14, 2021

dhimmel commented Oct 14, 2021

ACastanza commented Oct 14, 2021

dhimmel commented Oct 15, 2021

dhimmel commented Oct 14, 2021 •

edited

Loading

ACastanza commented Oct 14, 2021 •

edited

Loading