Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support species besides human? #4

Closed
dhimmel opened this issue Oct 12, 2021 · 9 comments
Closed

Support species besides human? #4

dhimmel opened this issue Oct 12, 2021 · 9 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Oct 12, 2021

@ACastanza asked:

Versions for Mouse and Rat as well?

We should think about support for additional species, since it appears there will at least be demand for it.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 12, 2021

There are at least two places where humanity is assumed:

species: str = "homo_sapiens"

mhc = pd.Interval(left=28_510_120, right=33_480_577, closed="both")
xmhc = pd.Interval(left=25_726_063, right=33_410_226, closed="both")

@ACastanza
Copy link
Contributor

ensembl_human_gene_pattern = r"^ENSG[0-9]{11}$"

Would also need a pattern for other species. I think it would be = r"^ENSMUSG[0-9]{11}$"/= r"^ENSRNOG[0-9]{11}$" for Mouse/Rat respectively

dhimmel added a commit that referenced this issue Oct 12, 2021
dhimmel added a commit that referenced this issue Oct 12, 2021
dhimmel added a commit that referenced this issue Oct 13, 2021
dhimmel added a commit that referenced this issue Oct 13, 2021
@dhimmel
Copy link
Member Author

dhimmel commented Oct 14, 2021

Have made lot's of progress here such that we can configure the species.

One remaining task is to switch to output branches that include species. I am thinking from changing the output branches from output-104 to output_homo_sapiens_core_104_38 such that they include the full database name. This will lead to URLs that are a bit longer and more unwieldy but will better match the design that each export corresponds to a single core database.

I believe the full list of current databases is at http://ftp.ensembl.org/pub/current_mysql/ For humans, the current set of dbs is:

homo_sapiens_cdna_104_38/                          08-Apr-2021 11:23                   -
homo_sapiens_core_104_38/                          08-Apr-2021 11:30                   -
homo_sapiens_funcgen_104_38/                       08-Apr-2021 13:43                   -
homo_sapiens_otherfeatures_104_38/                 08-Apr-2021 11:35                   -
homo_sapiens_rnaseq_104_38/                        08-Apr-2021 11:21                   -
homo_sapiens_variation_104_38/                     09-Apr-2021 13:38                   -

I wonder if we will ever need to query across multiple of these databases (and not just core)?

Another option would be to have multiple directories under each release for different species. This would result in less branches but would require exporting all species at once, which doesn't seem ideal.

@ACastanza
Copy link
Contributor

ACastanza commented Oct 14, 2021

I would lean towards folders. It would keep releases together better and In my use-case (which is admittedly not the typical one) I'd want to take the information from all species for a given release. Also, the release on Ensembl's side happens at the same time for all species. I suppose it would make it a little unwieldy for someone who was triggering their own build and wanting only (for example) human data, but I don't think it would be too bad with only Human/Mouse/Rat.

I think calling the other databases would need their own set of special purpose scripts.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 14, 2021

I don't think it would be too bad with only Human/Mouse/Rat.

It looks like there are currently 310 species (core databases). And if these datasets become popular, I could see requests for broader species coverage.

In my use-case (which is admittedly not the typical one) I'd want to take the information from all species for a given release.

I think we could write a script that would pretty easily download multiple species.

One last question that might help us decide is the reference genome / assembly. It seems that sometimes databases are released for multiple assemblies. See http://ftp.ensembl.org/pub/grch37/current/mysql/: there is a homo_sapiens_core_104_37 and homo_sapiens_core_104_38 database currently. Are we ever going to need to support anything but a single assembly for a given ensembl release?

Another question: should I rename species.reference_genome to species.assembly / species.genome_assembly? What is the most accurate term for this field?

@ACastanza
Copy link
Contributor

Separate branches for each species for each release would be a huge mess if expanding this to any particularly large selection of the available species, if anything, that seems to me to be a stronger argument for folders within a given release, although maybe there's some way to split the difference, like branches for some taxonomical level then folders under it.

I'm sure that someone would find annotations for current genes on the old assembly useful, I know clinicians in particular are, lets say reluctant to move on. But GRCh38 has been out for so long that people really need to move on from 37. The bigger question is the future looking one. At some point there will be a GRCh39 - the hold up there has been end-to-end gap closure for all the chromosomes. I think that's just about there. When that does someday finally happen, there will be a reasonable transition period when annotations for both assemblies are going to be reasonable to ask for.

Similarly there was a recent transition from GRCm38 to m39, I know that there is at least one use-case for the liftover database in mouse, m39 did not (at last I checked) have chromosomal kayrotype band information available. Someone interested in looking at the genes in those coordinates might want to use the latest gene build with the old assembly.

Also WRT to renaming, I think the "assembly" terminology is the generally accepted one, I know that Gencode uses it when referring to a particular build of the genome.

dhimmel added a commit that referenced this issue Oct 14, 2021
as per generally accepted terminology
#4 (comment)
@dhimmel
Copy link
Member Author

dhimmel commented Oct 14, 2021

I'm sure that someone would find annotations for current genes on the old assembly useful

Thanks for the explanation. It seems likely that we might want to support multiple assemblies per release especially during transitional periods, even if that deprives us the satisfaction of forcing others to upgrade 😏 . Therefore, I'm inclined to use the database name like homo_sapiens_core_104_38 as either the species directory or branch name (depending on how we proceed). This would require users to explicitly choose the assembly when selecting a branch/directory and is probably good for awareness that some fields are assembly specific.

Separate branches for each species for each release would be a huge mess if expanding this to any particularly large selection of the available species

Worst case scenario, we support all 310 species and there are 4 releases per year and the life of this project is a decade. Ignoring supporting multiple assemblies for the same release, we'd have 12400 branches.

https://github.com/flutter/flutter has 265 branches and the github interface is quite nice for filtering branches by substring match.

I imagine at some point we'll hit limits, but not due to the number of branches but instead cumulative file size in the repo. We can start deploying these branches to new repos, perhaps one repo per release or release year. Or we might switch to another storage solution.

One challenge with directories is that users who only want a single species, might have to download all species. Git's sparseCheckout is one workaround but looks a bit technical for many users.

With the branch method, a shell script like the following would get all three directories you'd like next to each other in a local directory.

release=104
databases=( homo_sapiens_core_${release}_38 mus_musculus_core_${release}_39 rattus_norvegicus_core_${release}_6 )
for database in "${databases[@]}"
do
   # remove echo to execute
   echo git clone --branch=output_$database --depth=1 https://github.com/related-sciences/ensembl-genes.git
done

Another challenge to directories is that the CI export jobs will take longer if they have to process multiple species. But I'm still open to this and will think about it some more. Happy to hear any additional feedback.

@ACastanza
Copy link
Contributor

Looking at the flutter repo, that seems less bad than I thought it was going to be, the splitting between active and stale branches keeps it pretty clean and the search function seems good. The downloading issue with folders is a valid one. Looks like the balance comes down on the side of branches.

dhimmel added a commit that referenced this issue Oct 15, 2021
dhimmel added a commit that referenced this issue Oct 15, 2021
dhimmel added a commit that referenced this issue Oct 15, 2021
dhimmel added a commit that referenced this issue Oct 15, 2021
@dhimmel
Copy link
Member Author

dhimmel commented Oct 15, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants