-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support species besides human? #4
Comments
There are at least two places where humanity is assumed: ensembl-genes/src/ensembl_genes.py Line 43 in 627b264
ensembl-genes/src/ensembl_genes.py Lines 97 to 98 in 627b264
|
ensembl-genes/src/ensembl_genes.py Line 11 in 627b264
Would also need a pattern for other species. I think it would be |
Have made lot's of progress here such that we can configure the species. One remaining task is to switch to output branches that include species. I am thinking from changing the output branches from I believe the full list of current databases is at http://ftp.ensembl.org/pub/current_mysql/ For humans, the current set of dbs is:
I wonder if we will ever need to query across multiple of these databases (and not just Another option would be to have multiple directories under each release for different species. This would result in less branches but would require exporting all species at once, which doesn't seem ideal. |
I would lean towards folders. It would keep releases together better and In my use-case (which is admittedly not the typical one) I'd want to take the information from all species for a given release. Also, the release on Ensembl's side happens at the same time for all species. I suppose it would make it a little unwieldy for someone who was triggering their own build and wanting only (for example) human data, but I don't think it would be too bad with only Human/Mouse/Rat. I think calling the other databases would need their own set of special purpose scripts. |
It looks like there are currently 310 species (core databases). And if these datasets become popular, I could see requests for broader species coverage.
I think we could write a script that would pretty easily download multiple species. One last question that might help us decide is the reference genome / assembly. It seems that sometimes databases are released for multiple assemblies. See http://ftp.ensembl.org/pub/grch37/current/mysql/: there is a Another question: should I rename |
Separate branches for each species for each release would be a huge mess if expanding this to any particularly large selection of the available species, if anything, that seems to me to be a stronger argument for folders within a given release, although maybe there's some way to split the difference, like branches for some taxonomical level then folders under it. I'm sure that someone would find annotations for current genes on the old assembly useful, I know clinicians in particular are, lets say reluctant to move on. But GRCh38 has been out for so long that people really need to move on from 37. The bigger question is the future looking one. At some point there will be a GRCh39 - the hold up there has been end-to-end gap closure for all the chromosomes. I think that's just about there. When that does someday finally happen, there will be a reasonable transition period when annotations for both assemblies are going to be reasonable to ask for. Similarly there was a recent transition from GRCm38 to m39, I know that there is at least one use-case for the liftover database in mouse, m39 did not (at last I checked) have chromosomal kayrotype band information available. Someone interested in looking at the genes in those coordinates might want to use the latest gene build with the old assembly. Also WRT to renaming, I think the "assembly" terminology is the generally accepted one, I know that Gencode uses it when referring to a particular build of the genome. |
as per generally accepted terminology #4 (comment)
Thanks for the explanation. It seems likely that we might want to support multiple assemblies per release especially during transitional periods, even if that deprives us the satisfaction of forcing others to upgrade 😏 . Therefore, I'm inclined to use the database name like
Worst case scenario, we support all 310 species and there are 4 releases per year and the life of this project is a decade. Ignoring supporting multiple assemblies for the same release, we'd have 12400 branches. https://github.com/flutter/flutter has 265 branches and the github interface is quite nice for filtering branches by substring match. I imagine at some point we'll hit limits, but not due to the number of branches but instead cumulative file size in the repo. We can start deploying these branches to new repos, perhaps one repo per release or release year. Or we might switch to another storage solution. One challenge with directories is that users who only want a single species, might have to download all species. Git's With the branch method, a shell script like the following would get all three directories you'd like next to each other in a local directory. release=104
databases=( homo_sapiens_core_${release}_38 mus_musculus_core_${release}_39 rattus_norvegicus_core_${release}_6 )
for database in "${databases[@]}"
do
# remove echo to execute
echo git clone --branch=output_$database --depth=1 https://github.com/related-sciences/ensembl-genes.git
done Another challenge to directories is that the CI export jobs will take longer if they have to process multiple species. But I'm still open to this and will think about it some more. Happy to hear any additional feedback. |
Looking at the flutter repo, that seems less bad than I thought it was going to be, the splitting between active and stale branches keeps it pretty clean and the search function seems good. The downloading issue with folders is a valid one. Looks like the balance comes down on the side of branches. |
Looks to be working! https://github.com/related-sciences/ensembl-genes/actions/runs/1344344511 |
@ACastanza asked:
We should think about support for additional species, since it appears there will at least be demand for it.
The text was updated successfully, but these errors were encountered: