Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Snakemake to build transcripts #70

Open
davmlaw opened this issue Feb 19, 2024 · 7 comments
Open

Use Snakemake to build transcripts #70

davmlaw opened this issue Feb 19, 2024 · 7 comments

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Feb 19, 2024

At the moment we have file existence tests instead of proper dependency management

@davmlaw
Copy link
Contributor Author

davmlaw commented Mar 7, 2024

Would be good to automate uploading releases as this is pretty tedious, could do:

gh release create <tag> --title "<release title>" --notes "<release notes>"
gh release upload <tag> <path/to/your/files/*>

@davmlaw
Copy link
Contributor Author

davmlaw commented Mar 13, 2024

Made a script "generate_transcript_data/github_release_upload.sh" which makes a release easier

davmlaw added a commit that referenced this issue Aug 29, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 30, 2024

Looking at the bash scripts, a lot of the complexity is due to looping over URLs and dealing with RefSeq URLs having identical file names, eg:

"https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz"
"https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20201022/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz"
"https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz"

So it's not so easy to just download it and carry on. I think with SnakeMake we should just explicitly list everything out in YAML files, and use that config to run a pipeline common between everything

We could make urls a dictionary, and then have the "nice name" for it as a key. That would allow us to move code into config which would be a lot nicer

davmlaw added a commit that referenced this issue Aug 30, 2024
…names (handle RefSeq's duplicated filenames)
davmlaw added a commit that referenced this issue Aug 30, 2024
davmlaw added a commit that referenced this issue Aug 30, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 30, 2024

ok, I have started on this (in generate_transcript_data)

I wanted to run the code with different config files, but couldn't work out a way to do it. I think SnakeMake seems to only want 1 config file. I thus combined everything in "config/*.yaml" into "cdot_transcripts.yaml"

having an issue at the moment with ambiguous rules for downloading files

davmlaw added a commit that referenced this issue Sep 2, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 3, 2024

@tedil @holtgrewe - I've finished v1 of the SnakeMake pipeline - if you could check it out as it's the first one I ever wrote:

https://github.com/SACGF/cdot/blob/main/generate_transcript_data/Snakefile
https://github.com/SACGF/cdot/blob/main/generate_transcript_data/cdot_transcripts.yaml

Happy to hear feedback / if I should have structured it a different way etc.

davmlaw added a commit that referenced this issue Sep 3, 2024
@tedil
Copy link

tedil commented Sep 3, 2024

Great, thank you! I will have a look when I am back from vacation

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 3, 2024

Sure, no hurry, enjoy your time off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants