-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement/hg38 update #128
Conversation
… enhancement/hg38_update
… enhancement/hg38_update
… enhancement/hg38_update
… enhancement/hg38_update
I’ll have to make a separate function for gene bed generation because v111 has “five_prime_utr” and “three_prime_utr”, whereas v75 just has UTR. Currently the gene bed for hg38 is lacking UTRs due to this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on clinical prod cluster, looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested this quite a bit on crater2 and am using it for validation samples currently, no major issues.
Addresses #129.
Detailed description of changes
We are switching to GRCh38, with the target Ensembl version being 111. STARfusion and Fusioncatcher outputs will not be Ensembl version 111 but will be reannotated to the target version for consistency with the rest of the pipeline. We will be using the reference sequence with decoys from NCBI. Because this reference genome includes the
chr
prefix for every sequence, the process FASTAREMOVEPREFIX was added to create a new fasta when running with GRCh38. Logic was also added to gunzip the fasta file before further processing. The STARFusion reference is updated to the latest GRCh38 plug-n-play reference. the metafusion_blocklist points to an lifted over version of the reference. The AGFusion container was updated to support the updated ensembl versions (< 112). More chromosomes were removed from the rRNA bed processing.The metafusion reference generation scripts were updated.
bin/final_generate_v75_gene_bed.R
was changed tobin/generate_gene_bed.R
and logic to handle Ensembl's new annotation of UTRs was added. inbin/make_gene_info_for_forte.R
,gene_id
is used as the gene's symbol for the cases in whichgene_name
is missing.The fact that the fasta now had to be preprocessed (
GUNZIP_FASTA
andFASTAREMOVEPREFIX
) forced some logic changes throughout the script. The fasta output was now a channel, and downstream logic was expecting a file object. For example,Channel.of([id:params.genome], fasta)
was fine before when fasta == params.fasta, but errors out when fasta is a channel. It also made sense to update several of the modules to use input conventions such astuple val(meta), path(fasta)
instead of those such aspath(fasta)
because the output ofGUNZIP_FASTA
andFASTAREMOVEPREFIX
contains a meta object, and we would otherwise have to repeatedly modify the channel for older versions of those modules. As a result, many of the changes will simply be updates to modules and small changes to how channels containing the references are input into the processes.PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).