Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement/hg38 update #128

Merged
merged 35 commits into from
Jan 13, 2025
Merged

Enhancement/hg38 update #128

merged 35 commits into from
Jan 13, 2025

Conversation

anoronh4
Copy link
Collaborator

@anoronh4 anoronh4 commented Sep 25, 2024

Addresses #129.

Detailed description of changes

We are switching to GRCh38, with the target Ensembl version being 111. STARfusion and Fusioncatcher outputs will not be Ensembl version 111 but will be reannotated to the target version for consistency with the rest of the pipeline. We will be using the reference sequence with decoys from NCBI. Because this reference genome includes the chr prefix for every sequence, the process FASTAREMOVEPREFIX was added to create a new fasta when running with GRCh38. Logic was also added to gunzip the fasta file before further processing. The STARFusion reference is updated to the latest GRCh38 plug-n-play reference. the metafusion_blocklist points to an lifted over version of the reference. The AGFusion container was updated to support the updated ensembl versions (< 112). More chromosomes were removed from the rRNA bed processing.

The metafusion reference generation scripts were updated. bin/final_generate_v75_gene_bed.R was changed to bin/generate_gene_bed.R and logic to handle Ensembl's new annotation of UTRs was added. in bin/make_gene_info_for_forte.R, gene_id is used as the gene's symbol for the cases in which gene_name is missing.

The fact that the fasta now had to be preprocessed (GUNZIP_FASTA and FASTAREMOVEPREFIX) forced some logic changes throughout the script. The fasta output was now a channel, and downstream logic was expecting a file object. For example, Channel.of([id:params.genome], fasta) was fine before when fasta == params.fasta, but errors out when fasta is a channel. It also made sense to update several of the modules to use input conventions such as tuple val(meta), path(fasta) instead of those such as path(fasta) because the output of GUNZIP_FASTA and FASTAREMOVEPREFIX contains a meta object, and we would otherwise have to repeatedly modify the channel for older versions of those modules. As a result, many of the changes will simply be updates to modules and small changes to how channels containing the references are input into the processes.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the mskcc/forte branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@anoronh4 anoronh4 linked an issue Sep 25, 2024 that may be closed by this pull request
@anoronh4 anoronh4 marked this pull request as ready for review October 9, 2024 18:38
@anoronh4 anoronh4 requested review from huyu335 and carynhale October 9, 2024 18:39
@pintoa1-mskcc
Copy link
Collaborator

I’ll have to make a separate function for gene bed generation because v111 has “five_prime_utr” and “three_prime_utr”, whereas v75 just has UTR. Currently the gene bed for hg38 is lacking UTRs due to this

Copy link
Collaborator

@huyu335 huyu335 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on clinical prod cluster, looks good!

Copy link
Collaborator

@carynhale carynhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this quite a bit on crater2 and am using it for validation samples currently, no major issues.

@anoronh4 anoronh4 requested a review from gongyixiao January 10, 2025 22:59
@anoronh4 anoronh4 merged commit d38564e into develop Jan 13, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

update for GRCh38
5 participants