Enhancement/hg38 update #128

anoronh4 · 2024-09-25T21:35:02Z

Addresses #129.

Detailed description of changes

We are switching to GRCh38, with the target Ensembl version being 111. STARfusion and Fusioncatcher outputs will not be Ensembl version 111 but will be reannotated to the target version for consistency with the rest of the pipeline. We will be using the reference sequence with decoys from NCBI. Because this reference genome includes the chr prefix for every sequence, the process FASTAREMOVEPREFIX was added to create a new fasta when running with GRCh38. Logic was also added to gunzip the fasta file before further processing. The STARFusion reference is updated to the latest GRCh38 plug-n-play reference. the metafusion_blocklist points to an lifted over version of the reference. The AGFusion container was updated to support the updated ensembl versions (< 112). More chromosomes were removed from the rRNA bed processing.

The metafusion reference generation scripts were updated. bin/final_generate_v75_gene_bed.R was changed to bin/generate_gene_bed.R and logic to handle Ensembl's new annotation of UTRs was added. in bin/make_gene_info_for_forte.R, gene_id is used as the gene's symbol for the cases in which gene_name is missing.

The fact that the fasta now had to be preprocessed (GUNZIP_FASTA and FASTAREMOVEPREFIX) forced some logic changes throughout the script. The fasta output was now a channel, and downstream logic was expecting a file object. For example, Channel.of([id:params.genome], fasta) was fine before when fasta == params.fasta, but errors out when fasta is a channel. It also made sense to update several of the modules to use input conventions such as tuple val(meta), path(fasta) instead of those such as path(fasta) because the output of GUNZIP_FASTA and FASTAREMOVEPREFIX contains a meta object, and we would otherwise have to repeatedly modify the channel for older versions of those modules. As a result, many of the changes will simply be updates to modules and small changes to how channels containing the references are input into the processes.

PR checklist

… enhancement/hg38_update

… bed

… enhancement/hg38_update

pintoa1-mskcc · 2024-11-07T19:51:58Z

I’ll have to make a separate function for gene bed generation because v111 has “five_prime_utr” and “three_prime_utr”, whereas v75 just has UTR. Currently the gene bed for hg38 is lacking UTRs due to this

huyu335

Tested on clinical prod cluster, looks good!

carynhale

I have tested this quite a bit on crater2 and am using it for validation samples currently, no major issues.

Anne Marie Noronha and others added 5 commits September 20, 2024 12:43

updating GRCh38 references

d19f77a

fix gtf channel for GRCh38 mode

f8069a2

update star/genomegenerate

16dd74e

adjustments to make GRCh38 run through

e30de97

Merge branch 'develop' into enhancement/hg38_update

b97fea0

anoronh4 requested a review from pintoa1-mskcc September 25, 2024 21:35

fix indentation

e206da4

anoronh4 linked an issue Sep 25, 2024 that may be closed by this pull request

update for GRCh38 #129

Open

anoronh4 and others added 18 commits September 25, 2024 21:39

fix agfusion download command

fa7f511

clean up view operator

20379c7

fix fillout pytest

39744f7

change pfam reference to stagnant release

01b2981

add poison pill to reference channels

7325589

update md5sum in test_profile test

c204119

update ensembl version to 112

7c5dc28

update CHANGELOG.md

d8e423f

change ensembl version to 111

4966057

update AGFusion to [email protected]

3bbaecf

fix conflicts

3ea6c20

Merge branch 'enhancement/hg38_update' of github.com:mskcc/forte into…

29f10aa

… enhancement/hg38_update

add cpus in AGAT_SPADDINTRONS resources

cd06afc

fix indentation

3bfeb91

Set gene_id as gene_name for lncRNAs, remove NF transcripts from gene…

9147321

… bed

Merge branch 'enhancement/hg38_update' of github.com:mskcc/forte into…

6e09888

… enhancement/hg38_update

fix linting error, ensure no scientific notation in gene bed

ad50743

Merge branch 'enhancement/hg38_update' of github.com:mskcc/forte into…

db34bad

… enhancement/hg38_update

anoronh4 marked this pull request as ready for review October 9, 2024 18:38

anoronh4 requested review from huyu335 and carynhale October 9, 2024 18:39

remove deprecated arriba installation

95e737e

anoronh4 and others added 6 commits October 10, 2024 10:15

update to decoy fasta

3d97ab8

Merge branch 'develop' into enhancement/hg38_update

32507d5

add storeDir directives for gunzip* and FASTAREMOVEPREFIX

7f8c7c0

Merge branch 'enhancement/hg38_update' of github.com:mskcc/forte into…

6fcb29c

… enhancement/hg38_update

change chromosome M to MT in fasta

709317a

add idt_v2 baits for GRCh38 to reference config

a70828d

pintoa1-mskcc added 4 commits November 7, 2024 15:52

edit genebed generation to function with v111 nomenclature

612a80a

linting errors

1e0d3f1

line endings

e105305

Modify generate gene bed to one script

fb2f514

huyu335 approved these changes Nov 21, 2024

View reviewed changes

carynhale approved these changes Nov 22, 2024

View reviewed changes

anoronh4 requested a review from gongyixiao January 10, 2025 22:59

gongyixiao approved these changes Jan 13, 2025

View reviewed changes

anoronh4 merged commit d38564e into develop Jan 13, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement/hg38 update #128

Enhancement/hg38 update #128

anoronh4 commented Sep 25, 2024 •

edited

Loading

pintoa1-mskcc commented Nov 7, 2024

huyu335 left a comment •

edited

Loading

carynhale left a comment

Enhancement/hg38 update #128

Enhancement/hg38 update #128

Conversation

anoronh4 commented Sep 25, 2024 • edited Loading

Detailed description of changes

PR checklist

pintoa1-mskcc commented Nov 7, 2024

huyu335 left a comment • edited Loading

Choose a reason for hiding this comment

carynhale left a comment

Choose a reason for hiding this comment

anoronh4 commented Sep 25, 2024 •

edited

Loading

huyu335 left a comment •

edited

Loading