Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QUILT method #26

Merged
merged 36 commits into from
Apr 20, 2024
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
8c332bf
add quilt module
atrigila Mar 27, 2024
244b199
add quilt subworkflows
atrigila Mar 27, 2024
2c76c89
add basic test config
atrigila Mar 27, 2024
281eb03
add new bcftools modules
atrigila Mar 27, 2024
466e899
add quilt module and subworkflow config
atrigila Mar 27, 2024
a961307
update subworkflows
atrigila Mar 27, 2024
f89df4e
add quilt test to ci tests
atrigila Mar 27, 2024
2958564
add quilt subworkflow to main workflow
atrigila Mar 27, 2024
ae577d0
modify config modules quilt subworkflow
atrigila Mar 27, 2024
2e7bbc2
change id for chr
atrigila Apr 8, 2024
dd1fa41
create input channels for quilt
atrigila Apr 8, 2024
914f5fc
adapt quilt impute to new structure
atrigila Apr 8, 2024
31965b4
gather all quilt outputs in subdirectory
atrigila Apr 8, 2024
e7cf9dd
add ngen and buffer as external params with defaults
atrigila Apr 8, 2024
b7def1e
remove unused module
atrigila Apr 8, 2024
f59924b
add documentation
atrigila Apr 9, 2024
6bd5364
add new params to schema
atrigila Apr 9, 2024
e332761
fix linting issues
atrigila Apr 10, 2024
19c904d
Update CITATIONS.md
LouisLeNezet Apr 10, 2024
76f7d71
correct issues
atrigila Apr 10, 2024
6ef0335
Revert "fix linting issues"
atrigila Apr 10, 2024
c1e7c82
change local subworkflow directory structure
atrigila Apr 10, 2024
bf5e443
rename subworkflow to match nf-core standards
atrigila Apr 11, 2024
f909912
rename subworkflow and notes
atrigila Apr 11, 2024
2b6292c
improve outputs from chunks
atrigila Apr 11, 2024
ab15ae1
read fasta from previous channel
atrigila Apr 11, 2024
5557f5a
first nf-test
atrigila Apr 11, 2024
32499a4
add sample files for full test grch37
atrigila Apr 11, 2024
24415b5
Merge branch 'nf-core:dev' into add_quilt_4
atrigila Apr 16, 2024
1b4393c
simplify concatenation
atrigila Apr 18, 2024
be4c25d
allow get_region to use all regions
atrigila Apr 18, 2024
f69a8b2
update full panel
atrigila Apr 18, 2024
7f3ec2a
patch
atrigila Apr 18, 2024
5d38001
remove unused elements
atrigila Apr 19, 2024
b9b4873
reorder functions in impute quilt channel
atrigila Apr 19, 2024
0ed8708
index concat files
atrigila Apr 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ jobs:
TEST_PROFILE:
- "test"
- "test_sim"
- "test_quilt"
steps:
- name: Check out pipeline code
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Initial release of nf-core/phaseimpute, created with the [nf-core](https://nf-co
- Test impute and test sim works
- [#19](https://github.com/nf-core/phaseimpute/pull/19) - Changed reference panel to accept a csv, update modules and subworkflows (glimpse1/2 and shapeit5)
- [#20](https://github.com/nf-core/phaseimpute/pull/20) - Added automatic detection of vcf contigs for the reference panel and automatic renaming available
- [#26](https://github.com/nf-core/phaseimpute/pull/26) - Added QUILT method

### `Fixed`

Expand Down
16 changes: 14 additions & 2 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,21 @@

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- [QUILT](https://pubmed.ncbi.nlm.nih.gov/34083788/)

> Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].
> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., ... & Myers, S. (2021). Rapid genotype imputation from sequence with reference panels. Nature genetics, 53(7), 1104-1111.

- [GLIMPSE](https://www.nature.com/articles/s41588-020-00756-0)

> Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics, 53(1), 120-126.

- [Shapeit](https://odelaneau.github.io/shapeit5/)

> Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w

- [bcftools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/)

> Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,14 @@ For further information or help, don't hesitate to get in touch on the [Slack `#

<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->

You can cite one of the main imputation methods ([`QUILT`](https://github.com/rwdavies/QUILT)) as follows:

> **Rapid genotype imputation from sequence with reference panels.**
>
> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., Chan, Y. F., & Myers, S.
>
> _Nature genetics_ 2021 June 03. doi: [10.1038/s41588-021-00877-0](https://doi.org/10.1038/s41588-021-00877-0)

An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.

You can cite the `nf-core` publication as follows:
Expand Down
122 changes: 122 additions & 0 deletions conf/quilt_subworkflow.config
atrigila marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Config file for defining DSL2 per module options and publishing paths
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Available keys to override module options:
ext.args = Additional arguments appended to command in module.
ext.args2 = Second set of arguments appended to command in module (multi-tool modules).
ext.args3 = Third set of arguments appended to command in module (multi-tool modules).
ext.prefix = File name prefix for output files.
----------------------------------------------------------------------------------------
*/

process {

withName: CUSTOM_DUMPSOFTWAREVERSIONS {
publishDir = [
path: { "${params.outdir}/pipeline_info" },
mode: params.publish_dir_mode,
pattern: '*_versions.yml'
]
}

withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:GLIMPSE_CHUNK' {

ext.prefix = { "${meta.id}_${meta.chr}" }

publishDir = [
[
path: { "${params.outdir}/quilt_impute/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}_chunk" },
mode: params.publish_dir_mode,
],


]
}

withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_INDEX' {
cpus = 2
memory = 400.MB
maxRetries = 2
}

withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_INDEX_2' {
ext.args = '--tbi'
cpus = 2
memory = 400.MB
maxRetries = 2
}

withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_INDEX_3' {
ext.args = '--tbi'
cpus = 2
memory = 400.MB
maxRetries = 2
}

withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_VIEW' {
ext.args = '-v snps -Oz'
ext.prefix = { "${meta.id}_${meta.chr}_biallelic" }
cpus = 2
memory = 400.MB
maxRetries = 2
}

withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_NORM' {
ext.args = '-m +any --output-type z'
ext.prefix = { "${meta.id}_${meta.chr}_multiallelic" }
cpus = 2
memory = 400.MB
maxRetries = 2
}

withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_CONVERT' {
ext.args = '--haplegendsample test'
ext.prefix = { "${meta.id}_${meta.chr}_convert" }
cpus = 2
memory = 400.MB
maxRetries = 2

publishDir = [
[
path: { "${params.outdir}/quilt_impute/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}/convert" },
mode: params.publish_dir_mode,
],
]
}

withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:IMPUTE_QUILT:QUILT_QUILT' {
publishDir = [
[
path: { "${params.outdir}/quilt_impute/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
mode: params.publish_dir_mode,
],
]
}

withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:IMPUTE_QUILT:BCFTOOLS_INDEX' {
ext.args = {[
"--tbi",
].join(" ").trim()}
}


withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:VCF_CONCATENATE_BCFTOOLS:BCFTOOLS_CONCAT' {
ext.args = {[
"--ligate",
"--output-type z",
].join(" ").trim()}

cpus = 2
memory = 1.GB
maxRetries = 2

publishDir = [
[
path: { "${params.outdir}/quilt_impute/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}/concat" },
mode: params.publish_dir_mode,
],
]
}

}
34 changes: 34 additions & 0 deletions conf/test_quilt.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nextflow config file for running minimal tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Defines input files and everything required to run a fast and simple pipeline test.

Use as follows:
nextflow run nf-core/phaseimpute -profile test_quilt,<docker/singularity> --outdir <OUTDIR>

----------------------------------------------------------------------------------------
*/

params {
config_profile_name = 'Minimal Quilt Test profile'
config_profile_description = 'Minimal test dataset to check pipeline function using the tool QUILT'

// Limit resources so that this can run on GitHub Actions
max_cpus = 2
max_memory = '2.GB'
max_time = '1.h'

// Input data
input = "${projectDir}/tests/csv/sample_bam.csv"
input_region = "${projectDir}/tests/csv/region.csv"

// Genome references
fasta = "https://raw.githubusercontent.com/nf-core/test-datasets/phaseimpute/data/reference_genome/21_22/hs38DH.chr21_22.fa"
panel = "${projectDir}/tests/csv/panel.csv"
phased = true

// Impute parameters
step = "impute"
tools = "quilt"
}
45 changes: 28 additions & 17 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,37 +12,48 @@ The directories listed below will be created in the results directory after the

<!-- TODO nf-core: Write this documentation describing your workflow's output -->

## Pipeline overview
## Pipeline overview: QUILT imputation mode

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [FastQC](#fastqc) - Raw read QC
- [Glimpse Chunk](#glimpse) - Create chunks of the reference panel
- [Remove Multiallelics](#multiallelics) - Remove multiallelic sites from the reference panel
- [Convert](#convert) - Convert reference panel to .hap and .legend files
- [QUILT](#quilt) - Perform imputation
- [Concatenate](#concatenate) - Concatenate all imputed chunks into a single VCF.
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

### FastQC
### Glimpse Chunk

<details markdown="1">
<summary>Output files</summary>
- `quilt_impute/glimpse/`
- `*.txt`: TXT file containing the chunks obtained from running Glimpse chunks.

- `fastqc/`
- `*_fastqc.html`: FastQC report containing quality metrics.
- `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
[Glimpse chunk](https://odelaneau.github.io/GLIMPSE/) defines chunks where to run imputation. For further reading and documentation see the [Glimpse documentation](https://odelaneau.github.io/GLIMPSE/glimpse1/commands.html). Once that you have generated the chunks for your reference panel, you can skip the reference preparation step and directly submit this file for imputation.

</details>
### Convert

- `quilt_impute/bcftools/convert/`
- `*.hap`: a .hap file for the reference panel.
- `*.legend*`: a .legend file for the reference panel.

[bcftools](https://samtools.github.io/bcftools/bcftools.html) aids in the conversion of vcf files to .hap and .legend files. A .samples file is also generated. Once that you have generated the hap and legend files for your reference panel, you can skip the reference preparation step and directly submit these files for imputation (to be developed).

### QUILT

[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
- `quilt_impute/quilt/`
- `quilt.*.vcf.gz`: Imputed VCF for a specific chunk.
- `quilt.*.vcf.gz.tbi`: TBI for the Imputed VCF for a specific chunk.

![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
[quilt](https://github.com/rwdavies/QUILT) performs the imputation. This step will contain the VCF for each of the chunks.

![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
### Concat

![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
- `quilt_impute/bcftools/concat`
- `.*.vcf.gz`: Imputed and ligated VCF for all the input samples.

:::note
The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
:::
[bcftools concat](https://samtools.github.io/bcftools/bcftools.html) will produce a single VCF from a list of imputed VCFs in chunks.

### MultiQC

Expand Down
Loading
Loading