nf-core · atrigila · Apr 20, 2024 · Mar 27, 2024 · Mar 27, 2024 · Mar 27, 2024
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -30,6 +30,7 @@ jobs:
         TEST_PROFILE:
           - "test"
           - "test_sim"
+          - "test_quilt"
     steps:
       - name: Check out pipeline code
         uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -18,6 +18,7 @@ Initial release of nf-core/phaseimpute, created with the [nf-core](https://nf-co
   - Test impute and test sim works
 - [#19](https://github.com/nf-core/phaseimpute/pull/19) - Changed reference panel to accept a csv, update modules and subworkflows (glimpse1/2 and shapeit5)
 - [#20](https://github.com/nf-core/phaseimpute/pull/20) - Added automatic detection of vcf contigs for the reference panel and automatic renaming available
+- [#26](https://github.com/nf-core/phaseimpute/pull/26) - Added QUILT method
 
 ### `Fixed`
 

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,9 +10,21 @@
 
 ## Pipeline tools
 
-- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+- [QUILT](https://pubmed.ncbi.nlm.nih.gov/34083788/)
 
-  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].
+  > Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., ... & Myers, S. (2021). Rapid genotype imputation from sequence with reference panels. Nature genetics, 53(7), 1104-1111.
+
+- [GLIMPSE](https://www.nature.com/articles/s41588-020-00756-0)
+
+  > Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics, 53(1), 120-126.
+
+- [Shapeit](https://odelaneau.github.io/shapeit5/)
+
+  > Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w
+
+- [bcftools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/)
+
+  > Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
 
 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
 

diff --git a/README.md b/README.md
@@ -109,6 +109,14 @@ For further information or help, don't hesitate to get in touch on the [Slack `#
 
 <!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
 
+You can cite one of the main imputation methods ([`QUILT`](https://github.com/rwdavies/QUILT)) as follows:
+
+> **Rapid genotype imputation from sequence with reference panels.**
+>
+> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., Chan, Y. F., & Myers, S.
+>
+> _Nature genetics_ 2021 June 03. doi: [10.1038/s41588-021-00877-0](https://doi.org/10.1038/s41588-021-00877-0)
+
 An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
 
 You can cite the `nf-core` publication as follows:

diff --git a/conf/quilt_subworkflow.config b/conf/quilt_subworkflow.config
@@ -0,0 +1,122 @@
+/*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    Config file for defining DSL2 per module options and publishing paths
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    Available keys to override module options:
+        ext.args   = Additional arguments appended to command in module.
+        ext.args2  = Second set of arguments appended to command in module (multi-tool modules).
+        ext.args3  = Third set of arguments appended to command in module (multi-tool modules).
+        ext.prefix = File name prefix for output files.
+----------------------------------------------------------------------------------------
+*/
+
+process {
+
+    withName: CUSTOM_DUMPSOFTWAREVERSIONS {
+        publishDir = [
+            path: { "${params.outdir}/pipeline_info" },
+            mode: params.publish_dir_mode,
+            pattern: '*_versions.yml'
+        ]
+    }
+
+    withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:GLIMPSE_CHUNK' {
+
+        ext.prefix = { "${meta.id}_${meta.chr}" }
+
+        publishDir = [
+            [
+                path: { "${params.outdir}/quilt_impute/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}_chunk" },
+                mode: params.publish_dir_mode,
+            ],
+
+
+        ]
+    }
+
+    withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_INDEX' {
+        cpus          = 2
+        memory        = 400.MB
+        maxRetries    = 2
+    }
+
+    withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_INDEX_2' {
+        ext.args      = '--tbi'
+        cpus          = 2
+        memory        = 400.MB
+        maxRetries    = 2
+    }
+
+    withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_INDEX_3' {
+        ext.args      = '--tbi'
+        cpus          = 2
+        memory        = 400.MB
+        maxRetries    = 2
+    }
+
+    withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_VIEW' {
+        ext.args      = '-v snps -Oz'
+        ext.prefix    = { "${meta.id}_${meta.chr}_biallelic" }
+        cpus          = 2
+        memory        = 400.MB
+        maxRetries    = 2
+    }
+
+    withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_NORM' {
+        ext.args      = '-m +any --output-type z'
+        ext.prefix    = { "${meta.id}_${meta.chr}_multiallelic" }
+        cpus          = 2
+        memory        = 400.MB
+        maxRetries    = 2
+    }
+
+    withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:BCFTOOLS_CONVERT' {
+        ext.args = '--haplegendsample test'
+        ext.prefix    = { "${meta.id}_${meta.chr}_convert" }
+        cpus          = 2
+        memory        = 400.MB
+        maxRetries    = 2
+
+        publishDir = [
+            [
+                path: { "${params.outdir}/quilt_impute/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}/convert" },
+                mode: params.publish_dir_mode,
+            ],
+        ]
+    }
+
+    withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:IMPUTE_QUILT:QUILT_QUILT' {
+        publishDir = [
+            [
+                path: { "${params.outdir}/quilt_impute/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
+                mode: params.publish_dir_mode,
+            ],
+        ]
+    }
+
+    withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:IMPUTE_QUILT:BCFTOOLS_INDEX' {
+        ext.args     = {[
+                        "--tbi",
+                        ].join(" ").trim()}
+    }
+
+
+    withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:VCF_CONCATENATE_BCFTOOLS:BCFTOOLS_CONCAT' {
+        ext.args = {[
+            "--ligate",
+            "--output-type z",
+        ].join(" ").trim()}
+
+        cpus = 2
+        memory = 1.GB
+        maxRetries    = 2
+
+        publishDir = [
+            [
+                path: { "${params.outdir}/quilt_impute/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}/concat" },
+                mode: params.publish_dir_mode,
+            ],
+        ]
+    }
+
+}
diff --git a/conf/test_quilt.config b/conf/test_quilt.config
@@ -0,0 +1,34 @@
+/*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    Nextflow config file for running minimal tests
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    Defines input files and everything required to run a fast and simple pipeline test.
+
+    Use as follows:
+        nextflow run nf-core/phaseimpute -profile test_quilt,<docker/singularity> --outdir <OUTDIR>
+
+----------------------------------------------------------------------------------------
+*/
+
+params {
+    config_profile_name        = 'Minimal Quilt Test profile'
+    config_profile_description = 'Minimal test dataset to check pipeline function using the tool QUILT'
+
+    // Limit resources so that this can run on GitHub Actions
+    max_cpus   = 2
+    max_memory = '2.GB'
+    max_time   = '1.h'
+
+    // Input data
+    input        = "${projectDir}/tests/csv/sample_bam.csv"
+    input_region = "${projectDir}/tests/csv/region.csv"
+
+    // Genome references
+    fasta  = "https://raw.githubusercontent.com/nf-core/test-datasets/phaseimpute/data/reference_genome/21_22/hs38DH.chr21_22.fa"
+    panel  = "${projectDir}/tests/csv/panel.csv"
+    phased = true
+
+    // Impute parameters
+    step   = "impute"
+    tools  = "quilt"
+}
diff --git a/docs/output.md b/docs/output.md
@@ -12,37 +12,48 @@ The directories listed below will be created in the results directory after the
 
 <!-- TODO nf-core: Write this documentation describing your workflow's output -->
 
-## Pipeline overview
+## Pipeline overview: QUILT imputation mode
 
-The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [FastQC](#fastqc) - Raw read QC
+- [Glimpse Chunk](#glimpse) - Create chunks of the reference panel
+- [Remove Multiallelics](#multiallelics) - Remove multiallelic sites from the reference panel
+- [Convert](#convert) - Convert reference panel to .hap and .legend files
+- [QUILT](#quilt) - Perform imputation
+- [Concatenate](#concatenate) - Concatenate all imputed chunks into a single VCF.
+- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 - [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 
-### FastQC
+### Glimpse Chunk
 
-<details markdown="1">
-<summary>Output files</summary>
+- `quilt_impute/glimpse/`
+  - `*.txt`: TXT file containing the chunks obtained from running Glimpse chunks.
 
-- `fastqc/`
-  - `*_fastqc.html`: FastQC report containing quality metrics.
-  - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
+[Glimpse chunk](https://odelaneau.github.io/GLIMPSE/) defines chunks where to run imputation. For further reading and documentation see the [Glimpse documentation](https://odelaneau.github.io/GLIMPSE/glimpse1/commands.html). Once that you have generated the chunks for your reference panel, you can skip the reference preparation step and directly submit this file for imputation.
 
-</details>
+### Convert
+
+- `quilt_impute/bcftools/convert/`
+  - `*.hap`: a .hap file for the reference panel.
+  - `*.legend*`: a .legend file for the reference panel.
+
+[bcftools](https://samtools.github.io/bcftools/bcftools.html) aids in the conversion of vcf files to .hap and .legend files. A .samples file is also generated. Once that you have generated the hap and legend files for your reference panel, you can skip the reference preparation step and directly submit these files for imputation (to be developed).
+
+### QUILT
 
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
+- `quilt_impute/quilt/`
+- `quilt.*.vcf.gz`: Imputed VCF for a specific chunk.
+- `quilt.*.vcf.gz.tbi`: TBI for the Imputed VCF for a specific chunk.
 
-![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
+[quilt](https://github.com/rwdavies/QUILT) performs the imputation. This step will contain the VCF for each of the chunks.
 
-![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
+### Concat
 
-![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
+- `quilt_impute/bcftools/concat`
+- `.*.vcf.gz`: Imputed and ligated VCF for all the input samples.
 
-:::note
-The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
-:::
+[bcftools concat](https://samtools.github.io/bcftools/bcftools.html) will produce a single VCF from a list of imputed VCFs in chunks.
 
 ### MultiQC