Merge pull request #494 from nf-core/bs-seq-primer

add bs-seq-primer doc
nf-core · Dec 16, 2024 · f68d468 · f68d468
2 parents 27ec6f4 + 86a0cc2
commit f68d468
Show file tree

Hide file tree

Showing 6 changed files with 149 additions and 4 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -24,6 +24,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - 🔧 Add CI support for pipeline-level bwameth GPU Tests [#481](https://github.com/nf-core/methylseq/pull/478)
 - 🔧 create a test for samplesheet with technical replicates [#477](https://github.com/nf-core/methylseq/pull/486)
 - 🔧 Update README, docs/usage and docs/output docs [#487](https://github.com/nf-core/methylseq/pull/489)
+- 🔧 Add Bisulfite Sequencing & Three-Base Aligners primer doc [#405](https://github.com/nf-core/methylseq/pull/494)
 
 ## [v2.7.1](https://github.com/nf-core/methylseq/releases/tag/2.7.1) - [2024-10-27]
 

diff --git a/README.md b/README.md
@@ -27,6 +27,8 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
 
 On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/methylseq/results).
 
+> Read more about **Bisulfite Sequencing & Three-Base Aligners** used in this pipeline [here](./docs/bs-seq-primer.md)
+
 ## Pipeline Summary
 
 The pipeline allows you to choose between running either [Bismark](https://github.com/FelixKrueger/Bismark) or [bwa-meth](https://github.com/brentp/bwa-meth) / [MethylDackel](https://github.com/dpryan79/methyldackel).

diff --git a/docs/bs-seq-primer.md b/docs/bs-seq-primer.md
@@ -0,0 +1,132 @@
+# Bisulfite Sequencing & Three-Base Aligners Primer
+
+[Bisulfite sequencing](https://github.com/nf-core/methylseq) (**BS-seq**) is a widely-used technique to investigate **DNA methylation**, a crucial epigenetic modification that regulates gene expression without altering the DNA sequence.
+
+## **Principle of Bisulfite Sequencing**
+
+1. **Bisulfite Treatment**:
+
+   - During bisulfite treatment, **non-methylated cytosines** in DNA are converted to **uracils**, while **methylated cytosines** remain unaffected
+   - This chemical reaction forms the foundation for distinguishing methylated and unmethylated cytosines
+
+2. **PCR Amplification**:
+
+   - After bisulfite treatment, the DNA undergoes PCR amplification
+   - During this step, uracils are converted into **thymines**
+   - This conversion results in a reduced complexity of the DNA code, introducing computational challenges in downstream analysis
+
+3. **DNA Strands in Sequencing**:
+   - For a given genomic locus, bisulfite treatment and PCR amplification produce **four distinct DNA strands**:
+     - Bisulfite-converted forward strand (OT)
+     - Reverse complement of OT (CTOT)
+     - Bisulfite-converted reverse strand (OB)
+     - Reverse complement of OB (CTOB)
+   - Depending on library type all four strands may end up in a sequencing library, adding to the complexity of analysis
+
+## **Applications of Bisulfite Sequencing**
+
+Bisulfite sequencing is employed to study:
+
+- **Genome-wide DNA methylation patterns**: Understanding epigenetic regulation of genes
+- **Differential methylation**: Investigating differences between different samples (e.g. healthy versus diseased states)
+- **Epigenetic inheritance**: Studying methylation changes across generations
+
+## **Challenges in Mapping Bisulfite-Sequenced Reads**
+
+Mapping bisulfite-treated sequences to a reference genome presents several computational challenges:
+
+1. **Reduced DNA Complexity**:
+   - Due to the conversion of cytosines to thymines, the DNA code becomes less diverse, increasing the likelihood of ambiguous alignments
+2. **Multiple DNA Strands**:
+   - The presence of four possible DNA strands (and their combinations) for each genomic locus increases the alignment search space
+3. **Variable Methylation States**:
+   - Each read can theoretically represent any possible methylation state for a locus, further complicating the alignment process
+
+## Three-Base Aligners: Bismark and BWA-Meth
+
+Bisulfite sequencing converts unmethylated cytosines into uracils (later read as thymines during PCR), leading to a DNA code represented predominantly by three bases (A, G, and T).
+
+Aligning these three-base reads against a standard reference genome is non-trivial, as the original strand identity and methylation state are obscured by the conversion.
+
+To address these challenges, specialized “three-base aligners” have been developed to accurately map bisulfite-treated reads and infer their original strand context and methylation status. Here, we primarily summarise the aligners used in the nf-core/methylseq pipeline.
+
+### Bismark ([docs](https://felixkrueger.github.io/Bismark/); [publication](https://pmc.ncbi.nlm.nih.gov/articles/PMC3102221/))
+
+- Bismark resolves strand ambiguity by performing up to four parallel alignments
+- First, sequencing reads are converted _in silico_ to represent both forward and reverse strand conversions (C-to-T and G-to-A), mirroring fully bisulfite-converted versions of the reference genome
+- Each read set is then aligned using Bowtie2 (alternatively HISAT2 or minimap2) against equally converted references; this enables support for indels, local alignments, and bisulfite converted RNA-seq-type or long reads (e.g. EM-seq using Nanopore or Pac Bio reads)
+- By comparing these (up-to) four alignments, Bismark identifies each read’s correct strand origin
+
+This approach allows Bismark to handle directional, PBAT, amplicon, and non-directional libraries robustly and to accurately align reads that represent partially methylated cytosines.
+
+### BWA-Meth ([docs](https://github.com/brentp/bwa-meth); [publication](https://arxiv.org/abs/1401.1129))
+
+- BWA-Meth adapts the BWA-MEM algorithm for bisulfite data, providing efficient, flexible alignments with support for indels, local alignments, and streaming-based workflows.
+- By converting reads _in silico_ on the fly, BWA-Meth eliminates the need for intermediate files, reducing temporary storage requirements and simplifying the overall process.
+- [Under the hood](https://github.com/brentp/bwa-meth/blob/master/bwameth.py), BWA-Meth maps bisulfite-converted reads to an _in silico_ converted reference. A typical command:
+
+```bash
+python bwameth.py --reference ref.fa A.fq B.fq
+```
+
+is transparently translated into a command that pipes converted reads directly to bwa mem (or bwa-mem2) without creating temporary files:
+
+```bash
+bwa mem -pCMR ref.fa.bwameth.c2t '<python bwameth.py c2t A.fq B.fq'
+# or, if using BWA-MEM2 indexing:
+bwa-mem2 mem -pCMR ref.fa.bwameth.c2t '<python bwameth.py c2t A.fq B.fq'
+```
+
+Here, **`A.fq`** is converted **`C-to-T`** and **`B.fq`** is converted **`G-to-A`** on the fly, and both are streamed into the aligner. The output is the standard `SAM` alignment file.
+
+## Aligner Feature/Attribute Table
+
+| Feature/Attribute                     | **Bismark**                                                                                                                                              | **BWA-Meth**                                                                                                                           |
+| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
+| **Core algorithm & approach**         | Uses Bowtie2/HISAT2/minimap2 for 3-letter alignments                                                                                                     | Uses BWA-MEM for 3-letter alignments                                                                                                   |
+| **Computational efficiency**          | More computationally intensive                                                                                                                           | Generally faster and more efficient due to single alignment step                                                                       |
+| **Output format & methylation calls** | Produces SAM/BAM, deduplication, direct methylation calls (CpG, CHG, CHH), bedGraph/coverage and cytosine context files with integrated QC and reporting | Produces standard SAM/BAM; requires external tools (e.g., MethylDackel) for methylation calling and QC                                 |
+| **Downstream analysis**               | Built-in methylation calling, filtering and (context-) reporting options                                                                                 | Requires external tools for methylation calling, offering more customization but added complexity                                      |
+| **Recommended when**                  | Accuracy and detailed methylation context are top priority, and ample computational resources are available                                              | Speed (including GPU support), scalability, and modular workflows are essential, or when integrating with existing BWA-based pipelines |
+
+**References**:
+
+- **Bismark [paper](https://pmc.ncbi.nlm.nih.gov/articles/PMC3102221/) and [official docs](https://felixkrueger.github.io/Bismark/)**
+- **BWA-Meth [paper](https://arxiv.org/abs/1401.1129) and [GitHub](https://github.com/brentp/bwa-meth)**
+- **[NVIDIA Parabricks `fq2bam_meth`](https://docs.nvidia.com/clara/parabricks/latest/documentation/tooldocs/man_fq2bam_meth.html)**
+
+## Additional Considerations for BS-seq:
+
+#### Incomplete conversion and controls:
+
+- **Incomplete bisulfite conversion**: Not all cytosines may be converted to uracil (and eventually to thymine) during the bisulfite treatment step. This partial conversion can lead to overestimation of methylation levels
+- **Spike-In controls**: Often, fully unmethylated lambda DNA or other known-conversion controls are included to estimate and correct for the efficiency of the bisulfite reaction
+
+#### Quality Control (QC) Metrics:
+
+- **M-bias Plots**: Analysis tools can generate M-bias plots to visualize methylation bias across read positions. This helps identify technical artefacts, or non-uniform conversion at biased positiones at the ends of reads
+- **Adapter- and quality-trimming**: Even more than standard sequencing data, bisulfite reads benefit from adapter removal and quality filtering. Tools like Trim Galore (often bundled with Bismark) ensure higher-quality alignments and more accurate methylation calls
+
+#### Post-alignment deduplication and bias correction:
+
+- **PCR duplication removal**: Bisulfite sequencing libraries often contain PCR duplicates. Removing these duplicates is crucial to avoid artificially inflated methylation signals (relevant for statistics in downstream analysis)
+
+#### Context-specific methylation:
+
+- While the focus often lies on CpG methylation (the most studied context in mammals), accurate mapping and downstream tools can also call methylation in CHG and CHH contexts, which are different sequence environments in which cytosine nucleotides can be found (e.g., “CpG” means a cytosine next to a guanine, while “CHG” and “CHH” indicate cytosines followed by different non-guanine nucleotides), and methylation patterns can vary depending on these contexts. This is particularly relevant for plant genomes where non-CpG methylation is biologically significant.
+
+Note: CpG, CHG, and CHH contexts.
+
+#### Large-scale analysis and cloud computing:
+
+- As whole-genome bisulfite sequencing (WGBS) datasets grow larger, computational efficiency, scalability, and memory usage become critical factors
+- **Cloud-based computing**: Tools that integrate into cloud-based workflows (e.g., AWS, GCP) and employ parallelization or GPU-acceleration (e.g., NVIDIA Parabricks for `fq2bam_meth`) can significantly reduce runtime for large projects
+
+#### Long-read bisulfite sequencing:
+
+- Although short-read Illumina sequencing predominates, emerging methods such as enzymatically converted methylation sequencing (EM-seq) are possible using long-read platforms (e.g., PacBio or Oxford Nanopore)
+- Aligning and calling methylation from longer reads introduces different challenges and opportunities, including better resolution of repetitive regions and phasing of haplotypes
+
+#### Experimental Design and Biological Replicates:
+
+- Careful experimental design, including replicates and controls, is essential. BS-seq can be influenced by technical variability, and replication ensures statistical robustness in methylation calling and differential methylation analyses
diff --git a/docs/output.md b/docs/output.md
@@ -25,7 +25,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 - [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 
-### Output Directories
+### Example output directories of nf-core/methylseq `-test` profile run
 
 #### Bismark
 
@@ -83,6 +83,8 @@ bwameth/
     └── logs
 ```
 
+### Detailed Output Descriptions
+
 ### FastQC
 
 <details markdown="1">

diff --git a/docs/usage.md b/docs/usage.md
@@ -16,6 +16,8 @@
 
 The nf-core/methylseq pipeline provides two distinct workflows for DNA methylation analysis. These workflows support different aligners and cater to a range of computational requirements.
 
+> Read more about **Bisulfite Sequencing & Three-Base Aligners** used in this pipeline [here](./docs/bs-seq-primer.md)
+
 ### Workflow: Bismark
 
 By default, the nf-core/methylseq pipeline uses [Bismark](http://www.bioinformatics.babraham.ac.uk/projects/bismark/) with [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) as the alignment tool. This configuration is optimized for most DNA methylation workflows and will run unless an alternative aligner is specified.
@@ -153,9 +155,15 @@ You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-c
 
 Additional arguments can be appended to a command in a module by specifying them within the module’s custom configuration. The configurations for modules and subworkflows used in the pipeline can be found in `conf/modules` or `conf/subworkflows`. A module’s publishDir path can also be customized in these configurations.
 
-For example, users working with unfinished genomes containing tens or even hundreds of thousands of scaffolds, contigs, or chromosomes often encounter errors when pre-sorting reads into individual chromosome files. These errors are typically caused by the operating system’s limit on the number of file handles that can be open simultaneously (usually 1024; to find out this limit on Linux, use the command: ulimit -a).
+For example, users working with unfinished genomes containing tens or even hundreds of thousands of scaffolds, contigs, or chromosomes often encounter errors when pre-sorting reads into individual chromosome files.
+
+These errors are typically caused by the operating system’s limit on the number of file handles that can be open simultaneously (usually 1024; to find out this limit on Linux, use the command: ulimit -a).
+
+To bypass this limitation, the `--scaffolds` option can be added as an additional `ext.args` in `conf/modules/bismark_methylationextractor.config`.
+
+This prevents methylation calls from being pre-sorted into individual chromosome files.
 
-To bypass this limitation, the `--scaffolds` option can be added as an additional `ext.args` in `conf/modules/bismark_methylationextractor.config`. This prevents methylation calls from being pre-sorted into individual chromosome files. Instead, all input files are temporarily merged into a single file (unless there is only one file), which is then sorted by both chromosome and position using the Unix sort command.
+Instead, all input files are temporarily merged into a single file (unless there is only one file), which is then sorted by both chromosome and position using the Unix sort command.
 
 For a detailed list of different options available, please refer to the official docs of:
 

diff --git a/nextflow.config b/nextflow.config
@@ -317,7 +317,7 @@ manifest {
         ],
         [
             name: 'Edmund Miller',
-            affiliation: 'Department of Biological Sciences and Center for Systems Biology, University of Texas at Dallas, 800 W Campbell Rd, Richardson, TX 75080, USA',
+            affiliation: 'Department of Biological Sciences and Center for Systems Biology, University of Texas at Dallas, Texas, USA',
             email: '[email protected]',
             github: 'https://github.com/edmundmiller',
             contribution: ['maintainer', 'contributor'], // List of contribution types ('author', 'maintainer' or 'contributor')