Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metagenomics Remove Human Reads #93

Open
wants to merge 30 commits into
base: DEV_Metagenomics_rmHR_NF_conversion
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
96b830a
Add files via upload
kieranmbrown Apr 2, 2024
919d2d2
Update Remove_Human_reads.config
kieranmbrown Apr 4, 2024
4259bc6
Create README.md
kieranmbrown Apr 4, 2024
1e3a7b3
moved
kieranmbrown Apr 4, 2024
4deeeae
moved
kieranmbrown Apr 4, 2024
9acfc4b
moved
kieranmbrown Apr 4, 2024
6382937
Create Estimate_Host_Reads.nf
kieranmbrown Apr 16, 2024
f33b5a0
Add files via upload
kieranmbrown Apr 16, 2024
cc40dd0
Rename Estimate_Host_Reads.nf to Estimate_Host_Reads.nf
kieranmbrown Apr 16, 2024
9e0f585
Update and rename Estimate_Host_reads.config to Estimate_Host_Reads.c…
kieranmbrown Apr 16, 2024
dd3e06b
Rename reference-database-info.md to reference-database-info.md
kieranmbrown Apr 16, 2024
44b755d
Rename unique_sample_ids.txt to unique_sample_ids.txt
kieranmbrown Apr 16, 2024
89ed0f4
Create README.md
kieranmbrown Apr 16, 2024
96e55ba
Update and rename Metagenomics/Estimate_host_reads_in_raw_data/Workfl…
kieranmbrown Apr 16, 2024
eb877e6
Rename Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Document…
kieranmbrown Apr 16, 2024
444663b
Add files via upload
kieranmbrown Apr 16, 2024
0aae6d5
Update and rename Metagenomics/Estimate_host_reads_in_raw_data/Workfl…
kieranmbrown Apr 23, 2024
86e371d
Add files via upload
kieranmbrown Apr 23, 2024
2180a50
Update README.md
kieranmbrown May 21, 2024
579828e
Update README.md
kieranmbrown May 21, 2024
fea98b0
renamed folder
kieranmbrown May 21, 2024
b3b07c4
Delete txt
kieranmbrown May 21, 2024
75cac8b
Delete NF_MGEstHostReads-B directory
kieranmbrown May 21, 2024
4a8b16c
Update README.md
kieranmbrown Jul 10, 2024
7dfc889
Update Remove_Human_reads.config
kieranmbrown Jul 10, 2024
c800d50
Update Remove_Human_reads.config
kieranmbrown Jul 10, 2024
74a627e
fix link
kieranmbrown Aug 20, 2024
706d045
Update and rename Metagenomics/Remove_human_reads_from_raw_data/Workf…
kieranmbrown Aug 31, 2024
39769ef
fixed section headers
kieranmbrown Aug 31, 2024
00674b6
Update and rename nextflow.config
kieranmbrown Sep 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# NF_MGRemoveHumanReads-A Workflow Information and Usage Instructions


## General workflow info
The current pipeline for how GeneLab identifies and removes human DNA in Illumina metagenomics sequencing data (MGRemoveHumanReads), [GL-DPPD-7105-A.md](../../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md), is implemented as a [NextFlow](https://www.nextflow.io/docs/stable/index.html) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in containers. This workflow (NF_MGRemoveHumanReads-A) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating or modifying Nextflow workflows is not required to run the workflow as-is, the [Nextflow documentation](https://www.nextflow.io/docs/stable/index.html) is a useful resource for users who wish to modify and/or extend the workflow.

## Utilizing the workflow

1. [Install NextFlow and Singularity](#1-install-nextflow-and-singularity)
1a. [Install Nextflow](#1a-install-nextflow)
1b. [Install Singularity](#1b-install-singularity)
2. [Download the workflow template files](#2-download-the-workflow-template-files)
3. [Modify the variables in the nextflow.config file](#3-modify-the-variables-in-the-nextflowconfig-file)
4. [Run the workflow](#4-run-the-workflow)


<br>

---

### 1. Install Nextflow and Singularity

#### 1a. Install Nextflow

To install NextFlow, follow the [Nextflow documentation page](https://www.nextflow.io/docs/latest/getstarted.html).
>
> Download and install NextFlow directly:
>
> ```bash
> curl -s https://get.nextflow.io | bash
> sudo mv nextflow /usr/local/bin
> ```

Or, if Conda is installed on your system (and you prefer to use it) install and activate the “genelab-utils” Conda package including NextFlow by running the following commands:
>
> ```bash
> conda install -n base -c conda-forge mamba
> ```
>
> You can read a quick intro to mamba [here](https://astrobiomike.github.io/unix/conda-intro#bonus-mamba-no-5) if desired.
>
> Once mamba is installed, you can install the genelab-utils conda package in a new environment > with the following command:
>
> ```bash
> mamba create -n genelab-utils -c conda-forge -c bioconda -c defaults -c astrobiomike 'genelab-utils>=1.1.02'
>
> conda activate genelab-utils
> ```
>






<br>

#### 1b. Install Singularity

Singularity is a container platform that allows usage of containerized software. This enables the workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system.

We recommend installing Singularity on a system wide level as per the associated [documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html).

> Note: Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity) if you are using Conda.

<br>

---


### 2. Download the workflow template files
All workflow files for removing human reads from metagenomics data are in the [workflow_code](workflow_code) directory. You can do this by either downloading the files for this workflow from GitHub or by [cloning](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository) the repository.


If you are using Conda with “genelab-utils”, you can copy the workflow files to your system using the “GL-get-workflow” command:
> ```bash
> GL-get-workflow NF_MGRemoveHumanReads-A
> ```
>
> This downloaded the workflow into a directory called `NF_MGRemoveHumanReads-*/`, with the workflow version number at the end.
>
> Note: If wanting an earlier version, the wanted version can be provided as an optional argument like so:
> ```bash
> GL-get-workflow NF_MGRemoveHumanReads-A --wanted-version 1.0.0
> ```

<br>

---

### 3. Modify the variables in the nextflow.config file
Once you've downloaded the workflow template, you can modify the variables in the [nextflow.config](workflow_code/nextflow.config) file as needed. You will also need to indicate the path to your input data (raw reads) and the root directory for where the kraken2 reference database should be stored (it will be set up automatically). Additionally, if necessary, you'll need to modify each variable in the [nextflow.config](workflow_code/nextflow.config) file to be consistent with the study you want to process and the machine you're using.
Confirm the following variables are appropriate for your data:
- DL_kraken
- single_end
- specify_reads
- Sample_id_list
- reads_dir
- PE_reads_suffix or SE_reads_suffix
- PE_reads_out_suffix or SE_reads_out_suffix
- kraken_output_dir
- human_db_path


If you want to specify certain read files you will have to provide a text file containing a single-column list of unique sample identifiers (see an example of how to set this up below - if you are running the example dataset, this file is provided in the [workflow_code](workflow_code) directory [here](workflow_code/unique-sample-IDs.txt)).
> Note: If you are unfamiliar with how to specify paths, one place you can learn more is [here](https://astrobiomike.github.io/unix/getting-started#the-unix-file-system-structure).

**Example for how to create a single-column list of unique sample identifiers from your raw data file names**

For example, if you only want to process a subset of the read files within the reads directory and have paired-end read data for 2 samples located in `../Raw_Sequence_Data/` relative to your workflow directory, that would look like this:

```bash
ls ../Raw_Sequence_Data/
```

```
Sample-1_R1.fastq.gz
Sample-1_R2.fastq.gz
Sample-2_R1.fastq.gz
Sample-2_R2.fastq.gz
```

You would set up your `unique-sample-IDs.txt` file as follows:

```bash
cat unique-sample-IDs.txt
```

```
Sample-1
Sample-2
```

<br>

---

### 4. Run the workflow

While in the directory holding the NextFlow file, nextflow.config file, and other workflow files that you downloaded in [step 2](#2-download-the-workflow-template-files), here is one example command of how to run the workflow:

```bash
nextflow run *path/to/Remove_Human_Reads.nf* -resume -ansi-log false --DL_kraken false
```


* `-resume` – continues to run the workflow using cached data from the previous run
* `-ansi-log false` – specifies to print out each command being run to the screen instead of dynamically updating the log
* `--specify_reads false` - processes all reads in the working directory, without requiring a sample ID list
* `--single_end true` – indicates reads are single-ended
* `--DL_kraken true` – runs a process before the rest of the workflow to download and install the necessary database.
>
> Note - Nextflow options use a single dash prefix, e.g. -resume, whereas pipeline parameters use double dash notation, e.g. --specify_reads. All of the pipeline parameters can and should be set from the nextflow.config file to avoid typos or errors.
>

See `nextflow -h` and [NextFlow's documentation](https://www.nextflow.io/docs/master/index.html) for more options and details.


---

## Reference database info
The database we use was built with kraken2 v2.1.1 as detailed below, and can be downloaded to run with this workflow (it's ~4.3 GB uncompressed). The steps for building it are described on the [reference database info page](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/reference-database-info.md).

---

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@


log.info """\
REMOVING HUMAN READS
===================================
Download DB: ${params.DL_kraken}
Single end reads: ${params.single_end}
Use SampleID file: ${params.specify_reads}
Outputs: ${params.human_db_path}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be the kraken_output_dir and not the human_db_path?

"""
.stripIndent()

// Process to set up the human reads database
process set_up_human_db {
tag "Downloading human reads database to ${params.human_db_path}\n"
publishDir path: "$projectDir"

output:
path "${params.human_db_name}/"

script:
"""
curl -L -o ${params.human_db_name}.tar.gz https://ndownloader.figshare.com/files/25627058

tar -xzvf ${params.human_db_name}.tar.gz
rm ${params.human_db_name}.tar.gz

"""

}

// Process for paired-end (PE) read analysis with Kraken2
process PE_kraken2 {

container params.kraken2container
tag "$sample_id"
publishDir "$params.kraken_output_dir", pattern: "*.{txt,tsv}"
publishDir "$params.kraken_output_dir/reads", pattern: "*.fastq.gz"

input:
path database
tuple val(sample_id), path(reads_ch)


output:
path "${sample_id}-kraken2-output.txt"
path "${sample_id}-kraken2-report.tsv"
path "${sample_id}*.gz"

script:
"""
kraken2 --db $database --gzip-compressed \
--threads 2 --use-names --paired \
--output ${sample_id}-kraken2-output.txt \
--report ${sample_id}-kraken2-report.tsv \
--unclassified-out "${sample_id}${params.PE_reads_out_suffix}" \
${reads_ch[0]} ${reads_ch[1]}

gzip ${sample_id}*.fastq
"""
}

// Process for single-end (SE) read analysis with Kraken2
process SE_kraken2 {

container params.kraken2container
tag "$sample_id"
publishDir "$params.kraken_output_dir", pattern: "*.{txt,tsv}"
publishDir "$params.kraken_output_dir/reads", pattern: "*.fastq.gz"

input:
path database
tuple val(sample_id), path(reads_ch)

output:
path "${sample_id}-kraken2-output.txt"
path "${sample_id}-kraken2-report.tsv"
path "${sample_id}${params.SE_reads_out_suffix}.gz"

script:
"""
kraken2 --db $database --gzip-compressed --threads 2 --use-names \
--output ${sample_id}-kraken2-output.txt \
--report ${sample_id}-kraken2-report.tsv \
--unclassified-out "${sample_id}${params.SE_reads_out_suffix}" \
${reads_ch[0]}

gzip ${sample_id}${params.SE_reads_out_suffix}
"""
}


// Main workflow logic
workflow {

// Conditionally download the human reads database if requested
if(params.DL_kraken == true){
log.info "\nPreparing to download new human reads database"
database_ch = set_up_human_db()
database_ch.view{"database path: ${it}"}
}

else {
log.info "\nAccessing previous human reads database"
database_ch = Channel.value(params.human_db_path)
database_ch.view{"database path: ${it}"}
}

// Process reads based on whether they are single-end or paired-end
if(params.single_end == true) {
log.info "\nReading Single-end data from ${params.reads_dir}\n"

// Specified reads handling (mimics Channel.fromFilePairs() method's output, but with SE)
if (params.specify_reads) {
reads_ch = Channel
.fromPath("${params.sample_id_list}")
.splitText()
.map { it.trim() }
.map { sample_id ->
def files = file("${params.reads_dir}${sample_id}${params.SE_reads_suffix}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assume that users will put the path separator on the end of the path they specify.

Suggested change
def files = file("${params.reads_dir}${sample_id}${params.SE_reads_suffix}")
def files = file("${params.reads_dir}/${sample_id}${params.SE_reads_suffix}")

return [sample_id, files]
}
}
else {
reads_ch = Channel
.fromPath("${params.reads_dir}/*${params.SE_reads_suffix}", checkIfExists: true)
.map { readfile ->
def sampleId = readfile.name.replaceAll("${params.SE_reads_suffix}\$", "")
return tuple(sampleId, readfile)
}
}
reads_ch.view{"reads: ${it}"}
output_ch = SE_kraken2(database_ch, reads_ch)
}
else {
log.info "\nReading Paired-end data from ${params.reads_dir}\n"

// Specified reads handling (mimics Channel.fromFilePairs() method's output)
if (params.specify_reads) {
reads_ch = Channel
.fromPath("${params.sample_id_list}")
.splitText()
.map { it.trim() }
.map { sample_id ->
def files = file("${params.reads_dir}${sample_id}${params.PE_reads_suffix}").toList().sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def files = file("${params.reads_dir}${sample_id}${params.PE_reads_suffix}").toList().sort()
def files = file("${params.reads_dir}/${sample_id}${params.PE_reads_suffix}").toList().sort()

return [sample_id, files]
}
}
else {
reads_ch = Channel.fromFilePairs(params.reads_dir + "*" + params.PE_reads_suffix, checkIfExists: true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reads_ch = Channel.fromFilePairs(params.reads_dir + "*" + params.PE_reads_suffix, checkIfExists: true)
reads_ch = Channel.fromFilePairs(params.reads_dir + "/*" + params.PE_reads_suffix, checkIfExists: true)

}
reads_ch.view{"reads: ${it}"}
output_ch = PE_kraken2(database_ch, reads_ch)
}

// Calculate and log the final result
final_percent = output_ch[1]
.collect{(it.text[0..5]).toFloat()}
.average().trunc(2)
.view{"\nRESULT: ${it}% of input reads were unclassified, available in ${params.kraken_output_dir}/reads "}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
//Variables to set:

params.DL_kraken = false //whether or not to download the human reads database as the first step

params.single_end = false // single-end reads (false if paired-end)

params.specify_reads = true //if true, only process reads specified by the sample_id_list

params.sample_id_list = "/path/to/unique_sample_ids.txt" //list of sample IDs to proccess if specify_reads is true

params.reads_dir = "$projectDir/example-reads_PE/" //directory to find sample reads

params.PE_reads_suffix = "_R{1,2}.fastq.gz" //raw read suffixes (region following the unique part of the sample names)
//e.g. for "Sample-1_R1/2_raw.fastq.gz" would be "_R1_raw.fastq.gz"

params.PE_reads_out_suffix = "_R#_raw_hrRemoved.fastq" //suffix to use for final (human reads removed) output files


params.SE_reads_suffix = "_raw.fastq.gz" //if single-end, set this. raw read suffixes which follow the unique part of sample name

params.SE_reads_out_suffix = "_raw_hrRemoved.fastq" //suffix to use for final (human reads removed) output files




//Only change if desired:

params.num_threads = 2
params.kraken_output_dir = "$projectDir/kraken2-outputs" //location to output files, relative to wd or full path
params.human_db_name = 'kraken2-human-db' //
params.human_db_path = "/path/to/kraken2/database/${params.human_db_name}"
singularity {
enabled = true
autoConvert = true
autoMounts = true
docker.enabled = false
podman.enabled = false
shifter.enabled = false
charliecloud.enabled = false}
params.kraken2container = 'quay.io/biocontainers/kraken2:2.1.3--pl5321hdcf5f25_0'
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Sample-1
Sample-2
Sample-3