Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metagenomics Remove Human Reads #93

Open
wants to merge 30 commits into
base: DEV_Metagenomics_rmHR_NF_conversion
Choose a base branch
from

Conversation

kieranmbrown
Copy link

changed WF name, fixed some documentation, and removed the Estimate Host Reads WF from PR @bnovak32 @asaravia-butler

would appreciate input on this!
…ow_Documentation/NF_MGEstHostReads-B/Estimate_Host_Reads.config to Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Documentation/NF_MGEstHostReads-B/workflow_code/Estimate_Host_Reads.config
…ation/NF_MGEstHostReads-B/Estimate_Host_Reads.nf to Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Documentation/NF_MGEstHostReads-B/workflow_code/Estimate_Host_Reads.nf
…ow_Documentation/NF_MGEstHostReads-B/workflow_code/Estimate_Host_Reads.config to Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Documentation/NF_MGEstHostReads-B/workflow_code/config/Estimate_Host_Reads.config
Copy link
Contributor

@bnovak32 bnovak32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general comments: the code blocks in the README need to be executable as-is. Ideally, we want users to be able to run the workflow example by copy/paste of the provided code blocks. As currently written, the README attempts to combine the genelab-utils based instructions from the snakemake workflows with the nextflow instructions from the nextflow workflows. Unfortunately, that is redundant/confusing. Pick one or the other. Further comments are in-line on the README, nextflow, and config files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be moved one directory up so that the instructions are easier to find. This is also how all of the other workflows are structured (README with instruction in the main workflow folder, not in "workflow_code")


Nextflow can be installed either through [Anaconda](https://anaconda.org/bioconda/nextflow) or as documented on the [Nextflow documentation page](https://www.nextflow.io/docs/latest/getstarted.html).

> Note: If you want to install Anaconda, we recommend installing a Miniconda, Python3 version appropriate for your system, as instructed by [Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant with the instructions in the previous step, which already have the user installing conda/mamba


### 2. Install Nextflow and Singularity

#### 2a. Install Nextflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The genelab-utils conda package already includes Nextflow (version 22.10.6). Do we want to tell people to use that version (in which case we can update this text to say that Nextflow is included) or do we want to have separate instructions that make it clear that Nextflow installation is only needed if they want to use a different environment for running the workflow and not genelab-utils.



## General workflow info
The current pipeline for how GeneLab identifies and removes human DNA in Illumina metagenomics sequencing data (MGRemoveHumanReads), [GL-DPPD-7105-A.md](../../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md), is implemented as a [NextFlow](https://www.nextflow.io/docs/stable/index.html) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) run all tools in containers. This workflow (NF_MGRemoveHumanReads-A) is run using the command line interface (CLI) of any unix-based system. The workflow can be used even if you are unfamiliar with NextFlow and Singularity, but if you want to learn more about those, [this NextFlow tutorial](https://training.nextflow.io/basic_training/) within [NextFlow's documentation](https://www.nextflow.io/docs/stable/index.html) is a good place to start for that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The current pipeline for how GeneLab identifies and removes human DNA in Illumina metagenomics sequencing data (MGRemoveHumanReads), [GL-DPPD-7105-A.md](../../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md), is implemented as a [NextFlow](https://www.nextflow.io/docs/stable/index.html) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) run all tools in containers. This workflow (NF_MGRemoveHumanReads-A) is run using the command line interface (CLI) of any unix-based system. The workflow can be used even if you are unfamiliar with NextFlow and Singularity, but if you want to learn more about those, [this NextFlow tutorial](https://training.nextflow.io/basic_training/) within [NextFlow's documentation](https://www.nextflow.io/docs/stable/index.html) is a good place to start for that.
The current pipeline for how GeneLab identifies and removes human DNA in Illumina metagenomics sequencing data (MGRemoveHumanReads), [GL-DPPD-7105-A.md](../../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md), is implemented as a [NextFlow](https://www.nextflow.io/docs/stable/index.html) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in containers. This workflow (NF_MGRemoveHumanReads-A) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating or modifying Nextflow workflows is not required to run the workflow as-is, the [Nextflow documentation](https://www.nextflow.io/docs/stable/index.html) is a useful resource for users who wish to modify and/or extend the workflow.

.splitText()
.map { it.trim() }
.map { sample_id ->
def files = file("${params.reads_dir}${sample_id}${params.PE_reads_suffix}").toList().sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def files = file("${params.reads_dir}${sample_id}${params.PE_reads_suffix}").toList().sort()
def files = file("${params.reads_dir}/${sample_id}${params.PE_reads_suffix}").toList().sort()

}
}
else {
reads_ch = Channel.fromFilePairs(params.reads_dir + "*" + params.PE_reads_suffix, checkIfExists: true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reads_ch = Channel.fromFilePairs(params.reads_dir + "*" + params.PE_reads_suffix, checkIfExists: true)
reads_ch = Channel.fromFilePairs(params.reads_dir + "/*" + params.PE_reads_suffix, checkIfExists: true)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README refers to the file with uppercase "Reads", so it's confusing for it to be lowercase "reads". Also, Nextflow doesn't automatically pickup this file. Either add a nextflow.config that points to this one or rename this to "nextflow.config"


params.sample_id_list = "/workspace/GeneLab_Data_Processing/rmv/unique_sample_ids.txt" //list of sample IDs to proccess if specify_reads is true

params.reads_dir = "$projectDir/example-reads_PE/" //directory to find sample reads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "$projectDir" refer to? It's OK to rely on the user setting a variable, but then you have to tell them to do so in the README.

params.num_threads = 2
params.kraken_output_dir = "$projectDir/kraken2-outputs" //location to output files, relative to wd or full path
params.human_db_name = 'kraken2-human-db' //
params.human_db_path = "$projectDir/${params.human_db_name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the $projectDir isn't discussed in the README. In this case, it is also a good idea to make it clear to the user that the db path doesn't necessarily have to be the same as the read path and using "projectDir" for both is confusing, unless they're downloading the DB (which isn't the default behavior).

Download DB: ${params.DL_kraken}
Single end reads: ${params.single_end}
Use SampleID file: ${params.specify_reads}
Outputs: ${params.human_db_path}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be the kraken_output_dir and not the human_db_path?

kieranmbrown and others added 2 commits August 20, 2024 11:50
Co-authored-by: Barbara Novak <[email protected]>
…low_Documentation/NF_MGRemoveHumanReads-A/workflow_code/README.md to Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/NF_MGRemoveHumanReads-A/README.md

corrected readme location and uploaded changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants