New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Metagenomics Remove Human Reads #93

Open

kieranmbrown wants to merge 30 commits into nasa:DEV_Metagenomics_rmHR_NF_conversion from kieranmbrown:Metagenomics-RHR

kieranmbrown commented May 21, 2024

changed WF name, fixed some documentation, and removed the Estimate Host Reads WF from PR @bnovak32 @asaravia-butler

kieranmbrown added 26 commits

April 2, 2024 13:53


          Add files via upload

96b830a


          Update Remove_Human_reads.config

919d2d2


          Create README.md

4259bc6

would appreciate input on this!


          moved

1e3a7b3


          moved

4deeeae


          moved

9acfc4b


          Create Estimate_Host_Reads.nf


          Add files via upload

f33b5a0


          Rename Estimate_Host_Reads.nf to Estimate_Host_Reads.nf

cc40dd0


          Update and rename Estimate_Host_reads.config to Estimate_Host_Reads.c…

9e0f585

…onfig


          Rename reference-database-info.md to reference-database-info.md

dd3e06b


          Rename unique_sample_ids.txt to unique_sample_ids.txt

44b755d


          Create README.md

89ed0f4


          Update and rename Metagenomics/Estimate_host_reads_in_raw_data/Workfl…

96e55ba

…ow_Documentation/NF_MGEstHostReads-B/Estimate_Host_Reads.config to Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Documentation/NF_MGEstHostReads-B/workflow_code/Estimate_Host_Reads.config


          Rename Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Document…

eb877e6

…ation/NF_MGEstHostReads-B/Estimate_Host_Reads.nf to Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Documentation/NF_MGEstHostReads-B/workflow_code/Estimate_Host_Reads.nf


          Add files via upload

444663b


          Update and rename Metagenomics/Estimate_host_reads_in_raw_data/Workfl…

0aae6d5

…ow_Documentation/NF_MGEstHostReads-B/workflow_code/Estimate_Host_Reads.config to Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Documentation/NF_MGEstHostReads-B/workflow_code/config/Estimate_Host_Reads.config


          Add files via upload

86e371d


          Update README.md

2180a50


          Update README.md

579828e


          renamed folder

fea98b0


          Delete txt

b3b07c4


          Delete NF_MGEstHostReads-B directory

75cac8b


          Update README.md

4a8b16c


          Update Remove_Human_reads.config

7dfc889


          Update Remove_Human_reads.config

c800d50

bnovak32 requested changes

View reviewed changes

Contributor

bnovak32 left a comment

Some general comments: the code blocks in the README need to be executable as-is. Ideally, we want users to be able to run the workflow example by copy/paste of the provided code blocks. As currently written, the README attempts to combine the genelab-utils based instructions from the snakemake workflows with the nextflow instructions from the nextflow workflows. Unfortunately, that is redundant/confusing. Pick one or the other. Further comments are in-line on the README, nextflow, and config files.

...n_reads_from_raw_data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/README.md Outdated

Contributor

bnovak32 Aug 13, 2024

This file should be moved one directory up so that the instructions are easier to find. This is also how all of the other workflows are structured (README with instruction in the main workflow folder, not in "workflow_code")

...n_reads_from_raw_data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/README.md Outdated


		Nextflow can be installed either through [Anaconda](https://anaconda.org/bioconda/nextflow) or as documented on the [Nextflow documentation page](https://www.nextflow.io/docs/latest/getstarted.html).

		> Note: If you want to install Anaconda, we recommend installing a Miniconda, Python3 version appropriate for your system, as instructed by [Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).

Contributor

bnovak32 Aug 13, 2024

This seems redundant with the instructions in the previous step, which already have the user installing conda/mamba

...n_reads_from_raw_data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/README.md Outdated


		### 2. Install Nextflow and Singularity

		#### 2a. Install Nextflow

Contributor

bnovak32 Aug 13, 2024

The genelab-utils conda package already includes Nextflow (version 22.10.6). Do we want to tell people to use that version (in which case we can update this text to say that Nextflow is included) or do we want to have separate instructions that make it clear that Nextflow installation is only needed if they want to use a different environment for running the workflow and not genelab-utils.

...n_reads_from_raw_data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/README.md Outdated



		## General workflow info
		The current pipeline for how GeneLab identifies and removes human DNA in Illumina metagenomics sequencing data (MGRemoveHumanReads), [GL-DPPD-7105-A.md](../../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md), is implemented as a [NextFlow](https://www.nextflow.io/docs/stable/index.html) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) run all tools in containers. This workflow (NF_MGRemoveHumanReads-A) is run using the command line interface (CLI) of any unix-based system. The workflow can be used even if you are unfamiliar with NextFlow and Singularity, but if you want to learn more about those, [this NextFlow tutorial](https://training.nextflow.io/basic_training/) within [NextFlow's documentation](https://www.nextflow.io/docs/stable/index.html) is a good place to start for that.

Contributor

bnovak32 Aug 15, 2024

Suggested change

      
            The current pipeline for how GeneLab identifies and removes human DNA in Illumina metagenomics sequencing data (MGRemoveHumanReads), [GL-DPPD-7105-A.md](../../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md), is implemented as a [NextFlow](https://www.nextflow.io/docs/stable/index.html) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) run all tools in containers. This workflow (NF_MGRemoveHumanReads-A) is run using the command line interface (CLI) of any unix-based system. The workflow can be used even if you are unfamiliar with NextFlow and Singularity, but if you want to learn more about those, [this NextFlow tutorial](https://training.nextflow.io/basic_training/) within [NextFlow's documentation](https://www.nextflow.io/docs/stable/index.html) is a good place to start for that.
          
            The current pipeline for how GeneLab identifies and removes human DNA in Illumina metagenomics sequencing data (MGRemoveHumanReads), [GL-DPPD-7105-A.md](../../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md), is implemented as a [NextFlow](https://www.nextflow.io/docs/stable/index.html) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in containers. This workflow (NF_MGRemoveHumanReads-A) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating or modifying Nextflow workflows is not required to run the workflow as-is, the [Nextflow documentation](https://www.nextflow.io/docs/stable/index.html) is a useful resource for users who wish to modify and/or extend the workflow.

...n_reads_from_raw_data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/README.md Outdated Show resolved Hide resolved

..._raw_data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/Remove_Human_Reads.nf

+                      .splitText()
+                      .map { it.trim() }
+                      .map { sample_id ->
+                          def files = file("${params.reads_dir}${sample_id}${params.PE_reads_suffix}").toList().sort()

Contributor

bnovak32 Aug 15, 2024

Suggested change

      
                        def files = file("${params.reads_dir}${sample_id}${params.PE_reads_suffix}").toList().sort()
          
                        def files = file("${params.reads_dir}/${sample_id}${params.PE_reads_suffix}").toList().sort()

..._raw_data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/Remove_Human_Reads.nf

+                      }
+                  }
+                  else {
+                      reads_ch = Channel.fromFilePairs(params.reads_dir + "*" + params.PE_reads_suffix, checkIfExists: true)

Contributor

bnovak32 Aug 15, 2024

Suggested change

      
                    reads_ch = Channel.fromFilePairs(params.reads_dir + "*" + params.PE_reads_suffix, checkIfExists: true)
          
                    reads_ch = Channel.fromFilePairs(params.reads_dir + "/*" + params.PE_reads_suffix, checkIfExists: true)

..._data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/Remove_Human_reads.config Outdated

Contributor

bnovak32 Aug 15, 2024

README refers to the file with uppercase "Reads", so it's confusing for it to be lowercase "reads". Also, Nextflow doesn't automatically pickup this file. Either add a nextflow.config that points to this one or rename this to "nextflow.config"

..._data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/Remove_Human_reads.config Outdated


		params.sample_id_list = "/workspace/GeneLab_Data_Processing/rmv/unique_sample_ids.txt" //list of sample IDs to proccess if specify_reads is true

		params.reads_dir = "$projectDir/example-reads_PE/" //directory to find sample reads

Contributor

bnovak32 Aug 15, 2024

What does "$projectDir" refer to? It's OK to rely on the user setting a variable, but then you have to tell them to do so in the README.

..._data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/Remove_Human_reads.config Outdated

+              params.num_threads = 2
+              params.kraken_output_dir = "$projectDir/kraken2-outputs"  //location to output files, relative to wd or full path
+              params.human_db_name = 'kraken2-human-db'  //
+              params.human_db_path = "$projectDir/${params.human_db_name}"

Contributor

bnovak32 Aug 15, 2024

Again, the $projectDir isn't discussed in the README. In this case, it is also a good idea to make it clear to the user that the db path doesn't necessarily have to be the same as the read path and using "projectDir" for both is confusing, unless they're downloading the DB (which isn't the default behavior).

bnovak32 reviewed

View reviewed changes

..._raw_data/Workflow_Documentation/NF_MGRemoveHumanReads-A/workflow_code/Remove_Human_Reads.nf

+                  Download DB:        ${params.DL_kraken}
+                  Single end reads:   ${params.single_end}
+                  Use SampleID file:  ${params.specify_reads}
+                  Outputs:            ${params.human_db_path}

Contributor

bnovak32 Aug 15, 2024

shouldn't this be the kraken_output_dir and not the human_db_path?

kieranmbrown and others added 2 commits

August 20, 2024 11:50


          fix link

74a627e

Co-authored-by: Barbara Novak <[email protected]>


          Update and rename Metagenomics/Remove_human_reads_from_raw_data/Workf…

706d045

…low_Documentation/NF_MGRemoveHumanReads-A/workflow_code/README.md to Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/NF_MGRemoveHumanReads-A/README.md

corrected readme location and uploaded changes

kieranmbrown added 2 commits

August 31, 2024 15:24


          fixed section headers

39769ef


          Update and rename nextflow.config

00674b6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet