diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml index d8f9579..9c53149 100644 --- a/.github/ISSUE_TEMPLATE/bug_report.yml +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -122,3 +122,22 @@ body: render: shell validations: required: false + - type: dropdown + id: run-demo + attributes: + label: Were you able to successfully run the latest version of the workflow with the demo data? + description: For CLI execution, were you able to successfully run the workflow using the demo data available in the [Install and run](./README.md#install-and-run) section of the `README.md`? For execution in the EPI2ME application, were you able to successfully run the workflow via the "Use demo data" button? + options: + - 'yes' + - 'no' + - other (please describe below) + validations: + required: true + - type: textarea + id: demo-other + attributes: + label: Other demo data information + render: shell + validations: + required: false + diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 8134142..2ca2e89 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -3,12 +3,12 @@ repos: hooks: - id: docs_readme name: docs_readme - entry: parse_docs -p docs -e .md -s 01_brief_description 02_introduction 03_compute_requirements 04_install_and_run 05_related_protocols 06_inputs 07_outputs 08_pipeline_overview 09_troubleshooting 10_FAQ 11_other -ot README.md -od output_definition.json -ns nextflow_schema.json + entry: parse_docs -p docs -e .md -s 01_brief_description 02_introduction 03_compute_requirements 04_install_and_run 05_related_protocols 06_input_example 06_input_parameters 07_outputs 08_pipeline_overview 09_troubleshooting 10_FAQ 11_other -ot README.md -od output_definition.json -ns nextflow_schema.json language: python always_run: true pass_filenames: false additional_dependencies: - - epi2melabs>=0.0.50 + - epi2melabs>=0.0.51 - id: build_models name: build_models entry: datamodel-codegen --strict-nullable --base-class workflow_glue.results_schema_helpers.BaseModel --use-schema-description --disable-timestamp --input results_schema.yml --input-file-type openapi --output bin/workflow_glue/results_schema.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 78d8f16..cf2cbf2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,7 +2,11 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), -and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) + +## [v1.0.3] +### Fixed +- Datatype inference error during CSV loading. ## [v1.0.2] ### Fixed diff --git a/README.md b/README.md index ab52e5e..95ff677 100644 --- a/README.md +++ b/README.md @@ -93,7 +93,30 @@ https://community.nanoporetech.com/docs/prepare/library_prep_protocols/ligation- -## Inputs +## Input example + + +This workflow accepts either FASTQ or BAM files as input. + +The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with `--sample`. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. + +``` +(i) (ii) (iii) +input_reads.fastq ─── input_directory ─── input_directory + ├── reads0.fastq ├── barcode01 + └── reads1.fastq │ ├── reads0.fastq + │ └── reads1.fastq + ├── barcode02 + │ ├── reads0.fastq + │ ├── reads1.fastq + │ └── reads2.fastq + └── barcode03 + └── reads0.fastq +``` + + + +## Input parameters ### Input Options @@ -139,7 +162,7 @@ https://community.nanoporetech.com/docs/prepare/library_prep_protocols/ligation- ## Outputs -Outputs files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}. +Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}. | Title | File path | Description | Per sample or aggregated | |-------|-----------|-------------|--------------------------| diff --git a/bin/workflow_glue/aav_structures.py b/bin/workflow_glue/aav_structures.py index a6e5c71..994acfa 100755 --- a/bin/workflow_glue/aav_structures.py +++ b/bin/workflow_glue/aav_structures.py @@ -334,10 +334,28 @@ def annotate_reads(aln_df, type_definitions, symmetry_threshold): def main(args): """Entry point.""" # Load the BAM info file - df_bam = pl.read_csv( + schema = { + 'Ref': pl.Utf8, + 'Read': pl.Utf8, + 'Pos': pl.UInt32, + 'EndPos': pl.UInt32, + 'ReadLen': pl.UInt32, + 'Strand': pl.UInt8, + 'IsSec': pl.UInt8, + 'IsSup': pl.UInt8 + } + + df_bam = (pl.read_csv( source=args.bam_info, separator='\t', - columns=['Ref', 'Read', 'Pos', 'EndPos', 'ReadLen', 'Strand', 'IsSec', 'IsSup'] + columns=list(schema.keys()), + dtypes=list(schema.values()) + ) + .with_columns([ + pl.col('Strand').cast(pl.Boolean), + pl.col('IsSec').cast(pl.Boolean), + pl.col('IsSup').cast(pl.Boolean) + ]) ) # Get the ITR locations diff --git a/bin/workflow_glue/check_sample_sheet.py b/bin/workflow_glue/check_sample_sheet.py index fe4fc37..62e3483 100755 --- a/bin/workflow_glue/check_sample_sheet.py +++ b/bin/workflow_glue/check_sample_sheet.py @@ -43,7 +43,7 @@ def main(args): ] if not os.path.exists(args.sample_sheet) or not os.path.isfile(args.sample_sheet): - sys.stdout.write(f"Could not open sample sheet '{args.sample_sheet}'.") + sys.stdout.write("Could not open sample sheet file.") sys.exit() try: diff --git a/bin/workflow_glue/contamination.py b/bin/workflow_glue/contamination.py index fb4158d..dd16531 100755 --- a/bin/workflow_glue/contamination.py +++ b/bin/workflow_glue/contamination.py @@ -4,6 +4,7 @@ from pathlib import Path import subprocess +import numpy as np import pandas as pd from .util import wf_parser # noqa: ABS101 @@ -62,7 +63,16 @@ def main(args): )] # Read the per-alignment read summaries - df_bam = pd.read_csv(args.bam_info, sep='\t', usecols=['Read', 'Ref', 'ReadLen']) + df_bam = pd.read_csv( + args.bam_info, + sep='\t', + usecols=['Read', 'Ref', 'ReadLen'], + dtype={ + 'Read': str, + 'Ref': str, + 'ReadLen': np.uint32 + } + ) # Assign reference category to alignments df_bam['contam_class'] = None df_bam.loc[df_bam.Ref == transgene_plasmid_name, 'contam_class'] = 'Transgene' diff --git a/bin/workflow_glue/report.py b/bin/workflow_glue/report.py index 1c6865e..2645e4d 100755 --- a/bin/workflow_glue/report.py +++ b/bin/workflow_glue/report.py @@ -8,6 +8,7 @@ from ezcharts.components.reports import labs from ezcharts.layout.snippets import Grid, Tabs from ezcharts.layout.snippets.table import DataTable +import numpy as np import pandas as pd from .util import get_named_logger, wf_parser # noqa: ABS101 @@ -19,7 +20,14 @@ def plot_trucations(report, truncations_file): The truncations_file contains start and end positions of alignments that are fully contained within the ITR-ITR regions. """ - df = pd.read_csv(truncations_file, sep='\t') + df = pd.read_csv( + truncations_file, sep='\t', + dtype={ + 'Read start': str, + 'Read end': np.uint32, + 'sample_id': str + } + ) with report.add_section("Truncations", "Truncations"): p( @@ -42,7 +50,16 @@ def plot_trucations(report, truncations_file): def plot_itr_coverage(report, coverage_file): """Make report section with ITR-ITR coverage of transgene cassette region.""" - df = pd.read_csv(coverage_file, sep=r"\s+") + df = pd.read_csv( + coverage_file, + sep=r"\s+", + dtype={ + 'ref': str, + 'pos': np.uint32, + 'depth': np.uint32, + 'strand': str, + 'sample_id': str + }) with report.add_section("ITR-ITR coverage", "Coverage"): p( @@ -72,7 +89,16 @@ def plot_contamination(report, class_counts): Two plots: (1) mapped/unmapped; (2) mapped reads per reference """ - df_class_counts = pd.read_csv(class_counts, sep='\t') + df_class_counts = pd.read_csv( + class_counts, + sep='\t', + dtype={ + 'Reference': str, + 'Number of alignments': np.uint32, + 'Percentage of alignments': np.float32, + 'sample_id': str + } + ) with report.add_section("Contamination", "Contamination"): p( @@ -114,7 +140,16 @@ def plot_contamination(report, class_counts): def plot_aav_structures(report, structures_file): """Make report section barplots detailing the AAV structures found.""" - df = pd.read_csv(structures_file, sep='\t') + df = pd.read_csv( + structures_file, + sep='\t', + dtype={ + 'Assigned_genome_type': str, + 'count': np.uint32, + 'percentage': np.float32, + 'sample_id': str + + }) with report.add_section("AAV Structures", "Structures"): p( diff --git a/bin/workflow_glue/truncations.py b/bin/workflow_glue/truncations.py index acdeb79..e410295 100755 --- a/bin/workflow_glue/truncations.py +++ b/bin/workflow_glue/truncations.py @@ -7,6 +7,7 @@ from pathlib import Path +import numpy as np import pandas as pd from .util import wf_parser # noqa: ABS101 @@ -48,6 +49,12 @@ def main(args): args.bam_info, sep='\t', usecols=['Read', 'Ref', 'Pos', 'EndPos'], + dtype={ + 'Read': str, + 'Ref': str, + 'Pos': np.uint32, + 'EndPos': np.uint32 + }, chunksize=50000 ) as reader: for df_bam in reader: diff --git a/docs/06_input_example.md b/docs/06_input_example.md new file mode 100644 index 0000000..edb244c --- /dev/null +++ b/docs/06_input_example.md @@ -0,0 +1,18 @@ + +This workflow accepts either FASTQ or BAM files as input. + +The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with `--sample`. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. + +``` +(i) (ii) (iii) +input_reads.fastq ─── input_directory ─── input_directory + ├── reads0.fastq ├── barcode01 + └── reads1.fastq │ ├── reads0.fastq + │ └── reads1.fastq + ├── barcode02 + │ ├── reads0.fastq + │ ├── reads1.fastq + │ └── reads2.fastq + └── barcode03 + └── reads0.fastq +``` \ No newline at end of file diff --git a/docs/06_inputs.md b/docs/06_inputs.md deleted file mode 100644 index 82c7317..0000000 --- a/docs/06_inputs.md +++ /dev/null @@ -1,38 +0,0 @@ -### Input Options - -| Nextflow parameter name | Type | Description | Help | Default | -|--------------------------|------|-------------|------|---------| -| fastq | string | FASTQ files to use in the analysis. | This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. | | -| bam | string | BAM or unaligned BAM (uBAM) files to use in the analysis. | This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. | | -| analyse_unclassified | boolean | Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. | If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. | False | -| itr_fl_threshold | integer | The maximum number of bases missing from an ITR in order for it to be classed as a full length ITR. | For ITR1, this many bases can be missing from the end of the ITR region. For ITR2, this many bases can be missing from the start of the ITR region. | 100 | -| itr_backbone_threshold | integer | The maximum number of bases and alignment is allowed to extended outside of the ITR-ITR region for an associated read to not be classed as `backbone`. | Reads mapping to the transgene plasmid sometimes extend beyond the ITRs. This parameter sets a maximum number or bases after which the read is classified as `backbone`. | 20 | -| itr1_start | integer | The start position of ITR1. | | | -| itr1_end | integer | The end position of ITR2. | | | -| itr2_start | integer | The start position of ITR2. | | | -| itr2_end | integer | The end position of ITR2. | | | -| symmetry_threshold | integer | The threshold to consider whether the start or end positions on opposite strands are classed as symmetrical or asymmetrical. | For certain categories of AAV genome type we want to test whether alignments on both strands are symmetrical or asymmetrical (i.e. whether the start and end positions are approximately the same or not) This parameter sets the threshold for this comparison. | 10 | -| ref_host | string | The reference FASTA file for the host organism (.fasta/fasta.gz). | | | -| ref_helper | string | The helper plasmid FASTA file. | | | -| ref_rep_cap | string | The rep/cap plasmid FASTA file. | | | -| ref_transgene_plasmid | string | The transgene plasmid FASTA file. | | | -| basecaller_cfg | string | Name of the basecaller model that processed the signal data; used to select an appropriate Medaka model. | The basecaller configuration is used to automatically select the appropriate Medaka model. The automatic selection can be overridden with the 'medaka_model' parameters. Available models are: 'dna_r10.4.1_e8.2_400bps_hac@v3.5.2', 'dna_r10.4.1_e8.2_400bps_sup@v3.5.2', 'dna_r9.4.1_e8_fast@v3.4', 'dna_r9.4.1_e8_hac@v3.3', 'dna_r9.4.1_e8_sup@v3.3', 'dna_r10.4.1_e8.2_400bps_hac_prom', 'dna_r9.4.1_450bps_hac_prom', 'dna_r10.3_450bps_hac', 'dna_r10.3_450bps_hac_prom', 'dna_r10.4.1_e8.2_260bps_hac', 'dna_r10.4.1_e8.2_260bps_hac_prom', 'dna_r10.4.1_e8.2_400bps_hac', 'dna_r9.4.1_450bps_hac', 'dna_r9.4.1_e8.1_hac', 'dna_r9.4.1_e8.1_hac_prom'. | dna_r10.4.1_e8.2_400bps_sup@v3.5.2 | -| medaka_model | string | The name of the Medaka model to use. This will override the model automatically chosen based on the provided basecaller configuration. | The workflow will attempt to map the basecaller model (provided with 'basecaller_cfg') used to a suitable Medaka model. You can override this by providing a model with this option instead. | | - - -### Sample Options - -| Nextflow parameter name | Type | Description | Help | Default | -|--------------------------|------|-------------|------|---------| -| sample_sheet | string | A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. | The sample sheet is a CSV file with, minimally, columns named `barcode` and `alias`. Extra columns are allowed. A `type` column is required for certain workflows and should have the following values; `test_sample`, `positive_control`, `negative_control`, `no_template_control`. | | -| sample | string | A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files. | | | - - -### Miscellaneous Options - -| Nextflow parameter name | Type | Description | Help | Default | -|--------------------------|------|-------------|------|---------| -| threads | integer | Maximum number of CPU threads for a process to consume. Applies to the minimap2 mapping and the AAV structure determination stages. | A minimap2 and AAV structure determination process per sample will be will be run. This setting applies a maximum number of threads to be used for each of these. | 4 | -| disable_ping | boolean | Enable to prevent sending a workflow ping. | | False | - - diff --git a/docs/07_outputs.md b/docs/07_outputs.md index 0bfe8bd..9cac2ec 100644 --- a/docs/07_outputs.md +++ b/docs/07_outputs.md @@ -1,4 +1,4 @@ -Outputs files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}. +Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}. | Title | File path | Description | Per sample or aggregated | |-------|-----------|-------------|--------------------------| diff --git a/nextflow.config b/nextflow.config index 450234c..8a226da 100644 --- a/nextflow.config +++ b/nextflow.config @@ -71,7 +71,7 @@ manifest { description = 'AAV plasmid quality control workflow' mainScript = 'main.nf' nextflowVersion = '>=23.04.2' - version = 'v1.0.2' + version = 'v1.0.3' } // used by default for "standard" (docker) and singularity profiles,