Merge branch 'CW-3450_dtype_error' into 'dev'

Resolve CW-3450 "Dtype error" Closes CW-3450 See merge request epi2melabs/workflows/wf-aav-qc!146
epi2me-labs · Feb 7, 2024 · 22efa10 · 22efa10
2 parents be7dd1f + 5225e9d
commit 22efa10
Show file tree

Hide file tree

Showing 13 changed files with 149 additions and 53 deletions.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -122,3 +122,22 @@ body:
       render: shell
     validations:
       required: false
+  - type: dropdown
+    id: run-demo
+    attributes:
+      label: Were you able to successfully run the latest version of the workflow with the demo data?
+      description: For CLI execution, were you able to successfully run the workflow using the demo data available in the [Install and run](./README.md#install-and-run) section of the `README.md`? For execution in the EPI2ME application, were you able to successfully run the workflow via the "Use demo data" button?
+      options:
+        - 'yes'
+        - 'no'
+        - other (please describe below)
+    validations:
+      required: true
+  - type: textarea
+    id: demo-other
+    attributes:
+      label: Other demo data information
+      render: shell
+    validations:
+      required: false
+
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -3,12 +3,12 @@ repos:
     hooks:
       - id: docs_readme
         name: docs_readme
-        entry: parse_docs -p docs -e .md -s 01_brief_description 02_introduction 03_compute_requirements 04_install_and_run 05_related_protocols 06_inputs 07_outputs 08_pipeline_overview 09_troubleshooting 10_FAQ 11_other -ot README.md -od output_definition.json -ns nextflow_schema.json
+        entry: parse_docs -p docs -e .md -s 01_brief_description 02_introduction 03_compute_requirements 04_install_and_run 05_related_protocols 06_input_example 06_input_parameters 07_outputs 08_pipeline_overview 09_troubleshooting 10_FAQ 11_other -ot README.md -od output_definition.json -ns nextflow_schema.json
         language: python
         always_run: true
         pass_filenames: false
         additional_dependencies:
-          - epi2melabs>=0.0.50
+          - epi2melabs>=0.0.51
       - id: build_models
         name: build_models
         entry: datamodel-codegen --strict-nullable --base-class workflow_glue.results_schema_helpers.BaseModel --use-schema-description --disable-timestamp --input results_schema.yml --input-file-type openapi --output bin/workflow_glue/results_schema.py

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,7 +2,11 @@
 All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
-and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
+
+## [v1.0.3]
+### Fixed
+- Datatype inference error during CSV loading. 
 
 ## [v1.0.2]
 ### Fixed 

diff --git a/README.md b/README.md
@@ -93,7 +93,30 @@ https://community.nanoporetech.com/docs/prepare/library_prep_protocols/ligation-
 
 
 
-## Inputs
+## Input example
+
+<!---Example of input directory structure, delete and edit as appropriate per workflow.--->
+This workflow accepts either FASTQ or BAM files as input.
+
+The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with `--sample`. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`.
+
+```
+(i)                     (ii)                 (iii)    
+input_reads.fastq   ─── input_directory  ─── input_directory
+                        ├── reads0.fastq     ├── barcode01
+                        └── reads1.fastq     │   ├── reads0.fastq
+                                             │   └── reads1.fastq
+                                             ├── barcode02
+                                             │   ├── reads0.fastq
+                                             │   ├── reads1.fastq
+                                             │   └── reads2.fastq
+                                             └── barcode03
+                                              └── reads0.fastq
+```
+
+
+
+## Input parameters
 
 ### Input Options
 
@@ -139,7 +162,7 @@ https://community.nanoporetech.com/docs/prepare/library_prep_protocols/ligation-
 
 ## Outputs
 
-Outputs files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
+Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
 
 | Title | File path | Description | Per sample or aggregated |
 |-------|-----------|-------------|--------------------------|

diff --git a/bin/workflow_glue/aav_structures.py b/bin/workflow_glue/aav_structures.py
@@ -334,10 +334,28 @@ def annotate_reads(aln_df, type_definitions, symmetry_threshold):
 def main(args):
     """Entry point."""
     # Load the BAM info file
-    df_bam = pl.read_csv(
+    schema = {
+        'Ref':  pl.Utf8,
+        'Read': pl.Utf8,
+        'Pos': pl.UInt32,
+        'EndPos': pl.UInt32,
+        'ReadLen': pl.UInt32,
+        'Strand': pl.UInt8,
+        'IsSec': pl.UInt8,
+        'IsSup': pl.UInt8
+    }
+
+    df_bam = (pl.read_csv(
         source=args.bam_info,
         separator='\t',
-        columns=['Ref', 'Read', 'Pos', 'EndPos', 'ReadLen', 'Strand', 'IsSec', 'IsSup']
+        columns=list(schema.keys()),
+        dtypes=list(schema.values())
+        )
+        .with_columns([
+            pl.col('Strand').cast(pl.Boolean),
+            pl.col('IsSec').cast(pl.Boolean),
+            pl.col('IsSup').cast(pl.Boolean)
+        ])
     )
 
     # Get the ITR locations

diff --git a/bin/workflow_glue/check_sample_sheet.py b/bin/workflow_glue/check_sample_sheet.py
@@ -43,7 +43,7 @@ def main(args):
     ]
 
     if not os.path.exists(args.sample_sheet) or not os.path.isfile(args.sample_sheet):
-        sys.stdout.write(f"Could not open sample sheet '{args.sample_sheet}'.")
+        sys.stdout.write("Could not open sample sheet file.")
         sys.exit()
 
     try:

diff --git a/bin/workflow_glue/contamination.py b/bin/workflow_glue/contamination.py
@@ -4,6 +4,7 @@
 from pathlib import Path
 import subprocess
 
+import numpy as np
 import pandas as pd
 
 from .util import wf_parser  # noqa: ABS101
@@ -62,7 +63,16 @@ def main(args):
     )]
 
     # Read the per-alignment read summaries
-    df_bam = pd.read_csv(args.bam_info, sep='\t', usecols=['Read', 'Ref', 'ReadLen'])
+    df_bam = pd.read_csv(
+        args.bam_info,
+        sep='\t',
+        usecols=['Read', 'Ref', 'ReadLen'],
+        dtype={
+            'Read': str,
+            'Ref': str,
+            'ReadLen': np.uint32
+            }
+    )
     # Assign reference category to alignments
     df_bam['contam_class'] = None
     df_bam.loc[df_bam.Ref == transgene_plasmid_name, 'contam_class'] = 'Transgene'

diff --git a/bin/workflow_glue/report.py b/bin/workflow_glue/report.py
@@ -8,6 +8,7 @@
 from ezcharts.components.reports import labs
 from ezcharts.layout.snippets import Grid, Tabs
 from ezcharts.layout.snippets.table import DataTable
+import numpy as np
 import pandas as pd
 
 from .util import get_named_logger, wf_parser  # noqa: ABS101
@@ -19,7 +20,14 @@ def plot_trucations(report, truncations_file):
     The truncations_file contains start and end positions of alignments that are fully
     contained within the ITR-ITR regions.
     """
-    df = pd.read_csv(truncations_file, sep='\t')
+    df = pd.read_csv(
+        truncations_file, sep='\t',
+        dtype={
+            'Read start': str,
+            'Read end': np.uint32,
+            'sample_id': str
+        }
+    )
 
     with report.add_section("Truncations", "Truncations"):
         p(
@@ -42,7 +50,16 @@ def plot_trucations(report, truncations_file):
 
 def plot_itr_coverage(report, coverage_file):
     """Make report section with ITR-ITR coverage of transgene cassette region."""
-    df = pd.read_csv(coverage_file, sep=r"\s+")
+    df = pd.read_csv(
+        coverage_file,
+        sep=r"\s+",
+        dtype={
+            'ref': str,
+            'pos': np.uint32,
+            'depth': np.uint32,
+            'strand': str,
+            'sample_id': str
+        })
 
     with report.add_section("ITR-ITR coverage", "Coverage"):
         p(
@@ -72,7 +89,16 @@ def plot_contamination(report, class_counts):
 
     Two plots: (1) mapped/unmapped; (2) mapped reads per reference
     """
-    df_class_counts = pd.read_csv(class_counts, sep='\t')
+    df_class_counts = pd.read_csv(
+        class_counts,
+        sep='\t',
+        dtype={
+            'Reference': str,
+            'Number of alignments': np.uint32,
+            'Percentage of alignments': np.float32,
+            'sample_id': str
+        }
+    )
 
     with report.add_section("Contamination", "Contamination"):
         p(
@@ -114,7 +140,16 @@ def plot_contamination(report, class_counts):
 
 def plot_aav_structures(report, structures_file):
     """Make report section barplots detailing the AAV structures found."""
-    df = pd.read_csv(structures_file, sep='\t')
+    df = pd.read_csv(
+        structures_file,
+        sep='\t',
+        dtype={
+            'Assigned_genome_type': str,
+            'count': np.uint32,
+            'percentage': np.float32,
+            'sample_id': str
+
+        })
 
     with report.add_section("AAV Structures", "Structures"):
         p(

diff --git a/bin/workflow_glue/truncations.py b/bin/workflow_glue/truncations.py
@@ -7,6 +7,7 @@
 
 from pathlib import Path
 
+import numpy as np
 import pandas as pd
 
 from .util import wf_parser  # noqa: ABS101
@@ -48,6 +49,12 @@ def main(args):
         args.bam_info,
         sep='\t',
         usecols=['Read', 'Ref', 'Pos', 'EndPos'],
+        dtype={
+            'Read': str,
+            'Ref': str,
+            'Pos': np.uint32,
+            'EndPos': np.uint32
+        },
         chunksize=50000
     ) as reader:
         for df_bam in reader:

diff --git a/docs/06_input_example.md b/docs/06_input_example.md
@@ -0,0 +1,18 @@
+<!---Example of input directory structure, delete and edit as appropriate per workflow.--->
+This workflow accepts either FASTQ or BAM files as input.
+
+The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with `--sample`. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`.
+
+```
+(i)                     (ii)                 (iii)    
+input_reads.fastq   ─── input_directory  ─── input_directory
+                        ├── reads0.fastq     ├── barcode01
+                        └── reads1.fastq     │   ├── reads0.fastq
+                                             │   └── reads1.fastq
+                                             ├── barcode02
+                                             │   ├── reads0.fastq
+                                             │   ├── reads1.fastq
+                                             │   └── reads2.fastq
+                                             └── barcode03
+                                              └── reads0.fastq
+```
diff --git a/docs/06_inputs.md b/docs/06_inputs.md
diff --git a/docs/07_outputs.md b/docs/07_outputs.md
@@ -1,4 +1,4 @@
-Outputs files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
+Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
 
 | Title | File path | Description | Per sample or aggregated |
 |-------|-----------|-------------|--------------------------|

diff --git a/nextflow.config b/nextflow.config
@@ -71,7 +71,7 @@ manifest {
     description     = 'AAV plasmid quality control workflow'
     mainScript      = 'main.nf'
     nextflowVersion = '>=23.04.2'
-    version         = 'v1.0.2'
+    version         = 'v1.0.3'
 }
 
 // used by default for "standard" (docker) and singularity profiles,