Skip to content

Commit

Permalink
Merge branch 'CW-3450_dtype_error' into 'dev'
Browse files Browse the repository at this point in the history
Resolve CW-3450 "Dtype error"

Closes CW-3450

See merge request epi2melabs/workflows/wf-aav-qc!146
  • Loading branch information
SamStudio8 committed Feb 7, 2024
2 parents be7dd1f + 5225e9d commit 22efa10
Show file tree
Hide file tree
Showing 13 changed files with 149 additions and 53 deletions.
19 changes: 19 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,3 +122,22 @@ body:
render: shell
validations:
required: false
- type: dropdown
id: run-demo
attributes:
label: Were you able to successfully run the latest version of the workflow with the demo data?
description: For CLI execution, were you able to successfully run the workflow using the demo data available in the [Install and run](./README.md#install-and-run) section of the `README.md`? For execution in the EPI2ME application, were you able to successfully run the workflow via the "Use demo data" button?
options:
- 'yes'
- 'no'
- other (please describe below)
validations:
required: true
- type: textarea
id: demo-other
attributes:
label: Other demo data information
render: shell
validations:
required: false

4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ repos:
hooks:
- id: docs_readme
name: docs_readme
entry: parse_docs -p docs -e .md -s 01_brief_description 02_introduction 03_compute_requirements 04_install_and_run 05_related_protocols 06_inputs 07_outputs 08_pipeline_overview 09_troubleshooting 10_FAQ 11_other -ot README.md -od output_definition.json -ns nextflow_schema.json
entry: parse_docs -p docs -e .md -s 01_brief_description 02_introduction 03_compute_requirements 04_install_and_run 05_related_protocols 06_input_example 06_input_parameters 07_outputs 08_pipeline_overview 09_troubleshooting 10_FAQ 11_other -ot README.md -od output_definition.json -ns nextflow_schema.json
language: python
always_run: true
pass_filenames: false
additional_dependencies:
- epi2melabs>=0.0.50
- epi2melabs>=0.0.51
- id: build_models
name: build_models
entry: datamodel-codegen --strict-nullable --base-class workflow_glue.results_schema_helpers.BaseModel --use-schema-description --disable-timestamp --input results_schema.yml --input-file-type openapi --output bin/workflow_glue/results_schema.py
Expand Down
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html)

## [v1.0.3]
### Fixed
- Datatype inference error during CSV loading.

## [v1.0.2]
### Fixed
Expand Down
27 changes: 25 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,30 @@ https://community.nanoporetech.com/docs/prepare/library_prep_protocols/ligation-



## Inputs
## Input example

<!---Example of input directory structure, delete and edit as appropriate per workflow.--->
This workflow accepts either FASTQ or BAM files as input.

The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with `--sample`. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`.

```
(i) (ii) (iii)
input_reads.fastq ─── input_directory ─── input_directory
├── reads0.fastq ├── barcode01
└── reads1.fastq │ ├── reads0.fastq
│ └── reads1.fastq
├── barcode02
│ ├── reads0.fastq
│ ├── reads1.fastq
│ └── reads2.fastq
└── barcode03
└── reads0.fastq
```



## Input parameters

### Input Options

Expand Down Expand Up @@ -139,7 +162,7 @@ https://community.nanoporetech.com/docs/prepare/library_prep_protocols/ligation-

## Outputs

Outputs files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

| Title | File path | Description | Per sample or aggregated |
|-------|-----------|-------------|--------------------------|
Expand Down
22 changes: 20 additions & 2 deletions bin/workflow_glue/aav_structures.py
Original file line number Diff line number Diff line change
Expand Up @@ -334,10 +334,28 @@ def annotate_reads(aln_df, type_definitions, symmetry_threshold):
def main(args):
"""Entry point."""
# Load the BAM info file
df_bam = pl.read_csv(
schema = {
'Ref': pl.Utf8,
'Read': pl.Utf8,
'Pos': pl.UInt32,
'EndPos': pl.UInt32,
'ReadLen': pl.UInt32,
'Strand': pl.UInt8,
'IsSec': pl.UInt8,
'IsSup': pl.UInt8
}

df_bam = (pl.read_csv(
source=args.bam_info,
separator='\t',
columns=['Ref', 'Read', 'Pos', 'EndPos', 'ReadLen', 'Strand', 'IsSec', 'IsSup']
columns=list(schema.keys()),
dtypes=list(schema.values())
)
.with_columns([
pl.col('Strand').cast(pl.Boolean),
pl.col('IsSec').cast(pl.Boolean),
pl.col('IsSup').cast(pl.Boolean)
])
)

# Get the ITR locations
Expand Down
2 changes: 1 addition & 1 deletion bin/workflow_glue/check_sample_sheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def main(args):
]

if not os.path.exists(args.sample_sheet) or not os.path.isfile(args.sample_sheet):
sys.stdout.write(f"Could not open sample sheet '{args.sample_sheet}'.")
sys.stdout.write("Could not open sample sheet file.")
sys.exit()

try:
Expand Down
12 changes: 11 additions & 1 deletion bin/workflow_glue/contamination.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from pathlib import Path
import subprocess

import numpy as np
import pandas as pd

from .util import wf_parser # noqa: ABS101
Expand Down Expand Up @@ -62,7 +63,16 @@ def main(args):
)]

# Read the per-alignment read summaries
df_bam = pd.read_csv(args.bam_info, sep='\t', usecols=['Read', 'Ref', 'ReadLen'])
df_bam = pd.read_csv(
args.bam_info,
sep='\t',
usecols=['Read', 'Ref', 'ReadLen'],
dtype={
'Read': str,
'Ref': str,
'ReadLen': np.uint32
}
)
# Assign reference category to alignments
df_bam['contam_class'] = None
df_bam.loc[df_bam.Ref == transgene_plasmid_name, 'contam_class'] = 'Transgene'
Expand Down
43 changes: 39 additions & 4 deletions bin/workflow_glue/report.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from ezcharts.components.reports import labs
from ezcharts.layout.snippets import Grid, Tabs
from ezcharts.layout.snippets.table import DataTable
import numpy as np
import pandas as pd

from .util import get_named_logger, wf_parser # noqa: ABS101
Expand All @@ -19,7 +20,14 @@ def plot_trucations(report, truncations_file):
The truncations_file contains start and end positions of alignments that are fully
contained within the ITR-ITR regions.
"""
df = pd.read_csv(truncations_file, sep='\t')
df = pd.read_csv(
truncations_file, sep='\t',
dtype={
'Read start': str,
'Read end': np.uint32,
'sample_id': str
}
)

with report.add_section("Truncations", "Truncations"):
p(
Expand All @@ -42,7 +50,16 @@ def plot_trucations(report, truncations_file):

def plot_itr_coverage(report, coverage_file):
"""Make report section with ITR-ITR coverage of transgene cassette region."""
df = pd.read_csv(coverage_file, sep=r"\s+")
df = pd.read_csv(
coverage_file,
sep=r"\s+",
dtype={
'ref': str,
'pos': np.uint32,
'depth': np.uint32,
'strand': str,
'sample_id': str
})

with report.add_section("ITR-ITR coverage", "Coverage"):
p(
Expand Down Expand Up @@ -72,7 +89,16 @@ def plot_contamination(report, class_counts):
Two plots: (1) mapped/unmapped; (2) mapped reads per reference
"""
df_class_counts = pd.read_csv(class_counts, sep='\t')
df_class_counts = pd.read_csv(
class_counts,
sep='\t',
dtype={
'Reference': str,
'Number of alignments': np.uint32,
'Percentage of alignments': np.float32,
'sample_id': str
}
)

with report.add_section("Contamination", "Contamination"):
p(
Expand Down Expand Up @@ -114,7 +140,16 @@ def plot_contamination(report, class_counts):

def plot_aav_structures(report, structures_file):
"""Make report section barplots detailing the AAV structures found."""
df = pd.read_csv(structures_file, sep='\t')
df = pd.read_csv(
structures_file,
sep='\t',
dtype={
'Assigned_genome_type': str,
'count': np.uint32,
'percentage': np.float32,
'sample_id': str

})

with report.add_section("AAV Structures", "Structures"):
p(
Expand Down
7 changes: 7 additions & 0 deletions bin/workflow_glue/truncations.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

from pathlib import Path

import numpy as np
import pandas as pd

from .util import wf_parser # noqa: ABS101
Expand Down Expand Up @@ -48,6 +49,12 @@ def main(args):
args.bam_info,
sep='\t',
usecols=['Read', 'Ref', 'Pos', 'EndPos'],
dtype={
'Read': str,
'Ref': str,
'Pos': np.uint32,
'EndPos': np.uint32
},
chunksize=50000
) as reader:
for df_bam in reader:
Expand Down
18 changes: 18 additions & 0 deletions docs/06_input_example.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
<!---Example of input directory structure, delete and edit as appropriate per workflow.--->
This workflow accepts either FASTQ or BAM files as input.

The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with `--sample`. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`.

```
(i) (ii) (iii)
input_reads.fastq ─── input_directory ─── input_directory
├── reads0.fastq ├── barcode01
└── reads1.fastq │ ├── reads0.fastq
│ └── reads1.fastq
├── barcode02
│ ├── reads0.fastq
│ ├── reads1.fastq
│ └── reads2.fastq
└── barcode03
└── reads0.fastq
```
38 changes: 0 additions & 38 deletions docs/06_inputs.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/07_outputs.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Outputs files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

| Title | File path | Description | Per sample or aggregated |
|-------|-----------|-------------|--------------------------|
Expand Down
2 changes: 1 addition & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ manifest {
description = 'AAV plasmid quality control workflow'
mainScript = 'main.nf'
nextflowVersion = '>=23.04.2'
version = 'v1.0.2'
version = 'v1.0.3'
}

// used by default for "standard" (docker) and singularity profiles,
Expand Down

0 comments on commit 22efa10

Please sign in to comment.