Skip to content

Commit

Permalink
Merge branch 'CW-3411' into 'dev'
Browse files Browse the repository at this point in the history
better sanitisation of ref seq IDs [CW-3411]

Closes CW-3411

See merge request epi2melabs/workflows/wf-amplicon!64
  • Loading branch information
julibeg committed Feb 16, 2024
2 parents 4ca717a + 869e4fe commit fcf55dd
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 3 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,13 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [unreleased]
### Fixed
- The workflow failing when there were tab characters in the FASTA header lines of reference sequences.

### Changed
- The way the reference sequence IDs are sanitised to prevent issues with special characters.

## [v1.0.3]
### Fixed
- The workflow failing when there was a whitespace in the name of the reference file.
Expand Down
11 changes: 8 additions & 3 deletions modules/local/variant-calling.nf
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ process sanitizeRefFile {
output: path "reference_sanitized_seqIDs.fasta"
script:
"""
sed '/^>/s/:\\|\\*\\| /_/g' reference.fasta > reference_sanitized_seqIDs.fasta
# use `sed` to replace all non-alphanumerical characters with underscores (`/2g`
# skips the first match which will be `>`)
sed -E '/^>/s/[^[:alnum:]]+/_/2g' reference.fasta > reference_sanitized_seqIDs.fasta
"""
}

Expand Down Expand Up @@ -175,8 +177,11 @@ workflow pipeline {
// subset the sanitized ref file
ref_id_map = Channel.empty()
| concat(
Channel.of(ref).splitFasta(record: [id: true]).map{ it.id }.collect(),
san_ref.splitFasta(record: [id: true]).map{ it.id }.collect()
// `splitFasta(reecord: [id: true])` does not split the header line at tab
// characters. We thus split again here to make sure that we only got the
// seq ID
Channel.of(ref).splitFasta(record: [id: true]).map{ it.id.split()[0] }.collect(),
san_ref.splitFasta(record: [id: true]).map{ it.id.split()[0] }.collect()
)
| toList
| map { it.transpose().collectEntries() as LinkedHashMap }
Expand Down

0 comments on commit fcf55dd

Please sign in to comment.