Skip to content

Commit

Permalink
Resolve "argparse rejig for pair"
Browse files Browse the repository at this point in the history
  • Loading branch information
ollenordesjo committed Nov 30, 2022
1 parent 4938bb3 commit 89b550c
Show file tree
Hide file tree
Showing 5 changed files with 106 additions and 71 deletions.
5 changes: 3 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Bug where an incorrect tag was used to get sequence lengths from .bam files (for dorado)
### Changed
- ProcessPoolExecutor -> ThreadPoolExecutor in filter_pairs for faster filtering
- Reporting duplex rate in duplex_tools pair

## [v0.2.16]
### Added
- Ability to use an unmapped bam (output from dorado) as input to pairs_from_summary
- Ability to use an unmapped bam (output from dorado) as input to filter_pairs
- Convenience script (`dorado pair unmapped.bam`) to call both pairs_from_summary and filter_pairs on a bam
- usage: dorado pair unmapped.bam
- Convenience script (`duplex_tools pair unmapped.bam`) to call both pairs_from_summary and filter_pairs on a bam
- usage: duplex_tools pair unmapped.bam

## [v0.2.15]
### Fixed
Expand Down
33 changes: 22 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,35 +30,38 @@ The available sub-commands are:
* [split_on_adapter](./fillet.md) - split incorrectly concantenate duplex pairs in to their component simplex reads (formerly `read_fillet`).
* pairs_from_summary - identify candidate duplex pairs from sequencing summary output by Guppy.
* filter_pairs - filter candidate pairs using basecall-to-basecall alignment.
* pair - a wrapper to `pairs_from_summary` and then `filter_pairs`. Currently only compatible with dorado


**Preparing duplex reads for Guppy basecalling**

To prepare reads for duplex calling Duplex Tools provides two programs. The
first parses the sequencing summary output by the Guppy basecaller in order
first parses the sequencing summary output by the Guppy basecaller (or the metadata in a .bam or .sam from dorado) in order
to generate candidate pairs from examining simple read metrics. The second
program analyses the basecalls of candidate reads, checking for similarity.

To run the basic sequencing summary based pairing run the following:
To run the basic sequencing summary(/bam metadata) based pairing run the following:

duplex_tools pairs_from_summary <sequencing_summary.txt> <output directory>
duplex_tools pairs_from_summary <sequencing_summary.txt/dorado.bam> <output directory>

The primary ouput of the above will be a text file named `pair_ids.txt` in the
The primary output of the above will be a text file named `pair_ids.txt` in the
user specified output directory. Although this file can be given to Guppy to perform
duplex calling we recommend running the second basecall-to-basecall alignment
filtering provided by the `filter_pairs` command:

duplex_tools filter_pairs <pair_ids.txt> <fastq directory>
duplex_tools filter_pairs <pair_ids.txt> <fastq directory/dorado.bam>

The first option here is the file described above and output by `pairs_from_summary`.
The second option should be specified as the Guppy (or MinKNOW), or dorado output directory
containing `fastq` or `bam` data --- the directory will be search recursively for all `.fastq.gz`, `.fastq`, and `.sam/.bam` files.

The first option here is the fle described above and output by `pairs_from_summary`.
The second option should be specified as the Guppy (or MinKNOW) output directory
containing `fastq` data --- the directory will be search recursively for all `.fastq.gz`
and `.fastq` files. The output of this second command will be a file named
The output of this second command will be a file named
`pair_ids_filtered.txt` placed alongside the `pair_ids.txt` file.

**Duplex basecalling with Guppy**

The file `pair_ids_filtered.txt` as prepared above can be used with the
original `.fast5` files produced during a sequencing run in order to calculate
original `.fast5`/`.pod5` files produced during a sequencing run in order to calculate
high quality duplex basecalls.

For example,
Expand All @@ -72,9 +75,17 @@ For example,
--duplex_pairing_file pair_ids_filtered.txt

will produce duplex basecalls using the read pairs stored in the
`pair_ids_filtered.txt` file using `.fast5` files found in the user
`pair_ids_filtered.txt` file using `.fast5`/`.pod5` files found in the user
provided MinKNOW output directory.

**Duplex basecalling with Dorado**

Please use `duplex_tools pair unmapped_dorado.bam`.
This will run both the pairing and pairwise alignment-based filtering to get a pair_ids_filtered.txt that can be passed to dorado.


For more details, see https://github.com/nanoporetech/dorado.


### Help

Expand Down
30 changes: 18 additions & 12 deletions duplex_tools/filter_pairs.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,18 +265,8 @@ def align_all_pairs(
return alignment_scores_df


def argparser():
"""Create argument parser."""
parser = ArgumentParser(
"Filter candidate read pairs by basecall alignment.",
formatter_class=ArgumentDefaultsHelpFormatter,
parents=[duplex_tools._log_level()], add_help=False)
parser.add_argument(
"read_pairs",
help="Candidate space-separated read ID pairs, time ordered.")
parser.add_argument(
"reads_directory",
help="Directory to search of fastq(.gz) or .bam files.")
def add_args(parser):
"""Add arguments specific to this process."""
parser.add_argument(
"--bases_to_align", default=250, type=int,
help="Number of bases from each read to attempt alignment.")
Expand Down Expand Up @@ -316,6 +306,22 @@ def argparser():
return parser


def argparser():
"""Create argument parser."""
parser = ArgumentParser(
"Filter candidate read pairs by basecall alignment.",
formatter_class=ArgumentDefaultsHelpFormatter,
parents=[duplex_tools._log_level()], add_help=False)
parser.add_argument(
"read_pairs",
help="Candidate space-separated read ID pairs, time ordered.")
parser.add_argument(
"reads_directory",
help="Directory to search of fastq(.gz) or .bam files.")
parser = add_args(parser)
return parser


def main(args):
"""Entry point."""
filter_candidate_pairs_by_aligning(
Expand Down
70 changes: 40 additions & 30 deletions duplex_tools/pair.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,12 @@

from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser

import pysam

import duplex_tools
from duplex_tools.filter_pairs import add_args as add_filter_args
from duplex_tools.filter_pairs import filter_candidate_pairs_by_aligning
from duplex_tools.pairs_from_summary import add_args as add_pair_args
from duplex_tools.pairs_from_summary import find_pairs


Expand All @@ -20,6 +24,13 @@ def pair_and_align(input_bam,
min_length,
max_length,
output_dir,
align_threshold,
no_end_penalties,
penalty_open,
penalty_extend,
score_match,
score_mismatch,
threads,
**kwargs):
"""Pair and align reads from an unmapped bam.
Expand All @@ -33,6 +44,7 @@ def pair_and_align(input_bam,
:param min_length: see filter_pairs
:param max_length: see filter_pairs
"""
logger = duplex_tools.get_named_logger("Pair")
find_pairs(input_bam,
outdir=output_dir,
max_time_between_reads=max_time_between_reads,
Expand All @@ -45,50 +57,41 @@ def pair_and_align(input_bam,
bases_to_align=bases_to_align,
min_length=min_length,
max_length=max_length,
align_threshold=align_threshold,
no_end_penalties=no_end_penalties,
penalty_open=penalty_open,
penalty_extend=penalty_extend,
score_match=score_match,
score_mismatch=score_mismatch,
threads=threads
)

npairs = sum(1 for _ in open(f'{output_dir}/pair_ids_filtered.txt'))
nreads = pysam.AlignmentFile(input_bam, check_sq=False).count(
until_eof=True)
logger.info(f'Initial reads: {nreads}')
logger.info(f'Created pairs: {npairs}')
logger.info(f'Paired reads: {2 * npairs}')
logger.info(f'Approximate duplex rate: {2*100*npairs / nreads:.2f}%')


def argparser():
"""Create argument parser."""
parser = ArgumentParser(
"Filter candidate read pairs by basecall alignment.",
formatter_class=ArgumentDefaultsHelpFormatter,
parents=[duplex_tools._log_level()], add_help=False)
parents=[duplex_tools._log_level()],
add_help=False)
parser.add_argument(
"bam",
help="A bam file from dorado.")
parser.add_argument(
"--output_dir",
help="The output directory", default='pairs_from_bam')
parser.add_argument(
"--max_time_between_reads", type=int, default=10000,
help=(
"Maximum time (seconds) between reads for them to be "
"deemed a pair."))
parser.add_argument(
"--max_seqlen_diff", type=float, default=0.4,
help=(
"Maximum ratio (a - b) / a, where a and b are the "
"sequence lengths of a putative pair."))
parser.add_argument(
"--max_abs_seqlen_diff", type=int, default=50000,
help=(
"Maximum sequence length difference between template and "
"complement"))
parser.add_argument(
"--min_qscore", type=float, default=0,
help=(
"The minimum simplex qscore required from both template and "
"complement"))
parser.add_argument(
"--bases_to_align", default=250, type=int,
help="Number of bases from each read to attempt alignment.")
parser.add_argument(
"--min_length", default=1, type=int,
help="Minimum length of template and complement.")
parser.add_argument(
"--max_length", default=float('inf'), type=float,
help="Maximum length of template and complement.")

parser = add_pair_args(parser)
parser = add_filter_args(parser)

return parser


Expand All @@ -103,4 +106,11 @@ def main(args):
bases_to_align=args.bases_to_align,
min_length=args.min_length,
max_length=args.max_length,
no_end_penalties=args.no_end_penalties,
align_threshold=args.align_threshold,
penalty_open=args.penalty_open,
penalty_extend=args.penalty_extend,
score_match=args.score_match,
score_mismatch=args.score_mismatch,
threads=args.threads,
)
39 changes: 23 additions & 16 deletions duplex_tools/pairs_from_summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ def take_column(x):
logger.info(
f"Found {ncandidate_pairs} pairs within {nstrands} reads. "
f"({frac_pairs:.1f}% of reads are part of a pair).")
logger.info('Values above 100% are allowed since reads can be either '
'template or complement')
logger.info(f'Writing files into {outdir} directory')

# Write
Expand Down Expand Up @@ -306,27 +308,16 @@ def calculate_metrics_for_next_strand(
return seqsummary


def argparser():
"""Create argument parser."""
parser = ArgumentParser(
"Create candidate pairs from sequencing summary.",
formatter_class=ArgumentDefaultsHelpFormatter,
parents=[duplex_tools._log_level()], add_help=False)

parser.add_argument(
"sequencing_summary",
help="Sequencing summary file.")
parser.add_argument(
"output",
help="Output directory.")
def add_args(parser):
"""Add arguments specific to this process."""
parser.add_argument(
"--prefix", default="pair",
help="")
parser.add_argument(
"--prepend_seqsummary_stem", action="store_true",
help="Add filename of the sequencing summary to output files.")
parser.add_argument(
"--max_time_between_reads", type=int, default=20,
"--max_time_between_reads", type=int, default=200000,
help=(
"Maximum time (seconds) between reads for them to be "
"deemed a pair."))
Expand All @@ -336,12 +327,12 @@ def argparser():
"Maximum ratio (a - b) / a, where a and b are the "
"sequence lengths of a putative pair."))
parser.add_argument(
"--max_abs_seqlen_diff", type=int, default=1000,
"--max_abs_seqlen_diff", type=int, default=5000,
help=(
"Maximum sequence length difference between template and "
"complement"))
parser.add_argument(
"--min_qscore", type=float, default=7,
"--min_qscore", type=float, default=6,
help=(
"The minimum simplex qscore required from both template and "
"complement"))
Expand All @@ -351,6 +342,22 @@ def argparser():
parser.add_argument(
"--match_barcodes", action="store_true",
help="Require putative pair to contain same barcodes.")
return parser


def argparser():
"""Create argument parser."""
parser = ArgumentParser(
"Create candidate pairs from sequencing summary.",
formatter_class=ArgumentDefaultsHelpFormatter,
parents=[duplex_tools._log_level()], add_help=False)
parser.add_argument(
"sequencing_summary",
help="Sequencing summary file.")
parser.add_argument(
"output",
help="Output directory.")
parser = add_args(parser)

return parser

Expand Down

0 comments on commit 89b550c

Please sign in to comment.