Resolve "argparse rejig for pair"

nanoporetech · Nov 30, 2022 · 89b550c · 89b550c
1 parent 4938bb3
commit 89b550c
Show file tree

Hide file tree

Showing 5 changed files with 106 additions and 71 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,13 +9,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Bug where an incorrect tag was used to get sequence lengths from .bam files (for dorado)
 ### Changed  
 - ProcessPoolExecutor -> ThreadPoolExecutor in filter_pairs for faster filtering
+- Reporting duplex rate in duplex_tools pair
 
 ## [v0.2.16]
 ### Added
 - Ability to use an unmapped bam (output from dorado) as input to pairs_from_summary
 - Ability to use an unmapped bam (output from dorado) as input to filter_pairs
-- Convenience script (`dorado pair unmapped.bam`) to call both pairs_from_summary and filter_pairs on a bam
-  - usage: dorado pair unmapped.bam
+- Convenience script (`duplex_tools pair unmapped.bam`) to call both pairs_from_summary and filter_pairs on a bam
+  - usage: duplex_tools pair unmapped.bam
 
 ## [v0.2.15]
 ### Fixed

diff --git a/README.md b/README.md
@@ -30,35 +30,38 @@ The available sub-commands are:
 * [split_on_adapter](./fillet.md) - split incorrectly concantenate duplex pairs in to their component simplex reads (formerly `read_fillet`).
 * pairs_from_summary - identify candidate duplex pairs from sequencing summary output by Guppy.
 * filter_pairs - filter candidate pairs using basecall-to-basecall alignment.
+* pair - a wrapper to `pairs_from_summary` and then `filter_pairs`. Currently only compatible with dorado 
+
 
 **Preparing duplex reads for Guppy basecalling**
 
 To prepare reads for duplex calling Duplex Tools provides two programs. The
-first parses the sequencing summary output by the Guppy basecaller in order
+first parses the sequencing summary output by the Guppy basecaller (or the metadata in a .bam or .sam from dorado) in order
 to generate candidate pairs from examining simple read metrics. The second
 program analyses the basecalls of candidate reads, checking for similarity.
 
-To run the basic sequencing summary based pairing run the following:
+To run the basic sequencing summary(/bam metadata) based pairing run the following:
 
-    duplex_tools pairs_from_summary <sequencing_summary.txt> <output directory>
+    duplex_tools pairs_from_summary <sequencing_summary.txt/dorado.bam> <output directory>
 
-The primary ouput of the above will be a text file named `pair_ids.txt` in the
+The primary output of the above will be a text file named `pair_ids.txt` in the
 user specified output directory. Although this file can be given to Guppy to perform
 duplex calling we recommend running the second basecall-to-basecall alignment
 filtering provided by the `filter_pairs` command:
 
-    duplex_tools filter_pairs <pair_ids.txt> <fastq directory>
+    duplex_tools filter_pairs <pair_ids.txt> <fastq directory/dorado.bam>
+
+The first option here is the file described above and output by `pairs_from_summary`.
+The second option should be specified as the Guppy (or MinKNOW), or dorado output directory
+containing `fastq` or `bam` data --- the directory will be search recursively for all `.fastq.gz`, `.fastq`, and `.sam/.bam` files. 
 
-The first option here is the fle described above and output by `pairs_from_summary`.
-The second option should be specified as the Guppy (or MinKNOW) output directory
-containing `fastq` data --- the directory will be search recursively for all `.fastq.gz`
-and `.fastq` files. The output of this second command will be a file named
+The output of this second command will be a file named
 `pair_ids_filtered.txt` placed alongside the `pair_ids.txt` file.
 
 **Duplex basecalling with Guppy**
 
 The file `pair_ids_filtered.txt` as prepared above can be used with the
-original `.fast5` files produced during a sequencing run in order to calculate
+original `.fast5`/`.pod5` files produced during a sequencing run in order to calculate
 high quality duplex basecalls.
 
 For example,
@@ -72,9 +75,17 @@ For example,
         --duplex_pairing_file pair_ids_filtered.txt
 
 will produce duplex basecalls using the read pairs stored in the
-`pair_ids_filtered.txt` file using `.fast5` files found in the user
+`pair_ids_filtered.txt` file using `.fast5`/`.pod5` files found in the user
 provided MinKNOW output directory.
 
+**Duplex basecalling with Dorado**
+
+Please use `duplex_tools pair unmapped_dorado.bam`. 
+This will run both the pairing and pairwise alignment-based filtering to get a pair_ids_filtered.txt that can be passed to dorado. 
+
+
+For more details, see https://github.com/nanoporetech/dorado. 
+
 
 ### Help
 

diff --git a/duplex_tools/filter_pairs.py b/duplex_tools/filter_pairs.py
@@ -265,18 +265,8 @@ def align_all_pairs(
     return alignment_scores_df
 
 
-def argparser():
-    """Create argument parser."""
-    parser = ArgumentParser(
-        "Filter candidate read pairs by basecall alignment.",
-        formatter_class=ArgumentDefaultsHelpFormatter,
-        parents=[duplex_tools._log_level()], add_help=False)
-    parser.add_argument(
-        "read_pairs",
-        help="Candidate space-separated read ID pairs, time ordered.")
-    parser.add_argument(
-        "reads_directory",
-        help="Directory to search of fastq(.gz) or .bam files.")
+def add_args(parser):
+    """Add arguments specific to this process."""
     parser.add_argument(
         "--bases_to_align", default=250, type=int,
         help="Number of bases from each read to attempt alignment.")
@@ -316,6 +306,22 @@ def argparser():
     return parser
 
 
+def argparser():
+    """Create argument parser."""
+    parser = ArgumentParser(
+        "Filter candidate read pairs by basecall alignment.",
+        formatter_class=ArgumentDefaultsHelpFormatter,
+        parents=[duplex_tools._log_level()], add_help=False)
+    parser.add_argument(
+        "read_pairs",
+        help="Candidate space-separated read ID pairs, time ordered.")
+    parser.add_argument(
+        "reads_directory",
+        help="Directory to search of fastq(.gz) or .bam files.")
+    parser = add_args(parser)
+    return parser
+
+
 def main(args):
     """Entry point."""
     filter_candidate_pairs_by_aligning(

diff --git a/duplex_tools/pair.py b/duplex_tools/pair.py
@@ -6,8 +6,12 @@
 
 from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
 
+import pysam
+
 import duplex_tools
+from duplex_tools.filter_pairs import add_args as add_filter_args
 from duplex_tools.filter_pairs import filter_candidate_pairs_by_aligning
+from duplex_tools.pairs_from_summary import add_args as add_pair_args
 from duplex_tools.pairs_from_summary import find_pairs
 
 
@@ -20,6 +24,13 @@ def pair_and_align(input_bam,
                    min_length,
                    max_length,
                    output_dir,
+                   align_threshold,
+                   no_end_penalties,
+                   penalty_open,
+                   penalty_extend,
+                   score_match,
+                   score_mismatch,
+                   threads,
                    **kwargs):
     """Pair and align reads from an unmapped bam.
 
@@ -33,6 +44,7 @@ def pair_and_align(input_bam,
     :param min_length: see filter_pairs
     :param max_length: see filter_pairs
     """
+    logger = duplex_tools.get_named_logger("Pair")
     find_pairs(input_bam,
                outdir=output_dir,
                max_time_between_reads=max_time_between_reads,
@@ -45,50 +57,41 @@ def pair_and_align(input_bam,
                                        bases_to_align=bases_to_align,
                                        min_length=min_length,
                                        max_length=max_length,
+                                       align_threshold=align_threshold,
+                                       no_end_penalties=no_end_penalties,
+                                       penalty_open=penalty_open,
+                                       penalty_extend=penalty_extend,
+                                       score_match=score_match,
+                                       score_mismatch=score_mismatch,
+                                       threads=threads
                                        )
 
+    npairs = sum(1 for _ in open(f'{output_dir}/pair_ids_filtered.txt'))
+    nreads = pysam.AlignmentFile(input_bam, check_sq=False).count(
+        until_eof=True)
+    logger.info(f'Initial reads: {nreads}')
+    logger.info(f'Created pairs: {npairs}')
+    logger.info(f'Paired reads:  {2 * npairs}')
+    logger.info(f'Approximate duplex rate: {2*100*npairs / nreads:.2f}%')
+
 
 def argparser():
     """Create argument parser."""
     parser = ArgumentParser(
         "Filter candidate read pairs by basecall alignment.",
         formatter_class=ArgumentDefaultsHelpFormatter,
-        parents=[duplex_tools._log_level()], add_help=False)
+        parents=[duplex_tools._log_level()],
+        add_help=False)
     parser.add_argument(
         "bam",
         help="A bam file from dorado.")
     parser.add_argument(
         "--output_dir",
         help="The output directory", default='pairs_from_bam')
-    parser.add_argument(
-        "--max_time_between_reads", type=int, default=10000,
-        help=(
-            "Maximum time (seconds) between reads for them to be "
-            "deemed a pair."))
-    parser.add_argument(
-        "--max_seqlen_diff", type=float, default=0.4,
-        help=(
-            "Maximum ratio (a - b) / a, where a and b are the "
-            "sequence lengths of a putative pair."))
-    parser.add_argument(
-        "--max_abs_seqlen_diff", type=int, default=50000,
-        help=(
-            "Maximum sequence length difference between template and "
-            "complement"))
-    parser.add_argument(
-        "--min_qscore", type=float, default=0,
-        help=(
-            "The minimum simplex qscore required from both template and "
-            "complement"))
-    parser.add_argument(
-        "--bases_to_align", default=250, type=int,
-        help="Number of bases from each read to attempt alignment.")
-    parser.add_argument(
-        "--min_length", default=1, type=int,
-        help="Minimum length of template and complement.")
-    parser.add_argument(
-        "--max_length", default=float('inf'), type=float,
-        help="Maximum length of template and complement.")
+
+    parser = add_pair_args(parser)
+    parser = add_filter_args(parser)
+
     return parser
 
 
@@ -103,4 +106,11 @@ def main(args):
                    bases_to_align=args.bases_to_align,
                    min_length=args.min_length,
                    max_length=args.max_length,
+                   no_end_penalties=args.no_end_penalties,
+                   align_threshold=args.align_threshold,
+                   penalty_open=args.penalty_open,
+                   penalty_extend=args.penalty_extend,
+                   score_match=args.score_match,
+                   score_mismatch=args.score_mismatch,
+                   threads=args.threads,
                    )
diff --git a/duplex_tools/pairs_from_summary.py b/duplex_tools/pairs_from_summary.py
@@ -106,6 +106,8 @@ def take_column(x):
     logger.info(
         f"Found {ncandidate_pairs} pairs within {nstrands} reads. "
         f"({frac_pairs:.1f}% of reads are part of a pair).")
+    logger.info('Values above 100% are allowed since reads can be either '
+                'template or complement')
     logger.info(f'Writing files into {outdir} directory')
 
     # Write
@@ -306,27 +308,16 @@ def calculate_metrics_for_next_strand(
     return seqsummary
 
 
-def argparser():
-    """Create argument parser."""
-    parser = ArgumentParser(
-        "Create candidate pairs from sequencing summary.",
-        formatter_class=ArgumentDefaultsHelpFormatter,
-        parents=[duplex_tools._log_level()], add_help=False)
-
-    parser.add_argument(
-        "sequencing_summary",
-        help="Sequencing summary file.")
-    parser.add_argument(
-        "output",
-        help="Output directory.")
+def add_args(parser):
+    """Add arguments specific to this process."""
     parser.add_argument(
         "--prefix", default="pair",
         help="")
     parser.add_argument(
         "--prepend_seqsummary_stem", action="store_true",
         help="Add filename of the sequencing summary to output files.")
     parser.add_argument(
-        "--max_time_between_reads", type=int, default=20,
+        "--max_time_between_reads", type=int, default=200000,
         help=(
             "Maximum time (seconds) between reads for them to be "
             "deemed a pair."))
@@ -336,12 +327,12 @@ def argparser():
             "Maximum ratio (a - b) / a, where a and b are the "
             "sequence lengths of a putative pair."))
     parser.add_argument(
-        "--max_abs_seqlen_diff", type=int, default=1000,
+        "--max_abs_seqlen_diff", type=int, default=5000,
         help=(
             "Maximum sequence length difference between template and "
             "complement"))
     parser.add_argument(
-        "--min_qscore", type=float, default=7,
+        "--min_qscore", type=float, default=6,
         help=(
             "The minimum simplex qscore required from both template and "
             "complement"))
@@ -351,6 +342,22 @@ def argparser():
     parser.add_argument(
         "--match_barcodes", action="store_true",
         help="Require putative pair to contain same barcodes.")
+    return parser
+
+
+def argparser():
+    """Create argument parser."""
+    parser = ArgumentParser(
+        "Create candidate pairs from sequencing summary.",
+        formatter_class=ArgumentDefaultsHelpFormatter,
+        parents=[duplex_tools._log_level()], add_help=False)
+    parser.add_argument(
+        "sequencing_summary",
+        help="Sequencing summary file.")
+    parser.add_argument(
+        "output",
+        help="Output directory.")
+    parser = add_args(parser)
 
     return parser