Skip to content

Commit

Permalink
Squashed dev commits.
Browse files Browse the repository at this point in the history
  • Loading branch information
bsipos committed Sep 17, 2019
1 parent 58e588d commit 5848d13
Show file tree
Hide file tree
Showing 109 changed files with 428,174 additions and 659 deletions.
10 changes: 5 additions & 5 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,16 @@ before_script:
- apt-get install -y software-properties-common
- apt-add-repository universe
- apt-get update
- apt-get install -y python-pip make python-numpy python-matplotlib python-biopython python-pandas mummer last-align
- pip install --upgrade pip sphinx sphinx-argparse sphinx_rtd_theme pytest pycmd futures packaging appdirs pysam
- hash -r pip
- pip install -e ./
- apt-get install -y python3 python3-pip make python3-numpy python3-matplotlib hmmer
- alias python=python3; pip3 install --upgrade pip sphinx sphinx-argparse sphinx_rtd_theme pytest pandas
- hash -r pip3
- alias python=python3; pip3 install -e ./


do_testing:
stage: test
script:
- make test
- alias python=python3; make test
except:
- tags

3 changes: 0 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,3 @@ install: clean ## install the package to the active Python's site-packages

com: ## commit all changes to git
git commit -a

it: ## integration test
./scripts/cdna_classifier.py -s 95 -i fasta -b pychopper/tests/data/barcodes.fas pychopper/tests/data/ref.fas pychopper/tests/data/test_output.fas
159 changes: 108 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,64 +6,66 @@ Pychopper
=========
[![CircleCI](https://circleci.com/gh/nanoporetech/pychopper.svg?style=svg)](https://circleci.com/gh/nanoporetech/pychopper) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/pychopper/README.html)

A tool to identify full length cDNA reads. Primers have to be specified as they are
on the forward strand.
Pychopper v2 is a tool to identify, orient and trim full-length Nanopore cDNA reads. The tool is also able to rescue fused reads.

Background
----------
The general approach of Pychopper v2 is the following:

- Pychopper first identifies alignment hits of the primers across the length of the sequence. The default method for doing this is using `nhmmscan` with the pre-trained strand specific profile HMMs, included with the package. Alternatively, one can use the `edlib` backend, which uses a combination of global and local alignment to identify the primers within the read.
- After identifying the primer hits by either of the backends, the reads are divided into segments defined by two consecutive primer hits. The score of a segment is its length if the configuration of the flanking primer hits is valid (such as `SPP,-VNP` for forward reads) or zero otherwise.
- The segments are assigned to rescued reads using a dynamic programming algorithm maximizing the sum of used segment scores (hence the amount of rescued bases). A crucial observation about the algorithm is that if a segment is included as a rescued read, then the next segment must be excluded as one of the primer hits defining it was "used up" by the previous segment. This put constraints on the dynamic programming graph, as illustrated in the figure below. The arrows in read define the optimal path for rescuing two fused reads with the a total score of `l1 + l3`.

![dp_segmentation](/dp_segmentation.png)

- A crucial parameter of Pychopper v2 is `-q`, which determines the stringency of primer alignment (E-value in the case of the pHMM backend). This can be explicitly specified by the user, however by default it is optimized on a random sample of input reads to produce the maximum number of classified reads.

Getting Started
================

## Installation
## Dependencies

Install via pip:
The required Python packages are installed by either `pip` or `conda`. The profile HMM alignment backend depends on the latest [hmmer](http://hmmer.org/) package.
This can be easily installed using conda:

```bash
conda install -c bioconda hmmer
```
pip install git+https://github.com/nanoporetech/pychopper.git
```

Or clone the repository:

```
git clone https://github.com/nanoporetech/pychopper.git
```
## Installation

And install the package:
Install via pip:

```
python setup.py install
```bash
pip install git+https://github.com/nanoporetech/pychopper.git
```

Install the package in developer mode:
Or install from bioconda:

```
python setup.py develop
```bash
conda install -c bioconda pychopper
```

Run the tests:

```
```bash
make test
```

Build the documentation:

```
make docs
```

Issue `make help` to get a list of `make` targets.

## Usage

```
usage: cdna_classifier.py [-h] -b primers [-i input_format] [-g aln_params]
[-t target_length] [-s score_percentile]
[-n sample_size] [-r report_pdf] [-u unclass_output]
[-S stats_output] [-A scores_output] [-x]
[-l heu_stringency] [-T nr_cores]
usage: cdna_classifier.py [-h] [-b primers] [-g phmm_file] [-c config_file]
[-q cutoff] [-r report_pdf] [-u unclass_output]
[-w rescue_output] [-S stats_output]
[-Y autotune_nr] [-L autotune_samples]
[-A scores_output] [-m method] [-x rescue] [-p]
[-t threads] [-B batch_size]
input_fastx output_fastx
Tool to identify full length cDNA reads. Primers have to be specified as they
are on the forward strand.
Tool to identify, orient and rescue full-length cDNA reads.
positional arguments:
input_fastx Input file.
Expand All @@ -72,37 +74,89 @@ positional arguments:
optional arguments:
-h, --help show this help message and exit
-b primers Primers fasta.
-i input_format Input/output format (fastq).
-g aln_params Alignment parameters (match,
mismatch,gap_open,gap_extend).
-t target_length Number of bases to scan at each end (200).
-s score_percentile Score cutoff percentile (98).
-n sample_size Number of samples when calculating score cutoff
(100000).
-r report_pdf Report PDF.
-g phmm_file File with custom profile HMMs (None).
-c config_file File to specify primer configurations for each
direction (None).
-q cutoff Cutoff parameter (autotuned).
-r report_pdf Report PDF (cdna_classifier_report.pdf).
-u unclass_output Write unclassified reads to this file.
-w rescue_output Write rescued reads to this file.
-S stats_output Write statistics to this file.
-A scores_output Write alignment scores to this file.
-x Use more sensitive (and error prone) heuristic mode
(False).
-l heu_stringency Stringency in heuristic mode (0.25).
-Y autotune_nr Approximate number of reads used for tuning the cutoff
parameter (10000).
-L autotune_samples Number of samples taken when tuning cutoff parameter
(30).
-A scores_output Write alignment scores to this BED file.
-m method Detection method: phmm or edlib (phmm).
-x rescue Protocol-specific read rescue: DCS109 (None).
-p Keep primers, but trim the rest.
-t threads Number of threads to use (8).
-B batch_size Maximum number of reads processed in each batch
(1000000).
```

Example usage:
### Basic usage

Example usage with default PCS109/DCS109 primers using the default pHMM backend:

```bash
cdna_classifier.py -b cdna_barcodes.fas -r report.pdf -u unclassified.fq input.fq full_length_output.fq
cdna_classifier.py -r report.pdf -u unclassified.fq -w rescued.fq input.fq full_length_output.fq
```

Example usage in heuristic mode which is more sensitive (and more error prone):
Example usage with default PCS109/DCS109 primers using the edlib/parasail backend:

```bash
cdna_classifier.py -x -b cdna_barcodes.fas -r report.pdf -u unclassified.fq input.fq full_length_output.fq
cdna_classifier.py -m edlib -r report.pdf -u unclassified.fq -w rescued.fq input.fq full_length_output.fq
```
Example usage with default PCS109/DCS109 primers using the default pHMM backend:

```bash
cdna_classifier.py -r report.pdf -A aln_hits.bed -S statistics.tsv -u unclassified.fq -w rescued.fq input.fq full_length_output.fq
```

### Advanced usage

The fasta files with custom primers used by the `edlib/parasail` backend can be specified through `-b`, while the valid primer configurations are specified through `-c`:

```bash
cdna_classifier.py -m edlib -b custom_pimers.fas -c primer_config.txt input.fq full_length_output.fq
```
Where the contents of `primer_config.txt` looks like `+:MySSP,-MyVNP|-:MyVNP,-MySSP`.

The `pHMM` alignment backend takes a "compressed" profile HMM trained from a multiple sequence alignment using the [hmmer](http://hmmer.org/) package. Custom profile HMMs can be trained from a fastq of reads and a fasta file with the primer sequences using the [hammerpede](https://github.com/nanoporetech/hammerpede) package. The path to the custom profile HMM can be specified using `-g`:

```bash
cdna_classifier.py -m phmm -g MySSP_MyVNP.hmm -c primer_config.txt input.fq full_length_output.fq
```

Evaluation
==========

## Performance on SIRV E0 mix spike-in data

The primers have to specified as they are on the forward strand (see `data/cdna_barcodes.fas` for an example).
The score cutoffs for each primer are calculated by aligning them against random sequences and taking the `-s` percentile of the score distribution (98 by default).
Evaluation on 50k reads from a [SIRV](https://www.lexogen.com/sirvs) E0 mix dataset produced by the PCS109 protocol indicated good performance using both backends:

- More than 85% of the reads were classified, while the percent of classified and rescued reads was higher than 90%:

![sirv_stats](/evaluation/img/sirv_stats.png)

- The oriented reads came from the + and - strands in a roughly 1:1 proportion as expected:

![sirv_strand](/evaluation/img/sirv_strand.png)

- When mapping the oriented reads to the transcriptome, virtually all of them map in the forward direction as expected:

![sirv_map_strand](/evaluation/img/sirv_map_strand.png)

- Comparing the percent of reads covered by alignment before and after trimming, we observe that trimming removed the adapters and the primers:

![sirv_read_cov](/evaluation/img/sirv_read_cov.png)

- Comparing the percent of reference transcripts covered by alignment before and after trimming, we can observe that trimming did not change its value in most of the cases, hence it only rarely removed valid sequence portions:

![sirv_ref_cov](/evaluation/img/sirv_ref_cov.png)

The evaluation can be re-run by issuing `make` from the `evaluation` directory.

Contributing
================
Expand All @@ -116,7 +170,7 @@ Help

## Licence and Copyright

(c) 2018 Oxford Nanopore Technologies Ltd.
(c) 2019 Oxford Nanopore Technologies Ltd.

This Source Code Form is subject to the terms of the Mozilla Public
License, v. 2.0. If a copy of the MPL was not distributed with this
Expand All @@ -126,5 +180,8 @@ file, You can obtain one at http://mozilla.org/MPL/2.0/.

## References and Supporting Information

See the post announcing the tool at the Oxford Nanopore Technologies community [here](https://community.nanoporetech.com/posts/new-transcriptomics-analys).
Research Release
Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

See the post announcing the tools at the Oxford Nanopore Technologies Community [here](https://community.nanoporetech.com/posts/new-transcriptomics-analys).

4 changes: 0 additions & 4 deletions data/cdna_barcodes.fas

This file was deleted.

1 change: 1 addition & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
*.rst
_build
Binary file added dp_segmentation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added evaluation/._cdna_classifier_report.pdf
Binary file not shown.
3 changes: 3 additions & 0 deletions evaluation/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.bam
*.bai
*.fq
27 changes: 27 additions & 0 deletions evaluation/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
READS="data/SIRV_E0_pcs109_25k.fq"
REF="data/sirv_transcriptome.fas"
CORES=8
TUN_SAMPLE=10000
IN=$(READS)
IN_EDLIB=edlib_classified_reads.fq
RES_EDLIB=edlib_rescued_reads.fq
IN_PHMM=phmm_classified_reads.fq
RES_PHMM=phmm_rescued_reads.fq

all:
@echo Reference is at: $(READS)
@echo Reads are at: $(REF)
# seqkit sample -n $(NR_READS) $(READS) > $(IN)
./scripts/process_reads.sh $(REF) $(IN) $(CORES) RAW
cdna_classifier.py -w $(RES_PHMM) -t $(CORES) -m phmm -Y $(TUN_SAMPLE) -r phmm_pychopper_report.pdf $(IN) $(IN_PHMM)
./scripts/process_reads.sh $(REF) $(IN_PHMM) $(CORES) TRIM_PHMM
./scripts/process_reads.sh $(REF) $(RES_PHMM) $(CORES) RES_PHMM
cdna_classifier.py -w $(RES_EDLIB) -t $(CORES) -m edlib -Y $(TUN_SAMPLE) -r edlib_pychopper_report.pdf $(IN) $(IN_EDLIB)
./scripts/process_reads.sh $(REF) $(IN_EDLIB) $(CORES) TRIM_EDLIB
./scripts/process_reads.sh $(REF) $(RES_EDLIB) $(CORES) RES_EDLIB

./scripts/compare.sh wspace_RAW RAW wspace_TRIM_PHMM PHMM
./scripts/compare.sh wspace_RAW RAW wspace_TRIM_EDLIB EDLIB

clean:
rm -fr wspace_* *.pdf *.fq cmp_*
Loading

0 comments on commit 5848d13

Please sign in to comment.