Skip to content

extract long read subsequences from a pair of primers

Notifications You must be signed in to change notification settings

Nucleomics-VIB/InSilico_PCR

Repository files navigation

(Nucleomics-VIB)

InSilico PCR

This Analysis aims at extracting genomic sequences mimicking 16S amplicon reads from a larger context long read set (gDNA reads or larger amplicons).

The motivation behind this analysis was to simulate 16S long amplicon sequencing on ONT platform using data obtained from the https://github.com/LomanLab/mockcommunity site as a large Promethion fastq archive but any other ONT or Pacbio fastq data can be used with this code (please acknowledge them when you would use this data too).

Data Availability

(reproduced from https://github.com/LomanLab/mockcommunity/edit/master/README.md)

Name Reads (M) Yield (G) FASTQ Run Folder Restarts FAST5
Zymo-PromethION-LOG-BB-SN 35.1 148 fastq.gz 64h run restarts download.sh, restarts.tar
Zymo-PromethION-EVEN-BB-SN 36.5 146 fastq.gz 64h run restarts download.sh, restarts.tar
Zymo-GridION-LOG-BB-SN 3.7 16 fastq.gz 48h run n/a signal.tar
Zymo-GridION-EVEN-BB-SN 3.5 14 fastq.gz 48h run n/a signal.tar

Method

The method used to extract sequences between primers was developed by Brian Bushnell based on his BBMap tools and is explained here.

The workflow is as follows:

  • split the data in smaller fastq bins for speed-up using parallel
  • search forward primer in all bins using BBMap msa.sh
  • search reverse primer in all bins using BBMap msa.sh
  • extract 'matching' regions using BBMAp cutprimers.sh
  • merge all results and keep only amplicons in a chosen size range (as set in the script)

Usage

REM: Snakemake does not always install well with bioconda and can be removed from the file environment.yaml if you encounter issues. It is not used in the current version of this code anyway and was added here for forward compatibility.

  • Install the required software using conca or manually based on the list provided in environment.yaml.

  • Set variables, names, and numeric limits in the top of the InSilico_PCR.sh script (adjust the number of threads to the available cores in your own machine)

  • Run the script with

    • demo data: ./InSilico_PCR.sh -d
    • your own read file present in the current folder: ./InSilico_PCR.sh <file_name>.fq.gz

Results obtained with the demo data

Future plans

This code should and will be changed to a snakemake pipeline in order to be more portable. The config.yaml file is the first step towards this transition.


Please send comments and feedback to [email protected]


Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

About

extract long read subsequences from a pair of primers

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages