This Analysis aims at extracting genomic sequences mimicking 16S amplicon reads from a larger context long read set (gDNA reads or larger amplicons).
The motivation behind this analysis was to simulate 16S long amplicon sequencing on ONT platform using data obtained from the https://github.com/LomanLab/mockcommunity site as a large Promethion fastq archive but any other ONT or Pacbio fastq data can be used with this code (please acknowledge them when you would use this data too).
(reproduced from https://github.com/LomanLab/mockcommunity/edit/master/README.md)
Name | Reads (M) | Yield (G) | FASTQ | Run Folder | Restarts | FAST5 |
---|---|---|---|---|---|---|
Zymo-PromethION-LOG-BB-SN | 35.1 | 148 | fastq.gz | 64h run | restarts | download.sh, restarts.tar |
Zymo-PromethION-EVEN-BB-SN | 36.5 | 146 | fastq.gz | 64h run | restarts | download.sh, restarts.tar |
Zymo-GridION-LOG-BB-SN | 3.7 | 16 | fastq.gz | 48h run | n/a | signal.tar |
Zymo-GridION-EVEN-BB-SN | 3.5 | 14 | fastq.gz | 48h run | n/a | signal.tar |
The method used to extract sequences between primers was developed by Brian Bushnell based on his BBMap tools and is explained here.
The workflow is as follows:
- split the data in smaller fastq bins for speed-up using parallel
- search forward primer in all bins using BBMap msa.sh
- search reverse primer in all bins using BBMap msa.sh
- extract 'matching' regions using BBMAp cutprimers.sh
- merge all results and keep only amplicons in a chosen size range (as set in the script)
REM: Snakemake does not always install well with bioconda and can be removed from the file environment.yaml if you encounter issues. It is not used in the current version of this code anyway and was added here for forward compatibility.
-
Install the required software using conca or manually based on the list provided in environment.yaml.
-
Set variables, names, and numeric limits in the top of the InSilico_PCR.sh script (adjust the number of threads to the available cores in your own machine)
-
Run the script with
- demo data: ./InSilico_PCR.sh -d
- your own read file present in the current folder: ./InSilico_PCR.sh <file_name>.fq.gz
-
the results of a 16S In-Silico PCR experiment are presented in ZymoPromethion_even_Results.
-
the results of using other databases for the MetONTIIME classification are reported here.
This code should and will be changed to a snakemake pipeline in order to be more portable. The config.yaml file is the first step towards this transition.
Please send comments and feedback to [email protected]
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.