Skip to content

Scripts to carry out pre-processing, analysis, and re-analysis of Illumina 16S data using the RDS method.

Notifications You must be signed in to change notification settings

mcnelsonphd/16S-RDS

Repository files navigation

#Introduction This is the official project home for the scripts necessary to carry out the various pre-processing and analysis steps that are recommended and implemented in the RDS processing scheme published by Nelson et al., PLoS ONE 9: e94249. doi:10.1371/journal.pone.0094249.

The primary scripts for carrying out the full processing and anlaysis pipeline are V4-Preprocess and Qiime_Analysis, however the user must perform split_libraries_fastq.py themselves.

###1. V4-Preprocess V4-Preprocess is the first step of the analysis pipeline. It takes the three raw Undetermined files from the MiSeq and proceeds with read-merging using SeqPrep, phiX identification via bowtie2 and removal, and length filtering of reads that fall outside the expected size range for V4 amplicons (< 240, > 265 bp) using a custom biopython script. The final results of this step are a Good_Reads_filtered_DATE-TIME.fastq.gz and Good_Index_filtered_DATE-TIME.fastq.gz file. Intermediate files are placed in a V4_Intermediates_DATE-TIME directory. The script operates on a checkpointing scheme and will attempt to restart after the last successfully completed step in case of failure.

Useage:

cd /Directory_containing_raw_read_files
V4_Preprocess

Note: The gzip compression used by MiSeq Reporter is somehow different than the gzip native to OSX and possibly linux. MD5 checksum values are calculated of the original Undetermined files, as well as after completion of the pre-processing steps. These values will likely differ due to the aformentioned compression differences. To compensate, the MD5 values of the uncompressed files are also calculated as these values should not change regardless of the compression system used.

###2. Qiime_Processs Qiime_Process is the second step of the analysis pipeline and will take demultiplexed sequences and proceed through a pre-defined set of QIIME analyses, principally involving reference OTU assignment followed by de novo OTU assignment of reads that failed to be assigned to a reference OTU as recommended and implemented in the RDS processing method. The user supplies the script with their QIIME demultiplexed sequences in FASTA format, the sample mapping file, and optionally their desired rarefaction depth and the script will proceed through OTU assignment, chimera checking with ChimeraSlayer, OTU table filtering, taxonomy assignment, and simple alpha and beta diversity analyses of both un-normalized and normalized data.

Notes:

  1. The input sequence file can be named anything the user chooses, such as combined_run_seqs.fasta.
  2. If meta-data is included in the sample mapping file, it will be added to the biom formatted OTU tables.

Useage:

Print help: Qiime_Process -h
Basic:      Qiime_Process -s seqs.fna -m map.txt
Advanced:   Qiime_Process -s spit_libs/seqs.fna -m map_with_metadata.txt -d 5000

###3. Reanalyze_16S Reanalyze_16S is an accessory script that may or may not be necessary for users to run. If the results of the Qiime_Analysis script result in normalization of the samples to a value too small for the users liking, or if they wish to re-run the normalization and resulting alpha and beta diversity analyses using a different number of sequences then this script will allow them to carry this out. To reanalyze using a new number of sequences for normalization, the user needs to supply the script the path to the filtered OTU table, sample mapping file, phylogenetic tree, and the value to be used for rarefaction. The output files will be placed in a subdirectory that is dated and also includes the new rarefaction value passed to the script.

Useage:

Print help: Reanalyze_16S -h
Standard:   Reanalyze_16S -o otu_table.biom -m map.txt -t Rep_Set_tree.tree -d 10000

##Accessory Files

  • GreenGenes_References: contains V4, V4V5, and V1-V3 reference datasets as gzipped tar archives. Within each archive is an un-aligned and aligned set of reference sequences as well as the taxonomy reference down to the species level.
  • Test_Data: contains a very small set of raw sequences for use in testing the implementation of the V4_Preprocess script.
  • phiX: contains the phiX index needed by bowtie as part of the V4_Preprocess script.
  • Length_Filter.py: is a script utilizing Biopython to size filter sequences as part of V4_Preprocess.

About

Scripts to carry out pre-processing, analysis, and re-analysis of Illumina 16S data using the RDS method.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published