An R package for guiding experimental determination of functional PAM sequences from CRISPR array spacers
The recent discovery and in-depth characterization of CRISPR-Cas9 and other CRISPR-Cas systems has led to a variety of technologies, including genome editing, genome modification, nucleic acid sensing, and next generation antimicrobials. Although CRISPR-Cas systems are powerful tool to alter biology, they are often toxic when heterologously expressed in bacteria. Fortunately, about half of all bacteria that have been sequenced encode at least one CRISPR-Cas system in their own genome which provides an alternative to heterologous CRISPR-Cas systems for genome manipulation.
Endogenous CRISPR-Cas systems have been used to successfully edit the genomes of a few bacteria and archaea, but expansion of this method is hindered by the unique protospacer adjacent motif (PAM) sequence of each CRISPR-Cas system required to target a DNA sequence. That is to say that the PAM must be known in order to target an endogenous CRISPR-Cas system toward the genome that encodes it. However, the PAM often also recognized during the spacer acquisition process, which adds new spacers to the endogenous CRISPR array. During this process, foreign nucleic acid that is invading a organism is surveyed for the presence of a PAM and then the DNA adjacent is excised and inserted into the CRISR array. As such, reversing this process in silico would allow determination of the PAM sequence. Past efforts to do so have primarily consisted of indivdual researchers generating nucleotide alignments between CRISPR array spacers and sequences within a variable database, and then manually curating alignments to hypothesize a few potential PAMs. Other more sophisticated apporaches have built tools to find and present the nucleotide alignments to the user, but leaves the user to generate PAM predictions from the alignment data.
Here we present Spacer2PAM
, a standardized in silico pipeline to predict PAM sequences for a given CRISPR-Cas system from annotated CRISPR array spacers. The tools in Spacer2PAM
allow the user to manipulate and reformat CRISPR array spacer data and then predict PAM sequences from that data. Users may start with a FASTA file containg the CRISPR array spacers they wish to analyze (such as those from CRISPRCasdb) or from an annotated CSV file of CRISPR array spacers. Once the Spacer2PAM
pipeline is run, the user is presented with a dataframe containing the statistics of their PAM prediction and a PDF file of a sequence logo annotated with the PAM prediction and score. Spacer2PAM
is an easy to use pipeline for PAM prediction from CRISPR array spacers and is a key step toward enabling the use of endogenous CRISPR-Cas systems for genome engineering and other applications.
The user starts by passing the CRISPR-Cas system’s host organism name and a user-defined identifier to setCRISPRInfo
, which sets the name of the CRISPR-Cas system and defines file output names. The user then chooses one of two options to input the CRISPR array spacer sequence data. If starting with a FASTA file containing each spacer as an individual sequence, the user may call FASTA2DF
to arrange the spacer sequences and other user input information about the CRISPR spacers into a dataframe which is suitable for downstream analysis with Spacer2PAM
. We highly recommend that the user then call DF2FASTA
to generate a FASTA containing all the spacers. Although the user already has a FASTA file, doing so ensures that the title of each sequence is compatible with downstream Spacer2PAM
functions. Alternatively, a user may start with a formatted dataframe containing the headers “Strain”, “Spacers”, “Array.Orientation”, “Repeat”, “Array”, and “Spacer” and pass it to DF2FASTA
to generate a FASTA file containing the spacer sequences with the appropriate labels. The user then uses the FASTA file and submits the sequences for alignment to BLAST. This can be done programatically through FASTA2Alignment
, which will return a dataframe summarizing the results which can be passed to joinSpacerDFandAlignmentDF
to continue. FASTA2Alignment
does not work for all CRISPR arrays due to the data size restrictions on the Entrez API. For arrays that exceed this size limit and generate an error message from FASTA2Alignment
, we recommend using the web server for BLAST. The user should use the BLASTn algorithm and to exclude both Eukaryotes (taxid:2759) to limit the alignment to relevant organisms and decreases both BLAST and Spacer2PAM
computational time. Once the alignment is completed through the BLAST web server, the resulting hit table should be downloaded in .CSV format. The hit table file should then be passed to alignmentCSV2DF
to convert it to a dataframe. The resulting dataframe can then be passed to joinSpacerDFandAlignmentDF
. This function joins the two dataframes, assigning spacer information to each alignment in the hit table. This function also converts the accession number of the alignment to the genus and species name of the organism that encodes the alignment sequence using the taxonomizr
package. The taxonomizr
package requires the local download and set up of an SQL database, the user should be prepared to store the 65 GB (at time of writing this) database in a location stably accessible while using joinSpacerDFandAlignmentDF
. The resulting dataframe is sufficient for PAM prediction by join2PAM
, but we recommend calling Submit2Phaster
if the user plans to select the prophage prediction option in join2PAM
. Submit2Phaster
interacts with the PHASTER
prophage prediction web server to submit a nonredundant list of accession numbers from the joined dataframe for prophage detection. Depending on the volume of traffic on the PHASTER
server, prediction can take minutes to months to complete. Lastly, the joined dataframe is passed to join2PAM
. This function is the core of Spacer2PAM
and predicts a PAM sequence from the alignments generated by BLAST. Multiple combinations of filter sets can be run sequentially with a single call of join2PAM
. The output of join2PAM
is a dataframe name collectionFrame that summarizes the filtering process and records the upstream and downstream predicted PAMs as well as their associated PAM score. Details on PAM identification and scoring can be found in the vignette associated with Spacer2PAM
.
Using the devtools
package, run the following command in R:
devtools::install_github("grybnicky/Spacer2PAM")
Once all dependencies are installed, follow the instructions to prepare the Taxonomizr
SQL library at https://cran.r-project.org/web/packages/taxonomizr/vignettes/usage.html.
Spacer2PAM has the following dependencies:
dplyr
ggplot2
ggseqlogo
taxonomizr
HelpersMG
httr
jsonlite
spatstat.utils
seqinr
readr