Software for RNA-Seq analysis on Windows, including creating sample-specific proteoform databases from genomic data
Spritz can be downloaded here.
Spritz uses snakemake and Docker to install and run commandline tools for Next-Generation Sequencing (NGS) analysis. These tools include an adapted version of SnpEff to annotate sequence variations and create an annotated protein database in XML format. The combinatorics of producing full-length proteoforms from these annotations is written in mzLib's VariantApplication class.
Install Docker Desktop for Windows.
Allocate resources to Docker. There are two ways to do this, described in the Spritz wiki:
- The recommended method requires Windows 10 version 2004 and is more robust. Here, we allocate computer resources to Docker like any other program.
- The alternate method is available on all Windows versions but is less robust. Here, we allocate computer resources to Docker using a virtual machine that's packaged with Docker.
Launch Spritz.
Step 1: Input SRA accessions and/or add FASTQ files.
- SRAs are added with the button indicating single-end or paired-end.
- FASTQ files must end with *_1.fastq if single-end, and paired-end sequences must have the same filename other than ending with *_1.fastq and *_2.fastq.
Step 2: Create and customize your Spritz workflow.
Step 3: Run Spritz!
- Environment:
- Windows 10 recommended
- .NET Core 6.0
- 24 GB RAM recommended
- The installer (Spritz.msi) only works on Windows.
Spritz will also work on the commandline within a Unix system (Linux, Mac, WSL on Windows).
Add SRR629563 to the SRA list.
Create the Spritz workflow. Select "release-97" and "homo_sapiens."
Run Spritz!
Monitor progress in the Information textbox. The final database named final/combined.spritz.snpeff.protein.withmods.xml.gz
can be used to search MS/MS with MetaMorpheus to find variant peptides and proteoforms, possibly with modifications. We recommend performing 1) Calibration, 2) Global PTM Discovery (G-PTM-D), and 3) Search tasks to get the best results.
The final database named final/combined.spritz.snpeff.protein.fasta
is generated to contain variant protein sequences, and it may be used in other search software, such as Proteome Discoverer, ProSight, and MASH Explorer.
The final database named final/combined.spritz.snpeff.protein.withdecoys.fasta
is ready for use in MSFragger. It is generated to contain variant protein sequences with decoy protein sequences appended.
If you use this Spritz, please cite:
: Cesnik, A. J.; Miller, R. M.; Ibrahim, K.; Lu, L.; Millikin, R. J.; Shortreed, M. R.; Frey, B. L.; Smith, L. M. “Spritz: A Proteogenomic Database Engine.” J. Proteome Res. 2021, 20, 4, 1826–1834.
This pipeline uses the following tools:
: Leinonen, R.; et al. International Nucleotide Sequence Database Collaboration. The Sequence Read Archive. Nucleic Acids Res. 2011, 39 (Database issue), D19-21.
: Chen, S.; et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018, 34 (17), i884-i890.
: Kim, D.; et al. Graph-Based Genome Alignment and Genotyping with HISAT2 and HISAT-Genotype. Nat. Biotechnol. 2019, 37 (8), 907-915."samtools
: Li, H.; et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25 (16), 2078-2079.
: McKenna, A.; et al. The Genome Analysis Toolkit: A MapReduce Framework for Analyzing next-Generation DNA Sequencing Data. Genome Res. 2010, 20 (9), 1297-1303.
: Cingolani, P.; et al. A Program for Annotating and Predicting the Effects of Single Nucleotide Polymorphisms, SnpEff: SNPs in the Genome of Drosophila Melanogaster Strain W1118; Iso-2; Iso-3. Fly (Austin) 2012, 6 (2), 80-92.
: Kovaka, S.; et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 2019, 20 (278), 1-13.