-
Notifications
You must be signed in to change notification settings - Fork 40
installation
piPipes provides a uniform interface for different organisms/genomes. Due to the limit on individual file sizes from github, genome sequence and annotation files have to be downloaded. The user will need to perform an installation step to download the files and prepare them for other pipelines.
To clone the directory from Github
git clone [email protected]:bowhan/piPipes.git
To make symbol links
# enter the piPipes directory
ln -s piPipes $HOME/bin/piPipes
ln -s piPipes_debug $HOME/bin/piPipes_debug
piPipes has most of the third-party tools pre-compiled and included in the bin
directory.
However, there are some tools the user will need to install.
# R
# Please follow instructions on http://www.r-project.org/ to install R
# if successfully installed:
$ which Rscript
/share/pkg/R/3.0.2/bin/Rscript
# ! Note the newer version of R has a different behavior on read.table (). Please use version before R 3.1.0.
# http://stackoverflow.com/questions/22962917/barplot-failure-in-r-3-1-0-read-csv-converting-what-should-be-numerics-to-facto/23225932#23225932
# HTSeq-count
# Please follow instructions on http://www-huber.embl.de/users/anders/HTSeq/doc/install.html to install HTSseq-count
# if successfully installed:
$ which htseq-count
~/bin/htseq-count
# Macs2
# Please follow instructions on https://github.com/taoliu/MACS/blob/master/INSTALL.rst to install MACS2
# if successfully installed:
$ which macs2
~/bin/macs2
# Perl Module Statistics::Descriptive; install it through
cpan Statistics::Descriptive
# if successfully installed:
$ perl -MStatistics::Descriptive -e "print \"Installed.\\n\";"
Installed.
To install the genome in one step
piPipes install -g dm3 # fly genome dm3
piPipes install -g mm9 # mouse genome mm9
piPipes install -g hg19 # human genome
However, some computing clusters only have internet access on the 'head node', which should only be used to submit jobs. To separate downloading and preparation steps:
## under the "head" node
piPipes install -g dm3 -D
# finish the work under computing node
piPipes install -g dm3
# Some steps take advantage of multiple CPUs, so providing more than one CPUs using `-c` might accelerate the installation process.
piPipes install -g dm3 -c 8
- piPipes uses
wget --continue
so the downloading will resume even if the installation is disrupted. - During the installation, the user will be prompted to define the length of siRNAs and piRNAs. Our lab uses 20-22 nt for fly/mouse siRNA, 23–29 nt for fly piRNA and 23–35 for mouse piRNA. The information is stored in
common/dm3/variables
files and users can change the values manually.
Currently, Drosophila melanogaster and Mus Musculus piRNAs are the most well studied. piPipes is optimized for those two species (assembly version dm3 and mm9). For other organisms, due to either the relatively immature piRNA cluster annotation or the authors' poor knowledge, some functions may not be performed. But we really would like to cooperate with experts to make piPipes more generic in terms of the organism it supports.
All the files for a specific genome are stored under the /path/to/piPipes/common/
. For example, fly files are stored under /path/to/piPipes/common/dm3
.
There are already some annotaion files, whose sizes are small enough to fit in Github, that come with piPipes. Most of them are in gziped BED format.
piPipes downloads the annotation from iGenome, which misses the chrU and X-TAS. piPipes then downloads chrU.fa from UCSC and put X-TAS.fa in the github repository due to its small size. piPipes ships with a few annotations bed and gtf files, mostly from flyBase.
For piRNA cluster annotation, piPipes uses the [Brennecke, et al., Cell, 2007](http://www.cell.com/fulltext/S0092-8674(0700257-7).
For transposons, piPipes uses two annotations. One (transposon) from flyBase and one repBase from repBase. The transposon annotation has been used in the Zamore Lab since Li, et al., Cell, 2009.
piPipes downloads the annotation from iGenome.
piPipes uses the piRNA cluster annotation from Li, et al., Mol Cell, 2013 and transposon annotation from repBase.
piPipes downloads the annotation from iGenome.
piPipes uses the piRNA cluster annotation from Rosenkranz, et al., BMC Bioinformatics, 2013 and transposon annotation from repBase.
In order for piPipes to perform its full function on other genomes, following steps should be completed:
- piRNA cluster annotation
Run proTRAC
or piClust
to get piRNA cluster annotation in BED format.
Rosenkranz D and Zischler H. 2012. proTRAC--a software for probabilistic piRNA cluster detection, visualization and analysis. BMC Bioinformatics 13: 5.
Jung, I., Park, J. C. & Kim, S. piClust: A density based piRNA clustering algorithm. Comput Biol Chem (2014).
-
get gene structure annotations from UCSC table browser or through the mySQL interface. We have already included those files for many organisms in the
common
folder. -
Edit the
genomic_features
file under the genome folder ( like dm3 or mm9 ). See below: -
piPipes builds bowtie index of the genome sequence for small RNA, STAR index for long RNA and Bowtie2 index for DNA. The genome sequence is named
$GENOME.fa
. -
piPipes builds bowtie index of the rRNA for small RNA, bowtie2 index for normal RNA. The rRNA sequence is named
rRNA.fa
. -
piPipes builds bowtie index of the repBase/transposon/piRNA cluster for small RNA. They are named
${GENOME}.repBase.fa
${GENOME}.transposon.fa
and${GENOME}.piRNAcluster.fa
respectively. -
piPipes builds bowtie index of the transcriptome repBase piRNA cluster for small RNA, Bowtie2 index for long RNA, and use eXpress to quantify them with sam/bam as the input;
piPipes downloads annotations from iGenome (UCSC version), which usually includes genomic sequence (fasta), rRNA (fasta), transcriptome (gtf) to be used by piPipes.
piPipes includes the repBase(fasta) in the github for dm3 and mm9. For other genomes, please retrieve the repBase.fa
and name it ${GENOME}.repBase.fa
in the common/${GENOME}
directory.
For example, run
# enter the directory unarchived from RepBase19.02.fasta.tar.gz
$ cat humrep.ref humsub.ref > ../hg19/hg19.repBase.fa
For genomes like dm3, there are transposon annotation from both repBase and flyBase, we call the them repBase and transposon respectively.
piPipes also includes a bunch of genomic features (bed). Please also include them in the common/${GENOME}
directory and add them in the TARGET array
in common/${GENOME}/genomic_features
. Follow the following example to set up the
# variables for small RNA pipeline intersecting
MASK=$COMMON_FOLDER/UCSC.rRNA+tRNA+nonCoding.bed6.gz
# tRNA, rRNA, nonCoding RNA (flyBase) from UCSC table browser
piRNA_Cluster=$COMMON_FOLDER/Brennecke.piRNAcluster.bed6.gz
# piRNA cluster defined in Brennecke, et al,. Cell, 2007; no strand information
piRNA_Cluster_42AB=$COMMON_FOLDER/Brennecke.piRNAcluster.42AB.bed6.gz
# 42AB
piRNA_Cluster_20A=$COMMON_FOLDER/Brennecke.piRNAcluster.20A.bed6.gz
# 20A
piRNA_Cluster_flam=$COMMON_FOLDER/Brennecke.piRNAcluster.flam.bed6.gz
# flam
repeatMasker=$COMMON_FOLDER/UCSC.RepeatMask.bed
# repeatMakser obtained from UCSC
repeatMasker_IN_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.inCluster.bed.gz
# repeat masker identified region that fall into piRNA cluster
repeatMasker_OUT_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.outCluster.bed.gz
# repeat masker identified region that fall outside piRNA cluster
Trn=$COMMON_FOLDER/Zamore.transposon.bed.gz
# transposon region used in Li, et al., Cell, 2009. More conserved than repeat masker
Trn_IN_Cluster=$COMMON_FOLDER/Zamore.transposon.inCluster.bed.gz
# transposon region in cluster
Trn_OUT_Cluster=$COMMON_FOLDER/Zamore.transposon.outCluster.bed.gz
# transposon region out cluster
Trn_GROUP0=$COMMON_FOLDER/Zamore.transposon.group0.bed.gz
# transposons that failed to pass threshold in Li, et al., Cell, 2009. More conserved than repeat masker
Trn_GROUP1=$COMMON_FOLDER/Zamore.transposon.group1.bed.gz
# group 1 transposon in Li, et al., Cell, 2009, mainly germline
Trn_GROUP2=$COMMON_FOLDER/Zamore.transposon.group2.bed.gz
# group 2 transposon in Li, et al., Cell, 2009
Trn_GROUP3=$COMMON_FOLDER/Zamore.transposon.group3.bed.gz
# group 3 transposon in Li, et al., Cell, 2009, mainly somatic
flyBase_Gene=$COMMON_FOLDER/UCSC.flyBase.Genes.bed12.gz
# flyBase gene
flyBase_Exon=$COMMON_FOLDER/UCSC.flyBase.Exons.bed.gz
# flyBase exons
flyBase_Intron=$COMMON_FOLDER/UCSC.flyBase.Introns.bed.gz
# flyBase introns
flyBase_Intron_xRM=$COMMON_FOLDER/UCSC.flyBase.Introns_xRM.bed.gz
# flyBase introns that subtract repeatMasker
flyBase_5UTR=$COMMON_FOLDER/UCSC.flyBase.5UTR.bed.gz
# flyBase 5' UTR
flyBase_CDS=$COMMON_FOLDER/UCSC.flyBase.CDS.bed.gz
# flyBase CDS
flyBase_3UTR=$COMMON_FOLDER/UCSC.flyBase.3UTR.bed.gz
# flyBase 3' UTR
cisNATs=$COMMON_FOLDER/cisNATs.bed.gz
# cis-NATs
structural_loci=$COMMON_FOLDER/structured_loci.bed.gz
# structural loci
lincRNA=$COMMON_FOLDER/lincRNA.Young.bed6.gz
# linc RNA identified in 'Identification and properties of 1,119 candidate lincRNA loci in the Drosophila melanogaster genome. Genome Biol Evol. 2012;4(4):427-42.'
unannotated=$COMMON_FOLDER/unannotated_genome.bed.gz
# unannoated region, basically all the genome segments between annotations defined above
declare -a TARGETS=( \
"piRNA_Cluster" \
"piRNA_Cluster_42AB" \
"piRNA_Cluster_20A" \
"piRNA_Cluster_flam" \
"repeatMasker" \
"repeatMasker_IN_Cluster" \
"repeatMasker_OUT_Cluster" \
"Trn" \
"Trn_IN_Cluster" \
"Trn_OUT_Cluster" \
"Trn_GROUP1" \
"Trn_GROUP2" \
"Trn_GROUP3" \
"Trn_GROUP0" \
"flyBase_Gene" \
"flyBase_Exon" \
"flyBase_Intron" \
"flyBase_Intron_xRM" \
"flyBase_5UTR" \
"flyBase_CDS" \
"flyBase_3UTR" \
"cisNATs" \
"structural_loci" \
"lincRNA" \
"unannotated" )
# variables for small RNA direct mapping
declare -a DIRECT_MAPPING=( "transposon" "repBase" "piRNAcluster" )
# gtf files for rnaseq/deg/cage htseq-count
Genes_transposon_Cluster=$COMMON_FOLDER/dm3.genes+transposon+piRNACluster.gtf
Genes_repBase_Cluster=$COMMON_FOLDER/dm3.genes+repBase+piRNACluster.gtf
declare -a HTSEQ_TARGETS=( "Genes_transposon_Cluster" "Genes_repBase_Cluster" )