-
Notifications
You must be signed in to change notification settings - Fork 40
installation
This document explains how to obtain piPipes from Github and how to install genome files.
To clone the directory from Github. You will need to have git
installed on your system.
If not, you will need to download git here.
# enter your directory to store softwares
# the genome sequence and annotations will be stored under the piPipes directory
# so allow extra ~8.5 G for dm3 (fly), ~90 G for mm9 (mouse), ~131 G for hg19 (human)
git clone [email protected]:bowhan/piPipes.git
Alternatively, you can obtain piPipes from its release page. Note that you will
not be able to easily make upgrade without git
.
|-- piPipes/ # directory
| |-- piPipes # main bash script to run
| |-- piPipes_debug # main bash script to run, debug mode
| |-- bin/ # binrary executables
| |-- piPipes_smallRNA.sh # smallRNA seq pipeline, single sample mode
| |-- piPipes_smallRNA2.sh # smallRNA seq pipeline, dual sample mode
| |-- piPipes_RNASeq.sh # RNA-seq pipeline, single sample mode
| |-- piPipes_RNASeq2.sh # RNA-seq pipeline, dual sample mode
| |-- piPipes_DegradomeSeq.sh # Degradome-seq pipeline
| |-- piPipes_ChIPSeq.sh # ChIP-seq pipeline, single sample mode
| |-- piPipes_ChIPSeq2.sh # ChIP-seq pipeline, dual sample mode
| |-- piPipes_GenomeSeq.sh # Genomic Seq pipeline
| |-- ... # binaries like bowtie, STAR, cufflinks ...
| |-- src/ # source codes
| |-- bed2_to_bedGraph.cpp # piPipes source codes
| |-- third_party/ # source codes of other tools; use this if the precompiled ones don't work
| |-- ...
| |-- common/ # where annotations and sequences been stored
| |-- mm9/
| |-- dm3/
| |-- dm3.fa # genome sequence
| |-- genomic_features # very important configuration file, see below
| |-- Brennecke.piRNAcluster.bed6.gz # one the the annotation file, in bed format
| |-- BowtieIndex/
| |-- ...
| |-- dm6/
| |-- hg19/
| |-- genome_supported.txt # storing the names of genome that has been installed
| |-- RepBase19.02.fasta.tar.gz # transposon consensus sequences from repBase
| |-- reformat_repBase_for_eXpress.sh # eXpress only takes the first token of Fasta name...
To make symbol links to piPipes main script, so that you can find piPipes without explicitly typing the absolute path
# enter the piPipes directory
ln -s piPipes $HOME/bin/piPipes
ln -s piPipes_debug $HOME/bin/piPipes_debug
# if successfully done, when you type:
$ which piPipes
~/bin/piPipes
piPipes has most of the third-party tools pre-compiled and included in the bin
directory. They will be automatically found when
you run piPipes. To avoid mixing them with your own versions, we do not recommend to add /piPipes/bin
to the $PATH
.
However, there are some tools that we find them hard to ship so the user will need to install if haven't done so.
# 1. R
# Please follow instructions on http://www.r-project.org/ to install R
# if successfully installed:
$ which Rscript
~/bin/Rscript
# ! Note the newer version of R has a different behavior for read.table ().
# Please use version earlier than R 3.1.0.
# http://stackoverflow.com/questions/22962917/barplot-failure-in-r-3-1-0-read-csv-converting-what-should-be-numerics-to-facto/23225932#23225932
# 2. HTSeq-count
# Please follow instructions on http://www-huber.embl.de/users/anders/HTSeq/doc/install.html
# to install HTSseq-count
# if successfully installed:
$ which htseq-count
~/bin/htseq-count
# 3. MACS2
# Please follow instructions on https://github.com/taoliu/MACS/blob/master/INSTALL.rst
# to install MACS2
# if successfully installed:
$ which macs2
~/bin/macs2
# 4. Perl Module Statistics::Descriptive; install it through
cpan Statistics::Descriptive
# if successfully installed:
$ perl -MStatistics::Descriptive -e "print \"Installed.\\n\";"
Installed.
# if you have git
# enter the piPipe directory
git pull
# if you have git
# enter the piPipe directory
git reset --hard origin/master
piPipes provides a uniform interface for different organisms/genomes. Due to the limit on file size of github, genome sequence and annotation files have to be downloaded. The user will need to perform an installation to download the files and prepare them for other pipelines.
To install a specific genome in one step
piPipes install -g dm3 # fly genome dm3
piPipes install -g dm6 # fly genome new release, BDGP6
piPipes install -g mm9 # mouse genome mm9
piPipes install -g hg19 # human genome hg19
Many computing clusters only have internet access on the 'head node', which should only be used to submit jobs but not run jobs. To separate downloading and preparation steps:
# under the "head" node: with internet access but no computing power
piPipes install -g dm3 -D
# finish the work under a computing node
piPipes install -g dm3
# Some steps take advantage of multiple CPUs, so providing more than one CPUs using `-c`
# accelerates the installation process.
piPipes install -g dm3 -c 8
Notes:
- piPipes uses
wget --continue
so downloading will resume if the installation is disrupted. - piPipes also only runs steps that haven't succeeded
- During the installation, the user will be prompted to define the length of siRNAs and piRNAs.
Our lab uses 20-22 nt for fly/mouse siRNA, 23–29 nt for fly piRNA and 23–35 for mouse piRNA.
This information is stored in
common/dm3/variables
files and users can change the values manually later.
Currently, Drosophila melanogaster and Mus Musculus piRNAs are the most well studied. piPipes then is optimized for those two species (assembly version dm3 and mm9 from UCSC). For other organisms, due to either the relatively immature piRNA cluster annotation or the authors' poor knowledge, some functions may not be performed. But we really would like to cooperate with experts to make piPipes more generic in terms of the organism it supports. Please contact us if you would like to help.
All the files for a specific genome are stored under the /path/to/piPipes/common/
. For example, fly files are stored under /path/to/piPipes/common/dm3
.
There are already some annotation files, whose sizes are small enough, coming with piPipes. Most of them are in gziped BED format.
piPipes downloads the annotation from iGenome, which misses the chrU and X-TAS. piPipes thus downloads chrU.fa from UCSC, and put X-TAS.fa in the Github repository.
For piRNA cluster annotation, piPipes uses the one from Brennecke, et al., Cell, 2007.
For transposons, piPipes uses two different annotations. One (transposon) from flyBase and one repBase from repBase. The transposon annotation has been used in the Zamore Lab since Li, et al., Cell, 2009. But the repBase annotation separated Long Terminal Repeat (LTR) of a retrotransposon from the middle part. So the LTR derived sequences do not become multi-mappers simply due to the presence of two LTR in a transposon sequence.
piPipes has incorporated the new assembly of fruitfly genome release 6.
# to install the new release
piPipes install -g dm6
Since it was just released (July 2014), iGenome or UCSC has not incorporated it. We used most annotation files from flyBase. Several notes:
1.piRNA cluster
Using the converter tool provided by flyBase, we tried to make the new annotation of piRNA clusters. However, 46 clusters cannot be successfully found in the new assembly, mostly due to "maps to more than one scaffold".
We now only keep the 96 ones that can be successfully mapped. But we are planning to use new data (higher depth) to annotate new cluster.
For more information, please read file common/dm6/Brennecke.piRNAcluster.bed6.converted.failed
2.Repeat Masker
We ran repeatMasker again using the following parameter to identify transposon site.
Note that by providing -species drosophila
, we were using the transposon sequences from repBase instead of the sequences from flyBase.
# using flyBase transposon sequences
RepeatMasker \
-pa 24 \
-s \
-low \
-lib dmel-all-transposon-r6.01.fasta \
-gff dmel-all-chromosome-r6.01.fasta \
1> flyBase.stdout \
2> flyBase.stderr
# using repBase
RepeatMasker \
-pa 24 \
-s \
-low \
-species drosophila \
-gff dmel-all-chromosome-r6.01.fasta \
1> repBase.stdout \
2>repBase.stderr
3.GTF file
The gtf file obtained from flyBase ftp://ftp.flybase.net/releases/FB2014_04/dmel_r6.01/gtf/dmel-all-r6.01.gtf.gz
cannot be correctly processed by gtfToGenePred
from kent tools, due to
the presence of "trans-splicing" of mdg4
.
invalid gffGroup detected on line: 3R FlyBase CDS 21375060 21375912 3.000000 - 0 gene_id "FBgn0002781"; transcript_id "FBtr0084081";
GFF/GTF group FBtr0084081 on 3R+, this line is on 3R-, all group members must be on same seq and strand
# the rest trans-splicing ones include
FBtr0084079
FBtr0084080
FBtr0084081
FBtr0084082
FBtr0084083
FBtr0084084
FBtr0084085
FBtr0307759
FBtr0307760
We thus removed all the mdg4
annotations.
grep -v mdg4
piPipes downloads the annotation from iGenome.
piPipes uses the piRNA cluster annotation from Li, et al., Mol Cell, 2013 and transposon annotation from repBase.
piPipes downloads the annotation from iGenome.
piPipes uses the piRNA cluster annotation from Rosenkranz, et al., BMC Bioinformatics, 2013 and transposon annotation from repBase.
In order for piPipes to perform its full function on other genomes, following steps should be completed:
1.Annotate piRNA cluster, provide it in BED format. Provide the sequence and name it ${GENOME}.piRNAcluster.fa
.
Run proTRAC
or piClust
to produce piRNA cluster annotation.
Rosenkranz D and Zischler H. 2012. proTRAC--a software for probabilistic piRNA cluster detection,
visualization and analysis. BMC Bioinformatics 13: 5.
Jung, I., Park, J. C. & Kim, S. piClust: A density based piRNA clustering algorithm.
Comput Biol Chem (2014).
2.Get gene structure annotations from UCSC table browser or through the mySQL interface.
We have already included those files for many organisms in the common
folder.
If the folder already exist, no need to do this step.
# currently those genomes have been done for this step
bosTau7
rn5
danRer7
TARI10
hg19
mm9
dm3
3.Edit the genomic_features
file under the genome folder ( like dm3 or mm9 ). See below.
4.The genome sequence should be provided and named as $GENOME.fa
.
piPipes builds bowtie index of the genome sequence for small RNA pipeline, STAR index for RNA-seq and degradome pipeline and Bowtie2 index for Genome-seq pipeline.
5.The rRNA sequence should be provided and named as rRNA.fa
.
piPipes builds bowtie index of the rRNA for small RNA, bowtie2 index for normal RNA.
6.The transposon consensus sequence should be provided and named as ${GENOME}.repBase.fa
.
piPipes builds bowtie index of the repBase/transposon/piRNA cluster for small RNA.
piPipes downloads annotations from iGenome (UCSC version), which usually includes genomic sequence (fasta), rRNA (fasta), transcriptome (gtf) to be used by piPipes.
piPipes includes the repBase(fasta) in the github for dm3 and mm9. For other genomes, please retrieve the repBase.fa
and name it ${GENOME}.repBase.fa
in the common/${GENOME}
directory.
For example, run
# enter the directory unarchived from RepBase19.02.fasta.tar.gz
$ cat humrep.ref humsub.ref > ../hg19/hg19.repBase.fa
piPipes includes a bunch of genomic features (bed) in the genomic_features file under the directory of each genome.
Please also include them in the common/${GENOME}
directory and add them in the TARGET array
in common/${GENOME}/genomic_features
.
Follow the following example to set up:
# variables for small RNA pipeline intersecting
MASK=$COMMON_FOLDER/UCSC.rRNA+tRNA+nonCoding.bed6.gz
# tRNA, rRNA, nonCoding RNA (flyBase) from UCSC table browser
piRNA_Cluster=$COMMON_FOLDER/Brennecke.piRNAcluster.bed6.gz
# piRNA cluster defined in Brennecke, et al,. Cell, 2007; no strand information
piRNA_Cluster_42AB=$COMMON_FOLDER/Brennecke.piRNAcluster.42AB.bed6.gz
# 42AB
piRNA_Cluster_20A=$COMMON_FOLDER/Brennecke.piRNAcluster.20A.bed6.gz
# 20A
piRNA_Cluster_flam=$COMMON_FOLDER/Brennecke.piRNAcluster.flam.bed6.gz
# flam
repeatMasker=$COMMON_FOLDER/UCSC.RepeatMask.bed
# repeatMakser obtained from UCSC
repeatMasker_IN_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.inCluster.bed.gz
# repeat masker identified region that fall into piRNA cluster
repeatMasker_OUT_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.outCluster.bed.gz
# repeat masker identified region that fall outside piRNA cluster
Trn=$COMMON_FOLDER/Zamore.transposon.bed.gz
# transposon region used in Li, et al., Cell, 2009. More conserved than repeat masker
Trn_IN_Cluster=$COMMON_FOLDER/Zamore.transposon.inCluster.bed.gz
# transposon region in cluster
Trn_OUT_Cluster=$COMMON_FOLDER/Zamore.transposon.outCluster.bed.gz
# transposon region out cluster
Trn_GROUP0=$COMMON_FOLDER/Zamore.transposon.group0.bed.gz
# transposons that failed to pass threshold in Li, et al., Cell, 2009.
# More conserved than repeat masker
Trn_GROUP1=$COMMON_FOLDER/Zamore.transposon.group1.bed.gz
# group 1 transposon in Li, et al., Cell, 2009, mainly germline
Trn_GROUP2=$COMMON_FOLDER/Zamore.transposon.group2.bed.gz
# group 2 transposon in Li, et al., Cell, 2009
Trn_GROUP3=$COMMON_FOLDER/Zamore.transposon.group3.bed.gz
# group 3 transposon in Li, et al., Cell, 2009, mainly somatic
flyBase_Gene=$COMMON_FOLDER/UCSC.flyBase.Genes.bed12.gz
# flyBase gene
flyBase_Exon=$COMMON_FOLDER/UCSC.flyBase.Exons.bed.gz
# flyBase exons
flyBase_Intron=$COMMON_FOLDER/UCSC.flyBase.Introns.bed.gz
# flyBase introns
flyBase_Intron_xRM=$COMMON_FOLDER/UCSC.flyBase.Introns_xRM.bed.gz
# flyBase introns that subtract repeatMasker
flyBase_5UTR=$COMMON_FOLDER/UCSC.flyBase.5UTR.bed.gz
# flyBase 5' UTR
flyBase_CDS=$COMMON_FOLDER/UCSC.flyBase.CDS.bed.gz
# flyBase CDS
flyBase_3UTR=$COMMON_FOLDER/UCSC.flyBase.3UTR.bed.gz
# flyBase 3' UTR
cisNATs=$COMMON_FOLDER/cisNATs.bed.gz
# cis-NATs
structural_loci=$COMMON_FOLDER/structured_loci.bed.gz
# structural loci
lincRNA=$COMMON_FOLDER/lincRNA.Young.bed6.gz
# linc RNA identified in 'Identification and properties of 1,119 candidate lincRNA loci in the
# Drosophila melanogaster genome. Genome Biol Evol. 2012;4(4):427-42.'
unannotated=$COMMON_FOLDER/unannotated_genome.bed.gz
# unannoated region, basically all the genome segments between annotations defined above
# TARGETS is used in small RNA-seq and degradome-seq pipeline
declare -a TARGETS=( \
"piRNA_Cluster" \
"piRNA_Cluster_42AB" \
"piRNA_Cluster_20A" \
"piRNA_Cluster_flam" \
"repeatMasker" \
"repeatMasker_IN_Cluster" \
"repeatMasker_OUT_Cluster" \
"Trn" \
"Trn_IN_Cluster" \
"Trn_OUT_Cluster" \
"Trn_GROUP1" \
"Trn_GROUP2" \
"Trn_GROUP3" \
"Trn_GROUP0" \
"flyBase_Gene" \
"flyBase_Exon" \
"flyBase_Intron" \
"flyBase_Intron_xRM" \
"flyBase_5UTR" \
"flyBase_CDS" \
"flyBase_3UTR" \
"cisNATs" \
"structural_loci" \
"lincRNA" \
"unannotated" )
# TARGETS_SHORT is used for "cis-Ping-Pong" analysis between degradome/small RNA.
# Since this step uses multi-threading itself, we are not able to run each feature simultaneously
# thus a few less important ones have been removed
declare -a TARGETS_SHORT=( \
"piRNA_Cluster" \
"piRNA_Cluster_42AB" \
"piRNA_Cluster_20A" \
"piRNA_Cluster_flam" \
"repeatMasker" \
"Trn" \
"Trn_GROUP1" \
"Trn_GROUP2" \
"Trn_GROUP3" \
"Trn_GROUP0" \
"flyBase_Gene" \
"flyBase_Exon" \
"flyBase_Intron_xRM" \
"flyBase_5UTR" \
"flyBase_3UTR" \
"lincRNA" )
# variables for small RNA direct mapping
declare -a DIRECT_MAPPING=( "transposon" "repBase" "piRNAcluster" )
# gtf files for rnaseq/deg/cage htseq-count
Genes_transposon_Cluster=$COMMON_FOLDER/dm3.genes+transposon+piRNACluster.gtf
Genes_repBase_Cluster=$COMMON_FOLDER/dm3.genes+repBase+piRNACluster.gtf
declare -a HTSEQ_TARGETS=( "Genes_transposon_Cluster" "Genes_repBase_Cluster" )