Skip to content

installation

bowhan edited this page Aug 15, 2014 · 31 revisions

piPipes installation and genome preparation

This document explains how to obtain piPipes from Github and how to install genome files.

To obtain piPipes

From Github

To clone the directory from Github. You will need to have git installed on your system. If not, you will need to download git here.

# enter your directory to store softwares
# the genome sequence and annotations will be stored under the piPipes directory
# so allow extra ~8.5 G for dm3 (fly), ~90 G for mm9 (mouse), ~131 G for hg19 (human)
git clone [email protected]:bowhan/piPipes.git

From release page

Alternatively, you can obtain piPipes from its release page. Note that you will not be able to easily make upgrades without git.

Set up

To make symbol links to piPipes main script, so that you can find piPipes without explicitly typing the absolute path:

# enter the piPipes directory
ln -s piPipes $HOME/bin/piPipes
ln -s piPipes_debug $HOME/bin/piPipes_debug
# if successfully done, when you type:
$ which piPipes
~/bin/piPipes

Other softwares

piPipes has most of the third-party tools pre-compiled and included in the bin directory. They will be automatically found when you run piPipes. To avoid mixing them with your own versions, we do not recommend to add /piPipes/bin to the $PATH.

However, there are some tools that we find them hard to ship so the user will need to install if haven't done so.

# 1. R
# Please follow instructions on http://www.r-project.org/ to install R
#! if successfully installed:
$ which Rscript
~/bin/Rscript
# ! Note the newer version of R has a different behavior for read.table ().
# Please use version earlier than R 3.1.0.
# http://stackoverflow.com/questions/22962917/barplot-failure-in-r-3-1-0-read-csv-converting-what-should-be-numerics-to-facto/23225932#23225932
# Try to keep only one version of R in your system or $PATH
# Many of the "bugs" reported by our users were caused by multiple versions of R!

# FYI: in the installation pipeline, piPipes will try to install the following packages
# It would be nice if they are manually installed and confirmed
## from CRAN
RColorBrewer
ggplot2
ggthemes
gplots
multicore
scales
reshape
gridExtra
gdata
RCircos
## from Bioconductor
cummeRbund


# 2. HTSeq-count
# Please follow instructions on http://www-huber.embl.de/users/anders/HTSeq/doc/install.html
# to install HTSseq-count
#! if successfully installed:
$ which htseq-count
~/bin/htseq-count


# 3. MACS2
# Please follow instructions on https://github.com/taoliu/MACS/blob/master/INSTALL.rst
# to install MACS2
#! if successfully installed:
$ which macs2
~/bin/macs2


# 4. Perl Module Statistics::Descriptive; install it through
cpan Statistics::Descriptive
#! if successfully installed:
$ perl -MStatistics::Descriptive -e "print \"Installed.\\n\";"
Installed.

To update piPipes

# if you have git
# enter the piPipe directory
git pull

Reinstall (start from scratch)

# if you have git
# enter the piPipe directory
git reset --hard origin/master

To install genome

piPipes provides a uniform interface for different organisms/genomes. Due to the limit on file size of github, genome sequence and annotation files have to be downloaded separately. The user will need to perform an installation to download the files and prepare them for other pipelines to use.

To install a specific genome in one step:

piPipes install -g dm3    # fly genome dm3
piPipes install -g dm6    # fly genome new release, BDGP6
piPipes install -g mm9    # mouse genome mm9
piPipes install -g hg19   # human genome hg19

Many computing clusters only have internet access on the 'head node', which should only be used to submit jobs but not to run jobs. To separate downloading and preparation steps:

# under the "head" node: with internet access but no computing power
piPipes install -g dm3 -D
# finish the work under a computing node
piPipes install -g dm3
# Some steps take advantage of multiple CPUs, so providing more than one CPUs using `-c`
# accelerates the installation process.
piPipes install -g dm3 -c 8

Notes:

  • piPipes uses wget --continue so downloading will resume if the installation is disrupted.

  • piPipes also only runs steps that haven't succeeded.

  • During the installation, the user will be prompted to define the length of siRNAs and piRNAs. Our lab uses 20-22 nt for fly/mouse siRNA, 23–29 nt for fly piRNA and 23–35 for mouse piRNA. This information is stored in common/dm3/variables files and users can change the values manually later.

Genome Assembly Supported

Currently, Drosophila melanogaster and Mus Musculus piRNAs are the most well studied. piPipes is optimized for those two species (assembly version dm3 and mm9 from UCSC). For other organisms, due to either the relatively immature piRNA cluster annotation or the authors' poor knowledge, some functions may not be performed. But we really would like to cooperate with experts to make piPipes more generic in terms of the organism it supports. Please contact us if you would like to help.

File organization

All the files for a specific genome are stored under the /path/to/piPipes/common/. For example, fly files are stored under /path/to/piPipes/common/dm3. Most of them are in gziped BED format

dm3

piPipes downloads the annotation from iGenome, which misses the chrU and X-TAS. piPipes thus downloads chrU.fa from UCSC, and put X-TAS.fa in the Github repository.

For piRNA cluster annotation, piPipes uses the one from Brennecke, et al., Cell, 2007.

For transposons, piPipes uses two different annotations. transposon sequences are from flyBase and repBase sequences are from repBase. The transposon annotation has been used in the Zamore Lab since Li, et al., Cell, 2009. But the repBase annotation separated Long Terminal Repeat (LTR) of a retrotransposon from the middle part. So the LTR derived sequences do not become multi-mappers simply due to the presence of two LTR in a transposon sequence.

BDGP6 (Berkeley Drosophila Genome Project Release 6)

piPipes has incorporated the new assembly of fruitfly genome release 6.

# to install the new release
piPipes install -g dm6

Since it was just released (July 2014), iGenome or UCSC has not incorporated it. We used most of the annotation files from flyBase. Several notes:

1.piRNA cluster

Using the converter tool provided by flyBase, we tried to make the new annotation of piRNA clusters. However, 46 clusters cannot be successfully found in the new assembly, mostly due to "maps to more than one scaffold".

We now only keep the 96 ones that can be successfully mapped. But we are planning to use new data (higher depth) to annotate new clusters.

For more information, please read file common/dm6/Brennecke.piRNAcluster.bed6.converted.failed

2.Repeat Masker

We ran repeatMasker again using the following parameter to identify transposon site.

Note that by providing -species drosophila, we were using the transposon sequences from repBase instead of the sequences from flyBase.

# using flyBase transposon sequences
RepeatMasker \
	-pa 24 \
	-s \
	-low \
	-lib dmel-all-transposon-r6.01.fasta \
	-gff dmel-all-chromosome-r6.01.fasta \
	1> flyBase.stdout \
	2> flyBase.stderr
# using repBase
RepeatMasker \
	-pa 24 \
	-s \
	-low \
	-species drosophila \
	-gff dmel-all-chromosome-r6.01.fasta \
	1> repBase.stdout \
	2>repBase.stderr

3.GTF file

The gtf file obtained from flyBase ftp://ftp.flybase.net/releases/FB2014_04/dmel_r6.01/gtf/dmel-all-r6.01.gtf.gz cannot be correctly processed by gtfToGenePred from kent tools, due to the presence of "trans-splicing" of mdg4.

invalid gffGroup detected on line: 3R	FlyBase	CDS	21375060	21375912	3.000000	-	0	gene_id "FBgn0002781"; transcript_id "FBtr0084081";
GFF/GTF group FBtr0084081 on 3R+, this line is on 3R-, all group members must be on same seq and strand
# the rest trans-splicing ones include

FBtr0084079
FBtr0084080
FBtr0084081
FBtr0084082
FBtr0084083
FBtr0084084
FBtr0084085
FBtr0307759
FBtr0307760

We thus removed all the mdg4 annotations.

grep -v mdg4

mm9

piPipes downloads the annotation from iGenome.

piPipes uses the piRNA cluster annotation from Li, et al., Mol Cell, 2013 and transposon annotation from repBase.

hg19

piPipes downloads the annotation from iGenome.

piPipes uses the piRNA cluster annotation from Rosenkranz, et al., BMC Bioinformatics, 2013 and transposon annotation from repBase.

other genomes

In order for piPipes to perform its full function on other genomes, the following steps should be completed:

1.Annotate piRNA cluster, provide it in BED format. Provide the sequence and name it ${GENOME}.piRNAcluster.fa.

Run proTRAC or piClust to produce piRNA cluster annotation.

	Rosenkranz D and Zischler H. 2012. proTRAC--a software for probabilistic piRNA cluster detection,
visualization and analysis. BMC Bioinformatics 13: 5.
	Jung, I., Park, J. C. & Kim, S. piClust: A density based piRNA clustering algorithm.
Comput Biol Chem (2014).

2.Get gene structure annotations from UCSC table browser or through the mySQL interface. We have already included those files for many organisms in the common folder. If the folder already exist, no need to do this step.

# currently those genomes have been done for this step
bosTau7
rn5
danRer7
TARI10
hg19
mm9
dm3

3.Edit the genomic_features file under the genome folder ( like dm3 or mm9 ). See below.

4.The genome sequence should be provided and named as $GENOME.fa. piPipes builds bowtie index of the genome sequence for small RNA pipeline, STAR index for RNA-seq and degradome pipeline and Bowtie2 index for Genome-seq pipeline.

5.The rRNA sequence should be provided and named as rRNA.fa. piPipes builds bowtie index of the rRNA for small RNA, bowtie2 index for normal RNA.

6.The transposon consensus sequence should be provided and named as ${GENOME}.repBase.fa. piPipes builds bowtie index of the repBase/transposon/piRNA cluster for small RNA.

Basic piPipes directory structure

|-- piPipes/ # top directory
|   |-- piPipes # main bash script to run
|   |-- piPipes_debug # main bash script to run, debug mode
|   |-- bin/ # binrary executables
|       |-- piPipes_smallRNA.sh # smallRNA seq pipeline, single sample mode
|       |-- piPipes_smallRNA2.sh # smallRNA seq pipeline, dual sample mode
|       |-- piPipes_RNASeq.sh # RNA-seq pipeline, single sample mode
|       |-- piPipes_RNASeq2.sh # RNA-seq pipeline, dual sample mode
|       |-- piPipes_DegradomeSeq.sh # Degradome-seq pipeline
|       |-- piPipes_ChIPSeq.sh # ChIP-seq pipeline, single sample mode
|       |-- piPipes_ChIPSeq2.sh # ChIP-seq pipeline, dual sample mode
|       |-- piPipes_GenomeSeq.sh # Genomic Seq pipeline
|       |-- ... # binaries like bowtie, STAR, cufflinks ...
|   |-- src/ # source codes
|       |-- bed2_to_bedGraph.cpp # piPipes source codes
|       |-- third_party/ # source codes of other tools; use this if the precompiled ones don't work
|       |-- ...
|   |-- common/ # where annotations and sequences been stored
|       |-- mm9/
|       |-- dm3/
|           |-- dm3.fa # genome sequence
|           |-- genomic_features # very important configuration file, see below
|           |-- Brennecke.piRNAcluster.bed6.gz # one the the annotation file, in bed format
|           |-- BowtieIndex/
|           |-- ...
|       |-- dm6/
|       |-- hg19/
|       |-- genome_supported.txt # storing the names of genome that has been installed
|       |-- RepBase19.02.fasta.tar.gz # transposon consensus sequences from repBase
|       |-- reformat_repBase_for_eXpress.sh # eXpress only takes the first token of Fasta name...

common folder

piPipes downloads annotations from iGenome (UCSC version), which usually includes genomic sequence (fasta), rRNA (fasta), transcriptome (gtf) to be used by piPipes. piPipes includes the repBase(fasta) in the github for dm3 and mm9. For other genomes, please retrieve the repBase.fa and name it ${GENOME}.repBase.fa in the common/${GENOME} directory. For example, run:

# enter the directory unarchived from RepBase19.02.fasta.tar.gz
$ cat humrep.ref humsub.ref > ../hg19/hg19.repBase.fa

genomic features

piPipes includes a bunch of genomic features (bed) in the genomic_features file under the directory of each genome. Please also include them in the common/${GENOME} directory and add them in the TARGET array in common/${GENOME}/genomic_features. Follow the following example to set up:

# variables for small RNA pipeline intersecting
	MASK=$COMMON_FOLDER/UCSC.rRNA+tRNA+nonCoding.bed6.gz
	# tRNA, rRNA, nonCoding RNA (flyBase) from UCSC table browser
	piRNA_Cluster=$COMMON_FOLDER/Brennecke.piRNAcluster.bed6.gz
	# piRNA cluster defined in Brennecke, et al,. Cell, 2007; no strand information
	piRNA_Cluster_42AB=$COMMON_FOLDER/Brennecke.piRNAcluster.42AB.bed6.gz
	# 42AB
	piRNA_Cluster_20A=$COMMON_FOLDER/Brennecke.piRNAcluster.20A.bed6.gz
	# 20A
	piRNA_Cluster_flam=$COMMON_FOLDER/Brennecke.piRNAcluster.flam.bed6.gz
	# flam
	repeatMasker=$COMMON_FOLDER/UCSC.RepeatMask.bed
	# repeatMakser obtained from UCSC
	repeatMasker_IN_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.inCluster.bed.gz
	# repeat masker identified region that fall into piRNA cluster
	repeatMasker_OUT_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.outCluster.bed.gz
	# repeat masker identified region that fall outside piRNA cluster
	Trn=$COMMON_FOLDER/Zamore.transposon.bed.gz
	# transposon region used in Li, et al., Cell, 2009. More conserved than repeat masker
	Trn_IN_Cluster=$COMMON_FOLDER/Zamore.transposon.inCluster.bed.gz
	# transposon region in cluster
	Trn_OUT_Cluster=$COMMON_FOLDER/Zamore.transposon.outCluster.bed.gz
	# transposon region out cluster
	Trn_GROUP0=$COMMON_FOLDER/Zamore.transposon.group0.bed.gz
	# transposons that failed to pass threshold in Li, et al., Cell, 2009.
	# More conserved than repeat masker
	Trn_GROUP1=$COMMON_FOLDER/Zamore.transposon.group1.bed.gz
	# group 1 transposon in Li, et al., Cell, 2009, mainly germline
	Trn_GROUP2=$COMMON_FOLDER/Zamore.transposon.group2.bed.gz
	# group 2 transposon in Li, et al., Cell, 2009
	Trn_GROUP3=$COMMON_FOLDER/Zamore.transposon.group3.bed.gz
	# group 3 transposon in Li, et al., Cell, 2009, mainly somatic
	flyBase_Gene=$COMMON_FOLDER/UCSC.flyBase.Genes.bed12.gz
	# flyBase gene
	flyBase_Exon=$COMMON_FOLDER/UCSC.flyBase.Exons.bed.gz
	# flyBase exons
	flyBase_Intron=$COMMON_FOLDER/UCSC.flyBase.Introns.bed.gz
	# flyBase introns
	flyBase_Intron_xRM=$COMMON_FOLDER/UCSC.flyBase.Introns_xRM.bed.gz  
	# flyBase introns that subtract repeatMasker
	flyBase_5UTR=$COMMON_FOLDER/UCSC.flyBase.5UTR.bed.gz
	# flyBase 5' UTR
	flyBase_CDS=$COMMON_FOLDER/UCSC.flyBase.CDS.bed.gz
	# flyBase CDS
	flyBase_3UTR=$COMMON_FOLDER/UCSC.flyBase.3UTR.bed.gz
	# flyBase 3' UTR
	cisNATs=$COMMON_FOLDER/cisNATs.bed.gz
	# cis-NATs
	structural_loci=$COMMON_FOLDER/structured_loci.bed.gz
	# structural loci
	lincRNA=$COMMON_FOLDER/lincRNA.Young.bed6.gz
	# linc RNA identified in 'Identification and properties of 1,119 candidate lincRNA loci in the
	# Drosophila melanogaster genome. Genome Biol Evol. 2012;4(4):427-42.'
	unannotated=$COMMON_FOLDER/unannotated_genome.bed.gz
	# unannoated region, basically all the genome segments between annotations defined above

# TARGETS is used in small RNA-seq and degradome-seq pipeline
	declare -a TARGETS=( \
	"piRNA_Cluster" \
	"piRNA_Cluster_42AB" \
	"piRNA_Cluster_20A" \
	"piRNA_Cluster_flam" \
	"repeatMasker" \
	"repeatMasker_IN_Cluster" \
	"repeatMasker_OUT_Cluster" \
	"Trn" \
	"Trn_IN_Cluster" \
	"Trn_OUT_Cluster" \
	"Trn_GROUP1" \
	"Trn_GROUP2" \
	"Trn_GROUP3" \
	"Trn_GROUP0" \
	"flyBase_Gene" \
	"flyBase_Exon" \
	"flyBase_Intron" \
	"flyBase_Intron_xRM" \
	"flyBase_5UTR" \
	"flyBase_CDS" \
	"flyBase_3UTR" \
	"cisNATs" \
	"structural_loci" \
	"lincRNA" \
	"unannotated" )

# TARGETS_SHORT is used for "cis-Ping-Pong" analysis between degradome/small RNA.
# Since this step uses multi-threading itself, we are not able to run each feature simultaneously
# thus a few less important ones have been removed
	declare -a TARGETS_SHORT=( \
	"piRNA_Cluster" \
	"piRNA_Cluster_42AB" \
	"piRNA_Cluster_20A" \
	"piRNA_Cluster_flam" \
	"repeatMasker" \
	"Trn" \
	"Trn_GROUP1" \
	"Trn_GROUP2" \
	"Trn_GROUP3" \
	"Trn_GROUP0" \
	"flyBase_Gene" \
	"flyBase_Exon" \
	"flyBase_Intron_xRM" \
	"flyBase_5UTR" \
	"flyBase_3UTR" \
	"lincRNA" )

# variables for small RNA direct mapping
	declare -a DIRECT_MAPPING=( "transposon" "repBase" "piRNAcluster" )

# gtf files for rnaseq/deg/cage htseq-count
	Genes_transposon_Cluster=$COMMON_FOLDER/dm3.genes+transposon+piRNACluster.gtf
	Genes_repBase_Cluster=$COMMON_FOLDER/dm3.genes+repBase+piRNACluster.gtf
	declare -a HTSEQ_TARGETS=( "Genes_transposon_Cluster" "Genes_repBase_Cluster" )
Clone this wiki locally