Skip to content

installation

bowhan edited this page Jan 22, 2015 · 31 revisions

piPipes installation and genome preparation

This document explains how to obtain piPipes from Github and how to install genome files.

To obtain and update piPipes

To obtain piPipes using commandline git

To clone the directory from Github, you will need to have git installed on your system. If not, please download git here.

# The genome sequence and annotations will be stored under the piPipes directory
# so allow extra ~8.5 G for dm3 (fly), ~90 G for mm9 (mouse), ~131 G for hg19 (human)
git clone https://github.com/bowhan/piPipes.git

To update piPipes

# If you have git, enter the piPipe directory and then type:
git pull

# occasionally, you might get error message like:
git pull
Updating 42bf792..fe137aa
error: Untracked working tree file 'common/dm3/rRNA.fa' would be overwritten by merge.  Aborting
# this issue originated from the explicit inclusion of rRNA.fa file in piPipes, which conflicts
# with the same file extracted from iGenome when you install the genome
# to solve "Untracked working tree file":
rm -f common/dm3/rRNA.fa && git pull


# you might also get error like
warning: Cannot merge binary files: common/dm3/structured_loci.bed.gz 
(HEAD vs. fe137aab4c81c6b0ff3f66cef68e3b7e396aba15)

Auto-merging common/dm3/structured_loci.bed.gz
CONFLICT (content): Merge conflict in common/dm3/structured_loci.bed.gz
Automatic merge failed; fix conflicts and then commit the result.

# this issue originated from force update of file which was included in gitignore
# to solve "Merge conflict":
git checkout -- common/dm3/structured_loci.bed.gz 
git pull

To re-install (start from scratch)

# If you have git, enter the piPipe directory and then type:
git reset --hard origin/master
# to re-install a genome in a clean background, enter the common/ directory and do:
rm -rf bosTau7
git checkout -- bosTau7
piPipes install -g bosTau7

To obtain piPipes from release page

Alternatively, you can obtain piPipes from its release page. Note that you will not be able to make upgrades without git.

To set up piPipes

Make symbol links to piPipes script, so that you can find piPipes without explicitly typing the absolute path:

# Enter the piPipes directory
ln -s $PWD/piPipes $HOME/bin/piPipes
ln -s $PWD/piPipes_debug $HOME/bin/piPipes_debug
# If successfully done:
$ which piPipes
~/bin/piPipes

Other softwares

piPipes has most of the third-party tools pre-compiled and included in the bin directory. They will be automatically found when you run piPipes. To avoid mixing them with your own versions, we do not recommend to add /piPipes/bin to the $PATH. However, there are some tools that we find them hard to ship so the user will need to install them if haven't done so.

# 1. R
# Please follow instructions on http://www.r-project.org/ to install R
#! if successfully installed:
$ which Rscript
~/bin/Rscript
# ! Note the newer version of R has a different behavior for read.table ().
# Please use version earlier than R 3.1.0.
# http://stackoverflow.com/questions/22962917/barplot-failure-in-r-3-1-0-read-csv-converting-what-should-be-numerics-to-facto/23225932#23225932
# Try to keep only one version of R in your system or $PATH
# Many of the "bugs" reported by our users were caused by multiple versions of R!

# FYI: in the installation pipeline, piPipes will try to install the following packages.
# It would be nice if they are manually installed and confirmed.
## from CRAN
RColorBrewer
ggplot2
ggthemes
gplots
parallel
scales
reshape
gridExtra
gdata
RCircos
## from Bioconductor
cummeRbund


# 2. HTSeq-count
# Please follow instructions on http://www-huber.embl.de/users/anders/HTSeq/doc/install.html
# to install HTSseq-count
# or if you have pip set up
pip install HTSeq
#! if successfully installed:
$ which htseq-count
~/bin/htseq-count
# HTSeq-count is used in RNA-seq pipeline; if you are not planning to use RNA-seq pipeline, you
# might not need it

# 3. MACS2
# Please follow instructions on https://github.com/taoliu/MACS/blob/master/INSTALL.rst
# to install MACS2
# or if you have pip
pip install macs2 # please run "macs2 callpeak -h" to see if the option --outdir is included...; if not, install it from github

#! if successfully installed:
$ which macs2
~/bin/macs2
# MACS2 is used in ChIP-seq pipeline; if you are not planning to use ChIP-seq pipeline, you
# might not need it

# 4. Perl Module Statistics::Descriptive; install it through
cpan Statistics::Descriptive
#! if successfully installed:
$ perl -MStatistics::Descriptive -e "print \"Installed.\\n\";"
Installed.
# This module is only used in genome-seq pipeline; if you are not planning to use genome-seq 
# pipeline, you might not need it

# 5. GNU awk
# GNU awk is heavily used in piPipes. But some versions of awk do not have the GNU extension, 
# for example, the definition of variable ARGIND; to test is
$ echo 1 | awk '{print ARGIND}'
# if it prints nothing, it means that your awk doesn't define ARGIND variable, it will cause 
# issues when you run piPipes
# the easiest way to install gawk is to use linuxbrew
# https://github.com/Homebrew/linuxbrew
# Please follow their instruction to install linuxbrew and install gawk with the following:
$ brew install gawk
# then you have to make a symbol link in the /bin directory of piPipes to make it use it as  "awk"
$ ln -s $HOME/.linuxbrew/bin/gawk /path/to/piPipes/bin/awk

To install genome

piPipes provides a uniform interface for different organisms/genomes. Due to Github's limit on the size of a single file, genome sequences and annotations are downloaded separately. The user will need to perform an installation to download the files and prepare them for other pipelines to use.

To install a specific genome in one step:

piPipes install -g dm3    # fly genome dm3
piPipes install -g dm6    # fly genome new release, BDGP6
piPipes install -g mm9    # mouse genome mm9
piPipes install -g hg19   # human genome hg19

Many computing clusters only have internet access on the 'head node', which should only be used to submit jobs but not to run jobs. To separate downloading and preparation steps:

# under the "head" node: with internet access but no computing power
piPipes install -g dm3 -D
# finish the work under a computing node
piPipes install -g dm3
# Some steps take advantage of multiple CPUs, so providing more than one CPUs using `-c`
# accelerates the installation process.
piPipes install -g dm3 -c 8

Notes:

  • piPipes uses wget --continue so downloading will resume if the installation is disrupted. piPipes also only runs steps that haven't succeeded.

  • During the installation, the user will be prompted to define the length of siRNAs and piRNAs for the genome to be installed. Our lab uses 20-22 nt for fly/mouse siRNA, 23–29 nt for fly piRNA and 23–35 for mouse piRNA. This information is stored in common/dm3/variables files and users can change the values manually later.

  • The installation of R packages is NOT multi-threading safe, so please install each genome separately.

Genome Assembly Supported

Currently, Drosophila melanogaster and Mus Musculus piRNAs are the most well studied. piPipes is optimized for those two species (assembly version dm3 and mm9 from UCSC). For other organisms, due to either the relatively immature piRNA cluster annotation, some functions in the pipelines may not be performed. Please contact us if you would like to contribute to the annotations of organisms that are poorly supported by piPipes.

File organization

All the files for a specific genome are stored under the /path/to/piPipes/common/. For example, fly files are stored under /path/to/piPipes/common/dm3. Most of them are in gzipped BED format.

dm3

piPipes downloads the annotation from iGenome, which misses the chrU and X-TAS. piPipes thus downloads chrU.fa from UCSC, and put X-TAS.fa in the Github repository.

For piRNA cluster annotation, piPipes uses the one from Brennecke, et al., Cell, 2007.

For transposons, piPipes uses two different annotations. transposon sequences are from flyBase and repBase sequences are from repBase. The transposon annotation has been used in the Zamore Lab since Li, et al., Cell, 2009. The repBase annotation separated Long Terminal Repeat (LTR) of a retrotransposon from the middle part. So the LTR derived sequences do not become multi-mappers simply due to the presence of two LTRs in a transposon sequence.

BDGP6 (Berkeley Drosophila Genome Project Release 6)

piPipes has incorporated the new assembly of fruitfly genome release 6.

# To install the new release, type:
piPipes install -g dm6

Since it was just released (July 2014), iGenome or UCSC has not incorporated it. We used most of the annotation files from flyBase. Several notes:

1.piRNA cluster

Using the converter tool provided by flyBase, we tried to make the new coordinates of piRNA clusters. However, 46 clusters cannot be successfully found in the new assembly, mostly due to "maps to more than one scaffold".

We now only keep the 96 ones that can be successfully mapped. But we are planning to use new data with higher depth and possibly new algorithsm to annotate new clusters.

For more information, please read file common/dm6/Brennecke.piRNAcluster.bed6.converted.failed

2.Repeat Masker

We ran repeatMasker using the following parameter to identify transposon sites in BDGP6.

Note that by providing -species drosophila, we were using the transposon sequences from repBase instead of the sequences from flyBase.

# Using flyBase transposon sequences
RepeatMasker \
	-pa 24 \
	-s \
	-low \
	-lib dmel-all-transposon-r6.01.fasta \
	-gff dmel-all-chromosome-r6.01.fasta \
	1> flyBase.stdout \
	2> flyBase.stderr
# Using repBase
RepeatMasker \
	-pa 24 \
	-s \
	-low \
	-species drosophila \
	-gff dmel-all-chromosome-r6.01.fasta \
	1> repBase.stdout \
	2>repBase.stderr

3.GTF file

The gtf file obtained from flyBase ftp://ftp.flybase.net/releases/FB2014_04/dmel_r6.01/gtf/dmel-all-r6.01.gtf.gz cannot be correctly processed by gtfToGenePred from kent tools, due to the presence of "trans-splicing" of mdg4.

invalid gffGroup detected on line: 3R	FlyBase	CDS	21375060	21375912	3.000000	-	0	gene_id "FBgn0002781"; transcript_id "FBtr0084081";
GFF/GTF group FBtr0084081 on 3R+, this line is on 3R-, all group members must be on same seq and strand
# the rest trans-splicing ones include

FBtr0084079
FBtr0084080
FBtr0084081
FBtr0084082
FBtr0084083
FBtr0084084
FBtr0084085
FBtr0307759
FBtr0307760

We thus removed all the mdg4 annotations.

grep -v mdg4

mm9

piPipes downloads the annotation from iGenome.

piPipes uses the piRNA cluster annotation from Li, et al., Mol Cell, 2013 and transposon annotation from repBase.

hg19

piPipes downloads the annotation from iGenome.

piPipes uses the piRNA cluster annotation from Rosenkranz, et al., BMC Bioinformatics, 2013 and transposon annotation from repBase.

other genomes with iGenome support

In order for piPipes to perform its full function on other genomes, the following steps should be completed:

1.Annotate piRNA cluster and provide it in BED format. Pleases also provide the sequences in a file named ${GENOME}.piRNAcluster.fa.

Run proTRAC or piClust to produce piRNA cluster annotation.

	Rosenkranz D and Zischler H. 2012. proTRAC--a software for probabilistic piRNA cluster detection,
visualization and analysis. BMC Bioinformatics 13: 5.
	Jung, I., Park, J. C. & Kim, S. piClust: A density based piRNA clustering algorithm.
Comput Biol Chem (2014).

2.Get gene structure annotations from UCSC table browser or through the mySQL interface. We have already included those files for many organisms in the common folder. If the folder already exist, there is no need to do this step.

other genomes without iGenome support

We provided an option -C to install genomes that are not currently supported by iGenome:

-C  Custom genome installation. The user will need to create a folder 
    $PIPELINE_DIRECTORY/common/GENOME and provide the following files:
	$PIPELINE_DIRECTORY/common/GENOME/GENOME.fa --> genome sequence in fasta format
	$PIPELINE_DIRECTORY/common/GENOME/GENOME.transposon.fa --> transposon sequence in fasta format
	$PIPELINE_DIRECTORY/common/GENOME/GENOME.piRNAcluster.bed --> piRNA cluster in bed format
	$PIPELINE_DIRECTORY/common/GENOME/GENOME.genes.gtf --> genes annotation in gtf format
	$PIPELINE_DIRECTORY/common/GENOME/GENOME.hairpin.fa --> miRNA hairpin sequence in fasta format
	$PIPELINE_DIRECTORY/common/GENOME/GENOME.mature.fa --> miRNA sequence in fasta format
  *Note that if you obtain hairpin and mature sequences from miRBase, you can extract the sequences 
corresponding to your genome using $PIPELINE_DIRECTORY/bin/piPipes_extract_organiam_from_fa.py:
	$PIPELINE_DIRECTORY/bin/piPipes_extract_organiam_from_fa.py hairpin.fa dme > \
		$PIPELINE_DIRECTORY/common/dm3/dm3.hairpin.fa
	$PIPELINE_DIRECTORY/bin/piPipes_extract_organiam_from_fa.py mature.fa  dme > \
		$PIPELINE_DIRECTORY/common/dm3/dm3.mature.fa
  Then run: 
  
  piPipes install -g GENOME -C

Then please create the genome_feature files according to the instruction at the end of this document.

Currently the following genomes have been done for this step

bosTau7 rn5 danRer7 TARI10 hg19 mm9 dm3


3.Edit the `genomic_features` file under the genome folder. See the next section.

4.The genome sequences should be provided in a file named as `$GENOME.fa`.
**piPipes** builds **bowtie** index of the **genome sequence** for _small RNA pipeline_, **STAR** index for _RNA-seq and degradome pipeline_ and **Bowtie2** index for _Genome-seq pipeline_.

5.The rRNA sequence should be provided in a file named as `rRNA.fa`.
**piPipes** builds **bowtie** index of the **rRNA** for _small RNA_, **bowtie2** index for _normal RNA_.

6.The transposon consensus sequences should be provided and named as `${GENOME}.repBase.fa`.
**piPipes** builds **bowtie** index of the **repBase/transposon/piRNA cluster** for _small RNA_.

### Basic piPipes directory structure
```bash
|-- piPipes/ # top directory
|   |-- piPipes # main bash script to run
|   |-- piPipes_debug # main bash script to run, debug mode
|   |-- bin/ # binrary executables
|       |-- piPipes_smallRNA.sh # smallRNA seq pipeline, single sample mode
|       |-- piPipes_smallRNA2.sh # smallRNA seq pipeline, dual sample mode
|       |-- piPipes_RNASeq.sh # RNA-seq pipeline, single sample mode
|       |-- piPipes_RNASeq2.sh # RNA-seq pipeline, dual sample mode
|       |-- piPipes_DegradomeSeq.sh # Degradome-seq pipeline
|       |-- piPipes_ChIPSeq.sh # ChIP-seq pipeline, single sample mode
|       |-- piPipes_ChIPSeq2.sh # ChIP-seq pipeline, dual sample mode
|       |-- piPipes_GenomeSeq.sh # Genomic Seq pipeline
|       |-- ... # binaries like bowtie, STAR, cufflinks ...
|   |-- src/ # source codes
|       |-- bed2_to_bedGraph.cpp # piPipes source codes
|       |-- third_party/ # source codes of other tools; use this if the precompiled ones don't work
|       |-- ...
|   |-- common/ # where annotations and sequences been stored
|       |-- mm9/
|       |-- dm3/
|           |-- dm3.fa # genome sequence
|           |-- genomic_features # very important configuration file, see below
|           |-- Brennecke.piRNAcluster.bed6.gz # one the the annotation file, in bed format
|           |-- BowtieIndex/
|           |-- ...
|       |-- dm6/
|       |-- hg19/
|       |-- genome_supported.txt # storing the names of genome that has been installed
|       |-- RepBase19.02.fasta.tar.gz # transposon consensus sequences from repBase
|       |-- reformat_repBase_for_eXpress.sh # eXpress only takes the first token of Fasta name...

common folder

piPipes downloads annotations from iGenome (UCSC version), which usually includes genomic sequence (fasta), rRNA (fasta), transcriptome (gtf) to be used by piPipes. piPipes includes the repBase(fasta) in the github for dm3 and mm9. For other genomes, please retrieve the repBase.fa and name it ${GENOME}.repBase.fa in the common/${GENOME} directory. For example, run:

# Enter the directory unarchived from RepBase19.02.fasta.tar.gz
$ cat humrep.ref humsub.ref > ../hg19/hg19.repBase.fa

genomic features

piPipes includes a bunch of genomic features (bed) in the genomic_features file under the directory of each genome. Please also include them in the common/${GENOME} directory and add them in the TARGET array in common/${GENOME}/genomic_features. Follow the following example to set up:

# variables for small RNA pipeline intersecting
	MASK=$COMMON_FOLDER/UCSC.rRNA+tRNA+nonCoding.bed6.gz
	# tRNA, rRNA, nonCoding RNA (flyBase) from UCSC table browser
	piRNA_Cluster=$COMMON_FOLDER/Brennecke.piRNAcluster.bed6.gz
	# piRNA cluster defined in Brennecke, et al,. Cell, 2007; no strand information
	piRNA_Cluster_42AB=$COMMON_FOLDER/Brennecke.piRNAcluster.42AB.bed6.gz
	# 42AB
	piRNA_Cluster_20A=$COMMON_FOLDER/Brennecke.piRNAcluster.20A.bed6.gz
	# 20A
	piRNA_Cluster_flam=$COMMON_FOLDER/Brennecke.piRNAcluster.flam.bed6.gz
	# flam
	repeatMasker=$COMMON_FOLDER/UCSC.RepeatMask.bed
	# repeatMakser obtained from UCSC
	repeatMasker_IN_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.inCluster.bed.gz
	# repeat masker identified region that fall into piRNA cluster
	repeatMasker_OUT_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.outCluster.bed.gz
	# repeat masker identified region that fall outside piRNA cluster
	Trn=$COMMON_FOLDER/Zamore.transposon.bed.gz
	# transposon region used in Li, et al., Cell, 2009. More conserved than repeat masker
	Trn_IN_Cluster=$COMMON_FOLDER/Zamore.transposon.inCluster.bed.gz
	# transposon region in cluster
	Trn_OUT_Cluster=$COMMON_FOLDER/Zamore.transposon.outCluster.bed.gz
	# transposon region out cluster
	Trn_GROUP0=$COMMON_FOLDER/Zamore.transposon.group0.bed.gz
	# transposons that failed to pass threshold in Li, et al., Cell, 2009.
	# More conserved than repeat masker
	Trn_GROUP1=$COMMON_FOLDER/Zamore.transposon.group1.bed.gz
	# group 1 transposon in Li, et al., Cell, 2009, mainly germline
	Trn_GROUP2=$COMMON_FOLDER/Zamore.transposon.group2.bed.gz
	# group 2 transposon in Li, et al., Cell, 2009
	Trn_GROUP3=$COMMON_FOLDER/Zamore.transposon.group3.bed.gz
	# group 3 transposon in Li, et al., Cell, 2009, mainly somatic
	flyBase_Gene=$COMMON_FOLDER/UCSC.flyBase.Genes.bed12.gz
	# flyBase gene
	flyBase_Exon=$COMMON_FOLDER/UCSC.flyBase.Exons.bed.gz
	# flyBase exons
	flyBase_Intron=$COMMON_FOLDER/UCSC.flyBase.Introns.bed.gz
	# flyBase introns
	flyBase_Intron_xRM=$COMMON_FOLDER/UCSC.flyBase.Introns_xRM.bed.gz  
	# flyBase introns that subtract repeatMasker
	flyBase_5UTR=$COMMON_FOLDER/UCSC.flyBase.5UTR.bed.gz
	# flyBase 5' UTR
	flyBase_CDS=$COMMON_FOLDER/UCSC.flyBase.CDS.bed.gz
	# flyBase CDS
	flyBase_3UTR=$COMMON_FOLDER/UCSC.flyBase.3UTR.bed.gz
	# flyBase 3' UTR
	cisNATs=$COMMON_FOLDER/cisNATs.bed.gz
	# cis-NATs
	structural_loci=$COMMON_FOLDER/structured_loci.bed.gz
	# structural loci
	lincRNA=$COMMON_FOLDER/lincRNA.Young.bed6.gz
	# linc RNA identified in 'Identification and properties of 1,119 candidate lincRNA loci in the
	# Drosophila melanogaster genome. Genome Biol Evol. 2012;4(4):427-42.'
	unannotated=$COMMON_FOLDER/unannotated_genome.bed.gz
	# unannoated region, basically all the genome segments between annotations defined above

# TARGETS is used in small RNA-seq and degradome-seq pipeline
	declare -a TARGETS=( \
	"piRNA_Cluster" \
	"piRNA_Cluster_42AB" \
	"piRNA_Cluster_20A" \
	"piRNA_Cluster_flam" \
	"repeatMasker" \
	"repeatMasker_IN_Cluster" \
	"repeatMasker_OUT_Cluster" \
	"Trn" \
	"Trn_IN_Cluster" \
	"Trn_OUT_Cluster" \
	"Trn_GROUP1" \
	"Trn_GROUP2" \
	"Trn_GROUP3" \
	"Trn_GROUP0" \
	"flyBase_Gene" \
	"flyBase_Exon" \
	"flyBase_Intron" \
	"flyBase_Intron_xRM" \
	"flyBase_5UTR" \
	"flyBase_CDS" \
	"flyBase_3UTR" \
	"cisNATs" \
	"structural_loci" \
	"lincRNA" \
	"unannotated" )

# TARGETS_SHORT is used for "cis-Ping-Pong" analysis between degradome/small RNA.
# Since this step uses multi-threading itself, we are not able to run each feature simultaneously
# thus a few less important ones have been removed
	declare -a TARGETS_SHORT=( \
	"piRNA_Cluster" \
	"piRNA_Cluster_42AB" \
	"piRNA_Cluster_20A" \
	"piRNA_Cluster_flam" \
	"repeatMasker" \
	"Trn" \
	"Trn_GROUP1" \
	"Trn_GROUP2" \
	"Trn_GROUP3" \
	"Trn_GROUP0" \
	"flyBase_Gene" \
	"flyBase_Exon" \
	"flyBase_Intron_xRM" \
	"flyBase_5UTR" \
	"flyBase_3UTR" \
	"lincRNA" )

# The following variables are for the pie chart, which gives reads information for genomic
# features that are mostly exclusive to each other. Different from the genomic feature count
# using TARGETS, reads mappable to genomic features in TARGETS_EXCLUSIVE will be partitioned.
# For example, if a read overlaps with a region annotated as both piRNA_Cluster and Repeats, 
# piRNA_Cluster and Repeats will each get half of the reads.
# Please see small RNA-seq pipeline document for more information.
	FivePrimeUTR=$flyBase_5UTR
	ThreePrimeUTR=$flyBase_3UTR
	CDS=$flyBase_CDS
	Intron=$flyBase_Intron_xRM
	Repeats=$repeatMasker
	tRNA_NonCoding=$COMMON_FOLDER/UCSC.rRNA+tRNA+nonCoding.bed6.gz
	declare -a TARGETS_EXCLUSIVE=(\
	"piRNA_Cluster" \
	"CDS" \
	"FivePrimeUTR" \
	"ThreePrimeUTR" \
	"Intron" \
	"Repeats" \
	"tRNA_NonCoding" \
	)
	
# variables for small RNA direct mapping
	declare -a DIRECT_MAPPING=( "transposon" "repBase" "piRNAcluster" )

# gtf files for rnaseq/deg/cage htseq-count
	Genes_transposon_Cluster=$COMMON_FOLDER/dm3.genes+transposon+piRNACluster.gtf
	Genes_repBase_Cluster=$COMMON_FOLDER/dm3.genes+repBase+piRNACluster.gtf
	declare -a HTSEQ_TARGETS=( "Genes_transposon_Cluster" "Genes_repBase_Cluster" )

For example:

	# put the bed files under the common/xxx folder
	#MASK is used to mask regions
 	MASK=$COMMON_FOLDER/region_I_want_to_mask.bed
	# some regions of interest
	piRNACluster=$COMMON_FOLDER/piRNAcluster.bed
	myGene=$COMMON_FOLDER/myGene.bed
	regionOfInterest=$COMMON_FOLDER/region1.bed
	# put them in an array in this awy
	declare -a TARGETS=( \
	"piRNACluster" \
	"myGene" \
	"regionOfInterest" \
	)

# The following variables are for the pie chart, which gives reads information for genomic
# features that are mostly exclusive to each other. Different from the genomic feature count
# using TARGETS, reads mappable to genomic features in TARGETS_EXCLUSIVE will be partitioned.
# For example, if a read overlaps with a region annotated as both piRNA_Cluster and Repeats, 
# piRNA_Cluster and Repeats will each get half of the reads.
# Please see small RNA-seq pipeline document for more information.
	declare -a TARGETS_EXCLUSIVE=(\
	"piRNACluster" \
	"myGene" \
	"regionOfInterest" \
	)
Clone this wiki locally