Skip to content

installation

bowhan edited this page Jul 29, 2014 · 31 revisions

piPipes: Genome installation pipeline

piPipes provides a uniform interface for different organisms/genomes. Due to the limit on individual file sizes from github, genome sequence and annotation files have to be downloaded. The user will need to perform an installation step to download the files and prepare them for other pipelines.

To obtain piPipes

To clone the directory from Github

git clone [email protected]:bowhan/piPipes.git

To make symbol links

# enter the piPipes directory
ln -s piPipes $HOME/bin/piPipes
ln -s piPipes_debug $HOME/bin/piPipes_debug

piPipes has most of the third-party tools pre-compiled and included in the bin directory.

However, there are some tools the user will need to install.

# R
# Please follow instructions on http://www.r-project.org/ to install R
# if successfully installed:
$ which Rscript
/share/pkg/R/3.0.2/bin/Rscript
# ! Note the newer version of R has a different behavior on read.table (). Please use version before R 3.1.0.
# http://stackoverflow.com/questions/22962917/barplot-failure-in-r-3-1-0-read-csv-converting-what-should-be-numerics-to-facto/23225932#23225932


# HTSeq-count
# Please follow instructions on http://www-huber.embl.de/users/anders/HTSeq/doc/install.html to install HTSseq-count
# if successfully installed:
$ which htseq-count
~/bin/htseq-count


# Macs2
# Please follow instructions on https://github.com/taoliu/MACS/blob/master/INSTALL.rst to install MACS2
# if successfully installed:
$ which macs2
~/bin/macs2


# Perl Module Statistics::Descriptive; install it through
cpan Statistics::Descriptive
# if successfully installed:
$ perl -MStatistics::Descriptive -e "print \"Installed.\\n\";"
Installed.

To run the installation

To install the genome in one step

piPipes install -g dm3		# fly genome dm3
piPipes install -g mm9		# mouse genome mm9
piPipes install -g hg19		# human genome

However, some computing clusters only have internet access on the 'head node', which should only be used to submit jobs. To separate downloading and preparation steps:

## under the "head" node
piPipes install -g dm3 -D
# finish the work under computing node
piPipes install -g dm3

# Some steps take advantage of multiple CPUs, so providing more than one CPUs using `-c` might accelerate the installation process.
piPipes install -g dm3 -c 8
  • piPipes uses wget --continue so the downloading will resume even if the installation is disrupted.
  • During the installation, the user will be prompted to define the length of siRNAs and piRNAs. Our lab uses 20-22 nt for fly/mouse siRNA, 23–29 nt for fly piRNA and 23–35 for mouse piRNA. The information is stored in common/dm3/variables files and users can change the values manually.

Genome Assembly Supported

Currently, Drosophila melanogaster and Mus Musculus piRNAs are the most well studied. piPipes is optimized for those two species (assembly version dm3 and mm9). For other organisms, due to either the relatively immature piRNA cluster annotation or the authors' poor knowledge, some functions may not be performed. But we really would like to cooperate with experts to make piPipes more generic in terms of the organism it supports.

File organization

All the files for a specific genome are stored under the /path/to/piPipes/common/. For example, fly files are stored under /path/to/piPipes/common/dm3. There are already some annotaion files, whose sizes are small enough to fit in Github, that come with piPipes. Most of them are in gziped BED format.

dm3

piPipes downloads the annotation from iGenome, which misses the chrU and X-TAS. piPipes then downloads chrU.fa from UCSC and put X-TAS.fa in the github repository due to its small size. piPipes ships with a few annotations bed and gtf files, mostly from flyBase.

For piRNA cluster annotation, piPipes uses the [Brennecke, et al., Cell, 2007](http://www.cell.com/fulltext/S0092-8674(0700257-7).

For transposons, piPipes uses two annotations. One (transposon) from flyBase and one repBase from repBase. The transposon annotation has been used in the Zamore Lab since Li, et al., Cell, 2009.

mm9

piPipes downloads the annotation from iGenome.

piPipes uses the piRNA cluster annotation from Li, et al., Mol Cell, 2013 and transposon annotation from repBase.

hg19

piPipes downloads the annotation from iGenome.

piPipes uses the piRNA cluster annotation from Rosenkranz, et al., BMC Bioinformatics, 2013 and transposon annotation from repBase.

other genomes

In order for piPipes to perform its full function on other genomes, following steps should be completed:

  1. piRNA cluster annotation

Run proTRAC or piClust to get piRNA cluster annotation in BED format.

Rosenkranz D and Zischler H. 2012. proTRAC--a software for probabilistic piRNA cluster detection, visualization and analysis. BMC Bioinformatics 13: 5.

Jung, I., Park, J. C. & Kim, S. piClust: A density based piRNA clustering algorithm. Comput Biol Chem (2014).

  1. get gene structure annotations from UCSC table browser or through the mySQL interface. We have already included those files for many organisms in the common folder.

  2. Edit the genomic_features file under the genome folder ( like dm3 or mm9 ). See below:

  3. piPipes builds bowtie index of the genome sequence for small RNA, STAR index for long RNA and Bowtie2 index for DNA. The genome sequence is named $GENOME.fa.

  4. piPipes builds bowtie index of the rRNA for small RNA, bowtie2 index for normal RNA. The rRNA sequence is named rRNA.fa.

  5. piPipes builds bowtie index of the repBase/transposon/piRNA cluster for small RNA. They are named ${GENOME}.repBase.fa ${GENOME}.transposon.fa and ${GENOME}.piRNAcluster.fa respectively.

  6. piPipes builds bowtie index of the transcriptome repBase piRNA cluster for small RNA, Bowtie2 index for long RNA, and use eXpress to quantify them with sam/bam as the input;

common folder

piPipes downloads annotations from iGenome (UCSC version), which usually includes genomic sequence (fasta), rRNA (fasta), transcriptome (gtf) to be used by piPipes. piPipes includes the repBase(fasta) in the github for dm3 and mm9. For other genomes, please retrieve the repBase.fa and name it ${GENOME}.repBase.fa in the common/${GENOME} directory. For example, run

# enter the directory unarchived from RepBase19.02.fasta.tar.gz
$ cat humrep.ref humsub.ref > ../hg19/hg19.repBase.fa

For genomes like dm3, there are transposon annotation from both repBase and flyBase, we call the them repBase and transposon respectively. piPipes also includes a bunch of genomic features (bed). Please also include them in the common/${GENOME} directory and add them in the TARGET array in common/${GENOME}/genomic_features. Follow the following example to set up the

# variables for small RNA pipeline intersecting
	MASK=$COMMON_FOLDER/UCSC.rRNA+tRNA+nonCoding.bed6.gz
	# tRNA, rRNA, nonCoding RNA (flyBase) from UCSC table browser
	piRNA_Cluster=$COMMON_FOLDER/Brennecke.piRNAcluster.bed6.gz
	# piRNA cluster defined in Brennecke, et al,. Cell, 2007; no strand information
	piRNA_Cluster_42AB=$COMMON_FOLDER/Brennecke.piRNAcluster.42AB.bed6.gz
	# 42AB
	piRNA_Cluster_20A=$COMMON_FOLDER/Brennecke.piRNAcluster.20A.bed6.gz
	# 20A
	piRNA_Cluster_flam=$COMMON_FOLDER/Brennecke.piRNAcluster.flam.bed6.gz
	# flam
	repeatMasker=$COMMON_FOLDER/UCSC.RepeatMask.bed
	# repeatMakser obtained from UCSC
	repeatMasker_IN_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.inCluster.bed.gz
	# repeat masker identified region that fall into piRNA cluster
	repeatMasker_OUT_Cluster=$COMMON_FOLDER/UCSC.RepeatMask.outCluster.bed.gz
	# repeat masker identified region that fall outside piRNA cluster
	Trn=$COMMON_FOLDER/Zamore.transposon.bed.gz
	# transposon region used in Li, et al., Cell, 2009. More conserved than repeat masker
	Trn_IN_Cluster=$COMMON_FOLDER/Zamore.transposon.inCluster.bed.gz
	# transposon region in cluster
	Trn_OUT_Cluster=$COMMON_FOLDER/Zamore.transposon.outCluster.bed.gz
	# transposon region out cluster
	Trn_GROUP0=$COMMON_FOLDER/Zamore.transposon.group0.bed.gz
	# transposons that failed to pass threshold in Li, et al., Cell, 2009. More conserved than repeat masker
	Trn_GROUP1=$COMMON_FOLDER/Zamore.transposon.group1.bed.gz
	# group 1 transposon in Li, et al., Cell, 2009, mainly germline
	Trn_GROUP2=$COMMON_FOLDER/Zamore.transposon.group2.bed.gz
	# group 2 transposon in Li, et al., Cell, 2009
	Trn_GROUP3=$COMMON_FOLDER/Zamore.transposon.group3.bed.gz
	# group 3 transposon in Li, et al., Cell, 2009, mainly somatic
	flyBase_Gene=$COMMON_FOLDER/UCSC.flyBase.Genes.bed12.gz
	# flyBase gene
	flyBase_Exon=$COMMON_FOLDER/UCSC.flyBase.Exons.bed.gz
	# flyBase exons
	flyBase_Intron=$COMMON_FOLDER/UCSC.flyBase.Introns.bed.gz
	# flyBase introns
	flyBase_Intron_xRM=$COMMON_FOLDER/UCSC.flyBase.Introns_xRM.bed.gz  
	# flyBase introns that subtract repeatMasker
	flyBase_5UTR=$COMMON_FOLDER/UCSC.flyBase.5UTR.bed.gz
	# flyBase 5' UTR
	flyBase_CDS=$COMMON_FOLDER/UCSC.flyBase.CDS.bed.gz
	# flyBase CDS
	flyBase_3UTR=$COMMON_FOLDER/UCSC.flyBase.3UTR.bed.gz
	# flyBase 3' UTR
	cisNATs=$COMMON_FOLDER/cisNATs.bed.gz
	# cis-NATs
	structural_loci=$COMMON_FOLDER/structured_loci.bed.gz
	# structural loci
	lincRNA=$COMMON_FOLDER/lincRNA.Young.bed6.gz
	# linc RNA identified in 'Identification and properties of 1,119 candidate lincRNA loci in the Drosophila melanogaster genome. Genome Biol Evol. 2012;4(4):427-42.'
	unannotated=$COMMON_FOLDER/unannotated_genome.bed.gz
	# unannoated region, basically all the genome segments between annotations defined above
	declare -a TARGETS=( \
	"piRNA_Cluster" \
	"piRNA_Cluster_42AB" \
	"piRNA_Cluster_20A" \
	"piRNA_Cluster_flam" \
	"repeatMasker" \
	"repeatMasker_IN_Cluster" \
	"repeatMasker_OUT_Cluster" \
	"Trn" \
	"Trn_IN_Cluster" \
	"Trn_OUT_Cluster" \
	"Trn_GROUP1" \
	"Trn_GROUP2" \
	"Trn_GROUP3" \
	"Trn_GROUP0" \
	"flyBase_Gene" \
	"flyBase_Exon" \
	"flyBase_Intron" \
	"flyBase_Intron_xRM" \
	"flyBase_5UTR" \
	"flyBase_CDS" \
	"flyBase_3UTR" \
	"cisNATs" \
	"structural_loci" \
	"lincRNA" \
	"unannotated" )

# variables for small RNA direct mapping
	declare -a DIRECT_MAPPING=( "transposon" "repBase" "piRNAcluster" )

# gtf files for rnaseq/deg/cage htseq-count
	Genes_transposon_Cluster=$COMMON_FOLDER/dm3.genes+transposon+piRNACluster.gtf
	Genes_repBase_Cluster=$COMMON_FOLDER/dm3.genes+repBase+piRNACluster.gtf
	declare -a HTSEQ_TARGETS=( "Genes_transposon_Cluster" "Genes_repBase_Cluster" )
Clone this wiki locally