Skip to content

epifluidlab/DoGFinder

 
 

Repository files navigation

DoGFinder

DoGFinder software is a package that identifies DoGs and quantifies their expression levels, from RNA-seq data.

AUTHOR/SUPPORT

Yuval wiesel, Shalgi lab technion

Version

DoGFinder-1.0.1

Changes done in current version (3/11/20): Adding -S option in Get_DoGs and Get_DoGs_rpkm, Adding -g option in Get_DoGs_rpkm, Adjusting the function PE modification in init.py.

Getting Started

Prerequisites

  1. Bedtools: version 2.20.0 - 2.26.0
  2. samtools
  3. python 2 with setuptools

Note: DoGFinder uses RSeQC, hence it must be run in python 2

You should be able to run bedtools and samtools commands from command prompt

Installing

  1. Download DoGFinder-1.0.0 from GitHub
  2. Extract the files and enter the DoGFinder folder
  3. Run the command:
sudo python setup.py install

Note: If you don’t have root privilege you can run the following command:

python setup.py install --home=<dir>

where you can supply any directory you like for the --home option. Before you run DoGFinder you should export your path by export PATH.

  1. To check installation run:
python setup.py test

No errors should be reported

MANUAL

DoGFinder Input files:

  1. Mapped reads .bam file, Should be sorted and index using samtools

  2. GTF annotation file format, could be downloaded from the UCSC website: https://genome.ucsc.edu/

DoGFinder functions and usage:

Step 1: Get_loci_annotation

Construct a new annotation bed file from one or more GTF files. Overlapping genes on the same strand are merged to one locus. If more than one input annotation is supplied, gene boundaries are set to be the most inclusive possible.

Command example:
Get_loci_annotation -out /annotationdir -gtf ucsc_gtf_file.gtf,refseq_gtf_file.gtf
Inputs:

-gtf: comma separated GTF files

-Suffix: OPTIONAL. add suffix to output file: loci_annotation_suffix.bed

Output:

Annotation file: loci_annotation.bed, to be used as input for Get_DoGs

Note: we provided ready to use annotation files for human and mouse, see annotation below.

Step 2: Pre_Process

Preprocessing and downsampling of RNA-seq bam files. (Run this command before Get_DoGs)

The use of pre-processed bam files is required, especially in the case of paired end reads

Note: The downsampling is performed using a pseudorandom seed. To reproduce the same downsampled bam files run Pre_process again keeping the order of the input bam files constant.

Command example:
Pre_Process -Q 3 -bam rawdata/bam_control1.bam,rawdata/bam_control2.bam,rawdata/bam_HS1.bam -ref loci_annotation.bed
Inputs:

-ref Reference .bed gene/transcript annotation file for strandness check (can be loci_annotation.bed file)

-bam comma separated and sorted bam files

-Q number of cores to be used, should be the same as the number of input bam files

Output:

Processed read files, to be used as input for Get_DoGs: rawdata/bam_control1.sorted_DS.bam,rawdata/ bam_control2.sorted_DS.bam,... The output raw and downsampled files will be located at the same folder as the original bam files Note: if you have paired-end bam file, the out file will have PE in his name

Note: This step may take long time

Step 3: Get_DoGs

Finds DoG annotations, generates a .bed file

Command example:
Get_DoGs -out /outdir -bam my_sorted_bam.sorted_DS.bam -a loci_annotation.bed [ -s -minDoGLen 4000 -minDoGCov 0.6 -w 200 -mode F ]
Inputs:

-bam: bam file generated by Pre_Process function, run each bam file individually

-s: Flag is true if the library used is strand specific on the forward strand and Single-End, or if the library is strand specific (either forward or reverse) and Paired-End. Default is False.

-S: Single-End library only. Flag is true if the library used is strand specific on the reverse strand and Single-End. Default is False.

-minDoGLen: OPTIONAL. Minimal DoG length,the minimal number of bases downstream of gene end with a coverage of minDoGCov. Default value is 4000 bases.

-minDoGCov: OPTIONAL. Minimal DoG coverage, the minimal fraction of DoG region that is covered by reads. In the running example it requires coverage of 0.6, and will filter out a DoG or mark its end coordinate when the coverage drops below 60%. Default value is 0.6.

-w: OPTIONAL. Moving window size, the number of bases examined in each step of DoG extension. The moving window is 50% overlapping the previously-defined DoG region resulting in extension of w/2 bases in each step. Default is 200 bases.

-Suffix: OPTIONAL. add suffix to output file: Final_Dog_annotation_suffix.bed

-mode: OPTIONAL. F or W: F denotes that minDoGCov is computed over the entire DoG length. W denotes that minDoGCov is computed over the last moving window of the DoG, if you choose this mode you should also specify -wc for coverage cutoff over the window. Default is F.

Output:

DoG annotation file at /outdir/Final_Dog_annotation.bed, you can use this file as input for Get_DoGs_rpkm

Notes:
  • During the first run, this script removes intergenic reads by running bedtools intersect -v and saves a new file: my_sorted_bam_no_gene_sorted.bed in the same folder as my_sorted_bam.bam

  • In the subsequent runs, the script will search for my_sorted_bam_no_gene_sorted.bed in the same folder as my_sorted_bam.bam folder and if it is there the script will use this file and will not run intersect bed again.

  • DoG length is restricted to the distance to the proximal downstream gene, with consideration of strand information according to strandness / unstrandedness of the input data.

Step 4.1 [Optional]: Union_DoGs_annotation

If several DoG annotation files are generated, (e.g. for different replicates and/or treatments), one can use the Union_DoGs_annotation function to create a single DoG annotation bed file (e.g long DoGs). The union includes any DoG that appears in one of more input files, and DoGs in more than one file will be unified according to their maximal length.

Command example:
Union_DoGs_annotation -dog Final_Dog_annotation1.bed,Final_Dog_annotation2.bed  -out /outdir 
Inputs:

-dog: Comma separated list of all DoG annotation files

-out: output Dir.

Output:

/outdir/union_dog_annotation.bed

Step 4.2 [Optional]: Common_DoGs_annotation

If users have several replicates and want to find the DoGs that are common to all replicates, the Common_DoGs_annotation function generates a single DoG annotation bed file from the intersection of all input DoGs annotation files.

Command example:
Common_DoGs_annotation -comm Dog_annotation_replicate1.bed,Dog_annotation_replicate2.bed  -out /outdir 
Inputs:

-comm: Comma separated list of replicates DoGs annotation

-out: output Dir.

Output:

/outdir/common_dog_annotation.bed

Step 5: Get_DoGs_rpkm

Calculate DoGs rpkm, returns a .csv table

Command example:
Get_DoGs_rpkm -out /outdir -bam my_sorted_bam_DS.bam -s -dog Final_Dog_annotation.bed
Inputs:

-bam: bam file generated by the Pre_Process function, run each bam file separately.

-dog: the Final_Dog_annotation.bed file or union_dog_annotation.bed

-s: OPTIONAL. Flag is true if the library used is strand specific on the forward strand and Single-End, or if the library is strand specific (either forward or reverse) and Paired-End. Default is False.

-S: OPTIONAL. Flag is true if the library used is strand specific on the reverse strand and Single-End. Default is False.

-g: OPTIONAL. Specify a genome file that defines the expected chromosome order and lengths. File example was added to the annotations folder.

Output:

.csv table of DoGs and their RPKM, located at: /outdir/DoGs_rpkm_table.csv

Annotations:

Under the annotation/ folder of the DoGFinder package you can find ready to use mouse (mm9) and human (hg19) loci_annotation.bed files. These files were generated by running Get_loci_annotation on three tables downloaded from the UCSC database: ucsc genes (knownGene), RefSeq genes (refGene) and Ensembl genes (ensGene). When using these files please make sure that the genome version is compatible with the one your bam files were mapped to.

Example

The folder /tests/data contains the following GTF files: test_ens.gtf,test_refseq.gtf,test_ucsc.gtf

as well as 3 paired-end RNA-seq bam files: UN.bam,KCL.bam,HS.bam

Step 1: Create loci annotation file:

Create loci annotation file from 3 different GTF files

Get_loci_annotation -out $(pwd) -gtf test_ens.gtf,test_refseq.gtf,test_ucsc.gtf

You will get the loci_annotation.bed file with Hspa8 annotation constructed from all GTF files (which is the union of several different transcripts in the GTF files) :

chr9	40609067	40613283	ENSMUST00000015800&ENSMUST00000082857&ENSMUST00000083424&ENSMUST00000101927&ENSMUST00000117557&ENSMUST00000117870&ENSMUST00000127699&ENSMUST00000133964&ENSMUST00000138895&ENSMUST00000140984&ENSMUST00000149936&ENSMUST00000153847&NM_031165&uc009ozx.2&uc009ozy.2&uc009ozz.1	0	+

Step 2: Pre_Process on bam files:

Preprocessing and downsampling relatively to HS.bam,UN.bam,KCL.bam

Pre_Process -bam HS.bam,UN.bam,KCL.bam -Q 3 -ref loci_annotation.bed

Because the bam files are paired-end, for every input bam file you will get two output files:

For HS.bam:

HS.sorted_PE.bam : This is the modified paired-end bam file

HS.sorted_PE.sorted_DS.bam : This is the downsampled bam file, the one you should use as input for Get_DoGs

Step 3: Get_DoGs for every bam file

Find DoGs annotation for every bam file

Get_DoGs -out $(pwd) -bam UN.sorted_PE.sorted_DS.bam -suff UN -a loci_annotation.bed -s
Get_DoGs -out $(pwd) -bam HS.sorted_PE.sorted_DS.bam -suff HS -a loci_annotation.bed -s
Get_DoGs -out $(pwd) -bam KCL.sorted_PE.sorted_DS.bam -suff KCL -a loci_annotation.bed -s

You will get 3 DoG annotation files: Final_Dog_annotation_UN.bed,Final_Dog_annotation_HS.bed,Final_Dog_annotation_KCL.bed

For example in UN:

chr9	40613283	40620883	ENSMUST00000015800&ENSMUST00000082857&ENSMUST00000083424&ENSMUST00000101927&ENSMUST00000117557&ENSMUST00000117870&ENSMUST00000127699&ENSMUST00000133964&ENSMUST00000138895&ENSMUST00000140984&ENSMUST00000149936&ENSMUST00000153847&NM_031165&uc009ozx.2&uc009ozy.2&uc009ozz.1	0	+

Step 4: Union_DoGs_annotation

To get one DoG annotation from all 3 annotations:

Union_DoGs_annotation -dog Final_Dog_annotation_UN.bed,Final_Dog_annotation_KCL.bed,Final_Dog_annotation_HS.bed -out $(pwd)

You will get one union annotation file, union_dog_annotation.bed:

chr9	40613283	40679683	ENSMUST00000015800&ENSMUST00000082857&ENSMUST00000083424&ENSMUST00000101927&ENSMUST00000117557&ENSMUST00000117870&ENSMUST00000127699&ENSMUST00000133964&ENSMUST00000138895&ENSMUST00000140984&ENSMUST00000149936&ENSMUST00000153847&NM_031165&uc009ozx.2&uc009ozy.2&uc009ozz.1	0	+

Step 5: Get_DoGs_rpkm

Calculate DoGs rpkm for every bam file (you can use also the Final_Dog_annotation files):

Get_DoGs_rpkm -out $(pwd) -bam UN.sorted_PE.sorted_DS.bam -s -dog union_dog_annotation.bed -suff UN -g genome_file.txt
Get_DoGs_rpkm -out $(pwd) -bam HS.sorted_PE.sorted_DS.bam -s -dog union_dog_annotation.bed -suff HS -g genome_file.txt
Get_DoGs_rpkm -out $(pwd) -bam KCL.sorted_PE.sorted_DS.bam -s -dog union_dog_annotation.bed -suff KCL -g genome_file.txt

You will get 3 DoG RPKM tables: DoGs_rpkm_table_UN.csv,DoGs_rpkm_table_HS.csv,DoGs_rpkm_table_KCL.csv

In UN, for example:

DoG_name chromosome start end DoG_length strand DoG_rpkm
ENSMUST00000015800&ENSMUST00000082857&ENSMUST00000083424&ENSMUST00000101927&ENSMUST00000117557&ENSMUST00000117870&ENSMUST00000127699&ENSMUST00000133964&ENSMUST00000138895&ENSMUST00000140984&ENSMUST00000149936&ENSMUST00000153847&NM_031165&uc009ozx.2&uc009ozy.2&uc009ozz.1 chr9 40613283 40679683,66400 + 7249.06668929

These csv (tab-delimited) files can be further processed or opened in excel, MATLAB , etc.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%