DoGFinder software is a package that identifies DoGs and quantifies their expression levels, from RNA-seq data.
Yuval wiesel, Shalgi lab technion
DoGFinder-1.0.1
Changes done in current version (3/11/20): Adding -S option in Get_DoGs and Get_DoGs_rpkm, Adding -g option in Get_DoGs_rpkm, Adjusting the function PE modification in init.py.
- Bedtools: version 2.20.0 - 2.26.0
- samtools
- python 2 with setuptools
Note: DoGFinder uses RSeQC, hence it must be run in python 2
You should be able to run bedtools and samtools commands from command prompt
- Download DoGFinder-1.0.0 from GitHub
- Extract the files and enter the DoGFinder folder
- Run the command:
sudo python setup.py install
Note: If you don’t have root privilege you can run the following command:
python setup.py install --home=<dir>
where you can supply any directory you like for the --home option. Before you run DoGFinder you should export your path by export PATH.
- To check installation run:
python setup.py test
No errors should be reported
-
Mapped reads .bam file, Should be sorted and index using samtools
-
GTF annotation file format, could be downloaded from the UCSC website: https://genome.ucsc.edu/
Construct a new annotation bed file from one or more GTF files. Overlapping genes on the same strand are merged to one locus. If more than one input annotation is supplied, gene boundaries are set to be the most inclusive possible.
Get_loci_annotation -out /annotationdir -gtf ucsc_gtf_file.gtf,refseq_gtf_file.gtf
-gtf: comma separated GTF files
-Suffix: OPTIONAL. add suffix to output file: loci_annotation_suffix.bed
Annotation file: loci_annotation.bed, to be used as input for Get_DoGs
Note: we provided ready to use annotation files for human and mouse, see annotation below.
Preprocessing and downsampling of RNA-seq bam files. (Run this command before Get_DoGs)
Note: The downsampling is performed using a pseudorandom seed. To reproduce the same downsampled bam files run Pre_process again keeping the order of the input bam files constant.
Pre_Process -Q 3 -bam rawdata/bam_control1.bam,rawdata/bam_control2.bam,rawdata/bam_HS1.bam -ref loci_annotation.bed
-ref Reference .bed gene/transcript annotation file for strandness check (can be loci_annotation.bed file)
-bam comma separated and sorted bam files
-Q number of cores to be used, should be the same as the number of input bam files
Processed read files, to be used as input for Get_DoGs: rawdata/bam_control1.sorted_DS.bam,rawdata/ bam_control2.sorted_DS.bam,... The output raw and downsampled files will be located at the same folder as the original bam files Note: if you have paired-end bam file, the out file will have PE in his name
Finds DoG annotations, generates a .bed file
Get_DoGs -out /outdir -bam my_sorted_bam.sorted_DS.bam -a loci_annotation.bed [ -s -minDoGLen 4000 -minDoGCov 0.6 -w 200 -mode F ]
-bam: bam file generated by Pre_Process function, run each bam file individually
-s: Flag is true if the library used is strand specific on the forward strand and Single-End, or if the library is strand specific (either forward or reverse) and Paired-End. Default is False.
-S: Single-End library only. Flag is true if the library used is strand specific on the reverse strand and Single-End. Default is False.
-minDoGLen: OPTIONAL. Minimal DoG length,the minimal number of bases downstream of gene end with a coverage of minDoGCov. Default value is 4000 bases.
-minDoGCov: OPTIONAL. Minimal DoG coverage, the minimal fraction of DoG region that is covered by reads. In the running example it requires coverage of 0.6, and will filter out a DoG or mark its end coordinate when the coverage drops below 60%. Default value is 0.6.
-w: OPTIONAL. Moving window size, the number of bases examined in each step of DoG extension. The moving window is 50% overlapping the previously-defined DoG region resulting in extension of w/2 bases in each step. Default is 200 bases.
-Suffix: OPTIONAL. add suffix to output file: Final_Dog_annotation_suffix.bed
-mode: OPTIONAL. F or W: F denotes that minDoGCov is computed over the entire DoG length. W denotes that minDoGCov is computed over the last moving window of the DoG, if you choose this mode you should also specify -wc for coverage cutoff over the window. Default is F.
DoG annotation file at /outdir/Final_Dog_annotation.bed, you can use this file as input for Get_DoGs_rpkm
-
During the first run, this script removes intergenic reads by running bedtools intersect -v and saves a new file: my_sorted_bam_no_gene_sorted.bed in the same folder as my_sorted_bam.bam
-
In the subsequent runs, the script will search for my_sorted_bam_no_gene_sorted.bed in the same folder as my_sorted_bam.bam folder and if it is there the script will use this file and will not run intersect bed again.
-
DoG length is restricted to the distance to the proximal downstream gene, with consideration of strand information according to strandness / unstrandedness of the input data.
If several DoG annotation files are generated, (e.g. for different replicates and/or treatments), one can use the Union_DoGs_annotation function to create a single DoG annotation bed file (e.g long DoGs). The union includes any DoG that appears in one of more input files, and DoGs in more than one file will be unified according to their maximal length.
Union_DoGs_annotation -dog Final_Dog_annotation1.bed,Final_Dog_annotation2.bed -out /outdir
-dog: Comma separated list of all DoG annotation files
-out: output Dir.
/outdir/union_dog_annotation.bed
If users have several replicates and want to find the DoGs that are common to all replicates, the Common_DoGs_annotation function generates a single DoG annotation bed file from the intersection of all input DoGs annotation files.
Common_DoGs_annotation -comm Dog_annotation_replicate1.bed,Dog_annotation_replicate2.bed -out /outdir
-comm: Comma separated list of replicates DoGs annotation
-out: output Dir.
/outdir/common_dog_annotation.bed
Calculate DoGs rpkm, returns a .csv table
Get_DoGs_rpkm -out /outdir -bam my_sorted_bam_DS.bam -s -dog Final_Dog_annotation.bed
-bam: bam file generated by the Pre_Process function, run each bam file separately.
-dog: the Final_Dog_annotation.bed file or union_dog_annotation.bed
-s: OPTIONAL. Flag is true if the library used is strand specific on the forward strand and Single-End, or if the library is strand specific (either forward or reverse) and Paired-End. Default is False.
-S: OPTIONAL. Flag is true if the library used is strand specific on the reverse strand and Single-End. Default is False.
-g: OPTIONAL. Specify a genome file that defines the expected chromosome order and lengths. File example was added to the annotations folder.
.csv table of DoGs and their RPKM, located at: /outdir/DoGs_rpkm_table.csv
Under the annotation/ folder of the DoGFinder package you can find ready to use mouse (mm9) and human (hg19) loci_annotation.bed files. These files were generated by running Get_loci_annotation on three tables downloaded from the UCSC database: ucsc genes (knownGene), RefSeq genes (refGene) and Ensembl genes (ensGene). When using these files please make sure that the genome version is compatible with the one your bam files were mapped to.
The folder /tests/data contains the following GTF files: test_ens.gtf,test_refseq.gtf,test_ucsc.gtf
as well as 3 paired-end RNA-seq bam files: UN.bam,KCL.bam,HS.bam
Create loci annotation file from 3 different GTF files
Get_loci_annotation -out $(pwd) -gtf test_ens.gtf,test_refseq.gtf,test_ucsc.gtf
You will get the loci_annotation.bed file with Hspa8 annotation constructed from all GTF files (which is the union of several different transcripts in the GTF files) :
chr9 40609067 40613283 ENSMUST00000015800&ENSMUST00000082857&ENSMUST00000083424&ENSMUST00000101927&ENSMUST00000117557&ENSMUST00000117870&ENSMUST00000127699&ENSMUST00000133964&ENSMUST00000138895&ENSMUST00000140984&ENSMUST00000149936&ENSMUST00000153847&NM_031165&uc009ozx.2&uc009ozy.2&uc009ozz.1 0 +
Preprocessing and downsampling relatively to HS.bam,UN.bam,KCL.bam
Pre_Process -bam HS.bam,UN.bam,KCL.bam -Q 3 -ref loci_annotation.bed
Because the bam files are paired-end, for every input bam file you will get two output files:
For HS.bam:
HS.sorted_PE.bam : This is the modified paired-end bam file
HS.sorted_PE.sorted_DS.bam : This is the downsampled bam file, the one you should use as input for Get_DoGs
Find DoGs annotation for every bam file
Get_DoGs -out $(pwd) -bam UN.sorted_PE.sorted_DS.bam -suff UN -a loci_annotation.bed -s
Get_DoGs -out $(pwd) -bam HS.sorted_PE.sorted_DS.bam -suff HS -a loci_annotation.bed -s
Get_DoGs -out $(pwd) -bam KCL.sorted_PE.sorted_DS.bam -suff KCL -a loci_annotation.bed -s
You will get 3 DoG annotation files: Final_Dog_annotation_UN.bed,Final_Dog_annotation_HS.bed,Final_Dog_annotation_KCL.bed
For example in UN:
chr9 40613283 40620883 ENSMUST00000015800&ENSMUST00000082857&ENSMUST00000083424&ENSMUST00000101927&ENSMUST00000117557&ENSMUST00000117870&ENSMUST00000127699&ENSMUST00000133964&ENSMUST00000138895&ENSMUST00000140984&ENSMUST00000149936&ENSMUST00000153847&NM_031165&uc009ozx.2&uc009ozy.2&uc009ozz.1 0 +
To get one DoG annotation from all 3 annotations:
Union_DoGs_annotation -dog Final_Dog_annotation_UN.bed,Final_Dog_annotation_KCL.bed,Final_Dog_annotation_HS.bed -out $(pwd)
You will get one union annotation file, union_dog_annotation.bed:
chr9 40613283 40679683 ENSMUST00000015800&ENSMUST00000082857&ENSMUST00000083424&ENSMUST00000101927&ENSMUST00000117557&ENSMUST00000117870&ENSMUST00000127699&ENSMUST00000133964&ENSMUST00000138895&ENSMUST00000140984&ENSMUST00000149936&ENSMUST00000153847&NM_031165&uc009ozx.2&uc009ozy.2&uc009ozz.1 0 +
Calculate DoGs rpkm for every bam file (you can use also the Final_Dog_annotation files):
Get_DoGs_rpkm -out $(pwd) -bam UN.sorted_PE.sorted_DS.bam -s -dog union_dog_annotation.bed -suff UN -g genome_file.txt
Get_DoGs_rpkm -out $(pwd) -bam HS.sorted_PE.sorted_DS.bam -s -dog union_dog_annotation.bed -suff HS -g genome_file.txt
Get_DoGs_rpkm -out $(pwd) -bam KCL.sorted_PE.sorted_DS.bam -s -dog union_dog_annotation.bed -suff KCL -g genome_file.txt
You will get 3 DoG RPKM tables: DoGs_rpkm_table_UN.csv,DoGs_rpkm_table_HS.csv,DoGs_rpkm_table_KCL.csv
In UN, for example:
DoG_name chromosome start end DoG_length strand DoG_rpkm
ENSMUST00000015800&ENSMUST00000082857&ENSMUST00000083424&ENSMUST00000101927&ENSMUST00000117557&ENSMUST00000117870&ENSMUST00000127699&ENSMUST00000133964&ENSMUST00000138895&ENSMUST00000140984&ENSMUST00000149936&ENSMUST00000153847&NM_031165&uc009ozx.2&uc009ozy.2&uc009ozz.1 chr9 40613283 40679683,66400 + 7249.06668929
These csv (tab-delimited) files can be further processed or opened in excel, MATLAB , etc.