Skip to content

sib-swiss/pwmscan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PWMScan
============================================================================
PWMScan is a tool to scan whole genomes of common model organisms with a
position-specific weight matrix (PWM).



Description
============================================================================

The Position Weight Matrix (PWM) is the most commonly used model to describe
the DNA binding motif of a transcription factor. A PWM contains weights for
each base at each motif position. A PWM score can be computed for any base
sequence of the same length by simply summing up the corresponding weights
from the PWM.

PWMScan is a software package with a Web interface.

PWMScan can use two alternative search engines:

 - bowtie, a fast memory-efficient short read aligner using indexed genomes
 - matrix_scan, a C program using a conventional search algorithm

Basically, two approaches are used to scan large DNA sequences such as genomes
with a PWM:

 1) use a fast string matching algorithm, Bowtie, to scan the genome as follows:

     - given a PWM model and a cut-off value, generate all possible matches/tags
       that represent the given PWM along with the corresponding scores;
     - map the list of tags to a reference genome or a set of DNA sequences,
       using a fast string-matching algorithm (e.g bowtie).

 2) use a conventional search algorithm, matrix_scan, that has been optimized for
    rapid score computation and drop-off strategy.
    matrix_scan first rescales the scoring matrix so that the maximum score at each
    position is set to zero. When scanning the genome, for each position along the
    sequence, it computes the sum of weights (scores) and drops out as soon as the
    score is below the cut-off value.
    In order to speed up the scanning process, the matrix_scan program pre-computes
    the PWM scores in both forward and reverse directions for all possible nucleotide
    words of a given length. In addition, In case the PWM is longer than the word
    size, a core region within the PWM is defined such that it minimizes the sum
    of weights for rapid drop-off. The lateral positions are ranked in decreasing
    order of importance.

The Bowtie-based approach is more efficient for short PWMs and very low p-values
(of the order of 10-5 or less).
The matrix_scan program can be executed in parallel by processing individual
chromosomes in parallel on multiple CPU-cores via GNU parallel.

The Web interface automatically chooses the most suitable method.

We use integer log likelihoods or integer log-odds as the internal working
format of PWMs. The PWM has one column for each nucleotide in DNA sequences,
and it has one row for each position in the pattern. The scores at each
position are calculated as the sum of integer log likelihoods (log-odds).

We provide several tools for matrix format conversion to integer log-odds,
in particular for JASPAR, TRANSFAC, real PWMs, letter probability matrices (LPMs),
and position frequency matrices (PFMs).
These tools are included in the perl_tools directory.

The match list is provided in BEDdetail format, with the following fields:

 1- chromosome name (e.g. chr1, chrX, chrM)
 2- starting position of the matching sequence
 3- ending position of the matching sequence
 4- matching nucleotide sequence
 5- integer score of the matching sequence
 6- strand (either '+' or '-')
 7- PWM name
 7- p-value of the match

BEDdetail is an extension of BED format that is used to enhance the track display page.
For PWMScan, we use BEDdetail format to include the name of the PWM as well as the p-value
associated to the motifs identified by PWMScan.

For a complete description of BED and BEDdetail format, please refer to:

 https://genome.ucsc.edu/FAQ/FAQformat.html#format1
 https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7



Programs and utilities
============================================================================

The main programs are the following:

 - matrix_prob          Compute the cumulative PWM score distribution
                        given a background model, and a PWM.
                        Used for PWM score computation and conversion.

 - mba                  Matrix Branch-and-bound Algorithm (mba) generates a list
                        of all matching sequences given an integer PWM and a cut-off.
                        MBA is the prior step to using Bowtie.

 - matrix_scan          Scan a set of sequence files with a PWM and a cut-off value.

 - bowtie2bed           Convert the BOWTIE output into BED format.

 - mscan2bed            Convert the matrix_scan output into BED format.

 - mscan_bed2sga        Convert the BED file from the PWMSCan pipeline into SGA format.

 - filterOverlaps       Filter out overlapping matches for BED format.

 - seq_extract_bcomp    Extract BED regions from a set of FASTA-formatted sequences.
                        The extracted sequences are written to standard output.
                        Optionally, the program computes and outputs the base composition,
                        in which case sequences can be extracted directly from the FASTA
                        input or, as for the extraction mode, specified in a BED file.

 - pwm_scoring          Score a set of nucleotide sequences in FASTA format, based on
                        matches to either an integer PWM or a base probability matrix.

The Bowtie software is available on SourceForge.net for all UNIX-based platforms:

  - bowtie           http://bowtie-bio.sourceforge.net/index.shtml

SGA is a single-line-oriented and tab-delimited format with the following five
obligatory fields:

  1. Sequence name/ID (Char String),
  2. Feature (Char String),
  3. Sequence Position (Integer),
  4. Strand (+/- or 0),
  5. Tag Counts (Integer).

Additional fields may be added containing application-specific information used by other programs.
In the case of ChIP-seq data, SGA files represent genome-wide read count distributions from one or
several experiments.

SGA is the working format of our ChIP-seq data analysis programs (sourceforge.net/projects/chip-seq).

Matrix format conversion utilities are the following:

  - jasparconvert.pl         Convert a JASPAR matrix file into MEME format (integer log-odds).

  - transfaconvert.pl        Convert a TRANSFAC matrix file into MEME format (integer log-odds).

  - pwmconvert.pl            Convert a real or SSA-formatted PWM into integer plain-text format.

  - lpmconvert.pl            Convert a Letter Probability Matrix (LPM) file into an integer PWM.

  - pfmconvert.pl            Convert a Position Frequency Matrix (PFM) file into either a PWM or a LPM.

  - pwm2lpmconvert.pl        Convert a Position Weight Matrix (PWM) to letter probability format (LPM).

The jasparconvert.pl and transfaconvert.pl scripts are based on an original implementation
by William Stafford Noble and Timothy L. Bailey (1999) for MEME (http://meme-suite.org/).

We also provide three (bash) shell wrapper scripts that embed the entire analysis pipeline:

  - pwm_scan                 Scan a genome with a PWM and a p-value using either Bowtie or matrix_scan
  - pwm_scan_ucsc            Scan a genome with a PWM and a p-value using either Bowtie or matrix_scan
                             (Adapted script from pwm_mscan_wrapper to deal with UCSC chromosome files chr*.fa)
  - pwm_bowtie_wrapper       Scan a genome with a PWM and a p-value using Bowtie
  - pwm_mscan_wrapper        Scan a genome with a PWM and a p-value using matrix_scan
  - pwm_mscan_wrapper_ucsc   Scan a genome with a PWM and a p-value using matrix_scan
                             (Adapted script from pwm_mscan_wrapper to deal with UCSC chromosome files chr*.fa)

  - pwmlib_scan           Scan a genome with a PWM Collection using either Bowtie or matrix_scan

The pwmlib_scan tool is used to scan a genome with a collection of PWMs coming from a database.
Accepted motif database formats are the MEME libraries as well as the motif libraries (both log-odds
and letter-probability formats) provided by the PWMScan Web interface.

Description of these formats is provided at https://epd.expasy.org/pwmtools/pwmlib.html.
The PWMScan library collections are available at: https://epd.expasy.org/pwmtools/data/pwmlib/.

As an example, you find the JASPAR CORE 2018 Collection for vertebrates in the pwmlibs sub-directory,
provided in the three supported formats: meme, log-odds, and letter probability.

Please, refer to the README files in the examples sub-directory for more details on how to use
the programs as well as the wrapper scripts.

For matrix conversion into integer log-odds, we also provide the following bash script:

  - pwm_convert           Convert JASPAR, TRANSFAC, PFM, LPM and real PWM formats to integer log-odds



A few conventions about the genome files
============================================================================

PWMScan works with both Bowtie-formatted genome files and chromosome sequences in FASTA format.

By default, chromosome sequence files are downloaded from NCBI (RefSeq) and are called chrom*.seq.

Bowtie index files are generated from the chromosome sequence files using the bowtie-build command,
as follow:

  ls -1 -v GENOME_DIR/chrom*.seq | xargs cat > h_sapiens_hg19.fa
  bowtie-build -f h_sapiens_hg19.fa h_sapiens_hg19

By convention, the bowtie index file name starts with the species name (in lower case letters, e.g.: h_sapiens)
followed by an underscore and the UCSC assembly name (e.g. hg19) (see the BOWTIEIDX tables defined in the
wrapper scripts).

All genome sequence files have a FASTA header that is formatted as follows:

  >chr|NC_000001|NC_000001.10 some text

The sequence identifier includes three words concatenated via a UNIX pipe '|': the word 'chr' followed by
the NCBI RefSeq partial and full accession identifiers. The accession records indicate sequence identity.
In total, the FASTA header includes 2 pipe delimiters.
The sequence identifier is parsed by most of the PWMScan programs and utilities.

For each genome assembly, we provide a table for NCBI RefSeq dentifier to chromosome number conversion
(genomes_chr_NC_gi.tar.gz) as well as a table for FASTA sequence identifier to chromosome number
conversion (genomes_chr_hdr.tar.gz).
The chr_NC_gi files have been downloaded from NCBI whereas the chr_hdr files have been locally-generated.
These tables are used to convert the match lists from PWMScan to BED format. In particular, the chr_NC_gi
table is used by the conversion program mscan2bed (for matrix_scan) whereas the chr_hdr table is used by
the bowtie2bed program to convert the Bowtie output to BED.

The chr_hdr table can be changed to adapt to other FASTA header format conventions.

It is important to remark that the matrix_scan program assumes by default the 2 pipe-delimited FASTA header rule.
This behaviour can be changed via the optional argument -n[--pipes] of matrix_scan (e.g. '-n 3' for three pipes).
If the header does not include pipes, the program takes the header as it is (e.g. it is equivalent to '-n 0').

The genome tables can be found in the genomedb sub-directory together with the chromosome sequence files for hg19.

Here is an example for the hg19 assembly:

chr_NC_gi
#Chr    Accession.ver   gi
1       NC_000001.10    224384768
2       NC_000002.11    224384767
3       NC_000003.11    224384766
4       NC_000004.11    224384765
5       NC_000005.9     224384764
6       NC_000006.11    224384763
7       NC_000007.13    224384762
8       NC_000008.10    224384761
9       NC_000009.11    224384760
10      NC_000010.10    224384759
11      NC_000011.9     224384758
12      NC_000012.11    224384757
13      NC_000013.10    224384756
14      NC_000014.8     224384755
15      NC_000015.9     224384754
16      NC_000016.9     224384753
17      NC_000017.10    224384752
18      NC_000018.9     224384751
19      NC_000019.9     224384750
20      NC_000020.10    224384749
21      NC_000021.8     224384748
22      NC_000022.10    224384747
X       NC_000023.10    224384746
Y       NC_000024.9     224384745
M       NC_012920.1     251831106


chr_hdr
#Chr    Sequence Header                 Assembly
1       chr|NC_000001|NC_000001.10      hg19
2       chr|NC_000002|NC_000002.11      hg19
3       chr|NC_000003|NC_000003.11      hg19
4       chr|NC_000004|NC_000004.11      hg19
5       chr|NC_000005|NC_000005.9       hg19
6       chr|NC_000006|NC_000006.11      hg19
7       chr|NC_000007|NC_000007.13      hg19
8       chr|NC_000008|NC_000008.10      hg19
9       chr|NC_000009|NC_000009.11      hg19
10      chr|NC_000010|NC_000010.10      hg19
11      chr|NC_000011|NC_000011.9       hg19
12      chr|NC_000012|NC_000012.11      hg19
13      chr|NC_000013|NC_000013.10      hg19
14      chr|NC_000014|NC_000014.8       hg19
15      chr|NC_000015|NC_000015.9       hg19
16      chr|NC_000016|NC_000016.9       hg19
17      chr|NC_000017|NC_000017.10      hg19
18      chr|NC_000018|NC_000018.9       hg19
19      chr|NC_000019|NC_000019.9       hg19
20      chr|NC_000020|NC_000020.10      hg19
21      chr|NC_000021|NC_000021.8       hg19
22      chr|NC_000022|NC_000022.10      hg19
X       chr|NC_000023|NC_000023.10      hg19
Y       chr|NC_000024|NC_000024.9       hg19
M       chr|NC_012920|NC_012920.1       hg19

 Note that the second field of the chr_hdr file corresponds to the chromosome sequence identifier in the FASTA files. Given that the Bowtie indices are generated starting from the chromosome FASTA files, the Bowtie output will include the sequence identifier as specified in the second field of the chr_hdr file. To convert to BED format, which uses chromosome numbers, programs such as bowtie2bed need to read the chr_hdr table.

Chromosome files downloaded from UCSC are named chr*.fa. If this is the case, the adapted wrapper scripts pwm_scan_ucsc and pwm_mscan_wrapper_ucsc should be used instead of the deafult ones (pwm_scan and pwm_mscan_wrapper). The script pwm_bowtie_wrapper script should in principle work with genome index files built from UCSC genome assemblies, provided the chr_hdr file is changed according to the UCSC chromosome header convention, which is the following:

  >chr1

An example for the mouse genome mm10 downloaded from UCSC would then be the following:

chr_NC_gi
#Chr    Accession.ver   gi      Assembly
1       chr1    372099109       mm10
2       chr2    372099108       mm10
3       chr3    372099107       mm10
4       chr4    372099106       mm10
5       chr5    372099105       mm10
6       chr6    372099104       mm10
7       chr7    372099103       mm10
8       chr8    372099102       mm10
9       chr9    372099101       mm10
10      chr10   372099100       mm10
11      chr11   372099099       mm10
12      chr12   372099098       mm10
13      chr13   372099097       mm10
14      chr14   372099096       mm10
15      chr15   372099095       mm10
16      chr16   372099094       mm10
17      chr17   372099093       mm10
18      chr18   372099092       mm10
19      chr19   372099091       mm10
X       chrX    372099090       mm10
Y       chrY    372099089       mm10
M       chrM    34538597        mm10

chr_hdr
#Chr    Sequence Header  Assembly
1       chr1             mm10
2       chr2             mm10
3       chr3             mm10
4       chr4             mm10
5       chr5             mm10
6       chr6             mm10
7       chr7             mm10
8       chr8             mm10
9       chr9             mm10
10      chr10            mm10
11      chr11            mm10
12      chr12            mm10
13      chr13            mm10
14      chr14            mm10
15      chr15            mm10
16      chr16            mm10
17      chr17            mm10
18      chr18            mm10
19      chr19            mm10
X       chrX             mm10
Y       chrY             mm10
M       chrM             mm10


Files and paths
---------------------------------------------------------------------------
The tar files genomes_chr_NC_gi.tar.gz and genomes_chr_hdr.tar.gz include a list of chr_NC_gi and chr_hdr
tables for several common model organisms such as human, mouse, fruit fly, worm, zebrafish, yeast, and more.

To check the content, do the following:

  cd genomedb
  tar -tvzf genomes_chr_NC_gi.tar.gz
  tar -tvzf genomes_chr_hdr.tar.gz

To extract the files, execute the following commands:

  cd genomedb
  tar -xvzf genomes_chr_NC_gi.tar.gz
  tar -xvzf genomes_chr_hdr.tar.gz

The default genome root location (variable genome_root_dir) for conversion programs and scripts is the following:

  genome_root_dir=/home/local/db/genome

which corresponds to the root directory of the entire genome assembly data.

When the PWMScan package is installed, the genome files for hg19 are in the genomedb sub-directory, which is the
genome root directory used by our application examples:

  genome_root_dir=genomedb

The genome root directory can be changed via the command line option -d <genome-root-dir> in all five PWMScan bash
wrapper scripts (pwm_scan, pwm_scan,_ucsc, pwm_mscan_wrapper, pwm_mscan_wrapper_ucsc, and pwm_bowtie_wrapper).

By convention, the Bowtie index files are located in the bowtie sub-directory, whereas the chromosome
sequence files for a given assembly are stored in the corresponding sub-directory (e.g. hg19), that is:

  bowtie_dir=$genome_root_dir/bowtie
  assembly_dir=$genome_root_dir/<assembly_name>

More details are given in the README file in the genomedb sub-directory.

NB - The path to both the PWMScan binaries and scripts is defined by the bin_dir variable:

  bin_dir=/home/local/bin

This variable is hard-coded in the original bash wrapper scripts, and is changed upon installation
via the Makefile, so that all installed scripts in bin_dir have the correct binary pathname.

When the PWMScan package is installed, the binary directory for the PWMScan-specific programs is defined as:

  bin_dir=bin


PWMScan pipeline and application examples
============================================================================

Given a PWM and a p-value, the PWMScan pipeline includes the following steps:

  1) Compute the integer cut-off value, given a PWM and a p-value;

  2) Scan the genome using either matrix_scan or Bowtie;

  3) Convert the match list into BEDdetail format.

In the examples sub-directory, we describe some application examples exploiting both methods, and
provide detailed step-by-step instructions on how to execute the PWMScan pipeline (examples/README.txt).

The README file is available in both text and html format at:

  https://epd.expasy.org/pwmtools/README.txt
  https://epd.expasy.org/pwmtools/README.html



Web Interface
============================================================================

PWMScan has a web interface which is freely available at:

   https://epd.expasy.org/pwmtools/pwmscan.php

Key features of the Web interface are the following:

  - Menu-driven access to genomes of more than 20 model organisms
  - Access to large collections of PWMs from MEME and other databases
  - Custom PWMs are supplied by copy&paste or file upload
  - Support of various PWM formats: JASPAR, TRANSFAC, plain text, etc.
  - Cut-off values defined as PWM match scores, match percentage, or p-values
  - Output provided in various formats: BEDdetail, SGA, FPS, etc.
  - Direct links to the UCSC genome browser for visualization of results
  - Action buttons to transfer match list to downstream analysis tools
    (ChIP-Seq and motif analysis tools)

The Web interface doesn't support upload of user-supplied FASTA sequence files.