Project Leader Vincent Ferretti
Author Ivan Borozan
CSSSCL is a python package that uses Combined Sequence Similarity Scores for accurate taxonomic CLassification of long and short reads.
Distributor ID: Debian/UbuntuDescription: Debian GNU/Linux 8.1 (jessie) / Ubuntu /12.04.3 LTS /14.04.1/19.04Release: 8.1 64-bit / 12.04 64-bit / 14.04 64-bitCodename: jessie / precise / trustyPython = 2.7.9 (biopython==1.67, Cython==0.29.10, numpy==1.9.2, pymongo==2.9.5, pysam==0.15.2, python-dateutil==2.5.3, scikit-learn==0.17.1, scipy==0.15.1, six==1.12.0)
We have setup three ways for installing the cssscl
package:
- Quick deployment using Docker (small file).
- System wide installation from the source code (see the Installation Guide below).
We recommend to install the cssscl
package using the Python's Virtual Environment tool to keep the dependencies required by the cssscl
package in a separate directory and to keep your global python dist- or site-packages directory clean and manageable as shown below:
Note: if any of the following packages: jellyfish, BLAST or plzip are already installed on your system make sure that they are in your executable search path (i.e. PATH variable) (as shown in the examples below):
BLAST
# e.g. PATH_TO_YOUR_BLAST=/home/user_x/blast/ncbi-blast-2.2.30+/bin
$ export PATH=$PATH:PATH_TO_YOUR_BLAST
jellyfish
# e.g. PATH_TO_YOUR_jellyfish=/home/user_x/jellyfish-1.1.12/bin
$ export PATH=$PATH:PATH_TO_YOUR_jellyfish
plzip
# e.g. PATH_TO_YOUR_plzip=/home/user_x/plzip-1.1/plzip
$ export PATH=$PATH:PATH_TO_YOUR_plzip
Step 1. Install dependencies on Debian and Ubuntu
In order to compile cssscl
on Debian GNU/Linux 8.1 and Ubuntu 12.04 LTS the following packages need to be installed:
$ sudo apt-get update
$ sudo apt-get install build-essential g++ libxml2-dev libxslt-dev gfortran libopenblas-dev liblapack-dev
Step 2. Download the cssscl
package
# use wget $ wget --no-check-certificate https://github.com/oicr-ibc/cssscl/archive/master.tar.gz $ tar -zxvf master.tar.gz; mv cssscl-master csssclor use git clone, note that
sudo apt-get install git
is required for git access# use git clone $ git clone https://github.com/oicr-ibc/cssscl.git
Step 3. Check that all packages necessary to run the cssscl
are installed and are available by running the cssscl_check_pre_installation.sh
script (only for Ubuntu/Debian distributions).
$ cd cssscl $ ./cssscl_check_pre_installation.sh
Note: when prompted follow instructions to export when source cssscl/scripts/export.sh
shows on the screen.
Note: for more information regarding the cssscl_check_pre_installation.sh
script see here.
Step 4. In the cssscl
directory
create a virtual environment (e.g. name it csssclvenv
)
$ virtualenv csssclvenv
Step 5. To begin using the virtual environment, it first needs to be activated as shown below:
$ source csssclvenv/bin/activate
Step 6. Install cssscl
as root
$ sudo pip install .
Note: this will install all the python modules necessary for running the cssscl
package in the cssscl/csssclvenv/
directory.
Step 7. Configure cssscl
$ cssscl configure
Accept all the values prompted by default by pressing [ENTER]
Note: If you are done working in the virtual environment, you can deactivate it as shown below.
$ deactivate
If you would like to run the cssscl
program again (and you have deactivated the python virtual environment) you will need to activate it again as shown above.
Install the cssscl
package directly to your python global dist- or site-packages directory as shown below (CAUTION: some of the python packages on your system might be updated if required by the cssscl
package):
Note: if any of the following packages: jellyfish, BLAST or plzip are already installed on your system make sure that they are in your executable search path (i.e. PATH variable) (as shown in the examples below):
BLAST
# e.g. PATH_TO_YOUR_BLAST=/home/user_x/blast/ncbi-blast-2.2.30+/bin
$ export PATH=$PATH:PATH_TO_YOUR_BLAST
jellyfish
# e.g. PATH_TO_YOUR_jellyfish=/home/user_x/jellyfish-1.1.12/bin
$ export PATH=$PATH:PATH_TO_YOUR_jellyfish
plzip
# e.g. PATH_TO_YOUR_plzip=/home/user_x/plzip-1.1/plzip
$ export PATH=$PATH:PATH_TO_YOUR_plzip
Step 1. Install dependencies on Debian and Ubuntu
Python: Only Python 2.7.3+ is supported. No support for Python 3 at the moment.
In order to compile cssscl
on Debian GNU/Linux 8.1 and Ubuntu 12.04 LTS the following packages need to be installed:
$ sudo apt-get update
$ sudo apt-get install build-essential python2.7 python2.7-dev g++ libxml2-dev libxslt-dev gfortran libopenblas-dev liblapack-dev
Step 2. Download the cssscl
package
# use wget $ wget --no-check-certificate https://github.com/oicr-ibc/cssscl/archive/master.tar.gz $ tar -zxvf master.tar.gz; mv cssscl-master csssclor use git clone, note that
sudo apt-get install git
is required for git access# use git clone $ git clone https://github.com/oicr-ibc/cssscl.git
Step 3. Check that all packages necessary to run the cssscl
are installed and are avaialble by running the cssscl_check_pre_installation.sh
script (only for Ubuntu/Debian distributions).
$ cd cssscl $ ./cssscl_check_pre_installation.sh
Note: when prompted follow instructions to export when source cssscl/scripts/export.sh
shows on the screen.
Note: for more information regarding the cssscl_check_pre_installation.sh
script please see here.
Step 4. Install cssscl
as root
$ sudo pip install .
Step 5. Configure cssscl
$ cssscl configure
Accept all the values prompted by default by pressing [ENTER]
Additional instructions for non-automated installation of third party software necessary for running the cssscl
package
In case the cssscl_check_pre_installation.sh script (see the installation subsections above) fails please read the info below for the manual installation of individual third party software:
Necessary Python modules:
- BioPython - Tools for biological computation.
- PyMongo - Python module needed for working with MongoDB (PyMongo = 2.8)
- Sklearn - Machine Learning in Python
- Numpy - NumPy is the fundamental package for scientific computing with Python
- Cython - Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex)
- SciPy - SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages:
Installing python modules using pip manually:
$ pip install cython==0.29.10 $ pip install numpy==1.9.2 $ pip install pymongo==2.9.5 $ pip install biopython==1.67 $ pip install scikit-learn==0.17.1 $ pip install scipy==0.15.1
Third party software:
BLAST (version 2.2.30+ and higher) Basic Local Alignment Search Tool. http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download
JELLYFISH (version 1.1.+ but not 2.0.+) JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. http://www.cbcb.umd.edu/software/jellyfish/
PLZIP (version 1.1+) Plzip is a massively parallel (multi-threaded) lossless data compressor based on the lzlib compression library, with a user interface similar to the one of lzip, bzip2 or gzip. http://download.savannah.gnu.org/releases/lzip/plzip/
Note: that the classification results in the paper were obtained using: Plzip 1.1 using Lzlib 1.5
To compile Plzip 1.1 and Lzlib 1.5:
Step 1. Donwload lzlib-1.5.tar.gz
$ wget --no-check-certificate http://download.savannah.gnu.org/releases/lzip/lzlib/lzlib-1.5.tar.gz
Step 2. Install lzlib-1.5:
$ gunzip lzlib-1.5.tar.gz
$ tar -xvf lzlib-1.5.tar
$ cd lzlib-1.5
$ ./configure
$ make
$ make install
Step 3. Donwload Plzip 1.1
$ wget --no-check-certificate http://download.savannah.gnu.org/releases/lzip/plzip/plzip-1.1.tar.gz
Step 4. Install Plzip
$ gunzip plzip-1.1.tar.gz
$ tar -xvf plzip-1.1.tar
$ cd plzip-1.1
$ ./configure
$ make
$ make install
For more information about plzip consult: http://www.nongnu.org/lzip/manual/plzip_manual.html
and for memory required to compress and decompress: http://www.nongnu.org/lzip/manual/plzip_manual.html#Memory-requirements
Make sure that JELLYFISH, BLAST and Plzip are in your executable search path (see the examples below):
# for example
$ export PATH=$PATH:PATH_TO_BLAST/blast/ncbi-blast-2.2.30+/bin
$ export PATH=$PATH:PATH_TO_jellyfish/jellyfish-1.1.12/bin
$ export PATH=$PATH:PATH_TO_plzip/plzip-1.1/plzip
Install MongoDB
MongoDB should be installed using the following set of instructions:
Ubuntu 12.04.3 LTS /14.04.1
First add the 10gen GPG key, the public gpg key used for signing these packages. It should be possible to import the key into apt's public keyring with a command like this:
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
Add this line verbatim to your /etc/apt/sources.list
:
$ deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen
In order to complete the installation of the packages, you need to update the sources and then install the desired package
$ sudo apt-get update
$ sudo apt-get install mongodb-10gen=2.4.14
Ubuntu 19.04
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
$ echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
$ sudo apt update
$ sudo apt-get install -y mongodb-org
Start mongo service
$ sudo service mongod start
Debian
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
$ echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | tee -a /etc/apt/sources.list
$ apt-get update
$ apt-get install mongodb-10gen=2.4.14
Note: this will only work if you installed cssscl with the cmd sudo pip install .
as shown in the Installation section above.
$ cd cssscl/ $ ./cssscl_uninstall.sh
Download taxon data:
https://drive.google.com/open?id=1okbaJkv6IgvWf8R1A97CX9lq10wV_INY
$ tar -zxvf taxon.tar.gz
Download test/train data:
https://drive.google.com/open?id=1glzuBJAqf5MPuO5_ivaFFLnZxpjnamFc
$ tar -zxvf test_data.tar.gz
Example 1 - run the cssscl
classifier without the optimization using the taxon data and the test set provided
Step 1. Build the necessary databases from the training set
$ cssscl build_dbs -btax -c -blast -nt 2 PATH_TO/test_data/TRAIN.fa PATH_TO/taxon/
(the whole process should take ~ 37 min using 2 CPUs)
By default all databases will be outputted to the directory where the TRAIN.fa resides (note that all paths provided in the examples above are using absolute/full paths to the files/directories). The above command will build three databases (blast, compression and the kmer database) for sequences in the training set.
The cssscl's
build_dbs
module requires two positional arguments to be provided:
1. a file in the fasta format (e.g. TRAIN.fa as in the example above) that specifies the collection of reference genomes composing the training set.2. a directory (taxon/ in the example above) that specifies the location where the taxon data is stored (more specifically the directory should contain the following files: gi_taxid_nucl.dmp, names.dmp and nodes.dmp, these files can be downloaded from the NCBI taxonomy database at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/).
The information about the additional optional arguments used in the command line above is provided here.
For more information please consult the cssscl's
build_dbs
help page by typing:
$ cssscl build_dbs --help
Step 2. Perform the classification of the sequences in the test set
# use cssscl to classify sequences in TEST.fa $ cssscl classify -c -blast blastn -tax genus -nt 2 PATH_TO/test_data/test/TEST.fa PATH_TO/test_data/
(the whole process should take ~ 29 min using 2 CPUs)
Note that in the above example the output file cssscl_results_genus.txt
with classification results will be located in the directory where the TEST.fa resides.
Note: For the test set data provided above the values of the parameters used in the model have already been optimized and are included as part of the test set (see the optimum_kmer
directory in the test_set/
directory provided). Thus for the test dataset the optimization is not required to be performed prior to running the classifier. On how to run the classifier by performing the optimization stage first please see the step 3 below.
The cssscl's
classify
module requires two positional arguments to be provided:
1. a file with test data with sequences in the FASTA format for classification (e.g. TEST.fa as in the example above)2. a directory where the databases (built using the training set) reside
Note: This will run the classifier with all the similarity measures (including the compression and the blast measure) as described in: Borozan I et al. "Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification." Bioinformatics. 2015 Jan 7. pii: btv006.
The information about the additional optional arguments used in the command line above is provided here.
For more information please consult the cssscl's
classify
help page by typing
$ cssscl classify --help
Step 1. Build the necessary databases from the training set
Note: Only do this is you did not already built the database in Example 1 above.
$ cssscl build_dbs -btax -c -blast -nt 2 PATH_TO/test_data/TRAIN.fa PATH_TO/taxon/
(the whole process should take ~ 37 min using 2 CPUs)
Step 2. Perform the classification of the sequences in the test set by optimizing the cssscl's
parameter values first
$ cssscl classify -c -blast blastn -opt -tax genus -nt 8 PATH_TO/test_data/test/TEST.fa PATH_TO/test_data/
More information about the optimization can be found here.
Note that the optimization phase will take considerably longer when -c
(compression) argument is used as mentioned in the section Note regarding the compression measure below.
The information about the additional optional arguments used in the command line above is provided here.
The use of the compression measure will slow down considerably the optimization and the classification parts because of the running time complexity ~ O(n*n) (for the optimization phase) and ~ O(n*m) for the classification phase, where n and m are respectively the number of sequences in the training and test sets. Thus the compression measure should only be used with smaller genome databases (e.g. viruses) and/or with smaller datasets (i.e. smaller number of reads/contigs to classify).
Licensed under the GNU General Public License, Version 3.0. See LICENSE for more details.
Copyright 2015 The Ontario Institute for Cancer Research.
This project is supported by the Ontario Institute for Cancer Research (OICR) through funding provided by the government of Ontario, Canada.