HexSE is a Python module designed to simulate sequence evolution along a phylogeny while considering the coding context of the nucleotides. The ultimate porpuse of HexSE is to account for multiple selection preasures on Overlapping Reading Frames.
HexSE
is a Python package developed under version 3.6.9.
The following dependencies are required (as listed in the setup.py
file):
scipy
numpy
biopython
pyyaml
To install HexSE
, there are three options.
- Clone repository:
$ git clone https://github.com/PoonLab/HexSE
$ cd HexSE
$ sudo python3 setup.py install
- Create virtual environment (ideal to ensure the right versions for the dependencies):
$ git clone https://github.com/PoonLab/HexSE
$ cd HexSE
$ python3 -m venv venv
$ source ./venv/bin/activate
$ python3 setup.py install
- Install with
pip
(without clonning):
$ python3 -m pip install --upgrade git+https://github.com/PoonLab/HexSE
To run HexSE
from anywhere in your Linux or macOS system, you can add HexSE's bin
folder to your PATH
as following:
$ PATH=<path_to_cloned_repository>/bin:$PATH
You should now be able to run your simulations from any directory in your system the command hexse
:
$ hexse <path_to_sequence_file> <path_to_phylogenetic_tree> <path_to_configuration_file> --outfile <path_to_alignment_file> --logfile <path_to_log_file>
Alternatively, in Linux you can create a symbolic link to hexse/bin/hexse/hexse.sh
in a folder from your PATH
(find folders in your PATH
with qecho $PATH$
):
ln -s <path_to_cloned_repository>/bin/hexse.sh hexse
HexSE
requires users to provide a nucleotide sequence (either in fasta
or genbank
format), a phylogenetic tree (in newick
format), and a configuration file in YAML format specifying the substitution biases and the coordinates of the Open Reading Frames (ORFs) in the sequence.
HexSE runs as a Python module from the terminal with command line arguments as following:
$ python3 -m hexse.run_simulation <path_to_sequence_file> <path_to_phylogenetic_tree> <path_to_configuration_file> --outfile <path_to_alignment_file> --logfile <path_to_log_file>
Configuration file
The configuration file contains the parameters that make the simulation possible. It must include a global substitution rate (global_rate
), transition-transvertion rate ratio (kappa
), mutation rates (mu
), stationary nucleotide frequencies (pi
), and locations of the coding sequences (orfs
) with specifications of the distrbution from which dN/dS values (omega
) are drawn (i.e. omega_classes
, omega_shape
, omega_dist
):
global_rate: 0.05
kappa: 0.3
pi:
# Note that keys for pi values HAVE to be the letters A, T, G or C
A: 0.25
T: 0.25
G: 0.25
C: 0.25
orfs:
# Location of gene C on HBV
1815,2454:
omega_classes: 5
omega_shape: 2.5
omega_dist: gamma
# From circular genome, this gene is specified as two fragments separated by a colon
2308, 3182; 0, 1625:
omega_classes: 4
omega_shape: 1.7
omega_dist: gamma
mu:
classes: 2
shape: 1.0
dist: lognorm
circular: true
To create a configuration file from a .gb
file, users can use gb_to_yaml.py
available at ./accesory_scripts/
.
Note: To declare coding aequences resulting from spliced genes or circular genomes, coordinates must be specified as the same orf
separated by a colon (;
), in the exact order than the ribosome would read it. For example, a gene from a circular genome encoded from position 2308
to 1625
, is defined in the configuration file as:
2308, 3182; 0, 1625:
omega_classes: 4
omega_shape: 1.7
omega_dist: gamma
Test files are provided in ./tests/fisxtures/
.
For a HexSE
test run on HBV genome along a phylogeny with 100 tips, use:
$ python3 -m hexse.run_simulation tests/fixtures/NC_003977.2_HBV.gb tests/fixtures/100_tree.newick tests/fixtures/conf_complete_HBV.yaml --logfile test_HBV.hexse.log --outfile HBV_align.hexse.fasta
HexSE
creates two files:
- Fasta file: sequence alignment in
fasta
format with as many mutated sequences as tips on the phylogeny. - Log file: with relevant information about the run. Includes location of input and output files, parameters for the run, and open reading frames (ORFs) in the sequence. Aditionally, it includes the regions in which the genome was divided based on their coding context, which are represented by tuples of
0
s and1
s. Overlapping regions can be recognise if more than one1
is present in the tuple.
Log file example:
INFO:root:
Simulation started at: 2022-10-05 13:24:53.786127
INFO:root:
FILES
Sequence: tests/fixtures/NC_003977.2_HBV.gb
Configuration: tests/fixtures/conf_complete_HBV.yaml
Phylo Tree: tests/fixtures/100_tree.newick
Alignment: HBV_align.hexse.fasta
PARAMETERS
Pi: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
Global rate: 0.05
Kappa: 0.3
Rate classes info:
{ 'classes': 2,
'dist': 'lognorm',
'shape': 1.0,
'values': {'mu1': 0.2615782918648644, 'mu2': 1.3871429788350027}}
INFO:root:
VALID ORFs
[ [[1375, 1840]],
[[2308, 3182], [0, 1625]],
[[2849, 3182], [0, 837]],
[[1815, 2454]]]
Total ORFs: 4
ORFs INFO
{ '+0': [ { 'coords': [[1375, 1840]],
'omega_classes': 5,
'omega_shape': 1.5,
'omega_values': [ 0.11355787594057443,
0.28563496606388417,
0.47595127509121704,
0.74137416016263,
1.383481722705885],
'orf_map': array([1, 0, 0, 0])},
{ 'coords': [[2308, 3182], [0, 1625]],
'omega_classes': 2,
'omega_shape': 2,
'omega_values': [0.37931535691266643, 1.2206846430881086],
'orf_map': array([0, 0, 1, 0])}],
'+1': [ { 'coords': [[2849, 3182], [0, 837]],
'omega_classes': 3,
'omega_shape': 1.2,
'omega_values': [ 0.11098924219663184,
0.36293140817456776,
0.9660793495713997],
'orf_map': array([0, 0, 0, 1])}],
'+2': [ { 'coords': [[1815, 2454]],
'omega_classes': 4,
'omega_shape': 1.0,
'omega_values': [ 0.05478151305786287,
0.190700742494181,
0.3999999999999998,
0.9545177444479566],
'orf_map': array([0, 1, 0, 0])}],
'-0': [],
'-1': [],
'-2': []}
INFO:root:
REGIONS
{ (0, 0, 1, 0): {'coords': [[837, 1374], [2454, 2848]], 'len': 933},
(0, 0, 1, 1): {'coords': [[0, 836], [2849, 3181]], 'len': 1170}, # Overlapping region on a circular genome
(0, 1, 0, 0): {'coords': [[1840, 2307]], 'len': 468}, # Non-overlapping region
(0, 1, 1, 0): {'coords': [[2308, 2453]], 'len': 146},
(1, 0, 0, 0): {'coords': [[1625, 1814]], 'len': 190},
(1, 0, 1, 0): {'coords': [[1375, 1624]], 'len': 250}, # Overlapping region
(1, 1, 0, 0): {'coords': [[1815, 1839]], 'len': 25}}
INFO:root:
Simulation Ended at: 2022-10-05 13:25:47.001636
Simulation lasted: 0:00:53.215509 seconds
HexSE
includes a set of scripts located at ./accesory_scripts/
that might be usefult to prepare information for a run and to process the alignment afterwards.
-
gb_to_yaml.py
: Creates a YAML configuration file by identifying the Coding Sequences (CDSs) on agenbank
file and sets default values tomu
,kappa
,global_rate
, andpi
. -
get_orfs.py
: Obtains the Open Reading Frames (ORF) of a genome based on itsgenbank
annotations. -
extract_cds
: From a sequence alignment infasta
file, extracts a specific region between two nucleotide locations.
Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may provide a mechanism to increase the information content of compact genomes. The presence of overlapping reading frames (OvRFs) can skew estimates of selection based on the rates of non-synonymous and synonymous substitutions, since a substitution that is synonymous in one reading frame may be non-synonymous in another, and vice versa.
To understand the impact of OvRFs on molecular evolution, HexSE implemented a versatile simulation model of nucleotide sequence evolution along a phylogeny with an arbitrary distribution of reading frames in linear or circular genomes. We use a custom data structure to track the substitution rates at every nucleotide site, which is determined by the stationary nucleotide frequencies, transition bias, and the distribution of selection biases (dN/dS) in the respective reading frames.