Skip to content

Commit

Permalink
Updating readme for 2.0.0
Browse files Browse the repository at this point in the history
  • Loading branch information
alex-wave committed Oct 20, 2023
1 parent d8cc567 commit 2c295d2
Showing 1 changed file with 92 additions and 54 deletions.
146 changes: 92 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,106 @@
# OMAmer
# OMAmer - tree-driven and alignment-free protein assignment to subfamilies

OMAmer is a novel alignment-free protein family assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. It is based on an innovative method using evolutionnary-informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, it has provided better and quicker subfamily-level assignments than a method based on closest sequences (using DIAMOND).
OMAmer is a novel alignment-free protein family assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. It is based on an innovative method using evolutionary-informed _k_-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, it has provided better and quicker subfamily-level assignments than a method based on closest sequences (using DIAMOND).

# Installation
Requires Python >= 3.6. Download the package from the PyPI, resolving the dependencies by using ``pip install omamer``.
Requires Python >= 3.8. Download the package from the PyPI, resolving the dependencies by using ``pip install omamer``.

Alternatively, clone this repository and install manually.

Note: Python 3.12 is currently not supported, until the `numba` package is updated ([issue](https://github.com/numba/numba/issues/9197)).

# Pre-Built Databases

Pre-built databases are available for the latest OMA release from the [download section on the OMA Browser website](https://omabrowser.org/oma/current).

- LUCA: https://omabrowser.org/All/LUCA.h5
- Metazoa: https://omabrowser.org/All/Metazoa.h5
- _Metazoa: https://omabrowser.org/All/Metazoa.h5
- _Viridiplantae_: https://omabrowser.org/All/Viridiplantae.h5
- _Saccharomyceta_: https://omabrowser.org/All/Saccharomyceta.h5
- _Primates_: https://omabrowser.org/All/Primates.h5

Their names indicate the root-taxon parameter used. Other non-required parameters were left to default.

Note: databases included in the [Zenodo upload](https://zenodo.org/record/4593702) from the manuscript are not supported by the most recent version of OMAmer. We recommend using the most recent release with databases built on the most recent OMA browser release.



# omamer search - Searching a Database
Assign proteins to families and subfamilies in a pre-existing database.
## Usage
Required arguments: ``--db``, ``--query``

usage: omamer search [-h] -d DB -q QUERY [--threshold THRESHOLD] [--family_alpha FAMILY_ALPHA] [-fo] [-n TOP_N_FAMS] [--reference_taxon REFERENCE_TAXON]
[-o OUT] [--include_extant_genes] [-c CHUNKSIZE] [-t {0,1,2,3,4,5,6,7,8}] [--log_level {debug,info,warning}] [--silent]

## Arguments
### Quick reference table

| Short Flag | Flag | Default | Description |
|:-----------|:---------------------|:-----------------------|:------------|
| [``-d``](#markdown-header-d) | [``--db``](#markdown-header--db) || Path to existing database (including filename)
| [``-q``](#markdown-header-q) | [``--query``](#markdown-header--query) || Path to FASTA formatted sequences
| | [``--threshold``](#markdown-header--threshold) | 0.1 | Threshold applied on the OMAmer-score that is used to vary the specificity of predicted HOGs. The lower the theshold the more (over-)specific predicted HOGs will be.
| | [``--family_alpha``](#markdown-header--family_alpha) | 1e-6 | Significance threshold used when filtering families.
| [``-fo``](#markdown-header-fo) | [``--family_only``](#markdown-header--family_only) | False | If set, only place at the family level. Useful for certain analysis. Note: `subfamily_medianseqlen` in the results is for the family level.
| [``-n``](#markdown-header-n) | [``--family_only``](#markdown-header--top_n_fams) | 1 | Number of top level families to place into. By default, placed into only the best scoring family.
<!--| | [``--reference_taxon``](#markdown-header--reference_taxon) || The placement is stopped when reaching a HOG with the reference taxon (must exist in the OMA database). This is a complementary option to vary the specificity of predicted HOGs.-->
| [``-o``](#markdown-header-o) | [``--out``](#markdown-header--db) | stdout | Path to output. If not set, defaults to stdout.
| | [``--include_extant_genes``](#markdown-header--include_extant_genes)||Include extant gene IDs as comma separated entry in results
| [``-c``](#markdown-header-c) | [``--chunksize``](#markdown-header--chunksize) |10000| Number of queries to process at once.
| [``-t``](#markdown-header-t) | [``--nthreads``](#markdown-header--db) |1|Number of threads to use
| | [``--log_level``](#markdown-header--db) |info| Logging level (options debug, info, warning)
| | [``--silent``](#markdown-header--silent) || Set to silence the output.

# Output

Output is in the form of a tab-seperated value file (TSV), with metadata added to the header using ``!<tag>: <value>``. A parser can be imported for further analysis in python as ``from omamer.results_reader import results_reader``.

## Output Columns

#### Query sequence identifier (`qseqid`)
The sequence identifier from the input FASTA-formatted sequences.

#### Predicted HOG identifier (`hogid`)
The identifier of the hierarchical orthologous group (HOG) in OMA, which you can access through the OMA browser search bar or its REST API (https://omabrowser.org/api/docs).

A HOG identifier is composed of the root-HOG identifier (following “HOG:” and before the first dot), which is followed by its sub-HOGs (before each subsequent dot). For example, for subfamily HOG:0487954.3l.27l, HOG:0487954 is the root-HOG (HOG without-parent), HOG:0487954.3l is its child and HOG:0487954.3l.27l its grandchild.

#### Predicted HOG taxonomic level (`hoglevel`)
The taxonomic level that the predicted HOG is defined at.

#### Family p-value (`family_p`)
p-value of having as many or more of k-mers in common under a binomial distribution. Reported in negative natural log units.

#### Family count (`family_count`)
Count of _k_-mers in common with the family / root level HOG.

#### Family normalised count (`family_normcount`)
Family count, normalised by the expected number of hits for the query's sequence length, with the family's _k_-mer content.

#### Sub-Family score (`subfamily_score`)
The OMAmer-score of the predicted HOG. At the subfamily level, this score captures the excess of similarity that is shared between the query and a given HOG, thus excluding the similarity with regions conserved in more ancestral HOGs.

#### Sub-Family count (`subfamily_count`)
Count of _k_-mers in common with the sub-family / HOG.

#### Query sequence length (`qseqlen`)
Count of _k_-mers in common with the sub-family / HOG.

#### Sub-Family median sequence length (`subfamily_medianseqlen`)
Median length of the sequences that are present in the predicted HOG. In the case of family-only placement, this is instead reported at the root-HOG level.

#### Query sequence overlap (`qseq_overlap`)
The proportion of the query sequence overlapping with _k_-mers of reference root-HOGs. This may be helpful to reject partially homologous matches that are problematic in some applications.

#### Sub-Family gene set (`subfamily_geneset`)
Optionally printed (see ``--include_extant_genes``). Comma-seperated list of extant gene IDs of predicted HOG. The [OMA browser](https://omabrowser.org) can be used to find out more information. In particular, using the [REST API](https://omabrowser.org/api/docs), or via the [Python API Client](https://github.com/DessimozLab/pyomadb).

<!-- #### Closest taxon from reference taxon (`closest_taxa`)
The taxon from the predicted HOG that is closest from the reference taxon (given one was provided). This option provides a mean to evaluate the performance of OMAmer placement given some knowledge of the query taxonomy is available.
-->


# omamer mkdb - Building a Database
This is currently reliant on the OMA browser's database file and the species phylogeny of HOGs. Building using OrthoXML files available shortly.
- https://omabrowser.org/All/OmaServer.h5
Expand Down Expand Up @@ -45,57 +128,14 @@ Required arguments: ``--db``, ``--oma_path``
| [``--oma_path``](#markdown-header--oma_path)||Path to a directory with both OmaServer.h5 and speciestree.nwk
| [``--log_level``](#markdown-header--log_level)|info|Logging level

# omamer search - Searching a Database
Assign proteins to families and subfamilies in a pre-existing database.
## Usage
Required arguments: ``--db``, ``--query``

usage: omamer search [-h] --db DB --query QUERY [--score {default,sensitive}] [--threshold THRESHOLD] [--reference_taxon REFERENCE_TAXON] [--out OUT]
[--include_extant_genes] [--chunksize CHUNKSIZE] [--nthreads NTHREADS] [--log_level {debug,info,warning}]

## Arguments
### Quick reference table

| Flag | Default | Description |
|:--------------------|:----------------------|:-----------|
| [``--db``](#markdown-header--db) || Path to existing database (including filename)
| [``--query``](#markdown-header--query) || Path to FASTA formatted sequences
| [``--score``](#markdown-header--score) |default| Type of OMAmer-score to use. Options are "default" and "sensitive".
| [``--threshold``](#markdown-header--threshold) |0.05| Threshold applied on the OMAmer-score that is used to vary the specificity of predicted HOGs. The lower the theshold the more (over-)specific predicted HOGs will be.
| [``--reference_taxon``](#markdown-header--reference_taxon) || The placement is stopped when reaching a HOG with the reference taxon (must exist in the OMA database). This is a complementary option to vary the specificity of predicted HOGs.
| [``--out``](#markdown-header--db) |stdout| Path to output (default stdout)
| [``--include_extant_genes``](#markdown-header--include_extant_genes)||Include extant gene IDs as comma separated entry in results
| [``--chunksize``](#markdown-header--chunksize) |10000| Number of queries to process at once.
| [``--nthreads``](#markdown-header--db) |1|Number of threads to use
| [``--log_level``](#markdown-header--db) |info| Logging level

# Output columns

#### Query sequence identifier
The sequence identifier from the input fasta

#### Predicted HOG identifier
The identifier of the hierarchical orthologous group (HOG) in OMA, which you can access through the OMA browser search bar or its REST API (https://omabrowser.org/api/docs).

A HOG identifier is composed of the root-HOG identifier (following “HOG:” and before the first dot), which is followed by its sub-HOGs (before each subsequent dot). For example, for subfamily HOG:0487954.3l.27l, HOG:0487954 is the root-HOG (HOG without-parent), HOG:0487954.3l is its child and HOG:0487954.3l.27l its grandchild.

#### Closest taxon from reference taxon
The taxon from the predicted HOG that is closest from the reference taxon (given one was provided). This option provides a mean to evaluate the performance of OMAmer placement given some knowledge of the query taxonomy is available.

#### Overlap-score
The fraction of the query sequence overlapping with k-mers of reference root-HOGs. This score aims to help reject partial homologous matches that are problematic in some applications.

#### Family-level OMAmer-score
The OMAmer-score of the predicted root-HOG. At the family level, this score measures the sequence similarity between the query and a given root-HOG.

#### Subfamily-level OMAmer-score
The OMAmer-score of the predicted HOG. At the subfamily level, this score captures the excess of similarity that is shared between the query and a given HOG, thus excluding the similarity with regions conserved in more ancestral HOGs.

#### Subfamily gene set
Extant gene IDs of predicted HOG, which you can look for in the OMA browser search bar or its REST API (https://omabrowser.org/api/docs).

# Change log

#### Version 2.0.0
- Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
- Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query.
- UX improvements - more feedback during interactive search runs, whilst maintaining small log files.

#### Version 0.2.5
- Fixes an issue when storing the pre-conputed statistics

Expand Down Expand Up @@ -134,5 +174,3 @@ You should have received a copy of the GNU Lesser General Public License along w
Victor Rossier, Alex Warwick Vesztrocy, Marc Robinson-Rechavi, Christophe Dessimoz, OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches, Bioinformatics, 2021;, btab219, https://doi.org/10.1093/bioinformatics/btab219

Code used for that paper is available here: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4593702.svg)](https://doi.org/10.5281/zenodo.4593702)


0 comments on commit 2c295d2

Please sign in to comment.