Releases: DessimozLab/omamer
v2.0.4
What's Changed
- [FIX] freeze numpy dependency to <2 (issue #34)
- [ADD] experimental support to build omamer databases from orthoxml/fasta files
- Bump pypa/gh-action-pypi-publish from 1.8.12 to 1.9.0 by @dependabot in #33
Full Changelog: v2.0.3...v2.0.4
v2.0.3
v2.0.2
- changed method for hiding taxa in build process. Now takes a file containing taxa to hide on separate lines.
- checks and improved feedback for root taxon and requested taxa to hide.
- root taxon set by default to the root level in speciestree.nwk (previously hard-coded to default to LUCA)
v2.0.1
What's Changed
- remove dependency for filehash library
- return better error message if build dependencies are not met, but trying to building an omamer database
- minor fixes
- Bump actions/checkout from 3 to 4 by @dependabot in #24
Full Changelog: v2.0.0...v2.0.1
v2.0.0
- Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
- Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query.
- UX improvements - more feedback during interactive search runs, whilst maintaining small log files.
Brief overview of major changes to OMAmer
The OMAmer placement algorithm consists of two steps: placing a query sequence into a protein family (root level HOG in OMA), before placing it into a sub-family. The original OMAmer publication focused on providing better and faster subfamily-level assignments than methods based on closest-sequence. Recently, the group has developed OMArk, a software package for proteome (protein-coding gene repertoire) quality assessment. The original OMAmer method was developed using a smaller taxonomic range than required for OMArk, which meant that the largest gene families were much smaller and less diverse in k-mer content. The largest HOG in OMA (November 2022 release) contains over 101,000 proteins and represents 53.9% of the k-mer index, based on the 6-mers that OMAmer uses by default. This means that a random protein sequence is very likely to be associated with this HOG.
In order to allow for this, we developed a scoring mechanism based on the binomial distribution. For each family, we estimated the probability of a random k-mer matching. We can then compute the
This is then used to filter the list of families which have an overlap with the query sequence (argument “--family-alpha
”, default
v0.2.6
v0.2.5
small patch release that fixes an issue with the previous version when building a new database from scratch.
Full Changelog: v0.2.4...v0.2.5