v2.0.0
- Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
- Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query.
- UX improvements - more feedback during interactive search runs, whilst maintaining small log files.
Brief overview of major changes to OMAmer
The OMAmer placement algorithm consists of two steps: placing a query sequence into a protein family (root level HOG in OMA), before placing it into a sub-family. The original OMAmer publication focused on providing better and faster subfamily-level assignments than methods based on closest-sequence. Recently, the group has developed OMArk, a software package for proteome (protein-coding gene repertoire) quality assessment. The original OMAmer method was developed using a smaller taxonomic range than required for OMArk, which meant that the largest gene families were much smaller and less diverse in k-mer content. The largest HOG in OMA (November 2022 release) contains over 101,000 proteins and represents 53.9% of the k-mer index, based on the 6-mers that OMAmer uses by default. This means that a random protein sequence is very likely to be associated with this HOG.
In order to allow for this, we developed a scoring mechanism based on the binomial distribution. For each family, we estimated the probability of a random k-mer matching. We can then compute the
This is then used to filter the list of families which have an overlap with the query sequence (argument “--family-alpha
”, default