Release notes

A number of major modifications to training pipeline producing the machine learning model
Return to random forest classifier with new parameters num_estimators=12, max_features=0.05 and min_samples_split=10

Curation description

Match designations with gisaid
Extract spike sequences from whole genomes
Identify spike sequences with no ambiguity
Call spike variants (SNPs, insertions and deletions)
Get variant counts per lineage
Calculate which variants occur at a frequency of 0.60 per lineage
Merge lineages into sets based on overlapping mutation thresholds
Calculate set names and precision (updated code, now calculates recombinant lineage precision accurately)
Translate amino acid mutations to nucleotide positions
If any minor mutations occur at a position that conflicts with a consensus spike haplotype for another lineage set, mask it out
Create a sequence hash to only supply unique sequences to the model for training
If any sequences contain fewer than 60% of the CSH mutations for a given lineage set, do not put them forward for training
After all the filtering steps and sequence hashing, remove any lineage sets from training that have less than 5 representative sequences
Run random forest training on final set of lineage sets