Skip to content

hedgehog v1.4.1

Compare
Choose a tag to compare
@aineniamh aineniamh released this 15 Aug 13:37
· 29 commits to main since this release
b26d43d

Release notes

  • A number of major modifications to training pipeline producing the machine learning model
  • Return to random forest classifier with new parameters num_estimators=12, max_features=0.05 and min_samples_split=10

Curation description

  • Match designations with gisaid
  • Extract spike sequences from whole genomes
  • Identify spike sequences with no ambiguity
  • Call spike variants (SNPs, insertions and deletions)
  • Get variant counts per lineage
  • Calculate which variants occur at a frequency of 0.60 per lineage
  • Merge lineages into sets based on overlapping mutation thresholds
  • Calculate set names and precision (updated code, now calculates recombinant lineage precision accurately)
  • Translate amino acid mutations to nucleotide positions
  • If any minor mutations occur at a position that conflicts with a consensus spike haplotype for another lineage set, mask it out
  • Create a sequence hash to only supply unique sequences to the model for training
  • If any sequences contain fewer than 60% of the CSH mutations for a given lineage set, do not put them forward for training
  • After all the filtering steps and sequence hashing, remove any lineage sets from training that have less than 5 representative sequences
  • Run random forest training on final set of lineage sets