Skip to content

pangolin v4.0

Compare
Choose a tag to compare
@aineniamh aineniamh released this 01 Apr 12:51
· 109 commits to master since this release
20eb73e

Release notes

pangolin has had a big code overhaul recently, which should help with maintainability going forward, but there are some main changes the user will be concerned with that I wanted to flag here before the release:

  • Notably, the default mode is shifting from pangoLEARN to UShER. If you run large amounts of sequences through pangolin routinely you should be aware this update will impact the speed of pangolin for large amounts of data and you may want to consider parallelisation, using the optional usher assignment cache file (accessed with --add-assignment-cache and --use-assignment-cache flags) or using the --analysis-mode pangoLEARN flag.
  • The pangoLEARN model being trained is a random forest rather than a decision tree, so the confidence scores reflect the assignment probability from the random forest model now rather than the number of suitable categories as is the case with the decision tree model.
  • Changes to dependencies: We’re rationalising the pangoLEARN repository and the file accessed from pango-designation into a single repository called pangolin-data, so pangoLEARN and pango-designation are no longer needed as dependencies.
  • Changes to versioning: pangolin-data will have the same version number as the pango-designation tag as the lineages version in UShER protobuf file and the pangoLEARN model, giving a less convoluted versioning system than has previously been the case.
  • There’s been confusion around the word None when a sequence fails to be assigned a lineage so we’ve changed the reporting from None to Unassigned as we think this is clearer.
  • The output csv file has changed slightly. We’ve tried to keep it as consistent as possible, but the versioning has changed so the columns reflect that. We’ve added in the pangolin version and we’ve separated out qc notes into a separate column to other notes to make it easier to access specific bits of information. New output format found here: https://cov-lineages.org/resources/pangolin/output.html
  • Because the location of the data files will now fall in a different repository (pangolin_data), there isn’t any natural backwards compatibility. We’re intending on training the decision tree model as well for the next couple of releases to give a buffer period where the data for 3.0 and 4.0 are both available with their respective inputs, and have been training the decision tree model for a tandem pangoLEARN and pangolin-data release this week. This will likely get phased out eventually as everyone migrates to v4.0.
  • Expanded lineage flag feature from issue #261 by @baileyglen
  • Optional use of assignment-cache for usher speed up with pangolin --use-assignment-cache.
  • The mapping command minimap2 uses has been altered slightly with --score-N=0 and mode -x asm20 being run now.
  • Alignment, sequence QC, Scorpio classification and designation hash assignment are run on all sequences now (previously only sequences that passed QC had these steps run). Bear in mind low quality sequences may fail to map, and sequences that fail QC will still not recieve a final lineage call, but information about what calls scorpio provides may be available.
  • When updating, please bear in mind installation via this repo will always provide the latest version (check the tagged releases). Installation through bioconda may have a lag for latest version to be available for install. Similarly, the pangolin web application is maintained by the Centre for Genomic Pathogen Surveillance (for which we are very grateful!) and users should be aware there may also be a short lag for update.

How to report issues

Please report any software issues that might arise to this repository and we will address them as soon as possible. Any lineage misassignments or new lineage requests please report to pango-designation.