Skip to content

Increase precision or recall

Alessio Milanese edited this page Mar 17, 2019 · 2 revisions

The input parameters that have an effect on the resulting taxonomic profile are:

  • minimum alignment length (-l). Minimum length of the alignment between the read and the marker genes (different from read length). Default value is 75, higher values will produce less false positives (less reads pass the filter) while lower values will recruit more reads, allowing to detect low abundant bugs at the cost of more false positives. Note that this parameter has to be tuned with the average read length, we suggest to choose a value between 45 and 100.
  • type of read counts (-y). There are three possible values: base.coverage, insert.raw_counts, insert.scaled_counts (default). The values with insert.* counts the number of inserts (reads) that map to the gene, where raw_counts measure the absolute number of reads and scaled_counts weights the read counts with the gene length. base.coverage measure the average base coverage of the gene.
  • marker genes cutoff (-g). Every mOTU is composed of 10 marker genes and the read count of the mOTU is calculated as the median of the read counts of the genes that are different from zero. The parameter -g defines the minimum number of genes that have to be different from zero. The default value is 3 and possible values are between 1 and 10. With -g 1 the detection of one gene is enough to consider the mOTU as present in the sample (detecting low abundance species but also also false positives). On the other hand, with -g 6 only the mOTUs with 6 detected genes are counted, reducing the false positives.

The result with highest sensitivity is obtained with -g 1 -l 30 -y base.coverage, allowing to detect low abundance bugs (at the cost of having more false positives).

The result with highest precision is obtained with -g 6 -l 100 -y insert.scaled_counts, reducing the false positives to the minimum.