Outlier detection #11006

aleene · 2024-11-11T19:25:48Z

Problem

When analysing the nutritional values of a category, we often find outliers. These are products with nutritional values that do not correspond to the category. This can be due to multiple reasons: the product has the wrong values transcribed, the product belongs to another category, the values are per serving or the product has unrealistic values.

Proposed solution

These outliers can easily be found visually by looking at a scatterplot which plots two nutritional values. The outlier will lie far from the other products. It would however be nice to detect these automatically. This can be done by using the statistics of a category. By using the statistical percentiles (10% and 90%) of the category we can define a minimum and maximum outlier limits. Anything above or below these outlier values should be looked at in detail and if possibly repaired (values or category). These limits then define the nutritional envelope of a category. Usually the outliers are based on the quartiles, but we can do it with the percentiles. First calculate the distance between the 90% and 10% percentiles. This interpercentile distance (IPR) is then used to define the lower envelope: 10% percentile minus the IPR (or zero), and the upper envelope: the 90% plus the IPR (or 100). This outlier detection should be automated and result in a data quality error. Then the product can be repaired or flagged (folksonomy).

Additional context

It is impossible to apply this to the entire database, it will result in way to many errors. Better start with selected and cleaned up categories. These cleaned-up categories could be set by a flag in the taxonomy.

aleene added the ✨ Feature Features or enhancements to Open Food Facts server label Nov 11, 2024

teolemon added this to 🍊 Open Food Facts Server issues Nov 11, 2024

github-project-automation bot moved this to To discuss and validate in 🍊 Open Food Facts Server issues Nov 11, 2024

CharlesNepote added averages by categories Generating & Leveraging average nutrition values by category 🧽 Data quality https://wiki.openfoodfacts.org/Quality labels Nov 13, 2024

github-project-automation bot moved this to To do in 🧽 Ensuring Data Quality Nov 13, 2024

github-project-automation bot added this to 🧽 Ensuring Data Quality Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outlier detection #11006

Outlier detection #11006

aleene commented Nov 11, 2024

Outlier detection #11006

Outlier detection #11006

Comments

aleene commented Nov 11, 2024

Problem

Proposed solution

Additional context