Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier detection #11006

Open
aleene opened this issue Nov 11, 2024 · 0 comments
Open

Outlier detection #11006

aleene opened this issue Nov 11, 2024 · 0 comments
Labels
averages by categories Generating & Leveraging average nutrition values by category 🧽 Data quality https://wiki.openfoodfacts.org/Quality ✨ Feature Features or enhancements to Open Food Facts server

Comments

@aleene
Copy link
Contributor

aleene commented Nov 11, 2024

Problem

When analysing the nutritional values of a category, we often find outliers. These are products with nutritional values that do not correspond to the category. This can be due to multiple reasons: the product has the wrong values transcribed, the product belongs to another category, the values are per serving or the product has unrealistic values.

Proposed solution

These outliers can easily be found visually by looking at a scatterplot which plots two nutritional values. The outlier will lie far from the other products. It would however be nice to detect these automatically. This can be done by using the statistics of a category. By using the statistical percentiles (10% and 90%) of the category we can define a minimum and maximum outlier limits. Anything above or below these outlier values should be looked at in detail and if possibly repaired (values or category). These limits then define the nutritional envelope of a category. Usually the outliers are based on the quartiles, but we can do it with the percentiles. First calculate the distance between the 90% and 10% percentiles. This interpercentile distance (IPR) is then used to define the lower envelope: 10% percentile minus the IPR (or zero), and the upper envelope: the 90% plus the IPR (or 100). This outlier detection should be automated and result in a data quality error. Then the product can be repaired or flagged (folksonomy).

Additional context

It is impossible to apply this to the entire database, it will result in way to many errors. Better start with selected and cleaned up categories. These cleaned-up categories could be set by a flag in the taxonomy.

@aleene aleene added the ✨ Feature Features or enhancements to Open Food Facts server label Nov 11, 2024
@github-project-automation github-project-automation bot moved this to To discuss and validate in 🍊 Open Food Facts Server issues Nov 11, 2024
@CharlesNepote CharlesNepote added averages by categories Generating & Leveraging average nutrition values by category 🧽 Data quality https://wiki.openfoodfacts.org/Quality labels Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
averages by categories Generating & Leveraging average nutrition values by category 🧽 Data quality https://wiki.openfoodfacts.org/Quality ✨ Feature Features or enhancements to Open Food Facts server
Projects
Status: To discuss and validate
Status: To do
Development

No branches or pull requests

2 participants