Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve ingredients extraction #8942

Merged
merged 8 commits into from
Sep 8, 2023

Conversation

benbenben2
Copy link
Collaborator

What

0850003875088

  • Lngredietns, correct typo, ok
  • *Organic ingredients, upd taxonomy, ok

0099482485375

  • text considered ingredients, add to phrases_after_ingredients_list (this same text appears in 272 products),

0856500004013, 0856500004037, 0029737700144

  • wrong extraction, reeaxtracted, ok

0029737700144

  • non-gmo soybean oil is unknown, added in taxonomy, ok

0029737210070
0073872746109
8006013990644
0008005958876
0008005985179
0008005958043
0008005959101

  • contributors question, folksonomy engine. key: ingredient_list:multiple, value: 'nb of ingredients list'

2252559300126

  • *SPICES AND OR VEGETABLE POWDER as last ingredient when that flavor, added to phrases_after_ingredients_list (this text appears in this single product only), ok

Question

Is there a way to test the image extraction locally?
Screenshot_20230903_110511

Related issue(s) and discussion

Comment

Thanks @bredowmax to have collected so many examples. That is of great help.

@benbenben2 benbenben2 self-assigned this Sep 3, 2023
@benbenben2 benbenben2 requested a review from a team as a code owner September 3, 2023 09:09
@github-actions github-actions bot added 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies 🥗 Ingredients labels 🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis labels Sep 3, 2023
@codecov-commenter
Copy link

codecov-commenter commented Sep 3, 2023

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 46.07%. Comparing base (bda3567) to head (fa8a207).
Report is 1140 commits behind head on main.

Files with missing lines Patch % Lines
lib/ProductOpener/Ingredients.pm 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8942      +/-   ##
==========================================
+ Coverage   46.03%   46.07%   +0.03%     
==========================================
  Files          64       64              
  Lines       19795    19824      +29     
  Branches     4791     4798       +7     
==========================================
+ Hits         9113     9133      +20     
- Misses       9496     9512      +16     
+ Partials     1186     1179       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@stephanegigandet stephanegigandet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @benbenben2 , it looks great, but could you add a test for those in unit/ingredients.t so that we don't break it in the future?

@benbenben2
Copy link
Collaborator Author

Tests

Thank you @benbenben2 , it looks great, but could you add a test for those in unit/ingredients.t so that we don't break it in the future?

@stephanegigandet, yes, that was my question (is there a way to test image ingredients list extraction from image). So, I created tests for the cut_ingredients_text_for_lang subroutine (related to picture extraction). They are in a new file called ingredients_extract.t.

%ignore_phrases bug fix

%ignore_phrases were NOT removed from the text. I assume that it was a bug. It is fixed now.
See for example German "inklusive": https://world.openfoodfacts.org/product/5000396000467, also Mirabelle

Additionally, I reviewed and removed 'na|n/a|not applicable' from the list:
-> too much false positives for na
-> some false positives for n/a (41 values in total)
-> 0 occurrence for "not applicable"

I kept those for FR but text of the image is often unrelated to ingredients:
non applicable -> 1 occurence
non concerné -> 15 occurences

I kept DE phrases as they seem pretty coherent:
inklusive -> 29 occurences

@sonarcloud
Copy link

sonarcloud bot commented Sep 8, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

Copy link
Contributor

@stephanegigandet stephanegigandet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@stephanegigandet stephanegigandet merged commit ddd8177 into main Sep 8, 2023
13 checks passed
@stephanegigandet stephanegigandet deleted the fix-3030-wildcards-in-ingredients-list branch September 8, 2023 13:30
@alexgarel
Copy link
Member

Amazing @benbenben2 ! 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis 🥗 Ingredients labels 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies 🧪 tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ignore explanations at end of list of ingredients like ". *Organic Ingredients"
4 participants