Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate proofs & prices sent to Open Prices #5689

Closed
raphodn opened this issue Oct 10, 2024 · 39 comments
Closed

Duplicate proofs & prices sent to Open Prices #5689

raphodn opened this issue Oct 10, 2024 · 39 comments
Assignees
Labels

Comments

@raphodn
Copy link
Member

raphodn commented Oct 10, 2024

What

I'm seeing regularly dozens of duplicate prices being uploaded to Open Prices.
When looking at the source, it indicated "Smoothie - Open Food Facts"

Some users are reporting that the prices are not uploaded immediately, but stay dormant in the "Pending contributions", then lost or uploaded multiple times.

cc @raphael0202 @tradmangh who've faced this issue

Screenshot/Mockup/Before-After

Image

@monsieurtanuki
Copy link
Contributor

@raphodn I'll have a look at it.
Btw, again, I don't think it makes sense on the server side to keep duplicate prices (same user, location, barcode, price, ...), but that's another story.

@monsieurtanuki monsieurtanuki self-assigned this Oct 11, 2024
@raphodn
Copy link
Member Author

raphodn commented Oct 11, 2024

You're right for the duplicate prices : openfoodfacts/open-prices#422

We need to decide if it's a silent ignore by the server, or if it returns an error. If the latter, then will it be easy to remove from the "Pending contributions" ?

@monsieurtanuki
Copy link
Contributor

@raphodn I guess a duplicate price would rather trigger a warning: you don't add the duplicate price, you return a warning as a String result, and then it's up to me to decide what to do with it in Smoothie.
Unfortunately in Smoothie we don't display anything to the user related to background tasks, so I guess we would just ignore the warning.
If it's an error we would try over and over again...

@monsieurtanuki
Copy link
Contributor

@raphodn I wasn't able to reproduce the current issue systematically, though I have to admit that sometimes I don't understand why the prices weren't added immediately.

What can be done:

  1. try to reproduce the bug in a more systematic way, in order to understand it and to fix it
  2. call more often the "background task run refresh" - so far Smoothie was oriented on products, not that much on prices, so we trigger the re-run mostly on product pages, not on prices pages
  3. implement a specific background task queue for prices and/or for quick "not an image upload" tasks - so far we only have one queue, with complicated business rules specific to product edits that could be simplified for prices
  4. give the user more feedback about the background task queue (like the current download operations in a browser)

@raphael0202
Copy link

raphael0202 commented Oct 11, 2024

I managed to reproduce the issue while adding several prices to an existing proof (that had no prices).
In the task queue, there was an error from the API ("product_id can only have letters or digits", or something similar).

This this price for instance: https://prices.openfoodfacts.org/products/4974507095172

@tradmangh
Copy link

Even though its somehow a bug in the mobile app, it creates a data quality problem. Shouldn´t that be handled by a general data cleaning process, maybe within robotoff?

@monsieurtanuki
Copy link
Contributor

I managed to reproduce the issue while adding several prices to an existing proof (that had no prices). In the task queue, there was an error from the API ("product_id can only have letters or digits", or something similar).

This this price for instance: https://prices.openfoodfacts.org/products/4974507095172

@raphael0202 That's interesting, because it looks like a bug on the server side:

  • that doesn't mean that there are no bugs in Smoothie
  • I've just tested on .net to add a price to 4974507095172 but couldn't find that "product_id" error
  • btw in my tests I've managed to delete a proof while there were still prices attached to it - not nice @raphodn!
  • that may mean that we're not resilient in Smoothie, because if a single "create price" action fails, we replay the whole "add proof + add all prices" action => I can improve that, splitting it in different background tasks.

One price without a proof, one price with a proof:
Image

@raphodn
Copy link
Member Author

raphodn commented Oct 11, 2024

Fixing the "delete a proof with prices" issue ASAP.
edit : done here : openfoodfacts/open-prices#517

But it's not related to our problem in hand ^^

@raphodn
Copy link
Member Author

raphodn commented Oct 11, 2024

I also created an issue to detect/ignore duplicate proofs, but it'll be a bit more complex to implement (we'll need to calculate and store the image hash) : openfoodfacts/open-prices#514

@monsieurtanuki
Copy link
Contributor

there was an error from the API ("product_id can only have letters or digits", or something similar).

@raphodn Any idea what it's about?

@monsieurtanuki
Copy link
Contributor

Could it be related to non-food products?
Image

@raphodn
Copy link
Member Author

raphodn commented Oct 11, 2024

product_id can only have letters or digits

It's strange, I tried to add prices with strange product_codes (with spaces, with padded 0, with X at the end) and they all work.
The Price.product_id is the Foreign Key that points to the Product db, it can be None or an ID.
If a product_id field is passed in the POST body it is simply ignored.

and how 4974507095172 looks like in the prod db

{
'_state': <django.db.models.base.ModelState at 0x7f9254b92c50>,
 'id': 3421508,
 'code': '4974507095172',
 'source': None,
 'source_last_synced': None,
 'product_name': None,
 'image_url': None,
 'product_quantity': None,
 'product_quantity_unit': None,
 'categories_tags': [],
 'brands': None,
 'brands_tags': [],
 'labels_tags': [],
 'nutriscore_grade': None,
 'ecoscore_grade': None,
 'nova_group': None,
 'unique_scans_n': 0,
 'price_count': 3,
 'location_count': 1,
 'user_count': 1,
 'proof_count': 1,
 'created': datetime.datetime(2024, 8, 19, 17, 14, 38, 104732, tzinfo=datetime.timezone.utc),
 'updated': None
}

@monsieurtanuki
Copy link
Contributor

@raphodn As an old-timer that still works with RDBMS, let me ask you this: is your "create single price" operation atomic?
I mean, if anything fails during the operation, do you roll back any database modification? (e.g. price addition)
Because if you do add a price and still return an error, that's normal for me to replay over and over again the "add price". Which will bring you hundreds of Katzenzungen like this: https://prices.openfoodfacts.org/products/4068706054488

@monsieurtanuki
Copy link
Contributor

@raphodn Looks like you send some null fields in your "product" answer while I expect at least an empty array, e.g. for label_tags. That's what I understand if I see the docs.
That would explain a lot: correctly adding a price, sending back the price with the product, and we cannot read the product, which causes an exception. And we try again later.
Let me try to fix this on the off-dart side.
Anyway, if we want to fix this quickly this needs to be addressed on the server side too.

@raphodn
Copy link
Member Author

raphodn commented Oct 11, 2024

The product doesn't exist in OFF : https://fr.openfoodfacts.org/produit/4974507095172
So it's normal that there are plenty of None or empty fields 🤷

https://prices.openfoodfacts.org/api/v1/products/code/4974507095172

When adding a price, we check if the product_code is associated with an existing Product (in the Open Prices Product table / as we sync daily with the OFF products)

  • if yes, we set the product_id
  • if no, we create it (and try to fetch in the background from OFF API ; and hope that one day a contributor will add it in OFF, and we'll populate it in a future sync)

What's missing here is a an actual POST body that reproduces the "product_id can only have letters or digits" error.

@monsieurtanuki
Copy link
Contributor

So it's normal that there are plenty of None or empty fields 🤷

Agreed, but I expected an empty array instead of null for labels_tags.

the "product_id can only have letters or digits"

Btw I expected it to be an int, which probably excludes letters, doesn't it?

we sync daily with the OFF products

Actually, you may have to check in Beauty, Pet Food and Products too.

@raphodn
Copy link
Member Author

raphodn commented Oct 11, 2024

Agreed, but I expected an empty array instead of null for labels_tags.

Where do you see that ? in both the message above and the API response it displays an empty list.

Btw I expected it to be an int, which probably excludes letters, doesn't it?

Yes the product_id should the Product DB internal ID. it's not set by the user anywhere. So I still don't see where the issue comes from.. ^^

Actually, you may have to check in Beauty, Pet Food and Products too.

we already do :) and store the flavor in the source field
see https://prices.openfoodfacts.org/products?source=opf for example

@tradmangh
Copy link

Another one that needs to get cleaned up ... https://prices.openfoodfacts.org/users/professordoc

@monsieurtanuki
Copy link
Contributor

@raphodn That's my interpretation: brands can be null and brands_tags cannot (but can be empty).
Image

Anyway, in order to avoid those kinds of misunderstandings, should I switch all fields as nullable?

@monsieurtanuki
Copy link
Contributor

@raphodn Not sure if you meant "labels_tags cannot be null with the current implementation", therefore here's an example:

{
    "id": 7095,
    "product_id": 3482592,
    "location_id": 3129,
    "proof_id": 3317,
    "product": {
        "id": 3482592,
        "code": "4068706054488",
        "source": "off",
        "product_name": "Katzenzungen",
        "image_url": null,
        "product_quantity": 0,
        "product_quantity_unit": null,
        "categories_tags": [
            "en:snacks",
            "en:sweet-snacks",
            "en:biscuits-and-cakes",
            "en:biscuits-and-crackers",
            "en:biscuits",
            "en:cat-tongue"
        ],
        "brands": "Choceur, Stollwerck",
        "brands_tags": [
            "choceur",
            "stollwerck"
        ],
        "labels_tags": null,
        "nutriscore_grade": "unknown",
        "ecoscore_grade": "c",
        "nova_group": null,
        "unique_scans_n": 0,
        "price_count": 1,
        "location_count": 0,
        "user_count": 0,
        "proof_count": 0,
        "created": "2024-09-24T17:14:12.605450Z",
        "updated": "2024-09-24T17:14:12.697090Z"
    },
    "location": {
        "id": 3129,
        "type": "OSM",
        "osm_id": 4966187139,
        "osm_type": "NODE",
        "osm_name": "Carrefour Market",
        "osm_display_name": "Carrefour Market, 37, Rue de Lyon, Quartier des Quinze-Vingts, Paris 12e Arrondissement, Paris, Île-de-France, France métropolitaine, 75012, France",
        "osm_tag_key": null,
        "osm_tag_value": null,
        "osm_address_postcode": "75012",
        "osm_address_city": "Paris",
        "osm_address_country": "France",
        "osm_address_country_code": null,
        "osm_lat": 48.8484997,
        "osm_lon": 2.3709505,
        "website_url": null,
        "price_count": 29,
        "user_count": 3,
        "product_count": 4,
        "proof_count": 17,
        "created": "2024-02-14T14:46:34.432318Z",
        "updated": "2024-08-28T00:53:19.638393Z"
    },
    "proof": {
        "id": 3317,
        "location_id": null,
        "file_path": "0004/Gti5y7BA52.jpg",
        "mimetype": "image/jpeg",
        "type": "PRICE_TAG",
        "image_thumb_path": "0004/Gti5y7BA52.400.jpg",
        "location_osm_id": null,
        "location_osm_type": null,
        "date": "2024-01-02",
        "currency": "USD",
        "price_count": 0,
        "owner": "openfoodfacts-dart",
        "created": "2024-10-11T14:01:49.534898Z",
        "updated": "2024-10-11T14:01:49.736850Z"
    },
    "product_code": "4068706054488",
    "product_name": null,
    "category_tag": null,
    "labels_tags": null,
    "origins_tags": null,
    "price": 3.99,
    "price_is_discounted": false,
    "price_without_discount": null,
    "price_per": null,
    "currency": "USD",
    "location_osm_id": 4966187139,
    "location_osm_type": "NODE",
    "date": "2024-01-02",
    "owner": "openfoodfacts-dart",
    "created": "2024-10-11T14:01:50.029748Z",
    "updated": "2024-10-11T14:01:50.037938Z"
}

@monsieurtanuki
Copy link
Contributor

Another one that needs to get cleaned up ... https://prices.openfoodfacts.org/users/professordoc

@tradmangh Well, the guy obviously loves Katzenzungen: who are we to judge? 😉

@raphodn
Copy link
Member Author

raphodn commented Oct 11, 2024

Not sure if you meant "labels_tags cannot be null with the current implementation", therefore here's an example:

I see... OFF doesn't return that field for this product_code 😅 : https://world.openfoodfacts.org/api/v2/product/4068706054488
edit : got it wrong, but here's other examples :

so indeed the default value isn't managed well during the sync with OFF 😬

@raphodn
Copy link
Member Author

raphodn commented Oct 11, 2024

Another one that needs to get cleaned up ... https://prices.openfoodfacts.org/users/professordoc

I'll run some scripts tonight or this weekend to cleanup these (and others) duplicates

@monsieurtanuki
Copy link
Contributor

I'll run some scripts tonight or this weekend to cleanup these (and others) duplicates

Of course you know that as long as professordoc uses the app, Katzenzungen will be added something like every 5 seconds.
The first thing to do would be for you to fix how you return labels_tags, with an "if null return [ ]" clause.
Let me check if that would be enough, at least for that product.

The gods forced him to roll an immense boulder up a hill only for it to roll back down every time it neared the top, repeating this action for eternity.
https://en.wikipedia.org/wiki/Sisyphus

@monsieurtanuki
Copy link
Contributor

"Il faut imaginer Sisyphe heureux"
Albert Camus

@raphodn
Copy link
Member Author

raphodn commented Oct 11, 2024

The first thing to do would be for you to fix how you return labels_tags, with an "if null return [ ]" clause.

How is this linked to the current issue ?

@monsieurtanuki
Copy link
Contributor

How is this linked to the current issue ?

I explained that quickly above, let's focus on that again.

You manage to add a price, and return the price. Cool.
Except that the price you return is not "valid", because the product inside is not valid (null labels_tags).
Therefore I consider it failed:

  • I'll try later
  • each time, you add a price

@monsieurtanuki
Copy link
Contributor

@raphodn I confirm that with a [ ] instead of a null for labels_tags, my test works for 4068706054488 (in .net env), and fails with null for labels_tags.

@raphodn
Copy link
Member Author

raphodn commented Oct 11, 2024

because the product inside is not valid (null labels_tags).

oh wow 👀 that's kinda extreme.

did a quick check, there's 132791 products concerned with the labels_tags=None ; 126213 with categories_tags=None ; 59504 with brands_tags=None

did a quick SQL update to fix them all in prod.
will fix the sync script sometime this weekend.

@monsieurtanuki
Copy link
Contributor

did a quick SQL update to fix them all in prod.
will fix the sync script sometime this weekend.

Great!

Obviously we need to be more resilient in off-dart.
Following are the Product fields.

NOT NULLABLE

  • String code;
  • int productId;
  • List categoriesTags;
  • List brandsTags;
  • List labelsTags;
  • DateTime created;
  • int uniqueScansNumber;

I guess that for those List<String> that wouldn't harm to consider all of them nullable.
Therefore, we would be left with only those NOT NULLABLE fields - are we safe with them?

  • String code;
  • int productId;
  • DateTime created;
  • int uniqueScansNumber;

Nullable fields

  • int? priceCount;
  • int? locationCount;
  • int? userCount;
  • int? proofCount;
  • Flavor? source;
  • String? name;
  • int? quantity;
  • String? quantityUnit;
  • String? brands;
  • String? imageURL;
  • String? nutriscoreGrade;
  • String? ecoscoreGrade;
  • int? novaGroup;
  • DateTime? updated;

@monsieurtanuki
Copy link
Contributor

I've just created #5693 because the "not uploaded immediately" part has nothing to do with the duplicates.

@raphodn
Copy link
Member Author

raphodn commented Oct 13, 2024

ok everything should be fixed :)

the product/user/location stat counts will be updated tonight (every start of week)

@monsieurtanuki
Copy link
Contributor

@raphodn Splendid! Still needed: what to do with future duplicates, right?

@raphodn
Copy link
Member Author

raphodn commented Oct 13, 2024

Indeed ! but it's more a feature than a bug, so I will probably prioritize other stuff before.

The related issues are:

@monsieurtanuki
Copy link
Contributor

@raphodn So, what now? Should we close the current issue? Is there anything you expect to be coded or fixed in Smoothie or in off-dart?

@raphodn
Copy link
Member Author

raphodn commented Oct 14, 2024

Well I'd expect any re-user of the API (web, mobile, third-party) to not retry calling the API if the server responds with a success message (20X). In our case a POST on /proofs & /prices, that returns a 201.

Re-trying to create the price, although the server has responded a 201 that the price has indeed been created, but the answer body is not 100% on-par with the API specs, that's something to be fixed.. the API will evolve, or an edge-case will happen, and the problem of the mobile sending dozens of identical prices will resurface 🤷

@monsieurtanuki
Copy link
Contributor

@raphodn Sounds fair. Then of course it means that the status code is always reliable, as it would be from now on the only thing that matters.

@monsieurtanuki
Copy link
Contributor

@raphodn I may limit my changes to createPrice.

That's the only case where I pretend to care about the returned JSON when it's not really the case, and the impact on the server is dramatic (prices constantly added) if the JSON seems corrupted (according to my standards).

Other not relevant cases:

  • uploadProof: I really care about the JSON, e.g. the id, that I will reuse for the createPrice calls
  • authentication: I really care about the returned token
  • "get" calls: what would that mean to say "it's OK, the status is OK, it's just that I cannot read the result"

You know that we cannot be as responsive with an app as with a web app, given that the users upgrade their apps when they want.
Probably, now that we have a rather stable API, breaking changes would be better in new "version" number, so far we're always using /api/v1/.

@monsieurtanuki
Copy link
Contributor

Mainly fixed within Prices; minor related fix in openfoodfacts/openfoodfacts-dart#987.

@github-project-automation github-project-automation bot moved this from Backlog to Done in 💸 Open Prices Oct 28, 2024
@github-project-automation github-project-automation bot moved this from 💬 To discuss and validate to 🎊 Done in 🤳🥫 The Open Food Facts mobile app (Android & iOS) Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

5 participants