schema.org author retrieval: author from WebPage not returned #1380

jayaddison · 2024-11-13T14:18:42Z

Pre-filing checks

I have searched for open issues that report the same problem
I have checked that the bug affects the latest version of the library

The URL of the recipe(s) that are not being scraped correctly

https://www.rewe.de/rezepte/wintergemuese-gnocchi-pfanne-kraeuterdip/

Discovered during discussion at #1378 (comment)

The results you expect to see

The author name should be available from schema.org metadata on the recipe page, by accessing the author field - in particular the WebPage item.

The results (including any Python error messages) that you are seeing

The author field returns a null/empty value.

The text was updated successfully, but these errors were encountered:

jayaddison · 2024-11-13T14:27:48Z

This part of the code seems intended to handle this case -- a Recipe metadata item contained inside a WebPage item:

recipe-scrapers/recipe_scrapers/_schemaorg.py

Lines 100 to 106 in 43093df

    
           # If the item is a webpage and describes a recipe entity, use the entity as our datasource 
        
           if self._contains_schematype(item, "WebPage"): 
        
               main_entity = item.get("mainEntity", {}) 
        
               if self._contains_schematype(main_entity, "Recipe"): 
        
                   self.format = syntax 
        
                   self.data = main_entity 
        
                   return

Does that code ever run, though?

jayaddison · 2024-11-13T14:34:20Z

Does that code ever run, though?

For this webpage, weirdly, no: because there are two Recipe entities. This seems valid; one of them is at the top-level, and mentions aggregate review scores. The other is within the WebPage metadata, and contains the bulk of the recipe info.

Approximately:

[
  {"@type": "Recipe", ...},
  {
    "@type": "WebPage",
    "mainEntity": {
      "@type": "Recipe",
      "recipeIngredients": [
        ...
      ],
      ...
    },
  }
  ...
]

Our SchemaOrg implementation is latching onto the first Recipe item -- the one that doesn't contain much apart from review data.

jayaddison · 2024-11-13T14:36:56Z

Two options that I can think of:

We could use the functionality provided by AbstractScraper: provide attributes to override default metadata parsers #1365 to adjust our parsing slightly for this specific recipe website.
We could try to handle this generically in SchemaOrg by accumulating recipe properties across multiple entities found on the page.

I'm going to take a break for a while here and will look at this again soon (next day or two probably).

hhursev · 2024-11-13T23:27:52Z

That's a pretty good find! My gut feeling tells me we should go with the second option that you've proposed, where this is handled in SchemaOrg.

I feel like instead of setting self.data to the schema "findings" and exiting the __init__ method right after like we do:

we can pass these "findings" through a cleverer _update_data method (or simply self.data.update(...) in the beginning) and not to exit __init__ when this happens. Removing the "return" statements altogether.

jayaddison · 2024-11-14T01:15:09Z

Sounds good to me. Let's also try to group those updates by the @id of the item they're creating/updating -- AFAICT that is the identifier to indicate whether two objects refer to the same entity or different items (e.g. if multiple people are mentioned in the metadata).

jayaddison added the bug label Nov 13, 2024

jayaddison mentioned this issue Nov 13, 2024

Addition: Support for rewe.de #1378

Open

jayaddison linked a pull request Nov 15, 2024 that will close this issue

schema.org parser: allow aggregation from multiple Recipe entities #1381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema.org author retrieval: author from WebPage not returned #1380

schema.org author retrieval: author from WebPage not returned #1380

jayaddison commented Nov 13, 2024

jayaddison commented Nov 13, 2024

jayaddison commented Nov 13, 2024

jayaddison commented Nov 13, 2024

hhursev commented Nov 13, 2024

jayaddison commented Nov 14, 2024

schema.org author retrieval: author from WebPage not returned #1380

schema.org author retrieval: author from WebPage not returned #1380

Comments

jayaddison commented Nov 13, 2024

jayaddison commented Nov 13, 2024

jayaddison commented Nov 13, 2024

jayaddison commented Nov 13, 2024

hhursev commented Nov 13, 2024

jayaddison commented Nov 14, 2024