Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema.org author retrieval: author from WebPage not returned #1380

Open
2 tasks done
jayaddison opened this issue Nov 13, 2024 · 5 comments · May be fixed by #1381
Open
2 tasks done

schema.org author retrieval: author from WebPage not returned #1380

jayaddison opened this issue Nov 13, 2024 · 5 comments · May be fixed by #1381
Labels

Comments

@jayaddison
Copy link
Collaborator

Pre-filing checks

  • I have searched for open issues that report the same problem
  • I have checked that the bug affects the latest version of the library

The URL of the recipe(s) that are not being scraped correctly

Discovered during discussion at #1378 (comment)

The results you expect to see

The author name should be available from schema.org metadata on the recipe page, by accessing the author field - in particular the WebPage item.

The results (including any Python error messages) that you are seeing

The author field returns a null/empty value.

@jayaddison jayaddison added the bug label Nov 13, 2024
@jayaddison
Copy link
Collaborator Author

This part of the code seems intended to handle this case -- a Recipe metadata item contained inside a WebPage item:

# If the item is a webpage and describes a recipe entity, use the entity as our datasource
if self._contains_schematype(item, "WebPage"):
main_entity = item.get("mainEntity", {})
if self._contains_schematype(main_entity, "Recipe"):
self.format = syntax
self.data = main_entity
return

Does that code ever run, though?

@jayaddison
Copy link
Collaborator Author

Does that code ever run, though?

For this webpage, weirdly, no: because there are two Recipe entities. This seems valid; one of them is at the top-level, and mentions aggregate review scores. The other is within the WebPage metadata, and contains the bulk of the recipe info.

Approximately:

[
  {"@type": "Recipe", ...},
  {
    "@type": "WebPage",
    "mainEntity": {
      "@type": "Recipe",
      "recipeIngredients": [
        ...
      ],
      ...
    },
  }
  ...
]

Our SchemaOrg implementation is latching onto the first Recipe item -- the one that doesn't contain much apart from review data.

@jayaddison
Copy link
Collaborator Author

Two options that I can think of:

I'm going to take a break for a while here and will look at this again soon (next day or two probably).

@hhursev
Copy link
Owner

hhursev commented Nov 13, 2024

That's a pretty good find! My gut feeling tells me we should go with the second option that you've proposed, where this is handled in SchemaOrg.

I feel like instead of setting self.data to the schema "findings" and exiting the __init__ method right after like we do:

we can pass these "findings" through a cleverer _update_data method (or simply self.data.update(...) in the beginning) and not to exit __init__ when this happens. Removing the "return" statements altogether.

@jayaddison
Copy link
Collaborator Author

Sounds good to me. Let's also try to group those updates by the @id of the item they're creating/updating -- AFAICT that is the identifier to indicate whether two objects refer to the same entity or different items (e.g. if multiple people are mentioned in the metadata).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants