Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper accuracy re-review: schema.org multi-entity aggregates #1382

Open
18 tasks
jayaddison opened this issue Nov 15, 2024 · 1 comment
Open
18 tasks

Scraper accuracy re-review: schema.org multi-entity aggregates #1382

jayaddison opened this issue Nov 15, 2024 · 1 comment

Comments

@jayaddison
Copy link
Collaborator

After coding up an initial implementation of aggregation of schema.org Recipe info from multiple entities in HTML page metadata (see #1381), a number of tests have begun failing.

This issue tracks a double-check process to figure out what the expected test values should really be, and if necessary making corrections, for the following affected scraper fields:

  • tests.RecipeTestCase.tests/test_data/argiro.gr/argiro.json [image]
  • tests.RecipeTestCase.tests/test_data/argiro.gr/argiro.json [instructions_list]
  • tests.RecipeTestCase.tests/test_data/argiro.gr/argiro.json [category]
  • tests.RecipeTestCase.tests/test_data/argiro.gr/argiro.json [description]
  • tests.RecipeTestCase.tests/test_data/argiro.gr/argiro.json [cuisine]
  • tests.RecipeTestCase.tests/test_data/argiro.gr/argiro.json [instructions_list vs instructions comparison]
  • tests.RecipeTestCase.tests/test_data/ethanchlebowski.com/ethanchlebowski.json [ingredients]
  • tests.RecipeTestCase.tests/test_data/ethanchlebowski.com/ethanchlebowski.json [instructions_list]
  • tests.RecipeTestCase.tests/test_data/ethanchlebowski.com/ethanchlebowski.json [title]
  • tests.RecipeTestCase.tests/test_data/ethanchlebowski.com/ethanchlebowski.json [total_time]
  • tests.RecipeTestCase.tests/test_data/ethanchlebowski.com/ethanchlebowski.json [yields]
  • tests.RecipeTestCase.tests/test_data/ethanchlebowski.com/ethanchlebowski.json [ingredient_groups]
  • tests.RecipeTestCase.tests/test_data/ethanchlebowski.com/ethanchlebowski.json [cook_time]
  • tests.RecipeTestCase.tests/test_data/ethanchlebowski.com/ethanchlebowski.json [prep_time]
  • tests.RecipeTestCase.tests/test_data/ethanchlebowski.com/ethanchlebowski.json [instructions_list vs instructions comparison]
  • tests.RecipeTestCase.tests/test_data/goodfooddiscoveries.com/goodfooddiscoveries.json [image]
  • tests.RecipeTestCase.tests/test_data/womensweeklyfood.com.au/womensweeklyfood_1.json [image]
  • tests.RecipeTestCase.tests/test_data/womensweeklyfood.com.au/womensweeklyfood_2.json [image]
@jayaddison
Copy link
Collaborator Author

Initial findings:

  • Argiro: this is a case where data is provided as both ld+json (JSON Linked Data) and also HTML microdata (itemprop, itemtype HTML attributes). The data contained in each varies slightly; for the core recipe fields it appears mostly consistent. The linked data (JSON format) appears more complete.

  • Ethan Chlebowski: on this website, the recipe for a meal is sometimes presented as multiple nested recipes -- and indeed in the Huevos Rancheros test case, component recipes are provided for the salsa and pinto beans respectively. The multiple schema.org entities that we find on the page, therefore, are one entity for each of those recipes. This is tricky: we return a single recipe per URL at the moment, so I don't know what we can do about this -- unless we can extract them by referencing each one individually within the webpage using their URI anchor fragments?

  • Good Food Discoveries: has multiple entities; the first one contains the bulk of the information about the recipe, the second one contains mostly/only review data.

  • Womens Weekly: for the recipes on this site, both ld+json and microdata (itemprop, ...) are again provided. However, the only case where they overlap seems to be image URLs, and the results are fairly similar (the resizing/scaling parameters in the URLs differ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant