Scraper accuracy re-review: schema.org multi-entity aggregates #1382

jayaddison · 2024-11-15T15:56:58Z

After coding up an initial implementation of aggregation of schema.org Recipe info from multiple entities in HTML page metadata (see #1381), a number of tests have begun failing.

This issue tracks a double-check process to figure out what the expected test values should really be, and if necessary making corrections, for the following affected scraper fields:

The text was updated successfully, but these errors were encountered:

jayaddison · 2024-11-18T18:33:43Z

Initial findings:

Argiro: this is a case where data is provided as both ld+json (JSON Linked Data) and also HTML microdata (itemprop, itemtype HTML attributes). The data contained in each varies slightly; for the core recipe fields it appears mostly consistent. The linked data (JSON format) appears more complete.
Ethan Chlebowski: on this website, the recipe for a meal is sometimes presented as multiple nested recipes -- and indeed in the Huevos Rancheros test case, component recipes are provided for the salsa and pinto beans respectively. The multiple schema.org entities that we find on the page, therefore, are one entity for each of those recipes. This is tricky: we return a single recipe per URL at the moment, so I don't know what we can do about this -- unless we can extract them by referencing each one individually within the webpage using their URI anchor fragments?
Good Food Discoveries: has multiple entities; the first one contains the bulk of the information about the recipe, the second one contains mostly/only review data.
Womens Weekly: for the recipes on this site, both ld+json and microdata (itemprop, ...) are again provided. However, the only case where they overlap seems to be image URLs, and the results are fairly similar (the resizing/scaling parameters in the URLs differ).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper accuracy re-review: schema.org multi-entity aggregates #1382

Scraper accuracy re-review: schema.org multi-entity aggregates #1382

jayaddison commented Nov 15, 2024

jayaddison commented Nov 18, 2024

Scraper accuracy re-review: schema.org multi-entity aggregates #1382

Scraper accuracy re-review: schema.org multi-entity aggregates #1382

Comments

jayaddison commented Nov 15, 2024

jayaddison commented Nov 18, 2024