Developer guide docs #862

strangetom · 2023-09-16T14:16:22Z

This PR will add some details developer guidance documentation. See #861 and #617 for discussion.

I plan to cover the following:

Introductory docs
Step by step guide
In depth: debugging during development
In depth: HTML scraping (best practice, common patterns etc.)
In depth: Ingredient groups
In depth: Scraper functions (all possible function, which are optional, mandatory etc., what their content should contain, return types, ...)

It will take me a bit of time to write all these, but I want to open this draft PR to get feedback as I do it. I do expect this to cause some discussion as the aim is to document things that have not necessarily been written down before.

All opinions and contributions are welcome.

…ocs folder. Remove [!NOTE] syntax from in depth guide draft notes.

Added additional information to the language section of the documentation. Corrected a few spelling items & double spaces

jayaddison · 2023-09-18T21:33:42Z

docs/README.md

+* ensuring the author is **attributed** correctly,
+* representing the recipes **accurately** and **authentically**
+
+Sometimes it is simple and straightforward to achieve all these goals, and sometimes it is more difficult (which is why this library exists). Where some interpretation or creativity is required to scrape a recipe, we should always keep those goals in mind. Occasionally, that might mean that we can't support a particular website.


+1 @strangetom - this is a nice way of explaining the goals. I agree with placing accuracy and authenticity in the same line item, too - I was thinking about exactly that during a commute recently and figured that those two reduce to more-or-less the same idea.

This was very much inspired by comment you made on a recent PR :)

Thanks! I hope they're useful guidelines - I don't want to have too much influence, though! 😄

@hhursev @bfcarpio @brett as some of the highest-total-committers here, do these more-or-less match the aims you have in mind when writing scrapers? And/or any suggestions for adjustments?

Edit: commit frequency was probably the wrong description; all-time total commits

docs/how-to-develop-scraper.md

docs/in-depth-guide-ingredient-groups.md

jayaddison · 2023-09-18T21:49:01Z

docs/in-depth-guide-ingredient-groups.md

+To add ingredient group support to a scraper, the `ingredient_group` function needs to be overridden in the scraper class. The are three main things to consider when implementing ingredient groups support:
+
+1. The groups and their contents are not present in the Recipe schema. They must be extracted from the recipe HTML.
+2. The ingredients found in `ingredients()` and `ingredient_groups()` must be the same. This may sound obvious but there can sometimes minor differences in the ingredients in the schema and the ingredients in the HTML.
+3. Not all recipes on a website will use ingredient groups, so the implementation must support recipes that do and recipes that don't have ingredient groups. For recipes that don't have ingredient groups, the output should be the same as default implementation (i.e. a single `IngredientGroup` with `purpose=None` and `ingredients=ingredients()`).


Trying to make some of the wording here a bit more direct. I don't think my suggestions are perfect, so I think some more edits may be needed:

Suggested change

To add ingredient group support to a scraper, the `ingredient_group` function needs to be overridden in the scraper class. The are three main things to consider when implementing ingredient groups support:

1. The groups and their contents are not present in the Recipe schema. They must be extracted from the recipe HTML.

2. The ingredients found in `ingredients()` and `ingredient_groups()` must be the same. This may sound obvious but there can sometimes minor differences in the ingredients in the schema and the ingredients in the HTML.

3. Not all recipes on a website will use ingredient groups, so the implementation must support recipes that do and recipes that don't have ingredient groups. For recipes that don't have ingredient groups, the output should be the same as default implementation (i.e. a single `IngredientGroup` with `purpose=None` and `ingredients=ingredients()`).

Adding ingredient group support to a scraper involves overriding the `ingredient_group` function for it. There are three important points to consider:

1. The schema.org Recipe format does not support groupings - so scraping HTML is going to be required in the implementation.

2. The `group_ingredients` function requires that the schema.org ingredients list matches the ingredients found from the HTML - so it's important to make sure that these match.

3. Not all recipes on a website will use ingredient groups, so the implementation must degrade gracefully in cases where groupings aren't available. For recipes that don't have ingredient groups, the output should be the same as default implementation (i.e. a single `IngredientGroup` with `purpose=None` and `ingredients=ingredients()`).

I don't agree with the change to the 2nd point. It isn't that the group_ingredients function needs them to match (in fact we added functionally to handle the cases where there isn't an exact match between the two sets of ingredients), it's more than the scraper should be presenting the same thing in two different ways: a single list or in groupings.

Ok, got it. So it's more of a suggestion to the developer to sanity-check whether both methods return the same information?

That was original my thinking when writing this.

However we added a test the ScraperTest base class that checks that both methods return the same ingredients, so the wording here should reflect that. I'll have a bit more of a think on how the phrase this.

docs/in-depth-guide-ingredient-groups.md

jayaddison · 2023-09-18T22:19:27Z

I'm looking forward to the HTML scraping best practices section when that's ready :)

One thing that I'll mention from experience: in web browsers -- and I admit BeautifulSoup isn't a browser, but it does end up parsing HTML into element trees in similar ways -- there can be significant performance differences based on how elements are found using CSS selectors It's almost like SQL database queries: sometimes it's possible to rewrite a query in a way that produces identical results but in a much more efficient and faster way.

Performance isn't top of our priority list, and I think that's fine, but giving readers an awareness that each CSS query could result in processing a really large number of HTML elements (especially given the layout of some websites) -- could be nice.

I'm not sure I have detailed advice there other than something along the lines of: "use #id selectors when possible because they may allow unique index lookup, don't be afraid to filter by both class-and-element -- but also try to make your query concise so that when you read it again in six months, or have to adjust it based on a website layout change, it makes some logical sense to you".

jayaddison · 2023-09-19T09:13:29Z

use #id selectors when possible because they may allow unique index lookup

On second thought: I'm not completely sure whether lxml provides this optimization, to be honest. I'll try to check that in the documentation/code.

jayaddison · 2023-09-19T09:50:18Z

use #id selectors when possible because they may allow unique index lookup

On second thought: I'm not completely sure whether lxml provides this optimization, to be honest. I'll try to check that in the documentation/code.

Much of this is irrelevant in our case at the moment, though; we tend to use BeautifulSoup operations like find and find_all that don't depend on CSS selectors.

The few cases where we use CSS selection (usually via soup.select) are implemented by lxml using SoupSieve, I think. Note that lxml can also use the cssselect library to implement some CSS selection operations.

As far as I can tell, SoupSieve doesn't build an index of elements-by-ID, so I don't think that using an ID selector (#foo) would make a difference there. In the case of cssselect (that I don't believe we use, currently), element-by-ID queries are transformed into XPath attribute equality checks - so it depends on the underlying etree/XPath implementation (native Python, in most cases, probably) how efficient those queries will be.

tl;dr - I don't think my advice around using CSS identity selectors is at all relevant, please disregard that.

(maybe some day we'll have a large enough dataset to examine and improve performance and efficiency of the scrapers, but today it's tricky)

strangetom · 2023-09-19T15:03:34Z

Thanks for the feedback @jayaddison. The scraping guide is up next but will probably take me a little while to get to.

docs/how-to-develop-scraper.md

jayaddison · 2023-09-20T16:43:52Z