Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Developer guide docs #862

Merged
merged 23 commits into from
Oct 1, 2023
Merged

Conversation

strangetom
Copy link
Collaborator

@strangetom strangetom commented Sep 16, 2023

This PR will add some details developer guidance documentation. See #861 and #617 for discussion.

I plan to cover the following:

  • Introductory docs
  • Step by step guide
  • In depth: debugging during development
  • In depth: HTML scraping (best practice, common patterns etc.)
  • In depth: Ingredient groups
  • In depth: Scraper functions (all possible function, which are optional, mandatory etc., what their content should contain, return types, ...)

It will take me a bit of time to write all these, but I want to open this draft PR to get feedback as I do it. I do expect this to cause some discussion as the aim is to document things that have not necessarily been written down before.

All opinions and contributions are welcome.

@strangetom strangetom mentioned this pull request Sep 16, 2023
5 tasks
* ensuring the author is **attributed** correctly,
* representing the recipes **accurately** and **authentically**

Sometimes it is simple and straightforward to achieve all these goals, and sometimes it is more difficult (which is why this library exists). Where some interpretation or creativity is required to scrape a recipe, we should always keep those goals in mind. Occasionally, that might mean that we can't support a particular website.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @strangetom - this is a nice way of explaining the goals. I agree with placing accuracy and authenticity in the same line item, too - I was thinking about exactly that during a commute recently and figured that those two reduce to more-or-less the same idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was very much inspired by comment you made on a recent PR :)

Copy link
Collaborator

@jayaddison jayaddison Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I hope they're useful guidelines - I don't want to have too much influence, though! 😄

@hhursev @bfcarpio @brett as some of the highest-total-committers here, do these more-or-less match the aims you have in mind when writing scrapers? And/or any suggestions for adjustments?

Edit: commit frequency was probably the wrong description; all-time total commits

Comment on lines 40 to 44
To add ingredient group support to a scraper, the `ingredient_group` function needs to be overridden in the scraper class. The are three main things to consider when implementing ingredient groups support:

1. The groups and their contents are not present in the Recipe schema. They must be extracted from the recipe HTML.
2. The ingredients found in `ingredients()` and `ingredient_groups()` must be the same. This may sound obvious but there can sometimes minor differences in the ingredients in the schema and the ingredients in the HTML.
3. Not all recipes on a website will use ingredient groups, so the implementation must support recipes that do and recipes that don't have ingredient groups. For recipes that don't have ingredient groups, the output should be the same as default implementation (i.e. a single `IngredientGroup` with `purpose=None` and `ingredients=ingredients()`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to make some of the wording here a bit more direct. I don't think my suggestions are perfect, so I think some more edits may be needed:

Suggested change
To add ingredient group support to a scraper, the `ingredient_group` function needs to be overridden in the scraper class. The are three main things to consider when implementing ingredient groups support:
1. The groups and their contents are not present in the Recipe schema. They must be extracted from the recipe HTML.
2. The ingredients found in `ingredients()` and `ingredient_groups()` must be the same. This may sound obvious but there can sometimes minor differences in the ingredients in the schema and the ingredients in the HTML.
3. Not all recipes on a website will use ingredient groups, so the implementation must support recipes that do and recipes that don't have ingredient groups. For recipes that don't have ingredient groups, the output should be the same as default implementation (i.e. a single `IngredientGroup` with `purpose=None` and `ingredients=ingredients()`).
Adding ingredient group support to a scraper involves overriding the `ingredient_group` function for it. There are three important points to consider:
1. The schema.org Recipe format does not support groupings - so scraping HTML is going to be required in the implementation.
2. The `group_ingredients` function requires that the schema.org ingredients list matches the ingredients found from the HTML - so it's important to make sure that these match.
3. Not all recipes on a website will use ingredient groups, so the implementation must degrade gracefully in cases where groupings aren't available. For recipes that don't have ingredient groups, the output should be the same as default implementation (i.e. a single `IngredientGroup` with `purpose=None` and `ingredients=ingredients()`).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with the change to the 2nd point. It isn't that the group_ingredients function needs them to match (in fact we added functionally to handle the cases where there isn't an exact match between the two sets of ingredients), it's more than the scraper should be presenting the same thing in two different ways: a single list or in groupings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. So it's more of a suggestion to the developer to sanity-check whether both methods return the same information?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was original my thinking when writing this.

However we added a test the ScraperTest base class that checks that both methods return the same ingredients, so the wording here should reflect that. I'll have a bit more of a think on how the phrase this.

@jayaddison
Copy link
Collaborator

I'm looking forward to the HTML scraping best practices section when that's ready :)

One thing that I'll mention from experience: in web browsers -- and I admit BeautifulSoup isn't a browser, but it does end up parsing HTML into element trees in similar ways -- there can be significant performance differences based on how elements are found using CSS selectors It's almost like SQL database queries: sometimes it's possible to rewrite a query in a way that produces identical results but in a much more efficient and faster way.

Performance isn't top of our priority list, and I think that's fine, but giving readers an awareness that each CSS query could result in processing a really large number of HTML elements (especially given the layout of some websites) -- could be nice.

I'm not sure I have detailed advice there other than something along the lines of: "use #id selectors when possible because they may allow unique index lookup, don't be afraid to filter by both class-and-element -- but also try to make your query concise so that when you read it again in six months, or have to adjust it based on a website layout change, it makes some logical sense to you".

@jayaddison
Copy link
Collaborator

use #id selectors when possible because they may allow unique index lookup

On second thought: I'm not completely sure whether lxml provides this optimization, to be honest. I'll try to check that in the documentation/code.

@jayaddison
Copy link
Collaborator

use #id selectors when possible because they may allow unique index lookup

On second thought: I'm not completely sure whether lxml provides this optimization, to be honest. I'll try to check that in the documentation/code.

Much of this is irrelevant in our case at the moment, though; we tend to use BeautifulSoup operations like find and find_all that don't depend on CSS selectors.

The few cases where we use CSS selection (usually via soup.select) are implemented by lxml using SoupSieve, I think. Note that lxml can also use the cssselect library to implement some CSS selection operations.

As far as I can tell, SoupSieve doesn't build an index of elements-by-ID, so I don't think that using an ID selector (#foo) would make a difference there. In the case of cssselect (that I don't believe we use, currently), element-by-ID queries are transformed into XPath attribute equality checks - so it depends on the underlying etree/XPath implementation (native Python, in most cases, probably) how efficient those queries will be.

tl;dr - I don't think my advice around using CSS identity selectors is at all relevant, please disregard that.

(maybe some day we'll have a large enough dataset to examine and improve performance and efficiency of the scrapers, but today it's tricky)

@strangetom
Copy link
Collaborator Author

Thanks for the feedback @jayaddison. The scraping guide is up next but will probably take me a little while to get to.


### `author() -> str`

Returns the author of the recipe. This is typically a person's name but can sometimes be an organization or the name of the website the recipe came from. If the recipe does not explicitly define an author, this should return the name of the website.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting this - I agree with (alhough similar to the scraper goals section, also maybe influenced) this 👍


> [!IMPORTANT]
>
> Occasionally, a recipe will have steps that have new lines within them. At the moment, this library cannot distinguish between a newline within a step and a newline between steps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this note also

Comment on lines 117 to 118
> [!NOTE]
> Is the return type correct here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The int type does have the benefit of being straightforward to emit and consume -- and in some cases it's a direct typecast of the value that was presented on the HTML page.

On the other hand, I suppose timedelta could be a more structured alternative?

(I'm not sure I have strong opinions about this - maybe I should have, but I don't. the only strong opinion I have is that if we do change it, we should migrate all three *_time fields during the v15 migration)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if a recipe would ever specify a time to sub-minute resolution, in which case we would want to use float.
int is probably fine, so I'll leave it like this.

There might be a job to do to check how well the existing scrapers we have align with the return types I've documented here.

Comment on lines 317 to 319
> [!WARNING]
>
> As far as I can tell, only the `AllRecipes` scraper actually implements this. Should we encourage more scrapers to use it, or should we remove the feature?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch 👍

I wonder.. I think we could check how many scrapers might support schema.org reviews if we added it as a schema field. That could take us from one to many fairly quickly.

I'll try to do that analysis soon (note: if I haven't responded to this within a day, then it probably means I've forgotten about it).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder.. I think we could check how many scrapers might support schema.org reviews if we added it as a schema field. That could take us from one to many fairly quickly.

I had an idea for a way to run a naive, basic analysis. Based on the commands below, 25 of our scrapers look like they contain Review objects as structured data:

$ grep -rwl '@type.*"Review"' tests/test_data | wc -l
25

...so maybe 10% of scrpers could return reviews if we implement that there? That would seem valuable enough to open a feature request for.

Copy link
Collaborator Author

@strangetom strangetom Sep 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can extract this from the schema, then I think this is would be worth doing. 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - I've filed #870 to track that, and will take a look into more of the details soon.

@strangetom
Copy link
Collaborator Author

I've added my first pass at the HTML scraping guide. It turned out to be harder to write than I first thought because I wasn't sure what to actually add to it. It will need some further tweaking, but if anybody has some suggestion I would appreciate them.


The Beautiful Soup documentation for `select` is [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors-through-the-css-property). MDN has a guide on CSS selectors [here](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors).

## Finding elements using a partial attribute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTML scraping guide is looking pretty good to me already :)

I'd suggest calling this section "Finding elements using string pattern matching"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or attribute pattern matching, I suppose (I see your point now that the matching is specific to HTML attributes)

@jayaddison
Copy link
Collaborator

Not something I do as often as I should, but an idea for something we could suggest in the developer guide: doing a manual test of the scraper on another recipe from the same website after developing it.

To be honest that's probably also something we should do routinely during code review (again, I admit to not often doing that. sometimes I do).

@strangetom
Copy link
Collaborator Author

I'm marking this as ready for review and merge. I've not written the debugging guide yet, but I don't think I've going to have time in the near future to get to it, so I think it's better to get what is written merged and then add it in the future.

@strangetom strangetom marked this pull request as ready for review September 30, 2023 17:38
README.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@jknndy jknndy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a commit with a bunch of small adjustments to spelling, grammar and some syntax consistency changes but looks great to me!

Thanks for all the time spent on this it is hugely helpful (wish it was around when I started contributing!)

@jayaddison
Copy link
Collaborator

Brilliant, thanks both - I would generally agree that it makes sense to merge what we have here so far and incrementally update from there. Merging in a moment 👍

(@hhursev: would these be publishable using GitHub pages?)

@jayaddison jayaddison merged commit 4547716 into hhursev:main Oct 1, 2023
15 checks passed
@strangetom strangetom deleted the developer-guide-docs branch October 1, 2023 10:40
strangetom added a commit to strangetom/recipe-scrapers that referenced this pull request Nov 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants