-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Developer guide docs #862
Developer guide docs #862
Conversation
…ocs folder. Remove [!NOTE] syntax from in depth guide draft notes.
Added additional information to the language section of the documentation. Corrected a few spelling items & double spaces
* ensuring the author is **attributed** correctly, | ||
* representing the recipes **accurately** and **authentically** | ||
|
||
Sometimes it is simple and straightforward to achieve all these goals, and sometimes it is more difficult (which is why this library exists). Where some interpretation or creativity is required to scrape a recipe, we should always keep those goals in mind. Occasionally, that might mean that we can't support a particular website. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @strangetom - this is a nice way of explaining the goals. I agree with placing accuracy and authenticity in the same line item, too - I was thinking about exactly that during a commute recently and figured that those two reduce to more-or-less the same idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was very much inspired by comment you made on a recent PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I hope they're useful guidelines - I don't want to have too much influence, though! 😄
@hhursev @bfcarpio @brett as some of the highest-total-committers here, do these more-or-less match the aims you have in mind when writing scrapers? And/or any suggestions for adjustments?
Edit: commit frequency was probably the wrong description; all-time total commits
To add ingredient group support to a scraper, the `ingredient_group` function needs to be overridden in the scraper class. The are three main things to consider when implementing ingredient groups support: | ||
|
||
1. The groups and their contents are not present in the Recipe schema. They must be extracted from the recipe HTML. | ||
2. The ingredients found in `ingredients()` and `ingredient_groups()` must be the same. This may sound obvious but there can sometimes minor differences in the ingredients in the schema and the ingredients in the HTML. | ||
3. Not all recipes on a website will use ingredient groups, so the implementation must support recipes that do and recipes that don't have ingredient groups. For recipes that don't have ingredient groups, the output should be the same as default implementation (i.e. a single `IngredientGroup` with `purpose=None` and `ingredients=ingredients()`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to make some of the wording here a bit more direct. I don't think my suggestions are perfect, so I think some more edits may be needed:
To add ingredient group support to a scraper, the `ingredient_group` function needs to be overridden in the scraper class. The are three main things to consider when implementing ingredient groups support: | |
1. The groups and their contents are not present in the Recipe schema. They must be extracted from the recipe HTML. | |
2. The ingredients found in `ingredients()` and `ingredient_groups()` must be the same. This may sound obvious but there can sometimes minor differences in the ingredients in the schema and the ingredients in the HTML. | |
3. Not all recipes on a website will use ingredient groups, so the implementation must support recipes that do and recipes that don't have ingredient groups. For recipes that don't have ingredient groups, the output should be the same as default implementation (i.e. a single `IngredientGroup` with `purpose=None` and `ingredients=ingredients()`). | |
Adding ingredient group support to a scraper involves overriding the `ingredient_group` function for it. There are three important points to consider: | |
1. The schema.org Recipe format does not support groupings - so scraping HTML is going to be required in the implementation. | |
2. The `group_ingredients` function requires that the schema.org ingredients list matches the ingredients found from the HTML - so it's important to make sure that these match. | |
3. Not all recipes on a website will use ingredient groups, so the implementation must degrade gracefully in cases where groupings aren't available. For recipes that don't have ingredient groups, the output should be the same as default implementation (i.e. a single `IngredientGroup` with `purpose=None` and `ingredients=ingredients()`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with the change to the 2nd point. It isn't that the group_ingredients
function needs them to match (in fact we added functionally to handle the cases where there isn't an exact match between the two sets of ingredients), it's more than the scraper should be presenting the same thing in two different ways: a single list or in groupings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. So it's more of a suggestion to the developer to sanity-check whether both methods return the same information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was original my thinking when writing this.
However we added a test the ScraperTest
base class that checks that both methods return the same ingredients, so the wording here should reflect that. I'll have a bit more of a think on how the phrase this.
I'm looking forward to the HTML scraping best practices section when that's ready :) One thing that I'll mention from experience: in web browsers -- and I admit Performance isn't top of our priority list, and I think that's fine, but giving readers an awareness that each CSS query could result in processing a really large number of HTML elements (especially given the layout of some websites) -- could be nice. I'm not sure I have detailed advice there other than something along the lines of: "use |
On second thought: I'm not completely sure whether |
Much of this is irrelevant in our case at the moment, though; we tend to use The few cases where we use CSS selection (usually via As far as I can tell, tl;dr - I don't think my advice around using CSS identity selectors is at all relevant, please disregard that. (maybe some day we'll have a large enough dataset to examine and improve performance and efficiency of the scrapers, but today it's tricky) |
Thanks for the feedback @jayaddison. The scraping guide is up next but will probably take me a little while to get to. |
|
||
### `author() -> str` | ||
|
||
Returns the author of the recipe. This is typically a person's name but can sometimes be an organization or the name of the website the recipe came from. If the recipe does not explicitly define an author, this should return the name of the website. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for documenting this - I agree with (alhough similar to the scraper goals section, also maybe influenced) this 👍
|
||
> [!IMPORTANT] | ||
> | ||
> Occasionally, a recipe will have steps that have new lines within them. At the moment, this library cannot distinguish between a newline within a step and a newline between steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this note also
> [!NOTE] | ||
> Is the return type correct here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The int
type does have the benefit of being straightforward to emit and consume -- and in some cases it's a direct typecast of the value that was presented on the HTML page.
On the other hand, I suppose timedelta
could be a more structured alternative?
(I'm not sure I have strong opinions about this - maybe I should have, but I don't. the only strong opinion I have is that if we do change it, we should migrate all three *_time
fields during the v15
migration)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if a recipe would ever specify a time to sub-minute resolution, in which case we would want to use float
.
int
is probably fine, so I'll leave it like this.
There might be a job to do to check how well the existing scrapers we have align with the return types I've documented here.
> [!WARNING] | ||
> | ||
> As far as I can tell, only the `AllRecipes` scraper actually implements this. Should we encourage more scrapers to use it, or should we remove the feature? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch 👍
I wonder.. I think we could check how many scrapers might support schema.org
reviews if we added it as a schema field. That could take us from one to many fairly quickly.
I'll try to do that analysis soon (note: if I haven't responded to this within a day, then it probably means I've forgotten about it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder.. I think we could check how many scrapers might support
schema.org
reviews if we added it as a schema field. That could take us from one to many fairly quickly.
I had an idea for a way to run a naive, basic analysis. Based on the commands below, 25 of our scrapers look like they contain Review
objects as structured data:
$ grep -rwl '@type.*"Review"' tests/test_data | wc -l
25
...so maybe 10% of scrpers could return reviews
if we implement that there? That would seem valuable enough to open a feature request for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can extract this from the schema, then I think this is would be worth doing. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - I've filed #870 to track that, and will take a look into more of the details soon.
I've added my first pass at the HTML scraping guide. It turned out to be harder to write than I first thought because I wasn't sure what to actually add to it. It will need some further tweaking, but if anybody has some suggestion I would appreciate them. |
|
||
The Beautiful Soup documentation for `select` is [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors-through-the-css-property). MDN has a guide on CSS selectors [here](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors). | ||
|
||
## Finding elements using a partial attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTML scraping guide is looking pretty good to me already :)
I'd suggest calling this section "Finding elements using string pattern matching"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or attribute pattern matching, I suppose (I see your point now that the matching is specific to HTML attributes)
Not something I do as often as I should, but an idea for something we could suggest in the developer guide: doing a manual test of the scraper on another recipe from the same website after developing it. To be honest that's probably also something we should do routinely during code review (again, I admit to not often doing that. sometimes I do). |
I'm marking this as ready for review and merge. I've not written the debugging guide yet, but I don't think I've going to have time in the near future to get to it, so I think it's better to get what is written merged and then add it in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a commit with a bunch of small adjustments to spelling, grammar and some syntax consistency changes but looks great to me!
Thanks for all the time spent on this it is hugely helpful (wish it was around when I started contributing!)
Brilliant, thanks both - I would generally agree that it makes sense to merge what we have here so far and incrementally update from there. Merging in a moment 👍 (@hhursev: would these be publishable using GitHub pages?) |
This PR will add some details developer guidance documentation. See #861 and #617 for discussion.
I plan to cover the following:
It will take me a bit of time to write all these, but I want to open this draft PR to get feedback as I do it. I do expect this to cause some discussion as the aim is to document things that have not necessarily been written down before.
All opinions and contributions are welcome.