diff --git a/README.rst b/README.rst index b0158c3fd..5da9660a2 100644 --- a/README.rst +++ b/README.rst @@ -375,7 +375,7 @@ Contribute If you spot a design change (or something else) that makes the scraper unable to work for a given site - please fire an issue asap. -If you are programmer PRs with fixes are warmly welcomed and acknowledged with a virtual beer. +If you are programmer PRs with fixes are warmly welcomed and acknowledged with a virtual beer. You can find documentation on how to develop scrapers `here `_. If you want a scraper for a new site added @@ -395,6 +395,8 @@ If you want a scraper for a new site added - **ClassName**: The name of the new scraper class. - **URL**: The URL of an example recipe from the target site. The content will be stored in `test_data` to be used with the test class. + You can find a more detailed guide `here `_. + For Devs / Contribute --------------------- diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 000000000..71425a1c3 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,20 @@ +# Scraper Development Guide + +## Goal + +This library has the goals of + +* making recipe information **accessible**, +* ensuring the author is **attributed** correctly, +* representing the recipes **accurately** and **authentically** + +Sometimes it is simple and straightforward to achieve all these goals, and sometimes it is more difficult (which is why this library exists). Where some interpretation or creativity is required to scrape a recipe, we should always keep those goals in mind. Occasionally, that might mean that we can't support a particular website. + +## Contents + +* [How To: Develop a New Scraper](how-to-develop-scraper.md) +* In Depth Guides: + * [Debugging](in-depth-guide-debugging.md) (coming soon) + * [HTML Scraping](in-depth-guide-html-scraping.md) + * [Ingredient Groups](in-depth-guide-ingredient-groups.md) + * [Scraper Functions](in-depth-guide-scraper-functions.md) diff --git a/docs/how-to-develop-scraper.md b/docs/how-to-develop-scraper.md new file mode 100644 index 000000000..a918d69c9 --- /dev/null +++ b/docs/how-to-develop-scraper.md @@ -0,0 +1,226 @@ +# How To: Develop a New Scraper + +## 1. Find a website + +If you have found a website you want to scrape the recipes from, first of all check to see if the website is already supported. + +The project [README](https://github.com/hhursev/recipe-scrapers/blob/main/README.rst) has a list of the hundreds of websites already supported. + +You can also check from within Python: + +```python +>>> from recipe_scrapers import SCRAPERS +``` + +`SCRAPERS` is a dict where the keys are the hostnames of the supported websites and the values are the scraper classes for each supported website. + +```python +>>> from recipe_scrapers import SCRAPERS +>>> SCRAPERS.get("bbcgoodfood.com") +recipe_scrapers.bbcgoodfood.BBCGoodFood +``` + +It's a good idea to file an [issue](https://github.com/hhursev/recipe-scrapers/issues/new/choose) on GitHub to track support for the website, and to indicate whether you are working on it. + +## 2. Fork the recipe-scrapers repository and clone + +If this is your first time contributing to this repository then you will need to create a fork of the repository and clone it to your computer. + +To create a fork, click the Fork button near the top of page on the project GitHub page. This will create a copy of the repository under your GitHub user. + +You can then clone the fork to your computer and set it up for development. + +**Clone the repository**, replacing \ with your username + +```shell +$ git clone git@github.com:/recipe-scrapers.git +$ cd recipe-scrapers +``` + +**Create a virtual environment, activate and install dependencies** +```shell +$ python -m venv .venv --upgrade-deps +$ source .venv/bin/activate +$ pip install -r requirements-dev.txt +$ pip install pre-commit +$ pre-commit install +``` + +**Check that everything is working by running the tests** +```shell +$ python -m unittest +``` +This will run all the tests for all the scrapers. You should not see any errors or failures. + +## 3. Identify a recipe and generate the scraper and test file + +To develop the scraper for the website, first identify a recipe. This will be used to create the test case that will validate that the scraper is working correctly. + +Next, find out if the website supports [Recipe Schema](https://schema.org/Recipe). If the website does support Recipe Schema, this will make creating the scraper straightforward. If not, supporting the site will be more complex but still possible. + +```python +>>> from recipe_scrapers import scrape_me +>>> scraper = scrape_me(URL, wild_mode=True) +>>> scraper.schema.data +{'@context': 'https://schema.org', + '@type': 'Recipe', + ... +} +``` + +If Recipe Schema is available, then `scraper.schema.data` will return a dict containing information about the recipe. + +If Recipe Schema is not available, then `scraper.schema.data` will return an empty dict. + +Next, generate the scraper class and test files by running this command: + +```shell +$ python generate.py +``` + +This will generate a file for the scraper with name \ with basic code that you will need to modify. This will also download the recipe at \ and create a test case. + +You can find the generated scraper class in the `recipe_scrapers/` directory in a file the same as \ but all lower case. The generated scraper class will look something like this: + +```python +from ._abstract import AbstractScraper + +class ScraperName(AbstractScraper): + @classmethod + def host(cls): + return "websitehost.com" + + def author(self): + return self.schema.author() + + def title(self): + return self.schema.title() + + def category(self): + return self.schema.category() + + def total_time(self): + return self.schema.total_time() + + def yields(self): + return self.schema.yields() + + def image(self): + return self.schema.image() + + def ingredients(self): + return self.schema.ingredients() + + def instructions(self): + return self.schema.instructions() + + def ratings(self): + return self.schema.ratings() + + def cuisine(self): + return self.schema.cuisine() + + def description(self): + return self.schema.description() +``` + +The generated scraper class will automatically be populated with functions that assume the Recipe Schema is available, regardless of whether it is or not. + +## 4. Add functionality to the scraper + +If the website supports Recipe Schema, then this is mostly done for you already. You can check if the output from each function is what you would expect from the recipe by using the scraper. + +```python +>>> from recipe_scrapers import scrape_me +>>> scraper = scrape_me(URL) +>>> scraper.title() +"..." +>>> scraper.ingredients() +[ + "...", + "..." +] +# etc. +``` + +Some additional functionality may be required in the scraper functions to make the output match the recipe on the website. + +An in-depth guide on all the functions a scraper can support and what their output should be can be found [here](in-depth-guide-scraper-functions.md). The automatically generated scraper does not include all of these functions be default, so you may wish to add some of the additional functions listed if the website can support them. + +If the website does not support Recipe Schema, or the schema does not include all of the recipe information, then you can scrape the information out of the website HTML. Each scraper has a `bs4.BeautifulSoup` object made available in `self.soup` which contains the parsed HTML. This can be used to extract the recipe information needed. + +An example of a scraper that uses this approach is [Przepisy](https://github.com/hhursev/recipe-scrapers/blob/main/recipe_scrapers/przepisy.py). + +The [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html) is a good resource for getting started with extracting information from HTML. An guide of common patterns and best practice used in this library can be found [here](in-depth-guide-html-scraping). + +Some helper functions are available in the `_utils.py` file. These are functions that are commonly needed when extracting information from HTML, such as `normalize_string()`. + +## 5. Create the test + +A test case was automatically created when the scraper class was created. It can be found in the `tests/` directory with the name `test_lowercaseclassname.py`. The HTML from the URL used to generate the scraper is also downloaded to the `tests/test_data/` directory. + +The generated test case will look something like this: + +```python +from recipe_scrapers.newscraperclass import NewScraperClass +from tests import ScraperTest + +class TestNewScraperScraper(ScraperTest): + + scraper_class = NewScraperClass + + def test_host(self): + self.assertEqual("websitehost.com", self.harvester_class.host()) + + def test_author(self): + self.assertEqual(None, self.harvester_class.author()) + + def test_title(self): + self.assertEqual(None, self.harvester_class.title()) + + def test_category(self): + self.assertEqual(None, self.harvester_class.category()) + + def test_total_time(self): + self.assertEqual(None, self.harvester_class.total_time()) + + def test_yields(self): + self.assertEqual(None, self.harvester_class.yields()) + + def test_image(self): + self.assertEqual(None, self.harvester_class.image()) + + def test_ingredients(self): + self.assertEqual(None, self.harvester_class.ingredients()) + + def test_instructions(self): + self.assertEqual(None, self.harvester_class.instructions()) + + def test_ratings(self): + self.assertEqual(None, self.harvester_class.ratings()) + + def test_cuisine(self): + self.assertEqual(None, self.harvester_class.cuisine()) + + def test_description(self): + self.assertEqual(None, self.harvester_class.description()) +``` + +You will need to modify each of these functions to replace `None` with the correct output from the scraper for the recipe in the URL you used to generate the test. You should not do any processing of the scraper output in the test case, this would belong in the scraper itself. + +This test case should only have tests for the functions that the scraper implements. You may need to add or remove tests depending on the implementation of the scraper. + +You can check whether your scraper is passing the tests by running + +```shell +$ python -m unittest tests.test_myscraper +``` + +> [!NOTE] +> It is also recommended that you manually test the scraper with a couple of different recipes from the website, to check that there aren't any special cases the scraper will need to handle. You don't need to create test cases for each of these. + +## 6. Open a pull request + +Once you have finished developing the scraper and test case, you can commit the files to git and push them to GitHub. You should also update the README.rst to list the site, alphabetically, under the [Scrapers available for:](https://github.com/hhursev/recipe-scrapers#scrapers-available-for) header. + +After you have pushed the changes to GitHub, you can open a pull request in the [recipe-scrapers project](https://github.com/hhursev/recipe-scrapers/pulls). Your changes will undergo some automatic tests (no different to running the all the tests in the project, but this time on all supported platforms and using all supported Python versions) and be reviewed by other project contributors. diff --git a/docs/in-depth-guide-debugging.md b/docs/in-depth-guide-debugging.md new file mode 100644 index 000000000..e472804ab --- /dev/null +++ b/docs/in-depth-guide-debugging.md @@ -0,0 +1,6 @@ +# In Depth Guide: Debugging + +> **Draft** +> This in depth guide is intended to give more guidance on debugging scrapers during development and handling exceptions. +> +>To be populated at a later date. diff --git a/docs/in-depth-guide-html-scraping.md b/docs/in-depth-guide-html-scraping.md new file mode 100644 index 000000000..b302b01ea --- /dev/null +++ b/docs/in-depth-guide-html-scraping.md @@ -0,0 +1,178 @@ +# In Depth Guide: HTML Scraping + +The preferred method of scraping recipe information from a web page is to use the schema.org Recipe data. This is a machine readable, structured format intended to provide a standardised method of extracting information. However, whilst most recipe websites use the schema.org Recipe format, not all do, and for those websites that do, it does not always include all the information we are looking for. In these cases, we can use HTML scraping to extract the information from the HTML markup. + +## `soup` + +Each scraper has a `BeautifulSoup` object that can be accessed using the `self.soup` attribute. The `BeautifulSoup` object is a representation of the web page HTML that has been parsed into a format that we can query and extract information from. + +The [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the best resource for learning how to use `BeautifulSoup` objects to interact with HTML documents. + +This guide covers a number of common patterns that are used in this library. + +## Finding a single element + +The `self.soup.find()` function returns the first element matching the arguments. This is useful if you are trying to extract some information that should only occur once, for example the prep time or total time. + +```python +# To find a particular element +self.soup.find("h1") # Returns the first h1 element + +# To find an element with particular class (note the underscore at the end of class_) +self.soup.find(class_"total-time") # Returns the first element with total-time class. + +# To find an element with a particular ID +self.soup.find(id="total-time") + +# You can include multiple arguments to be more specific +# To find the first h1 element with "title" class +self.soup.find("h1", class_="title") +``` + +`self.soup` returns a `bs4.element.Tag` object. Usually we just want the text from the selected element and the best way to do that is to use `.get_text()`. + +```python +title_tag = self.soup.select("h1") # bs4.element.Tag object +title_text = title_tag.get_text() # str +``` + +`.get_text()` will get the text from all child elements, as it would appear in your browser, so there is no need to iterate through all the children, call `.get_text()` on each one, then join the results afterwards. + +As an example, consider one of the ingredients in [this recipe](https://rainbowplantlife.com/instant-pot-jackfruit-curry/#wprm-recipe-container-5618). The markup looks like this: + +```html +
  • + + + + + + 1 + + + tablespoon + + coconut oil + + + (or oil of choice) + +
  • +``` + +We can select this element using its tag and class (we're pretending this recipe only has this one ingredient), and extract the text like so: + +```python +ingredient_tag = self.soup.find("li", class_="wprm-recipe-ingredient") +ingredient_text = ingredient_tag.get_text() +# '1 tablespoon coconut oil (or oil of choice)' +``` + +The Beautiful Soup documentation for `find` is [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find). + +### Normalizing strings + +A convenience function called `normalize_string()` is provided in the `_utils` package. This function will convert any characters escaped for HTML to their actual character (e.g. `&` to `&`) and remove unnecessary white space. It is best practice to always use this when extracting text from the HTML. + +```python +from ._utils import normalize_string + +# ... +ingredient_tag = self.soup.find("li", class_="wprm-recipe-ingredient") +ingredient_text = normalize_string(ingredient_tag.get_text()) +``` + +### Getting yields + +A convenience function called `get_yields()` is provided in the `_utils` package. This function accepts a `str` or `bs4.element.Tag` and will return the yield, handling many of the common formats yields can appear in and normalizing them to a standard format. + +```python +from ._utils import get_yields + +# ... +yield_tag = self.soup.find(class_="wprm-recipe-servings") +yield_text = get_yields(yield_tag) +# or +yield_text = get_yields(yield_tag.get_text()) +# both return '4 servings' +``` + +### Getting times + +A convenience function called `get_minutes()` is provided in the `_utils` package. This function accepts a `str` or `bs4.element.Tag` and will return the number of minutes as an `int`. This function handles a number of common formats that times can be expressed in. + +```python +from ._utils import get_minutes + +# ... +prep_time_tag = self.soup.find(class_="wprm-recipe-prep_time-minutes") +prep_time_value = get_minutes(prep_time_tag) +# or +prep_time_value = get_minutes(prep_time_tag.get_text()) +# both return 25 +``` + +## Finding multiple elements + +Some information in a recipe, like the ingredients or instructions, come in the form of lists where we need to find multiple elements with the same attributes. We can use `self.soup.find_all()` for this. `find_all` uses the same arguments as `find`, it just returns a list of `bs4.element.Tag` objects with all the matching elements. + +Using the same site as above, we can find all the ingredients like so + +```python +ingredient_tags = self.soup.find_all("li", class_="wprm-recipe-ingredient") +ingredients_text = [normalize_string(tag.get_text()) for tag in ingredient_tags] +""" +[ + '2 (20-ounce // 565g) cans of jackfruit (in water or brine)*', + '1 tablespoon coconut oil (or oil of choice)', + '1 1/2 teaspoons cumin seeds', + '1 1/2 teaspoons black mustard seeds (can substitute brown mustard seeds)', + '1 large yellow onion, diced', + ... +] +""" +``` + +The Beautiful Soup documentation for `find_all` is [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all). + +## Using CSS selectors + +If you are already familiar with CSS selectors, then you can use `select()` to achieve the same result as `find_all()`, or `select_one()` to achieve the same result as `find`. + +```python +ingredient_tag = self.soup.select("li.wprm-recipe-ingredient") # Match all li elements with wprm-recipe-ingredient class +``` + +The Beautiful Soup documentation for `select` is [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors-through-the-css-property). MDN has a guide on CSS selectors [here](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors). + +## Finding elements using a partial attribute + +Sometimes you might want to find elements using a part of an attribute. This is particularly helpful for websites that automatically generate CSS in a way that appends a random string to the end of class names. + +An example of this is [cooking.nytimes.com](https://cooking.nytimes.com/recipes/1024605-cumin-and-cashew-yogurt-rice). If we wanted to select the yield element from this page, we could use the class `ingredients_recipeYield__DN65p`. However when the website is updated in the future, the `DN65p` at the end of the class name is likely to change, so we only want to use part of the class name. + +There are two ways we can do this: + +### Using `find` + +Instead of using a string in the arguments we pass to `find`, we can use a regular expression instead. + +```python +yield_tag = self.soup.find(class_=re.compile("ingredients_recipeYield")) +yield_text = yield_tag.get_text() +# Yield:4 servings +``` + +### Using `select` + +CSS also supports partial attribute matching. MDN has a useful guide [here](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors). + +```python +# Select elements where class contains 'ingredients_recipeYield' +yield_tag = self.soup.select("[class*='ingredients_recipeYield']") + +# Select tags where class starts with 'ingredients_recipeYield' +yield_tag = self.soup.select("[class^='ingredients_recipeYield']") +``` diff --git a/docs/in-depth-guide-ingredient-groups.md b/docs/in-depth-guide-ingredient-groups.md new file mode 100644 index 000000000..ebdf934d6 --- /dev/null +++ b/docs/in-depth-guide-ingredient-groups.md @@ -0,0 +1,123 @@ +# In Depth Guide: Ingredient Groups + +Sometimes a website will format lists of ingredients using groups, where each group contains the ingredients needed for a particular aspect of the recipe. Recipe Schema has no way to represent these groupings, so all of the ingredients are presented as a single list and information about the groupings is lost. + +Some examples of recipes that have ingredient groups are: + +* https://cooking.nytimes.com/recipes/1024570-green-salad-with-warm-goat-cheese-salade-de-chevre-chaud +* https://lifestyleofafoodie.com/air-fryer-frozen-french-fries +* https://www.bbcgoodfood.com/recipes/monster-cupcakes + +Not all websites use ingredient groups and those that do use ingredient groups will not use them for all recipes. + +This library allows a scraper to return the ingredients in the groups defined in the recipe if the functionality is added to that scraper. + +## `ingredient_groups()` + +The `ingredient_groups()` function returns a list of `IngredientGroup` objects. Each `IngredientGroup` is a dataclass that represents a group of ingredients: + +```python +@dataclass +class IngredientGroup: + ingredients: List[str] + purpose: Optional[ + str + ] = None # this group of ingredients is {purpose} (e.g. "For the dressing") +``` + +The *purpose* is the ingredient group heading, such as *"For the dressing"*, *"For the sauce"* etc. The *ingredients* is the list of ingredients that comprise that group. + +This dataclass is defined in `_grouping_utils.py` and should be imported from there. + +```python +from ._grouping_utils import IngredientGroup +``` + +The `ingredient_groups()` function has a default implementation in the `AbstractScraper` class that returns a single `IngredientGroup` object with `purpose` set to `None` and `ingredients` set to the output from the scraper's `ingredients()` function. + +Adding ingredient group support to a scraper involves overriding the `ingredient_groups` function for it. There are three important points to consider: + +1. The schema.org Recipe format does not support groupings - so scraping from the HTML is required in the implementation. +2. The ingredients found in `ingredients()` and `ingredient_groups()` should be the same because we're presenting the same set of ingredients, just in a different way. There can sometimes be minor differences in the ingredients in the schema and the ingredients in the HTML which needs to be handled. +3. Not all recipes on a website will use ingredient groups, so the implementation must degrade gracefully in cases where groupings aren't available. For recipes that don't have ingredient groups, the output should be the same as default implementation (i.e. a single `IngredientGroup` with `purpose=None` and `ingredients=ingredients()`). + +In many cases the structure of how ingredients and group heading appear in the HTML is very similar. Some helper functions have been developed to make the implementation easier. + +## _grouping_utils.py + +The `_grouping_utils.py` file contains a helper function (`group_ingredients(...)`) that will handle the extraction of ingredients groups and their headings from the HTML, make sure the ingredients in the groups match those return from `.ingredients()`, and then return the groups. + +The `group_ingredients()` function takes four arguments: + +```python +def group_ingredients( + ingredients_list: List[str], + soup: BeautifulSoup, + group_heading: str, + group_element: str, +) -> List[IngredientGroup]: +``` + +* `ingredients_list` is the output from the `.ingredients()` function of the scraper class. This is used to make the ingredients found in the HTML match those returned by `.ingredients()`. +* `soup` is the `BeautifulSoup` object for the scraper. The ingredient groups are extracted from this. +* `group_heading` is the CSS selector for the group headings. This selector must only match the group headings in the recipe HTML. +* `group_element` is the CSS selector for the ingredients. This selector must only match the ingredients in the recipe HTML. + +### Example + +Many recipe blogs use WordPress and the WordPress Recipe Manager plugin. This means they often use the same HTML elements and CSS classes to represent the same things. One such example is https://rainbowplantlife.com. + +If we look at the recipe: https://rainbowplantlife.com/vegan-pasta-salad/ + +The group headings in this recipe are all `h4` headings inside an element with the class `wprm-recipe-ingredient-group`. Therefore we can select all ingredient group headings with the selector: `.wprm-recipe-ingredient-group h4` + +The ingredients are all elements with the class `wprm-recipe-ingredient`. Therefore we can select all ingredients with the selector: `.wprm-recipe-ingredient` + +The implementation in the scraper looks like + +```python +from ._abstract import AbstractScraper +from ._grouping_utils import group_ingredients + + +class RainbowPlantLife(AbstractScraper): + ... + + def ingredient_groups(self): + return group_ingredients( + self.ingredients(), + self.soup, + ".wprm-recipe-ingredient-group h4", + ".wprm-recipe-ingredient", + ) +``` + +That is all that is required to add support for ingredient groups to this scraper. + +Some other examples of scrapers that support ingredient groups are: + +* [BudgetBytes](https://github.com/hhursev/recipe-scrapers/blob/main/recipe_scrapers/budgetbytes.py) +* [NYTimes](https://github.com/hhursev/recipe-scrapers/blob/main/recipe_scrapers/nytimes.py) +* [PickUpLimes](https://github.com/hhursev/recipe-scrapers/blob/main/recipe_scrapers/pickuplimes.py) +* [RealFoodTesco](https://github.com/hhursev/recipe-scrapers/blob/main/recipe_scrapers/realfoodtesco.py) + +### What if `group_ingredients()` doesn't work? + +The `group_ingredients` function relies on being able to identify all the group headings with a single CSS selector and all the ingredients with a single CSS selector. However, this is not always possible - it depends on how the website lays out its HTML. + +In these cases, supporting ingredient groups may still be possible. The `group_ingredients()` helper method is only that: an optional helper -- you can always implement custom grouping logic yourself by overriding `ingredient_groups()` directly in your scraper if you can't find suitable CSS selectors for the ingredients and groups. + +An example of a scraper that supports ingredient groups without using the `group_ingredients()` helper is [NIHHealthyEating](https://github.com/hhursev/recipe-scrapers/blob/main/recipe_scrapers/nihhealthyeating.py). + +## Tests + +When adding ingredient group support to a scraper we need to create two test cases. + +- A test case for a recipe that **does not** group the ingredients +- A test case for a recipe that **does** group the ingredient + +This is because not all the recipes on a website will have ingredient groups and the scraper does not know if the recipe does or not beforehand. Therefore, the scraper must handle both cases. + +In addition to the usual tests a scraper requires, the tests also needs to check the groups and the ingredients in each group are correct for the recipe. For the test cases where there are no ingredient groups, this should check for a single `IngredientGroup` object with `purpose=None` and `ingredients` set to the output from the scraper's `ingredients()` function. For the test case with ingredient groups, the output should match the groups in the recipe. + +Each test case will automatically inherit a test that checks to make sure the same ingredients are found in `.ingredients()` and in the groups returned from `.ingredient_groups()`, so there is no need to write this test in the scraper test cases. diff --git a/docs/in-depth-guide-scraper-functions.md b/docs/in-depth-guide-scraper-functions.md new file mode 100644 index 000000000..82ed2be05 --- /dev/null +++ b/docs/in-depth-guide-scraper-functions.md @@ -0,0 +1,306 @@ +# In Depth Guide: Scraper Functions + +Each website scraper has a number of functions that return information about the recipe that has been scraped. Due to the variability in how recipes are written, not all of them are always applicable. These functions fall into three categories: + +1. Mandatory functions + + These functions can be expected to be available for all Scraper classes and combined provide the majority of the information for a recipe. + +2. Inherited functions + + These functions are always available for all Scraper classes. They are implemented in the `AbstractScraper` base class and rarely require overriding in the Scraper class. + +3. Optional functions + + These functions provide extra information about a recipe, from the particular websites that support them. + +All of the examples below come from https://www.bbcgoodfood.com/recipes/monster-cupcakes. + +```py +>>> from recipe_scrapers import scrape_me +>>> scraper = scrape_me("https://www.bbcgoodfood.com/recipes/monster-cupcakes") +``` + +## Mandatory functions + +### `author() -> str` + +Returns the author of the recipe. This is typically a person's name but can sometimes be an organization or the name of the website the recipe came from. If the recipe does not explicitly define an author, this should return the name of the website. + +```py +>>> scraper.author() +'Good Food team' +``` + +### `host() -> str` + +Returns the host of the website the Scraper class is for. This is a constant `str` set in each Scraper class. + +```python +>>> scraper.host() +'bbcgoodfood.com' +``` +### `description() -> str` + +Returns a description of the recipe. This is normally a sentence or short paragraph describing the recipe. Often the website defines the description, but sometimes it has to be inferred from the page content. + +```py +>>> scraper.description() +'Let your little monsters do their worst, decorating these spooky Halloween treats' +``` + +### `image() -> str` + +Returns the URL to the main image associated with the recipe, usually a photograph of the completed recipe. + +```py +>>> scraper.image() +'https://images.immediate.co.uk/production/volatile/sites/30/2020/08/recipe-image-legacy-id-405483_12-cee017a.jpg?resize=768,574' +``` +### `ingredients() -> List[str]` + +Returns the ingredients needed to make the recipe as a `list` of `str`. Each element of the list is usually a single sentence stating an ingredient, the required amount and any additional comments. The elements of the list should mirror the ingredients written on the website and should not include non-ingredient sentences such as sub-headings. + +```py +>>> scraper.ingredients() +[ + '250g self-raising flour', + '25g cocoa powder', + '175g light muscovado sugar', + '85g unsalted butter , melted', + '5 tbsp vegetable or sunflower oil', + '150g pot fat-free natural yogurt', + '1 tsp vanilla extract', + '3 large eggs', + '85g unsalted butter , softened', + '1 tbsp milk', + '½ tsp vanilla extract', + '200g icing sugar , sifted', + 'food colourings (optional)', + 'sweets and sprinkles, to decorate' +] +``` + +### `instructions() -> str` + +Returns a single `str` containing all instruction steps. Where a recipe has multiple instructions, each step is separated in the returned `str` by a newline character (`\n`). + +```py +>>> scraper.instructions() +'Heat oven to 190C/170C fan/gas 5 and line a 12-hole muffin tin with deep cake cases. Put all the cake ingredients into a large bowl and beat together with electric hand beaters until smooth. Spoon the mix into the cases, then bake for 20 mins until risen and a skewer inserted into the middle comes out dry. Cool completely on a rack. Can be made up to 3 days ahead and kept in an airtight container, or frozen for up to 1 month.\nFor the frosting, work the butter, milk and vanilla into the icing sugar until creamy and pale. Colour with food colouring, if using, then create your own gruesome monster faces using sweets and sprinkles.' +``` + +> [!IMPORTANT] +> +> Occasionally, a recipe will have steps that have new lines within them. At the moment, this library cannot distinguish between a newline within a step and a newline between steps. + +### `title() -> str` + +Returns the title of the recipes, usually a short sentence or phrase. + +```py +>>> scraper.title() +'Monster cupcakes' +``` + +### `total_time() -> int` + +Returns the total time required to complete the recipe, in minutes. + +```py +>>> scraper.total_time() +50 +``` + +### `yields() -> str` + +Returns the number of items or servings the recipe will make. This `str` includes the quantity and unit of the yield, for example: 4 servings, 6 items, 12 cookies. + +```py +>>> scraper.yields() +'12 items' +``` + +## Inherited functions + +### `canonical_url() -> str` + +Returns the canonical URL for the scraped recipe. The canonical URL is the direct URL (defined by the website) at which the recipe can be found. This URL will generally not contain any query parameters or fragments, except those required to load the recipe. + +```py +>>> scraper.canonical_url() +'https://www.bbcgoodfood.com/recipes/monster-cupcakes' +``` + +### `ingredient_groups() -> List[IngredientGroup]` + +Returns a `list` of groups of ingredients. Some recipes on some websites will present the ingredients in groups, where each group contains the ingredients required for a particular aspect of the recipe. + +Each element of the returned `list` is an `IngredientGroup` object. An `IngredientGroup` object has a `purpose` (for example, *for the sauce*) and a `list` of ingredients. + +> [!IMPORTANT] +> +> All scrapers inherit this function. By default, it returns a single group with purpose of `None` and the ingredients set to the output of `ingredients()`. This function should be overridden in scrapers for website that use ingredient groups. See [this guide](in-depth-guide-ingredient-groups.md) for help on implementing this. + +```py +>>> scraper.ingredient_groups() +[ + IngredientGroup( + ingredients=[ + '250g self-raising flour', + '25g cocoa powder', + '175g light muscovado sugar', + '85g unsalted butter , melted', + '5 tbsp vegetable or sunflower oil', + '150g pot fat-free natural yogurt', + '1 tsp vanilla extract', '3 large eggs' + ], + purpose=None), + IngredientGroup( + ingredients=[ + '85g unsalted butter , softened', + '1 tbsp milk', + '½ tsp vanilla extract', + '200g icing sugar , sifted', + 'food colourings (optional)', + 'sweets and sprinkles, to decorate' + ], + purpose='For the frosting and decorating') +] +``` + +### `instruction_list()` + +Return a `list` of instructions. This `list` is generated by splitting the output of `instructions()` on newline characters. + +```py +>>> scraper.instructions_list() +[ + 'Heat oven to 190C/170C fan/gas 5 and line a 12-hole muffin tin with deep cake cases. Put all the cake ingredients into a large bowl and beat together with electric hand beaters until smooth. Spoon the mix into the cases, then bake for 20 mins until risen and a skewer inserted into the middle comes out dry. Cool completely on a rack. Can be made up to 3 days ahead and kept in an airtight container, or frozen for up to 1 month.', + 'For the frosting, work the butter, milk and vanilla into the icing sugar until creamy and pale. Colour with food colouring, if using, then create your own gruesome monster faces using sweets and sprinkles.' +] +``` + +### `language() -> str` + +Returns the language of recipe page, as defined within the page's HTML. +This is typically a two-letter BCP 47 language code, such as 'en' for English or 'de' for German, +but may also include the dialect or variation, such as 'en-US' for American English. + +The language code is based on BCP 47 standards. +For a comprehensive list of BCP 47 language codes, refer to this GitHub Gist: +https://gist.github.com/typpo/b2b828a35e683b9bf8db91b5404f1bd1 + +```py +>>> scraper.language() +'en' +``` + +### `links() -> List[Dict[str, str]]` + +Returns a `list` of all links found in the page HTML defined in an `` element. For each link, the attributes of the HTML element are returned as a `dict`. + +```py +>>> scraper.links() +[ + { + 'class': ['popup-toggle-button'], + 'aria-label': 'Toggle main-navigation popup', + 'aria-haspopup': 'true', + 'href': '#main-navigation-popup' + }, + ... # etc. +] +``` + +### `site_name() -> str | None` + +Returns the website name, as defined in the page's HTML. If the page does not define this, this function returns `None` + +```py +>>> scraper.site_name() +None +``` + +### `to_json() -> List[str, str]` + +Returns the output of all functions implemented by this scraper as a `dict`. + +```py +>>> scraper.to_json() +{ + 'author': 'Good Food team', + 'canonical_url': 'https://www.bbcgoodfood.com/recipes/monster-cupcakes', + 'category': 'Treat', + ... # etc. +} +``` + +## Optional functions + +### `category() -> str` + +Semi-structured field that can contain a mix of cuisine type (for example, country names), mealtime (breakfast/dinner/etc) and dietary properties (gluten-free, vegetarian). The value is defined by the website, so it may overlap with other scraper functions (e.g. `cuisine()`). + +```py +>>> scraper.category() +'Treat' +``` + +### `cook_time() -> int` + +Returns the time to cook the recipe in minutes, excluding any time to prepare ingredients. + +```py +>>> scraper.cook_time() +20 +``` + +### `cuisine() -> str` + +Returns the cuisine of the recipe. + +```py +>>> scraper.cuisine() +# Not implemented! +``` + +### `nutrients() -> Dict[str, str]` + +Returns a `dict` of nutrition information. Each nutrition item is usually given per unit of yield, e.g. per servings, per item. The keys of the `dict` are the nutrients (including calories) and the values are the amount of that nutrient, including the unit. + +```py +>>> scraper.nutrients() +{ + 'calories': '389 calories', + 'fatContent': '19 grams fat', + 'saturatedFatContent': '9 grams saturated fat', + 'carbohydrateContent': '53 grams carbohydrates', + 'sugarContent': '36 grams sugar', + 'fiberContent': '1 grams fiber', + 'proteinContent': '5 grams protein', + 'sodiumContent': '0.3 milligram of sodium' +} +``` + +### `prep_time() -> int` + +Returns the time to prepare the ingredients for the recipe in minutes. + +```py +>>> scraper.prep_time() +30 +``` + +### `ratings() -> float` + +Returns the recipe rating. When available, this is usually the average of all the ratings given to the recipe on the website. + +```py +scraper.ratings() +# Not implemented! +``` + +### `reviews() -> List[Dict[str, str]]` + +Returns a `list` of reviews about the recipe from the website. Each review is a `dict` containing the reviewer's name (`str`) and their review (`str`).