-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy pathfunctions.Rmd
168 lines (114 loc) · 8.56 KB
/
functions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# Functions
```{r, echo = FALSE}
options(width = 60)
```
So far, we have been writing code to handle specific situations, such as subsetting a single data set, often using other people's functions. In cases where you want to reuse the code, it is unwise to simply copy and paste the code and make minor changes to handle the new data. Instead we want something that is able to take multiple values and perform the same action (subset, aggregate, make a plot, webscrape, etc.) on those values. We've used lots of other people's function throughout this book, and in this chapter we'll learn how to create our own.
Think of a function like a stapler - you put the paper in, push down, and it staples the paper together. It doesn't matter what papers you are using; it always staples them together. If you needed to buy a new stapler every time you needed to staple something (i.e. copy and pasting code) you'd quickly have way too many staplers (and waste a bunch of money).
An important benefit is that you can use this function again and again to help solve other problems. Let's imagine you need to clean crime data from 10 different cities. Most cities' crime data is very similar so writing the code for one gets you most of the way there for the other 9 cities. Your code will probably work for the other cities, with only minor changes necessary (for example, column names are probably different across all agencies). However, copy and pasting code quickly becomes a terrible solution - functions work much better. If you did copy and paste 10 times and you found a bug, then you'd have to fix the bug 10 times. With a function, you would change the code once.
## A simple function
We'll start with a simple function that takes a number and returns that number plus the number 2.
```{r}
add_2 <- function(number) {
number <- number + 2
return(number)
}
```
The syntax (how we write it) of a function is
function_name <- function(parameters)
{
code
return(output)
}
There are five essential parts of a function
+ function_name - This is just the name we give to the function. It can be anything, but as when making other objects, call it something that is easy to remember what it does.
+ parameters - Here is where we say what goes into the function. In most cases you will want to put some data in and expect something new out. For example, for the function `mean()` you put in a vector of numbers in the () section, and it returns the mean of those numbers. Here is also where you can put any options to affect how the code is run.
+ code - This is the code you write to do the thing you want the function to do. In the above example our code is `number <- number + 2`. For any number inputted, our code adds 2 to it and assigns it back into the object *number*.
+ return - This is something new in this book, here you use the `return()` function and inside the () you put the object you want to be outputted. In our example we have *number* inside the `return()` as that's what we want to come out of the function. It is not always necessary to end your function with `return()` but is highly recommended to do so to make sure you're outputting what it is you want to output.
+ The final piece is the structure of your function. After the function_name (whatever it is you call it) you always need the text `<- function()` where the parameters (if any) are in the (). After the closing parentheses put a `{`, and at the very end of the function, after the `return()`, close those squiggly brackets with a `}`. The `<- function()` tells R that you are making a function rather than some other type of object. And the `{` and `}` tell R that all the code in between are part of that function.
Our function here adds 2 to any number we input.
```{r}
add_2(2)
```
```{r}
add_2(5)
```
## Adding parameters
Let's add a single parameter, which multiplies the result by 5 if selected.
```{r}
add_2 <- function(number, times_5 = FALSE) {
number <- number + 2
return(number)
}
```
Now we have added a parameter called `time_5` to the () part of the function and set it the be FALSE by default. Right now it doesn't do anything so we need to add code to say what happens if it is TRUE (remember in R true and false must always be all capital letters and not in quotes).
```{r}
add_2 <- function(number, times_5 = FALSE) {
number <- number + 2
if (times_5 == TRUE) {
number <- number * 5
}
return(number)
}
```
Now our code says if the parameter `times_5` is TRUE, then do the thing in the squiggly brackets `{}` below. Note that we use the same squiggly brackets as when making the entire function. That just tells R that the code in those brackets belong together. Let's try out our function.
```{r}
add_2(2)
```
It returns 4, as expected. Since the parameter `times_5` is defaulted to FALSE, we don't need to specify that parameter if we want it to stay FALSE. When we don't tell the function that we want it to be TRUE, the code in our "if statement" doesn't run. When we set `times_5` to TRUE, it runs that code.
```{r}
add_2(2, times_5 = TRUE)
```
## Making a function to scrape recipes {#recipes-function}
In Section \@ref(scraping-one-page) we wrote some code to scrape data from the website [All Recipes](https://www.allrecipes.com/) for a recipe. We are going to turn that code into a function here. The benefit is that our input to the function will be a URL, and then it will print out the ingredients and directions for that recipe. If we want multiple recipes (and for webscraping you usually will want to scrape multiple pages), we just change the URL we input without changing the code at all.
We used the `rvest` package so we need to tell R we want to use it again.
```{r}
library(rvest)
```
Let's start by writing a shell of the function - everything but the code. We can call it *scrape_recipes* (though any name would work), add in the `<- function()` and put URL in the () as our input for the function is the URL of the page with the recipe we want. For this function we won't return any object, we will just print things to the console, so we don't need the `return()` value. Don't forget the `{` after the end of the `function()` and `}` at the very end of the function.
```{r}
scrape_recipes <- function(URL) {
}
```
Now we need to add the code that takes the URL, scrapes the website, and assigns the ingredients part of the page to an object called *ingredients* and the directions part to an object called *directions*. Since we have the code from an earlier lesson, we can copy and paste that code into the function and make a small change to get a working function.
```{r}
scrape_recipes <- function(URL) {
brownies <- read_html("https://www.allrecipes.com/recipe/25080/mmmmm-brownies/")
ingredients <- html_nodes(brownies, ".ingredients-item-name")
ingredients <- html_text(ingredients)
directions <- html_nodes(brownies, ".instructions-section-item")
directions <- html_text(directions)
directions <- trimws(directions)
}
```
The part inside the () of `read_html()` is the URL of the page we want to scrape. This is the part of the function that will change based on our input. We want whatever input is in the URL parameter to be the URL we scrape. So let's change the URL of the brownies recipe we scraped previously to simply say URL (without quotes).
```{r}
scrape_recipes <- function(URL) {
brownies <- read_html(URL)
ingredients <- html_nodes(brownies, ".ingredients-item-name")
ingredients <- html_text(ingredients)
directions <- html_nodes(brownies, ".instructions-section-item")
directions <- html_text(directions)
directions <- trimws(directions)
}
```
To make this function print something to the console, we need to specifically tell it to do so in the code. We do this using the `print()` function. Let's first print the ingredients and then the directions. We'll add that to the final lines of the function.
```{r}
scrape_recipes <- function(URL) {
brownies <- read_html(URL)
ingredients <- html_nodes(brownies, ".ingredients-item-name")
ingredients <- html_text(ingredients)
directions <- html_nodes(brownies, ".instructions-section-item")
directions <- html_text(directions)
directions <- trimws(directions)
print(ingredients)
print(directions)
}
```
Now we can try it for a new recipe, this one for "The Best Lemon Bars" at this [link](https://www.allrecipes.com/recipe/10294/the-best-lemon-bars/).
```{r, eval = FALSE}
scrape_recipes("https://www.allrecipes.com/recipe/10294/the-best-lemon-bars/")
```
```{r, echo = FALSE}
knitr::include_graphics("images/functions.PNG")
```
In the next lesson we'll use "for loops" to scrape multiple recipes very quickly.