-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy path0_02_R_prerequisites.Rmd
108 lines (55 loc) · 9.1 KB
/
0_02_R_prerequisites.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# Programming prerequisites {#prerequisites}
This chapter gives a quick overview of the prerequisite R skills needed to use this book. These are covered in the [Introduction to Exploratory Data Analysis with R](https://dzchilds.github.io/eda-for-bio) course. We assume you are comfortable using these skills, so you may need to spend revising them if you feel that you're a little rusty.
## Starting a **learnr** tutorial
You will be using learnr **learnr** tutorials to gain practical experience of what you are learning about learn each week. These interactive tutorials contain three main components:
1 - **Static Text**: This provides background information for revision purposes or instructions to do something.
2 - **Code boxes**: These are interactive boxes that allow you to execute R code and see the results.
3 - **Quizzes**: These are multiple-choice/multiple-answer questions designed to check your understanding.
You will be given a visual 'walk-through' of how to run a **learnr** tutorial in week one of the course. The first tutorial aims to be self-describing---it provides a stand-alone introduction to how to use the tutorials.
<!-- ## Starting a new R script in RStudio {#new-project-setup} -->
<!-- You will need to prepare and run your own R scripts to make best use of the book. The process of setting things up to do this varies slightly depending on whether you are starting a brand new script 'from scratch' or using one of the provided template scripts. -->
<!-- The first three set up steps are the same regardless of whether you're creating a new script or using one of the provided templates: -->
<!-- **Step 1.** Decide which folder you will use to store your project files such as R scripts and CSV data files. Be sure to choose a location that is in your personal user space and not some random system location. Of course, you can create the folder you want to use if it doesn't already exist! -->
<!-- **Step 2.** Open up RStudio. Remember, RStudio and R are not the same---don't open the stand-alone version of R. RStudio provides a lot of convenient functionality that makes it easier to work with R. Basically, R runs inside RStudio, which means you don't need to launch R separately because RStudio 'talks' to it for you. -->
<!-- **Step 3.** Use RStudio to set your working directory. Do this via the RStudio menu system: **Session ▶ Set Working Directory ▶ Choose Working Directory...**. Be sure to choose a sensible location. The working directory is where R will look for data files and R scripts. -->
<!-- Remember: 'directory' is just another name for 'folder'. You pick your working directory by selecting a folder rather than a file in the file-choose dialogue box. -->
<!-- ```{block, type='do-something'} -->
<!-- **Using your system file explorer vs. RStudio** -->
<!-- When we say 'system file explorer' we mean the 'Windows File Explorer' on Windows or the 'Finder' on a Mac. You will need to use your system file explorer alongside RStudio to get set up. We assume you're comfortable enough with the system file explorer that you can to use it move files around and create new folders. We won't provide detailed guidance about how to do this kind of thing. In contrast, when we describe an action that takes place in RStudio we will tell you exactly which menu sequence to use. -->
<!-- ``` -->
<!-- What you do next will depend on whether you need to work with a new script or one of the templates. -->
<!-- ##### Starting a brand new, empty R script {-} -->
<!-- **Step 4.** Open up a new R script using the RStudio menu system: **File ▶ New File ▶ R Script**. This will create an empty, unnamed R script. Don't create any other kind of file. This will not have been save at this point! -->
<!-- **Step 5.** Add any important 'set up' chunks of code (and comments!) before doing anything else. There are a few of things that regularly appear at the start of a script, e.g. we often start by loading packages with `library`: -->
<!-- ```{r, eval=FALSE} -->
<!-- # load and attach the packages we want to use ---- -->
<!-- # 1. 'dplyr' for data manipulation -->
<!-- library(dplyr) -->
<!-- # 2. 'ggplot2' for plotting -->
<!-- library(ggplot2) -->
<!-- ``` -->
<!-- **Step 6.** Now run the preamble section of the script, i.e. highlight everything and hit **Ctrl+Enter**. If the `library` commands produced errors the relevant package probably aren't installed yet. Install the package (see below) and try rerunning the script. -->
<!-- **Step 7.** Look at the label of the tab the script lives in. This will be called something like *Untitled1* and the label text will be red. This is RStudio signalling that the file has not yet been saved. So... after the preamble part of the script is working, save the script inside the folder of your working directory. -->
<!-- That's it. Now you're ready to start developing the new script. -->
<!-- ##### Using a template script {-} -->
<!-- **Step 4.** Use your the system file explorer to copy the template script to from wherever it is stored to the folder you've set as the working directory. Alternatively, you could put it inside a sub folder of the working directory. It doesn't matter as long as you keep it somewhere inside the working directory as that is where you project 'lives'. -->
<!-- **Step 5.** The template scripts all contain, at a minimum, at least some set up R code at the top. Typically this will load packages using one or more `library` statements. Run this section of the script, i.e. highlight everything and hit **Ctrl+Enter**. If the `library` commands produced errors the relevant package probably aren't installed yet. Install the package (see below) and try rerunning the script. -->
## Using packages
R packages extend the basic functionality of R so that we can do more with it. In a nutshell, an R package bundles together R code, data, and documentation in a standardised way that is easy to use and share with other users. This book uses a subset of the [tidyverse](https://www.tidyverse.org) ecosystem of packages: the `readr` package for reading data into R, the `dplyr` package for data manipulation, and the `ggplot2` package for making plots. We need to understand how R's package system works to use these.
Here's the key point: Installing a package, and then loading and attaching the package, are different operations. We only have to install a package once onto our computer, but we have to load and attach the package every time we want to use it in a new R session (i.e. every time we start RStudio). If that doesn't make any sense, revise the [package system](https://dzchilds.github.io/eda-for-bio/packages.html) chapter Exploratory Data Analysis in R book.
Installing a package can be done via the `install.packages` function, e.g. use this code to install the **dplyr** package:
```{r, eval=FALSE}
install.packages("dplyr")
```
Alternatively, you can use RStudio's menu interface via the packages tab in the bottom right window.
Either way is fine. However, the `install.packages` route should be carried by typing the install commands directly into the Console (this is pretty much the only time we work this way). **Do not leave `install.packages` statements in your R scripts.** We only have to install a package once onto our computer to make it available. Because installing packages can be slow, we'd rather not do that every time we have to run a script.
Loading and attaching a package so that it can actually be used happens via the `library` function, e.g.
```{r, eval=FALSE}
library("dplyr")
```
We do usually leave `library` statements at the beginning of scripts to ensure that all the package functions we need are available to the rest of the script.
## Reading data into R
Last year we made extensive use of 'built in' data sets that reside inside R. This meant we could use the data without getting bogged down trying to read it into R. We'll carry on doing that at times as we work through the book and the accompanying **learnr** tutorials. However, we don't have the luxury of this short cut when we work with our own data, so we'll work towards adopting more realistic practices as we go.
In 'real world' data analysis, when we need to work with data, we typically save a copy of it into some kind of file on our computer and then read that file into R. The data sets we use in this course are stored as a Comma Separated Value ('CSV') text files. The base R `read.csv` or the `read_csv` function from the **readr** package can be used to read in such files.
If we use `read.csv` the resulting R data object is a data frame. If we use `read_csv` we end up with a 'tibble' which can be thought of as the tidyverse version of a data frame. Either is fine, the differences don't matter in this book. A data frame is a table-like object that collects together different variables, storing each of them as a named column. We can access the data inside the data frame by referring to particular columns and rows, or manipulate the whole data frame with a package like **dplyr**.
If that last paragraph was confusing, it would be a good idea to work through the [data frames](https://dzchilds.github.io/eda-for-bio/data-frames.html) chapter of the Exploratory Data Analysis in R book.