Stéphane Guillou 6 December 2018
This workshop assumes no prior technical knowledge and focuses on:
- using a language interactively in a Command Line Interface
- introducing concepts and tools central to scientific programming
- working with source code efficiently
- automating tasks and enhancing reproducibility with small programs
This seminar will focus on the R language for a hands-on practice and extensive examples, but other languages will be presented, highlighting the differences between options.
The best way to learn is to practise, so we will quickly get into coding by all typing and executing the same commands. This is called "live-coding".
Today, we will use R and RStudio. They are available on the library's training computers, but if you want to use your own laptop, please follow the instructions here: https://gitlab.com/stragu/CDS/blob/master/R/Installation.md
Let's start by adding our names and some info about us in our collaborative pad.
See how we can use some code to format our document? This is called Markdown, which is a concise language designed to easily create HTML documents. We are already coding!
Programming allows you to use a programming language to build programs and execute tasks. There are many programs around, both Open Source and proprietary, that researchers can use to execute tasks by using a graphical user interface – or GUI – and clicking on buttons. You might however find that the programs you have been recommended do not help you with the specific task that you want to achieve, or at least do not do it efficiently.
Programming allows you to automate tasks by, for example, using loops to repeat a task many times, and creating functions to avoid writing the same code many times over.
Another added benefit to programming is that you can share your code with a team or with peers, and include it in a publication, in order to increase reproducibility.
Finally, because of the text-based nature of your program, you can use tools to keep track of version history and of authorship when collaborating with others.
There are many languages you can use. They differ with their syntax, what they were designed for, and the size of the community that is actively using them.
In scientific computing, R and Python are two very popular languages. R was designed as a language for statistical analysis to be used interactively, whereas Python is more generalist.
A GUI gives the user visual elements to interact with the program, for example a button to click on in order to do a task.
A command-line interface, or CLI, takes an input (a command), executes it and sometimes prints some output.
Languages like R and Python can be used interactively in a command-line interface, which allows us to experiment and quickly get some results. However, if we need to create more complex series of steps for our data processing, it becomes more comfortable to work with scripts.
An integrated development environment, or IDE, is a program that allows to write code more comfortably. Code is only text, and can therefore always be written in the most simple text editor, but an IDE makes it easier to work with code, test it, execute it, and sometimes find help and visualise output.
RStudio is an IDE primarily designed to work on R code.
Let's open RStudio and start using the R language interactively.
The most simple thing we can do with R is using it as a calculator, with the help of binary operators.
For example, try executing the following commands:
R can be used like a calculator. Try the following commands:
3 * 4
10 / 2
11^6
We can store data by creating objects, and assigning values to them with the assignement operator <-
:
num1 <- 42
num2 <- x / 9
str1 <- "Hello world!"
You can use the shortcut Alt + - to type the assignement operator quicker.
Different objects have different classes and will be handled differently.
An R function looks like this:
<functionname>(<argument(s)>)
In the console, we write a command and then evaluate it by pressing Enter.
For example, try running the following command:
sqrt(x = num1)
There are two main ways to find help about a function inside RStudio:
- the shortcut command:
?functionname
- the keyboard shortcut: press F1 with your cursor in a function name
Exercise 1 - Use the help pages to find out what these functions do:
c()
sum()
rm()
length()
list.files()
c()
concatenates the arguments into a vector.
sum()
returns the sum of several numbers.
seq()
creates a sequence of numbers from the parameters you give it.
rm()
removes an object from your environment.
length()
returns the length of a vector.
list.files()
lists all the files in a directory.
Let's do some more complex operations by combining two functions:
ls()
lists the objects in the current R environment. For example, try running the ls()
function after executing the command num3 <- 42
.
You can remove all the objects in the environment by using ls()
as the value for the list
argument:
rm(list = ls())
Arguments can be named, or can be automatically matched if used in order.
Base R contains quite a few functions, and you can already build most things from scratch.
However, other people have already developped many functions for specific uses, and made them available to all via packages.
We can install packages with a command. For example, to install the package praise
:
install.packages("praise")
We now need to load the package to access its functions:
library(praise) # load the package
praise() # use a function
## [1] "You are geometric!"
We can create our own custom function, so we can reuse code easily:
greeting <- function(name) {
paste0(name, " is my name. Hello World! :)")
}
Use Shift + Enter to go to the line without executing the command.
If I try running my function without a value for the argument name
, I will get an error.
I can include a default value if I want to:
greeting <- function(name = "Jane Doe") {
paste0(name, " is my name. Hello World! :)")
}
If you want to execute something many times, you can use a loop:
for (i in 1:100) {
print(":D")
}
A logical operator returns a boolean: TRUE or FALSE.
Common ones are ==
for equality, !=
for inequality, >
for greater than, <
for smaller than, >=
for greater or equal, and <=
for smaller or equal.
Try those examples:
1 == 1
## [1] TRUE
1 != 1
## [1] FALSE
3 > 4
## [1] FALSE
7 <= c(7, 8, 6)
## [1] TRUE TRUE FALSE
With these tools, we can check for conditions, for example in an if
statement:
number = rnorm(1)
if (number > 0) {
print("Positive")
} else {
print("Negative")
}
## [1] "Positive"
Indexing allows use to access a part of an object.
fact <- c("People", "are", "alright")
fact[1]
## [1] "People"
fact[3] <- "hopeless"
fact[4] <- "right"
fact[nchar(fact) < 7]
## [1] "People" "are" "right"
Indexing in R starts at 1. Most other languages start at 0.
Let's put what we just learned to use with a concrete example!
File > New Project > New Directory > New Project
Working from a script is a lot more comfortable, and allows us to create a succession of steps to reproduce our process – in other words, a little program.
The RStudio source pane helps us writing code thanks to:
- handy shortcuts
- syntax highlighting
- automatic indentation
- feedback on code errors
- commenting
After building a script, we can execute in one click of a button.
You can create a script from the File > New File > R Script.
We are going to work on a dataset of ten books from Project Gutenberg. We want to get some numbers on them, and to search for the frequency of some terms. But we want to make sure that we automate the process, as we are likely to do that on many more books.
The data archive is located at https://cloudstor.aarnet.edu.au/plus/s/rqOEQQCjcwBv36x/download and we can use the following commands to download and unzip it:
download.file(url = "https://cloudstor.aarnet.edu.au/plus/s/O4XHRi4VqiaAmMK/download",
destfile = "data.zip")
unzip("data.zip")
Anthem <- readLines("data/AynRand_Anthem.txt")
stringr
is a useful library to work with strings.
library(stringr)
You can find a handy cheatsheet for this package here: https://github.com/rstudio/cheatsheets/raw/master/strings.pdf
The books all have a header and a footer that we want to remove.
start <- str_which(Anthem, "START OF TH") + 1
end <- str_which(Anthem, "END OF TH") - 1
Anthem_sub <- Anthem[start:end]
We also want to convert everything to lowercase:
Anthem_lower <- str_to_lower(Anthem_sub)
And remove the empty lines:
Anthem_clean <- Anthem_lower[Anthem_lower != ""]
We want to find how characters the book contains:
nchar(Anthem_clean)
sum(nchar(Anthem_clean)) # base R
sum(str_length(Anthem_clean)) # stringr
There are often many ways to do one thing.
We also want to find specific terms, and count them:
# find terms visually
str_view(Anthem_clean, "the")
# count specific terms
str_count(Anthem_clean, "love")
sum(str_count(Anthem_clean, "love"))
How many words does the book contain?
str_count(Anthem_clean, "\\w+")
sum(str_count(Anthem_clean, "\\w+"))
Regular expressions, or RegEx, can be used to detect any pattern and create complex queries.
And how many lines?
length(Anthem_clean)
## [1] 1575
We want to create functions so we can reuse the code easily, on any book.
We can separate our process in two steps:
- Importing and cleaning the data
- Analysing the data
Let's start with the function to import and clean up a book:
import_clean <- function(file) {
book <- readLines(file)
start <- str_which(book, "START OF TH") + 1
end <- str_which(book, "END OF TH") - 1
book <- book[start:end]
book <- str_to_lower(book)
}
There's one more step missing: removing the empty strings. Can you add it to the function?
import_clean <- function(file) {
book <- readLines(file)
start <- str_which(book, "START OF TH") + 1
end <- str_which(book, "END OF TH") - 1
book <- book[start:end]
book <- str_to_lower(book)
book <- book[book != ""]
}
Let's test our function:
JaneEyre <- import_clean("data/CharlotteBronte_JaneEyre.txt")
Now, let's create the function that analyses the data. It needs to output a list that contains our stats:
book_stats <- function(book) {
characters <- sum(str_length(book))
words <- sum(str_count(book, "\\w+"))
lines <- length(book)
results <- list(char_count = characters,
word_count = words,
line_count = lines)
results
}
We've got the characters, the words and the lines. How would you add the search term data to your function?
book_stats <- function(book, search_term) {
characters <- sum(str_length(book))
words <- sum(str_count(book, "\\w+"))
lines <- length(book)
terms <- sum(str_count(book, search_term))
results <- list(char_count = characters,
word_count = words,
line_count = lines,
term_count = terms)
results
}
Let's test our second function on the book we just imported:
book_stats(JaneEyre, "love")
## $char_count
## [1] 1009326
##
## $word_count
## [1] 189462
##
## $line_count
## [1] 16397
##
## $term_count
## [1] 235
We will now use our functions on all our files, and store the data into a dataframe. But to save us some typing, we will use a loop.
First, let's create an empty dataframe:
data <- data.frame()
Then, let build our for
loop. We want to add an extra variable in there: the name of the file the data comes from.
for (filename in list.files("data")) {
book <- import_clean(paste0("data/", filename))
results <- book_stats(book, "love")
results <- append(list(file = filename), results)
data <- rbind(data, results, stringsAsFactors = FALSE)
}
After executing our loop, we can visualise our dataset with:
View(data)
dplyr
is a useful package to manipulate data.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- data %>%
mutate(word_length = char_count / word_count,
love_score = term_count / word_count) %>%
arrange(love_score)
View(data)
ggplot2
is a useful package to visualise data.
library(ggplot2)
ggplot(data,
aes(x = reorder(file, love_score),
y = love_score)) +
geom_point() +
coord_flip() +
ylim(c(0, NA))