Introduction to scientific programming

Stéphane Guillou 6 December 2018

Welcome!

This workshop assumes no prior technical knowledge and focuses on:

using a language interactively in a Command Line Interface
introducing concepts and tools central to scientific programming
working with source code efficiently
automating tasks and enhancing reproducibility with small programs

This seminar will focus on the R language for a hands-on practice and extensive examples, but other languages will be presented, highlighting the differences between options.

The best way to learn is to practise, so we will quickly get into coding by all typing and executing the same commands. This is called "live-coding".

Installation

Today, we will use R and RStudio. They are available on the library's training computers, but if you want to use your own laptop, please follow the instructions here: https://gitlab.com/stragu/CDS/blob/master/R/Installation.md

Introduction

Let's start by adding our names and some info about us in our collaborative pad.

See how we can use some code to format our document? This is called Markdown, which is a concise language designed to easily create HTML documents. We are already coding!

Why programming?

Programming allows you to use a programming language to build programs and execute tasks. There are many programs around, both Open Source and proprietary, that researchers can use to execute tasks by using a graphical user interface – or GUI – and clicking on buttons. You might however find that the programs you have been recommended do not help you with the specific task that you want to achieve, or at least do not do it efficiently.

Programming allows you to automate tasks by, for example, using loops to repeat a task many times, and creating functions to avoid writing the same code many times over.

Another added benefit to programming is that you can share your code with a team or with peers, and include it in a publication, in order to increase reproducibility.

Finally, because of the text-based nature of your program, you can use tools to keep track of version history and of authorship when collaborating with others.

What language?

There are many languages you can use. They differ with their syntax, what they were designed for, and the size of the community that is actively using them.

In scientific computing, R and Python are two very popular languages. R was designed as a language for statistical analysis to be used interactively, whereas Python is more generalist.

GUI vs CLI

A GUI gives the user visual elements to interact with the program, for example a button to click on in order to do a task.

A command-line interface, or CLI, takes an input (a command), executes it and sometimes prints some output.

Languages like R and Python can be used interactively in a command-line interface, which allows us to experiment and quickly get some results. However, if we need to create more complex series of steps for our data processing, it becomes more comfortable to work with scripts.

IDE

An integrated development environment, or IDE, is a program that allows to write code more comfortably. Code is only text, and can therefore always be written in the most simple text editor, but an IDE makes it easier to work with code, test it, execute it, and sometimes find help and visualise output.

RStudio is an IDE primarily designed to work on R code.

Programming with R

Getting to know R

Let's open RStudio and start using the R language interactively.

The most simple thing we can do with R is using it as a calculator, with the help of binary operators.

For example, try executing the following commands:

R can be used like a calculator. Try the following commands:

3 * 4
10 / 2
11^6

Objects

We can store data by creating objects, and assigning values to them with the assignement operator <-:

num1 <- 42
num2 <- x / 9
str1 <- "Hello world!"

You can use the shortcut Alt + - to type the assignement operator quicker.

Different objects have different classes and will be handled differently.

Using functions

An R function looks like this:

<functionname>(<argument(s)>)

In the console, we write a command and then evaluate it by pressing Enter.

For example, try running the following command:

sqrt(x = num1)

Finding help

There are two main ways to find help about a function inside RStudio:

the shortcut command: ?functionname
the keyboard shortcut: press F1 with your cursor in a function name

Exercise 1 - Use the help pages to find out what these functions do:

c()
sum()
rm()
length()
list.files()

c() concatenates the arguments into a vector.

sum() returns the sum of several numbers.

seq() creates a sequence of numbers from the parameters you give it.

rm() removes an object from your environment.

length() returns the length of a vector.

list.files() lists all the files in a directory.

Let's do some more complex operations by combining two functions:

ls() lists the objects in the current R environment. For example, try running the ls() function after executing the command num3 <- 42.

You can remove all the objects in the environment by using ls() as the value for the list argument:

rm(list = ls())

Arguments can be named, or can be automatically matched if used in order.

Packages

Base R contains quite a few functions, and you can already build most things from scratch.

However, other people have already developped many functions for specific uses, and made them available to all via packages.

We can install packages with a command. For example, to install the package praise:

install.packages("praise")

We now need to load the package to access its functions:

library(praise) # load the package
praise() # use a function

## [1] "You are geometric!"

Custom functions

We can create our own custom function, so we can reuse code easily:

greeting <- function(name) {
  paste0(name, " is my name. Hello World! :)")
}

Use Shift + Enter to go to the line without executing the command.

If I try running my function without a value for the argument name, I will get an error.

I can include a default value if I want to:

greeting <- function(name = "Jane Doe") {
  paste0(name, " is my name. Hello World! :)")
}

Loops

If you want to execute something many times, you can use a loop:

for (i in 1:100) {
  print(":D")
}

Logical operators, booleans and if statements

A logical operator returns a boolean: TRUE or FALSE.

Common ones are == for equality, != for inequality, > for greater than, < for smaller than, >= for greater or equal, and <= for smaller or equal.

Try those examples:

1 == 1

## [1] TRUE

1 != 1

## [1] FALSE

3 > 4

## [1] FALSE

7 <= c(7, 8, 6)

## [1]  TRUE  TRUE FALSE

With these tools, we can check for conditions, for example in an if statement:

number = rnorm(1)
if (number > 0) {
  print("Positive")
} else {
  print("Negative")
}

## [1] "Positive"

Indexing

Indexing allows use to access a part of an object.

fact <- c("People", "are", "alright")
fact[1]

## [1] "People"

fact[3] <- "hopeless"
fact[4] <- "right"
fact[nchar(fact) < 7]

## [1] "People" "are"    "right"

Indexing in R starts at 1. Most other languages start at 0.

An example project

Let's put what we just learned to use with a concrete example!

Creating a project

File > New Project > New Directory > New Project

Creating a script

Working from a script is a lot more comfortable, and allows us to create a succession of steps to reproduce our process – in other words, a little program.

The RStudio source pane helps us writing code thanks to:

handy shortcuts
syntax highlighting
automatic indentation
feedback on code errors
commenting

After building a script, we can execute in one click of a button.

You can create a script from the File > New File > R Script.

Our data

We are going to work on a dataset of ten books from Project Gutenberg. We want to get some numbers on them, and to search for the frequency of some terms. But we want to make sure that we automate the process, as we are likely to do that on many more books.

The data archive is located at https://cloudstor.aarnet.edu.au/plus/s/rqOEQQCjcwBv36x/download and we can use the following commands to download and unzip it:

download.file(url = "https://cloudstor.aarnet.edu.au/plus/s/O4XHRi4VqiaAmMK/download",
              destfile = "data.zip")
unzip("data.zip")

Import a book

Anthem <- readLines("data/AynRand_Anthem.txt")

Clean up the data

stringr is a useful library to work with strings.

library(stringr)

You can find a handy cheatsheet for this package here: https://github.com/rstudio/cheatsheets/raw/master/strings.pdf

The books all have a header and a footer that we want to remove.

start <- str_which(Anthem, "START OF TH") + 1
end <- str_which(Anthem, "END OF TH") - 1
Anthem_sub <- Anthem[start:end]

We also want to convert everything to lowercase:

Anthem_lower <- str_to_lower(Anthem_sub)

And remove the empty lines:

Anthem_clean <- Anthem_lower[Anthem_lower != ""]

Analyse the book

We want to find how characters the book contains:

nchar(Anthem_clean)
sum(nchar(Anthem_clean)) # base R
sum(str_length(Anthem_clean)) # stringr

There are often many ways to do one thing.

We also want to find specific terms, and count them:

# find terms visually
str_view(Anthem_clean, "the")

# count specific terms
str_count(Anthem_clean, "love")
sum(str_count(Anthem_clean, "love"))

How many words does the book contain?

str_count(Anthem_clean, "\\w+")
sum(str_count(Anthem_clean, "\\w+"))

Regular expressions, or RegEx, can be used to detect any pattern and create complex queries.

And how many lines?

length(Anthem_clean)

## [1] 1575

Creating functions

We want to create functions so we can reuse the code easily, on any book.

We can separate our process in two steps:

Importing and cleaning the data
Analysing the data

Let's start with the function to import and clean up a book:

import_clean <- function(file) {
  book <- readLines(file)
  start <- str_which(book, "START OF TH") + 1
  end <- str_which(book, "END OF TH") - 1
  book <- book[start:end]
  book <- str_to_lower(book)
}

Challenge N

Instructions

There's one more step missing: removing the empty strings. Can you add it to the function?

Solution

import_clean <- function(file) {
  book <- readLines(file)
  start <- str_which(book, "START OF TH") + 1
  end <- str_which(book, "END OF TH") - 1
  book <- book[start:end]
  book <- str_to_lower(book)
  book <- book[book != ""]
}

Let's test our function:

JaneEyre <- import_clean("data/CharlotteBronte_JaneEyre.txt")

Now, let's create the function that analyses the data. It needs to output a list that contains our stats:

book_stats <- function(book) {
  characters <- sum(str_length(book))
  words <- sum(str_count(book, "\\w+"))
  lines <- length(book)
  results <- list(char_count = characters,
                  word_count = words,
                  line_count = lines)
  results
}

Challenge N

Instructions

We've got the characters, the words and the lines. How would you add the search term data to your function?

Solution

book_stats <- function(book, search_term) {
  characters <- sum(str_length(book))
  words <- sum(str_count(book, "\\w+"))
  lines <- length(book)
  terms <- sum(str_count(book, search_term))
  results <- list(char_count = characters,
                  word_count = words,
                  line_count = lines,
                  term_count = terms)
  results
}

Let's test our second function on the book we just imported:

book_stats(JaneEyre, "love")

## $char_count
## [1] 1009326
## 
## $word_count
## [1] 189462
## 
## $line_count
## [1] 16397
## 
## $term_count
## [1] 235

Loop through all files

We will now use our functions on all our files, and store the data into a dataframe. But to save us some typing, we will use a loop.

First, let's create an empty dataframe:

data <- data.frame()

Then, let build our for loop. We want to add an extra variable in there: the name of the file the data comes from.

for (filename in list.files("data")) {
  book <- import_clean(paste0("data/", filename))
  results <- book_stats(book, "love")
  results <- append(list(file = filename), results)
  data <- rbind(data, results, stringsAsFactors = FALSE)
}

After executing our loop, we can visualise our dataset with:

View(data)

Create new variables

dplyr is a useful package to manipulate data.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- data %>% 
  mutate(word_length = char_count / word_count,
         love_score = term_count / word_count) %>% 
  arrange(love_score)
View(data)

Visualise

ggplot2 is a useful package to visualise data.

library(ggplot2)
ggplot(data,
       aes(x = reorder(file, love_score),
           y = love_score)) +
  geom_point() +
  coord_flip() +
  ylim(c(0, NA))

Files

intro_to_programming.md

Latest commit

History

intro_to_programming.md

File metadata and controls

Introduction to scientific programming

Welcome!

Installation

Introduction

Why programming?

What language?

GUI vs CLI

IDE

Programming with R

Getting to know R

Objects

Using functions

Finding help

Packages

Custom functions

Loops

Logical operators, booleans and if statements

Indexing

An example project

Creating a project

Creating a script

Our data

Import a book

Clean up the data

Analyse the book

Creating functions

Challenge N

Instructions

Solution

Challenge N

Instructions

Solution

Loop through all files

Create new variables

Visualise

Experiment with code!

Other tools

Python: another language

Git: revision control and collaboration

Bash: control a computer

What next?