03a-tidying-data.Rmd

---
knit: bookdown::preview_chapter
---

# Tidying Common Data Formats

## Terminology

- A __variable__ is a quantity, quality, or property that you can measure. For the grad programs, these would be all the column headers.
- An __observation__ is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point. For the grad programs, this is institution, and program, uniquley define the observation.
- A value is the state of a variable when you measure it. The value of a variable typically changes from observation to observation.

##  Tidy tabular form

__Tabular data__ is a set of values, each associated with a variable and an observation. Tabular data is __tidy__ if each value is placed in its own `cell`, each variable in its own column, and each observation in its own row.

### Wide format

The grad programs data in tabular form. It also is in **wide** tidy form, because there are multiple columns containing different variables.

```{r readgrad}
library(emo)
library(tidyverse)
grad <- read_csv("data/graduate-programs.csv")
grad %>% top_n(10)
```

What's good about the format?

- Rows contain information about each department of an institution
- Columns contain types of information, like average number of publications, average number of citations, % completion, 

It makes it easy to make summaries:

```{r gradsummary}
grad %>% count(subject)
grad %>% filter(subject == "economics") %>%
  summarise(m=mean(NumStud), s=sd(NumStud))
grad %>% filter(subject == "economics") %>%
  ggplot(aes(x=NumStud, y=MedianTimetoDegree)) +
  geom_point() + theme(aspect.ratio=1)
```

### Long format

It can also be useful - for the process of data analysis - to arrange data into **long** tidy form, where each value is uniquely identified.

```{r longformat}
grad %>% gather(variable, value, -subject, -Inst)
```

## Raw data examples

For each of these data examples, **let's try together to identify the variables and the observations** - some are HARD!

### Gene expression experiment

`r ji("thinking")`

```{r readgenes}
genes <- read_csv("data/genes.csv")
genes 
```

### Melbourne weather

What are the variables? Observations?

`r ji("scared")`

```{r readmelb}
melbtemp <- read_fwf("data/ASN00086282.dly", 
   col_positions=fwf_widths(c(11, 4, 2, 4, 
        rep(c(5, 1, 1, 1), 31))))
melbtemp %>% select(X1, X2, X3, X4, X5, X9, 
          X13, X17, X21, X25, X29, X33)
```

### TB incidence

This is current tuberculosis data taken from [WHO](http://www.who.int/tb/country/data/download/en/), the case notifications table.

`r set.seed(2018); ji("sick")`

```{r readTB}
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
  select(country, year, starts_with("new_sp_")) %>%
  filter(year>1996, year<2012)
tb %>% top_n(20)
```

### French fries

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

```{r loadff}
load("data/french_fries.rda")
french_fries %>% top_n(10)
```

![](images/french_fries.png)

What are the variables? Observations?

### Contingency table

This data is collated from this story: [41% Of Fliers Think You’re Rude If You Recline Your Seat](http://fivethirtyeight.com/datalab/airplane-etiquette-recline-seat/)

```{r readfly}
fly <- read_csv("data/fly_tbl.csv")
fly
```

What are the variables?

## ABS Datapack

The Australian Bureau of Statistics (ABS) collects, maintains and delivers data and official statistics on a wide range of economic, social, population and environmental matters of importance to Australia. There are many different access points for data, but primarily aggregated data is the main type available. Examples at accessing the Census data from the ABS can be found in the `eechidna` package. 

1. The individual `csv` files must be held locally. They come from a zip file and can be downloaded from: https://datapacks.censusdata.abs.gov.au/datapacks/
2. Select: 2016 Census Datapacks, General Community Profile, Commonwealth Electoral Divisons
3. Download for all of Australia
4. Unzip the package - its necessary, because the data is delivered in many small csv files. There is also the license information detailing appropriate usage, and detailed information about the formats.

```{r eval=FALSE}
G1_Main <- read_csv(here::here("data/2016 Census GCP Commonwealth Electoral Divisions for AUST/", "2016Census_G01_AUS_CED.csv"))
```

## Messy vs tidy

Messy data is messy in its own way. You can make unique solutions, but then another data set comes along, and you have to again make a unique solution. 

Tidy data can be though of as legos. Once you have this form, you can put it together in so many different ways, to make different analyses and visualizations. <!--Pedantic comment: legos isn't a word. the plural of lego is lego. -->

![](images/lego.png)

## Tidy verbs

- `gather`: specify the **keys** (identifiers) and the **values** (measures) to make long form (used to be called melting)
- `spread`: move a variable from a row into columns (used to be called casting)
- `separate`: split a character vector into columns

## Tidying genes data

```{r wranglegenes}
genes
genes_long <- genes %>% 
  gather(variable, expr, -id) 
genes_long
```
In this example, we identified the columns as containing expression levels of a gene during different experimental conditions (variables). We have _gathered_ these variables into a single column, with the expression levels in a corresponding column. The `-id` argument to `gather()` indicates that we do not want the `id` column to be gathered; this has the effect of repeating the 3 values of gene for each variable. 


### Separate columns

The variables in the previous example are a cryptic combination of treatment, timepoint, and replicate. Using `separate()`, we can separate this information out into a tidy (and much more human friendly) representation of the experimental conditions:

```{r wranglegenes1}
genes_long %>%
  separate(variable, c("trt", "leftover"), "-") 
```
First, we separate the treatment from the rest of the information. 

Next, we separate the remainder into timepoint and replicate variables. 

```{r wranglegenes2}
genes_long %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.") 
```
This data is tidy, in that observations are in rows and variables are in columns, but it is not as easy to read. We can examine the different features of this data set using `spread()`.

### Now spread to examine different aspects

There are several different features of the data which we might want to explore. Each feature requires a slight reconfiguration of the dataset, but these are easily accomplished using `spread()`. 

#### Examine treatments against each other

```{r treatments}
genes_long <- genes_long %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.") 
genes_long %>%
  spread(trt, expr) %>%
  ggplot(aes(x=WI, y=WM, colour=id, shape = id)) + geom_point()
```

Generally, some negative association within each gene, WM is low if WI is high. <!--Not really... at best, there's a negative association for Gene 2, but even that is pushing it. If you don't consider each gene separately, there's a positive association. -->
 
<!-- The color scheme is INCREDIBLY difficult to see... -->

#### Examine replicates against each other

```{r replicates}
genes_long %>%
  spread(rep, expr) %>%
  ggplot(aes(x=R1, y=R4, colour=id)) + 
  geom_point() + 
  coord_equal()
```

Very roughly, replicate 4 is like replicate 1, eg if one is low, the other is low. That's a good thing, that the replicates are fairly similar.

## Try this yourself

Here is a little data set to practice gather, spread and separate on.

```{r practicekoala}
kb <- read_csv("data/koala_bilby.csv")
kb
```

Try to do these by yourself, but the code is in the Rmd file if you need help.

### Exercise 1

Gather the data into long form, naming the two new variables, `label` and `count`

```{r practicekoala2, echo=FALSE}
kb_long <- kb %>% gather(label, count, -ID)
kb_long
```

### Exercise 2

Separate the labels into two new variables, `animal`, `state`

```{r practicekoala3, echo=FALSE}
kb_long <- kb_long %>%
  separate(label, c("animal", "state"))
kb_long
```

### Exercise 3

Spread the long form data into wide form, where the columns are the states. 
```{r practicekoala4, echo=FALSE}
kb_long %>% spread(state, count)
```

### Exercise 4

Spread the long form data into wide form, where the columns are the animals. 

```{r practicekoala5, echo=FALSE}
kb_long %>% spread(animal, count)
```

## Tidying Melbourne weather

This is seriously messy!

Read the data first, just the columns that we need. The data is in fixed column width format, which means a certain number of columns are designated for the particular records, eg columns 1-11 contains the station id, columns 12-15 contain the year, ...

```{r practicemelb}
melbtemp <- read_fwf("data/ASN00086282.dly", 
   col_positions=fwf_widths(c(11, 4, 2, 4, 
        rep(c(5, 1, 1, 1), 31)), 
        col_names = c("station", "year", "month",
              "variable", paste0("X", 5:128))))
melbtemp
```

X5, X9, ... contain the temperature and precipitation records. The remaining columns have data quality flags. We want to ignore the data quality flags for now, and select every fourth variable.

### Sometime easier to index select

- `[...]` allows indexing of elements of a vector, or collection of numbers
- `c(1:4, seq(5,128,4))` means collect items 1 through 4, and then every 4th item until the 128'th

```{r practicemelb2}
melbtemp <- melbtemp[,c(1:4, seq(5, 128, 4))] 
melbtemp
```

### Make long form

```{r practicemelb3}
melbtemp_long <- melbtemp %>% 
  gather(day, value, X5:X125)
melbtemp_long
```

This is basically what we need. The day variable doesn't have the right values in it, but we can fix this later.

### Use spread to look at different things

But there are a couple of things that I need to fix to make it possible to do these plots. We will discuss these operations next week. <!-- Next week? -->

```{r practicemelb4}
melbtemp_long <- melbtemp_long %>% 
  filter(variable %in% c("TMIN", "TMAX", "PRCP")) %>%
  mutate(value = ifelse(value == -9999, NA, value))
```

### What are the numbers?

- What is the range of temperature and precipitation?
- What units could this be in?
- Why are some values -9999?

Go to the data source: [https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt)


### Now use spread

#### Max vs Min

```{r practicemelb5}
melbtemp_long %>% spread(variable, value) %>%
  ggplot(aes(x=TMIN, y=TMAX)) + geom_point() + 
  coord_equal() + geom_abline(slope=1, intercept=0)
```

Oh, TMAX is always higher than TMIN. This is a good thing.

#### Winter vs Summer

```{r practicemelb6}
melbtemp_long %>% 
  filter(variable == "TMAX") %>%
  spread(month, value) %>%
  ggplot(aes(x=`07`, y=`01`)) + geom_point() + 
  xlab("July") + ylab("Jan") +
  coord_equal() + geom_abline(slope=1, intercept=0)
```

Same day of the month, same year, January has higher maximum temperatures than July. Except two days, the July max exceeded the January max. (This comparison is a bit whacky, though right?)


## Exercises

1. [41% Of Fliers Think You’re Rude If You Recline Your Seat](http://fivethirtyeight.com/datalab/airplane-etiquette-recline-seat/). In the following table, V1 is a response to the question "Is it rude to recline your seat on a plane?", and V2 is the response to the question "Do you ever recline your seat when you fly?". The data is in the form of a contingency table.

```{r practicefly}
fly_tbl <- read_csv("data/fly_tbl.csv")
fly_tbl
```

a. What are the variables and observations in this data?
b. Put the data in tidy long form (using the names `V2` as the key variable, and `count` as the value). 

You can get the data from  [data/fly_tbl.csv](http://dmac.dicook.org/lectures/data/fly_tbl.csv) 

```{r practicefly2, echo=FALSE, eval=FALSE}
library(tidyverse)
fly_tbl <- read_csv("data/fly_tbl.csv")
fly_tbl %>% gather(V2, count, -V1)
```

2. Your job is to tidy the TB incidence data. 

- Write down the steps that will be needed go from raw data to tidy
- Write the code piece by piece

You can get the data from  [TB_notifications_2018-03-18.csv](http://dmac.dicook.org/lectures/data/TB_notifications_2018-03-18.csv) 

```{r practiceTB, echo=FALSE, eval=FALSE}
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
  select(country, year, starts_with("new_sp_")) %>%
  filter(year>1996, year<2012)
tb %>% top_n(20)
tb_tidy <- tb %>% 
  gather(stuff, count, starts_with("new_sp_")) %>%
  separate(stuff, c("stuff1", "stuff2", "genderage")) %>%
  separate(genderage, c("gender", "age"), sep=1) %>%
  select(-stuff1, -stuff2)
tb_tidy
```

3. For the data set, `rates.csv`, 

```{r practicerates}
rates <- read_csv("data/rates.csv")
head(rates)
```

a. What are the variables and observations?
b. Gather the five currencies, AUD, GBP, JPY, CNY, CAD, make it into tidy long form.
c. Make line plots of the currencies, like the plot below, and describe the similarities and differences between the currencies. 

You can get the data from [rates.csv](http://dmac.netlify.com/lectures/data/rates.csv) 

```{r echo=FALSE, results='hide', fig.show='hide'}
library(tidyverse)
rates <- read_csv("data/rates.csv")
ggplot(rates, aes(x=date, y=AUD)) + geom_line()
```

```{r practicerates2, echo=FALSE, results='hide', fig.width=6, fig.height=10}
rates_long <- rates %>% select(date, AUD, GBP, JPY, CNY, CAD) %>%
  gather(currency, rate, -date)
ggplot(rates_long, aes(x=date, y=rate)) + 
  geom_line() +
  facet_wrap(~currency, ncol=1, scales="free_y")
```

4. This is tough! From the web site [Department of Infrastructure, Regional Development and Cities](https://bitre.gov.au/publications/ongoing/airport_traffic_data.aspx) download the latest "Airport Traffic Data 1985–86 to ...". It is reasonably small.

Extract the "Airport Passengers" table into R with this code

```{r practiceair}
library(readxl)
passengers <- read_xlsx(here::here("data", "WebAirport_FY_1986-2019.xlsx"), sheet=3, skip=6)
passengers
```

Tidy the data, to produce a data set with these columns
    - airport: all of the airports. 
    - year 
    - type_of_flight: DOMESTIC, INTERNATIONAL
    - bound: IN or OUT

```{r practiceair2, eval=FALSE, echo=FALSE}
passengers <- passengers %>% 
  select(-X__1, -X__2, -Rank, -starts_with("TOTAL")) %>%
  rename(INBOUND_DOM=INBOUND, INBOUND_INTL=INBOUND__1, 
         OUTBOUND_DOM=OUTBOUND, OUTBOUND_INTL=OUTBOUND__1,
         INBOUND_TOTL=INBOUND__2, OUTBOUND_TOTL=OUTBOUND__2) %>%
  gather(where, amount, starts_with("IN"), starts_with("OUT")) 
```