forked from bioinformatics-core-shared-training/basicr
-
Notifications
You must be signed in to change notification settings - Fork 40
/
Session2.3-data-manipulation.Rmd
400 lines (274 loc) · 11.9 KB
/
Session2.3-data-manipulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
---
title: "Introduction to Solving Biological Problems Using R - Day 2"
author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_notebook:
toc: yes
toc_float: yes
---
# 3. Data Manipulation Techniques
## Motivation
- So far we have been lucky that all our data have been in the same file:
+ This is not usually the case
+ Dataset may be spread over several files
+ This takes longer, and is harder, than many people realise
+ We need to combine before doing an analysis
## Combining data from multiple sources: Gene Clustering Example
- R has powerful functions to combine heterogeneous data sources into a single data set
- Gene clustering example data:
+ Gene expression values in ***gene.expression.txt***
+ Gene information in ***gene.description.txt***
+ Patient information in ***cancer.patients.txt***
- A breast cancer dataset with numerous patient characteristics:
+ We will concentrate on ***ER status*** (positive / negative)
+ What genes show a statistically-significant different change between ER groups?
## Analysis goals
- We will show how to lookup a particular gene in the dataset
- Also, how to look-up genes in a given genomic region
- Perform a "sanity-check" to see if a previously-known gene exhibits a difference in our dataset
- How many genes on chromosome 8 are differentially-expressed?
- Create a heatmap to cluster the samples and reveal any subgroups in the data
+ do the subgroups agree with our prior knowledge about the samples
## Peek at the data
- `gene.expression.txt` is a tab-delimited file, so we can use `read.delim` to import it
- here the `head` function is used as a convenient way to see the first six rows of the resulting data frame
```{r}
normalizedValues <- read.delim("gene.expression.txt")
head(normalizedValues)
```
- `r nrow(normalizedValues)` rows and `r ncol(normalizedValues)` columns
+ One row for each gene:
+ Rows are named according to particular technology used to make measurement
+ The names of each row can be returned by `rownames(normalizedValues)`; giving a vector
+ One column for each patient:
+ The names of each column can be returned by `colnames(normalizedValues)`; giving a vector
```{r}
geneAnnotation <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
head(geneAnnotation)
```
- `r nrow(geneAnnotation)` rows and `r ncol(geneAnnotation)` columns
- One for each gene
- Includes mapping between manufacturer ID and Gene name
```{r}
patientMetadata <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
head(patientMetadata)
```
- One for each patient in the study
- Each column is a different characteristic of that patient
+ e.g. whether a patient is ER positive (value of 1) or negative (value of 0)
```{r}
table(patientMetadata$er)
```
## Ordering and sorting
To get a feel for these data, we will look at how we can subset and order
- R allows us to do the kinds of filtering, sorting and ordering operations you might be familiar with in Excel
- For example, if we want to get information about patients that are ER negative
+ these are indicated by an entry of ***0*** in the `er` column
```{r eval=FALSE}
patientMetadata$er == 0
```
We can do the comparison within the square brackets
- Remembering to include a `,` to index the columns as well
- Best practice to create a new variable and leave the original data frame untouched
```{r}
erNegPatients <- patientMetadata[patientMetadata$er == 0,]
head(erNegPatients)
```
or
```{r}
View(erNegPatients)
```
Sorting is supported by the **`sort()`** function
- Given a vector, it will return a sorted version of the same length
```{r}
sort(erNegPatients$grade)
```
- But this is not useful in all cases
+ We have lost the extra information that we have about the patients
- Instead, we can use **`order()`**
- Given a vector, `order()` will give a set of numeric values which will give an ordered version of the vector
+ default is smallest --> largest
```{r}
myvec <- c(90,100,40,30,80,50,60,20,10,70)
myvec
order(myvec)
```
- i.e. number in position 9 is the smallest, number in position 8 is the second smallest:
```{r}
myvec[9]
myvec[8]
```
N.B. `order` will also work on character vectors
```{r}
firstName <- c("Adam", "Eve", "John", "Mary", "Peter", "Paul", "Joanna", "Matthew", "David", "Sally")
order(firstName)
```
- We can use the result of `order()` to perform a subset of our original vector
- The result is an ordered vector
```{r}
myvec.ord <- myvec[order(myvec)]
myvec.ord
```
- Implication: We can use `order` on a particular column of a data frame, and use the result to sort all the rows
- We might want to select the youngest ER negative patients for a follow-up study
- Here we order the `age` column and use the result to re-order the rows in the data frame
```{r}
erNegPatientsByAge <- erNegPatients[order(erNegPatients$age),]
head(erNegPatientsByAge)
```
- can change the behaviour of `order` to be Largest --> Smallest
```{r}
erNegPatientsByAge <- erNegPatients[order(erNegPatients$age,decreasing = TRUE),]
head(erNegPatientsByAge)
```
- we can write the result to a file if we wish
```{r eval=FALSE}
write.table(erNegPatientsByAge, file="erNegativeSubjectsByAge.txt", sep="\t")
```
## Exercise: Exercise7
- Imagine we want to know information about chromosome 8 genes that have been measured.
1. Create a new data frame containing information on genes on Chromosome 8
2. Order the rows in this data frame according to start position, and write the results to a file
```{r}
## Your Answer Here ###
```
### Alternative:
- you might find the function `subset` a bit easier to use
+ no messing around with square brackets
+ no need to remember row and column indices
+ no need for `$` operator to access columns
- more advanced packages like dplyr use a similar approach
+ you'll find out about this on our intermediate course
```{r}
chr8Genes <- subset(geneAnnotation, Chromosome=="chr8")
head(chr8Genes)
```
## Retrieving data for a particular gene
- Gene `ESR1` is known to be hugely-different between ER positive and negative patient
+ let's check that this is evident in our dataset
+ if not, something has gone wrong!
- First step is to locate this gene in our dataset
- We can use `==` to do this, but there are some alternatives that are worth knowing about
## Character matching in R
- `match()` and `grep()` are often used to find particular matches
+ CAUTION: by default, match will only return the ***first*** match!
```{r}
match("D", LETTERS)
grep("F", rep(LETTERS,2))
match("F", rep(LETTERS,2))
```
- `grep` can also do partial matching
+ can also do complex matching using "regular expressions"
```{r}
month.name
grep("ary",month.name)
grep("ber",month.name)
```
- `%in%` will return a logical if each element is contained in a shortened list
```{r}
month.name %in% c("May", "June")
```
## Retrieving data for a particular gene
- Find the name of the ID that corresponds to gene ***ESR1*** using `match`
+ mapping between IDs and genes is in the ***genes*** data frame
+ ID in first column, gene name in the second
- Save this ID as a variable
```{r}
rowInd <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
geneAnnotation[rowInd,]
myProbe <- geneAnnotation$probe[rowInd]
myProbe
```
Now, find which row in our expression matrix is indexed by this ID
- recall that the rownames of the expression matrix are the probe IDs
- save the expression values as a variable
```{r}
match(myProbe, rownames(normalizedValues))
normalizedValues[match(myProbe, rownames(normalizedValues)), 1:10]
myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
class(myGeneExpression)
```
## Relating to patient characteristics
We have expression values and want to visualise them against our categorical data
- use a boxplot, for example
- however, we have to first make sure our values are treat as numeric data
- as we created the subset of a data frame, the result was also a data frame
+ use `as.numeric` to create a vector that we can plot
+ various `as.` functions exist to convert between various data types
```{r}
boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)
```
- In this case there is a clear difference, so we probably don't even need a p-value to convince ourselves of the difference
+ in real-life, we would probably test lots of genes and implement some kind of multiple-testing
+ e.g. `p.adjust` (`?p.adjust`)
```{r}
t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)
```
## Complete script
```{r}
geneAnnotation <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
patientMetadata <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
normalizedValues <- read.delim("gene.expression.txt")
rowInd <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
myProbe <- geneAnnotation$probe[rowInd]
myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)
t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)
```
## Exercise: Exercise 8
Repeat the same steps we performed for the gene ESR1, but for GATA3:
- Try and make as few changes as possible from the ESR1 script
- Can you see why making a markdown document is useful for analysis?
```{r}
### Your Answer Here ###
```
## Extra Discussion
This example has been simplified by the fact that the columns in the expression matrix are in the same order as the patient metadata. This would normally be the case for data obtained in a public repository such as Gene Expression Omnibus
```{r}
colnames(normalizedValues)
patientMetadata$samplename
```
There is a quick shortcut to check that these names are the same using the `all` function
```{r}
colnames(normalizedValues) == patientMetadata$samplename
all(colnames(normalizedValues) == patientMetadata$samplename)
```
Let's say that our metadata have been re-ordered by ER status and age, and not by patient ID
```{r}
patientMetadata <- patientMetadata[order(patientMetadata$er,patientMetadata$age),]
patientMetadata
```
- If we run the same code as before to produce the boxplot and perform the t-test we would get a very different result.
- This should make use immediately suspicious, as the ESR1 gene is known to be highly differentially-expressed in the contrast we are making
- Such sanity checks are important to check to your code
```{r}
rowInd <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
myProbe <- geneAnnotation$probe[rowInd]
myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)
t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)
```
If we run the same check as before on the column names and patient IDs, we see that it fails:-
```{r}
all(colnames(normalizedValues) == patientMetadata$samplename)
```
A solution is to use `match` again. Specifically, we want to know where each column in the expression matrix can be found in the patient metadata. The result is a vector, each item of which is an index for a particular row in the patient metadata
```{r}
match(colnames(normalizedValues),patientMetadata$samplename)
```
The vector we have just generated can then by used to re-order the rows in the patient metadata
```{r}
patientMetadata <- patientMetadata[match(colnames(normalizedValues),patientMetadata$samplename),]
patientMetadata
all(colnames(normalizedValues) == patientMetadata$samplename)
```
And we can now proceed to perform the analysis and can the result we expect
```{r}
rowInd <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
myProbe <- geneAnnotation$probe[rowInd]
myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)
t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)
```