-
Notifications
You must be signed in to change notification settings - Fork 25
/
downstream-tutorial.Rmd
1048 lines (712 loc) · 33.9 KB
/
downstream-tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Clustering and classification"
author: "Mark Dunning; mark 'dot' dunning 'at' cruk.cam.ac.uk, Oscar Rueda; oscar 'dot' rueda 'at' cruk.cam.ac.uk"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_notebook:
toc: yes
toc_float: yes
---
# Description
The tutorial covers the application of a number of exploratory analysis techniques to real-life microarray experiments
- Clustering
+ and making a heatmap
- Principal Components Analysis
- Classification
- Survival
However, the techniques presented could be used on other types of data where you have genomic-observations in rows and samples in columns
# Data Pre-processing
We will re-use the colon cancer data in GSE33126, and saw yesterday how to import these data into R.
```{r cache=TRUE,message=FALSE,warning=FALSE}
library(GEOquery)
url <- "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE33126/"
filenm <- "data/GSE33126_series_matrix.txt.gz"
if(!file.exists(filenm)) download.file(paste(url, filenm, sep=""), destfile=filenm)
colonData <- getGEO(filename=filenm)
colonData
```
We will log$_2$ transform the expression data to put on a more convenient scale for analysis
```{r}
exprs (colonData) <- log2 (exprs(colonData))
boxplot(exprs(colonData),outline=FALSE)
```
## Filtering the data
A first step in the analysis of microarray data is often to remove any uniformative probes.
We can do this because typically only 50% of probes genes will be expressed, and even fewer
than this will be differentially expressed. Including such non-informative genes in visualisa-
tion will obscure any biological differences that exist. As we mentioned yesterday, the `genefilter` package contains a suite
of tools for the filtering of microarray data. The `varFilter` function allows probes with low-
variance to be removed from the dataset. The metric using to decide which probes to remove
is the Inter-Quartile Range (IQR), and by default half of the probes are removed. Both the
function used to do the filtering, and cut-off can be specified by the user.
```{r}
library (genefilter)
dim (colonData)
varFiltered <- varFilter (colonData)
dim (varFiltered)
nrow (colonData) / nrow (varFiltered)
```
## Clustering.. or Classification?
- **Unsupervised:** classes unknown, want to discover them from the data (cluster analysis)
- **Supervised:** classes are predefined, want to use a (training or learning) set of labelled objects to form a classifier for classification of future observations
![](images/clus-versus-class.png)
# Clustering
## Why do clustering for genomic data?
- Clustering leads to readily interpretable figures and can be helpful for identifying patterns in time or space.
- We can cluster samples (columns)
+ e.g. identification of new / unknown tumor classes using gene expression profiles
- We can cluster genes (rows)
+ e.g. using large numbers of yeast experiments to identify groups of co-regulated genes
+ we can cluster genes to reduce redundancy (i.e. variable selection) in predictive models
### Subtype Discovery
- There are plenty of examples of using clustering to discover subtypes in the literature
![Perou et al. Molecular portraits of human breast tissues](images/perou.png)
### Clustering as QA
- Also used as a quality assessment tool
+ to check for outlier samples
![](images/array-clustering.png)
- Can check within / between experiment variability and potential confounding factors (batch effect etc)
![](images/annotated-clustering.png)
### Clustering Overview
- Steps in a Cluster Analysis
1. Preprocess the data
2. Choose a *dissimilarity* measure
3. Choose a cluster algorithm
4. Select the number of clusters
5. Validate the procedure
When clustering genes, it is common to pre-process;
- normalise
- filter; remove genes with low variability across samples and many missing values
- (possibly impute missing values)
- standardise; e.g. zero-mean and, unit variance:
+ $y_g^* = (y_g - \mu_g)/\sigma_g$
+ subtracting the mean expression level and divide by standard deviation
+ the algorithm may already do this
### How to compute similarity
Common *Similarity* / *Dissimilarity* measures include
- Correlation coefficient; *scale invariant*
+ Pearson's correlation;
+ Spearman correlation of ranks
- Distance: scale dependant
+ Euclidean distance; $d(x,y) = \sqrt{\sum_i (x_i - y_i)^2}$
+ City block (Manhattan) distance; $d(x,y) = \sum | x_i - y_i |$
+ and others.....
- ***Warning***: Don't get too hung-up on the choice of measure
+ Clustering is primarily an exploratory tool
+ If your data has signal, it should show regardless of which measure you choose
+ Don't just pick the one which "looks the best"
### How to compute similarity, in R
The `dist` function can be used to calculate a variety of similarity measures
- As usual, we can do `?dist`.
- The input is a matrix of numerical values
- Let's test with some sample data
+ we use `set.seed` here to make sure we always get the same values
```{r}
set.seed(10032016)
myMatrix <- matrix(rnorm(1000),ncol=10)
colnames(myMatrix) <- LETTERS[1:10]
head(myMatrix)
```
- The default is to compute distances from the *rows* of the matrix
+ i.e. which would normally be the genes we are measuring
- Note that the output matrix has some values that appear to be missing
+ why do you think that might be?
```{r eval=FALSE}
d <- dist(myMatrix)
d
```
- The more common use of clustering is to cluster the columns (*samples*)
- To do this, we have to *transpose* the matrix; using the function `t`
```{r}
d <- dist(t(myMatrix))
d
```
- Changing the type of distance measure can be done by changing the `method` argument
+ as always, check the help page for full list of options `?dist`
```{r}
?dist
d.man <- dist(t(myMatrix),method="manhattan")
d.man
```
### How to calculate correlation in R
- This time we use the `cor` function to calculate correlation(`?cor`)
- What type of correlation is this computing?
+ use the help page to find out
- What do you notice about the output, compared to that of `dist`?
```{r}
cor(myMatrix)
```
- Clustering algorithms will expect input in distance matrix form
- We can convert using `as.dist`
+ recall that various `as.` functions exist to convert between various types of data. e.g. `as.numeric`, `as.character` etc
- Two samples with a *higher correlation* values means *more similar*
- ...so the *distance* between is *less*
+ `abs` will calculate the absolute value
```{r}
corMat <- as.dist(1-abs(cor(myMatrix)))
corMat
```
### Correlation versus Distance
- The main choice is whether to use a distance-based metric (e.g. Euclidean) or correlation
- A simple toy example of three genes
+ which genes seem to be closest?
![](images/cor-vs-dist.png)
## Calculating a distance matrix for gene expression data
When clustering samples, each entry in the distance matrix should be the pairwise distance between two samples. As the `ExpressionSet` object has genes in the rows, and samples in the column we have to transpose the expression matrix.
**N.B. to calculate the distances between samples, we have to transpose the expression matrix (e.g. using the function `t`). If we do not do this, R will try and compute distances between all genes which may take a long time or exceed the available memory)**
```{r}
euc.dist <- dist (t(exprs(varFiltered)))
euc.dist
```
Or, if we wanted measures of correlation, we use the `cor` function as above
```{r}
corMat <- cor(exprs(varFiltered))
corMat
cor.dist <- as.dist(1 - corMat)
```
We are now ready to use a clustering algorithm
## Hierachical methods for clustering
- start with *n* samples (or $p$ gene) clusters
- At each step, *merge* the two closest clusters using a measure of *between-cluster* dissimilarity which reflects the shape of the clusters
- The distance between the clusters is defined by the method used (e.g. in *complete linkage*, the distance is defined as the distance between the furthest pair of points in the two clusters)
![](images/hierarichal-alg.png)
- We also have some control over the definition of distance used when creating a new cluster
+ this will alter the *shape* of the dendrogram
+ some methods will give more compact clusters
![](images/distance-metrics.png)
- Dendrograms are good visual guides, but the x-axis is *arbitrary*
- Nodes can be reordered
+ Closer on dendrogram $\ne$ more similar
![](images/cluster-caution.png)
### Performing hierachical clustering, in R
- The function to use here is `hclust` (`?hclust`)
- It takes a distance matrix that you computed previously
```{r}
clust <- hclust(d)
clust
```
The summary provided when we print the object doesn't give much useful information
- we can access particular parts of the output with a `$`
```{r}
names(clust)
clust$order
```
- The standard plotting function has been extended to visualise the result of the clustering
```{r}
plot(clust)
```
- We can change the name of the method used to construct the clusters
```{r}
clust.ward <- hclust(d,method = "ward.D")
par(mfrow=c(1,2))
plot(clust)
plot(clust.ward)
```
******
Exercise:
- Apply hierachical clustering to the Eucliden and Correlation distance matrices from the Colon Cancer data that you computed previously
- Do you see the same clustering arising from the different distance matrices?
```{r}
## Your answer here ##
```
******
The default plotting for a dendrogram labels the "leaves" with the column names from the input matrix, in our case the sample names from GEO. This may make the interpretation of the dendrogram difficult, as it may not be obvious which sample group each sample belongs to. We can alter the appearance of the dendrogram so that sample groups appear in the labels.
- Can use the `labels` argument to specify a vector that you want to use to label each leaf
+ the vector must be in the same order as the columns in the matrix that you did clustering on
```{r}
plot(clust,labels=paste("Sample", 1:10))
```
******
Exercise:
- Add text labels to indicate which sample group (tumour or normal) that each sample belongs to
+ below is a reminder of how to retrieve a matrix with all the sample metadata
+ does the clustering make sense in relation to the sample groups in the experiment?
+ do any samples appear incorrectly clustered?
******
```{r}
pd <- pData(colonData)
View(pd)
## Your answer here ##
```
The `WGCNA` package in Bioconductor provides methods for finding clusters of correlated genes, which we will not be looking at in this tutorial. However, the package is of interest as it provides other visualisation methods for dendrograms which allows colours to be overlaid to distinguish sample groups.
```{r message=FALSE}
library(WGCNA)
```
We need to create a vector of colours; one for each sample in our dataset
- we already know how to extract the sample group information from the metadata
- we can start by creating a vector containing just one colour name
+ e.g. using the `rep` function
+ then replace the entries corresponding to a tumour with a different value
```{r}
SampleGroup <- pData(colonData)$source_name_ch1
SampleGroup
groupColours <- rep("yellow",length(SampleGroup))
groupColours[SampleGroup == "tumor"] <- "blue"
groupColours
```
```{r}
clust.euclid = hclust(euc.dist)
plotDendroAndColors(clust.euclid,colors=groupColours)
```
Alternatively, one might re-set the *levels* of the vector to be the colours we want
```{r}
groupColours <- SampleGroup
levels(groupColours) <- c("yellow","blue")
```
## Extracting data from the clustering
If we want to interpret the data presented in a clustering analysis, we need a way of extracting which samples are grouped together, or to determine the optimal grouping of samples.
- One intuitive way of assigning groups it to *cut* the dendrogram at a particular height on the y-axis.
- We can do this manually on the plot, or use the `cutree` function to return the
labels of samples that are belong to the same group when the dendrogram is cut at the specified height, `h`.
- Alternatively, we can specify how many groups, `k`, that we want to create.
```{r}
library (cluster)
plot(clust.euclid)
abline (h = 120, col = " red ")
cutree (clust.euclid , h =120)
cutree (clust.euclid , k = 3)
table (cutree(clust.euclid , k = 3) , SampleGroup)
```
## Summary of clustering
- Useful as ***exploratory / visualisation*** tools
- Choice of metric, methods and parameters usually guided by prior knowledge about the question
+ The result is guided by what you are looking for
- Validate using different algorithms
- Use prior Biological knowledge, or your sample meta data to see if the clusters make sense
- Be aware
+ Clustering will always produce something. It cannot NOT work
- Clustering is a generic technique that can be applied to many forms of data
+ We will use it on Microarrays, but it can be used on RNA-seq, ChIP, or other NGS data
- Quality assessment of samples
+ checking for batch effects and outliers
+ check effectiveness of normalisation / transformation
## Producing a heatmap
A heatmap is often used to visualise differences between samples.
- Each row represents a gene and each column is an array and coloured cells indicate the expression levels of genes.
- Both samples and genes with similar expression profile are clustered together.
- The code to draw the heatmap is actually quite straighforward
+ the trick is to select what genes to use
Drawing a heatmap in R uses a lot of memory and can take a long time,
- Reducing the amount of data to be plotted is usually recommended.
- Including too many non-informative genes can also make it difficult to spot patterns.
- Typically, data are filtered to include the genes which tell us the most about the biological variation.
+ alternatively, use a pre-selected list of genes of interest
+ e.g. a particular pathway, or gene-list from another study
- In an un-supervised setting, the selection of such genes is done without using prior knowledge about the sample groupings.
- The `rowSds` function from `genefilter` provides a convenient way of calculating the variability of each gene
```{r}
geneVar = rowSds(exprs(colonData))
sd(exprs(colonData)[1,])
geneVar[1]
sd(exprs(colonData)[2,])
geneVar[2]
length(geneVar)
```
Next we can select which genes have the highest variance (say the top 100)
```{r}
highVarGenes = order (geneVar, decreasing = TRUE )[1:100]
```
- A basic heatmap can now be constructed by selecting the appropriate rows from the expression matrix
+ the `heatmap` function expects a `matrix` object, so we have to convert
+ as we will see it can be customised in lots of ways
+ `labCol` allows us to label the columns in the plot (default is to use the column names from the original matrix used for clustering)
```{r}
heatmap (as.matrix(exprs(colonData)[highVarGenes, ]),labCol = SampleGroup)
```
******
Exercise:
- Does the clustering of samples look correct?
+ are there any issues you can spot?
******
## Customising the heatmap
From this plot we can already to discern patterns in the data
In a similar way to adding colours to a dendrogram (with `plotDendroAndColors`), we can add a colour bar underneath the sample dendrogram in the heatmap
- the argument to look for is `ColSideColors`
+ only one band of colour is supported though
```{r}
heatmap (as.matrix(exprs(colonData)[highVarGenes, ]),
labCol = SampleGroup,
ColSideColors = as.character(groupColours))
```
- However, the rows are named according to the row names in the expression matrix
+ which in this case are manufacturer identifiers and do not help the interpretation
******
Exercise:
- How can we add gene names to the heatmap?
+ first out which column in the feature data corresponds to gene symbol
+ save this as a vector
+ find out which argument to heatmap you need to change
******
```{r}
features <- fData(colonData)
View(features)
## Your answer here ##
```
- The default options for the heatmap are to cluster both the genes (rows) and samples (columns).
- However, sometimes we might want to specify a particular order. For example, we might
want to order the columns according to sample groups.
- We can do this by re-ordering the input matrix manually and setting the `Colv` argument to `NA`. This tells the heatmap function not be cluster the columns.
```{r}
heatmap (as.matrix(exprs(colonData)[highVarGenes, ]),
labCol = SampleGroup , Colv=NA)
```
In this plot we set the column order according to the Sample Group
```{r}
heatmap (as.matrix(exprs(colonData)[highVarGenes, order(SampleGroup)]),
labCol = SampleGroup[order(SampleGroup)], Colv = NA)
```
Alternatively, a pre-calculated dendrogram could be used.
```{r}
clus.ward <- hclust (euc.dist , method = "ward")
heatmap (as.matrix(exprs(colonData)[highVarGenes, ]) ,
Colv = as.dendrogram(clus.ward) , labCol = SampleGroup )
```
The colours used to display the gene expression values can also be modified. For this, we
can use the `RColorBrewer` package which has functions for creating pre-defined palettes. The
function `display.brewer.all` can be used to display the palettes available through this
package.
**You should avoid using the traditional red / green colour scheme as it may be difficult for people with colour-blindness to interpret!**
```{r}
library (RColorBrewer)
display.brewer.all()
hmcol <- brewer.pal(11 , "RdBu")
heatmap (as.matrix(exprs(colonData)[highVarGenes, ]) ,
ColSideColors = as.character(groupColours) , col=hmcol)
```
## Other packages that produce heatmaps
One drawback of the standard `heatmap` function is that it only allows one "track" of colours below the dendrogram. We might wish to display various sample groupings using this feature. The `heatmap.plus` package allows us to do just this.
```{r}
library(heatmap.plus)
colourMatrix <- matrix(nrow=length(SampleGroup),ncol=2)
Patient <- pd$characteristics_ch1.1
patientCol <- rep(rainbow(n=length(unique(Patient))),each=2)
colourMatrix[,1] <- as.character(groupColours)
colourMatrix[,2] <- patientCol
heatmap.plus (as.matrix(exprs(colonData)[highVarGenes, ]) ,
ColSideColors = as.matrix(colourMatrix) , col=hmcol)
```
Another alternative is provided by the `gplots` package. The `heatmap.2` function can be used in the same fashion as `heatmap`. The plots produced include a colour legend for the cells in the heatmap. By default, a density plot of each column is also produced.
```{r}
library(gplots)
heatmap.2 (as.matrix(exprs(colonData)[highVarGenes, ]) ,
ColSideColors = as.character(groupColours) , col=hmcol)
```
We can turn-off the column density if we wish.
```{r}
heatmap.2 (as.matrix(exprs(colonData)[highVarGenes, ]) ,
ColSideColors = as.character(groupColours) , col=hmcol,trace="none")
```
## (Extra) What is the correct number of clusters?
A *Silhouette plot* can be used to choose the optimal number of clusters. For each sample, we calculate a value that quantifies how well it 'fits' the cluster that it has been assigned to. If the value is around 1, then the sample closely fits other samples in the same cluster. However, if the value is around 0 the sample could belong to another cluster. In the silhouette plot, the values for each cluster are plotted together and ordered from largest to smallest. The number of samples belonging to each group is also displayed.
The `silhouette` function is used to calculate this measure, and requires the output from a clustering method and the corresponding distance matrix.
```{r}
silhouette(cutree(clust.euclid, k = 2),euc.dist)
```
```{r fig.width=12}
par(mfrow=c(2,2))
plot(silhouette(cutree(clust.euclid, k = 2), euc.dist), col = "red", main = paste("k=", 2))
plot(silhouette(cutree(clust.euclid, k = 3), euc.dist), col = "red", main = paste("k=", 3))
plot(silhouette(cutree(clust.euclid, k = 4), euc.dist), col = "red", main = paste("k=", 4))
plot(silhouette(cutree(clust.euclid, k = 5), euc.dist), col = "red", main = paste("k=", 5))
```
The `pvclust` package is also able to provide some assessment on whether the clusters you get are significant or not.
- `install.packages(pvclust)`
- If we have prior knowledge (as we do for this dataset) we might be tempted to use a supervised method that takes the number of expected clusters into account
+ we can compare to the what we obtain from an unsupervised method
+ more on this later, but here is a taster of what such supervised methods can do
```{r}
library(cluster)
supervised.clus <- pam(euc.dist,k=2)
clusplot(supervised.clus)
supervised.clus$clustering
table(supervised.clus$clustering,SampleGroup)
```
# Principal Components Analysis
## Why Use PCA
- In genomic data, we have a large number of variables which are often highly correlated
- PCA is one of many *dimension-reduction* techniques that can remove redundancy and give a smaller more mangeable set of variables. In summary it can help us to:-
- Reduce *dimensionality* of the data
- Decrease the *redundancy* of the data
- Filter the *noise* in the data
- *Compress* the data
## PCA Example: Whisky
The data: 86 malt whiskies scored for 12 different taste categories
(https://www.mathstat.strath.ac.uk/outreach/nessie/nessie_whisky.html)
- Q. Can we find which whiskies have a similar taste profile?
- Q. What are the factors that determine the taste profile?
- Q. What other types of whisky are similar to my favourite?
```{r echo=FALSE}
download.file("https://www.mathstat.strath.ac.uk/outreach/nessie/datasets/whiskies.txt",destfile = "whiskies.txt")
whisky <- read.csv("whiskies.txt")
scores <- whisky[,c(3:14)]
rownames(scores) <- whisky$Distillery
scores
```
## What does PCA tell us for these data?
Allows us to visualise the dataset in two (perhaps three) dimensions and see the main sources of variation in the data
- in this case, which samples taste most similar
![](images/whisky-pca.png)
It is particularly useful to overlay known sample attributes to the plot
- can explain the similarities between whiskies
![](images/whisky-pca-annotated.png)
## More-relevant example
- MAQC microarrays with mixtures of Brain and Reference RNA (UHRR) at different concentrations
- Run in different labs
+ what do you notice?
![](images/MAQC-pca.png)
So, PCA can be used as a quality assessment tool and can inform us of any factors we need to account for in the analysis
- In this case we could correct for the batch effect, or add a batch factor to the linear model for analysis
![](images/MAQC-pca-corrected.png)
## What is going-on here?
A two-dimensional case
- Find direction of most variation (PC1)
- Direction that explains next amount of variation is orthogonal (at 90 degrees) (PC2)
- Each point on original x and y is a linear combination of PC1 and PC2
- Higher dimensional data can have many more components PC3, PC4....
+ although they will become increasingly less-influential
![pca-explained](images/pca-explained.png)
## Running the PCA
The `prcomp` function does the hard-work for us
- It is computing *eigenvectors* and *eigenvalues* from our input matrix
+ or rather the covariance matrix
- We have 12 components (as we had 12 variables)
- The components are ordered according to the amount of variance they *explain*
```{r}
scores
pca <- prcomp(scores)
pca
names(pca)
summary(pca)
```
## Dissecting the PCA results
The variable loadings are given by the `$rotation` matrix, which has one row for each sample and one column for each principal component.
- The principal components are ordered accorded to amount of variance explained.
- The actual values of a particular component have no meaning,
+ but their relative values can be used to inform us about relationships between samples.
+ e.g. Smoky seems to be important for PC1
+ Smoky and Floral are opposite ends of the spectrum for PC1
```{r}
pca$rotation
```
## How many components?
We look at the variance explained by each component and judge where it "drops-off"
- Sometimes called a "scree-plot"
```{r}
plot(pca)
```
![scree](images/scree.jpg)
The new co-ordinate system is given by `pca$x`
- One row for each of our whiskies, one column for each component
- The original coordinates are a linear combination of the new ones
+ an exercise for the reader?
```{r}
pca$x[1:3,1:5]
```
Plotting the first and second columns as a scatter plot gives the plot we saw above
```{r}
plot(pca$x[,1],pca$x[,2],
pch=16)
```
### Digression about ggplot2
The `ggplot2` package was actually used to create the plots above.
- Beyond the scope of the course, but you should check out our [Intermediate R](http://bioinformatics-core-shared-training.github.io/r-intermediate/) course for more details
- Essentially, it allows us to map between the variables in our data and the characteristics of the plot
+ x- and y-axis position, colour, shape, size of points
```{r}
df <- data.frame(whisky,pca$x)
df
library(ggplot2)
p <- ggplot(df, aes(x=PC1,y=PC2,label=Distillery)) + geom_point() + geom_text(alpha=0.3)
p
```
```{r}
p <- ggplot(df, aes(x=PC1,y=PC2,label=Distillery,cex=Smoky,col=as.factor(Floral))) + geom_point() + geom_text(cex=5,alpha=0.3)
p
```
## Applying PCA to gene expression data
In the above example, we had several whiskies and had made several observations about each
- Each row was a whisky, and the observations were in columns
In the gene expression case, we have biological samples and the measurements we have made are gene expression levels
- So for PCA we need to transpose the matrix
+ like we did prior to clustering
- We can then proceed as before
```{r}
pca.geneExpression <- prcomp(t(exprs(varFiltered)))
summary(pca.geneExpression)
plot(pca.geneExpression)
```
This time, the matrix `$rotation` has one row for each gene and one column for each component
- We are more interested in the new (projected) coordinates and how they relate to the (known) sample groups
+ remember these are found in `$x`
```{r}
head(pca.geneExpression$rotation)
head(pca.geneExpression$x)
```
Quite often, we plot the first few PCs against each other to visualise sample relationship.
- This can be achived with a scatter plot with the first principal component on the x-axis, and second principal component on the y-axis.
+ the samples clearly seem to separate, but is this separation consistent with the metadata?
```{r}
plot(pca.geneExpression$x[,1],pca.geneExpression$x[,2])
```
We can improve this plot by adding colours for the different sample groups and choosing a different plotting character.
```{r}
plot(pca$x[,1],pca$x[,2],
pch=16,col=as.character(groupColours))
```
We might also add a legend to explain the different colours.
```{r}
plot(pca.geneExpression$x[,1],pca.geneExpression$x[,2],
pch=16,col=as.character(groupColours))
legend("bottomright",fill=c("blue","yellow"),legend=c("tumour","normal"))
text(pca.geneExpression$x[,1],pca.geneExpression$x[,2]-0.01,labels = pd$geo_accession)
```
Happily, the first component seems to separate the tumours from normals
- We can also see this with a boxplot
- To find what the other components relate to, we would have to use other information in the metadata
+ there doesn't seem to be anything useful though
+ we should try and capture as much metadata as possible; batch, age of patient, tumour purity...
```{r}
boxplot(pca.geneExpression$x[,1] ~ SampleGroup)
```
# Classification
The Bioconductor project has a collection of example datasets. Often these are used as examples to illustrate a particular package or functionality, or to accompany the analysis presented in a publication. For example, several datasets are presented to accompany the `genefu`
package which has functions useful for the classification of breast cancer patients based on
expression profiles.
An experimental dataset can be installed and loaded as with any other Bioconductor package. The data itself is saved as an object in the package. You will need to see the documentation for the package to find out the relevant object name. The full list of datasets available
through Bioconductor can be found [here](http://bioconductor.org/packages/release/BiocViews.html#___ExperimentData)
```{r}
library(breastCancerVDX)
library(breastCancerTRANSBIG)
data(vdx)
data(transbig)
dim(vdx)
dim(transbig)
annotation(vdx)
annotation(transbig)
```
If we want any classifers to be reproducible and applicable to other datasets, it is sensible
to exclude probes that do not have sufficient annotation from the analysis. For this, we can
use the genefilter package as before. The `nsFilter` function performs this annotation-based
filtering as well as variance filtering. The output of the function includes details about how
many probes were removed at each stage of the filtering.
```{r}
library (genefilter)
vdx.filt <- nsFilter(vdx)
vdx.filt
vdx.filt <- vdx.filt[[1]]
```
Format the vdx data for pamr, and train a classifier to predict ER status. For extra
clarity in the results, it might be useful to rename the binary er status used in the data package
to something more descriptive.
```{r}
library(pamr)
dat <- exprs(vdx.filt)
gN <- as.character(fData(vdx.filt)$Gene.symbol)
gI <- featureNames (vdx.filt)
sI <- sampleNames (vdx.filt)
erStatus <- pData (vdx)$er
erStatus <- gsub (0 , "ER -" , erStatus )
erStatus <- gsub (1 , "ER +" , erStatus )
```
Fitting the model
```{r}
train.dat <- list ( x = dat , y = erStatus , genenames = gN ,
geneid = gI , sampleid = sI )
model <- pamr.train(train.dat ,n.threshold = 100)
model
```
We can perform cross-validation using the pamr.cv function. Printing the output of this
function shows a table of how many genes were used at each threshold, and the number of
classification errors. Both these values need to be taken into account when choosing a suit-
able theshold. The `pamr.plotcv` function can assist with this by producing a diagnostic plot
which shows how the error changes with the number of genes. In the plot produced by this
function there are two panels; the top one shows the errors in the whole dataset and the bottom one considers each class separately. In each panel, the x axis corresponds to the thresh-
old (and number of genes at each threshold) whereas the y-axis is the number of misclassifications.
```{r fig.width=12,fig.height=12}
model.cv <- pamr.cv(model , train.dat , nfold = 10)
model.cv
pamr.plotcv(model.cv)
```
In the following sections, feel free to experiment with different values of the threshold
(which we will call `Delta`)
The misclassifications can easily be visualised as a ’confusion table’. This simply tabulates
the classes assigned to each sample against the original label assigned to the sample. e.g.
Misclassifications are samples that we thought were ’ER+’ but have been assigned to the
’ER-’ group by the classifier, or ’ER-’ samples assigned as ’ER+’ by the classifier.
```{r}
Delta <- 8
pamr.confusion(model.cv , Delta)
```
A visual representation of the class separation can be obtained using the `pamr.plotcvprob`
function. For each sample there are two circles representing the probabilty of that sample
being classified ER- (red) or ER+ (green).
```{r}
pamr.plotcvprob(model , train.dat , Delta )
```
There are a couple of ways of extract the details of the genes that have been used in the
classifier. We can list their names using the pamr.listgenes function, which in our case
these are just returns the microarray probe names. We can however, use these IDs to query
the `featureData` stored with the original `vdx` object. We can also plot the expression values
for each gene, coloured according to the class label.
```{r, fig.width=12,fig.height=12}
pamr.listgenes(model , train.dat , Delta )
classifierGenes <- pamr.listgenes(model , train.dat , Delta )[,1]
pamr.geneplot(model , train.dat ,Delta)
```
***You may get an error message Error in plot.new(): Figure margins too large
when trying to produce the gene plot. If this occurs, try increasing the size of your plotting
window, or decrease the number of genes by decreasing the threshold. Alternatively, the fol-
lowing code will write the plots to a pdf.***
```{r}
pdf ("classifierProfiles.pdf")
for (i in 1: length (classifierGenes)) {
Symbol <- fData(vdx.filt)[classifierGenes[i] , "Gene.symbol"]
boxplot(exprs(vdx.filt)[classifierGenes[i], ] ~ erStatus ,
main = Symbol )
}
dev.off()
```
Use the genes identified by the classifier to produce a heatmap to confirm that they
separate the samples as expected.
```{r}
symbols <- fData(vdx.filt)[classifierGenes , "Gene.symbol"]
heatmap(exprs(vdx.filt)[classifierGenes, ] , labRow = symbols )
```
## Testing the model
We can now test the classifier on an external dataset. We choose the transbig dataset for
simplicity as it was generated on the same microarray platform
```{r}