-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathplotting.Rmd
295 lines (208 loc) · 7.83 KB
/
plotting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
title: "Lesson 4 - basic R plotting"
author: "Mik Black"
date: "6 November 2015"
output: html_document
---
## R graphics
- R provides multiple "systems" for producing graphics
- We're going to focus on the oldest (and arguably easiest) of these: R base graphics
- These capabilities are included with base R (i.e., what you get when you install the R software), and don't require any additional packages to be loaded.
---
## But first... data summaries
- One of the reasons for generating plots is to produce a visual summary of the data
- this helps us (and ohers) to understand key features such as _centre_ and _spread_.
- In addition to graphical methods, we can also use tools to produce numerical summaries of this information.
- For example: mean, median, standard deviation, variance, range, min, max,
interquartile range...
---
## Examples: non-graphical data summaries
```{r}
## Lets load the data and take a look at it first
data <- read.csv('dummy_data.csv')
## What type of object have we created?
class(data)
## What are its dimensions?
dim(data)
## What are the variable names?
names(data)
## What do the first few rows look like?
head(data)
## What does the start of the BMI variable look like?
## (note the use of the $ operator to access the BMI information from the data object)
head(data$BMI)
## What if we "attach" the data?
attach(data)
## Look, no more $ sign!
head(BMI)
## Summarize the entire data frame
summary(data)
## Summarize just the BMI variable
summary(BMI)
## Calculate the mean of the BMI variable
mean(BMI)
## Oops, we need to get rid of the NA values
mean(BMI,na.rm=T)
## Calculate the mean of the BMI variable
median(BMI, na.rm=T)
## Calculate the standard deviation of the BMI variable
sd(BMI, na.rm=T)
```
---
## Other summaries
- the above are just a few summaries that can be produced.
- try: IQR, max, min, range...
- We can also summarize categorical data using tables via the ```table``` command.
- ```prop.table``` can then be used to calculate proportions.
```{r}
## Make a table of the ethnicity data
## Note that information class labels can be a good thing...
table(ETHCLASS)
## Include the NA information
table(ETHCLASS, useNA = 'always')
## Diabetes status
table(DIABETES)
## Two-way table of ethnicity and diabetes
table(ETHCLASS, DIABETES)
## Add dimension names
table(ETHCLASS, DIABETES, dnn = c("Ethnicity","Diabetes"))
## We can also flip this via the transpose operator:
t( table(ETHCLASS, DIABETES) )
## Use the prop.table command to calculate proportions (cells sum to 1):
prop.table( table(ETHCLASS, DIABETES) )
## You can also calculate proportions across rows (rows sum to 1):
prop.table( table(ETHCLASS, DIABETES), 1 )
## or down columns (columns sum to 1):
prop.table( table(ETHCLASS, DIABETES), 2 )
```
---
## Plotting
- Need to use the right sort of plot for the data we have
- One continuous variable: histogram (```hist```), boxplot (```boxplot```),
density plot (```plot``` and ```density``` - see below)
- Two continuous variabes: scatterplot (```plot```)
- One (or more) cateogircal variables: barplots (```barplot```)
- One continuous and one (or more) categorical variable: boxplots (```boxplot```)
---
## Histograms - single continuous variable
```{r}
## Generate histogram of age data
hist( AGECOL )
```
## Making plots prettier
- most base R plotting methods share parameters relating to plot aesthetics (e.g.,
axis labels/limits, title, text size, plotting colours etc)
- this makes it very easy to customize plots, and may them look fairly nice
```{r}
## Add title and axis labels
hist( AGECOL , main = "Histogram of Age", xlab = "Age (years)")
##Add some colour...
hist( AGECOL , main = "Histogram of Age", xlab = "Age (years)", col='lightblue')
## Extend y axis
hist( AGECOL , main = "Histogram of Age", xlab = "Age (years)", ylim = c(0,400))
```
---
## Boxplots - single continuous variable
```{r}
## Boxplot of BMI
boxplot(BMI)
```
## Boxplots - continuous & categorical variables
```{r}
## Boxplot of BMI vs Diabetes Status
boxplot(BMI ~ DIABETES, xlab = "Diabetes status", ylab = "BMI", main = "BMI vs diabetes status")
## Change the colours and box widths
## Note: ```col``` defines the box colour, and ```border``` defines the border colour
boxplot(BMI ~ DIABETES, xlab = "Diabetes status", ylab = "BMI", main = "BMI vs diabetes status",
boxwex = 0.2, col = "lightblue", border = "blue")
```
---
## Scatterplots - two continuous variables
```{r}
## Scatterplot of Age vs BMI
plot(AGECOL, BMI)
## Change plotting symbol, size and colour
plot(AGECOL, BMI, pch = 16, col = "blue", cex = 0.5)
## Use SEX variable to denote colour (has to be numeric, or actual colours)
## R colours are: 1 = black, 2 = red, 3 = green, 4 = blue, 5 = cyan, 6 = magenta,
## 7 = yellow, 8 = grey, 9 = black, 0 = white
plot(AGECOL, BMI, col = SEX, pch = 16)
## Adjust y axis limits and add a legend
plot(AGECOL, BMI, col = SEX, pch = 16, ylim = c(18, 80))
legend(20, 80, c("Male", "Female"), fill = c("black","red"))
```
---
## Barplots - one or more categorical variables
```{r}
## Summarize ethinicty data as a table:
table(ETHCLASS)
## Represent this information as a barplot
barplot( table(ETHCLASS) )
## Generate 2-way tabe for ethnicity and diabetes
table(ETHCLASS, DIABETES)
## Represent this information as a (stacked) barplot
barplot( table(ETHCLASS, DIABETES), xlab = "Diabetes status", ylab = "Frequency" )
## Same information, unstacked
barplot( table(ETHCLASS, DIABETES), xlab = "Diabetes status", ylab = "Frequency", beside = TRUE )
## What is we want to look at it the other way around?
## Use the transpose function, t(), to flip the table
t( table(ETHCLASS, DIABETES) )
## Then make a barplot
barplot( t(table(ETHCLASS, DIABETES)), beside = TRUE )
## What if we're more interested in proportions?
prop.table( table(ETHCLASS, DIABETES) )
## Make the corresponding barplot
barplot( prop.table(table(ETHCLASS, DIABETES)) )
## Or we can make the columns sum to one
prop.table( table(ETHCLASS, DIABETES), 2)
barplot( prop.table(table(ETHCLASS, DIABETES), 2) )
```
---
## Density plots
- Sometines it is useful to represent continuous data as a density, rather than
a histogram.
```{r}
## Histogram of BMI data
hist(BMI)
## Density plot of BMI data
## Note: have to remove NA values, or the density() function gives an error.
plot( density(BMI, na.rm = T) )
```
---
## Multiple plots per page
- Arranging multiple plots on a single figure can be useful
- In base R graphics this is accomplished using either the ```mfrow()``` or
```mfcol()``` functions.
- both specify a grid of plots: which function you use depends on the order in which you want to generate the plots.
- ```mfrow()``` fills the grid row-by-row (left to right)
- ```mfcol()``` fills the grid column-by-column (top to bottom)
```{r}
## Specify a 1 x 2 (1 row, 2 columns) grid of plots, and fill by rows, left to right
par(mfrow = c(1, 2))
## Place a histogram of BMI in the first cell of the first row
hist(BMI)
## Place a density plot of BMI in the second cell of the first row
plot(density(BMI, na.rm=T))
```
---
## Saving plots
- R can output graphics in a number of formats: pdf, jpeg, png etc
- In RStudio you can also save your plots via the "Export" menu in the "Plots" tab.
- The following code opens a PDF file, writes a histogram to that file, and then
closes the connection to the file.
```{r}
## Open PDF file for writing, and specify dimensions
## Note that PDF files use dimensions based on inches, whereas png and
## jpeg files use pixels
pdf(file = 'bmi_histogram.pdf', height = 6, width = 6)
## Write histogram of BMI to file
hist(BMI)
## Close the file
dev.off()
## For multi-plot figures, you need to adjust the dimensions
pdf(file = 'bmi_histogram_density.pdf', height = 6, width = 12)
par(mfrow = c(1, 2))
hist(BMI)
plot(density(BMI, na.rm=T))
dev.off()
```