-
Notifications
You must be signed in to change notification settings - Fork 0
/
Corr.Rmd
76 lines (59 loc) · 3.85 KB
/
Corr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
title: "Correlation: Variance, Covariance, and Plotting"
author: "Kim Wiggins"
output:
pdf_document: default
Date: July 4, 2021
---
```{r libraries, include=FALSE}
library(KernSmooth)
library(MVA)
load(file = "Correlations.RData")
```
Park names, park sizes and visitors printed below:
```{r, include=TRUE}
parks
```
Use the ```cov()``` and ```cor()``` commands to view the sample covariance matrix and the sample correlation matrices.
```{r}
cov(parks)
```
*Covariance matrix, Parks data*
```{r}
cor(parks)
```
*Correlation Matrix, Parks Data*
The correlation matrix displays the Pearson correlation coefficient between two variables to demonstrate how strongly those variables are related to each other. As the scaled form of covariance, correlation coefficients are standardized (non-scalar) and dimensionless (unit-free), and can thus be interpreted easily on a scale from -1 to +1. Covariance values vary from negative infinity to positive infinity. While covariance indicates the direction of the relationship only, correlation indicates direction and strength, resulting in a proportion metric measuring how much on average these variables differ with regard to one another. </br>
Because the covariance matrix isn't scaled, we can only interpret direction. The size of the park is measured on a different scale than the visitor count, so interpretation regarding strength of the direction is not appropriate using the covariance matrix.
In this case, the covariance matrix tells us that the relationship is positive, but the correlation coefficient of 0.173 being so close to 0 indicates that the linear relationship is not strong. The variables may have some other strong relationship that is not linear, and we must take care to not interpret a 0 (or values close to 0) as indicating lack of any relationship - only lack of a strong linear one.
We can also draw a matrix of plots containing histograms and boxplots of park sizes and number of visitors, making sure to give appropriate titles and axes labels for the histograms and clearly interpret the graph/s.
```{r}
layout(matrix(1:2, nc=2))
# 1 by 2 rows
boxplot(parks$size, main = "Size")
boxplot(parks$visitors, main = "Visitors")
```
Examination of the boxplots indicate two outliers: one for each variable. Visual inspection of the elements reveal the potential outliers to be Yellowstone for Size, and Great Smoky for visitors.
Be cautious to consider these *potential* outliers - they could be anomalies or belong to a bimodal distribution and actually be quite typical of the data's behavior.
Histograms
```{r}
par(mfrow=c(2,1))
hist(parks$size, main = "Park Acreage, in Thousands", xlab = "Size", prob= T)
hist(parks$visitors, main = "Park Annual Visitors, in Thousands", xlab = "Visitors", prob = T)
```
Histograms indicate that most values for Size fall within the 0-1000 range. Visitor ranges indicate the majority of observations fall between 1 and 4, with a potential outlier in the 8-10 range.
Explore for possible outliers. Use appropriate graphs to draw conclusions. Calculate correlation coefficient between park size and number of park visitors before and after removing possible outliers. Comment on your findings.
```{r}
outpark = match(parkname <- c("Great Smoky", "Yellostone"), rownames(parks))
bvbox(parks, mtitle = "Bivariate Boxplot, Parks", xlab = size.lab, ylab = visitors.lab)
text(parks$size[outpark], parks$visitors[outpark], labels = parkname, pos = c(2,2,4,2,2))
```
The outliers are labeled, and in this case, will be removed from future analysis. In future cases, that is likely not the best practice, but for the purposes of comparing before and after correlations here, we will remove them.
After exluding the outliers, the correlation coefficient increases from 0.173 to 0.399.
```{r}
with(parks, cor(size, visitors))
```
```{r}
with(parks, cor(size[-outpark], visitors[-outpark]))
```
```{r}