forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter3.Rmd
112 lines (87 loc) · 5.85 KB
/
chapter3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# Chapter 3 Logistic regression
##read the data
Description: this data is the student alcohol consumption data, and there are 37 variables in this data. I am going to select four variables from this data for my analysis.
```{r}
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc = read.csv(url, sep=",")
dim(alc)
colnames(alc)
```
## analysis the relationship between high alcohol consumption and age, failures, absences and sex
Four variables that I selected:
Age, failures, absences and sex
Hypothesis:
1. frade, sex, failures and absences can all predict the high alcohol consumption.
## Numerically and graphically explore the distribution of the data
comments: the cross tabulation and charts showed that the distribution of age, failures, absences and sex for high/low alcohol use, and they showed that for age, the proportion of age 17 is the biggest proportion (35) of high alcohol use than other age group, but not that different from other groups; and male are more likely to be high alcohol use compared to female; failures 0 are more likely to be high alcohol use; absences 0 is more likely to be high use. These may partly comfirmed our Hypothesis: only failures, absences and sex have statistical significant relationship with alcohol consumption
Cross tabulation
```{r}
malc <- glm(high_use ~ age + failures + absences + sex, data = alc, family = "binomial")
summary(malc)
summary(alc)
table(high_use = alc$high_use, age = alc$age)
table(high_use = alc$high_use, sex = alc$sex)
table(high_use = alc$high_use, failures = alc$failures)
table(high_use = alc$high_use, absences = alc$absences)
```
graphically
```{r}
library(dplyr)
library(ggplot2)
malc <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
g1 + geom_bar()
alc <- mutate(alc, high_use = alc_use > 2)
g2 <- ggplot (alc, aes(high_use))
g2 + facet_wrap("age") + geom_bar()
g2 + facet_wrap("failures") + geom_bar()
g2 + facet_wrap("absences") + geom_bar()
g2 + facet_wrap("sex") + geom_bar()
library(tidyr)
glimpse(alc)
gather(alc) %>% glimpse
gather(alc) %>% ggplot(aes(value))+ facet_wrap("key", scales = "free")+ geom_bar()
```
## logistic regression to explore the relationship between high/low alcohol consumption and age, failures, absences and sex
Interpretation: the summary of the fitted model showed that failures, absences and sex have statistically significant relationship with high/low alcohol use (the p value of failures, absences and sex are all below 0.05), while age do not have statistically relationship with high/low alcohol use (the p value of age is above 0.05); interpret coefficients as odds ratio, use intercept of age(-3.82) + estimate od age (0.11) * the students is age what (since the original data can not see the category of the varibles, so it cannot calculated now); the odds ratio of age is 1.12 (>1), with 95% CI being 0.0007 and 0.6274, it means that age 15 is more likely to become alcohol consumption compared to other age, the odds ratio of failures is 1.46(>1), with 95% CI being 1.0796 and 1.9778, it means that the participates who are failures are more likely to become alcohol consumption, the odds ratio of absences is 1.07(>1), with 95% CI being 1.0383 and 1.1138, it means that the participates who absences are more likely to be alcohol consumption than those who not absences, the odds ratio of sex is 2.69(>1), it means that male are more likely to be alcohol consumption. These also partly confirmed our hypothesis.
```{r}
malc <- glm(high_use ~ age + failures + absences + sex, data = alc, family = "binomial")
summary(malc)
OR <- coef (malc) %>% exp
CI <- confint(malc) %>% exp
cbind (OR, CI)
```
## Using failures, absences and sex (have statistically significant) with high/low alcohol consumption to explore the predictive power of model and crosstab
comments: The results showed that the model can well predict the target variables, the accuracy is 74.35%. The probabilities of the first ten participants are 0.18, 0.16, 0.50, 0.14, 0.16, 0.44, 0.28, 0.18, 0.28, 0.28. And there is 12 + 86 = 98 individuals, about 25.65%, individuals are inaccurately calssified. A strategy that reveals, on average, a 30% chance of getting the guess right, and for this model, 74.35% chance can get the explanary variables for target variables right, so this model is more better than simple guessing strategy.
```{r}
malc <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
summary(malc)
probabilities <- predict (malc, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select (alc, failures, absences, sex, high_use, probability, prediction) %>% tail(382)
table(high_use = alc$high_use, prediction = alc$prediction)
library(dplyr)
library(ggplot2)
galc <- ggplot (alc, aes(x = age, y = high_use, col = prediction))
galc + geom_point()
galc <- ggplot (alc, aes(x = failures, y = high_use, col = prediction))
galc + geom_point()
galc <- ggplot (alc, aes(x = absences, y = high_use, col = prediction))
galc + geom_point()
galc <- ggplot (alc, aes(x = sex, y = high_use, col = prediction))
galc + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
```
# 10-fold cross-validation
Comments: the prediction error of my model is not higher but equal to the model introduced in DataCamp. Since, I choose age, failures, absences and sex as selected variables, but the age do not have statistically relationship with high/low alcohol use, so the error is 0.2670157, so the variables used in my model is bigger to Datacamp.
```{r}
library(boot)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = malc, K = 10)
cv$delta[1]
```