forked from dataquestio/solutions
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Mission327Solutions.Rmd
137 lines (104 loc) · 4.02 KB
/
Mission327Solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Solutions for Guided Project: Exploring NYC Schools Survey Data"
author: "Rose Martin"
data: "January 22, 2019"
output: html_document
---
**Here are suggested solutions to the questions in the Data Cleaning With R Guided Project: Exploring NYC Schools Survey Data.**
Load the packages you'll need for your analysis
```{r}
library(readr)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
library(ggplot2)
```
Import the data into R.
```{r}
combined <- read_csv("combined.csv")
survey <- read_tsv("survey_all.txt")
survey_d75 <- read_tsv("survey_d75.txt")
```
Filter `survey` data to include only high schools and select columns needed for analysis based on the data dictionary.
```{r}
survey_select <- survey %>%
filter(schooltype == "High School") %>%
select(dbn:aca_tot_11)
```
Select columns needed for analysis from `survey_d75`.
```{r}
survey_d75_select <- survey_d75 %>%
select(dbn:aca_tot_11)
```
Combine `survey` and `survey_d75` data frames.
```{r}
survey_total <- survey_select %>%
bind_rows(survey_d75_select)
```
Rename `survey_total` variable `dbn` to `DBN` so can use as key to join with the `combined` data frame.
```{r}
survey_total <- survey_total %>%
rename(DBN = dbn)
```
Join the `combined` and `survey_total` data frames. Use `left_join()` to keep only survey data that correspond to schools for which we have data in `combined`.
```{r}
combined_survey <- combined %>%
left_join(survey_total, by = "DBN")
```
Create a correlation matrix to look for interesting relationships between pairs of variables in `combined_survey` and convert it to a tibble so it's easier to work with using tidyverse tools.
```{r}
cor_mat <- combined_survey %>% ## interesting relationshipsS
select(avg_sat_score, saf_p_11:aca_tot_11) %>%
cor(use = "pairwise.complete.obs")
cor_tib <- cor_mat %>%
as_tibble(rownames = "variable")
```
Look for correlations of other variables with `avg_sat_score` that are greater than 0.25 or less than -0.25 (strong correlations).
```{r}
strong_cors <- cor_tib %>%
select(variable, avg_sat_score) %>%
filter(avg_sat_score > 0.25 | avg_sat_score < -0.25)
```
Make scatter plots of those variables with `avg_sat_score` to examine relationships more closely.
```{r}
create_scatter <- function(x, y) {
ggplot(data = combined_survey) +
aes_string(x = x, y = y) +
geom_point(alpha = 0.3) +
theme(panel.background = element_rect(fill = "white"))
}
x_var <- strong_cors$variable[2:5]
y_var <- "avg_sat_score"
map2(x_var, y_var, create_scatter)
```
Reshape the data so that you can investigate differences in student, parent, and teacher responses to survey questions.
```{r}
# combined_survey_gather <- combined_survey %>%
# gather(key = "survey_question", value = score, saf_p_11:aca_tot_11)
combined_survey_gather <- combined_survey %>%
pivot_longer(cols = saf_p_11:aca_tot_11,
names_to = "survey_question",
values_to = "score")
```
Use `str_sub()` to create new variables, `response_type` and `question`, from the `survey_question` variable.
```{r}
combined_survey_gather <- combined_survey_gather %>%
mutate(response_type = str_sub(survey_question, 4, 6)) %>%
mutate(question = str_sub(survey_question, 1, 3))
```
Replace `response_type` variable values with names "parent", "teacher", "student", "total" using `if_else()` function.
```{r}
combined_survey_gather <- combined_survey_gather %>%
mutate(response_type = ifelse(response_type == "_p_", "parent",
ifelse(response_type == "_t_", "teacher",
ifelse(response_type == "_s_", "student",
ifelse(response_type == "_to", "total", "NA")))))
```
Make a boxplot to see if there appear to be differences in how the three groups of responders (parents, students, and teachers) answered the four questions.
```{r}
combined_survey_gather %>%
filter(response_type != "total") %>%
ggplot(aes(x = question, y = score, fill = response_type)) +
geom_boxplot()
```