forked from dataquestio/solutions
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Mission277Solutions.Rmd
163 lines (129 loc) · 5.01 KB
/
Mission277Solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
title: "Solutions for Guided Project: Exploratory Visualization of Forest Fire Data"
author: "Rose Martin"
output: html_document
---
# Exploring Data Through Visualizations: Independent Investigations
Load the packages and data we'll need for the project
```{r}
library(tidyverse)
forest_fires <- read_csv("forestfires.csv")
```
# The Importance of Forest Fire Data
```{r}
# What columns are in the dataset?
colnames(forest_fires)
```
We know that the columns correspond to the following information:
* **X**: X-axis spatial coordinate within the Montesinho park map: 1 to 9
* **Y**: Y-axis spatial coordinate within the Montesinho park map: 2 to 9
* **month**: Month of the year: 'jan' to 'dec'
* **day**: Day of the week: 'mon' to 'sun'
* **FFMC**: Fine Fuel Moisture Code index from the FWI system: 18.7 to 96.20
* **DMC**: Duff Moisture Code index from the FWI system: 1.1 to 291.3
* **DC**: Drought Code index from the FWI system: 7.9 to 860.6
* **ISI**: Initial Spread Index from the FWI system: 0.0 to 56.10
* **temp**: Temperature in Celsius degrees: 2.2 to 33.30
* **RH**: Relative humidity in percentage: 15.0 to 100
* **wind**: Wind speed in km/h: 0.40 to 9.40
* **rain**: Outside rain in mm/m2 : 0.0 to 6.4
* **area**: The burned area of the forest (in ha): 0.00 to 1090.84
A single row corresponds to the location of a fire and some characteristics about the fire itself. Higher water presence is typically asssociated with less fire spread, so we might expect the water-related variables (`DMC` and `rain`) to be related with `area`.
# Data Processing
`month` and `day` are character vartiables, but we know that there is an inherent order to them. We'll convert these variables into factors so that they'll be sorted into the correct order when we plot them.
```{r}
forest_fires %>% pull(month) %>% unique
```
```{r}
forest_fires %>% pull(day) %>% unique
```
This guided project will assume that Sunday is the first day of the week, but feel free to adjust the levels according to what's comfortable to you. Ultimately, the levels just help us rearrange the resulting plots in an order that makes sense to us.
```{r}
month_order <- c("jan", "feb", "mar",
"apr", "may", "jun",
"jul", "aug", "sep",
"oct", "nov", "dec")
dow_order <- c("sun", "mon", "tue", "wed", "thu", "fri", "sat")
forest_fires <- forest_fires %>%
mutate(
month = factor(month, levels = month_order),
day = factor(day, levels = dow_order)
)
```
# When Do Most Forest Fires Occur?
We need to create a ssummary tibble that counts the number of fires that appears in each month. Then, we'll be able to use this tibble in a visualization. We can consider `month` and `day` to be different grouping variablse, so our code to produce the tibbles and plots will look similar.
## Month Level
```{r}
fires_by_month <- forest_fires %>%
group_by(month) %>%
summarize(total_fires = n())
fires_by_month %>%
ggplot(aes(x = month, y = total_fires)) +
geom_col() +
labs(
title = "Number of forest fires in data by month",
y = "Fire count",
x = "Month"
)
```
```{r}
fires_by_dow <- forest_fires %>%
group_by(day) %>%
summarize(total_fires = n())
fires_by_dow %>%
ggplot(aes(x = day, y = total_fires)) +
geom_col() +
labs(
title = "Number of forest fires in data by day of the week",
y = "Fire count",
x = "Day of the week"
)
```
We see a massive spike in fires in August and September, as well as a smaller spike in March. Fires seem to be more frequent on the weekend.
# Plotting Other Variables Against Time
```{r}
forest_fires_long <- forest_fires %>%
pivot_longer(
cols = c("FFMC", "DMC", "DC",
"ISI", "temp", "RH",
"wind", "rain"),
names_to = "data_col",
values_to = "value"
)
forest_fires_long %>%
ggplot(aes(x = month, y = value)) +
geom_boxplot() +
facet_wrap(vars(data_col), scale = "free_y") +
labs(
title = "Variable changes over month",
x = "Month",
y = "Variable value"
)
```
# Examining Forest Fire Severity
We are trying to see how each of the variables in the dataset relate to `area`. We can leverage the long format version of the data we created to use with `facet_wrap()`.
```{r}
forest_fires_long %>%
ggplot(aes(x = value, y = area)) +
geom_point() +
facet_wrap(vars(data_col), scales = "free_x") +
labs(
title = "Relationships between other variables and area burned",
x = "Value of column",
y = "Area burned (hectare)"
)
```
# Outlier Problems
It seems that there are two rows where `area` that still hurt the scale of the visualization. Let's make a similar visualization that excludes these observations so that we can better see how each variable relates to `area`.
```{r}
forest_fires_long %>%
filter(area < 300) %>%
ggplot(aes(x = value, y = area)) +
geom_point() +
facet_wrap(vars(data_col), scales = "free_x") +
labs(
title = "Relationships between other variables and area burned (area < 300)",
x = "Value of column",
y = "Area burned (hectare)"
)
```