-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path02b-x01-weather.Rmd
181 lines (133 loc) · 6.05 KB
/
02b-x01-weather.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
knit: bookdown::preview_chapter
---
### Examples
#### Weather in Ames
Problem
What does today's temperature tell us about tomorrow's?
This question is inspired by Rob Hyndman's answer to it for temperatures in Melbourne as detailed in [@hyndman1996] and [@hyndman1996b].
Data source
The website Weather Underground [https://www.wunderground.com/](https://www.wunderground.com/)
provides current and historic weather information for weather stations all across the globe.
For historic data, each day is summarised by a set of statistics, such as temperature, humidity, pressure, and wind speeds and compared to averages.
The process
We want to scrape the daily information from Weather Underground for a particular station and some time frame.
First, we write a scraper:
```{r KAMWscraper}
# functions
getWeather <- function(station = "KAMW", year = 2015, month = 1, day = 1) {
numbers <- "[-]*[0-9\\.]*[e0-9]*"
units <- "[°a-zA-Z ]+"
require(stringr)
require(rvest)
url <-
sprintf(
"https://www.wunderground.com/history/airport/%s/%s/%s/%s/DailyHistory.html", station, year, month, day
)
html <- read_html(url)
tabs <- html %>% html_nodes("table")
day.stats <- html_table(tabs[[1]], fill = TRUE)
if (is.null(day.stats$Actual)) return(NULL) # no data available
day.stats$Unit <- str_extract(day.stats$Actual, pattern = units)
day.stats$Actual <- as.numeric(str_extract(day.stats$Actual, pattern = numbers))
day.stats$Average <- as.numeric(str_extract(day.stats$Average, pattern = numbers))
day.stats$RecordYear <-
str_extract(day.stats$Record, pattern = "\\([0-9]{4}\\)")
day.stats$RecordYear <- gsub("[\\(\\)]","", day.stats$RecordYear)
day.stats$Record <-
as.numeric(str_extract(day.stats$Record, pattern = numbers))
day.stats$Station <- station
day.stats$Date <- sprintf("%s/%s/%s",year,month,day)
names(day.stats)[1] <- "Statistics"
day.stats
}
```
And now we scrape:
```{r scrapeKAMW, eval=FALSE}
######################
library(lubridate)
# download data for
last <- as.Date("2016-03-01")
first <- as.Date("2008-02-21")
days <- seq(first, last, "day")
filename <- "data/KAMW.csv"
station <- "KAMW"
for (dateIDX in length(days):1) {
# browser()
date <- days[dateIDX]
res <- getWeather(station, year(date), month(date), mday(date))
write.table(res, file=filename, append=file.exists(filename),
col.names=!file.exists(filename),
row.names=F, sep=",")
}
```
And now it's time for the analysis. Create a vector of the previous days temperatures and plot in a scatterplot. Alternatively, a two-dimensional histogram (using hex binning) also works well in this example, because the focus here is on the high density areas:
```{r analyseKAMW, warning=FALSE}
weather <- read.csv("data/KAMW.csv", stringsAsFactors = FALSE)
weather$Actual <- as.numeric(weather$Actual)
weather$Average <- as.numeric(weather$Average)
weather$Record <- as.numeric(weather$Record)
temps <- subset(weather, Statistics=="Max Temperature")
temps$Previous <- c(temps$Actual[-1], NA)
library(ggplot2)
ggplot(data = temps, aes(x = Previous, y = Actual)) +
geom_jitter(alpha = I(0.5)) +
geom_smooth() +
xlab("Yesterday's maximum temperature") +
ylab("Today's maximum temperature") +
coord_equal()
ggplot(data = temps, aes(x = Previous, y = Actual)) +
geom_hex() +
xlab("Yesterday's maximum temperature") +
ylab("Today's maximum temperature") +
coord_equal()
```
What we see is a strong linear pattern, i.e. tomorrow's maximum temperature is close to what we see today, making it a very strong predictor. There is a peak in density at temperatures around 30 degree Celsius (shown in the binned scatterplot by the light blue tiles). With higher temperatures we see less variability, while at lower temperatures this relationship is more variable.
#### Your turn
+ The pattern discussed involves maximum temperatures. How would you expect this pattern to change for averages? Explain your expectation first, then check.
+ Reformat the weather data set in such a way that you can easily compare Maximum and minimum temperatures in a scatterplot.
+ How do temperatures behave over the course of a year? Organize the data correspondingly and plot.
+ How does the temperature pattern compare to another place in the world, say e.g. Melbourne, Australia?
***Solution for last question***
Scrape the data
```{r KAMWsolution, eval=FALSE}
last <- as.Date("2016-03-01")
first <- as.Date("1995-01-01")
last <- days[dateIDX]
first <- as.Date("1995-01-01")
days <- seq(first, last, "day")
filename <- "data/YMML.csv"
station <- "YMML"
for (dateIDX in length(days):1) {
# browser()
date <- days[dateIDX]
res <- getWeather(station, year(date), month(date), mday(date))
write.table(res, file=filename, append=file.exists(filename),
col.names=!file.exists(filename),
row.names=F, sep=",")
}
```
and plot the data:
```{r plotKAMW, warning=FALSE, eval=FALSE}
#weather <- read.csv("data/YMML.csv", stringsAsFactors = FALSE)
weather <- read_csv("data/YMML.csv")
temps <- subset(weather, Statistics=="Max Temperature")
temps$Previous <- c(temps$Actual[-1], NA)
ggplot(data=temps) +
geom_hex(aes(x=Previous, y=Actual)) + #, fill=log(..value..))) +
xlab("Yesterday's maximum temperature") +
ylab("Today's maximum temperature") +
coord_equal()
```
In the paper by [@hyndman1996b] conditional densities are used: given today's temperature what is the density distribution for tomorrow's temperatures?
We can use high-density region (HDR) boxplots to show the results. The location of the dot indicates the location of the density's highest mode.
```{r HDR, fig.width=6, fig.height=6}
# bin yesterday's temperatures and compute densities of today's temperatures for each level
library(lubridate)
#temps$Month <- month(temps$Date)
#subtemps <- subset(temps, Month %in% c(12,1,2))
subtemps <- na.omit(temps[,c("Previous", "Actual")])
library(hdrcde)
temps.cde <- with(subtemps, cde(Previous, Actual, nxmargin=40))
plot(temps.cde,xlab="Yesterday's maximum temperature",ylab="Today's maximum temperature",plot.fn="hdr", prob=c(50, 95))
```