-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy path5_gtfstools.en.qmd
468 lines (326 loc) · 23.5 KB
/
5_gtfstools.en.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
# GTFS data manipulation and visualization
GTFS data is frequently used in various types of analyses that involve a few common elements. The AOP team has developed the [`{gtfstools}`](https://github.com/ipeaGIT/gtfstools) R package, which provides several functions that help tackling repetitive tasks and operations and facilitate feed manipulation and exploration.
In this chapter, we'll go through some of the most frequently used package features. To do this, we will use a sample of the SPTrans feed presented in the previous chapter, and which is included in the package installation.
## Reading and manipulating GTFS files
Reading GTFS files with `{gtfstools}` is done with the `read_gtfs()` function, which receives a string with the file path. The package represents a feed as a list of `data.table`s, a high-performance version of `data.frame`s. Throughout this chapter, we will refer to this list of tables as a *GTFS object*. By default, the function reads all `.txt` tables in the feed:
```{r}
# loads the package
library(gtfstools)
# points to path of the sample gtfs data installed in {gtfstools}
path <- system.file("extdata/spo_gtfs.zip", package = "gtfstools")
# reads the gtfs
gtfs <- read_gtfs(path)
# checks the tables inside the gtfs object
names(gtfs)
```
We can see that each `data.table` within the GTFS object is named according to the table it represents, without the `.txt` extension. This configuration allows us to select and manipulate each table individually. The code below, for example, lists the first 6 rows of the `trips` table:
```{r}
head(gtfs$trips)
```
Tables within a GTFS object can be easily manipulated using the `{dplyr}` or `{data.table}` packages, for example. In this book, we opted to use the `{data.table}` syntax. This package offers several useful features, primarily for manipulating tables with a large number of records, such as updating columns by reference, very fast row subsets and efficient data aggregation[^datatable_info]. For example, we can use the code below to add 100 seconds to all the headways listed in the `frequencies` table and later reverse this change:
[^datatable_info]: For more details on `{data.table}` usage and syntax, please check <https://rdatatable.gitlab.io/data.table/index.html>.
```{r}
# saves original headways
original_headway <- gtfs$frequencies$headway_secs
head(gtfs$frequencies, 3)
# updates the headways
gtfs$frequencies[, headway_secs := headway_secs + 100]
head(gtfs$frequencies, 3)
# restores the original headway
gtfs$frequencies[, headway_secs := original_headway]
head(gtfs$frequencies, 3)
```
After editing a GTFS object in R, we often want to use the processed GTFS to perform different analyses. In order to do this, we frequently need the GTFS file in `.zip` format again, and not as a list of tables in an R session. To transform GTFS objects that exist in an R session into GTFS files saved to disk, `{gtfstools}` includes the `write_gtfs()` function. To use this function, we only need to pass the object that should be written to disk and the file path where it should be written to:
```{r}
# points to the path where the GTFS should be written to
export_path <- tempfile("new_gtfs", fileext = ".zip")
# writes the GTFS to the path
write_gtfs(gtfs, path = export_path)
# lists files within the feed
zip::zip_list(export_path)[, c("filename", "compressed_size", "timestamp")]
```
## Calculating trip speed
GTFS files are often used in public transport routing applications and to inform the timetable of different routes in a given region to potential passengers. Feeds must, therefore, accurately describe the schedule and the operational speed of public transport trips.
To calculate the average speed of the trips described in a feed, `{gtfstools}` package includes the function `get_trip_speed()`. By default, the function returns the speed (in km/h) of all trips included in the feed, but one can choose to calculate the speed of selected trips with the `trip_id` parameter:
```{r}
# calculates the speeds of all trips
speeds <- get_trip_speed(gtfs)
head(speeds)
nrow(speeds)
# calculates the speeds of two specific trips
speeds <- get_trip_speed(gtfs, trip_id = c("CPTM L07-0", "2002-10-0"))
speeds
```
To calculate the speed of a trip, we need to know its length and how long it takes to travel from its first to its last stop. Behind the scenes, `get_trip_speed()` uses two other functions from `{gtfstools}` toolset: `get_trip_length()` and `get_trip_duration()`. The usage of both is very similar to what has been shown before, returning the length/duration of all trips by default or of a few selected trips if desired. Below, we show their default behavior:
```{r}
# calculates the length of all trips
lengths <- get_trip_length(gtfs, file = "shapes")
head(lengths)
# calculates the duration of all trips
durations <- get_trip_duration(gtfs)
head(durations)
```
Just as `get_trip_speed()` returns speeds in km/h by default, `get_trip_length()` returns lengths in km and `get_trip_duration()` returns the duration in minutes. These units can be adjusted with the `unit` parameter, present in all three functions.
## Combining and filtering feeds
The tasks of processing and manipulating GTFS files are often performed manually, which may increase the chances of leaving minor inconsistencies or errors in the data. A common issue in some GTFS feeds is the presence of duplicate records in the same table. SPTrans' feed, for example, contains duplicate records both in `agency.txt` and in `calendar.txt`:
```{r}
gtfs$agency
gtfs$calendar
```
`{gtfstools}` includes the `remove_duplicates()` function to keep only unique entries in all tables of the feed. This function takes a GTFS object as input and returns the same object without duplicates:
```{r}
no_dups_gtfs <- remove_duplicates(gtfs)
no_dups_gtfs$agency
no_dups_gtfs$calendar
```
We often have to deal with multiple feeds describing the same study area. For example, when the bus and the rail systems of a single city are described in separate GTFS files. In such cases, we may want to merge both files into a single feed to reduce the data processing effort. To help us with that, `{gtfstools}` includes the `merge_gtfs()` function. The example below shows the output of merging SPtrans' feed (without duplicate entries) with EPTC's feed:
```{r}
# reads Porto Alegre's GTFS
poa_path <- system.file("extdata/poa_gtfs.zip", package = "gtfstools")
poa_gtfs <- read_gtfs(poa_path)
poa_gtfs$agency
no_dups_gtfs$agency
# combines Porto Alegre's and São Paulo's GTFS objects
combined_gtfs <- merge_gtfs(no_dups_gtfs, poa_gtfs)
# check results
combined_gtfs$agency
```
We can see that the tables of both feeds are combined into a single one. This is the case when two (or more) GTFS objects contain the same table (`agency`, in the example). When a particular table is present in only one of the feeds, the function copies this table to the output. That's the case of the `frequencies` table, in our example, which exists only in SPTrans' feed:
```{r}
names(poa_gtfs)
names(no_dups_gtfs)
names(combined_gtfs)
identical(no_dups_gtfs$frequencies, combined_gtfs$frequencies)
```
Filtering feeds to keep only a few entries within each table is another operation that frequently comes up when dealing with GTFS data. Feeds are often used to describe large-scale public transport networks, which may result in complex and slow data manipulation, analysis and sharing. Thus, planners and researchers often work with feeds' subsets. If we want to measure the performance of a transport network during the morning peak, for example, we can filter our GTFS data to keep only the observations related to trips that run within this period.
`{gtfstools}` includes lots of functions to filter GTFS data. They are:
- `filter_by_agency_id()`;
- `filter_by_route_id()`;
- `filter_by_service_id()`;
- `filter_by_shape_id()`;
- `filter_by_stop_id()`;
- `filter_by_trip_id()`;
- `filter_by_route_type()`;
- `filter_by_weekday()`;
- `filter_by_time_of_day()`; and
- `filter_by_sf()`.
### Filtering by identifiers
The seven first functions from the above list work very similarly. They take as input a vector of identifiers and return a GTFS object whose table entries are related to the specified ids. The example below demonstrates this functionality with `filter_by_trip_id()`:
```{r}
# checks pre-filter object size
utils::object.size(gtfs)
head(gtfs$trips[, .(trip_id, trip_headsign, shape_id)])
# keeps entries related to the two specified ids
filtered_gtfs <- filter_by_trip_id(
gtfs,
trip_id = c("CPTM L07-0", "CPTM L07-1")
)
# checks post-filter object size
utils::object.size(filtered_gtfs)
head(filtered_gtfs$trips[, .(trip_id, trip_headsign, shape_id)])
unique(filtered_gtfs$shapes$shape_id)
```
We can see from the code snippet above that the function not only filters `trips`, but all other tables containing a column that relates to `trip_id` in any way. The shapes of trips `CPTM L07-0` and `CPTM L07-1`, for example, are respectively described by `shape_id`s `17846` and `17847`. Therefore, these are the only shape identifiers kept in the filtered GTFS.
The function also supports the opposite behavior: instead of keeping the entries related to the specified identifiers, we can drop them. To do this, we need to set the `keep` argument to `FALSE`:
```{r}
# removes entries related to two trips from the feed
filtered_gtfs <- filter_by_trip_id(
gtfs,
trip_id = c("CPTM L07-0", "CPTM L07-1"),
keep = FALSE
)
head(filtered_gtfs$trips[, .(trip_id, trip_headsign, shape_id)])
head(unique(filtered_gtfs$shapes$shape_id))
```
We can see that the specified trips, as well as their shapes, are not present in the filtered GTFS anymore. The same logic, demonstrated here with `filter_by_trip_id()`, applies to the functions that filter GTFS objects by `agency_id`, `route_id`, `service_id`, `shape_id`, `stop_id` and `route_type`.
### Filtering by day of the week and time of the day
Another common operation when dealing with GTFS data is subsetting feeds to keep services that only happen during certain times of the day or days of the week. To do this, the package includes the `filter_by_weekday()` and `filter_by_time_of_day()` functions.
`filter_by_weekday()` takes as input the days of the week whose services that operate on them should be kept (or dropped). The function also includes a `combine` parameter, which defines how multi-days filters should work. When this argument receives the value `"and"`, only services that operate on every single specified day are kept. When it receives the value `"or"`, services that operate on at least one of the days are kept:
```{r}
# keeps services that operate on both saturday AND sunday
filtered_gtfs <- filter_by_weekday(
no_dups_gtfs,
weekday = c("saturday", "sunday"),
combine = "and"
)
filtered_gtfs$calendar[, c("service_id", "sunday", "saturday")]
# keeps services that operate EITHER on saturday OR on sunday
filtered_gtfs <- filter_by_weekday(
no_dups_gtfs,
weekday = c("sunday", "saturday"),
combine = "or"
)
filtered_gtfs$calendar[, c("service_id", "sunday", "saturday")]
```
`filter_by_time_of_day()`, on the other hand, takes the beginning and the end of a time window and keeps (or drops) the entries related to the trips that run within this window. The behavior of this function depends on whether a `frequencies` table is included in the feed or not: the `stop_times` timetable of trips listed in `frequencies` must not be filtered, because, as [previously mentioned](4_dados_gtfs.en.qmd#frequencies.txt), it works as a reference that describes the time between consecutive stops, and the departure and arrival times listed there should not be considered rigorously. If a trip is not listed in `frequencies`, however, its `stop_times` entries are filtered according to the specified time window. Let's see how the function works with some examples:
```{r}
# keeps trips that run within the 5am to 6am window
filtered_gtfs <- filter_by_time_of_day(gtfs, from = "05:00:00", to = "06:00:00")
head(filtered_gtfs$frequencies)
head(filtered_gtfs$stop_times[, c("trip_id", "departure_time", "arrival_time")])
# save the frequencies table and remove it from the original gtfs
frequencies <- gtfs$frequencies
gtfs$frequencies <- NULL
filtered_gtfs <- filter_by_time_of_day(gtfs, from = "05:00:00", to = "06:00:00")
head(filtered_gtfs$stop_times[, c("trip_id", "departure_time", "arrival_time")])
```
Filtering the `stop_times` table can work in two different ways. One is to keep trips that *cross* the specified time window intact. The other is to keep only the timetable entries that take place *inside* this window (default behavior). This behavior is controlled by the `full_trips` parameter, as shown below (please pay attention to the times and stops present in each example):
```{r}
# keeps any trips that cross the 5am to 6am window intact
filtered_gtfs <- filter_by_time_of_day(
gtfs,
from = "05:00:00",
to = "06:00:00",
full_trips = TRUE
)
head(
filtered_gtfs$stop_times[
,
c("trip_id", "departure_time", "arrival_time", "stop_sequence")
]
)
# keeps only the timetable entries that happen inside the 5am to 6am window
filtered_gtfs <- filter_by_time_of_day(
gtfs,
from = "05:00:00",
to = "06:00:00",
full_trips = FALSE
)
head(
filtered_gtfs $stop_times[
,
c("trip_id", "departure_time", "arrival_time", "stop_sequence")
]
)
```
### Filtering using a spatial extent
Finally, `{gtfstools}` also includes a function that allows one to filter a GTFS object using a spatial polygon. `filter_by_sf()` takes an `sf`/`sfc` object (spatial representation created by the [`{sf}`](https://r-spatial.github.io/sf/) package), or its bounding box, and keeps the entries related to trips selected by their position in relation to that spatial polygon. Although this might seem complicated, this filtering process is fairly easy to grasp once we illustrate it with an example. To demonstrate this function, we are going to filter SPTrans' feed using the bounding box of shape `68962`. With the code snippet below we show the spatial distribution of unfiltered data along with the bounding box in red:
```{r}
#| label: fig-shapes_distribution
#| fig-cap: Shapes spatial distribution overlayed by the bounding box of shape `68962`
library(ggplot2)
# creates a polygon with the bounding box of shape 68962
shape_68962 <- convert_shapes_to_sf(gtfs, shape_id = "68962")
bbox <- sf::st_bbox(shape_68962)
bbox_geometry <- sf::st_as_sfc(bbox)
# creates a geometry with all the shapes described in the gtfs
all_shapes <- convert_shapes_to_sf(gtfs)
ggplot() +
geom_sf(data = all_shapes) +
geom_sf(data = bbox_geometry, fill = NA, color = "red") +
theme_minimal()
```
Please note that we have used the `convert_shapes_to_sf()` function, also included in `{gtfstools}`, to convert the shapes described in the feed into a `sf` spatial object. By default, `filter_by_sf()` keeps all entries related to trips that intersect with the specified polygon:
```{r}
#| label: fig-intersect_distribution
#| fig-cap: Spatial distribution of shapes that intersect with the bounding box of shape `68962`
filtered_gtfs <- filter_by_sf(gtfs, bbox)
filtered_shapes <- convert_shapes_to_sf(filtered_gtfs)
ggplot() +
geom_sf(data = filtered_shapes) +
geom_sf(data = bbox_geometry, fill = NA, color = "red") +
theme_minimal()
```
We can, however, specify different spatial operations to filter the feed. The code below shows how we can keep the entries related to trips that are *contained by* the specified polygon:
```{r}
#| label: fig-contained_distribution
#| fig-cap: Spatial distribution of shapes contained by the bounding box of shape `68962`
filtered_gtfs <- filter_by_sf(gtfs, bbox, spatial_operation = sf::st_contains)
filtered_shapes <- convert_shapes_to_sf(filtered_gtfs)
ggplot() +
geom_sf(data = filtered_shapes) +
geom_sf(data = bbox_geometry, fill = NA, color = "red") +
theme_minimal()
```
## Validating GTFS data
Transport planners and researchers often want to assess the quality of the GTFS data they are producing or using in their analyses. Are feeds structured following the [best practices](https://github.com/MobilityData/GTFS_Schedule_Best-Practices) adopted by the larger GTFS community? Are tables and columns adequately formatted? Is the information described by the feed reasonable (trip speeds, stop locations, etc)? These are some of the questions that may arise when dealing with GTFS data.
To answer these and other questions, `{gtfstools}` includes the `validate_gtfs()` function. This function works as a wrapper to MobilityData's [Canonical GTFS Validator](https://github.com/MobilityData/gtfs-validator), which requires Java version 11 or higher to run[^java11_chap3].
[^java11_chap3]: For more information on how to check the installed version of Java in your computer and on how to install the required version, please check [Chapter 3](3_calculando_acesso.en.qmd#installing-r5r).
Using `validate_gtfs()` is very simple. First, we need to download the validator. To do this, we use the `download_validator()` function, included in the package, which receives the path to the directory where the validator should be saved to and the version of the validator that should be downloaded (defaults to the latest available). The function returns the path to the downloaded validator:
```{r}
tmpdir <- tempdir()
validator_path <- download_validator(tmpdir)
validator_path
```
The second (and final) step consists in actually validating the GTFS data with `validate_gtfs()`. This function supports GTFS data in different formats: i) as a GTFS object in an R session; ii) as a path to a local GTFS file in `.zip` format; iii) as an URL pointing to a feed; or iv) as a directory containing unzipped GTFS tables. The function also takes a path to a directory where the validation result should be saved to and the path to the validator that should be used in the process. In the example below, we validate SPTrans' feed from its path:
```{r}
output_dir <- tempfile("gtfs_validation")
validate_gtfs(
path,
output_path = output_dir,
validator_path = validator_path
)
list.files(output_dir)
```
We can see that the validation process generates a few output files:
- `report.html`, shown in @fig-report, which summarizes the validation result in a nicely formatted HTML page (only available with validator version 3.1.0 or higher);
- `report.json`, which summarizes the same information, but in JSON format, which can be used to programatically parse and process the results;
- `system_errors.json`, which summarizes eventual system errors that may have happened during the validation process and may compromise the results; and
- `validation_stderr.txt`, which lists informative messages sent by the validator tool, including a list of the tests conducted, eventual error messages, etc[^validation_stderr].
[^validation_stderr]: Informative messages may also be listed in the `validation_stdout.txt` file. Whether messages are listed in this file or in `validation_stderr.txt` depends on the validator version.
```{r}
#| echo: false
#| label: fig-report
#| fig-cap: Validation report example
knitr::include_graphics("images/validator_report.png")
```
## `{gtfstools}` workflow example: spatial visualization of headways
We have shown in previous sections that `{gtfstools}` offers a large toolset to process and analyze GTFS files. The package, however, also includes many other functions that could not be shown in this book due to space constraints[^gtfstools_ref].
[^gtfstools_ref]: The complete list of functions available in `{gtfstools}` can be checked at <https://ipeagit.github.io/gtfstools/reference/index.html>.
In this final section of the chapter, we illustrate how to use the package to make more complex analyses. To do this, we present a workflow that combines various functions of `{gtfstools}` together to answer the following question: how are the times between vehicles operating the same route (the headways) spatially distributed in SPTrans' GTFS?
First, we need to define the scope of our analysis. In this example, we are only going to consider the services operating during the morning peak, between 7am and 9am, on a typical tuesday. Thus, we need to filter our feed:
```{r}
gtfs <- read_gtfs(path)
# filters the GTFS
filtered_gtfs <- gtfs |>
remove_duplicates() |>
filter_by_weekday("tuesday") |>
filter_by_time_of_day(from = "07:00:00", to = "09:00:00")
# checking the result
filtered_gtfs$frequencies[trip_id == "2105-10-0"]
filtered_gtfs$calendar
```
Next, we need to calculate the headways within this time interval. This information can be found at the `frequencies` table, though there is a factor we have to pay attention to: each trip is associated to more than one headway, as shown above (one entry for the 7am to 7:59am interval and another for the 8am to 8:59am interval). To solve this, we are going to calculate the *average* headway from 7am to 9am.
The first few `frequencies` rows in SPTrans' feed seem to suggest that the headways are always associated to one-hour intervals, but this is neither a rule set in the official specification nor necessarily a practice adopted by other feed producers. Thus, we have to calculate the average headways weighted by the time duration of each headway. To do this, we need to multiply each headway by the size of the time interval during which it is valid, sum these multiplication results for each trip, and then divide this amount by the total time interval (two hours, in our case). To calculate the time intervals within which the headways are valid, we first use the `convert_time_to_seconds()` function to calculate the start and end time of the time interval in seconds and then subtract the latter by the former:
```{r}
filtered_gtfs <- convert_time_to_seconds(filtered_gtfs)
# check how the results look like for a particular trip id
filtered_gtfs$frequencies[trip_id == "2105-10-0"]
filtered_gtfs$frequencies[, time_interval := end_time_secs - start_time_secs]
```
Then we calculate the average headway:
```{r}
average_headway <- filtered_gtfs$frequencies[,
.(average_headway = weighted.mean(x = headway_secs, w = time_interval)),
by = trip_id
]
average_headway[trip_id == "2105-10-0"]
head(average_headway)
```
Now we need to generate each trip geometry and join this data to the average headways. To do this, we will use the `get_trip_geometry()` function, which returns the spatial geometries of the trips in the feed. This function allows us to specify which trips we want to generate the geometries of, so we are only going to apply the procedure to the trips present in the average headways table:
```{r}
selected_trips <- average_headway$trip_id
geometries <- get_trip_geometry(
filtered_gtfs,
trip_id = selected_trips,
file = "shapes"
)
head(geometries)
```
Finally, we need to join the average headway data to the geometries and then configure the map as wished. In the example below, the color and line width of each trip geometry varies with its headway:
```{r}
#| message: false
#| label: fig-headways_spatial_dist
#| fig-cap: Headways spatial distribution in SPTrans' GTFS
geoms_with_headways <- merge(
geometries,
average_headway,
by = "trip_id"
)
ggplot(geoms_with_headways) +
geom_sf(aes(color = average_headway, size = average_headway), alpha = 0.8) +
scale_color_gradient(high = "#132B43", low = "#56B1F7") +
labs(color = "Average headway", size = "Average headway") +
theme_minimal()
```
As we can see, `{gtfstools}` makes the analysis of GTFS feeds a simple task that requires only basic knowledge of table manipulation packages (such as `{data.table}` or `{dplyr}`). The example shown in this section illustrates how one could use many of the package's functions together to reveal important aspects of public transport systems specified in the GTFS format.