Get v1 data, partial implementation #18

e-kotov · 2024-08-07T13:31:31Z

Hi @Robinlovelace , I have finished some of the work on v1 data and tried to envision how it would be united with current code for v2, as well as how I would suggest the v2 code may be changed and improved upon. Could you please review what I have for v1 (and minor changes to v2) below?

Setup:

remotes::install_github("Robinlovelace/spanishoddata@get-v1-data")
Sys.setenv(SPANISH_OD_DATA_DIR = "/your/cache/path")

Available data / metadata

V1 Data

Fetch available data for version 1 with an optional check for local files:

v1 <- spod_available_data_v1(check_local_files = TRUE)

This function changes the default paths of the MITMA data locally to arrange it in a hive style for faster filtering and/or import with duckdb/arrow.

Function Wrapping Suggestion

Consider wrapping both functions below into a single function with a version argument, for example, spod_available_data(ver = 1). Alternatively, you could name them spod_available_data_v1() and spod_available_data_v2(), or spod_get_metadata_v1() and spod_get_metadata_v2(). Another option could be spod_metadata_v1() and spod_metadata_v2(). The current function handles the XML file cache by checking when it was last downloaded and fetches it automatically if it is older than one day. It might make sense to add an argument to load the latest cached XML for offline use.

V2 Data

Fetch metadata for version 2:

v2 <- spod_get_metadata()

I have not modified this function much, but I plan to add the path-changing feature for v2 data as well if it is appreciated in v1. The few changes I have made include updating the XML file name so it differs from v1, renaming the file mask to data_links, and removing the current_timestamp argument from the XML fetching function, as the current timestamp should be used automatically.

The naming of both functions is still up for discussion, and I am open to suggestions.

Implemented Functions for V1 Data

Get Latest V1 File List

v1_file <- spod_get_latest_v1_file_list()

This is only really for internal use or for advanced users, so I would have this explicit long name and change to similar for v2, or even combine the functions into one.

Available Data for V1

v1 <- spod_available_data_v1() # with optional file check for local files as noted in the beginning of the message

Zones Loader

The zones loader with English parameter names and their variations that are converted to Spanish using internal helper functions. These functions load the shapefiles, check the geometry validity, and replace the uppercase ID with lowercase 'id'. In the future, I plan to attach metadata such as municipality names.

zones_municip <- spod_get_zones_v1("muni")
zones_municip <- spod_get_zones_v1("municip")
zones_municip <- spod_get_zones_v1("municipalities")
zones_districts <- spod_get_zones_v1("dist")
zones_districts <- spod_get_zones_v1("distr")
zones_districts <- spod_get_zones_v1("districts")

V2 Zones

For v2 zones, more adjustments are needed, such as attaching accompanying names and population numbers metadata that are available alongside the files. This is a TODO.

Download Function

The new download function handles most non-spatial data, automatically checks for the required data version, accepts multiple types of zoning, and processes date arguments automatically.

spod_download_data(type = "od", zones = "districts", dates = "2020032[0-5]")
spod_download_data(type = "od", zones = "municip", dates = "20200302_20200305")
spod_download_data(type = "od", zones = "municip", dates = "2020-03-02_2020-03-05")
spod_download_data(type = "od", zones = "municip", dates = c(start = "2020-03-02", end = "2020-03-05"))

To download any two dates without interpreting them as a range, use the following. More than two dates are always interpreted as an arbitrary sequence and not converted to a range.

spod_download_data(type = "od", zones = "municip", dates = c("2020-03-02", "2020-03-05"))

I think this advanced handling of dates would make the package more accessible for beginners, as I do not expect everyone to easily grasp what a regex is and they may be more comfortable using simpler dates argument.

There are also "testthat" tests for the underlying function that handles the magic behind the dates argument. That includes tests for data incompatibility (i.e. the helper functions check if the requested date range is spanning across multiple data versions and stops with a warning).

Next Steps

The next step is to implement an equivalent of spod_get_zones() and similar functions for v1 data that would rely on the already downloaded data or download it automatically if it is missing, this will use the same mechanics that I implemented for the download function as they would simply wrap it. I would appreciate your feedback on the current implementation and naming of the functions, as well as my suggestions on the overall structure of the package.

We can discuss during our next scheduled meeting.

…han 1 day

…hen temp used

…xamples with date arguments

…_get_metadata

…s and download data

Robinlovelace · 2024-08-07T14:03:13Z

Good stuff @e-kotov, thanks for the clear statement of changes and options, will test in due course, likely on Friday.

…dir for successful tests

e-kotov · 2024-08-07T14:19:59Z

@Robinlovelace great! Take your time. Meanwhile, I added some last minute additions to finally pass the R CMD check. We now have a snapshot of metadata bundled with the package (compressed xmls are tiny and xml2::read_xml() reads them directly from gz just fine). These can be used for future internal tests of some other functions too.

Robinlovelace · 2024-08-07T14:21:18Z

Sounds good, more soon!

Robinlovelace · 2024-08-09T12:55:50Z

OK, going to give this a quick test and hopefully review now. Have quite limited time, ~15 minutes, due to family commitments later in the day.

Robinlovelace · 2024-08-09T13:07:43Z

Update here: looking great!

Robinlovelace

This is an epic set of changes Egor, many thanks! Big +1, all tested on README and no further comment in limited time I have today. Let's get it merged to keep ball rolling but will leave that to you.

DESCRIPTION

NAMESPACE

R/download_data.R

R/folders.R

R/get.R

R/get_v1_data.R

e-kotov · 2024-08-09T14:11:24Z

Ok, merging, but new big updates are coming up)

e-kotov · 2024-08-12T15:52:39Z

Partially fulfils #5

Robinlovelace · 2024-08-12T16:36:11Z

1 version down, 1 to go!

e-kotov added 24 commits July 31, 2024 14:11

add basic v1 data retrieval functions

b6d4602

clean get v1 zones functions, revised metadata update if file older t…

f07eef6

…han 1 day

Merge branch 'main' into get-v1-data

b75c0c7

cleanup get zones v1 workflow

d8cb82b

separate raw cache data and clean data

674a7a7

internal utils to handle different data arguments and expand date regex

38bff6b

skeleton for the get v1 od function

ea6706b

use temp dir if SPANISH_OD_DATA_DIR is not set. also give a warning w…

8cbc1be

…hen temp used

unset of sys.getenv did not work, plug back the if statement

e05ef8d

more detailed warning and suggestsions on temp dir use

a56b2fb

fixes to metadata retrieval, first draft of smart download function

0e0798d

update spod_download_tables to be spod_download_data, various usage e…

a6d542a

…xamples with date arguments

cleanup remains of download_table

e40bcef

revamped internal utils maily for dates and data version handling

e90f61e

clean multi-purpose download function

94b641b

moved download function away from v1 specific file

455119d

add tests for critical date handling internal function

3f223a6

quiet options and error warning messages in spod_get_dta_dir and spod…

6a37562

…_get_metadata

update docs

5f9091a

ensure newlines in the end of files

17017a3

move current_timestamp from arguments to body of get xml 2

7621740

rename type to zones in get_zones v1 to unify naming between get_zone…

cee4eb6

…s and download data

fixes to arguments and docs to pass cmd check

3e166ee

more fixes to pass r cmd check

88c69d8

Robinlovelace self-assigned this Aug 7, 2024

bundle xml files for tests to avoid data download. copy them to test …

797d8ec

…dir for successful tests

e-kotov added this to the v1 data (2020-2021) support milestone Aug 7, 2024

move data type check after the version detection

d1a1e52

Robinlovelace approved these changes Aug 9, 2024

View reviewed changes

e-kotov merged commit 2ce6f68 into main Aug 9, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get v1 data, partial implementation #18

Get v1 data, partial implementation #18

e-kotov commented Aug 7, 2024

Robinlovelace commented Aug 7, 2024

e-kotov commented Aug 7, 2024

Robinlovelace commented Aug 7, 2024

Robinlovelace commented Aug 9, 2024

Robinlovelace commented Aug 9, 2024

Robinlovelace left a comment

e-kotov commented Aug 9, 2024

e-kotov commented Aug 12, 2024

Robinlovelace commented Aug 12, 2024

Get v1 data, partial implementation #18

Get v1 data, partial implementation #18

Conversation

e-kotov commented Aug 7, 2024

Available data / metadata

V1 Data

Function Wrapping Suggestion

V2 Data

Implemented Functions for V1 Data

Get Latest V1 File List

Available Data for V1

Zones Loader

V2 Zones

Download Function

Next Steps

Robinlovelace commented Aug 7, 2024

e-kotov commented Aug 7, 2024

Robinlovelace commented Aug 7, 2024

Robinlovelace commented Aug 9, 2024

Robinlovelace commented Aug 9, 2024

Robinlovelace left a comment

Choose a reason for hiding this comment

e-kotov commented Aug 9, 2024

e-kotov commented Aug 12, 2024

Robinlovelace commented Aug 12, 2024