Differences in how datetime columns are treated with arrow=True #487

theroggy · 2024-10-17T20:06:52Z

When arrow is used, there seem to be several issues/difference with how datetimes are treated compared to how this behaves without arrow. It seems the differences are most of the time when reading.

To test this, I created a PR that shows this behaviour in some added test cases: #486. Some more details can be found in a comment a bit lower in this thread.

Most likely (some of) the issues/behaviours would be best solved/changed upstream in GDAL...

@rouault

jorisvandenbossche · 2024-10-18T13:57:41Z

Do you know it is the writing or the reading that adds the timezone information?

And do you see this happen for different file formats? (it's not GPKG specific?)

jorisvandenbossche · 2024-10-18T14:00:18Z

See also OSGeo/gdal#8460, an issue I raised about timezone support earlier. The original issue is fixed, but there is some potentially relevant discussion related to GPKG as well

theroggy · 2024-10-18T16:36:17Z

I had a deeper and broader look... and it's a bit complicated, as typically with datetimes and timezones :-). Based on this, it looks like there some issues when reading datetimes with timezones as well. I added some more test cases in the tests in PR #486 .

For datetimes WITHOUT timezone:

This look OK for most file formats: they just stay naive datetimes after being written en read again.

Only for GPKG there is something wrong: the datetime is written fine (naive datetime is kept without changes to the time), but when read the datetime is assumed to be UTC...

2020-01-01T09:00:00.123 becomes 2020-01-01T09:00:00.123+00:00
this doesn't seem right, as it gives the impression the time is actually in UTC, so when applications localize the time for visualisation the time will be wrong.

For datetimes WITH timezone:

For GPKG, the datetimes are written as they are, but when read they are converted to UTC.
- E.g. 2020-01-01T09:00:00.123-05:00 becomes 2020-01-01T14:00:00.123+00:00
- The time stays technically the same, but I don't think it is practical that this happens?
For FlatGeoBuffer, the hours stay the same, but the timezone information is just dropped. Not sure if it is when read or written.
- E.g. 2020-01-01T09:00:00.123-05:00 becomes 2020-01-01T09:00:00.123
- Timezone info is lost, which doesn't sound ideal. Maybe flatgeobuf doesn't support timezone info?
For GeoJSON, the datetimes are written as they are which is OK, but when read the hours are changed and the timezone offset is retained, so you end up with wrong times.
- E.g. 2020-01-01T09:00:00.123-05:00 becomes 2020-01-01T04:00:00.123-05:00
- This changes the information quite a lot.

…ng of timezones Fixes geopandas/pyogrio#487 (comment) regarding GeoJSON

rouault · 2024-10-19T11:04:49Z

Only for GPKG, the datetime is written fine (naive datetime is kept without changes to the time), but when read the datetime is assumed to be UTC...

Format limitation. Complicated... In theory GPKG only supports UTC datetime. As an extension OGR will support without timezone or other timezones. But Arrow reading makes things complicated as Arrow imposes a common timezone or absence of timezone for a column. And we have no way to know that for a GPKG without reading the content of the entire column. Which we don't do for obvious performance reason. So as a conformant GPKG is assumed to be UTC, we optimisticly lie by reporting a Arrow UTC datetime. Which breaks here. I don't see any way to fix that without doing a full scan, or developing some extension to GPKG to store the single timezone if OGR writes and if all values have the same timezone info

For GPKG, the datetimes are written as they are, but when read they are converted to UTC.

Answered above

For FlatGeoBuffer, the hours stay the same, but the timezone information is just dropped. Not sure if it is when read or written.

In FlatGeoBuf, datetimes are written as ISO8601 strings. But within a column you can have a mix of different timezones, or with/without timezones. On reading with Arrow, similar issue than with GPKG; we must decide for a Arrow column type without having the info. Here we just drop the timezone. I don't see any way to fix that without doing a full scan, or developing some extension to FlatGeoBuf to store the single timezone if OGR writes and if all values have the same timezone info (CC @bjornharrtell)

For GeoJSON, the datetimes are written as they are which is OK, but when read the hours are changed and the timezone offset is retained, so you end up with wrong times.

OK, that one is a true bug. Addressed per OSGeo/gdal#11049
But for my own excuse, I can't find anything in https://arrow.apache.org/docs/format/Columnar.html that explains exactly which numeric value you're supposed to write in a timestamp with timezone column (if it is UTC, which I believe it must be and probably make the most sense, or relative to the timezone, which was what was written)

bjornharrtell · 2024-10-19T11:16:20Z

Not sure I fully understand, but so a workaround for FlatGeobuf is to just use/assume UTC?

rouault · 2024-10-19T11:44:47Z

Not sure I fully understand, but so a workaround for FlatGeobuf is to just use/assume UTC?

In Arrow, you must indicate the timezone of a datetime column, for the whole column, without timezone, with timezone UTC, with timezone UTC+XXX, etc. The FlatGeoBuf format, when just reading its header, doesn't convey that information. Hence on reading with Arrow, a without timezone Arrow column is reported, and we compute a timestamp value ignoring the timezone (we could have made the decision to write the value as UTC too. Not sure why I didn't do that. Probably because of the confusion I had with how Arrow actually stores timestamps)
If the user would indeed create its FlatGeobuf file with datetimes already in UTC, the impact would likely be less, in that the arrow timestamp value would match the UTC value.

theroggy · 2024-10-19T12:05:31Z

Indeed complicated.

How about using the first row as heuristic?

if no timezone: assume no timezone for everything. If other rows would have a timezone, convert everything to the current local timezone of the machine so the datetimes stay consistent over all rows?
if timezone: use this timezone and convert all rows to this timezone if other rows would have other timezones

kylebarron · 2024-10-19T12:39:11Z

In case it's useful at all, in my (non-GDAL) FlatGeobuf to Arrow reader I intend to take option 1, to assign UTC time zone on the timestamp column and convert hours into UTC (though I do currently see a bug in my code)

rouault · 2024-10-19T12:47:01Z

How about using the first row as heuristic?

Such complication would have to be implemented in all drivers that support datetime with potentially non-UTC timezone... At least all those of interest from an Arrow perspective (Perhaps spread the load over all driver maintainers 😂 ?). In some cases there might be performance implication even in just reading one row (thinking of datasets with huge number of layers, where the layer schema is established at opening time).

if no timezone: assume no timezone for everything. If other rows would have a timezone, convert everything to the current local timezone of the machine so the datetimes stay consistent over all rows?

I would be nervous about that part. Absence of timezone might mean either "timezone is truly unknown" or "timezone is local time". And even in the latest case, that would mean that any scripting based on that would be dependent on the local timezone of the machine

if timezone: use this timezone and convert all rows to this timezone if other rows would have other timezones

that one would be non controversial

theroggy · 2024-10-19T13:05:37Z

How about using the first row as heuristic?

In some cases there might be performance implication even in just reading one row (thinking of datasets with huge number of layers, where the layer schema is established at opening time).

Is this a common case?

if no timezone: assume no timezone for everything. If other rows would have a timezone, convert everything to the current local timezone of the machine so the datetimes stay consistent over all rows?

I would be nervous about that part. Absence of timezone might mean either "timezone is truly unknown" or "timezone is local time". And even in the latest case, that would mean that any scripting based on that would be dependent on the local timezone of the machine

I agree it isn't ideal, but I couldn't think of anything better. In general I think it is a pretty bad idea to mix naive and timezoned dates in one column, so at least writing a warning when this is encountered sounds like a good idea.

jorisvandenbossche · 2024-10-19T13:06:19Z

And even in the latest case, that would mean that any scripting based on that would be dependent on the local timezone of the machine

Yes, I personally would not like to see anything that is system dependent.

If going with a "first row heuristic", I would rather raise an error if anything later is inconsistent (regarding with vs without timezone), i.e. if you have a mix of tz-aware and tz-naive timestamps.

But for my own excuse, I can't find anything in https://arrow.apache.org/docs/format/Columnar.html that explains exactly which numeric value you're supposed to write in a timestamp with timezone column (if it is UTC, which I believe it must be and probably make the most sense, or relative to the timezone, which was what was written)

Yeah, fully agreed that page is not ideal .. Right now the information about the actual types (not just the physical layout) lives in the Schema.fbs, for this topic it is https://github.com/apache/arrow/blob/54697eaefeb151ad195f3a5812aafa48ba77c45e/format/Schema.fbs#L284-L302
(see apache/arrow#42011 for an issue I opened a while ago about improving that)

rouault · 2024-10-19T13:07:05Z

that one would be non controversial

Actually I was almost forgetting that OGRLayer::GetArrowStream() has a TIMEZONE option (cf https://gdal.org/en/latest/doxygen/classOGRLayer.html#a3ffa8511632cbb7cff06a908e6668f55). That you can specify to be UTC (the GeoPackage driver sets it to UTC, if the user hasn't explicitly specified it). So at least for layers where all records have a known timezone, this is the way to get values normalized as UTC.

…ng of timezones Fixes geopandas/pyogrio#487 (comment) regarding GeoJSON

theroggy · 2024-11-03T10:06:56Z

@rouault thinking a bit further on this... would it be possible to instead of discovering the (lack of) timezone up-front per table/layer to rather do it when actually reading the data and then using the first row being read (or whatever other heuristic, as there would typically be more rows available to do something with) to (if needed), change the arrow column time zone info after reading?

This would:

avoid having to discover the timezone up-front
avoid issues when reading data via a SQL query... as I suppose discovering the (lack of) timezone up-front would still lead to issues when a SQL query is used?

rouault · 2024-11-05T19:05:00Z

to instead of discovering the (lack of) timezone up-front per table/layer to rather do it when actually reading the data and then using the first row being read (or whatever other heuristic, as there would typically be more rows available to do something with) to (if needed), change the arrow column time zone info after reading?

you can't do that once a user has requested the schema, and logically users should start by reading the schema before reading data

jorisvandenbossche · 2024-11-05T21:31:22Z

Indeed, I also don't directly see an option to properly solve this on the GDAL side (in the end it's a limitation of the GPKG file format, combined with an Arrow stream that needs to have its schema defined)

(I suppose in theory GDAL could already read the first batch and cache that when creating the ArrowArrayStream object, but that will also give unexpected performance behaviour?)

One other alternative I can think of is to have an option in GetArrowStream to read timestamp columns as strings, and then the consumer of the Arrow stream (pyogrio in the case of read_dataframe and converting to geopandas) could handle that as they like (we could use this "first row heuristic" and quite performantly parse the full column to timestamps).
But that is maybe a bit specific to GPKG where the source is using strings as the representation? (for other file formats that might again be more complicated or less clear what the result should be)

jorisvandenbossche · 2024-11-05T21:35:49Z

One other alternative I can think of is to have an option in GetArrowStream to read timestamp columns as strings

And this is actually what we do on our side in the non-Arrow reader (which was implemeted in #253, where we essentially use OGR_F_GetFieldAsString to retrieve the data of a datetime field inside a OGR_L_GetNextFeature loop)

rouault · 2024-11-05T21:42:40Z

where we essentially use OGR_F_GetFieldAsString to retrieve the data of a datetime field inside a OGR_L_GetNextFeature loop)

General remark: be careful: The GetFieldAsXXXX() methods are essentially cast from the internal representation store in the OGRField structure to the requested output type . So in the case of GeoPackage, that will do "text value in GPKG" -> OGRField.DateTime -> OGR-style DateTime string formatting. If the first conversion stage was lossy (shouldn't be the case for GeoPackage), GetFieldAsString() won't recover the initial version

Unless you'd want really the original datetime values from the GeoPackage, I don't think there's an issue with the current GetArrowStream() implementation for GeoPackage that will by default return values converted to UTC if there weren't stored as UTC in the GeoPackage (but they should normally if strictly following the GPKG spec)

rouault · 2024-11-05T21:49:42Z

hat will by default return values converted to UTC if there weren't stored as UTC in the GeoPackage

was forgetting the case of unknown timezone or local timezone. For GeoPackage, a solution is to issue a ExecuteSQL("SELECT CAST(the_datetime_column AS VARCHAR) as the_datetime_column, ... FROM ...") and issue GetArrowStream() on it

theroggy · 2024-11-05T22:01:52Z

hat will by default return values converted to UTC if there weren't stored as UTC in the GeoPackage

was forgetting the case of unknown timezone or local timezone.

It's indeed the naive timezone cases that are problematic: at least for Geopackage and Flatgeobuf that we know of...

DATETIME_AS_STRING=YES/NO. Defaults to NO. Added in GDAL 3.11. Whether DateTime fields should be returned as a (normally ISO-8601 formatted) string by drivers. The aim is to be able to handle mixed timezones (or timezone naive values) in the same column. All drivers must honour that option, and potentially fallback to the OGRLayer generic implementation if they cannot (which is the case for the Arrow, Parquet and ADBC drivers). When DATETIME_AS_STRING=YES, the TIMEZONE option is ignored. Fixes geopandas/pyogrio#487

rouault · 2024-11-05T23:45:49Z

What about OSGeo/gdal#11213 ?

DATETIME_AS_STRING=YES/NO. Defaults to NO. Added in GDAL 3.11. Whether DateTime fields should be returned as a (normally ISO-8601 formatted) string by drivers. The aim is to be able to handle mixed timezones (or timezone naive values) in the same column. All drivers must honour that option, and potentially fallback to the OGRLayer generic implementation if they cannot (which is the case for the Arrow, Parquet and ADBC drivers). When DATETIME_AS_STRING=YES, the TIMEZONE option is ignored. Fixes geopandas/pyogrio#487

theroggy added the question Further information is requested label Oct 17, 2024

theroggy changed the title ~~Using arrow=True with naive datetimes...~~ Using arrow=True with naive datetimes? Oct 17, 2024

theroggy changed the title ~~Using arrow=True with naive datetimes?~~ Issues with arrow=True for datetime columns? Oct 18, 2024

theroggy mentioned this issue Oct 18, 2024

TST: add tests exposing some issues with datetimes with arrow? #486

Draft

rouault added a commit to rouault/gdal that referenced this issue Oct 19, 2024

Arrow reading: generic code path (as used by GeoJSON): fix mis-handli…

427c8dc

…ng of timezones Fixes geopandas/pyogrio#487 (comment) regarding GeoJSON

rouault mentioned this issue Oct 19, 2024

Arrow reading: generic code path (as used by GeoJSON): fix mis-handling of timezones OSGeo/gdal#11049

Merged

theroggy changed the title ~~Issues with arrow=True for datetime columns?~~ Differences in how datetime columns are treated with arrow=True Oct 19, 2024

kylebarron mentioned this issue Oct 19, 2024

FlatGeobuf: ensure timezone conversion to UTC on load geoarrow/geoarrow-rs#834

Open

theroggy mentioned this issue Oct 20, 2024

ENH: Use use_arrow=True for reading and writing dataframes geofileops/geofileops#392

Open

rouault added a commit to OSGeo/gdal that referenced this issue Oct 22, 2024

Arrow reading: generic code path (as used by GeoJSON): fix mis-handli…

38b8aae

…ng of timezones Fixes geopandas/pyogrio#487 (comment) regarding GeoJSON

theroggy mentioned this issue Nov 5, 2024

ogr2ogr: for e.g. for GPKG files, naive datetimes in src are represented as being UTC in dst file OSGeo/gdal#11212

Open

rouault linked a pull request Nov 5, 2024 that will close this issue

OGRLayer::GetArrowStream(): add a DATETIME_AS_STRING=YES/NO option OSGeo/gdal#11213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences in how datetime columns are treated with arrow=True #487

Differences in how datetime columns are treated with arrow=True #487

theroggy commented Oct 17, 2024 •

edited

Loading

jorisvandenbossche commented Oct 18, 2024

jorisvandenbossche commented Oct 18, 2024 •

edited

Loading

theroggy commented Oct 18, 2024 •

edited

Loading

rouault commented Oct 19, 2024 •

edited

Loading

bjornharrtell commented Oct 19, 2024

rouault commented Oct 19, 2024

theroggy commented Oct 19, 2024

kylebarron commented Oct 19, 2024 •

edited

Loading

rouault commented Oct 19, 2024 •

edited

Loading

theroggy commented Oct 19, 2024 •

edited

Loading

jorisvandenbossche commented Oct 19, 2024 •

edited

Loading

rouault commented Oct 19, 2024 •

edited

Loading

theroggy commented Nov 3, 2024 •

edited

Loading

rouault commented Nov 5, 2024

jorisvandenbossche commented Nov 5, 2024 •

edited

Loading

jorisvandenbossche commented Nov 5, 2024

rouault commented Nov 5, 2024

rouault commented Nov 5, 2024

theroggy commented Nov 5, 2024

rouault commented Nov 5, 2024 •

edited

Loading

Differences in how datetime columns are treated with arrow=True #487

Differences in how datetime columns are treated with arrow=True #487

Comments

theroggy commented Oct 17, 2024 • edited Loading

jorisvandenbossche commented Oct 18, 2024

jorisvandenbossche commented Oct 18, 2024 • edited Loading

theroggy commented Oct 18, 2024 • edited Loading

For datetimes WITHOUT timezone:

For datetimes WITH timezone:

rouault commented Oct 19, 2024 • edited Loading

bjornharrtell commented Oct 19, 2024

rouault commented Oct 19, 2024

theroggy commented Oct 19, 2024

kylebarron commented Oct 19, 2024 • edited Loading

rouault commented Oct 19, 2024 • edited Loading

theroggy commented Oct 19, 2024 • edited Loading

jorisvandenbossche commented Oct 19, 2024 • edited Loading

rouault commented Oct 19, 2024 • edited Loading

theroggy commented Nov 3, 2024 • edited Loading

rouault commented Nov 5, 2024

jorisvandenbossche commented Nov 5, 2024 • edited Loading

jorisvandenbossche commented Nov 5, 2024

rouault commented Nov 5, 2024

rouault commented Nov 5, 2024

theroggy commented Nov 5, 2024

rouault commented Nov 5, 2024 • edited Loading

theroggy commented Oct 17, 2024 •

edited

Loading

jorisvandenbossche commented Oct 18, 2024 •

edited

Loading

theroggy commented Oct 18, 2024 •

edited

Loading

rouault commented Oct 19, 2024 •

edited

Loading

kylebarron commented Oct 19, 2024 •

edited

Loading

rouault commented Oct 19, 2024 •

edited

Loading

theroggy commented Oct 19, 2024 •

edited

Loading

jorisvandenbossche commented Oct 19, 2024 •

edited

Loading

rouault commented Oct 19, 2024 •

edited

Loading

theroggy commented Nov 3, 2024 •

edited

Loading

jorisvandenbossche commented Nov 5, 2024 •

edited

Loading

rouault commented Nov 5, 2024 •

edited

Loading