Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ogr2ogr: for e.g. for GPKG files, naive datetimes in src are represented as being UTC in dst file #11212

Open
theroggy opened this issue Nov 5, 2024 · 5 comments

Comments

@theroggy
Copy link
Contributor

theroggy commented Nov 5, 2024

What is the bug?

Datetime columns are not always treated correctly/consistently:

  • For GeoPackage, when a file is "translated" to a geopackage file:
    • naive datetimes (no timezone information) are interpreted/written as being UTC to the destination file, which changes the the datetime information significantly. This only happens when the data is internally treated via arrow (more info). With the classic internal treatment (e.g. when explodecollection,... is specified) this does not occur.
    • If the source geopackage has a column with localized timezone (e.g. +4), the datetime in that column is converted to UTC. This is not ideal as the localization information is lost, but the time stays ~correct, so this problem isn't as significant.
  • For FlatGeoBuffer the timezone information is always ignored when this is the destination file format.

This issue is already being discussed in the context of the integration in pyogrio via the C API here. Just posting it here as well for completeness sake and to document that the same behaviour occurs in ogr2ogr as well when arrow is used internally.

Steps to reproduce the issue

UPDATE: added FlatgeoBuffer to the script + with versus without using arrow

import warnings
from pathlib import Path
import geopandas as gpd
import pandas as pd
from osgeo import gdal
from shapely import Point

warnings.filterwarnings("ignore")
gdal.UseExceptions()

input_gdf = gpd.GeoDataFrame(
    data={
        "datetime_naive": pd.to_datetime(["2021-01-01 00:00:00", "2021-01-01 00:00:00", "2021-01-01 00:00:00"]),
        "datetime_utc": pd.to_datetime(["2021-01-01 00:00:00+00:00", "2021-01-01 00:00:00+00:00", "2021-01-01 00:00:00+00:00"]),
        "datetime_tz_local": pd.to_datetime(["2021-01-01 00:00:00+04:00", "2021-01-01 00:00:00+04:00", "2021-01-01 00:00:00+04:00"]),
    },
    geometry=[Point(0, 0), Point(0, 0), Point(0, 0)],
    crs=31370,
)

for suffix in [".gpkg", ".fgb"]:
    for arrow in ["YES", "NO"]:
        gdal.SetConfigOption("OGR2OGR_USE_ARROW_API", arrow)
        src = Path(f"C:/temp/src_arrow-{arrow}{suffix}")
        src.unlink(missing_ok=True)
        input_gdf.to_file(src)

        dst = Path(f"C:/temp/dst_arrow-{arrow}{suffix}")
        dst.unlink(missing_ok=True)
        ds_output = gdal.VectorTranslate(srcDS=src, destNameOrDestDS=dst)
        ds_output = None

        src_gdf = gpd.read_file(src)
        dst_gdf = gpd.read_file(dst)

        print(f"=== result for {suffix}, {arrow=} ===")
        print(src_gdf.drop(columns=["geometry"]))
        print(dst_gdf.drop(columns=["geometry"]))

Output:

=== result for .gpkg, arrow='YES' ===
  datetime_naive              datetime_utc         datetime_tz_local
0     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
1     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
2     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
             datetime_naive              datetime_utc         datetime_tz_local
0 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+00:00 2020-12-31 20:00:00+00:00
1 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+00:00 2020-12-31 20:00:00+00:00
2 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+00:00 2020-12-31 20:00:00+00:00
=== result for .gpkg, arrow='NO' ===
  datetime_naive              datetime_utc         datetime_tz_local
0     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
1     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
2     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
  datetime_naive              datetime_utc         datetime_tz_local
0     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
1     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
2     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
=== result for .fgb, arrow='YES' ===
  datetime_naive              datetime_utc         datetime_tz_local
0     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
1     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
2     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
  datetime_naive datetime_utc datetime_tz_local
0     2021-01-01   2021-01-01        2021-01-01
1     2021-01-01   2021-01-01        2021-01-01
2     2021-01-01   2021-01-01        2021-01-01
=== result for .fgb, arrow='NO' ===
  datetime_naive              datetime_utc         datetime_tz_local
0     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
1     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
2     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
  datetime_naive              datetime_utc         datetime_tz_local
0     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
1     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00
2     2021-01-01 2021-01-01 00:00:00+00:00 2021-01-01 00:00:00+04:00

Versions and provenance

  • OS: Windows 11
  • gdal version: 3.9.2, installed from conda-forge

Additional context

No response

@jratike80
Copy link
Collaborator

Naive datetimes and times with localized timezones are not supported in the GeoPackage standard so if you aim at interoperability, don't use them even GDAL allows to do it.

@rouault
Copy link
Member

rouault commented Nov 6, 2024

Slightly related improvement in #11213 , but won't fix that particular use case

@theroggy
Copy link
Contributor Author

theroggy commented Nov 6, 2024

@jratike80 it is a bit broader than GeoPackage, also for other file formats the treatment of datetimes is sometimes not ideal, typically when arrow is used under the hood. For some of the issues there are technical complications why this is the case as discussed in geopandas/pyogrio#487 .

The file formats I tested were GeoJSON, GeoJSONSEQ, FlatGeoBuffer and Geopackage. I added some extra relevant cases to the reproduction script above:

  • FlatgeoBuffer files, as their datetime columns also seem to be treated a bit weirdly in the arrow case
  • With arrow being used or without: the issues are specific to arrow being used

@jratike80
Copy link
Collaborator

Doesn't the GeoPackage creation option DATETIME_FORMAT=[WITH_TZ​/​UTC] https://gdal.org/en/latest/drivers/vector/gpkg.html#dataset-creation-options change anything?

@rouault
Copy link
Member

rouault commented Nov 6, 2024

Doesn't the GeoPackage creation option DATETIME_FORMAT=[WITH_TZ​/​UTC] https://gdal.org/en/latest/drivers/vector/gpkg.html#dataset-creation-options change anything?

,no, it won't. The issue with Arrow is that in Arrow DateTime columns must declare their timezone (or the absence of one), at the column level. Whereas GeoPackage DATETIME_FORMAT=WITH_TZ (the default) allows to create rows in a same column with different timezones. So when reading back with Arrow, as we don't know if there's a single timezone used or a mix, we normalize to UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants