Fix schema violation for timezone data (#259) #278

lr4d · 2020-04-14T12:21:14Z

Supersedes #260

I tried to rebase this branch on top of master but it was quite the adventure to do so.
I excluded the changes to the changelog and it seems like some removal of comments is also missing, but the important parts should be there.

cc @ged-steponavicius

codecov · 2020-04-14T12:27:19Z

Codecov Report

Merging #278 into master will decrease coverage by 0.01%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master     #278      +/-   ##
==========================================
- Coverage   89.88%   89.86%   -0.02%     
==========================================
  Files          39       39              
  Lines        3746     3749       +3     
  Branches      911      915       +4     
==========================================
+ Hits         3367     3369       +2     
  Misses        223      223              
- Partials      156      157       +1

Impacted Files	Coverage Δ
kartothek/core/common_metadata.py	`95.17% <93.33%> (-0.22%)`	⬇️
kartothek/core/testing.py	`73.43% <100.00%> (-1.57%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba6215e...b7ea769. Read the comment docs.

marco-neumann-by · 2020-04-16T13:29:31Z

kartothek/core/common_metadata.py

@@ -78,6 +77,35 @@ def _schema_compat(self):

            schema = schema.remove_metadata()
            md = {b"pandas": _dict_to_binary(pandas_metadata)}
+            # https://github.com/JDASoftwareGroup/kartothek/issues/259
+
+            if schema is not None:


why can it be None here? 3 lines above we just call schema = schema.remove_metadata(). Can this return None?

marco-neumann-by · 2020-04-16T13:30:17Z

kartothek/core/common_metadata.py

+
+            if schema is not None:
+                fields = []
+                for f in schema:


short comment before the entire massaging would be nice. This function just does more and more and it is hard to follow along.

marco-neumann-by · 2020-04-16T13:30:48Z

kartothek/core/common_metadata.py

-
-        fields.append(f)
-    return pa.schema(fields, schema.metadata)
+    return schema


a direct return from the line above would work now as well.

reference-data/arrow-compat/batch_generate_references.sh

marco-neumann-by · 2020-04-16T13:32:33Z

tests/serialization/test_arrow_compat.py

@@ -53,4 +52,7 @@ def test_arrow_compat(arrow_version, reference_store, mocker):
    if arrow_version in ("0.14.1", "0.15.0", "0.16.0") and not ARROW_LARGER_EQ_0141:
        orig = orig.astype({"null": float})

+    if LooseVersion(arrow_version) < "0.13.0":


can you add a short comment on the reason? Is this because timezones cannot be preserved with these old versions?

Because the changes from @ged-steponavicius fixes #259, enabling proper compatibilty of datetimes with timezones i.e. these columns, and we currently only support pyarrow >= 0.13.0

Then why do we need to check < "0.13.0": That's dead code, isn't it?

arrow_version here refers to the arrow version with which the reference data file was created

Suggested change

if LooseVersion(arrow_version) < "0.13.0":

if LooseVersion(arrow_version) < "0.13.0": # gh-259: compatibilty of datetimes with timezones only supported in kartothek versions with pyarrow >= 0.13.0

suggestion looks good to me, thanks :)

lr4d · 2020-04-16T15:05:32Z

This PR now also includes the appropriate changelog entry and I added some commits to update the reference Parquet files.

@ged-steponavicius feel free to push to this branch if you'd like to address the review comments

ged-steponavicius and others added 3 commits April 14, 2020 15:42

fixed schema violation for timestamp with tz

6ac3fa4

update reference Parquet generation scripts

720ac97

update reference Parquet files

b7ea769

lr4d force-pushed the tz_bug_fix_rebased branch from 16d2233 to b7ea769 Compare April 14, 2020 13:43

lr4d changed the title ~~WIP: Tz bug fix rebased~~ Fix schema violation for timezone data (#259) Apr 14, 2020

lr4d marked this pull request as ready for review April 14, 2020 13:45

marco-neumann-by suggested changes Apr 16, 2020

View reviewed changes

lr4d mentioned this pull request May 13, 2020

update reference Parquet generation scripts #280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix schema violation for timezone data (#259) #278

Fix schema violation for timezone data (#259) #278

lr4d commented Apr 14, 2020

codecov bot commented Apr 14, 2020 •

edited

Loading

marco-neumann-by Apr 16, 2020

marco-neumann-by Apr 16, 2020

marco-neumann-by Apr 16, 2020

marco-neumann-by Apr 16, 2020

lr4d Apr 16, 2020

fjetter Apr 16, 2020

lr4d Apr 16, 2020

lr4d Apr 16, 2020

marco-neumann-by Apr 16, 2020

lr4d commented Apr 16, 2020

	if LooseVersion(arrow_version) < "0.13.0":
	if LooseVersion(arrow_version) < "0.13.0": # gh-259: compatibilty of datetimes with timezones only supported in kartothek versions with pyarrow >= 0.13.0

Fix schema violation for timezone data (#259) #278

Are you sure you want to change the base?

Fix schema violation for timezone data (#259) #278

Conversation

lr4d commented Apr 14, 2020

codecov bot commented Apr 14, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lr4d commented Apr 16, 2020

codecov bot commented Apr 14, 2020 •

edited

Loading