Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed schema violation for timestamp with tz #260

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

ged-steponavicius
Copy link

Description:

fixes issue #259

@lr4d
Copy link
Collaborator

lr4d commented Mar 19, 2020

The black code formatter is causing tests to be red. I suggests you install the pre-commit hooks or run black on the code files

Copy link
Collaborator

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

Overall LGTM. Can you add an entry to the changelog, please? see https://github.com/JDASoftwareGroup/kartothek/blob/master/CHANGES.rst

kartothek/core/common_metadata.py Outdated Show resolved Hide resolved
kartothek/core/common_metadata.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 22, 2020

Codecov Report

Merging #260 into master will increase coverage by 0.11%.
The diff coverage is 94.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #260      +/-   ##
==========================================
+ Coverage   89.72%   89.84%   +0.11%     
==========================================
  Files          39       39              
  Lines        3729     3741      +12     
  Branches      906      909       +3     
==========================================
+ Hits         3346     3361      +15     
+ Misses        224      223       -1     
+ Partials      159      157       -2     
Impacted Files Coverage Δ
kartothek/core/testing.py 75.00% <ø> (+1.56%) ⬆️
kartothek/io/eager.py 77.09% <ø> (ø)
kartothek/core/common_metadata.py 95.17% <93.33%> (-0.22%) ⬇️
kartothek/core/index.py 85.63% <100.00%> (+0.91%) ⬆️
kartothek/io_components/metapartition.py 91.41% <0.00%> (+0.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bc0a53...d2faed5. Read the comment docs.

@ged-steponavicius
Copy link
Author

@lr4d @fjetter

Tests are now passing for pyarrow 0.15 and 0.16

I have done an investigation into earlier pyarrow versions, how they store schema internally appears to be significantly different from later versions. I could not pinpoint it to the changelog entry.

key = "some_uuid/some_table/_common_metadata"
pq_file = pq.ParquetFile(store.open(key))

0.13.0 does not store timezone information:

pq_file.schema #pyarrow==0.13.0
<pyarrow._parquet.ParquetSchema object at 0x000001F0B544E608>
datetime64: INT64 TIMESTAMP_MICROS
datetime64_tz: INT64 TIMESTAMP_MICROS
datetime64_utc: INT64 TIMESTAMP_MICROS

0.14.0 aligns to UTC:

pq_file.schema #pyarrow==0.14.0
<pyarrow._parquet.ParquetSchema object at 0x000001FCE4087148>
datetime64: INT64 Timestamp(isAdjustedToUTC=true, timeUnit=microseconds)
datetime64_tz: INT64 Timestamp(isAdjustedToUTC=true, timeUnit=microseconds)
datetime64_utc: INT64 Timestamp(isAdjustedToUTC=true, timeUnit=microseconds)

I'm not sure what's our the best approach, any suggestions?

PS. sorry for multiple commits, this is my first PR

fjetter and others added 3 commits March 23, 2020 10:06
kartothek/core/common_metadata.py Outdated Show resolved Hide resolved
kartothek/core/common_metadata.py Outdated Show resolved Hide resolved
CHANGES.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the github diff is a bit confused due to the many merge commits. We probably should squash this on merge.

tests/core/test_common_metadata.py Outdated Show resolved Hide resolved
tests/serialization/test_arrow_compat.py Show resolved Hide resolved
Comment on lines +139 to +141
keys = np.array(list(self.index_dct.keys()))
labeled_array = pa.array(keys, type=self.dtype)
return np.array(labeled_array.to_pandas(date_as_object=date_as_object))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already be on master. Is github fooling me or what's happening here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pulled the latest master before committing, I think GitHub is confused

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the rebase got a bit messed up. @ged-steponavicius could you perhaps try rebasing again on master?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants