Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pyarrow instead of fastparquet to write parquet data #17

Closed
ravwojdyla opened this issue Feb 12, 2022 · 3 comments
Closed

Use pyarrow instead of fastparquet to write parquet data #17

ravwojdyla opened this issue Feb 12, 2022 · 3 comments

Comments

@ravwojdyla
Copy link
Contributor

ravwojdyla commented Feb 12, 2022

pyarrow is the default pandas parquet engine, it also by default works better across the ecosystem (including pyspark). Specifically genes.snappy.parquet data can't by read by pyspark 3.2.0, due to:

org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))
at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284)

Btw fastparquet has a spark compatible mode for timestamps times="int96".

Also from https://fastparquet.readthedocs.io/en/latest/releasenotes.html#id2:

nanosecond resolution times: the new extended “logical” types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both “converted” and “logical” type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing.

@dhimmel
Copy link
Member

dhimmel commented Feb 14, 2022

Thanks for reporting this issue and the PR!

IIRC I selected fastparquet over pyarrow since it seemed lighter weight as per this comment:

fastparquet library was only about 1.1mb, while pyarrow library was 176mb

Switching to pyarrow is adding about 30 seconds to build times as per https://github.com/related-sciences/ensembl-genes/pull/18/files#r805887127. It also gave me confidence that fastparquet was part of the dask GitHub organization.

I agree compatibility is paramount for the parquet outputs from this repo. One solution would be to add times="int96" with fastparquet, but I'm guessing that there might be future issues like this. @ravwojdyla is that your reasoning for switching to pyarrow... that it is more likely to be more compatible in the future?

@ravwojdyla
Copy link
Contributor Author

@dhimmel I personally trust pyarrow more, it also seems to have sounder defaults + as you have mentioned there might be other issues.

@dhimmel
Copy link
Member

dhimmel commented Feb 14, 2022

Okay, rerunning exports with the pyarrow engine for pandas.DataFrame.to_parquet in this build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants