Use `pyarrow` instead of `fastparquet` to write parquet data #17

ravwojdyla · 2022-02-12T04:20:06Z

pyarrow is the default pandas parquet engine, it also by default works better across the ecosystem (including pyspark). Specifically genes.snappy.parquet data can't by read by pyspark 3.2.0, due to:

org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))
at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284)

Btw fastparquet has a spark compatible mode for timestamps times="int96".

Also from https://fastparquet.readthedocs.io/en/latest/releasenotes.html#id2:

nanosecond resolution times: the new extended “logical” types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both “converted” and “logical” type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing.

The text was updated successfully, but these errors were encountered:

dhimmel · 2022-02-14T14:26:17Z

Thanks for reporting this issue and the PR!

IIRC I selected fastparquet over pyarrow since it seemed lighter weight as per this comment:

fastparquet library was only about 1.1mb, while pyarrow library was 176mb

Switching to pyarrow is adding about 30 seconds to build times as per https://github.com/related-sciences/ensembl-genes/pull/18/files#r805887127. It also gave me confidence that fastparquet was part of the dask GitHub organization.

I agree compatibility is paramount for the parquet outputs from this repo. One solution would be to add times="int96" with fastparquet, but I'm guessing that there might be future issues like this. @ravwojdyla is that your reasoning for switching to pyarrow... that it is more likely to be more compatible in the future?

ravwojdyla · 2022-02-14T22:58:54Z

@dhimmel I personally trust pyarrow more, it also seems to have sounder defaults + as you have mentioned there might be other issues.

dhimmel · 2022-02-14T23:12:52Z

Okay, rerunning exports with the pyarrow engine for pandas.DataFrame.to_parquet in this build.

ravwojdyla mentioned this issue Feb 12, 2022

Use pyarrow instead of fastparquet #18

Merged

dhimmel closed this as completed in 15f7920 Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `pyarrow` instead of `fastparquet` to write parquet data #17

Use `pyarrow` instead of `fastparquet` to write parquet data #17

ravwojdyla commented Feb 12, 2022 •

edited

Loading

dhimmel commented Feb 14, 2022

ravwojdyla commented Feb 14, 2022

dhimmel commented Feb 14, 2022

Use pyarrow instead of fastparquet to write parquet data #17

Use pyarrow instead of fastparquet to write parquet data #17

Comments

ravwojdyla commented Feb 12, 2022 • edited Loading

dhimmel commented Feb 14, 2022

ravwojdyla commented Feb 14, 2022

dhimmel commented Feb 14, 2022

Use `pyarrow` instead of `fastparquet` to write parquet data #17

Use `pyarrow` instead of `fastparquet` to write parquet data #17

ravwojdyla commented Feb 12, 2022 •

edited

Loading