-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pyarrow
instead of fastparquet
to write parquet data
#17
Comments
Thanks for reporting this issue and the PR! IIRC I selected fastparquet over pyarrow since it seemed lighter weight as per this comment:
Switching to pyarrow is adding about 30 seconds to build times as per https://github.com/related-sciences/ensembl-genes/pull/18/files#r805887127. It also gave me confidence that fastparquet was part of the dask GitHub organization. I agree compatibility is paramount for the parquet outputs from this repo. One solution would be to add |
@dhimmel I personally trust pyarrow more, it also seems to have sounder defaults + as you have mentioned there might be other issues. |
Okay, rerunning exports with the pyarrow engine for |
pyarrow
is the defaultpandas
parquet engine, it also by default works better across the ecosystem (including pyspark). Specifically genes.snappy.parquet data can't by read by pyspark 3.2.0, due to:Btw fastparquet has a spark compatible mode for timestamps
times="int96"
.Also from https://fastparquet.readthedocs.io/en/latest/releasenotes.html#id2:
The text was updated successfully, but these errors were encountered: