You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm experiencing an issue with the Hudi configuration for the parquet compression codec. Despite setting the option "hoodie.parquet.compression.codec": "GZIP" in my Hudi write options, the output files in my data lake are not showing as compressed files. Instead, I only see the standard Parquet files.
Configuration:
Hudi Version: 1.0.0-beta2
Spark Version: 3.4
Java Version: OpenJDK 11
@soumilshah1995 Did you used try to see the compression on this files?
They should be gzip. you can use parquet-tools or try like below -
import pyarrow.parquet as pq
# Specify the path to your Parquet file
parquet_file = 'your_file.parquet'
# Read the metadata
metadata = pq.read_metadata(parquet_file)
# Print overall metadata
print(metadata)
# Iterate through the row groups and print compression information
for i in range(metadata.num_row_groups):
row_group_metadata = metadata.row_group(i)
print(f'Row Group {i}:')
for j in range(row_group_metadata.num_columns):
column_metadata = row_group_metadata.column(j)
print(f' Column {j}:')
print(f' Compression: {column_metadata.compression}')
I'm experiencing an issue with the Hudi configuration for the parquet compression codec. Despite setting the option "hoodie.parquet.compression.codec": "GZIP" in my Hudi write options, the output files in my data lake are not showing as compressed files. Instead, I only see the standard Parquet files.
Configuration:
Hudi Version: 1.0.0-beta2
Spark Version: 3.4
Java Version: OpenJDK 11
Test Code
Output
The text was updated successfully, but these errors were encountered: