wr.s3.to_parquet() incorrectly identifies empty columns #2974

coldenol · 2024-09-27T13:24:44Z

Describe the bug

When I try to write a dataframe using wr.s3.to_parquet() method it often throws the following error on some columns:

UndetectedType Impossible to infer the equivalent Athena data type for the billingaddress column. It is completely empty (only null values) and has a too generic data type (object). Please, cast this columns with a more deterministic data type (e.g. df['billingaddress'] = df['billingaddress'].astype('string')) or pass the column schema as argument(e.g. dtype={'billingaddress': 'string'}

How to Reproduce

First I tried to make a pre-validation before passing the dataframe to the method but it helped partially:

dtype = {}
  for column in data.columns:
      if data[column].isnull().all():
          dtype[column] = "string"
data = data.astype(dtype)

I then discovered that the dataframe column that the method defined as “completely empty” contains a few values, say 3 values per 500 dataframe records.

It seems that the method only analyzes a limited sample of rows to determine a completely empty column and if there are only a few rows with values in the column, then there is a high probability that the column will be treated as completely empty.

The following code is what I did as a working workaround, but with the downside that the data type is rigidly equated to a string.

def _validate_dtypes(self, data: pd.DataFrame) -> pd.DataFrame:
    # awswrangler can not infer the data type from an entire (or mostly) null
    # column, thus we need to manually set the data type for such columns

    dtype = {}

    for column in data.columns:
        non_null_count = data[column].notnull().sum()
        total_count = len(data[column])

        if non_null_count == 0:
            dtype[column] = "string"
        # awswrangler defines a column that has only a few values also as empty
        # this is "less than 10% non-null values" workaround
        elif (non_null_count / total_count < 0.1):
            dtype[column] = "string"

    data = data.astype(dtype)

    return data

I would appreciate a solving this problem.
Thank you for the great tool!

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.9

AWS SDK for pandas version

awswrangler-3.9.1-py3-none-any.whl

Additional context

No response

The text was updated successfully, but these errors were encountered:

Tian-2017 · 2024-10-05T17:59:31Z

Hey Denis, sounds like a plan. I encountered a similar problem and was wondering if this logic could be incorporated into the _df_to_table function here.

However, it might not be necessary to define dtype = {}. Instead, it could only handle cases where the type is 'object' to improve efficiency.

for column in df.select_dtypes(include=['object']).columns:
    # Check if less than 10% of the column values are non-null, treat it as empty
    non_null_ratio = df[column].notnull().sum() / len(df)
    if non_null_ratio < 0.1:
        df[column] = df[column].astype('string')  # Default to 'string' if mostly null
    elif df[column].notnull().sum() == 0:
        df[column] = df[column].astype('string')  # Default to 'string' if all values are null

coldenol added the bug Something isn't working label Sep 27, 2024

github-actions bot added the needs-triage label Oct 2, 2024

jaidisido removed the needs-triage label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wr.s3.to_parquet() incorrectly identifies empty columns #2974

wr.s3.to_parquet() incorrectly identifies empty columns #2974

coldenol commented Sep 27, 2024

Tian-2017 commented Oct 5, 2024 •

edited

Loading

wr.s3.to_parquet() incorrectly identifies empty columns #2974

wr.s3.to_parquet() incorrectly identifies empty columns #2974

Comments

coldenol commented Sep 27, 2024

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

Tian-2017 commented Oct 5, 2024 • edited Loading

Tian-2017 commented Oct 5, 2024 •

edited

Loading