Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wr.s3.to_parquet() incorrectly identifies empty columns #2974

Open
coldenol opened this issue Sep 27, 2024 · 1 comment
Open

wr.s3.to_parquet() incorrectly identifies empty columns #2974

coldenol opened this issue Sep 27, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@coldenol
Copy link

Describe the bug

When I try to write a dataframe using wr.s3.to_parquet() method it often throws the following error on some columns:

UndetectedType Impossible to infer the equivalent Athena data type for the billingaddress column. It is completely empty (only null values) and has a too generic data type (object). Please, cast this columns with a more deterministic data type (e.g. df['billingaddress'] = df['billingaddress'].astype('string')) or pass the column schema as argument(e.g. dtype={'billingaddress': 'string'}

How to Reproduce

First I tried to make a pre-validation before passing the dataframe to the method but it helped partially:

dtype = {}
  for column in data.columns:
      if data[column].isnull().all():
          dtype[column] = "string"
data = data.astype(dtype)

I then discovered that the dataframe column that the method defined as “completely empty” contains a few values, say 3 values per 500 dataframe records.

It seems that the method only analyzes a limited sample of rows to determine a completely empty column and if there are only a few rows with values in the column, then there is a high probability that the column will be treated as completely empty.

The following code is what I did as a working workaround, but with the downside that the data type is rigidly equated to a string.

def _validate_dtypes(self, data: pd.DataFrame) -> pd.DataFrame:
    # awswrangler can not infer the data type from an entire (or mostly) null
    # column, thus we need to manually set the data type for such columns

    dtype = {}

    for column in data.columns:
        non_null_count = data[column].notnull().sum()
        total_count = len(data[column])

        if non_null_count == 0:
            dtype[column] = "string"
        # awswrangler defines a column that has only a few values also as empty
        # this is "less than 10% non-null values" workaround
        elif (non_null_count / total_count < 0.1):
            dtype[column] = "string"

    data = data.astype(dtype)

    return data

I would appreciate a solving this problem.
Thank you for the great tool!

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.9

AWS SDK for pandas version

awswrangler-3.9.1-py3-none-any.whl

Additional context

No response

@coldenol coldenol added the bug Something isn't working label Sep 27, 2024
@Tian-2017
Copy link

Tian-2017 commented Oct 5, 2024

Hey Denis, sounds like a plan. I encountered a similar problem and was wondering if this logic could be incorporated into the _df_to_table function here.

However, it might not be necessary to define dtype = {}. Instead, it could only handle cases where the type is 'object' to improve efficiency.

for column in df.select_dtypes(include=['object']).columns:
    # Check if less than 10% of the column values are non-null, treat it as empty
    non_null_ratio = df[column].notnull().sum() / len(df)
    if non_null_ratio < 0.1:
        df[column] = df[column].astype('string')  # Default to 'string' if mostly null
    elif df[column].notnull().sum() == 0:
        df[column] = df[column].astype('string')  # Default to 'string' if all values are null

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants