Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Series.to_dummies raises for PyArrow when nulls are present #1037

Open
MarcoGorelli opened this issue Sep 21, 2024 · 11 comments · May be fixed by #1040
Open

[Bug]: Series.to_dummies raises for PyArrow when nulls are present #1037

MarcoGorelli opened this issue Sep 21, 2024 · 11 comments · May be fixed by #1040

Comments

@MarcoGorelli
Copy link
Member

Describe the bug

Calling to_dummies on a Series with nulls present raises an error

Steps or code to reproduce the bug

# ruff: noqa

# todo:
# - Series.rename, as alias for Series.alias. it's OK

import pyarrow as pa
import narwhals as nw

df = nw.from_native(pa.table({'a': ['x', 'y', None]}), eager_only=True)
print(df['a'].to_dummies().to_native())

Expected results

pyarrow.Table
a_null: uint8
a_x: uint8
a_y: uint8
----
a_null: [[0,0,1]]
a_x: [[1,0,0]]
a_y: [[0,1,0]]

Actual results

Traceback (most recent call last):
  File "/home/marcogorelli/polars-api-compat-dev/t.py", line 11, in <module>
    print(df['a'].to_dummies().to_native())
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/marcogorelli/polars-api-compat-dev/narwhals/series.py", line 2289, in to_dummies
    self._compliant_series.to_dummies(separator=separator, drop_first=drop_first),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/marcogorelli/polars-api-compat-dev/narwhals/_arrow/series.py", line 661, in to_dummies
    columns[da.indices, np.arange(len(da))] = 1
    ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid in
dices

Please run narwhals.show_version() and enter the output below.

1.8.2

Relevant log output

No response

@MarcoGorelli MarcoGorelli added bug: incorrect result Something isn't working pyarrow labels Sep 21, 2024
@FBruzzesi
Copy link
Member

I am taking a look at this. Currently also pandas-like fails to add a column for none, yet without raising an error.

Yet hardest part is mapping column names and ordering of such 😂

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Sep 21, 2024

😄 thanks for taking a look

🤔 i do think the Polars default output looks a bit odd here, and where I spotted this was in Formulaic, and they wouldn't include the "null" column in the output anyway. wondering if we should add a required drop_nulls keyword argument

@FBruzzesi FBruzzesi linked a pull request Sep 21, 2024 that will close this issue
10 tasks
@FBruzzesi
Copy link
Member

i do think the Polars default output looks a bit odd here,

Agree 👌

I opened the PR, but pandas behavior is so inconsistent and hard to remap. Tests are failing because for int values, adding the dummy_na=True flag (which adds a null column at the end, even if there aren't any) seems to trigger some casting to float. Therefore the column names change from a_1, a_2,... to a_1.0, a_2.0,....

Do you think that this should be reported upstream?

@MarcoGorelli
Copy link
Member Author

Upstream to Polars? Yes I think so that might be good

@FBruzzesi
Copy link
Member

FBruzzesi commented Sep 21, 2024

I meant the pandas behavior that converts int to float for dummy_na=True. Example:

import pandas as pd

data = [1, 2, 1]
pd.get_dummies(pd.Series([1,2,1]), dummy_na=True)
     1.0    2.0    NaN
0   True  False  False
1  False   True  False
2   True  False  False

while:

pd.get_dummies(pd.Series([1,2,1]))
       1      2
0   True  False
1  False   True
2   True  False

What I mean is that by adding the extra argument, the name of the first two columns will change as well even though no nulls/nans are present.

This does not happen with nullable types. I am not sure how we should go about it honestly.

@MarcoGorelli
Copy link
Member Author

ah i see - well pandas allows for column names to be floats, so I don't think pandas would consider this a bug nor something that needs changing

the string "null" in Polars on the other hand...maybe that should at least be configurable, with a dummy_null argument? If not in Polars, we could at least add it here, given how problematic it seems to be to have dummy_null=True behave consistently across libraries

@FBruzzesi
Copy link
Member

Edited the snippet above to better explain what I mean for pandas

@MarcoGorelli
Copy link
Member Author

thanks, but i still think the response would be "this is just what pandas does" (😭 )

@FBruzzesi
Copy link
Member

Alright, then let's discuss how we should behave 😂
As you can see in #1040 doctests are failing because we have ints in our example. We can either convert_dtypes or use strings in the example. However I would rather add some disclaimer note in the method docstring. WDYT?

@MarcoGorelli
Copy link
Member Author

i discussed this a bit with polars devs, they're open to having something like dummy_null in Polars

I think the Polars API should be revised further, because currently, if you get

shape: (3, 3)
┌──────┬──────┬───────┐
│ _bar ┆ _foo ┆ _null │
│ ---  ┆ ---  ┆ ---   │
│ u8   ┆ u8   ┆ u8    │
╞══════╪══════╪═══════╡
│ 0    ┆ 1    ┆ 0     │
│ 1    ┆ 0    ┆ 0     │
│ 0    ┆ 0    ┆ 1     │
└──────┴──────┴───────┘

then you have no way to know if it came from

pl.Series(['foo', 'bar', 'null']).to_dummies()

or

pl.Series(['foo', 'bar', None]).to_dummies()

I'm more inclined still to just follow the pandas default here (!), and loudly document that we slightly differ from Polars here (my prediction is that the Polars API will have to change for this function anyway)

@FBruzzesi
Copy link
Member

Thanks @MarcoGorelli , shall we wait for them to decide on the polars API? IIRC this was needed for formulaic, so you can set the pace for how soon we need this one out.

I added some commits because I figured out of to workaround pandas issue 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants