Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dask namespace concat method #840

Merged
merged 11 commits into from
Aug 28, 2024
Merged

Conversation

benrutter
Copy link
Contributor

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

I've added in a concat method returning a DaskLazyFrame. One thing that wasn't tested for the concat_test was whether an error gets thrown trying to horizontally concat frames of different length. I'm not 100% sure what happens within polars/pandas when you do this, but dask will throw a confusing assertion error when you try to access anything within the dataframe, so I thought putting in a handy check here was worthwhile.

@github-actions github-actions bot added enhancement New feature or request internal labels Aug 21, 2024
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thanks for the PR, that was fast :)

I left a couple of comments, the main concern is about triggering computation on behalf of the user, which we would like to avoid as much as possible.

I would also double check the behaviour of the default join="outer" argument for dd.concat.

Finally, regarding the missing test, feel free to add it to get higher/better coverage of edge cases.

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved
narwhals/_dask/namespace.py Outdated Show resolved Hide resolved
@benrutter
Copy link
Contributor Author

Cheers @FBruzzesi!

Think you're right around the behaviour of "outer" by default. In theory, I think it only comes into play on "vertical" merges (otherwise, mismatching indexes throw errors) where columns are checked to be matching already, but that sounds a little like an error waiting to happen.

Just looked at adding another test, but looks like its totally fine to horizontally concatenate different length frames outside of dask. For now, I've taken out the check for matching lengths- it'll avoid any computing for figuring out length, and pass over handling the error to dask, which is probably preferable? (let me know if I'm missing a trick on how to check concats are likely to be valid though)

@benrutter
Copy link
Contributor Author

Also, not very related, but quick question on tests! Looks like ubuntu tests are failing at the moment because they require 100% coverage and there's something like 99.7% (macos requires 90% and windows 95%).

Is it intentionally that they're all different? If it's be handy I can bundle updating to make them all match in with this PR.

@FBruzzesi
Copy link
Member

Hey sorry I am from mobile, so this could be a sloppy comment. I will try to do my best.

Think you're right around the behaviour of "outer" by default. In theory, I think it only comes into play on "vertical" merges (otherwise, mismatching indexes throw errors) where columns are checked to be matching already, but that sounds a little like an error waiting to happen.

Alright, to me the join keyword resonates more on a horizontal concat but apparently not. I am happy to keep the check on the column names.

Just looked at adding another test, but looks like its totally fine to horizontally concatenate different length frames outside of dask. For now, I've taken out the check for matching lengths- it'll avoid any computing for figuring out length, and pass over handling the error to dask, which is probably preferable? (let me know if I'm missing a trick on how to check concats are likely to be valid though)

Maybe we can just let dask run its behavior without manually checking for dataframe lengths. This is what we are doing for pandas, yet I would try to see how Polars LazyFrame behaves (i.e. if it throws an error, and if it does it most likely happens only at collect time).

@benrutter
Copy link
Contributor Author

Thanks @FBruzzesi, comment is really helpful! At the moment, the changes I've made since my initial commit (have squashed the commits together for neatness):

  • keyword argument of "inner" now passed into dd.concat - I think this is the desired behaviour?
  • I've taken out length checks to avoid unnecessary compute and just let dask do its own thing with checks
  • Extra tweak: I also swapped the import dask.dataframe as dd # ignore-banned-import for dd = get_dask_dataframe() (I think that's better practice maybe?)

On the join="inner" vs join="outer" behaviour, where you thinking around any specific cases? I can add some extra tests in if there's an example that you have in mind.

@FBruzzesi
Copy link
Member

FBruzzesi commented Aug 22, 2024

  • keyword argument of "inner" now passed into dd.concat - I think this is the desired behaviour?

Thanks! I am still a bit unsure about the difference. I can run some tests later today or most likely tomorrow, unless you have some examples ready to show the difference in behavior.

  • I've taken out length checks to avoid unnecessary compute and just let dask do its own thing with checks

I can still see the check using len?!

  • Extra tweak: I also swapped the import dask.dataframe as dd # ignore-banned-import for dd = get_dask_dataframe() (I think that's better practice maybe?)

We recently changed all those (check #788), so I would suggest to keep it as before:

- dd = get_dask_dataframe()
+ import dask.dataframe as dd  # ignore-banned-import

On the join="inner" vs join="outer" behaviour, where you thinking around any specific cases? I can add some extra tests in if there's an example that you have in mind.

Back to the first point

@benrutter
Copy link
Contributor Author

benrutter commented Aug 22, 2024

Well that's embarrassing! Looks like my "neat squashing" was just a "neat deleting a commit" for the validation check removal 🫠 All taken out now.

image

Ok, I've done some playing around and actually looks like dask can concatenate on axis 1 with different lengths anyway (I don't know if this works with expression based columns though, definitely was getting an assertion error previously). That's pretty handy because it makes the outer/inner behaviour really clear for both!

I'll explain them here, and hopefully you can give me a steer on the desired Narwhals behaviour? (I think I know but not 100% sure)

import dask.dataframe

left = dd.from_dict({"a": [1, 2], "b": [1, 2]}, npartitions=1)
right = dd.from_dict({"a": [1, 2, 3], "b": [1, 2, 3], "c": [1, 2, 3]}, npartitions=1)

# inner on axis 0 (vertical)
dd.concat([right, left], axis=0, join="inner").compute()
#   a b
# 0 1 1
# 1 2 2
# 2 3 3
# 0 1 1
# 1 2 2

# outer on axis 0 (vertical)
dd.concat([right, left], axis=0, join="outer").compute()
#   a b c
# 0 1 1 1.0
# 1 2 2 2.0
# 2 3 3 3.0
# 0 1 1 NaN
# 1 2 2 NaN


# inner on axis 1 (horizontal)
dd.concat([right, left], axis=1, join="inner").compute()
#   a b c a b
# 0 1 1 1 1 1
# 1 2 2 2 2 2

# outer on axis 1 (horizontal)
dd.concat([right, left], axis=1, join="outer").compute()
#   a b c a   b
# 0 1 1 1 1.0 1.0
# 1 2 2 2 2.0 2.0
# 2 3 3 3 NaN Nan

Hopefully that makes it a little clearer? Think I muddied the water a little by being confused before on the behaviour for axis=1, sorry!

Looking at the other examples in narwhals I think that when axis=0 (vertical concat) we'd want join=inner and when 'axis=1' (horizontal concat) we'd want join=outer?

I've implemented the above behaviour for now - would appreciate feedback on if that's logical though! πŸ™

@FBruzzesi
Copy link
Member

FBruzzesi commented Aug 22, 2024

Awesome, that makes it much more clear and saved me a bunch of time, thanks a lot.

I am checking polars LazyFrame behavior, which is what we should aim to replicate:

Edit:

TL;DR

  • how="vertical" behavior should be easy to replicate
  • how="horizontal" we could end up with a stricter behavior for dask than Polars LazyFrame, but I think that's still ok.

how="vertical"

import polars as pl

data1 = {"a": [1, 2, 3], "b": [1, 2, 3], "c": [1, 2, 3]}
data2 = {"a": [1, 2], "b": [1, 2]}

df1 = pl.LazyFrame(data1)
df2 = pl.LazyFrame(data2)

df = pl.concat([df1, df2], how="vertical")  # this doesn't raise just yet, but then when we collect
df.collect()

ShapeError: unable to append to a DataFrame of width 3 with a DataFrame of width 2

So I think we can raise early if columns are not same for vertical, exactly how you are already doing. Once the check is done, join keyword does not matter anymore since the columns will already be the same.

Edit: Actually the column order also matter in polars, i.e. the following fails:

data1 = {"a": [1, 2, 3], "b": [1, 2, 3]}
data2 = {"b": [1, 2], "a": [1, 2]}

df1 = pl.LazyFrame(data1)
df2 = pl.LazyFrame(data2)

df = pl.concat([df1, df2], how="vertical").collect()

ShapeError: unable to vstack, column names don't match: "a" and "b"

how="horizontal"

First check is done on the duplicate column names, before collect time:

data1 = {"a": [1, 2, 3], "b": [1, 2, 3]}
data2 = {"b": [1, 2], "a": [1, 2]}

df1 = pl.LazyFrame(data1)
df2 = pl.LazyFrame(data2)

df = pl.concat([df1, df2], how="horizontal")

DuplicateError: Column with name 'a' has more than one occurrence

This such check should be fairly easy to achieve in dask as well.

If the column names are ok, then, actually the different length does not matter:

data1 = {"a": [1, 2, 3], "b": [1, 2, 3]}
data2 = {"c": [1, 200], "d": [1, 2]}

df1 = pl.LazyFrame(data1)
df2 = pl.LazyFrame(data2)

df = pl.concat([df1, df2], how="horizontal")
df.collect()
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   ┆ c    ┆ d    β”‚
β”‚ --- ┆ --- ┆ ---  ┆ ---  β”‚
β”‚ i64 ┆ i64 ┆ i64  ┆ i64  β”‚
β•žβ•β•β•β•β•β•ͺ═════β•ͺ══════β•ͺ══════║
β”‚ 1   ┆ 1   ┆ 1    ┆ 1    β”‚
β”‚ 2   ┆ 2   ┆ 200  ┆ 2    β”‚
β”‚ 3   ┆ 3   ┆ null ┆ null β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

This behavior I was not able to replicate in dask, meaning that at compute time if the length is different I am getting an AssertionError with whatever parameter I am trying to play with.

I think this is ok as we can be more restrictive than polars, but I would like to end up with the same result (cc: @MarcoGorelli )

Hope this helps

Final edit: I was not able to fully replicate what you have done @benrutter (I am working with dask '2024.6.2')

Update narwhals/_dask/namespace.py

Co-authored-by: Francesco Bruzzesi <[email protected]>

import change

inner kwarg

doh!

dynamic join

dumb typo

validation
@benrutter
Copy link
Contributor Author

benrutter commented Aug 23, 2024

Thanks @FBruzzesi that's very helpful! I've made a couple more changes off the basis of that:

  • Changed the validation to check for order as well as column names matching
  • Added validation for horizontal concat to ensure column names aren't duplicated

I've tested adding in some new tests, which I this would be ideal, here's the tests I added:

# extra condition for vertical concat to check that column order is required
with pytest.raises((Exception, TypeError)):
    reversed_data_right = dict(reversed(data_right.items()))
    reversed_df_right = nw.from_native(constructor(reversed_data_right)).lazy()
    nw.concat([df_left, reversed_df_right], how="vertical").collect()

Only that actually doesn't throw an error for most of them. And then the same again with:

# extra condition to check that duplicate columns are banned
    with pytest.raises(Exception, match="occurence"):
        duplicate_columns = nw.from_native(constructor(data_right | data))
        nw.concat([df_left, df_right], how="horizontal")

Which also doesn't throw an error. I'm down to try and address this for other dataframe types if Narwhals want to be strict on those conditions, but I guess it'd probably be better as a seperate issue/pr, so I've just left out those tests for now.

Edit: I think you found the reason I was seeing inconsistent behaviour with dask! Initially I was trying on the latest version, but the examples I shared where from an installation of 2024.2 before expressions was opt-out.

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adjusting on all fronts and already planning some further tests.

I commented on some error raise since:

  • in other dataframe we raise AssertionError for empty list of items
  • other errors are certainly not type related, so we can keep AssertionError for now.

I think we can merge after that :)

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved
narwhals/_dask/namespace.py Outdated Show resolved Hide resolved
narwhals/_dask/namespace.py Outdated Show resolved Hide resolved
narwhals/_dask/namespace.py Outdated Show resolved Hide resolved
@benrutter
Copy link
Contributor Author

Awesome thanks @FBruzzesi - I've made those changes (as an FYI I also raised an issue with over on dask/dask to check around the expected behaviour for concatenating on axis=1 dask/dask#11343)

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic is 100% in place, there are a couple of issues with coverage for:

  • the new checks we discussed
  • the variables introduced for mypy type hints

I commented on how to deal with these cases - as soon as all CI is green we are ready to merge

narwhals/_dask/namespace.py Outdated Show resolved Hide resolved
narwhals/_dask/namespace.py Outdated Show resolved Hide resolved
narwhals/_dask/namespace.py Outdated Show resolved Hide resolved
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to ship it? πŸš€ Thanks @benrutter

image

@FBruzzesi FBruzzesi merged commit a0af8ee into narwhals-dev:main Aug 28, 2024
21 checks passed
@MarcoGorelli
Copy link
Member

nice one, thanks all! πŸ™Œ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants