Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replicate pandas ngroup behavior #19481

Open
hutch3232 opened this issue Oct 27, 2024 · 0 comments
Open

replicate pandas ngroup behavior #19481

hutch3232 opened this issue Oct 27, 2024 · 0 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@hutch3232
Copy link

Description

Loving polars and trying to migrate all of my pandas code over. This FR would simplify one aspect of my migration.

It would be very nice to have functionality like pandas's ngroup method.

A couple posts where users are asking how to accomplish this:
https://stackoverflow.com/questions/75483708/replicate-pandas-ngroup-behaviour-in-polars
https://stackoverflow.com/questions/74600568/compute-pandas-n-group-in-polars-and-assign-new-id

To be fair, there already is a viable approach to this, but it is more complicated:

import pandas as pd
import polars as pl

data = {
    'customer_id': ['001', '002', '001', '002', '002', '003', '001', '003', '002'],
    'other_column': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
}
df = pl.DataFrame(data)

df = (
    df
    .with_row_index()
    .with_columns(
        group_index = pl
        .first("index")
        .over("customer_id")
        .rank("dense") - 1
    )
)

print(df)
┌───────┬─────────────┬──────────────┬─────────────┐
│ indexcustomer_idother_columngroup_index │
│ ------------         │
│ u32strstri64         │
╞═══════╪═════════════╪══════════════╪═════════════╡
│ 0001a0           │
│ 1002b1           │
│ 2001c0           │
│ 3002d1           │
│ 4002e1           │
│ 5003f2           │
│ 6001g0           │
│ 7003h2           │
│ 8002i1           │
└───────┴─────────────┴──────────────┴─────────────┘

df = pd.DataFrame(data)

df["group_index"] = df.groupby("customer_id").ngroup()

print(df)
  customer_id other_column  group_index
0         001            a            0
1         002            b            1
2         001            c            0
3         002            d            1
4         002            e            1
5         003            f            2
6         001            g            0
7         003            h            2
8         002            i            1

My use case for this is passing polars prepped data to_torch() and then doing essentially group-by sums with torch.scatter_add which requires an index to sum over: https://pytorch.org/docs/stable/generated/torch.scatter_add.html

@hutch3232 hutch3232 added the enhancement New feature or an improvement of an existing feature label Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant