replicate `pandas` `ngroup` behavior #19481

hutch3232 · 2024-10-27T15:34:55Z

Description

Loving polars and trying to migrate all of my pandas code over. This FR would simplify one aspect of my migration.

It would be very nice to have functionality like pandas's ngroup method.

A couple posts where users are asking how to accomplish this:
https://stackoverflow.com/questions/75483708/replicate-pandas-ngroup-behaviour-in-polars
https://stackoverflow.com/questions/74600568/compute-pandas-n-group-in-polars-and-assign-new-id

To be fair, there already is a viable approach to this, but it is more complicated:

import pandas as pd
import polars as pl

data = {
    'customer_id': ['001', '002', '001', '002', '002', '003', '001', '003', '002'],
    'other_column': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
}
df = pl.DataFrame(data)

df = (
    df
    .with_row_index()
    .with_columns(
        group_index = pl
        .first("index")
        .over("customer_id")
        .rank("dense") - 1
    )
)

print(df)
┌───────┬─────────────┬──────────────┬─────────────┐
│ index ┆ customer_id ┆ other_column ┆ group_index │
│ ---   ┆ ---         ┆ ---          ┆ ---         │
│ u32   ┆ str         ┆ str          ┆ i64         │
╞═══════╪═════════════╪══════════════╪═════════════╡
│ 0     ┆ 001         ┆ a            ┆ 0           │
│ 1     ┆ 002         ┆ b            ┆ 1           │
│ 2     ┆ 001         ┆ c            ┆ 0           │
│ 3     ┆ 002         ┆ d            ┆ 1           │
│ 4     ┆ 002         ┆ e            ┆ 1           │
│ 5     ┆ 003         ┆ f            ┆ 2           │
│ 6     ┆ 001         ┆ g            ┆ 0           │
│ 7     ┆ 003         ┆ h            ┆ 2           │
│ 8     ┆ 002         ┆ i            ┆ 1           │
└───────┴─────────────┴──────────────┴─────────────┘

df = pd.DataFrame(data)

df["group_index"] = df.groupby("customer_id").ngroup()

print(df)
  customer_id other_column  group_index
0         001            a            0
1         002            b            1
2         001            c            0
3         002            d            1
4         002            e            1
5         003            f            2
6         001            g            0
7         003            h            2
8         002            i            1

My use case for this is passing polars prepped data to_torch() and then doing essentially group-by sums with torch.scatter_add which requires an index to sum over: https://pytorch.org/docs/stable/generated/torch.scatter_add.html

The text was updated successfully, but these errors were encountered:

hutch3232 added the enhancement New feature or an improvement of an existing feature label Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replicate `pandas` `ngroup` behavior #19481

replicate `pandas` `ngroup` behavior #19481

hutch3232 commented Oct 27, 2024

replicate pandas ngroup behavior #19481

replicate pandas ngroup behavior #19481

Comments

hutch3232 commented Oct 27, 2024

Description

replicate `pandas` `ngroup` behavior #19481

replicate `pandas` `ngroup` behavior #19481