Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nth function/method for groupby #3128

Open
samukweku opened this issue Aug 2, 2021 · 3 comments · May be fixed by #3404
Open

nth function/method for groupby #3128

samukweku opened this issue Aug 2, 2021 · 3 comments · May be fixed by #3404

Comments

@samukweku
Copy link
Contributor

samukweku commented Aug 2, 2021

An nth function, similar to pandas groupby nth, where the selected row/rows stay as part of the new aggregations in the presence of by:

from datatable import dt, f, by

data = {'x': [1, 1, 1, 2, 2, 3],
        'y': [1, 2, 3, 1, 2, 1],
        'n': [3, 2, 1, 1, 2, 1]}

DT = dt.Frame(data)

DT

   |     x      y      n
   | int32  int32  int32
-- + -----  -----  -----
 0 |     1      1      3
 1 |     1      2      2
 2 |     1      3      1
 3 |     2      1      1
 4 |     2      2      2
 5 |     3      1      1
[6 rows x 3 columns]

The nth function/method should return the row or rows along with other aggregations:

DT[:, {'sum': f.y.sum(), 'nth' : f.y.nth(1)}, 'x']

    |   x    sum      nth
    | int32  int64  int32
 -- + -----  -----  -----
  0 |     1      6      2
  1 |     2      3      2
  2 |     3      1     NA
 [3 rows x 3 columns]

One way to currently implement this would be to do a cbind:

dt.cbind(DT[:, {"sum": f.y.sum()}, "x"], DT[1, 'n', by('x', add_columns=False)], force=True)

which might not properly align; a much safer option would be a left join.

The implementation in #2176 selects a single row or slice; and since within datatable i is executed before j, the results wont come out right. Also, at the moment, sequences are not accepted within the i; it would be nice to be able to select lists with the nth function

@st-pasha
Copy link
Contributor

st-pasha commented Aug 2, 2021

So, this is supposed to be similar to first() or last(), but instead it selects the "n-th" row?

@samukweku
Copy link
Contributor Author

@st-pasha , yes. It can be any row, and if the row is missing, then NA is returned

@samukweku
Copy link
Contributor Author

WIP

This was referenced Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants