Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] nth function #3404

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

[ENH] nth function #3404

wants to merge 21 commits into from

Conversation

samukweku
Copy link
Contributor

Implement dt.nth(cols, n) function to return the nth row (also per group) for the specified columns. If n goes out of bounds, NA-row is returned.

Closes #3128

@samukweku samukweku added the new feature Feature requests for new functionality label Jan 3, 2023
@samukweku samukweku self-assigned this Jan 3, 2023
@oleksiyskononenko
Copy link
Contributor

Thanks @samukweku. So the issue on this PR is that skipna doesn't work? Why tests fail with "got an unexpected keyword argument n"?

@samukweku samukweku changed the title Samukweku/nth function [ENH] nth function Jan 4, 2023
@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Jan 10, 2023

I guess this implementation has an issue even with no skipna. Look at the example below

from datatable import dt, f, by
DT = dt.Frame([1, 1, 2])
DT[:, f.C0.nth(0), by(f.C0)]

Triggers

AssertionError: Assertion 'i < nrows()' failed in src/core/column.cc, line 236

UPD: I guess this issue has never been fixed: #3346 (comment)
I propose not to close PR's when there is something wrong with your commits, just let me know and we work together on fixing it. Otherwise, we loose a conversation and some issues like the one above.

@oleksiyskononenko
Copy link
Contributor

As for the skipna, I've been looking into this and it seems that if we just pass the validity mask to Nth_ColumnImpl we could have issues with negative n. Because to figure out which row is the row -X, we first need to establish all the other row ids (skipping missings) and then calculate the X'th row from the end.

So what we need to do is, first, to apply the validity mask to the original frame (filtering out "all" or "any" missings).
However, since the rows are already grouped, we may need to adjust the group by...

@samukweku
Copy link
Contributor Author

my bad ... I am lost on the explanation regarding skipna

@samukweku
Copy link
Contributor Author

@oleksiyskononenko /home/sam/datatable/docs/api/fexpr/nth.rst: WARNING: document isn't included in any toctree where is the toctree?

@oleksiyskononenko
Copy link
Contributor

@samukweku ah, you just need to add nth() function to the api, just like we do for all the other functions, see index-api.rst.

@samukweku
Copy link
Contributor Author

@oleksiyskononenko what do you suggest is the way forward for this PR? drop the skipna parameter and let the user handle it instead? Do you mind showing me an implementation for skipna assuming a positive n, so I can understand how FExpr_all and any would be used

@samukweku
Copy link
Contributor Author

Looking further at pandas' implementation of dropna, the dropna is applied to the entire dataframe (including the groupby column), before the nth function is applied. What we should be doing for nth and any other relevant function is to apply this per column; if the user wants it per row, that should be applied in the i section as a filter, same way we'd do if we were removing nulls from the dataframe:

Using the example from pandas' home page

from datatable import dt, f, by
import pandas as pd

df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
                                   'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
g = df.groupby('A')

DT = dt.Frame(df)

In [5]: df
Out[5]: 
   A    B
0  1  NaN
1  1  2.0
2  2  3.0
3  1  4.0
4  2  5.0

In [6]: DT
Out[6]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     1        2
 2 |     2        3
 3 |     1        4
 4 |     2        5
[5 rows x 2 columns]

without skipping nulls:

In [7]: g.nth(0)
Out[7]: 
     B
A     
1  NaN
2  3.0

In [9]: DT[:, dt.nth(f.B,0), 'A']
Out[9]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2        3
[2 rows x 2 columns]

In [10]: DT[:, dt.nth(f.B,1), 'A']
Out[10]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        2
 1 |     2        5
[2 rows x 2 columns]

In [11]: g.nth(1)
Out[11]: 
     B
A     
1  2.0
2  5.0

In [12]: g.nth(-1)
Out[12]: 
     B
A     
1  4.0
2  5.0

In [13]: DT[:, dt.nth(f.B,-1), 'A']
Out[13]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        4
 1 |     2        5
[2 rows x 2 columns]

skipping nulls:

In [14]: g.nth(0, dropna='any')
Out[14]: 
     B
A     
1  2.0
2  3.0

In [21]: g.nth(0, dropna='all')
Out[21]: 
     B
A     
1  NaN
2  3.0

In [42]: DT[~((f[:]==None).rowall()), :][:, dt.nth(f.B, 0), f.A]
Out[42]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2        3
[2 rows x 2 columns]

In [43]: DT[~((f[:]==None).rowany()), :][:, dt.nth(f.B, 0), f.A]
Out[43]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        2
 1 |     2        3
[2 rows x 2 columns]

In [54]: g.nth(3, dropna='any')
Out[54]: 
    B
A    
1 NaN
2 NaN

In [55]: DT[~((f[:]==None).rowany()), :][:, dt.nth(f.B, 3), f.A]
Out[55]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2       NA
[2 rows x 2 columns]

It'd probably wont be a bad idea to implement a dropna function for use in the i section; a better option (which I hope to get to sometime) is to implement methods for ==, !=, >, ... which would make chaining easier, while making the computation clearer and cleaner to the eyes -> DT[~f[:].eq(None).rowany(),:].

That was a digression; my point is in Pandas, they treat nth as per row, we should treat it as per column; if the user wishes something per row, it should go into the i section (and rightly so).

Of course, another option would be via cumcount, and subsequent filtering:

In [51]: DT[:, f[:].extend(dt.cumcount()), f.A]
Out[51]: 
   |     A        B     C0
   | int64  float64  int64
-- + -----  -------  -----
 0 |     1       NA      0
 1 |     1        2      1
 2 |     1        4      2
 3 |     2        3      0
 4 |     2        5      1
[5 rows x 3 columns]

# fetch second row per column
In [52]: DT[:, f[:].extend(dt.cumcount()), f.A][f[-1]==1, :-1]
Out[52]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        2
 1 |     2        5
[2 rows x 2 columns]

# fetch first row per column
In [53]: DT[:, f[:].extend(dt.cumcount()), f.A][f[-1]==0, :-1]
Out[53]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2        3
[2 rows x 2 columns]

Allowing the skipna per column allows us to also extend to first and last implementation to fetch first non-null value, if the user desires that.

@samukweku
Copy link
Contributor Author

@oleksiyskononenko revisiting the issue of skipna per column or rowwise ^^^^^^^^^^^^^^^

@samukweku
Copy link
Contributor Author

samukweku commented Mar 2, 2023

@oleksiyskononenko, @sh1ng , @st-pasha - need ur help with how to use FExpr_row functions -

Workframe inputs = arg_->evaluate_n(ctx);
      Grouping gmode = inputs.get_grouping_mode();
      colvec columns;
      size_t ncols = inputs.ncols();
      size_t nrows = 1;
      columns.reserve(ncols);
      for (size_t i = 0; i < ncols; ++i) {
        Column col = inputs.retrieve_column(i);
        xassert(i == 0 || nrows == col.nrows());
        nrows = col.nrows();
        columns.emplace_back(col);        
      }

      Column col_out = FExpr_RowAll::apply_function(std::move(columns), nrows, ncols);

Error message:

src/core/expr/fexpr_nth.cc: In member functiondt::expr::Workframe dt::expr::FExpr_Nth<SKIPNA>::evaluate_n(dt::expr::EvalContext&) const’:
src/core/expr/fexpr_nth.cc:77:52: error: cannot call member functionvirtual Column dt::expr::FExpr_RowAll::apply_function(colvec&&, dt::expr::size_t, dt::expr::size_t) constwithout object
   77 |       Column col_out = FExpr_RowAll::apply_function(std::move(columns), nrows, ncols);

what is the correct way of using FExpr_RowAll?

@samukweku
Copy link
Contributor Author

samukweku commented Apr 16, 2023

@oleksiyskononenko this is dependent on PR #3404 and #3444. once those PRS are concluded, I can pick up on this

@samukweku
Copy link
Contributor Author

@oleksiyskononenko figured out how to implement SKIPNA='any' or SKIPNA='all', similar to what pandas implements. If you've got some time to review, after the build completes. thanks

@samukweku
Copy link
Contributor Author

@oleksiyskononenko just checking in, waiting for your feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature Feature requests for new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nth function/method for groupby
2 participants