-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] nth function #3404
base: main
Are you sure you want to change the base?
[ENH] nth function #3404
Conversation
Thanks @samukweku. So the issue on this PR is that |
I guess this implementation has an issue even with no from datatable import dt, f, by
DT = dt.Frame([1, 1, 2])
DT[:, f.C0.nth(0), by(f.C0)] Triggers AssertionError: Assertion 'i < nrows()' failed in src/core/column.cc, line 236 UPD: I guess this issue has never been fixed: #3346 (comment) |
As for the So what we need to do is, first, to apply the validity mask to the original frame (filtering out "all" or "any" missings). |
my bad ... I am lost on the explanation regarding skipna |
@oleksiyskononenko |
@samukweku ah, you just need to add |
@oleksiyskononenko what do you suggest is the way forward for this PR? drop the skipna parameter and let the user handle it instead? Do you mind showing me an implementation for skipna assuming a positive |
Looking further at pandas' implementation of Using the example from pandas' home page from datatable import dt, f, by
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
g = df.groupby('A')
DT = dt.Frame(df)
In [5]: df
Out[5]:
A B
0 1 NaN
1 1 2.0
2 2 3.0
3 1 4.0
4 2 5.0
In [6]: DT
Out[6]:
| A B
| int64 float64
-- + ----- -------
0 | 1 NA
1 | 1 2
2 | 2 3
3 | 1 4
4 | 2 5
[5 rows x 2 columns] without skipping nulls: In [7]: g.nth(0)
Out[7]:
B
A
1 NaN
2 3.0
In [9]: DT[:, dt.nth(f.B,0), 'A']
Out[9]:
| A B
| int64 float64
-- + ----- -------
0 | 1 NA
1 | 2 3
[2 rows x 2 columns]
In [10]: DT[:, dt.nth(f.B,1), 'A']
Out[10]:
| A B
| int64 float64
-- + ----- -------
0 | 1 2
1 | 2 5
[2 rows x 2 columns]
In [11]: g.nth(1)
Out[11]:
B
A
1 2.0
2 5.0
In [12]: g.nth(-1)
Out[12]:
B
A
1 4.0
2 5.0
In [13]: DT[:, dt.nth(f.B,-1), 'A']
Out[13]:
| A B
| int64 float64
-- + ----- -------
0 | 1 4
1 | 2 5
[2 rows x 2 columns] skipping nulls: In [14]: g.nth(0, dropna='any')
Out[14]:
B
A
1 2.0
2 3.0
In [21]: g.nth(0, dropna='all')
Out[21]:
B
A
1 NaN
2 3.0
In [42]: DT[~((f[:]==None).rowall()), :][:, dt.nth(f.B, 0), f.A]
Out[42]:
| A B
| int64 float64
-- + ----- -------
0 | 1 NA
1 | 2 3
[2 rows x 2 columns]
In [43]: DT[~((f[:]==None).rowany()), :][:, dt.nth(f.B, 0), f.A]
Out[43]:
| A B
| int64 float64
-- + ----- -------
0 | 1 2
1 | 2 3
[2 rows x 2 columns]
In [54]: g.nth(3, dropna='any')
Out[54]:
B
A
1 NaN
2 NaN
In [55]: DT[~((f[:]==None).rowany()), :][:, dt.nth(f.B, 3), f.A]
Out[55]:
| A B
| int64 float64
-- + ----- -------
0 | 1 NA
1 | 2 NA
[2 rows x 2 columns] It'd probably wont be a bad idea to implement a dropna function for use in the That was a digression; my point is in Pandas, they treat Of course, another option would be via cumcount, and subsequent filtering: In [51]: DT[:, f[:].extend(dt.cumcount()), f.A]
Out[51]:
| A B C0
| int64 float64 int64
-- + ----- ------- -----
0 | 1 NA 0
1 | 1 2 1
2 | 1 4 2
3 | 2 3 0
4 | 2 5 1
[5 rows x 3 columns]
# fetch second row per column
In [52]: DT[:, f[:].extend(dt.cumcount()), f.A][f[-1]==1, :-1]
Out[52]:
| A B
| int64 float64
-- + ----- -------
0 | 1 2
1 | 2 5
[2 rows x 2 columns]
# fetch first row per column
In [53]: DT[:, f[:].extend(dt.cumcount()), f.A][f[-1]==0, :-1]
Out[53]:
| A B
| int64 float64
-- + ----- -------
0 | 1 NA
1 | 2 3
[2 rows x 2 columns]
Allowing the skipna per column allows us to also extend to |
@oleksiyskononenko revisiting the issue of skipna per column or rowwise ^^^^^^^^^^^^^^^ |
d0dd1a8
to
f9a3234
Compare
@oleksiyskononenko, @sh1ng , @st-pasha - need ur help with how to use Workframe inputs = arg_->evaluate_n(ctx);
Grouping gmode = inputs.get_grouping_mode();
colvec columns;
size_t ncols = inputs.ncols();
size_t nrows = 1;
columns.reserve(ncols);
for (size_t i = 0; i < ncols; ++i) {
Column col = inputs.retrieve_column(i);
xassert(i == 0 || nrows == col.nrows());
nrows = col.nrows();
columns.emplace_back(col);
}
Column col_out = FExpr_RowAll::apply_function(std::move(columns), nrows, ncols); Error message: src/core/expr/fexpr_nth.cc: In member function ‘dt::expr::Workframe dt::expr::FExpr_Nth<SKIPNA>::evaluate_n(dt::expr::EvalContext&) const’:
src/core/expr/fexpr_nth.cc:77:52: error: cannot call member function ‘virtual Column dt::expr::FExpr_RowAll::apply_function(colvec&&, dt::expr::size_t, dt::expr::size_t) const’ without object
77 | Column col_out = FExpr_RowAll::apply_function(std::move(columns), nrows, ncols); what is the correct way of using |
67890c5
to
7435bf5
Compare
@oleksiyskononenko this is dependent on PR #3404 and #3444. once those PRS are concluded, I can pick up on this |
Co-authored-by: oleksiyskononenko <[email protected]>
7435bf5
to
0bc78d9
Compare
@oleksiyskononenko figured out how to implement SKIPNA='any' or SKIPNA='all', similar to what pandas implements. If you've got some time to review, after the build completes. thanks |
@oleksiyskononenko just checking in, waiting for your feedback |
Implement
dt.nth(cols, n)
function to return the nth row (also per group) for the specified columns. Ifn
goes out of bounds, NA-row is returned.Closes #3128