Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] nth function #3346

Closed
wants to merge 83 commits into from
Closed

[ENH] nth function #3346

wants to merge 83 commits into from

Conversation

samukweku
Copy link
Contributor

@samukweku samukweku commented Sep 3, 2022

Implement dt.nth(cols, n=0) function to return the nth row (also per group) for the specified columns. If n goes out of bounds, NA-row is returned.

Closes #3128

@samukweku samukweku added the new feature Feature requests for new functionality label Sep 3, 2022
@samukweku samukweku self-assigned this Sep 3, 2022
@samukweku
Copy link
Contributor Author

@oleksiyskononenko MVP implementation of the nth function, similar to pandas' nth function/dplyr's nth function. Feedback appreciated before adding docs

@oleksiyskononenko
Copy link
Contributor

I wonder if you want to replace first()/last() with nth()? Like nth(0) would mean first() and nth(-1) would mean last()?

@samukweku
Copy link
Contributor Author

@oleksiyskononenko not a bad idea, as it is more generic.

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Sep 8, 2022

So if you look at how the first() / last() functions work, you will see that it is a purely virtual column and not even a rowindex on the source column: https://github.com/h2oai/datatable/blob/main/src/core/expr/head_reduce_unary.cc#L112-L165

FirstLast_ColumnImpl::get_element() just gets the first or last element from a group. What you need to do in the case of the nth(n=...) function, is to see how your n compares to the group size and return NA if it is out of bounds.

In the case the column is grouped, first() / last() immediately return NA column for zero-rows frame, or the source column (note, the first or last elements from each group from the grouped column is actually the grouped column itself): https://github.com/h2oai/datatable/blob/main/src/core/expr/head_reduce_unary.cc#L179-L182

So for the nth(n=...) function: n could only be 0 or -1, otherwise, NA column should be returned.

I suggest we convert implementation on this PR to something similar we have for first() / last().

@samukweku
Copy link
Contributor Author

@oleksiyskononenko what's the disadvantage of using a rowindex for this function? Performance? Or something else?

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Sep 8, 2022

Yeah, current implementation uses as much memory as needed for

Buffer buf = Buffer::mem(gby.size() * sizeof(int32_t));

When you do it virtual as for first() / last(), you won't really need additional memory.

@samukweku
Copy link
Contributor Author

Thanks @oleksiyskononenko

sammychoco and others added 7 commits September 15, 2022 19:25
While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.
Cumulative functions acces the column's data in parallel. If the input column appears to be a `Latent_ColumnImpl`, materialization of such a column happens in parallel. However, the `Latent_ColumnImpl::materialize()` is not a thread-safe method and invoking it from multiple threads results in a segfault.

In this PR we `vivify()` the source column, i.e. materialization, if necessary, happens in one thread only, before processing it in parallel.

Closes #3345
As we eliminated Travis from our building pipeline (#3042), its status badge stopped working. In this PR we replace it with the AppVeyor's badge.
@samukweku
Copy link
Contributor Author

@oleksiyskononenko made changes to the nth function, based on feedback. I also added a skipna parameter, to get non null values.

When you have some time, kindly have a look at the PR; your feedback is always appreciated.

@samukweku
Copy link
Contributor Author

I made a blunder on this and should have just done a git pull 🤦

src/core/expr/fexpr.cc Outdated Show resolved Hide resolved
src/core/expr/fexpr_nth.cc Outdated Show resolved Hide resolved
samukweku and others added 2 commits September 20, 2022 17:39
This PR implements column's aliasing as proposed in #2684. We couldn't name the method `.as()` though, because `as` is a built-in python keyword — hence, we use `.alias()` instead. Column aliasing is now also available in the group-by clause.

Closes #2504
oleksiyskononenko and others added 25 commits January 3, 2023 12:07
Fix casting `void` columns to categoricals and add corresponding tests.

WIP for #1691
Allow column names to be missing when detecting a header in CSV files.

Closes #3363
Implement casting of boolean, integer, float, date, time and string columns to categoricals.

WIP for #1691
- use `col`/`cols` as a parameter name when dt function supports single/multiple column(s);
- convert `dt.shift()` documentation into standard format;
- cosmetics.
Implement `dt.categories()` to get categories for categorical columns.

WIP for #1691
…#3368)

- fix "See also" section for categorical types;
- improve `cbind()`/`rbind()` documentation.
It seems that `dt.categories()` is the first datatable function, that needs to produce `Grouping::GtoFEW` columns. That's because the number of categories in a categorical column could be anything between `0` and `nrows - 1`. Currently, datatable doesn't really support `Grouping::GtoFEW`, but may need it for the cases when `dt.categories()` is combined with other f-expressions, or when `dt.categories()` is applied to columns that have different number of underlying categories.

In this PR we
- add some basic support for `Grouping::GtoFEW` grouping mode;
- adjust `dt.categories()` to produce `Grouping::GtoFEW` columns, that in the case of uneven number of rows are  promoted to `Grouping::GtoALL`;
- do minor refactoring in `dt.alias()` function.

WIP for #1691
Implement `dt.codes()` to get integer codes for categorical columns. Since we currently don't support unsigned column types, the internal representation of codes has been also changed to signed integers.

WIP for #1691
…tegorical columns (#3372)

In this PR we 
- implement casts from `dt.cat*(...)` to all of the basic types;
- as a consequence, support for converting categorical columns to CSV has been added.

WIP for #1691
Implement most of the statistical functions for categorical columns. Once we implement sorting of categorical columns, we could add the missing `dt.mode()`.

WIP for #1691
#3344)

Enhance `dt.fillna()` to support filling missing values with a particular value, that could be a scalar, sequence or an `FExpr`.

WIP for #3279
Since python `3.6` reached its end of life, update documentation to `3.7+`.

Closes #3376
…and `cummax()` (#3381)

Add `reverse` parameter to control direction of cumulative function's calculations: 
- when `False`, calculation is done from top to bottom (default);
- when `True`, calculation is done from bottom to top.

Сloses #3279
It seems that `na_position` parameter was missing in the `dt.sort()` documentation for some reason, though it was referred to in the examples. In this PR we fix this issue.
It appears as though we never initialized `na_position_` in the case of `dt.by()`, and this resulted in some random data corruption for columns, that contain missing values. As of this PR, we initialize `na_position_` to `NaPosition::FIRST` to be consistent with what we declare in `dt.by()` [documentation](https://datatable.readthedocs.io/en/latest/api/dt/by.html):

```
The default behavior of groupby is to sort the groups in the ascending order, with NA values appearing before any other values.
```

Also, we switch to python 3.8 for testing debug wheels, so that we keep track of the status of the mentioned groupby tests.
 
Closes #3331
- make codebase compatible with Python 3.11 changes;
- only use the required adopted  code from `python/pythoncapi-compat`;
- remove obsolete code for the older Python versions;
- switch to Python 3.11 on AppVeyor for Windows and linux;
- add Python 3.11 support to Jenkins;
 
Closes #3374
)

- the fix for #3390 has been already pushed as a part of #3388, here we merely add a corresponding test;
- as of #2472, `f[:]` excludes the groupby columns, in this PR we make the corresponding adjustments to the docs. 

Closes #3390
While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.
Cumulative functions acces the column's data in parallel. If the input column appears to be a `Latent_ColumnImpl`, materialization of such a column happens in parallel. However, the `Latent_ColumnImpl::materialize()` is not a thread-safe method and invoking it from multiple threads results in a segfault.

In this PR we `vivify()` the source column, i.e. materialization, if necessary, happens in one thread only, before processing it in parallel.

Closes #3345
This PR implements column's aliasing as proposed in #2684. We couldn't name the method `.as()` though, because `as` is a built-in python keyword — hence, we use `.alias()` instead. Column aliasing is now also available in the group-by clause.

Closes #2504
@samukweku
Copy link
Contributor Author

just could not resolve the conflicts - moved to #3403

@samukweku samukweku closed this Jan 3, 2023
@samukweku samukweku deleted the samukweku/nth branch January 3, 2023 02:42
@st-pasha st-pasha removed this from the Release 1.1.0 milestone Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature Feature requests for new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nth function/method for groupby
3 participants