-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] nth
function
#3346
[ENH] nth
function
#3346
Conversation
@oleksiyskononenko MVP implementation of the |
I wonder if you want to replace |
@oleksiyskononenko not a bad idea, as it is more generic. |
So if you look at how the
In the case the column is grouped, So for the I suggest we convert implementation on this PR to something similar we have for |
@oleksiyskononenko what's the disadvantage of using a rowindex for this function? Performance? Or something else? |
Yeah, current implementation uses as much memory as needed for Buffer buf = Buffer::mem(gby.size() * sizeof(int32_t)); When you do it virtual as for |
Thanks @oleksiyskononenko |
While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.
Cumulative functions acces the column's data in parallel. If the input column appears to be a `Latent_ColumnImpl`, materialization of such a column happens in parallel. However, the `Latent_ColumnImpl::materialize()` is not a thread-safe method and invoking it from multiple threads results in a segfault. In this PR we `vivify()` the source column, i.e. materialization, if necessary, happens in one thread only, before processing it in parallel. Closes #3345
As we eliminated Travis from our building pipeline (#3042), its status badge stopped working. In this PR we replace it with the AppVeyor's badge.
@oleksiyskononenko made changes to the When you have some time, kindly have a look at the PR; your feedback is always appreciated. |
I made a blunder on this and should have just done a git pull 🤦 |
Fix casting `void` columns to categoricals and add corresponding tests. WIP for #1691
Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691
- use `col`/`cols` as a parameter name when dt function supports single/multiple column(s); - convert `dt.shift()` documentation into standard format; - cosmetics.
It seems that `dt.categories()` is the first datatable function, that needs to produce `Grouping::GtoFEW` columns. That's because the number of categories in a categorical column could be anything between `0` and `nrows - 1`. Currently, datatable doesn't really support `Grouping::GtoFEW`, but may need it for the cases when `dt.categories()` is combined with other f-expressions, or when `dt.categories()` is applied to columns that have different number of underlying categories. In this PR we - add some basic support for `Grouping::GtoFEW` grouping mode; - adjust `dt.categories()` to produce `Grouping::GtoFEW` columns, that in the case of uneven number of rows are promoted to `Grouping::GtoALL`; - do minor refactoring in `dt.alias()` function. WIP for #1691
Implement most of the statistical functions for categorical columns. Once we implement sorting of categorical columns, we could add the missing `dt.mode()`. WIP for #1691
Since python `3.6` reached its end of life, update documentation to `3.7+`. Closes #3376
It seems that `na_position` parameter was missing in the `dt.sort()` documentation for some reason, though it was referred to in the examples. In this PR we fix this issue.
It appears as though we never initialized `na_position_` in the case of `dt.by()`, and this resulted in some random data corruption for columns, that contain missing values. As of this PR, we initialize `na_position_` to `NaPosition::FIRST` to be consistent with what we declare in `dt.by()` [documentation](https://datatable.readthedocs.io/en/latest/api/dt/by.html): ``` The default behavior of groupby is to sort the groups in the ascending order, with NA values appearing before any other values. ``` Also, we switch to python 3.8 for testing debug wheels, so that we keep track of the status of the mentioned groupby tests. Closes #3331
- make codebase compatible with Python 3.11 changes; - only use the required adopted code from `python/pythoncapi-compat`; - remove obsolete code for the older Python versions; - switch to Python 3.11 on AppVeyor for Windows and linux; - add Python 3.11 support to Jenkins; Closes #3374
While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.
Cumulative functions acces the column's data in parallel. If the input column appears to be a `Latent_ColumnImpl`, materialization of such a column happens in parallel. However, the `Latent_ColumnImpl::materialize()` is not a thread-safe method and invoking it from multiple threads results in a segfault. In this PR we `vivify()` the source column, i.e. materialization, if necessary, happens in one thread only, before processing it in parallel. Closes #3345
just could not resolve the conflicts - moved to #3403 |
Implement
dt.nth(cols, n=0)
function to return the nth row (also per group) for the specified columns. Ifn
goes out of bounds, NA-row is returned.Closes #3128