Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Refactor count/countna to use FExpr #3440

Merged
merged 126 commits into from
Apr 22, 2023

Conversation

samukweku
Copy link
Contributor

@samukweku samukweku commented Mar 10, 2023

  • convert count() and countna() to FExpr;
  • fix countna() when applied to grouped columns;
  • changeReduceUnary_ColumnImpl for more type flexibility.

WIP for #2562
Closes #3441

@samukweku samukweku self-assigned this Mar 10, 2023
@samukweku samukweku changed the title [ENH] FExpr count/countna [ENH] Refactor count/countnato use FExpr Mar 10, 2023
@samukweku samukweku changed the title [ENH] Refactor count/countnato use FExpr [ENH] Refactor count/countna to use FExpr Mar 10, 2023
src/core/column/countna.h Outdated Show resolved Hide resolved
src/core/expr/fexpr_count_countna.cc Outdated Show resolved Hide resolved
src/core/expr/head_reduce_unary.cc Outdated Show resolved Hide resolved
src/core/column/countna.h Outdated Show resolved Hide resolved
@samukweku
Copy link
Contributor Author

@oleksiyskononenko i still need ur help on this PR #3404 --- if you can show me how to use FExpr_RowAll function, and also your feedback on my comments regarding skipna logic

@samukweku
Copy link
Contributor Author

@oleksiyskononenko Also need your help in diagnosing why the tests fail in appveyor but work locally?

docs/api/dt/count.rst Outdated Show resolved Hide resolved
docs/api/dt/count.rst Outdated Show resolved Hide resolved
src/core/column/count_all_rows.h Outdated Show resolved Hide resolved
src/core/column/count_all_rows.h Outdated Show resolved Hide resolved
src/core/column/countna.h Outdated Show resolved Hide resolved
src/core/expr/fexpr_count_countna.cc Outdated Show resolved Hide resolved
// we just want the total number of rows
bool count_all_rows = arg_->get_expr_kind() == Kind::None;

if (count_all_rows && !COUNTNA) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

countna() could also be called with no cols, in such a case we are currently returning 0 for each group. Is this case covered on this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it useful though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, but may be good for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I dont think it fits in. we are replicating SQL here, where count(*) returns all rows, whereas count(column) counts the non null rows for that column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

countna() does not mean anything

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, but if someone already using it, it makes no sense to remove this functionality.

Copy link
Contributor

@oleksiyskononenko oleksiyskononenko Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think anyone is using countna(), the use-case could be countna(cols), where cols could become None at some point. My feeling is that returning 0 or None is better in this case, than throwing an error.

src/core/expr/fexpr_count_countna.cc Outdated Show resolved Hide resolved
tests/dt/test-countna.py Outdated Show resolved Hide resolved
tests/dt/test-countna.py Show resolved Hide resolved
@@ -83,6 +83,7 @@ class FExpr_SumProd : public FExpr_Func {
case SType::INT16:
case SType::INT32:
case SType::INT64:
col.cast_inplace(SType::INT64);
Copy link
Contributor

@oleksiyskononenko oleksiyskononenko Mar 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this, but then you need also to do the same for floats, etc, so better to revert the changes to what it was before this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the changes - appveyor still fails - i'm particularly confused about the errors because they do not show when I run the tests locally

Copy link
Contributor

@oleksiyskononenko oleksiyskononenko Mar 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failure is not related to your PR:

    def test_dt_count_na2():
        DT = dt.Frame(G=[1,1,1,2,2,2], V=[None, None, None, None, 3, 5])
        EXP = dt.Frame(G=[1,2], V1=[3,1], V2=[1,0])
>       RES = DT[:, [dt.countna(f.V), dt.countna(dt.mean(f.V))], dt.by(f.G)]
E       AssertionError: Assertion 'i < nrows()' failed in src\core\column.cc, line 251
DT         = <Frame#229a5d3da80 6x2>

You need to disable this test until #3417 is resolved.

@samukweku samukweku force-pushed the samukweku/fexpr_count_countna branch from 1f1ce67 to 95dc7ed Compare March 18, 2023 07:04

public:
ReduceUnary_ColumnImpl(Column &&col, const Groupby& gby)
: Virtual_ColumnImpl(gby.size(), col.stype()),
: Virtual_ColumnImpl(gby.size(), stype_from<T_OUT>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to pass stype here just like it was before, otherwise it is not possible to properly get it from T_OUT. I'm not sure what was the reason to change it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change was made here to allow passing int64 for count, instead of casting say a float to int64, as we did for the other reducers so far(sum/prod/mean). For min/max, we did not need to cast since it is within the same data type. Another option would be to pass an explicit stype here, instead of deriving the stype from T_OUT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, but the resulting column for count() has int64 stype, so col.stype() will work in this case too. Anyways, stype_from<T_OUT> can only give you a guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does stype_from<T_OUT> not work properly? maybe something I am missing from my interpretation of the function? or is Type::from d right way to go?

Copy link
Contributor

@oleksiyskononenko oleksiyskononenko Mar 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way to go is to use col.stype() as it was before.

stype_from<T> will not work because there is no strict way to get stype from the underlying type. We may have different columns to have the same C++ type: date32/int32, time64/int64, boo8/int8, etc. stype_from<T> only gives you a guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is ok to cast say a float64 to int64 for the count() function? no need to worry about precision?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do we need to cast it for count?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reducers are separate classes, casting was needed for some of them outside of this PR. What I say is that you need to restore the original implementation for those reducers, otherwise it will not work with only passing the C++ type.

Note, tests are failing not for count…

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Mar 19, 2023

@samukweku I guess we need to go through this PR and revert all the changes related to reducers other than count(). Only minor changes should be left that make other reducers compatible with the new ReduceUnary_ColumnImpl, i.e. calling ReduceUnary_ColumnImpl<T, T> instead of just ReduceUnary_ColumnImpl<T>.

The current way of determining the resulting column stype/casting is the way to go, and the changes on this PR are not going to work for the reasons explained above.

@samukweku samukweku force-pushed the samukweku/fexpr_count_countna branch from 109c76f to 8397a8f Compare April 20, 2023 23:28
@oleksiyskononenko
Copy link
Contributor

@samukweku I'm not sure why we need dozens of the old commits been pushed to all your PRs?

@oleksiyskononenko
Copy link
Contributor

May I ask you not to push anything onto this branch? As you asked me to show you how to handle several review comments, I was going to push some changes here and suddenly got +60 old commits to deal with.

Comment on lines 59 to 63
return make<int32_t>(std::move(col), gby, is_grouped);
case SType::DATE32:
return make<int32_t>(std::move(col), gby, is_grouped);
case SType::INT64:
return make<int64_t>(std::move(col), gby, is_grouped);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this change was introduced, we don't need additional treatment of date32/time64 as internally there are the same as int32/int64.

@oleksiyskononenko oleksiyskononenko added this to the Release 1.1.0 milestone Apr 21, 2023
@oleksiyskononenko oleksiyskononenko added refactor Internal code changes, clean-ups or reorganizations that are not externally visible FIX Fix for an issue bug Any bugs / errors in datatable; however for severe bugs use [segfault] label documentation test Add new tests, or fix existing tests labels Apr 21, 2023
@oleksiyskononenko
Copy link
Contributor

@samukweku Please take a look at the changes, and let me know if you have any questions.

@samukweku
Copy link
Contributor Author

@oleksiyskononenko my apologies; must have been when I rebased the branch. apologies on that

@oleksiyskononenko
Copy link
Contributor

@samukweku To merge main, simply do git merge main being on the destination branch. It will then create just one commit and make things simple. Anyways, I guess I've resolved all the conflicts and PR is ready for your review.

@samukweku
Copy link
Contributor Author

looks good. thanks @oleksiyskononenko . need an explanation of the code in reduceunary:

    ReduceUnary_ColumnImpl(Column &&col, const Groupby& gby)
      : ReduceUnary_ColumnImpl(std::move(col), gby, col.stype())
    {}


    ColumnImpl *clone() const override {
      return new ReduceUnary_ColumnImpl(Column(col_), Groupby(gby_), this->stype());
    }

the first part of the code gets its type directly from the column, while the other part seems to get it from the SType provided? maybe an explanation of clone and the preceding code would be helpful.

@oleksiyskononenko
Copy link
Contributor

@samukweku The first part of the code you're referring to is a constructor, that is needed for the case when stype_out is the same as col.stype(). It is calling the main constructor, that accepts three parameters. This is exactly what you've been asking about, the way we need to deal with mean, min/max, etc.

@samukweku
Copy link
Contributor Author

Thanks for the explanation @oleksiyskononenko LGTM to merge

@oleksiyskononenko oleksiyskononenko merged commit dbbc664 into main Apr 22, 2023
@oleksiyskononenko oleksiyskononenko deleted the samukweku/fexpr_count_countna branch April 22, 2023 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label documentation FIX Fix for an issue refactor Internal code changes, clean-ups or reorganizations that are not externally visible test Add new tests, or fix existing tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dt.countna() returns wrong results for grouped columns
2 participants