[ENH] Refactor count/countna to use FExpr #3440

samukweku · 2023-03-10T11:14:44Z

convert count() and countna() to FExpr;
fix countna() when applied to grouped columns;
changeReduceUnary_ColumnImpl for more type flexibility.

WIP for #2562
Closes #3441

src/core/column/countna.h

src/core/expr/fexpr_count_countna.cc

src/core/expr/head_reduce_unary.cc

src/core/column/countna.h

samukweku · 2023-03-11T00:14:21Z

@oleksiyskononenko i still need ur help on this PR #3404 --- if you can show me how to use FExpr_RowAll function, and also your feedback on my comments regarding skipna logic

samukweku · 2023-03-11T00:21:53Z

@oleksiyskononenko Also need your help in diagnosing why the tests fail in appveyor but work locally?

src/core/column/countna.h

src/core/expr/fexpr_count_countna.cc

src/core/column/countna.h

docs/api/dt/count.rst

src/core/column/count_all_rows.h

src/core/column/countna.h

src/core/expr/fexpr_count_countna.cc

oleksiyskononenko · 2023-03-14T05:26:39Z

src/core/expr/fexpr_count_countna.cc

+      // we just want the total number of rows
+      bool count_all_rows = arg_->get_expr_kind() == Kind::None;
+
+      if (count_all_rows && !COUNTNA) {


countna() could also be called with no cols, in such a case we are currently returning 0 for each group. Is this case covered on this PR?

is it useful though?

I don't know, but may be good for consistency.

yea, I dont think it fits in. we are replicating SQL here, where count(*) returns all rows, whereas count(column) counts the non null rows for that column.

countna() does not mean anything

Well, but if someone already using it, it makes no sense to remove this functionality.

I don’t think anyone is using countna(), the use-case could be countna(cols), where cols could become None at some point. My feeling is that returning 0 or None is better in this case, than throwing an error.

src/core/expr/fexpr_count_countna.cc

tests/dt/test-countna.py

docs/api/dt/count.rst

src/core/expr/fexpr_count_countna.cc

src/core/expr/fexpr_sumprod.cc

oleksiyskononenko · 2023-03-18T03:08:37Z

src/core/expr/fexpr_sumprod.cc

@@ -83,6 +83,7 @@ class FExpr_SumProd : public FExpr_Func {
        case SType::INT16:
        case SType::INT32:
        case SType::INT64:
+          col.cast_inplace(SType::INT64);


You can do this, but then you need also to do the same for floats, etc, so better to revert the changes to what it was before this PR.

made the changes - appveyor still fails - i'm particularly confused about the errors because they do not show when I run the tests locally

This failure is not related to your PR:

def test_dt_count_na2(): DT = dt.Frame(G=[1,1,1,2,2,2], V=[None, None, None, None, 3, 5]) EXP = dt.Frame(G=[1,2], V1=[3,1], V2=[1,0]) > RES = DT[:, [dt.countna(f.V), dt.countna(dt.mean(f.V))], dt.by(f.G)] E AssertionError: Assertion 'i < nrows()' failed in src\core\column.cc, line 251 DT = <Frame#229a5d3da80 6x2>

You need to disable this test until #3417 is resolved.

oleksiyskononenko · 2023-03-19T03:54:15Z

src/core/column/reduce_unary.h


  public:
    ReduceUnary_ColumnImpl(Column &&col, const Groupby& gby)
-      : Virtual_ColumnImpl(gby.size(), col.stype()),
+      : Virtual_ColumnImpl(gby.size(), stype_from<T_OUT>),


You need to pass stype here just like it was before, otherwise it is not possible to properly get it from T_OUT. I'm not sure what was the reason to change it?

the change was made here to allow passing int64 for count, instead of casting say a float to int64, as we did for the other reducers so far(sum/prod/mean). For min/max, we did not need to cast since it is within the same data type. Another option would be to pass an explicit stype here, instead of deriving the stype from T_OUT

Well, but the resulting column for count() has int64 stype, so col.stype() will work in this case too. Anyways, stype_from<T_OUT> can only give you a guess.

why does stype_from<T_OUT> not work properly? maybe something I am missing from my interpretation of the function? or is Type::from d right way to go?

The way to go is to use col.stype() as it was before.

stype_from<T> will not work because there is no strict way to get stype from the underlying type. We may have different columns to have the same C++ type: date32/int32, time64/int64, boo8/int8, etc. stype_from<T> only gives you a guess.

so it is ok to cast say a float64 to int64 for the count() function? no need to worry about precision?

But why do we need to cast it for count?

All reducers are separate classes, casting was needed for some of them outside of this PR. What I say is that you need to restore the original implementation for those reducers, otherwise it will not work with only passing the C++ type.

Note, tests are failing not for count…

src/core/expr/fexpr_mean.cc

src/core/expr/fexpr_sumprod.cc

oleksiyskononenko · 2023-03-19T04:17:04Z

@samukweku I guess we need to go through this PR and revert all the changes related to reducers other than count(). Only minor changes should be left that make other reducers compatible with the new ReduceUnary_ColumnImpl, i.e. calling ReduceUnary_ColumnImpl<T, T> instead of just ReduceUnary_ColumnImpl<T>.

The current way of determining the resulting column stype/casting is the way to go, and the changes on this PR are not going to work for the reasons explained above.

src/core/column/mean.h

oleksiyskononenko · 2023-04-21T20:57:16Z

@samukweku I'm not sure why we need dozens of the old commits been pushed to all your PRs?

oleksiyskononenko · 2023-04-21T21:13:12Z

May I ask you not to push anything onto this branch? As you asked me to show you how to handle several review comments, I was going to push some changes here and suddenly got +60 old commits to deal with.

oleksiyskononenko · 2023-04-21T21:29:07Z

src/core/expr/fexpr_minmax.cc

-        return make<int32_t>(std::move(col), gby, is_grouped);
        case SType::DATE32:
          return make<int32_t>(std::move(col), gby, is_grouped);
        case SType::INT64:
-        return make<int64_t>(std::move(col), gby, is_grouped);


Not sure why this change was introduced, we don't need additional treatment of date32/time64 as internally there are the same as int32/int64.

oleksiyskononenko · 2023-04-21T21:32:42Z

@samukweku Please take a look at the changes, and let me know if you have any questions.

samukweku · 2023-04-21T22:52:00Z

@oleksiyskononenko my apologies; must have been when I rebased the branch. apologies on that

oleksiyskononenko · 2023-04-21T22:54:10Z

@samukweku To merge main, simply do git merge main being on the destination branch. It will then create just one commit and make things simple. Anyways, I guess I've resolved all the conflicts and PR is ready for your review.

samukweku · 2023-04-21T23:03:21Z

looks good. thanks @oleksiyskononenko . need an explanation of the code in reduceunary:

    ReduceUnary_ColumnImpl(Column &&col, const Groupby& gby)
      : ReduceUnary_ColumnImpl(std::move(col), gby, col.stype())
    {}


    ColumnImpl *clone() const override {
      return new ReduceUnary_ColumnImpl(Column(col_), Groupby(gby_), this->stype());
    }

the first part of the code gets its type directly from the column, while the other part seems to get it from the SType provided? maybe an explanation of clone and the preceding code would be helpful.

oleksiyskononenko · 2023-04-22T00:18:55Z

@samukweku The first part of the code you're referring to is a constructor, that is needed for the case when stype_out is the same as col.stype(). It is calling the main constructor, that accepts three parameters. This is exactly what you've been asking about, the way we need to deal with mean, min/max, etc.

samukweku · 2023-04-22T01:51:47Z

Thanks for the explanation @oleksiyskononenko LGTM to merge

samukweku requested a review from oleksiyskononenko March 10, 2023 11:14

samukweku self-assigned this Mar 10, 2023

samukweku changed the title ~~[ENH] FExpr count/countna~~ [ENH] Refactor count/countnato use FExpr Mar 10, 2023

samukweku changed the title ~~[ENH] Refactor count/countnato use FExpr~~ [ENH] Refactor count/countna to use FExpr Mar 10, 2023

oleksiyskononenko reviewed Mar 10, 2023

View reviewed changes

src/core/column/countna.h Outdated Show resolved Hide resolved

src/core/expr/fexpr_count_countna.cc Outdated Show resolved Hide resolved

src/core/expr/head_reduce_unary.cc Outdated Show resolved Hide resolved

src/core/column/countna.h Outdated Show resolved Hide resolved

oleksiyskononenko reviewed Mar 11, 2023

View reviewed changes

src/core/column/countna.h Outdated Show resolved Hide resolved

oleksiyskononenko reviewed Mar 11, 2023

View reviewed changes

src/core/column/countna.h Outdated Show resolved Hide resolved

samukweku mentioned this pull request Mar 12, 2023

dt.countna() returns wrong results for grouped columns #3441

Closed

oleksiyskononenko reviewed Mar 13, 2023

View reviewed changes

src/core/expr/fexpr_count_countna.cc Outdated Show resolved Hide resolved