-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Refactor count/countna to use FExpr #3440
Conversation
@oleksiyskononenko i still need ur help on this PR #3404 --- if you can show me how to use |
@oleksiyskononenko Also need your help in diagnosing why the tests fail in appveyor but work locally? |
src/core/expr/fexpr_count_countna.cc
Outdated
// we just want the total number of rows | ||
bool count_all_rows = arg_->get_expr_kind() == Kind::None; | ||
|
||
if (count_all_rows && !COUNTNA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
countna()
could also be called with no cols
, in such a case we are currently returning 0
for each group. Is this case covered on this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it useful though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, but may be good for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I dont think it fits in. we are replicating SQL here, where count(*) returns all rows, whereas count(column) counts the non null rows for that column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
countna() does not mean anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, but if someone already using it, it makes no sense to remove this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think anyone is using countna()
, the use-case could be countna(cols)
, where cols
could become None
at some point. My feeling is that returning 0
or None
is better in this case, than throwing an error.
src/core/expr/fexpr_sumprod.cc
Outdated
@@ -83,6 +83,7 @@ class FExpr_SumProd : public FExpr_Func { | |||
case SType::INT16: | |||
case SType::INT32: | |||
case SType::INT64: | |||
col.cast_inplace(SType::INT64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this, but then you need also to do the same for floats, etc, so better to revert the changes to what it was before this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the changes - appveyor still fails - i'm particularly confused about the errors because they do not show when I run the tests locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failure is not related to your PR:
def test_dt_count_na2():
DT = dt.Frame(G=[1,1,1,2,2,2], V=[None, None, None, None, 3, 5])
EXP = dt.Frame(G=[1,2], V1=[3,1], V2=[1,0])
> RES = DT[:, [dt.countna(f.V), dt.countna(dt.mean(f.V))], dt.by(f.G)]
E AssertionError: Assertion 'i < nrows()' failed in src\core\column.cc, line 251
DT = <Frame#229a5d3da80 6x2>
You need to disable this test until #3417 is resolved.
1f1ce67
to
95dc7ed
Compare
src/core/column/reduce_unary.h
Outdated
|
||
public: | ||
ReduceUnary_ColumnImpl(Column &&col, const Groupby& gby) | ||
: Virtual_ColumnImpl(gby.size(), col.stype()), | ||
: Virtual_ColumnImpl(gby.size(), stype_from<T_OUT>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to pass stype
here just like it was before, otherwise it is not possible to properly get it from T_OUT
. I'm not sure what was the reason to change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change was made here to allow passing int64 for count, instead of casting say a float to int64, as we did for the other reducers so far(sum/prod/mean). For min/max, we did not need to cast since it is within the same data type. Another option would be to pass an explicit stype here, instead of deriving the stype from T_OUT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, but the resulting column for count()
has int64
stype, so col.stype()
will work in this case too. Anyways, stype_from<T_OUT>
can only give you a guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does stype_from<T_OUT>
not work properly? maybe something I am missing from my interpretation of the function? or is Type::from
d right way to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way to go is to use col.stype()
as it was before.
stype_from<T>
will not work because there is no strict way to get stype
from the underlying type. We may have different columns to have the same C++ type: date32
/int32
, time64
/int64
, boo8
/int8
, etc. stype_from<T>
only gives you a guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it is ok to cast say a float64 to int64 for the count() function? no need to worry about precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why do we need to cast it for count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All reducers are separate classes, casting was needed for some of them outside of this PR. What I say is that you need to restore the original implementation for those reducers, otherwise it will not work with only passing the C++ type.
Note, tests are failing not for count…
@samukweku I guess we need to go through this PR and revert all the changes related to reducers other than The current way of determining the resulting column stype/casting is the way to go, and the changes on this PR are not going to work for the reasons explained above. |
109c76f
to
8397a8f
Compare
@samukweku I'm not sure why we need dozens of the old commits been pushed to all your PRs? |
May I ask you not to push anything onto this branch? As you asked me to show you how to handle several review comments, I was going to push some changes here and suddenly got +60 old commits to deal with. |
src/core/expr/fexpr_minmax.cc
Outdated
return make<int32_t>(std::move(col), gby, is_grouped); | ||
case SType::DATE32: | ||
return make<int32_t>(std::move(col), gby, is_grouped); | ||
case SType::INT64: | ||
return make<int64_t>(std::move(col), gby, is_grouped); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this change was introduced, we don't need additional treatment of date32/time64 as internally there are the same as int32/int64.
@samukweku Please take a look at the changes, and let me know if you have any questions. |
@oleksiyskononenko my apologies; must have been when I rebased the branch. apologies on that |
@samukweku To merge |
looks good. thanks @oleksiyskononenko . need an explanation of the code in reduceunary:
the first part of the code gets its type directly from the column, while the other part seems to get it from the SType provided? maybe an explanation of clone and the preceding code would be helpful. |
@samukweku The first part of the code you're referring to is a constructor, that is needed for the case when |
Thanks for the explanation @oleksiyskononenko LGTM to merge |
count()
andcountna()
toFExpr
;countna()
when applied to grouped columns;ReduceUnary_ColumnImpl
for more type flexibility.WIP for #2562
Closes #3441