-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higher computation time for new versions #67
Comments
@athawk81 Any ideas? |
Florian, I would definitely encourage you to try again with the latest If you still have a problem please let us know. Kind regards, Ian. On Mon, Jun 8, 2015 at 5:10 AM, Florian Laws [email protected]
Ian Clarke |
I'm not too thrilled that later versions dropped support for bagging, but we'll try. |
@athawk81 can comment on why we did that, but I believe that in our testing on a variety of datasets it didn't help predictive accuracy. There have been many improvements to predictive accuracy in more recent versions, including a flexible hyper-parameter optimizer that you should try. |
Ok, we discussed it, here is the situation: There have been a number of bugfixes, some of which have probably resulted in an increase in training time, but with significant benefits to predictive accuracy. Our usecase for QuickML is using it for 2-class classification, and so the approach it uses for this (see TreeBuilder.createTwoClassCategoricalNode()) is quite a bit more efficient than TreeBuilder.createNClassCategoricalNode(), which builds the "inset" progressively based on the impact on the split's score. Regarding bagging, this would be easy enough to add back, but if you are in a position to contribute improvements, I might hold-off for a few days as we have a significant refactor that's likely to be merged soon. |
Testing with quickml 0.7.14 shows that training times are still much longer than with 0.0.8.11, |
@florianlaws Looks like the solution here is to reimplement createNClassCategoricalNode() in a more efficient way. Unfortunately our focus is elsewhere right now and so we don't have the resource to allocate to this, but it shouldn't be a significant amount of work (I would guess less than a day for a proficient Java programmer - although we'd need to decide on an algorithm). Do you have anyone that could take a whack at it? We'd certainly provide whatever assistance we can. |
I am comparing two versions of quickdt, namely:
In the following plot, we have the old version (0.0.8.11) in blue and a new one (0.2.2.1) in red. In the old version, the computation time is linear w.r.t. the number of samples; but in the newer version, the time grows exponentially.
Any idea that justifies this result?
The text was updated successfully, but these errors were encountered: