Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher computation time for new versions #67

Open
ADiegoCAlonso opened this issue Sep 11, 2014 · 8 comments
Open

Higher computation time for new versions #67

ADiegoCAlonso opened this issue Sep 11, 2014 · 8 comments

Comments

@ADiegoCAlonso
Copy link
Contributor

I am comparing two versions of quickdt, namely:

  • 0.0.8.11: RF >> numThreads: 8, numTrees: 8, useBag: true, DT >> minProb: 0.95,
  • 0.2.2.1: RF >> numThreads: 8, numTrees: 8, sizeBag: 2000000, DT >> maxDepth: 2147483647, minInstances: 0, minScore: 1.0E-6, ignoreProb: 0.5

In the following plot, we have the old version (0.0.8.11) in blue and a new one (0.2.2.1) in red. In the old version, the computation time is linear w.r.t. the number of samples; but in the newer version, the time grows exponentially.

computation_time

Any idea that justifies this result?

@sanity
Copy link
Owner

sanity commented Sep 17, 2014

@athawk81 Any ideas?

@florianlaws
Copy link

I am a colleague of Diego and would like to follow up on the issue. It seems that this is probably not a memory usage issue.
I created a sampling profile using jvisualvm and it looks like the code is spending a lot of its time in TreeBuilder.createNClassCategoricalNode().
We're using quite a few features that correspond to words in a text, i.e. categorical features with a lot of levels. I wonder if there is some inefficiencies inside that method. Note that this is still on version 0.2.2.1. I'll try to reproduce that using the most current version when I have the time.
screen shot 2015-06-08 at 12 10 08

@sanity
Copy link
Owner

sanity commented Jun 8, 2015

Florian, I would definitely encourage you to try again with the latest
version, as there have been many improvements. If since version 0.2.2.1,
including efficiency improvements.

If you still have a problem please let us know.

Kind regards,

Ian.

On Mon, Jun 8, 2015 at 5:10 AM, Florian Laws [email protected]
wrote:

I am a colleague of Diego and would like to follow up on the issue. It
seems that this is probably not a memory usage issue.
I created a sampling profile using jvisualvm and it looks like the code is
spending a lot of its time in TreeBuilder.createNClassCategoricalNode().
We're using quite a few features that correspond to words in a text, i.e.
categorical features with a lot of levels. I wonder if there is some
inefficiencies inside that method. Note that this is still on version
0.2.2.1. I'll try to reproduce that using the most current version when I
have the time.
[image: screen shot 2015-06-08 at 12 10 08]
https://cloud.githubusercontent.com/assets/498420/8032606/4f88bb64-0dd7-11e5-9ed4-78756437ba83.png


Reply to this email directly or view it on GitHub
#67 (comment).

Ian Clarke
Blog: http://blog.locut.us/

@florianlaws
Copy link

I'm not too thrilled that later versions dropped support for bagging, but we'll try.

@sanity
Copy link
Owner

sanity commented Jun 8, 2015

@athawk81 can comment on why we did that, but I believe that in our testing on a variety of datasets it didn't help predictive accuracy. There have been many improvements to predictive accuracy in more recent versions, including a flexible hyper-parameter optimizer that you should try.

@sanity
Copy link
Owner

sanity commented Jun 8, 2015

Ok, we discussed it, here is the situation:

There have been a number of bugfixes, some of which have probably resulted in an increase in training time, but with significant benefits to predictive accuracy.

Our usecase for QuickML is using it for 2-class classification, and so the approach it uses for this (see TreeBuilder.createTwoClassCategoricalNode()) is quite a bit more efficient than TreeBuilder.createNClassCategoricalNode(), which builds the "inset" progressively based on the impact on the split's score.

Regarding bagging, this would be easy enough to add back, but if you are in a position to contribute improvements, I might hold-off for a few days as we have a significant refactor that's likely to be merged soon.

@florianlaws
Copy link

Testing with quickml 0.7.14 shows that training times are still much longer than with 0.0.8.11,
too long for us to be useful. So we don't really need bagging until the training time issues are solved.

@sanity
Copy link
Owner

sanity commented Jun 15, 2015

@florianlaws Looks like the solution here is to reimplement createNClassCategoricalNode() in a more efficient way. Unfortunately our focus is elsewhere right now and so we don't have the resource to allocate to this, but it shouldn't be a significant amount of work (I would guess less than a day for a proficient Java programmer - although we'd need to decide on an algorithm).

Do you have anyone that could take a whack at it? We'd certainly provide whatever assistance we can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants