Higher computation time for new versions #67

ADiegoCAlonso · 2014-09-11T09:42:01Z

I am comparing two versions of quickdt, namely:

0.0.8.11: RF >> numThreads: 8, numTrees: 8, useBag: true, DT >> minProb: 0.95,
0.2.2.1: RF >> numThreads: 8, numTrees: 8, sizeBag: 2000000, DT >> maxDepth: 2147483647, minInstances: 0, minScore: 1.0E-6, ignoreProb: 0.5

In the following plot, we have the old version (0.0.8.11) in blue and a new one (0.2.2.1) in red. In the old version, the computation time is linear w.r.t. the number of samples; but in the newer version, the time grows exponentially.

Any idea that justifies this result?

sanity · 2014-09-17T18:14:06Z

@athawk81 Any ideas?

florianlaws · 2015-06-08T10:10:31Z

I am a colleague of Diego and would like to follow up on the issue. It seems that this is probably not a memory usage issue.
I created a sampling profile using jvisualvm and it looks like the code is spending a lot of its time in TreeBuilder.createNClassCategoricalNode().
We're using quite a few features that correspond to words in a text, i.e. categorical features with a lot of levels. I wonder if there is some inefficiencies inside that method. Note that this is still on version 0.2.2.1. I'll try to reproduce that using the most current version when I have the time.

sanity · 2015-06-08T13:43:37Z

Florian, I would definitely encourage you to try again with the latest
version, as there have been many improvements. If since version 0.2.2.1,
including efficiency improvements.

If you still have a problem please let us know.

Kind regards,

Ian.

On Mon, Jun 8, 2015 at 5:10 AM, Florian Laws [email protected]
wrote:

I am a colleague of Diego and would like to follow up on the issue. It
seems that this is probably not a memory usage issue.
I created a sampling profile using jvisualvm and it looks like the code is
spending a lot of its time in TreeBuilder.createNClassCategoricalNode().
We're using quite a few features that correspond to words in a text, i.e.
categorical features with a lot of levels. I wonder if there is some
inefficiencies inside that method. Note that this is still on version
0.2.2.1. I'll try to reproduce that using the most current version when I
have the time.
[image: screen shot 2015-06-08 at 12 10 08]
https://cloud.githubusercontent.com/assets/498420/8032606/4f88bb64-0dd7-11e5-9ed4-78756437ba83.png

—
Reply to this email directly or view it on GitHub
#67 (comment).

Ian Clarke
Blog: http://blog.locut.us/

florianlaws · 2015-06-08T13:45:26Z

I'm not too thrilled that later versions dropped support for bagging, but we'll try.

sanity · 2015-06-08T14:20:00Z

@athawk81 can comment on why we did that, but I believe that in our testing on a variety of datasets it didn't help predictive accuracy. There have been many improvements to predictive accuracy in more recent versions, including a flexible hyper-parameter optimizer that you should try.

sanity · 2015-06-08T23:02:50Z

Ok, we discussed it, here is the situation:

There have been a number of bugfixes, some of which have probably resulted in an increase in training time, but with significant benefits to predictive accuracy.

Our usecase for QuickML is using it for 2-class classification, and so the approach it uses for this (see TreeBuilder.createTwoClassCategoricalNode()) is quite a bit more efficient than TreeBuilder.createNClassCategoricalNode(), which builds the "inset" progressively based on the impact on the split's score.

Regarding bagging, this would be easy enough to add back, but if you are in a position to contribute improvements, I might hold-off for a few days as we have a significant refactor that's likely to be merged soon.

florianlaws · 2015-06-15T08:15:17Z

Testing with quickml 0.7.14 shows that training times are still much longer than with 0.0.8.11,
too long for us to be useful. So we don't really need bagging until the training time issues are solved.

sanity · 2015-06-15T15:25:35Z

@florianlaws Looks like the solution here is to reimplement createNClassCategoricalNode() in a more efficient way. Unfortunately our focus is elsewhere right now and so we don't have the resource to allocate to this, but it shouldn't be a significant amount of work (I would guess less than a day for a proficient Java programmer - although we'd need to decide on an algorithm).

Do you have anyone that could take a whack at it? We'd certainly provide whatever assistance we can.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Higher computation time for new versions #67

Higher computation time for new versions #67

ADiegoCAlonso commented Sep 11, 2014

sanity commented Sep 17, 2014

florianlaws commented Jun 8, 2015

sanity commented Jun 8, 2015

florianlaws commented Jun 8, 2015

sanity commented Jun 8, 2015

sanity commented Jun 8, 2015

florianlaws commented Jun 15, 2015

sanity commented Jun 15, 2015

Higher computation time for new versions #67

Higher computation time for new versions #67

Comments

ADiegoCAlonso commented Sep 11, 2014

sanity commented Sep 17, 2014

florianlaws commented Jun 8, 2015

sanity commented Jun 8, 2015

florianlaws commented Jun 8, 2015

sanity commented Jun 8, 2015

sanity commented Jun 8, 2015

florianlaws commented Jun 15, 2015

sanity commented Jun 15, 2015