Skip to content
forked from bayesian/boosting

Fast implementation of Gradient Boosting Machine (GBM) training algorithm.

Notifications You must be signed in to change notification settings

tachim/boosting

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fsb: fast & simple boosting

Fast & Simple implementation of GBM

Goal:

  1. Fast (Handle 40M rows * 500 features within 12 hours)
  2. Simple (The less lines of code, the better) <= 3000
  3. Mudular/Extensible for further improvements

How to Use

  1. Install folly and thrift.
  2. Modify Makefile and boosting.sh and make FOLLY and THRIFT point to the right places.
  3. Run make
  4. Run boosting.sh

Algorithms:

  1. pre-bucketing (data compression)
  2. bucket sort to build histogram, then linear scan to find best split
  3. hints and intelligent of using #buckets
  4. stochastic gradient boosting machine

Features:

  1. correctness (model + fimps)
  2. deterministic randomness
  3. easily extensible for wide varieties of similar algorithms: random forest, bagging, gbm, for both classification and regression methods, regression takes priority

New features:

  1. byte/short: two layer of storage. (save both memory and cpu)
  2. taking hints based on previous fimps (top 1/3 using short, rest using byte)

Parameters:

m: number trees n: number of leaves per tree r: example sampling rate s: feature sampling rate

d: number of data points f: number of features

k: number of buckets ml: minimum number of datapoints per leave

Complexity: Memory: max(f * d1 * 8, [f * d, f * d * 2))

Algorithmic:

  1. Bucketization: O(f * d1 * log(d1))
  2. Continue reading: O(f * d2 * log(k))
  3. Single Best Split: O(f' * d' + f' * k)
  4. Trees
  • depth-k balanced tree: k * S
  • single n-leaves tree: #splits: (2n - 3), O(S * n * log(n)) (roughly)

D: 20M, exampling sampling: 4M feature sampling rate:

Components:

Config: (specify data format and training parameters) DataSet: (column-wise storage, with Self Compression) Tree: (works both in compressed/raw) TreeRegressor: (k-leaf regression tree) GbmFun: (function to extend to different types of loss) Gbm: (gradient boosting machine)

About

Fast implementation of Gradient Boosting Machine (GBM) training algorithm.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 99.0%
  • Other 1.0%