No variables selected, on high dimensional data. #26

Gambleruin · 2018-04-15T20:02:46Z

I used stab package for high dimensional gene type of data, which, regardless how I tried, (lars.lasso would not work on input, but glmnet.lasso works just fine), it shows that ,everytime I run, 'no variables selected '. for example,

stab_lasso<-stabsel(x =train_dat, y =tr_target, fitfun =glmnet.lasso, cutoff =0.75,PFER =0.2)

my data had extremely high dimensionality 6248 columns and low observations with 224 rows.

If I use glmnet.lasso_maxCoef, the algorithm would just return first 45 genes as being selected, this is clearly not right.

What am I doing wrong? has anyone, except the low dimensional data, tried on real world high dimensional data?

hofnerb · 2018-04-16T06:57:08Z

The problem is two-fold:

Genetic data is often known to have little influence (and on top to be correlated). Thus, results tend to be quite unstable. This is further engraved given the relatively small samlpe size. This was also observed by others.
Your PFER is rather low. You would only accept on average 0.2 false positive variables, which is, given the size of your data set and the instability of results disucssed in 1. very low. The corresponding (uncorrected) type 1 error (= significance level) would be as low as 3.02e-05, see the output from your code or

stabsel_parameters(p = 6248, cutoff =0.75,PFER =0.2)
# Stability selection with unimodality assumption
# 
# Cutoff: 0.75; q: 34; PFER (*):  0.189 
#    (*) or expected number of low selection probability variables
# PFER (specified upper bound):  0.2 
# PFER corresponds to signif. level 3.02e-05 (without multiplicity adjustment)

Please have a look at the literature, e.g.

citation("stabs")[[2]]

i.e.,

  Benjamin Hofner, Luigi Boccuto and Markus Goeker
  (2015). Controlling false discoveries in
  high-dimensional situations: Boosting with stability
  selection. BMC Bioinformatics, 16:144.
  doi:10.1186/s12859-015-0575-3

A BibTeX entry for LaTeX users is

  @Article{Hofner:StabSel:2015,
    title = {Controlling false discoveries in high-dimensional situations: Boosting with stability selection},
    author = {Benjamin Hofner and Luigi Boccuto and Markus G\"oker},
    journal = {{BMC Bioinformatics}},
    year = {2015},
    volume = {16},
    pages = {144},
    url = {http://dx.doi.org/10.1186/s12859-015-0575-3},
  }

where we give some advice on the choice of PFER (by relating it to the usual type 1 error rate) and on the best way to set parameters (i.e., fix q and modify any of the others, which goes without any compuational burden if you use stabsel on the result of the original call to stabsel), see

?stabsel.stabsel

With respect to your issue with glmnet.lasso_maxCoef, could you please provide a minimal working example, i.e., code and (simulated or real) data that allows to replicate the issue?

hofnerb mentioned this issue Apr 16, 2018

no variables are selected #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No variables selected, on high dimensional data. #26

No variables selected, on high dimensional data. #26

Gambleruin commented Apr 15, 2018

hofnerb commented Apr 16, 2018

No variables selected, on high dimensional data. #26

No variables selected, on high dimensional data. #26

Comments

Gambleruin commented Apr 15, 2018

hofnerb commented Apr 16, 2018