Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No variables selected, on high dimensional data. #26

Open
Gambleruin opened this issue Apr 15, 2018 · 1 comment
Open

No variables selected, on high dimensional data. #26

Gambleruin opened this issue Apr 15, 2018 · 1 comment

Comments

@Gambleruin
Copy link

I used stab package for high dimensional gene type of data, which, regardless how I tried, (lars.lasso would not work on input, but glmnet.lasso works just fine), it shows that ,everytime I run, 'no variables selected '. for example,

stab_lasso<-stabsel(x =train_dat, y =tr_target, fitfun =glmnet.lasso, cutoff =0.75,PFER =0.2)

my data had extremely high dimensionality 6248 columns and low observations with 224 rows.

If I use glmnet.lasso_maxCoef, the algorithm would just return first 45 genes as being selected, this is clearly not right.

What am I doing wrong? has anyone, except the low dimensional data, tried on real world high dimensional data?

@hofnerb
Copy link
Owner

hofnerb commented Apr 16, 2018

The problem is two-fold:

  1. Genetic data is often known to have little influence (and on top to be correlated). Thus, results tend to be quite unstable. This is further engraved given the relatively small samlpe size. This was also observed by others.

  2. Your PFER is rather low. You would only accept on average 0.2 false positive variables, which is, given the size of your data set and the instability of results disucssed in 1. very low. The corresponding (uncorrected) type 1 error (= significance level) would be as low as 3.02e-05, see the output from your code or

stabsel_parameters(p = 6248, cutoff =0.75,PFER =0.2)
# Stability selection with unimodality assumption
# 
# Cutoff: 0.75; q: 34; PFER (*):  0.189 
#    (*) or expected number of low selection probability variables
# PFER (specified upper bound):  0.2 
# PFER corresponds to signif. level 3.02e-05 (without multiplicity adjustment)

Please have a look at the literature, e.g.

citation("stabs")[[2]]

i.e.,

  Benjamin Hofner, Luigi Boccuto and Markus Goeker
  (2015). Controlling false discoveries in
  high-dimensional situations: Boosting with stability
  selection. BMC Bioinformatics, 16:144.
  doi:10.1186/s12859-015-0575-3

A BibTeX entry for LaTeX users is

  @Article{Hofner:StabSel:2015,
    title = {Controlling false discoveries in high-dimensional situations: Boosting with stability selection},
    author = {Benjamin Hofner and Luigi Boccuto and Markus G\"oker},
    journal = {{BMC Bioinformatics}},
    year = {2015},
    volume = {16},
    pages = {144},
    url = {http://dx.doi.org/10.1186/s12859-015-0575-3},
  }

where we give some advice on the choice of PFER (by relating it to the usual type 1 error rate) and on the best way to set parameters (i.e., fix q and modify any of the others, which goes without any compuational burden if you use stabsel on the result of the original call to stabsel), see

?stabsel.stabsel

With respect to your issue with glmnet.lasso_maxCoef, could you please provide a minimal working example, i.e., code and (simulated or real) data that allows to replicate the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants