Skip to content

Reference-free celltype and phenotype annotation in single cell RNAseq by learning geneset representations

Notifications You must be signed in to change notification settings

DavisLaboratory/scDECAF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scDECAF - single cell disentanglement by canonical factors

Fig1_scDECAF_github

scDECAF is a statistical learning algorithm to identify cell types, states and programs in single-cell gene expression data using vector representation of gene sets. scDECAF improves biological interpretation by selecting a subset of most biologically relevant programs.

Installation

(Requires R >= 4.0.0)

install.packages("devtools")
devtools::install_github("DavisLaboratory/scDECAF")

Examples

See notebooks in the reproducibility repository

Quick start

scDECAF takes the followings as input:

data: A numeric matrix of log-normalised single cell gene expression (SCT normalisation from seurat, scran- or scanpy- normalised data). Rows are genes, columns are the cells.

genesetlist: A list of lists. Each element of the list is a list of gene IDs or symbols (depending on rownames(data)) in a gene set. The outer list has to be named.

hvg: Character vector of highly variable genes in data. If the data is already subsetted on HVGs, then set this to rownames(data).

embedding: A numeric matrix 2-D or higher dimensional embedding of the cells, e.g. UMAP, PCA, PHATE, Diffusion components etc. Rows are cells, columns are the dimension of the data in the reduced dimension space.

min_gs_size: Scalar. Minimum number of genes in a gene set (after considering hvgs).

lambda: Shrinkage regulariser penalty.

n_components: Scalar. This is number of components in the CCA model. Has to be smaller than the number of gene sets in genesetlist or the prunned genesetlist.

k: Scalar. Number of nearest neighbors for cell type refinement.

# sparse selection of most relevant genesets
# also plots number of genesets surviving the sparsity threshold
selected_gs <- pruneGenesets(data = x, genesetlist = my_genesets, hvg = hvg,
                            embedding = cell_embedding, min_gs_size = 3, lambda = exp(-3))
                            


# print selected genesets
as.character(selected_gs)



# print ranking/importance of geneset
head(attributes(selected_gs)$"glmnet_coef")



# subset on selected genesets from the full genesets list and prepare gene-geneset assignment binary matrix
rownames(cell_embedding) = cell_names
target <- genesets2ids(x[match(hvg, rownames(x)),], my_genesets[selected_gs])



# compute geneset scores per cell for the sparse set of selected genesets. `K` is number of components in the CCA model. 
ann_res <- scDECAF(data = x, gs = target, standardize = FALSE, 
                   hvg = hvg, k = 10, embedding = cell_embedding,
                   n_components = ncol(target) - 1, max_iter = 2, thresh = 0.5)
                   


# get geneset scores per cell for the sparse set of genesets
scores_constrained = attributes(ann_res)$raw_scores


You can now store the scores to your data container (sce, seuratObj, anndata etc) and visualise the scores per cell.

About

Reference-free celltype and phenotype annotation in single cell RNAseq by learning geneset representations

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages