An R implementation of the Gene Frequency - Inverse Cell Frequency method for single cell data normalization (Gambardella et al. 2019). The package also includes Phenograph Louvain method clustering using RcppAnnoy library from uwot and a naive but fast parallel implementation of Jaccard Coefficient estimation using RcppParallel. The package also include data reduction with either Principal Component Analisys (PCA) or Latent Semantic Analisys (LSA) before to apply t-SNE or UMAP for single cell data visualization.
Examples & Functionality:
-
General use case scenario (normalization and clustering) HERE
-
Embed new cells in an already existing embedded space and classify it. See example how..
-
Idetify active pathways in a group of cells. See example how..
-
Idetify marker genes across clusters. See example how..
Sep. 24 2019 New functionality: Classify cells using GF-ICF transformation and K-nn algorithm. See example how..
Sep. 09 2019 Version 0.3.1: Save and load gficf objects, support for Leiden and few bug fixes.
Aug. 24 2019 Support for binary packages for OSX and Windows (only R>=3.5)
Aug. 22 2019 New functionality: Identify marker genes across clusters of cells. See example how..
Aug. 20 2019 RcppParallel Mann–Whitney U test (Benchmarks against R implementation)
Aug. 13 2019 New functionality: Identify active pathways in a group of cells. See example how..
Aug. 12 2019 RcppParallel Jaccard estimation in Phenograph (20X speed boost with 6 cores)
Jul. 26 2019 New functionality: Embed new cells in an already existing embedded space. See example how..
Jul. 12 2019. Paper Accepted and now available HERE.
Jul. 03 2019. Version 0.1 with example on Tabula Muris.
# Install required bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE)) {install.packages("BiocManager")}
BiocManager::install(setdiff(c("edgeR", "BiocParallel", "fgsea", "biomaRt"),rownames(installed.packages())),update = F)
# install gficf package
install.packages(pkgs = "gficf",repos = c("https://dibbelab.github.io/Rrepo/","https://cloud.r-project.org"))
gficf
makes use of Rcpp
, RcppParallel
and RcppGSL
. So you have to carry out
a few extra steps before being able to build this package. The steps are reported below for each platform.
You need gsl dev library to successfully install RcppGSL library.
On Ubuntu/Debian systems this can be accomplished by runnuing the command sudo apt-get install libgsl-dev
from the terminal.
- Open terminal and run
xcode-select --install
to install the command line developer tools. - We than need to install gsl libraries. This can be done via Homebrew. So, still from terminal
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
and than use homebrew
to install gsl with following command
brew install gsl
- Skip this first step if you are using RStudio because it will ask you automatically. Otherwise install Rtools and ensure
path\to\Rtools\bin
is on your path. - Download gsl library for Windows from sourceforge and exctract it in
C:\
or where you want. - Open R/Rstudio and before to istall the package from github exec the following command in the R terminal.
# Change the path if you installed gsl librarie not in the default path.
# Be sure to use the format '"path/to/gsl-xxx_mingw-xxx/gsl-xxx-static"'
# In this way " characters will be mainteined and spaces in the path preserved if there are.
# For example for gsl-2.2.1 compiled with mingw-6.2.0:
Sys.setenv(GSL_LIBS = '"C:/gsl-2.2.1_mingw-6.2.0/gsl-2.2.1-static"')
Exec in R terminal the following commands
if(!require(devtools)){ install.packages("devtools")}
devtools::install_github("dibbelab/gficf")
In the package gficf
the function clustcells
implement the Phenograph algorithm,
which is a clustering method designed for high-dimensional single-cell data analysis. It works by creating a graph ("network") representing phenotypic similarities between cells by calculating the Jaccard coefficient between nearest-neighbor sets, and then identifying communities using the well known Louvain method or Leiden algorithm in this graph.
In this particular implementation of Phenograph we use approximate nearest neighbors found using RcppAnnoy
libraries present in the uwot
package. The supported distance metrics for KNN (set by the dist.method
parameter) are:
- Euclidean (default)
- Cosine
- Manhattan
- Hamming
Please note that the Hamming support is a lot slower than the other metrics. It is not recomadded to use it if you have more than a few hundred features, and even then expect it to take several minutes during the index building phase in situations where the Euclidean metric would take only a few seconds.
After computation of Jaccard distances among cells (custom RcppParallel implementation), the Louvain community detection is instead performed using igraph
or native Seurat
implementation.
All supported communities detection algorithm (set by the community.algo
parameter) are:
- Louvain classic (default)
- Louvian with modularity optimization (native c++ function imported from
Seurat
) - Louvain algorithm with multilevel refinement (native c++ function imported from
Seurat
) - Leiden algorithm from Traag et al. 2019 (need to be installed via
sudo -H pip install leidenalg igraph
) - Walktrap
- Fastgreedy
Apart from the man pages in R you may be interested in the following readings:
-
A description of UMAP using algorithmic terminology similar to t-SNE, rather than the more topological approach of the UMAP publication.
-
Some Examples of the output of UMAP on some datasets, compared to t-SNE.
-
Some results of running UMAP on the simple datasets from How to Use t-SNE Effectively.