clustertree
is a fast and extensible R package for estimating the empirical cluster tree, a hierarchical representation of high-density clusters, defined (recursively) as the connected subsets of:
- The Robust Single Linkage (RSL) algorithm from:
Chaudhuri, Kamalika, and Sanjoy Dasgupta. "Rates of convergence for the cluster tree." Advances in Neural Information Processing Systems. 2010.
- The KNN and Mutual KNN variants from Algorithm 2 analyzed in:
Chaudhuri, Kamalika, et al. "Consistent procedures for cluster tree estimation and pruning." IEEE Transactions on Information Theory 60.12 (2014): 7900-7912.
Additional tools for working with the tree(s) are also provided. Namely,
- Stuetzle's Runt pruning technique is provided, from:
Stuetzle, Werner. "Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample." Journal of classification 20.1 (2003): 025-047.
- Eldridge et. al's Merge Distortion metric is also provided, from:
Eldridge, Justin, Mikhail Belkin, and Yusu Wang. "Beyond hartigan consistency: Merge distortion metric for hierarchical clustering." Conference on Learning Theory. 2015.
The tree(s) returned are by default standard hclust
objects. Support for densities estimated via KDE techniques are planned for the future. For more information regarding the utility of this package and of the cluster tree itself, see the Usage and Additional References sections, respectively.
The package currently only exists on github. The installation options are as follows:
-
Installed directly from the repo with help from the devtools package, i.e.
install.packages("devtools") devtools::install_github("peekxc/clustertree")
-
Download the most recent successful build from AppVeyor
The package is actively developing. A release candidate for CRAN will be created eventually.
library("clustertree")
data("iris")
x <- as.matrix(iris[, 1:4])
Run Robust Single Linkage
ct <- clustertree(x, k = 15L, alpha = sqrt(2), estimator = "RSL")
ct
Cluster tree object of: 150 objects.
Estimator used: Robust Single Linkage
Parameters: k = 15, alpha = 1.4142, dim = 4
Plot the cluster tree. Like other hierarchical clustering algorithms, the main tree is stored internally as an 'hclust' object
plot(ct)
is(ct$hc, "hclust")
TRUE
You can also use either of the two linkage criteria studied From Algorithm 2 in [2] listed above:
ct2 <- clustertree(x, k = 15L, alpha = sqrt(2), estimator = "KNN")
ct3 <- clustertree(x, k = 15L, alpha = sqrt(2), estimator = "mutual KNN")
Unlike other hierarchical algorithms, it's possible that these do not form complete hierarchies. This can happen when there is a sufficiently low density areas separating high density clusters. For example, it's well known the Iris setosa species is clearly separable from the other two species. This is reflected in the Mutual KNN graph, the sparser of the two estimators. These disjoint connected components are also stored as trees, i.e.
Iris Setosa tree
length(ct3$hc) ## == 2
ct3$hc[[1]]
...
Cluster method : mutual knn
Number of objects: 50
ct3$hc[[2]]
...
Cluster method : mutual knn
Number of objects: 100
Typical hierarchical clustering structures represent every singleton as a possible cluster, but obviously, not every singleton will be in a disjoint high density cluster. One approach to making these modes more apparent is to specify a threshold to to use as a means of 'runt pruning' from [3] above.
This can significantly simplify the tree:
ct_simplified <- runt_prune(ct, 2)
plot(ct_simplified[[1]]) ## Three detected modes of density
The cluster tree theory itself has a long history. For a brief overview of the definition, see section 11.13 of:
Hartigan, John A., and J. A. Hartigan. Clustering algorithms. Vol. 209. New York: Wiley, 1975. for an overview of what Hartigan refers to as the density-contour tree, and brief overview of the background and associated concepts of formal high-density clustering.
Established notions of consistency of the above class of estimators can be found in:
Hartigan, John A. "Consistency of single linkage for high-density clusters." Journal of the American Statistical Association 76.374 (1981): 388-394.
This package is part of the Google Summer of Code 2017 initiative, under the R Project.