This package provides a set of algorithms for data clustering.
Pkg.add("Clustering")
Currently working algorithms:
- Kmeans
- Affinity Propagation
To be available:
- K medoids
- DP means
- ISO Data
Interfaces:
# perform K-means (centers are updated inplace)
result = kmeans!(x, centers, opts)
# perform K-means based on a given set of inital centers
result = kmeans(x, init_centers, opts)
# perform K-means to get K centers
result = kmeans(x, k, opts)
result = kmeans(x, k)
All these methods return an instance of KmeansResult
, it is defined as
type KmeansResult{T<:FloatingPoint}
centers::Matrix{T} # cluster centers (d x k)
assignments::Vector{Int} # assignments (n)
costs::Vector{T} # costs of the resultant assignments (n)
counts::Vector{Int} # number of samples assigned to each cluster (k)
cweights::Vector{T} # cluster weights (k)
total_cost::Float64 # total cost (i.e. objective) (k)
iterations::Int # number of elapsed iterations
converged::Bool # whether the procedure converged
end
Options:
Note: options are specified using keyword arguments.
name | descrption | default value |
---|---|---|
max_iters | maximum number of iterations | 100 |
tol | tolerable objv change at convergence | 1.0e-6 |
weights | sample weights (a vector or nothing) | nothing |
display | verbosity (:none , :final , or :iter ) |
:iter |
x = rand(100, 10000) # a set of 10000 samples (each of dimension 100)
k = 50 # the number of clustering
result = kmeans(x, k; max_iter=50, display=:iter)
Affinity Propagation is an algorithm that uses loopy belief propagation to run MAP inference to identify some exemplars. Unlike kmeans, the exemplars are chosen from the original samples. After the algorithm returns, every sample will be assigned to one of the exemplars.
The input of the algorithm is a similarity matrix S
.
Unlike kmeans, you don't need to (and cannot) specify the number of
clusters. But the diagonal values of S
will affect how many
clusters you will get at the end. Specifically, S[i,j]
could be
interpreted as the tendency of assigning point i
to point j
(when j
is an exemplar). So
S
need NOT to be symmetricS[i,i]
represents the willingness of assigned pointi
to itself. So generally larger diagonal values forS
means more clusters. For example, ifS[i,i]==max(S)
for alli
, then every point will be an exemplar itself.
Usually, assigning the diagonal of S
to be the median of all the
rest entries could lead to reasonable results.
Interfaces:
result = affinity_propagation(S, opts)
where the following options could be specified using keyword arguments
max_iter::Integer = 500, # max number of iterations
n_stop_check::Integer = 10, # stop if exemplars not changed for this number of iterations
damp::FloatingPoint = 0.5, # damping factor for message updating, 0 means no damping
display::Symbol = :iter # whether progress is shown
the returning value is a struct that looks like this:
type AffinityPropagationResult
exemplar_index ::Vector{Int} # index for exemplars (centers)
assignments ::Vector{Int} # assignments for each point
iterations ::Int # number of iterations executed
converged ::Bool # converged or not
end