Skip to content

Commit

Permalink
Merge branch 'kmeans'
Browse files Browse the repository at this point in the history
  • Loading branch information
rasbt committed Apr 26, 2016
2 parents 11db7ed + b11aaf4 commit c1993cf
Show file tree
Hide file tree
Showing 17 changed files with 1,271 additions and 14 deletions.
4 changes: 2 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ matrix:
- python: 3.5
env: LATEST="false" TENSORFLOW="false" COVERAGE="false" NUMPY_VERSION="1.10.4" SCIPY_VERSION="0.17" SKLEARN_VERSION="0.17" PANDAS_VERSION="0.17.1" MATPLOTLIB_VERSION="1.5.1"
- python: 3.5
env: LATEST="true" TENSORFLOW="false" COVERAGE="false"
env: LATEST="true" TENSORFLOW="false" COVERAGE="true"
- python: 3.4
env: LATEST="true" TENSORFLOW="false" COVERAGE="false"
- python: 3.4
env: LATEST="true" TENSORFLOW="true" COVERAGE="false"
env: LATEST="true" TENSORFLOW="true" COVERAGE="true"
- python: 2.7
env: LATEST="true" TENSORFLOW="false" COVERAGE="false"
- python: 2.7
Expand Down
5 changes: 5 additions & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ pages:
- user_guide/feature_extraction/PrincipalComponentAnalysis.md
- user_guide/feature_extraction/LinearDiscriminantAnalysis.md
- user_guide/feature_extraction/RBFKernelPCA.md
- cluster:
- user_guide/cluster/Kmeans.md
- evaluate:
- user_guide/evaluate/confusion_matrix.md
- user_guide/evaluate/plot_decision_regions.md
Expand All @@ -74,6 +76,7 @@ pages:
- user_guide/data/mnist_data.md
- user_guide/data/load_mnist.md
- user_guide/data/wine_data.md
- user_guide/data/three_blobs_data.md
- file_io:
- user_guide/file_io/find_filegroups.md
- user_guide/file_io/find_files.md
Expand Down Expand Up @@ -103,6 +106,8 @@ pages:
- api_subpackages/mlxtend.data.md
- api_subpackages/mlxtend.evaluate.md
- api_subpackages/mlxtend.feature_selection.md
- api_subpackages/mlxtend.feature_extraction.md
- api_subpackages/mlxtend.cluster.md
- api_subpackages/mlxtend.file_io.md
- api_subpackages/mlxtend.general_plotting.md
- api_subpackages/mlxtend.preprocessing.md
Expand Down
1 change: 1 addition & 0 deletions docs/sources/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
##### New Features

- New TensorFlow estimator for Linear Regression ([`tf_regressor.TfLinearRegression`](./user_guide/tf_regressor/TfLinearRegression.md))
- New k-means clustering estimator ([`cluster.Kmeans`](./user_guide/cluster/Kmeans.md))

##### Changes

Expand Down
27 changes: 16 additions & 11 deletions docs/sources/USER_GUIDE_INDEX.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,24 @@
# User Guide Index

## `classifier`

- [`EnsembleVoteClassifier`](user_guide/classifier/EnsembleVoteClassifier.md)
- [`StackingClassifier`](user_guide/classifier/StackingClassifier.md) (new in 0.3.1dev)
- [`StackingClassifier`](user_guide/classifier/StackingClassifier.md)
- [`Perceptron`](user_guide/classifier/Perceptron.md)
- [`Adaline`](user_guide/classifier/Adaline.md)
- [`LogisticRegression`](user_guide/classifier/LogisticRegression.md)
- [`NeuralNetMLP`](user_guide/classifier/NeuralNetMLP.md)
- [`SoftmaxRegression`](user_guide/classifier/SoftmaxRegression.md) (new in 0.3.1dev)
- [`SoftmaxRegression`](user_guide/classifier/SoftmaxRegression.md)

## `tf_classifier` (TensorFlow Classifier)
- [`TfSoftmaxRegression`](user_guide/tf_classifier/TfSoftmaxRegression.md) (new in 0.3.1dev)
- [`TfMultiLayerPerceptron`](user_guide/tf_classifier/TfMultiLayerPerceptron.md) (new in 0.3.1dev)
- [`TfSoftmaxRegression`](user_guide/tf_classifier/TfSoftmaxRegression.md)
- [`TfMultiLayerPerceptron`](user_guide/tf_classifier/TfMultiLayerPerceptron.md)

## `regressor`

- [`LinearRegression`](user_guide/regressor/LinearRegression.md)
- [`StackingRegressor`](user_guide/regressor/StackingRegressor.md) (new in 0.3.1dev)
- [`StackingRegressor`](user_guide/regressor/StackingRegressor.md)

## `tf_regressor` (TensorFlow Regressor)
- [`TfLinearRegression`](user_guide/tf_regressor/TfLinearRegression.md) (new in 0.4.1dev)

## `regression_utils`
- [`plot_linear_regression`](user_guide/regression_utils/plot_linear_regression.md)
Expand All @@ -26,9 +27,12 @@
- [`SequentialFeatureSelector`](user_guide/feature_selection/SequentialFeatureSelector.md)

## `feature_extraction`
- [`PrincipalComponentAnalysis`](user_guide/feature_extraction/PrincipalComponentAnalysis.md) (new in 0.3.1dev)
- [`LinearDiscriminantAnalysis`](user_guide/feature_extraction/LinearDiscriminantAnalysis.md) (new in 0.3.1dev)
- [`RBFKernelPCA`](user_guide/feature_extraction/RBFKernelPCA.md) (new in 0.3.1dev)
- [`PrincipalComponentAnalysis`](user_guide/feature_extraction/PrincipalComponentAnalysis.md)
- [`LinearDiscriminantAnalysis`](user_guide/feature_extraction/LinearDiscriminantAnalysis.md)
- [`RBFKernelPCA`](user_guide/feature_extraction/RBFKernelPCA.md)

## `cluster`
- [`Kmeans`](user_guide/cluster/Kmeans.md) (new in 0.4.1dev)

## `evaluate`
- [`confusion_matrix`](user_guide/evaluate/confusion_matrix.md)
Expand All @@ -42,7 +46,7 @@
- [`minmax_scaling`](user_guide/preprocessing/minmax_scaling.md)
- [`shuffle_arrays_unison`](user_guide/preprocessing/shuffle_arrays_unison.md)
- [`standardize`](user_guide/preprocessing/standardize.md)
- [`one-hot_encoding`](user_guide/preprocessing/one-hot_encoding.md) (new in 0.3.1dev)
- [`one-hot_encoding`](user_guide/preprocessing/one-hot_encoding.md)

## `data`
- [`autompg_data`](user_guide/data/autompg_data.md)
Expand All @@ -51,6 +55,7 @@
- [`mnist_data`](user_guide/data/mnist_data.md)
- [`load_mnist`](user_guide/data/load_mnist.md)
- [`wine_data`](user_guide/data/wine_data.md)
- [`three_blobs`](user_guide/data/three_blobs_data.md)

## `file_io`
- [`find_filegroups`](user_guide/file_io/find_filegroups.md)
Expand Down
399 changes: 399 additions & 0 deletions docs/sources/user_guide/cluster/Kmeans.ipynb

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
311 changes: 311 additions & 0 deletions docs/sources/user_guide/data/three_blobs_data.ipynb

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions mlxtend/cluster/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Sebastian Raschka 2014-2016
# mlxtend Machine Learning Library Extensions
# Author: Sebastian Raschka <sebastianraschka.com>
#
# License: BSD 3 clause

from .kmeans import Kmeans

__all__ = ["Kmeans"]
114 changes: 114 additions & 0 deletions mlxtend/cluster/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Sebastian Raschka 2014-2016
# mlxtend Machine Learning Library Extensions
#
# Base Clusteer (Clutering Parent Class)
# Author: Sebastian Raschka <sebastianraschka.com>
#
# License: BSD 3 clause

import numpy as np
from sys import stderr
from time import time


class _BaseCluster(object):

"""Parent Class Base Cluster
A base class that is implemented by
clustering child classes.
"""
def __init__(self, print_progress=0, random_seed=None):
self.print_progress = print_progress
self.random_seed = random_seed
self._is_fitted = False

def fit(self, X):
"""Learn cluster centroids from training data.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and
n_features is the number of features.
Returns
-------
self : object
"""
self._is_fitted = False
self._check_array(X=X)
if self.random_seed is not None:
np.random.seed(self.random_seed)
self._fit(X=X)
self._is_fitted = True
return self

def _fit(self, X):
# Implemented in child class
pass

def predict(self, X):
"""Predict cluster labels of X.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and
n_features is the number of features.
Returns
----------
cluster_labels : array-like, shape = [n_samples]
Predicted cluster labels.
"""
self._check_array(X=X)
if not self._is_fitted:
raise AttributeError('Model is not fitted, yet.')
return self._predict(X)

def _predict(self, X):
# Implemented in child class
pass

def _shuffle(self, arrays):
"""Shuffle arrays in unison."""
r = np.random.permutation(len(arrays[0]))
return [ary[r] for ary in arrays]

def _print_progress(self, iteration, cost=None, time_interval=10):
if self.print_progress > 0:
s = '\rIteration: %d/%d' % (iteration, self.n_iter)
if cost:
s += ' | Cost %.2f' % cost
if self.print_progress > 1:
if not hasattr(self, 'ela_str_'):
self.ela_str_ = '00:00:00'
if not iteration % time_interval:
ela_sec = time() - self.init_time_
self.ela_str_ = self._to_hhmmss(ela_sec)
s += ' | Elapsed: %s' % self.ela_str_
if self.print_progress > 2:
if not hasattr(self, 'eta_str_'):
self.eta_str_ = '00:00:00'
if not iteration % time_interval:
eta_sec = ((ela_sec / float(iteration)) *
self.n_iter - ela_sec)
self.eta_str_ = self._to_hhmmss(eta_sec)
s += ' | ETA: %s' % self.eta_str_
stderr.write(s)
stderr.flush()

def _to_hhmmss(self, sec):
m, s = divmod(sec, 60)
h, m = divmod(m, 60)
return "%d:%02d:%02d" % (h, m, s)

def _check_array(self, X):
if isinstance(X, list):
raise ValueError('X must be a numpy array')
if not len(X.shape) == 2:
raise ValueError('X must be a 2D array. Try X[:, numpy.newaxis]')
105 changes: 105 additions & 0 deletions mlxtend/cluster/kmeans.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Sebastian Raschka 2014-2016
# mlxtend Machine Learning Library Extensions
#
# Estimator for Linear Regression
# Author: Sebastian Raschka <sebastianraschka.com>
#
# License: BSD 3 clause

from .base import _BaseCluster
import numpy as np
from time import time
from scipy.spatial.distance import euclidean


class Kmeans(_BaseCluster):
""" K-means clustering class.
Added in 0.4.1dev
Parameters
------------
k : int
Number of clusters
max_iter : int (default: 10)
Number of iterations during cluster assignment.
Cluster re-assignment stops automatically when the algorithm
converged.
random_seed : int (default: None)
Set random state for the initial centroid assignment.
print_progress : int (default: 0)
Prints progress in fitting to stderr.
0: No output
1: Iterations elapsed
2: 1 plus time elapsed
3: 2 plus estimated time until completion
Attributes
-----------
centroids_ : 2d-array, shape={k, n_features}
Feature values of the k cluster centroids.
custers_ : dictionary
The cluster assignments stored as a Python dictionary;
the dictionary keys denote the cluster indeces and the items are
Python lists of the sample indices that were assigned to each
cluster.
iterations_ : int
Number of iterations until convergence.
"""

def __init__(self, k, max_iter=10, random_seed=None, print_progress=0):
super(Kmeans, self).__init__(print_progress=print_progress,
random_seed=random_seed)
self.k = k
self.max_iter = max_iter

def _fit(self, X):
"""Learn cluster centroids from training data.
Called in self.fit
"""
self.iterations_ = 0
n_samples = X.shape[0]

# initialize centroids
idx = np.random.choice(n_samples, self.k, replace=False)
self.centroids_ = X[idx]

for _ in range(self.max_iter):

# assign samples to cluster centroids
self.clusters_ = {i: [] for i in range(self.k)}
for sample_idx, cluster_idx in enumerate(
self._get_cluster_idx(X=X, centroids=self.centroids_)):
self.clusters_[cluster_idx].append(sample_idx)

# recompute centroids
new_centroids = np.array([np.mean(X[self.clusters_[k]], axis=0)
for k in sorted(self.clusters_.keys())])

# stop if cluster assignment doesn't change
if (self.centroids_ == new_centroids).all():
break
else:
self.centroids_ = new_centroids

self.iterations_ += 1

return self

def _get_cluster_idx(self, X, centroids):
for sample_idx, sample in enumerate(X):
dist = [euclidean(sample, c) for c in self.centroids_]
yield np.argmin(dist)

def _predict(self, X):
"""Predict cluster labels of X.
Called in self.predict
"""
pred = np.array([idx for idx in self._get_cluster_idx(X=X,
centroids=self.centroids_)])
return pred
Loading

0 comments on commit c1993cf

Please sign in to comment.