Merge branch 'kmeans'

rasbt · Apr 26, 2016 · c1993cf · c1993cf
2 parents 11db7ed + b11aaf4
commit c1993cf
Show file tree

Hide file tree

Showing 17 changed files with 1,271 additions and 14 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -6,11 +6,11 @@ matrix:
         - python: 3.5
           env: LATEST="false" TENSORFLOW="false" COVERAGE="false" NUMPY_VERSION="1.10.4" SCIPY_VERSION="0.17" SKLEARN_VERSION="0.17" PANDAS_VERSION="0.17.1" MATPLOTLIB_VERSION="1.5.1"
         - python: 3.5
-          env: LATEST="true" TENSORFLOW="false" COVERAGE="false"
+          env: LATEST="true" TENSORFLOW="false" COVERAGE="true"
         - python: 3.4
           env: LATEST="true" TENSORFLOW="false" COVERAGE="false"
         - python: 3.4
-          env: LATEST="true" TENSORFLOW="true" COVERAGE="false"
+          env: LATEST="true" TENSORFLOW="true" COVERAGE="true"
         - python: 2.7
           env: LATEST="true" TENSORFLOW="false" COVERAGE="false"
         - python: 2.7

diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -55,6 +55,8 @@ pages:
     - user_guide/feature_extraction/PrincipalComponentAnalysis.md
     - user_guide/feature_extraction/LinearDiscriminantAnalysis.md
     - user_guide/feature_extraction/RBFKernelPCA.md
+  - cluster:
+    - user_guide/cluster/Kmeans.md
   - evaluate:
     - user_guide/evaluate/confusion_matrix.md
     - user_guide/evaluate/plot_decision_regions.md
@@ -74,6 +76,7 @@ pages:
     - user_guide/data/mnist_data.md
     - user_guide/data/load_mnist.md
     - user_guide/data/wine_data.md
+    - user_guide/data/three_blobs_data.md
   - file_io:
     - user_guide/file_io/find_filegroups.md
     - user_guide/file_io/find_files.md
@@ -103,6 +106,8 @@ pages:
   - api_subpackages/mlxtend.data.md
   - api_subpackages/mlxtend.evaluate.md
   - api_subpackages/mlxtend.feature_selection.md
+  - api_subpackages/mlxtend.feature_extraction.md
+  - api_subpackages/mlxtend.cluster.md  
   - api_subpackages/mlxtend.file_io.md
   - api_subpackages/mlxtend.general_plotting.md
   - api_subpackages/mlxtend.preprocessing.md

diff --git a/docs/sources/CHANGELOG.md b/docs/sources/CHANGELOG.md
@@ -7,6 +7,7 @@
 ##### New Features
 
 - New TensorFlow estimator for Linear Regression ([`tf_regressor.TfLinearRegression`](./user_guide/tf_regressor/TfLinearRegression.md))
+- New k-means clustering estimator ([`cluster.Kmeans`](./user_guide/cluster/Kmeans.md))
 
 ##### Changes
 

diff --git a/docs/sources/USER_GUIDE_INDEX.md b/docs/sources/USER_GUIDE_INDEX.md
@@ -1,23 +1,24 @@
 # User Guide Index
 
 ## `classifier`
-
  - [`EnsembleVoteClassifier`](user_guide/classifier/EnsembleVoteClassifier.md)
- - [`StackingClassifier`](user_guide/classifier/StackingClassifier.md) (new in 0.3.1dev)
+ - [`StackingClassifier`](user_guide/classifier/StackingClassifier.md)
  - [`Perceptron`](user_guide/classifier/Perceptron.md)
  - [`Adaline`](user_guide/classifier/Adaline.md)
  - [`LogisticRegression`](user_guide/classifier/LogisticRegression.md)
  - [`NeuralNetMLP`](user_guide/classifier/NeuralNetMLP.md)
- - [`SoftmaxRegression`](user_guide/classifier/SoftmaxRegression.md) (new in 0.3.1dev)
+ - [`SoftmaxRegression`](user_guide/classifier/SoftmaxRegression.md)
 
 ## `tf_classifier` (TensorFlow Classifier)
- - [`TfSoftmaxRegression`](user_guide/tf_classifier/TfSoftmaxRegression.md) (new in 0.3.1dev)
- - [`TfMultiLayerPerceptron`](user_guide/tf_classifier/TfMultiLayerPerceptron.md) (new in 0.3.1dev)
+ - [`TfSoftmaxRegression`](user_guide/tf_classifier/TfSoftmaxRegression.md)
+ - [`TfMultiLayerPerceptron`](user_guide/tf_classifier/TfMultiLayerPerceptron.md)
 
 ## `regressor`
-
 - [`LinearRegression`](user_guide/regressor/LinearRegression.md)
-- [`StackingRegressor`](user_guide/regressor/StackingRegressor.md) (new in 0.3.1dev)
+- [`StackingRegressor`](user_guide/regressor/StackingRegressor.md)
+
+## `tf_regressor` (TensorFlow Regressor)
+- [`TfLinearRegression`](user_guide/tf_regressor/TfLinearRegression.md) (new in 0.4.1dev)
 
 ## `regression_utils`
 - [`plot_linear_regression`](user_guide/regression_utils/plot_linear_regression.md)
@@ -26,9 +27,12 @@
 - [`SequentialFeatureSelector`](user_guide/feature_selection/SequentialFeatureSelector.md)
 
 ## `feature_extraction`
-- [`PrincipalComponentAnalysis`](user_guide/feature_extraction/PrincipalComponentAnalysis.md) (new in 0.3.1dev)
-- [`LinearDiscriminantAnalysis`](user_guide/feature_extraction/LinearDiscriminantAnalysis.md) (new in 0.3.1dev)
-- [`RBFKernelPCA`](user_guide/feature_extraction/RBFKernelPCA.md) (new in 0.3.1dev)
+- [`PrincipalComponentAnalysis`](user_guide/feature_extraction/PrincipalComponentAnalysis.md)
+- [`LinearDiscriminantAnalysis`](user_guide/feature_extraction/LinearDiscriminantAnalysis.md)
+- [`RBFKernelPCA`](user_guide/feature_extraction/RBFKernelPCA.md)
+
+## `cluster`
+- [`Kmeans`](user_guide/cluster/Kmeans.md) (new in 0.4.1dev)
 
 ## `evaluate`
 - [`confusion_matrix`](user_guide/evaluate/confusion_matrix.md)
@@ -42,7 +46,7 @@
 - [`minmax_scaling`](user_guide/preprocessing/minmax_scaling.md)
 - [`shuffle_arrays_unison`](user_guide/preprocessing/shuffle_arrays_unison.md)
 - [`standardize`](user_guide/preprocessing/standardize.md)
-- [`one-hot_encoding`](user_guide/preprocessing/one-hot_encoding.md) (new in 0.3.1dev)
+- [`one-hot_encoding`](user_guide/preprocessing/one-hot_encoding.md)
 
 ## `data`
 - [`autompg_data`](user_guide/data/autompg_data.md)
@@ -51,6 +55,7 @@
 - [`mnist_data`](user_guide/data/mnist_data.md)
 - [`load_mnist`](user_guide/data/load_mnist.md)
 - [`wine_data`](user_guide/data/wine_data.md)
+- [`three_blobs`](user_guide/data/three_blobs_data.md)
 
 ## `file_io`
 - [`find_filegroups`](user_guide/file_io/find_filegroups.md)

diff --git a/docs/sources/user_guide/cluster/Kmeans.ipynb b/docs/sources/user_guide/cluster/Kmeans.ipynb
diff --git a/docs/sources/user_guide/cluster/Kmeans_files/Kmeans_13_0.png b/docs/sources/user_guide/cluster/Kmeans_files/Kmeans_13_0.png
diff --git a/docs/sources/user_guide/cluster/Kmeans_files/Kmeans_17_0.png b/docs/sources/user_guide/cluster/Kmeans_files/Kmeans_17_0.png
diff --git a/docs/sources/user_guide/data/three_blobs_data.ipynb b/docs/sources/user_guide/data/three_blobs_data.ipynb
diff --git a/docs/sources/user_guide/data/three_blobs_data_files/three_blobs_data_13_0.png b/docs/sources/user_guide/data/three_blobs_data_files/three_blobs_data_13_0.png
diff --git a/docs/sources/user_guide/data/three_blobs_data_files/three_blobs_data_14_0.png b/docs/sources/user_guide/data/three_blobs_data_files/three_blobs_data_14_0.png
diff --git a/mlxtend/cluster/__init__.py b/mlxtend/cluster/__init__.py
@@ -0,0 +1,9 @@
+# Sebastian Raschka 2014-2016
+# mlxtend Machine Learning Library Extensions
+# Author: Sebastian Raschka <sebastianraschka.com>
+#
+# License: BSD 3 clause
+
+from .kmeans import Kmeans
+
+__all__ = ["Kmeans"]
diff --git a/mlxtend/cluster/base.py b/mlxtend/cluster/base.py
@@ -0,0 +1,114 @@
+# Sebastian Raschka 2014-2016
+# mlxtend Machine Learning Library Extensions
+#
+# Base Clusteer (Clutering Parent Class)
+# Author: Sebastian Raschka <sebastianraschka.com>
+#
+# License: BSD 3 clause
+
+import numpy as np
+from sys import stderr
+from time import time
+
+
+class _BaseCluster(object):
+
+    """Parent Class Base Cluster
+
+    A base class that is implemented by
+    clustering child classes.
+
+    """
+    def __init__(self, print_progress=0, random_seed=None):
+        self.print_progress = print_progress
+        self.random_seed = random_seed
+        self._is_fitted = False
+
+    def fit(self, X):
+        """Learn cluster centroids from training data.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
+            Training vectors, where n_samples is the number of samples and
+            n_features is the number of features.
+
+        Returns
+        -------
+        self : object
+
+        """
+        self._is_fitted = False
+        self._check_array(X=X)
+        if self.random_seed is not None:
+            np.random.seed(self.random_seed)
+        self._fit(X=X)
+        self._is_fitted = True
+        return self
+
+    def _fit(self, X):
+        # Implemented in child class
+        pass
+
+    def predict(self, X):
+        """Predict cluster labels of X.
+
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
+            Training vectors, where n_samples is the number of samples and
+            n_features is the number of features.
+
+        Returns
+        ----------
+        cluster_labels : array-like, shape = [n_samples]
+          Predicted cluster labels.
+
+        """
+        self._check_array(X=X)
+        if not self._is_fitted:
+            raise AttributeError('Model is not fitted, yet.')
+        return self._predict(X)
+
+    def _predict(self, X):
+        # Implemented in child class
+        pass
+
+    def _shuffle(self, arrays):
+        """Shuffle arrays in unison."""
+        r = np.random.permutation(len(arrays[0]))
+        return [ary[r] for ary in arrays]
+
+    def _print_progress(self, iteration, cost=None, time_interval=10):
+        if self.print_progress > 0:
+            s = '\rIteration: %d/%d' % (iteration, self.n_iter)
+            if cost:
+                s += ' | Cost %.2f' % cost
+            if self.print_progress > 1:
+                if not hasattr(self, 'ela_str_'):
+                    self.ela_str_ = '00:00:00'
+                if not iteration % time_interval:
+                    ela_sec = time() - self.init_time_
+                    self.ela_str_ = self._to_hhmmss(ela_sec)
+                s += ' | Elapsed: %s' % self.ela_str_
+                if self.print_progress > 2:
+                    if not hasattr(self, 'eta_str_'):
+                        self.eta_str_ = '00:00:00'
+                    if not iteration % time_interval:
+                        eta_sec = ((ela_sec / float(iteration)) *
+                                   self.n_iter - ela_sec)
+                        self.eta_str_ = self._to_hhmmss(eta_sec)
+                    s += ' | ETA: %s' % self.eta_str_
+            stderr.write(s)
+            stderr.flush()
+
+    def _to_hhmmss(self, sec):
+        m, s = divmod(sec, 60)
+        h, m = divmod(m, 60)
+        return "%d:%02d:%02d" % (h, m, s)
+
+    def _check_array(self, X):
+        if isinstance(X, list):
+            raise ValueError('X must be a numpy array')
+        if not len(X.shape) == 2:
+            raise ValueError('X must be a 2D array. Try X[:, numpy.newaxis]')
diff --git a/mlxtend/cluster/kmeans.py b/mlxtend/cluster/kmeans.py
@@ -0,0 +1,105 @@
+# Sebastian Raschka 2014-2016
+# mlxtend Machine Learning Library Extensions
+#
+# Estimator for Linear Regression
+# Author: Sebastian Raschka <sebastianraschka.com>
+#
+# License: BSD 3 clause
+
+from .base import _BaseCluster
+import numpy as np
+from time import time
+from scipy.spatial.distance import euclidean
+
+
+class Kmeans(_BaseCluster):
+    """ K-means clustering class.
+
+    Added in 0.4.1dev
+
+    Parameters
+    ------------
+    k : int
+        Number of clusters
+    max_iter : int (default: 10)
+        Number of iterations during cluster assignment.
+        Cluster re-assignment stops automatically when the algorithm
+        converged.
+    random_seed : int (default: None)
+        Set random state for the initial centroid assignment.
+    print_progress : int (default: 0)
+        Prints progress in fitting to stderr.
+        0: No output
+        1: Iterations elapsed
+        2: 1 plus time elapsed
+        3: 2 plus estimated time until completion
+
+    Attributes
+    -----------
+    centroids_ : 2d-array, shape={k, n_features}
+        Feature values of the k cluster centroids.
+    custers_ : dictionary
+        The cluster assignments stored as a Python dictionary;
+        the dictionary keys denote the cluster indeces and the items are
+        Python lists of the sample indices that were assigned to each
+        cluster.
+    iterations_ : int
+        Number of iterations until convergence.
+
+    """
+
+    def __init__(self, k, max_iter=10, random_seed=None, print_progress=0):
+        super(Kmeans, self).__init__(print_progress=print_progress,
+                                     random_seed=random_seed)
+        self.k = k
+        self.max_iter = max_iter
+
+    def _fit(self, X):
+        """Learn cluster centroids from training data.
+
+        Called in self.fit
+
+        """
+        self.iterations_ = 0
+        n_samples = X.shape[0]
+
+        # initialize centroids
+        idx = np.random.choice(n_samples, self.k, replace=False)
+        self.centroids_ = X[idx]
+
+        for _ in range(self.max_iter):
+
+            # assign samples to cluster centroids
+            self.clusters_ = {i: [] for i in range(self.k)}
+            for sample_idx, cluster_idx in enumerate(
+                    self._get_cluster_idx(X=X, centroids=self.centroids_)):
+                self.clusters_[cluster_idx].append(sample_idx)
+
+            # recompute centroids
+            new_centroids = np.array([np.mean(X[self.clusters_[k]], axis=0)
+                                      for k in sorted(self.clusters_.keys())])
+
+            # stop if cluster assignment doesn't change
+            if (self.centroids_ == new_centroids).all():
+                break
+            else:
+                self.centroids_ = new_centroids
+
+            self.iterations_ += 1
+
+        return self
+
+    def _get_cluster_idx(self, X, centroids):
+        for sample_idx, sample in enumerate(X):
+            dist = [euclidean(sample, c) for c in self.centroids_]
+            yield np.argmin(dist)
+
+    def _predict(self, X):
+        """Predict cluster labels of X.
+
+        Called in self.predict
+
+        """
+        pred = np.array([idx for idx in self._get_cluster_idx(X=X,
+                         centroids=self.centroids_)])
+        return pred