Merge branch 'yzhao062:master' into dep_kpca

yzhao062 · Nov 14, 2023 · bbe2c4d · bbe2c4d
2 parents 302f377 + b95b82a
commit bbe2c4d
Show file tree

Hide file tree

Showing 31 changed files with 1,078 additions and 284 deletions.
diff --git a/.github/workflows/testing-cron.yml b/.github/workflows/testing-cron.yml
@@ -28,7 +28,7 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install -r requirements_ci.txt
+        pip install -r docs/requirements.txt
         pip install pytest
         pip install coverage
         pip install coveralls

diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
@@ -33,7 +33,7 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install -r requirements_ci.txt
+        pip install -r docs/requirements.txt
         pip install pytest
         pip install coverage
         pip install coveralls

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,22 @@
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 2
+
+# Set the version of Python and other tools you might need
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.11"
+
+# Build documentation in the docs/ directory with Sphinx
+sphinx:
+  configuration: docs/conf.py
+
+# We recommend specifying your dependencies to enable reproducible builds:
+# https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
+python:
+  install:
+    - requirements: docs/requirements.txt
diff --git a/.travis.yml b/.travis.yml
diff --git a/CHANGES.txt b/CHANGES.txt
@@ -177,3 +177,6 @@ v<1.0.8>, <03/08/2023> -- Improve clone compatibility (#471).
 v<1.0.8>, <03/08/2023> -- Add QMCD detector (#452).
 v<1.0.8>, <03/08/2023> -- Optimized ECDF and drop Statsmodels dependency (#467).
 v<1.0.9>, <03/19/2023> -- Hot fix for errors in ECOD and COPOD due to the issue of scipy.
+v<1.1.0>, <06/19/2023> -- Further integration of PyThresh.
+v<1.1.1>, <07/03/2023> -- Bump up sklearn requirement and some hot fixes.
+v<1.1.1>, <10/24/2023> -- Add deep isolation forest (#506)
diff --git a/README.rst b/README.rst
@@ -58,7 +58,7 @@ Python Outlier Detection (PyOD)
 
 -----
 
-**News**: We just released a 45-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-neurips-adbench.pdf>`_.
+**News**: We have a 45-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-neurips-adbench.pdf>`_.
 The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.
 
 **For time-series outlier detection**, please use `TODS <https://github.com/datamllab/tods>`_.
@@ -70,7 +70,7 @@ multivariate data. This exciting yet challenging field is commonly referred as
 or `Anomaly Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_.
 
 PyOD includes more than 40 detection algorithms, from classical LOF (SIGMOD 2000) to
-the latest ECOD (TKDE 2022). Since 2017, PyOD has been successfully used in numerous academic researches and
+the latest ECOD and DIF (TKDE 2022 and 2023). Since 2017, PyOD has been successfully used in numerous academic researches and
 commercial products with more than `10 million downloads <https://pepy.tech/project/pyod>`_.
 It is also well acknowledged by the machine learning community with various dedicated posts/tutorials, including
 `Analytics Vidhya <https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/>`_,
@@ -156,6 +156,7 @@ NeurIPS 2022 paper `ADBench: Anomaly Detection Benchmark Paper <https://www.andr
 * `ADBench Benchmark <#adbench-benchmark>`_
 * `Model Save & Load <#model-save--load>`_
 * `Fast Train with SUOD <#fast-train-with-suod>`_
+* `Thresholding Outlier Scores <#thresholding-outlier-scores>`_
 * `Implemented Algorithms <#implemented-algorithms>`_
 * `Quick Start for Outlier Detection <#quick-start-for-outlier-detection>`_
 * `How to Contribute <#how-to-contribute>`_
@@ -198,9 +199,10 @@ Alternatively, you could clone and run setup.py file:
 * numpy>=1.19
 * numba>=0.51
 * scipy>=1.5.1
-* scikit_learn>=0.20.0
+* scikit_learn>=0.22.0
 * six
 
+
 **Optional Dependencies (see details below)**\ :
 
 * combo (optional, required for models/combination.py and FeatureBagging)
@@ -327,7 +329,25 @@ and  `SUOD example <https://github.com/yzhao062/pyod/blob/master/examples/suod_e
     clf = SUOD(base_estimators=detector_list, n_jobs=2, combination='average',
                verbose=False)
 
+----
+
+Thresholding Outlier Scores
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A more data based approach can be taken when setting the contamination level.
+By using a thresholding method, guessing an abritrary value can be replaced
+with tested techniques for seperating inliers and outliers. Refer to 
+`PyThresh <https://github.com/KulikDM/pythresh>`_ for
+a more in depth look at thresholding.
+
 
+.. code-block:: python
+
+    from pyod.models.knn import KNN
+    from pyod.models.thresholds import FILTER
+
+    # Set the outlier detection and thresholding methods
+    clf = KNN(contamination=FILTER())
 
 
 ----
@@ -337,7 +357,7 @@ and  `SUOD example <https://github.com/yzhao062/pyod/blob/master/examples/suod_e
 Implemented Algorithms
 ^^^^^^^^^^^^^^^^^^^^^^
 
-PyOD toolkit consists of three major functional groups:
+PyOD toolkit consists of four major functional groups:
 
 **(i) Individual Detection Algorithms** :
 
@@ -373,6 +393,7 @@ Proximity-Based      SOD                 Subspace Outlier Detection
 Proximity-Based      ROD                 Rotation-based Outlier Detection                                                                        2020   [#Almardeny2020A]_
 Outlier Ensembles    IForest             Isolation Forest                                                                                        2008   [#Liu2008Isolation]_
 Outlier Ensembles    INNE                Isolation-based Anomaly Detection Using Nearest-Neighbor Ensembles                                      2018   [#Bandaragoda2018Isolation]_
+Outlier Ensembles    DIF                 Deep Isolation Forest for Anomaly Detection                                                             2023   [#Xu2023Deep]_
 Outlier Ensembles    FB                  Feature Bagging                                                                                         2005   [#Lazarevic2005Feature]_
 Outlier Ensembles    LSCP                LSCP: Locally Selective Combination of Parallel Outlier Ensembles                                       2019   [#Zhao2019LSCP]_
 Outlier Ensembles    XGBOD               Extreme Boosting Based Outlier Detection **(Supervised)**                                               2018   [#Zhao2018XGBOD]_
@@ -411,8 +432,43 @@ Combination          Median            Simple combination by taking the median o
 Combination          majority Vote     Simple combination by taking the majority vote of the labels (weights can be used)                     2015   [#Aggarwal2015Theoretical]_
 ===================  ================  =====================================================================================================  =====  ========================================
 
-
-**(iii) Utility Functions**:
+**(iii) Outlier Detection Score Thresholding Methods**:
+
+==================================  ================  ================================================================ ====================================================================================================================
+Type                                Abbr              Algorithm                                                        Documentation                                    
+==================================  ================  ================================================================ ====================================================================================================================
+Kernel-Based                        AUCP              Area Under Curve Percentage                                      `AUCP <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.AUCP>`_
+Statistical Moment-Based            BOOT              Bootstrapping                                                    `BOOT <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.BOOT>`_ 
+Normality-Based                     CHAU              Chauvenet's Criterion                                            `CHAU <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.CHAU>`_
+Linear Model                        CLF               Trained Linear Classifier                                        `CLF <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.CLF>`_
+cluster-Based                       CLUST             Clustering Based                                                 `CLUST <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.CLUST>`_
+Kernel-Based                        CPD               Change Point Detection                                           `CPD <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.CPD>`_
+Transformation-Based                DECOMP            Decomposition                                                    `DECOMP <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.DECOMP>`_
+Normality-Based                     DSN               Distance Shift from Normal                                       `DSN <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.DSN>`_
+Curve-Based                         EB                Elliptical Boundary                                              `EB <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.EB>`_
+Kernel-Based                        FGD               Fixed Gradient Descent                                           `FGD <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.FGD>`_
+Filter-Based                        FILTER            Filtering Based                                                  `FILTER <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.FILTER>`_
+Curve-Based                         FWFM              Full Width at Full Minimum                                       `FWFM <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.FWFM>`_
+Statistical Test-Based              GESD              Generalized Extreme Studentized Deviate                          `GESD <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.GESD>`_
+Filter-Based                        HIST              Histogram Based                                                  `HIST <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.HIST>`_
+Quantile-Based                      IQR               Inter-Quartile Region                                            `IQR <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.IQR>`_
+Statistical Moment-Based            KARCH             Karcher mean (Riemannian Center of Mass)                         `KARCH <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.KARCH>`_
+Statistical Moment-Based            MAD               Median Absolute Deviation                                        `MAD <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.MAD>`_
+Statistical Test-Based              MCST              Monte Carlo Shapiro Tests                                        `MCST <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.MCST>`_
+Ensembles-Based                     META              Meta-model Trained Classifier                                    `META <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.META>`_
+Transformation-Based                MOLL              Friedrichs' Mollifier                                            `MOLL <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.MOLL>`_
+Statistical Test-Based              MTT               Modified Thompson Tau Test                                       `MTT <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.MTT>`_
+Linear Model                        OCSVM             One-Class Support Vector Machine                                 `OCSVM <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.OCSVM>`_
+Quantile-Based                      QMCD              Quasi-Monte Carlo Discrepancy                                    `QMCD <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.QMCD>`_
+Linear Model                        REGR              Regression Based                                                 `REGR <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.REGR>`_
+Neural Networks                     VAE               Variational Autoencoder                                          `VAE <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.VAE>`_
+Curve-Based                         WIND              Topological Winding Number                                       `WIND <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.WIND>`_
+Transformation-Based                YJ                Yeo-Johnson Transformation                                       `YJ <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.YJ>`_
+Normality-Based                     ZSCORE            Z-score                                                          `ZSCORE <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.ZSCORE>`_
+==================================  ================  ================================================================ ====================================================================================================================
+
+
+**(iV) Utility Functions**:
 
 ===================  ======================  =====================================================================================================================================================  ======================================================================================================================================
 Type                 Name                    Function                                                                                                                                               Documentation
@@ -630,6 +686,8 @@ Reference
 
 .. [#Wang2020adVAE] Wang, X., Du, Y., Lin, S., Cui, P., Shen, Y. and Yang, Y., 2019. adVAE: A self-adversarial variational autoencoder with Gaussian anomaly prior knowledge for anomaly detection. *Knowledge-Based Systems*.
 
+.. [#Xu2023Deep] Xu, H., Pang, G., Wang, Y., Wang, Y., 2023. Deep isolation forest for anomaly detection. *IEEE Transactions on Knowledge and Data Engineering*.
+
 .. [#You2017Provable] You, C., Robinson, D.P. and Vidal, R., 2017. Provable self-representation based outlier detection in a union of subspaces. In Proceedings of the IEEE conference on computer vision and pattern recognition.
 
 .. [#Zenati2018Adversarially] Zenati, H., Romain, M., Foo, C.S., Lecouat, B. and Chandrasekhar, V., 2018, November. Adversarially learned anomaly detection. In 2018 IEEE International conference on data mining (ICDM) (pp. 727-736). IEEE.

diff --git a/docs/about.rst b/docs/about.rst
@@ -5,10 +5,10 @@ About us
 Core Development Team
 ---------------------
 
-Yue Zhao (Ph.D. Student @ Carnegie Mellon University):
+Yue Zhao (Assistant Professor @ USC, Ph.D. @ CMU):
 
 - Initialized the project in 2017
-- `Homepage <https://www.andrew.cmu.edu/user/yuezhao2/>`_
+- `Homepage <https://viterbi-web.usc.edu/~yzhao010/>`_
 - `LinkedIn (Yue Zhao) <https://www.linkedin.com/in/yzhao062/>`_
 
 Zain Nasrullah (Data Scientist at RBC; MSc in Computer Science from University of Toronto):

diff --git a/docs/example.rst b/docs/example.rst
@@ -191,6 +191,45 @@ please navigate to **"/notebooks/Model Combination.ipynb"**
         Combination by AOM ROC:0.9257, precision @ rank n:0.4844
         Combination by MOA ROC:0.9263, precision @ rank n:0.4688
 
+Thresholding Example
+--------------------
+
+
+Full example: `threshold_example.py <https://github.com/yzhao062/Pyod/blob/master/examples/threshold_example.py>`_
+
+1. Import models
+
+    .. code-block:: python
+
+        from pyod.models.knn import KNN   # kNN detector
+        from pyod.models.thresholds import FILTER  # Filter thresholder
+
+
+2. Generate sample data with :func:`pyod.utils.data.generate_data`:
+
+    .. code-block:: python
+
+        contamination = 0.1  # percentage of outliers
+        n_train = 200  # number of training points
+        n_test = 100  # number of testing points
+
+        X_train, X_test, y_train, y_test = generate_data(
+            n_train=n_train, n_test=n_test, contamination=contamination)
+
+3. Initialize a :class:`pyod.models.knn.KNN` detector, fit the model, and make
+   the prediction.
+
+    .. code-block:: python
+
+        # train kNN detector and apply FILTER thresholding
+        clf_name = 'KNN'
+        clf = KNN(contamination=FILTER())
+        clf.fit(X_train)
+
+        # get the prediction labels and outlier scores of the training data
+        y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
+        y_train_scores = clf.decision_scores_  # raw outlier scores
+
 .. rubric:: References
 
 .. bibliography::