diff --git a/docs/_static/extreme_value_theory_for_anomaly_detection.html b/docs/_static/nb_01_0_intro_anomaly_detection.html similarity index 50% rename from docs/_static/extreme_value_theory_for_anomaly_detection.html rename to docs/_static/nb_01_0_intro_anomaly_detection.html index f47f174..346cfe6 100644 --- a/docs/_static/extreme_value_theory_for_anomaly_detection.html +++ b/docs/_static/nb_01_0_intro_anomaly_detection.html @@ -3,7 +3,34 @@ -extreme_value_theory_for_anomaly_detection +nb_01_0_intro_anomaly_detection + + - -
-
+
-
-
-
%%capture
+
+
-%set_random_seed 12 -
+
+
+ +
+ + + +
+

$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$ +$\newcommand{\amax}{{\text{argmax}}}$ +$\newcommand{\P}{{\mathbb{P}}}$ +$\newcommand{\E}{{\mathbb{E}}}$ +$\newcommand{\R}{{\mathbb{R}}}$ +$\newcommand{\Z}{{\mathbb{Z}}}$ +$\newcommand{\N}{{\mathbb{N}}}$ +$\newcommand{\C}{{\mathbb{C}}}$ +$\newcommand{\abs}[1]{{ \left| #1 \right| }}$ +$\newcommand{\simpl}[1]{{\Delta^{#1} }}$

-
+
-
-
+
-
-
-
%load_latex_macros
-
+
+
+
+

Introduction to Anomaly Detection

Snow

-
-
-
import tensorflow as tf
-import tensorflow_probability as tfp
-from tensorflow import keras
-import numpy as np
-import pandas as pd
-import matplotlib.pyplot as plt
-import os
-import logging
-from sklearn.preprocessing import StandardScaler
-from typing import Protocol, Sequence, Union, Tuple, List, TypeVar, Callable
-from matplotlib.animation import FuncAnimation
-from celluloid import Camera
-from IPython.core.display import HTML 
-
-tfd = tfp.distributions
+
import numpy as np
+
+
+import matplotlib
+from matplotlib import pyplot as plt
+from matplotlib.patches import Ellipse
+
+from tfl_training_anomaly_detection.exercise_tools import evaluate, visualize_mahalanobis
+
+from ipywidgets import interact
+
+from sklearn.metrics import f1_score, precision_score, recall_score
+
+%matplotlib inline
+matplotlib.rcParams['figure.figsize'] = (5, 5)
 
@@ -13176,214 +13352,363 @@
-

Snow

-
Extreme Value Theory for Anomaly Detection
+

What is an Anomaly?

-

Extreme Value Theory and Anomaly Detection

Extreme value theory (EVT) usually deals with tails of univariate probability distributions, i.e. with events that -are rare because they are large. -Anomaly detection deals with events that rare but not necessarily large. Therefore, the topic of anomaly detection is -more general than EVT.

- +
+ +
-

Despite the smaller applicability of EVT techniques, they are still a valuable addition to the anomaly detectionist's -toolbox. In several situations, anomalies directly correspond to large deviations from some (possibly running) mean - e.g. for sensor data, intrusion attacks based on the number of calls and others.

+

Grubbs, 1969:

+

An outlying observation, or "outlier," is one that appears to deviate markedly from other members of +the sample in which it occurs.

+
+

Hawkins, 1980:

+

An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was +generated by a different mechanism.

+
+

Chandola et al., 2009:

+

Anomalies are patterns in data that do not conform to a well defined notion of normal behavior

+
-

However, even for entirely different definitions of anomaly, most detection algorithms will produce a scalar outlier score for each datapoint. -EVT can then be used as a probabilistic framework for analyzing the univariate distribution of outlier scores and help determine meaningful thresholds for separating anomalous from normal cases.

- +

Anomalies Can be Hard to Detect

+ +
+
+
+
+
+
+

Practical Relevance of Anomaly Detection

-

EVT in a Nutshell

There are two fundamental theorems of extreme value theory on which most results in that field are based. The first is concerned with the asymptotic distribution of block maxima of a sequence of i.i.d. random variables. The second one gives an expression for the distribution of excesses over a threshold.

-

We will first state these theorems (in their standard formulation in the literature), then see how they can be applied to anomaly detection and after that highlight ideas of their proofs as well as some theoretical consequences.

-

From now on let $X_1, X_2, ...$ be a sequence of 1-dimensional i.i.d. random variables with cumulative distribution function $F$. Let $X$ also be a r.v. with the same c.d.f.

+

Predictive Maintenance

    +
  • Determine condition of in-service equipment
  • +
  • Optimize maintenance cycle
  • +
  • Too frequent inspections cause unnecessary costs and downtime
  • +
  • Too infrequent inspections can lead to failures or even breaking of the equipment
    + +
  • +
+

Anomaly Detection: Sensory data can provide valuable information about the condition of the component. Increasingly +abnormal readings may indicate a wear of the equipment.

-

We define the n-block maximum as the random variable

-$$ -M_n := \max \{X_1, ..., X_n\}. -$$

Given a threshold $u$, the excess over the threshold is given by $X-u$.

-

In EVT, we are typically interested in approximating $P(M_n<z)$ for large $n$ and in approximating the distribution of excesses $P(X-u < y \mid X > u)$ for large $u$.

+

Fraud Detection

    +
  • Identify fraudulent transactions, e.g. credit card
  • +
  • Prevent criminal activities
  • +
  • Avoid financial or other damages for the involved parties
  • +
+
+ +

Anomaly Detection: Fraudulent transactions can often be identified through unusual destinations, amounts, +or network topology (over several transactions).

-

The Fisher-Tipett-Genedenko theorem characterizes the possible limits of renormalized block maxima.

-

If there exist sequences of real numbers $a_n>0, b_n$ such that the probability distributions -$$ -P\left(\frac{M_n-b_n}{a_n}<z \right) -$$ -converge to a non-degenerate distribution $G(z)$, then $G(z)$ must be of the following form:

-\begin{equation} - P\left(\frac{M_n-b_n}{a_n}<z \right) \xrightarrow[n\rightarrow \infty]{} G(z; \xi, \mu, \sigma) = \exp \left\{ -\left( 1 + \xi \left( \frac{z - \mu}{\sigma} \right) \right)^{- \frac{1}{\xi} } \right\} -\end{equation}

where $\xi, \mu \in \mathbb{R}$ and $\sigma >0$. This function family is called the Generalized Extreme Value distributions (GEV).

+

Intrusion Detection

    +
  • Detect attacks against a network
  • +
  • Protect nodes against unauthorized access
  • +
+
+ +

Anomaly Detection +Malicious connections can leaf unusual footprints, e.g., used protocol, ports, number of packages, IP, duration, etc.

-

The Pickands–Balkema–De Haan theorem states that under the same conditions as above and for a threshold $u \in \mathbb{R}$ going to infinity, the distribution of excesses over the threshold $u$ converges to a Generalized Pareto Distribution (GPD), i.e.

-\begin{equation} -P(X-u < y \mid X > u) \xrightarrow[u \rightarrow \infty]{} H(y; \xi, \tilde{\sigma})=1 - \left( 1 + \frac{\xi \ y}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} \ -\end{equation}

where $y>0$ and $1 + \frac{\xi \ y}{\tilde{\sigma}} >0$. The parameter $\xi$ takes the same value as for the GEV.

-

We have highlighted the dependence of the limiting distributions on the parameters in both cases. In applications, these parameters will be estimated based on data.

+

Relevance of Unsupervised Machine Learning in AD

    +
  • Due to the difficulty of identifying anomalies one often has no labeled data available
  • +
  • Even if labels are available, anomalies are rare and the data sets are heavily imbalanced
  • +
  • Often, we don't want to restrict the system to anomalies that we have encountered in the past
  • +
  • The information that is available heavily influences the applicable techniques:
      +
    • Is the distribution of nominal data known?
    • +
    • Is there clean data (without anomalies) for training?
    • +
    • Do we have labeled anomalies for evaluation?
    • +
    • How large is the proportion of anomalies?
    • +
    • How much noise is in the data?
    • +
    +
  • +
-

Practical Significance of EVT Theorems

Before we analyze the consequences of the above distributions in detail, let us discuss their practical significance. The distinctive feature of the GEV and GPD distributions is that they are of a very restricted form, belonging to a three and two parameter function family respectively. This motivates to model distributions of block maxima for finite but large $n$ by the GEV distribution

-$$ - P\left(\frac{M_n-b_n}{a_n}<z \right) \approx G(z; \xi, \mu, \sigma) \Longleftrightarrow - P\left( M_n < z \right) \approx G(z; \xi, \mu\prime , \sigma\prime) -$$

where $\mu\prime=b_n+a_n \mu$ and $\sigma\prime=a_n \sigma$. Thus, fitting the coefficients $\xi, \mu\prime, \sigma\prime$ to the observed values of $M_n$, e.g. by maximum likelihood estimation, also finds the "best" values of the renormalizing constants $a_n$ and $b_n$.

-

Similarly, fitting $\xi, \tilde{\sigma}$ to observed excesses of a finite threshold $u$ also finds the best renormalizing constants for the GPD.

+

Question

Where do you think you can benefit from anomaly detection?

+
    +
  • Which problem do you want to solve?
  • +
  • How does it translate into an anomaly detection problem?
  • +
  • What data is available (dimensionality, time dependence, $\ldots$)?
      +
    • Clean data (without anomalies) available?
    • +
    • Labeled anomalies available?
    • +
    • Proportion of outliers?
    • +
    +
  • +
-

In the context of AD, modeling the distributions of $M_n$ or $X-u$ is useful for finding thresholds on outlier scores with probabilistic interpretations or for predicting the occurrence rates and sizes of anomalies.

+

Contamination Framework

+
    +
  • Unsupervised Scenario
  • +
  • Two distributions:
      +
    • $F_0$ generates normal points
    • +
    • $F_1$ generates anomalies
    • +
    • $p$ relative frequency of $F_1$
    • +
    +
  • +
  • Data set $D \stackrel{\text{IID}}{\sim} F=(1-p)F_0 + pF_1$
  • +
+

Task: Estimate if given $x$ is anomalous

+

Assumptions:

+
    +
  • Few: $p \ll 1/2$
  • +
  • Outlying: $F_0$ and $F_1$ do not overlap too much
  • +
  • Sparse: $F_1$ is less clustered than $F_0$
  • +
-

For example, given some complex outlier score based on sensor data of a factory process, we might be interested in the probability that this outlier score exceeds a certain threshold within a month. This could be achieved by fitting a GEV distribution to observed frequencies of monthly maxima of the score.

+

Does the Contamination Framework Always Apply?

+
+
+
+
+
+

No!

+
    +
  • We might have clean data without anomalies available for training
  • +
  • In an adversarial scenario, like fraud detection, the opponent might change her behavior over time to evade detection +$\Rightarrow$ $F_1$ might not be well-defined
  • +
  • The degree to which the three assumptions are true can vary for each specific problem
  • +
  • Some assumptions might even be false in some scenarios
  • +
-

The GPD can be used to directly estimate a cumulative univariate distribution $F(z)$ for large enough $z$. Then one could use it to determine the anomaly threshold $z_{\text{th}}$ by defining an anomalous upper quantile. E.g. solving $F(z_\text{th}) = 0.99$ for $z_{\text{th}}$ (where $F$ was obtained by fitting the GPD to some outlier score) would declare approximately 1% of data points as anomalous. We will describe this in more detail below.

+

Evaluation Metrics

    +
  • Accuracy is not a good measure in anomaly detection:
      +
    • $1\%$ anomalies $\Rightarrow$ always predicting nominal gives $99\%$ accuracy!
    • +
    +
  • +
  • Better measures are precision, recall and $F_1$
  • +
  • The confusion matrix divides a test set according to the predictions and ground truth
  • +
+ + + + + + + + + + + + + + + + + + + +
Confusion MatrixActual NominalActual Anomaly
Predicted NominalTrue Negative (TN)False Negative (FN)
Predicted AnomalyFalse Positive (FP)True Positive (TP)
-

EVT in Action

Let us give a quick example for fitting a GEV on data and extracting insight from it. For that we will use the NYC taxi calls dataset - a collection of taxi calls per 30-minutes intervals that was collected for over a year.

+

Precision, Recall

    +
  • Precision is defined as
  • +
+$$\frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$$

It estimates the probability that an observation really is anomalous given that the detection system predicted it to be.

+
    +
  • Recall is defined as
  • +
+$$\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$

It estimates the probability that an observation will be predicted to be anomalous given that it really is.

-
-
+
+
+

$F_1$ Score

    +
  • $F_1$ is defined as the harmonic mean of precision and recall
  • +
+$$2\cdot \frac{\text{Precision}\cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

+

It balances between precision and recall.

-
-
-
taxi_csv = os.path.join("..", 'data','nyc_taxi','nyc_taxi.csv')
-taxi_df = pd.read_csv(taxi_csv)
-taxi_df['time'] = [x.time() for x in pd.to_datetime(taxi_df['timestamp'])]
-taxi_df['date'] = [x.date() for x in pd.to_datetime(taxi_df['timestamp'])]
-taxi_df.rename(columns={"value": "n_calls"}, inplace=True)
-taxi_df.drop(columns=["timestamp"], inplace=True)
-
+
+
+
+
+
+

Evaluating Thresholds

    +
  • Most anomaly detection algorithms output an anomaly score where higher values mean more anomalous.
  • +
  • We need to set a decision threshold $\tau$ in order to compute precision, recall and $F_1$. +
  • +
  • The precision-recall (PR) curve plots the pairs
  • +
+$$\{(\text{Recall}(\tau), \text{Precision}(\tau)) \mid \tau_{\text{min}} \leq \tau \leq \tau_{\text{max}}\}$$
    +
  • The receiver-operator-characteristics curve (ROC) plots the true positive rate (TPR) against the false positive rate (FPR) for the possible thresholds
      +
    • $\mathrm{TPR(\tau)} = \frac{\mathrm{TP(\tau)}}{\mathrm{TP(\tau)}+\mathrm{FN(\tau)}}$
    • +
    • $\mathrm{FPR(\tau)} = \frac{\mathrm{FP(\tau)}}{\mathrm{FP}(\tau)+\mathrm{TN}(\tau)}$
    • +
    +
  • +
-
+
+
+
+

Cost Matrix

Choosing the optimal threshold does not only depend on the values of our metrics but also on the cost associated to the confusion matrix. Similarly to precision, recall, etc. the confusion matrix has to be understood as a function of the threshold. For each threshold we obtain different numbers of true positives, false positives... . The associated costs e.g. for the false positives are the expected costs that a falsely positive prediction generates (profits are represented as negative costs). Other than the confusion matrix, the cost matrix does not depend on $\tau$.

+ + + + + + + + + + + + + + + + + + + +
Cost MatrixActual NominalActual Anomaly
Predicted NominalCost of TN (CTN)Cost of FN(CFN)
Predicted AnomalyCost of FP (CFP)Cost of TP (CTP)
+

Our goal is to set the threshold such that the expected empirical costs will be minimized

+$$ +\begin{align*} +\tau_{\text{opt}}=\arg\min_\tau \frac{\mathrm{TP}(\tau)\cdot \mathrm{CTP} + \mathrm{FP}(\tau)\cdot \mathrm{CFP} + \mathrm{TN}(\tau)\cdot \mathrm{CTN} + \mathrm{FN}(\tau)\cdot \mathrm{CFN}}{N} +\end{align*} +$$

Where $N$ is the total number of samples.

-
-
+
+
+
+
+

Our First Anomaly Detection Approach

Let's have a look at a simple probabilistic anomaly score.

+
    +
  • If the distribution of nominal data is known then we can use $-\log p(x)$, also known as the surprise.
  • +
  • If only the covariance $\Sigma$ and the mean $\mu$ is known, we can compute the Mahalanobis distance to the mean $\sqrt{(x-\mu)^T\Sigma^{-1}(x-\mu)}$ can be used.
      +
    • Only applicable if nominal distribution is unimodally centered around the mean.
    • +
    • Extension for mixture models with means $\mu_1,\ldots,\mu_k$ and covariance matrices $\Sigma_1,\ldots,\Sigma_k$: $\min_{1\leq i \leq k}\sqrt{(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}$
    • +
    +
  • +
-
-
-
taxi_df.head()
-
+
+
+
+
+
+

Motivation of Mahalanobis Distance

    +
  • The Mahalanobis distance is motivated by the surprise of a Gaussian: +\begin{align*} +-\log p(x) &= -\log \frac{1}{(2\pi)^{\frac{m}{2}}\sqrt{|\det(\Sigma)|}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)\\ +&= \frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu) + c +\end{align*}
  • +
  • Since monotonous transformations (such as $\sqrt{\cdot}$ or adding a constant) don't change the outlier ranking this is equivalent to the Mahalanobis distance
  • +
-
+
+
+
+

Exercise

Try the outlier scores for yourself in a simple synthetic scenario. We have prepared the function evaluate for you. Try to find the optimal threshold for the dataset.

+
+
-
# Helper functions for normalizing data. In most cases it will be enough to use the normalize function
-
-def normalize_data(data: Sequence) -> np.ndarray:
-    scaler = StandardScaler()
-    return scaler.fit_transform(data)
-
+
nominal = np.random.normal(0, [1, 1.5], size=(300, 2))
+anomaly = np.random.normal(5, 2, size=(10, 2))
 
-def normalize_series(series: pd.Series) -> pd.DataFrame:
-    data = series.values.reshape(-1, 1)
-    normalized_data = normalize_data(data).reshape(-1)
-    return pd.Series(normalized_data, index=series.index)
+data = np.concatenate([nominal, anomaly], axis=0)
+y = np.zeros(310)
+y[-10:] = 1
 
+plt.scatter(data[:, 0], data[:,1], c=y)
+plt.gca().set_aspect('equal')
+plt.show()
+
-def normalize_df(data_frame: pd.DataFrame): - normalized_data = normalize_data(data_frame) - return pd.DataFrame(normalized_data, columns=data_frame.columns, index=data_frame.index) +
+
+
+
+
-T = TypeVar("T") +
-def normalize(data: T) -> T: - if isinstance(data, np.ndarray): - return normalize_data(data) - elif isinstance(data, pd.Series): - return normalize_series(data) - elif isinstance(data, pd.DataFrame): - return normalize_df(data) - else: - raise ValueError(f"Unsupported data type: {data.__class__.__name__}") -
-
-
+
+
-
-
- -
-
-
taxi_df_normalized = taxi_df
-taxi_df_normalized["n_calls"] = normalize(taxi_df_normalized["n_calls"])
-taxi_df_normalized.head()
-
-
-
-
-
-

A glance at the tensorflow probability API

For solving the exercises in this notebook you will need to use basic properties of tensorflow probability distributions. They have a very intuitive and convenient API - you get access to the probability density, cdf, quantile function and so on.

+
+
+
+ +
+
Mean: [0.12698056 0.00663621]
+Std: [1.3049857 1.7497876]
+
-
-
-
-
-
-
sample_gev = get_gev(0.5)
+
+
-print(f"Probability density: {sample_gev.prob([1, 0.3])}") -print(f"Cdf: {sample_gev.cdf([1, 0.3])}") -print(f"Quantile: {sample_gev.quantile([0.5, 0.9])}") -print(f"Trainable vars:\n {sample_gev.trainable_variables}") -
+
+
+
+

Question

How did the contamination influence the parameter estimation?

-
-
-

Exercise 1: playing around with GEV parameters

Plot the GEV probability distribution for different values of $\xi, \mu$ and $\sigma$. How do they differ qualitatively?

-

What are the domains of definition of $z$ in the above analytic expression for $G(z)$? What values should the c.d.f. $G(z)$ take outside these domains and how does this affect fitting $\xi, \mu, \sigma$ from data by maximum likelihood estimation?

-

What expression for the GEV do we get in the limit $\xi \longrightarrow 0$?

-

The three qualitatively different shapes of the GEV have their own names. For $\xi >0$ we get the Fréchet Distribution, for $\xi<0$ the reverse Weibull distribution and for $\xi=0$ the Gumbel distribution. Note that using the Gumbel distribution in tensorflow probability is not exactly the same as using GEV with $\xi=0$ due to rounding errors. Try it out!

+

Compute scores and evaluate

@@ -13449,38 +13762,36 @@

Exercise 1: playing arou
-
gev = get_gev(xi=1e-5, sigma=2)
-arr = np.linspace(-5, 5)
-
-pdf = gev.prob(arr)
-plt.plot(arr, pdf)
-plt.show()
+
# Mahalanobis distance from the mean of N(mu, Sigma)
+scores = np.sqrt(((data - mu) * (1/Sigma_diag) * (data - mu)).sum(axis=1)) 
+curves = evaluate(y, scores)
 
+
+
+ +
+ + + +
+ +
+
-
-
-

Solution Exercise 1:

The cdf of the GEV distribution is well defined when $1 + \xi \left( \frac{z - \mu}{\sigma} \right) > 0$. This is equivalent to

-$$ - z > \mu - \frac{\sigma}{\xi} \qquad \text{if $\xi>0$} -$$

and -$$ - z < \mu - \frac{|\sigma|}{\xi} \qquad \text{if $\xi<0$}. -$$

-

Thus, for $\xi>0$, the distribution has a left boundary, the probability of points lying to the left of it is zero. The value of the cdf there is zero.

-

For $\xi<0$ there is a right boundary, the probability of points lying to the right of it is zero and the value of the cdf is 1.

-

As $\xi$ moves to zero from below, the right boundary is pushed to infinity. Similarly, if it approaches zero from above, the left boundary is pushed to negative infinity. At exactly $\xi=0$, the GEV becomes the Gumbel distribution which is well defined for all $z$.

+
-

We can group the numbers of calls according to the dates, thereby obtaining daily maxima and minima of calls. One way of detecting anomalies in the NYC taxi data set is by fitting a GEV to the distribution of these daily maxima. Here a histogram plot of the (normalized) maxima:

+

Choose a threshold

@@ -13490,9 +13801,37 @@

Solution Exercise 1: -

-
-
- -
-
-
plt.hist(daily_grouped_normalized["max"], density=True, bins=40)
-plt.title("Daily maxima of n_calls/(30 minutes)")
-plt.show()
-
- -
-
-
- -
-
-
-

Q: Can you already spot the obvious anomalies? What caused them?

-

A: See below

-

Q: Which of the three qualitatively different shapes would make "physical" sense for the taxi calls data?

-

A: The Weibull shape - thus we expect $\xi<0$.

- -
-
-
-
-
- -
-
-
maxima = daily_grouped_normalized["max"]
-maxima[(maxima > 1.8) | (maxima < -3.5)]
-
- -
-
-
- -
-
-
-
    -
  • 02/11 - NY marathon
  • -
  • 25/12 - Christmas
  • -
  • 27/01 - Snowstorm
  • -
  • 01/01 - New Years
  • -
  • 06/09 - Columbus day (big parade)
  • -
- -
-
-
-
-
-

Fitting the GEV

Now let us infer the parameters of the GEV from the data using maximum likelihood estimation. We will make gradient descent on the negative log likelihood of the GEV. Here a very simple training loop for a suitable initial choice for the shape parameter $\xi$ (it is called "concentration") written out in detail:

- -
-
-
-
-
- -
-
-
# we are going to be a bit fancy and show an animation of the function as it is being fitted
-
-daily_max = daily_grouped_normalized["max"].values
-
-optimizer = keras.optimizers.SGD(learning_rate=2e-4)
-losses = []
-
-sample_gev = get_gev(xi=-0.1, trainable_xi=True)
-
-fig = plt.figure(dpi=200, figsize=(4.5, 3))
-camera = Camera(fig)
-
-for step in range(100):
-    with tf.GradientTape() as tape:
-        loss = - tf.math.reduce_sum(sample_gev.log_prob(daily_max))
-    gradients = tape.gradient(loss, sample_gev.trainable_variables)
-    optimizer.apply_gradients(zip(gradients, sample_gev.trainable_variables))
-    losses.append(loss)
-    
-    bins = plt.hist(daily_max, bins=40, density=True, color="C0")[1]
-    pdf = sample_gev.prob(bins)
-    plt.plot(bins, pdf, color="orange")
-    ax = plt.gca()
-    ax.text(0.5, 1.01, f"{step=}, Loss={loss}", transform=ax.transAxes)
-    camera.snap()
-
-plt.close()
-plt.figure()
-plt.plot(losses)
-plt.title("Negative Log Likelihood")
-plt.xlabel("gradient steps")
-plt.show()
-
- -
-
-
- -
-
-
-

Seems like after 100 steps we have already converged. Let us have a quick look at the result

- -
-
-
-
-
- -
-
-
bin_positions = plt.hist(daily_max, density=True, bins=25)[1]
-plt.plot(bin_positions, sample_gev.prob(bin_positions))
-plt.show()
-
- -
-
-
- -
-
-
- -
-
-
HTML(camera.animate().to_html5_video())
-
- -
-
-
- -
-
-
- -
-
-
sample_gev.trainable_variables
-
- -
-
-
- -
-
-
-

Well, we probably can do better...

- -
-
-
-
-
-

Exercise 2.1: MLE for the generalized extreme value distribution

Find a better fit using:

-
    -
  1. Removing the obvious anomalies
  2. -
  3. Profiling in the shape parameter $\xi$ or using different initial values/learning rates for inferring $\xi$
  4. -
-

Feel free to improve the code by defining new functions and so on!

-

Evaluate the quality of your fit by visual inspection of a histogram and a Q-Q plot (more on that below).

-

You can also use other statistical tools that you are familiar with.

-

Use the fitted model to find "anomalies" for taxi calls corresponding to probabilities of less than $0.01$.

- -
-
-
-
-
-

Q-Q Plot

A Q-Q plot is useful for visually comparing two distributions, or comparing a distribution with a dataset. We are interested in the latter. In a Q-Q plot the quantiles of one distribution are plotted against the quantiles of another. For a dataset, the natural choice of quantiles is given simply by the sorted data itself. The data points then roughly correspond to the $\frac{k}{n+1}$th percentiles, where $n$ is the number of samples and $k=1,...,n$ (these are often called plotting positions and other choices for them are possible). The corresponding theoretical quantiles from some specified c.d.f. $F$ are then given by $q_k \ \text{s.t} \ F(q_k) = \frac{k}{n+1}$ (in our applications, $F$ will generally be injective and the $q_k$ uniquely defined).

-

If the distribution is a good fit of the data, the resulting line will be close to the diagonal. Below we ask you to complete a simple implementation of the Q-Q plot for tensorflow-like distributions

- -
-
-
-
-
- -
-
-
ArrayLike = Sequence[Union[float, tf.Tensor]]
-
-
-class TFDistributionProtocol(Protocol):
-    name: str
-    trainable_variables: Tuple[tf.Variable]
-        
-    def quantile(self, prob: ArrayLike) -> ArrayLike: ...    
-
- -
-
-
- -
-
-
- -
-
-
def qqplot(data: ArrayLike, dist: TFDistributionProtocol):
-    num_observations = len(data)
-    observed_quantiles = sorted(data)
-    plotting_positions = np.arange(1, num_observations + 1) / (num_observations + 1)
-    theoretical_quantiles = dist.quantile(plotting_positions)
-    
-    plot_origin = (theoretical_quantiles[0], observed_quantiles[0])
-    plt.plot(theoretical_quantiles, observed_quantiles)
-    plt.plot(theoretical_quantiles, theoretical_quantiles) # adding a diagonal for visual comparison
-    plt.xlabel(f"Theoretical quantiles of {dist.name}")
-    plt.ylabel(f"Observed quantiles")
-    
-
- -
-
-
- -
-
-
-

Solution of exercise 2.1

-
-
-
-
-
- -
-
-
# setting up functions for normal and profile likelihood fit
-
-def fit_dist(data: ArrayLike, dist: TFDistributionProtocol, num_steps=100, lr=1e-4, 
-             plot_losses=True, return_animation=True) -> Union[float, Tuple[float, HTML]]:
-    optimizer = keras.optimizers.SGD(learning_rate=lr)
-    losses = []
-    
-    if return_animation:
-        fig = plt.figure(dpi=200, figsize=(4.5, 3))
-        camera = Camera(fig)    
-
-    for step in range(num_steps):
-        with tf.GradientTape() as tape:
-            loss = - tf.math.reduce_sum(dist.log_prob(data))
-        if np.isnan(loss.numpy()):
-            logging.warning(f"Encountered nan after {step} steps")
-            break
-        
-        gradients = tape.gradient(loss, dist.trainable_variables)
-        optimizer.apply_gradients(zip(gradients, dist.trainable_variables))
-        losses.append(loss)
-        
-        if return_animation:
-            bins = plt.hist(data, bins=50, density=True, color="C0")[1]
-            pdf = dist.prob(bins)
-            plt.plot(bins, pdf, color="orange")
-            ax = plt.gca()
-            ax.text(0.5, 1.01, f"{step=}, Loss={round(loss.numpy(), 2)}", transform=ax.transAxes)
-            camera.snap()
-    
-
-    if plot_losses:
-        plt.close()
-        plt.figure()
-        plt.plot(losses)
-        plt.title("Negative Log Likelihood")
-        plt.xlabel("gradient steps")
-        plt.show()
-    
-    result = losses[-1]
-    if return_animation:
-        result = result, HTML(camera.animate().to_html5_video())
-    return result
-
-def profile_fit_dist(data: ArrayLike, dist_factory: Callable[[float], TFDistributionProtocol], xi_values: Sequence[float], 
-                     num_steps=100, lr=1e-4) -> Tuple[float, TFDistributionProtocol]:
-    """
-    Fits the distribution to data and returns the final loss. If return_animation=True, returns the tuple
-    (final_loss, animation)
-    """
-    minimal_loss = np.infty
-    optimal_dist = None
-    for xi in xi_values:
-        dist = dist_factory(xi)
-        loss = fit_dist(data, dist, num_steps=num_steps, lr=lr, plot_losses=False, return_animation=False)
-        if loss < minimal_loss:
-            minimal_loss = loss
-            optimal_dist = dist
-    if optimal_dist is None:
-        raise RuntimeError(f"Could not find optimal dist, probably due to divergences during fit. "  
-                           "Try to find a better choice for xi_values")
-    return minimal_loss, optimal_dist
-
- -
-
-
- -
-
-
- -
-
-
# removing obvious anomalies
-daily_max_without_anomalies = daily_max[np.logical_and( daily_max < 1.8, daily_max > -3.5)]
-
-plt.hist(daily_max_without_anomalies, density=True, bins=40)
-plt.title("Daily maxima without anomalies")
-plt.show()
-
- -
-
-
- -
-
-
- -
-
-
# Example with profile likelihood
-
-xi_values = np.linspace(-0.3, -0.5, 30)
-dist_factory = lambda xi: get_gev(xi, trainable_xi=False)
-min_loss, optimal_gev = profile_fit_dist(daily_max_without_anomalies, dist_factory, xi_values, num_steps=80)
-print(f"Minimal loss: {min_loss}")
-print(f"Optimal xi: {optimal_gev.concentration}")
-optimal_gev.trainable_variables
-
- -
-
-
- -
-
-
- -
-
-
bin_positions = plt.hist(daily_max_without_anomalies, density=True, bins=40)[1]
-plt.plot(bin_positions, optimal_gev.prob(bin_positions))
-plt.title("Result from profile likelihood")
-plt.show()
-
- -
-
-
- -
-
-
- -
-
-
# Solving with gradient descent on xi
-daily_max_gev = get_gev(xi=-0.4)
-final_loss, animation = fit_dist(daily_max_without_anomalies, daily_max_gev)
-
- -
-
-
- -
-
-
- -
-
-
animation
-
- -
-
-
- -
-
-
- -
-
-
# Here the values found by fitting
-daily_max_gev.trainable_variables
-
- -
-
-
- -
-
-
- -
-
-
# and here the qqplot
-qqplot(daily_max_without_anomalies, daily_max_gev)
-
- -
-
-
- -
-
-
-

Solution exercise 2.1 - Finding anomalies from the GEV

-
-
-
-
-
- -
-
-
#The fit looks quite good, apart from the lower region, which we are not really interested in. 
-#Let us find the anomalies corresponding to the upper 1% quantile
-upper_percentile = 0.99
-upper_quantile = daily_max_gev.quantile(upper_percentile).numpy()
-upper_quantile
-
- -
-
-
- -
-
-
- -
-
-
# and here the anomalies above this threshold
-daily_grouped_normalized["max"][daily_grouped_normalized["max"] > upper_quantile]
-
- -
-
-
- -
-
-
-

In addition to the obvious anomalies found above, we caught the independence day (one day before it). We also have the probabilistic interpretation that for 99% of the days the maximal amount of calls per 30 minutes will not exceed the threshold found above (it should be rescaled back to the original value for this statement to hold).

- -
-
-
-
-
-

Estimating the uncertainty

One benefit of the probabilistic approach is that we get confidence intervals almost for free. These can be used to estimate the robustness of our analysis (e.g. the determination of anomalies and the quality of the fit).

- -
-
-
-
-
-

Since we fitted our functions using MLE, which is known to be approximately normal, we get uncertainty estimates from the second derivatives of the loss function. Fortunately, tensorflow makes this extremely easy for us.

- -
-
-
-
-
- -
-
-
def observed_fisher_information(data: ArrayLike, dist: TFDistributionProtocol) -> tf.Tensor:
-    with tf.GradientTape() as t2:
-        with tf.GradientTape() as t1:
-            nll = - tf.math.reduce_sum(dist.log_prob(data))
-        # conversion needed b/c trainable_vars is a tuple, so gradients and jacobians are tuples too
-        g = tf.convert_to_tensor(  
-            t1.gradient(nll, dist.trainable_variables)
-        )
-    return tf.convert_to_tensor(t2.jacobian(g, dist.trainable_variables))
-
- -
-
-
- -
-
-
- -
-
-
def mle_std_deviations(data: ArrayLike, dist: TFDistributionProtocol) -> tf.Tensor:
-    observed_information_matrix = observed_fisher_information(data, dist)
-    mle_covariance_matrix = tf.linalg.inv(observed_information_matrix)
-    variances = tf.linalg.tensor_diag_part(mle_covariance_matrix)
-    return tf.math.sqrt(variances)
-
- -
-
-
- -
-
-
-

Exercise 2.2: Uncertainty in GEV

Using the above functions, include error bars into the Q-Q plots of the maximum likelihood estimates of the GEV distribution found above.

- -
-
-
-
-
-

Solution Exercise 2.2

-
-
-
-
-
- -
-
-
# finding the stddevs and adding/substracting them from the values found from fitting
-std_devs = mle_std_deviations(daily_max_without_anomalies, daily_max_gev)
-print(f"Found std_devs: {std_devs}")
-
-coeff_fitted = tf.convert_to_tensor(daily_max_gev.trainable_variables)
-coeff_upper = coeff_fitted + std_devs
-coeff_lower = coeff_fitted - std_devs
-
-# creating GEVs corresponding to the boundaries of the confidence intervals found above
-gev_upper = get_gev(*coeff_upper)
-gev_lower = get_gev(*coeff_lower)
-
- -
-
-
- -
-
-
- -
-
-
# The qqplots for the original GEV and the GEVs at the boundaries
-
-qqplot(daily_max_without_anomalies, daily_max_gev)
-qqplot(daily_max_without_anomalies, gev_upper)
-qqplot(daily_max_without_anomalies, gev_lower)
-
- -
-
-
- -
-
-
-

Exercise 3: GEV for minima

Now let us repeat the same analysis fitting the distribution of the daily minima using the same strategy. Since minima for a univariate random variable $X$ correspond to maxima of $-X$, all we have to do is to fit a GEV to the minima multiplied by -1.

- -
-
-
-
-
-

Solution of exercise 3

-
-
-
-
-
- -
-
-
neg_minima_series = -daily_grouped_normalized["min"]
-neg_daily_min = neg_minima_series.values
-
-plt.hist(neg_daily_min, density=True, bins=40)
-plt.title("Daily minima * (-1)")
-plt.show()
-
- -
-
-
- -
-
-
- -
-
-
# identifying obvious anomalies
-neg_minima_series[(neg_minima_series>2) | (neg_minima_series<-2)]
-
- -
-
-
- -
-
-
- -
-
-
# - 01/01 - New Year
-# - 01-02/11 - Marathon
-# - 26-27/01 - Snowstorm
-
- -
-
-
- -
-
-
- -
-
-
neg_minima_without_anomalies = neg_daily_min[np.logical_and(neg_daily_min<2, neg_daily_min>-2)]
-plt.hist(neg_minima_without_anomalies, density=True, bins=40)
-plt.title("Daily minima * (-1) without obvious anomalies")
-plt.show()
-
- -
-
-
- -
-
-
- -
-
-
daily_min_gev = get_gev(xi=-0.3)
-final_loss, animation = fit_dist(neg_minima_without_anomalies, daily_min_gev)
-
- -
-
-
- -
-
-
- -
-
-
animation
-
- -
-
-
- -
-
-
- -
-
-
qqplot(neg_minima_without_anomalies, daily_min_gev)
-
- -
-
-
- -
-
-
- -
-
-
# Fit looks good in the region we are interested in, let us find the 1% quantile and the corresponding anomalies
-
- -
-
-
- -
-
-
- -
-
-
upper_quantile = daily_min_gev.quantile(0.99).numpy()
-upper_quantile
-
- -
-
-
- -
-
-
- -
-
-
neg_minima_series[neg_minima_series>upper_quantile]
-
- -
-
-
- -
-
-
- -
-
-
# Only one non-obvious anomaly is found in the upper quantile, it is cause by the snowstorm responsible for the obvious anomalies we have seen above.
-
- -
-
-
- -
-
-
-

Comparison with Z-Test

-
-
-
-
-
- -
-
-
daily_means = daily_grouped_normalized["sum"]
-
-plt.plot(daily_means.values)
-plt.axhline(y=2., color='r', linestyle='-')
-plt.axhline(y=-2., color='r', linestyle='-')
-plt.show()
-
- -
-
-
- -
-
-
-

The big question here is: where to put the threshold? Clearly the assumption of a Gaussian distribution underlying the sum of daily calls is incorrect - the distribution seems skewed.

- -
-
-
-
-
- -
-
-
plt.hist(daily_means, bins=25)
-plt.show()
-
- -
-
-
- -
-
-
-

We can detect some anomalies with the Z-test, of course, but the probabilistic interpretation is going to be flawed.

- -
-
-
-
-
- -
-
-
daily_means[np.abs(daily_means) > 2]
-
- -
-
-
- -
-
-
-

A look back at the theory

So, what have we really done and why does it make sense to use the GEV for such problems? What kind of guarantees does the Fisher-Tipett-Genedenko theorem give us about the quality of the fit?

- -
-
-
-
-
-

Well, the truth is, not too many. First notice the following exact equality:

-$$ -P(M_n < z) = P(X_1< z \text{ and } X_2 < z ... \text{ and } X_n < z) = F^n(z) -$$

So, if we know the cumulative distribution, there is no need to resort to the GEV. Typically, of course, we do not know it. The above equality implies:

-$$ -\lim P(M_n < z) = - \begin{cases} - 0 & \text{if}\ F(z) < 1 \\ - 1 & \text{otherwise} - \end{cases} -$$ -
-
-
-
-
-

We actually always know the exact limit of the distribution of the block-maxima! It is degenerate (either a step function of identical zero). In fact, this degenerate distribution can be seen as a limit of the GEV. It would correspond to normalizing constants $a_n=1, \ b_n=0$.

- -
-
-
-
-
-

While this observation is very simple and the difference between the cdf of block maxima $P(M_n < z)$ and its degenerate limit does decrease as $n$ increases, this limiting distribution is unexpressive and fitting it to data does not provide probabilistic insight.

- -
-
-
-
-
-

Q: How many parameters does the exact limit of $F^n$ have? What would we get if we fit it to data?

-

A:

- -
-
-
-
-
-

Introducing the normalizing constants $a_n$ and $b_n$ might allow the distribution of renormalized block maxima to converge to something non-trivial. It also might not.

- -
-
-
-
-
-

In applications we usually care about modeling $M_n$ for a _fixed $n_0$_ (or maybe for a few selected $n_i$). An arbitrary series of $a_n$ and $b_n$ that at some point helps convergence does not directly address our needs. In fact, this is also not what we do - by fitting the GEV parameters to data for our selected $n_0$ we automatically find the best $a_{n_0}$ and $b_{n_0}$ that minimize the difference between $F^{n_0}(z)$ and $G(z)$.

- -
-
-
-
-
-

Clearly $G(z)$ is much more expressive than the degenerate exact limit and could potentially provide a good fit.

- -
-
-
-
-
-

So, the convergence that we really care about is to answer the question:

-

How well do the best fits of $G(z)$ for fixed $n$ - let us call them $G_n(z)$ - approximate the distributions $F^n(z)$ as $n$ increases? One could e.g. be interested in the infinity norm

-$$ -\Delta_n := \sup_z | F^n(z) - G_n(z) | -$$ -
-
-
-
-
-

This is not the same as asking how well $G(z)$ approximates some rescaled variant of $F^n(z)$ with $n$-dependent normalization constants! That would be

-$$ -\tilde{\Delta}_n(a_n, b_n) := \sup_z |F^n(a_n z + b_n) - G(z) | -$$ -
-
-
-
-
-

In the latter question, the choice of normalization constants matters, in the former it does not - they are implicitly determined by the best fit for each $n$. Since for $\Delta_n$ the $a_n, b_n$ have been optimized, one could reasonably expect a relation of the type

-$$ -\Delta_n \approx \min_{a_n, b_n} \tilde{\Delta}_n(a_n, b_n) -$$

to hold.

- -
-
-
-
-
-

It is easy to see that given some normalizing sequences $a_n, b_n$, the convergence to a GEV is possible, than with other sequences $\tilde{a}_n, \tilde{b}_n$ with some $a>0, b$ such that

-$$ -\lim_{n\rightarrow \infty} \frac{\tilde{a}_n}{a_n} = a \quad,\quad \lim_{n \rightarrow \infty} \frac{b_n-\tilde{b}_n}{a_n} = b -$$

the rescaled $\frac{M_n-\tilde{b}_n}{\tilde{a_n}}$ also converges to a GEV of the same type (with the same $\xi$). This is often formulated that a distribution $F$ has a fixed domain of attraction. However, the error rates $\tilde{\Delta}_n(\tilde{a}_n, \tilde{b}_n)$ would be different from those associated to $a_n, b_n$.

- -
-
-
-
-
-

Unfortunately, theoretical bounds for the quantity of interest $\Delta_n$ are hard to come by - we are not aware of any. They also highly depend on the fitting procedure, which is non-trivial, as we have seen above. There are some bounds for quantities of the type $\tilde{\Delta}_n(\tilde{a}_n, \tilde{b}_n)$ (see the annotated literature reference) but they are rather loose and not really helpful in practice. Therefore, the EVT theorems are more of a motivation for selecting distribution families for fitting than a rigorous approach with guarantees. In practice the convergence and fit tend to work pretty well, though.

- -
-
-
-
-
-

Exercise 4 (theoretical, bonus): outlining the proof of the Fisher-Gnedenko-Tripet theorem

One may wonder how the statement of the Fisher-Gnedenko-Tripet theorem is obtained without providing bounds on convergence. The reason is that the limiting distribution of (renormalized) maxima must have a very special property - it must be max stable. It is instructive to go through a part of the proof to get a feeling for the EVT theorems. We will do so in this exercise.

-

Definition: A cumulative distribution function $D(z)$ is called max-stable iff for all $n\in\mathbb{N} \ \exists \ \alpha_n>0, \beta_n \in \mathbb{R}$ such that

-$$ -D^n(z) = D(\alpha_n z + \beta_n) -$$

Prove that from $\lim_{n\rightarrow \infty} P\left( \frac{M_n - b_n}{a_n} < z \right) = G(z)$ follows that $G(z)$ is max-stable.

-

This goes a long way towards proving the first EVT theorem. One can easily compute that the GEV distribution is max-stable and with more effort one can also prove that any max-stable distribution belongs to the GEV family. Thus, the proof of the theorem is very implicit and does not involve any convergence rates or bounds.

- -
-
-
-
-
-

Exercise 5: increase the block size

According to the line of thought above, increasing the block-size before determining the maxima should improve convergence. Of course, it also decreases the number of points for fitting so it increases variance. We will analyze uncertainties of the fitted GEV below.

-

Repeat the fit of the GEV for 2-day maxima/minima. What do you think about the result?

-

Hint: use the .reshape method of numpy arrays on the already computed daily maxima/minima

- -
-
-
-
-
-

Solution exercise 5:

-
-
-
-
-
- -
-
-
bidaily_maxima = daily_max_without_anomalies.reshape(-1, 2).max(axis=1)
-
-plt.hist(bidaily_maxima, bins=40)
-plt.title("Bidaily maxima")
-plt.show()
-
- -
-
-
- -
-
-
- -
-
-
bidaily_gev = get_gev(xi=-0.5)
-loss, animation = fit_dist(bidaily_maxima, bidaily_gev, lr=3e-4, num_steps=100)
-
- -
-
-
- -
-
-
- -
-
-
bidaily_gev.trainable_variables
-
- -
-
-
- -
-
-
-

The shape parameter should be independent of the size of the block (it is not affected by $a_n$ and $b_n$) . -Of course, since we find it from fitting, we shouldn't be surprised to find a slightly different value.

- -
-
-
-
-
- -
-
-
animation
-
- -
-
-
- -
-
-
-

We get a better fit than before (less than half of the loss with half as many data points). -But we have higher variance in the very important shape parameter $\xi$

- -
-
-
-
-
- -
-
-
std_devs_daily = mle_std_deviations(daily_max_without_anomalies, daily_max_gev)
-std_devs_bidaily = mle_std_deviations(bidaily_maxima, bidaily_gev)
-
-print("Daily stddevs:")
-print(std_devs_daily.numpy())
-print("Biaily stddevs:")
-print(std_devs_bidaily.numpy())
-
- -
-
-
- -
-
-
-

Peaks over threshold (PoT)

So far we have only used the first theorem of EVT. As you might have noticed above, it can be somehow wasteful when it comes to data efficiency. Since the GEV is fitted on block-maxima, a huge number of data points remain unused for parameter estimation. The second theorem of EVT gives rise to a more efficient approach

- -
-
-
-
-
-

Exercise 6 (theoretical, bonus): deriving the second theorem of EVT

Use the approximation $\ln(1+x) \approx 1 + x$ for $|x| \ll 1$ and $F(z) \approx 1$ for large enough $z$ to derive.

-\begin{equation} -P(X-u < y \mid X > u) \approx 1 - \left( 1 + \frac{\xi \ y}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} \label{GPD-approx-original} -\end{equation}

for large enough $u$ (this is a slightly less formal derivation of Pickards' et. al. theorem). One could equivalently write

-\begin{equation} -P(X-u > y \mid X > u) \approx \left( 1 + \frac{\xi \ y}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} -\end{equation}

What is the relation between $\tilde{\sigma}$ and the normalizing coefficients of the first theorem of EVT?

- -
-
-
-
-
-

The above equation can be used to estimate the entire tail of the cdf $F$ of $X$ from a sample of size $N$ obtained by sampling repeatedly from $F$. First note that for a single $u$ we can approximate the cdf through the sample statistics as:

-\begin{equation} -1-F(u) = P(X>u) \approx \frac{N_u}{N} -\end{equation}

where $N_u$ is the number of samples with values above $u$. Interpreting $u$ as a threshold, we will call those samples peaks over threshold (PoT) and $N_u$ is simply their count.

- -
-
-
-
-
-

Q: What should $u$ and the data set fulfill in order for the above approximation to be accurate?

-

A: It should be small enough such that many data points are larger than it. Then the approximation in $P(X>u) \approx \frac{N_u}{N}$ holds (the estimator is not too biased).

- -
-
-
-
-
-

Now we can perform a series of approximations for $z>u$ to get to the tail-distribution. First using $P(X>u) \approx \frac{N_u}{N}$ we get

-$$ -P(X>z) = P(X>z \cap X>u) = P(X>z \mid X>u) P(X>u) \approx \frac{N_u}{N} P(X>z \mid X>u) -$$ -
-
-
-
-
-

Now we use the GDP theorem to approximate

-$$ -P(X>z \mid X>u) = P(X-u > z -u \mid X>u) \approx - \left( 1 + \frac{\xi (z-u)}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} -$$ -
-
-
-
-
-

Putting everything together gives

-$$ -P(X>z) \approx \frac{N_u}{N} \left( 1 + \frac{\xi (z-u)}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} -$$ -
-
-
-
-
-

Q: Intuitively, what does $u$ need to fulfill for both approximations to hold?

-

A: $u$ should be small enough such that the approximation $P(X>u) \approx \frac{N_u}{N}$ holds and sufficiently large such that the generalized pareto distribution is a good estimate of the tail of the distribution for values larger than $u$. Intuitively, it should be at the beginning of the tail, where for values larger than $u$ only the tail behavior plays a role - i.e. no more local extrema or other specifics of the underlying distribution of the data.

- -
-
-
-
-
-

Exercise 7: Using the GPD for anomaly detection

This exercise lets you explore the second theorem of EVT for anomaly detection. Here we let you calculate and code on your own, without giving too many hints. You can follow the GEV-fitting code above for solving this exercise. Feel free to ask for hints if you are stuck!

-
    -
  1. Using the results above, find an approximation of the upper quantile $z_q$ such that $P(X>z_q) < q$ (assuming $z_q > u$).
  2. -
  3. What is the relation of this quantile to the quantile of the generalized pareto distribution?
  4. -
  5. Select a threshold $u$ and fit the generalized pareto distribution to the peaks over this threshold using tensorflow-probability and the same tricks that were used above for fitting the GEV distribution. You might want to use the profile likelihood fitting.
  6. -
  7. Determine anomalies from the quantile function.
  8. -
  9. What advantages do you see in fitting the GPD with PoT compared to fitting GEV distribution using block-maxima for anomaly detection? What are the disadvantages?
  10. -
  11. Check the quality of your fit and perform an uncertainty analysis as above for the GEV.
  12. -
- -
-
-
-
-
-

Solution Exercise 7:

-
-
-
-
-
- -
-
-
# We define the creation of the GPD analogous to the GEV above
-
-def get_gpd(xi: float,  sigma=1., trainable_xi=True):
-    xi, sigma = np.array([xi, sigma]).astype(float)
-    if trainable_xi:
-        xi = tf.Variable(xi, name="xi")
-    return tfd.GeneralizedPareto(
-        loc=0,
-        scale=tf.Variable(sigma, name='sigma'),
-        concentration=xi
-    )
-
- -
-
-
- -
-
-
- -
-
-
# GPD is fit directly on the thresholded data, no need for grouping
-
-n_calls = taxi_df_normalized["n_calls"].values
-
+
+
-
-
-
- -
-
-
+
-
-
-
plt.hist(n_calls, bins=40)
-plt.show()
-
-
-
-
-
-
-
-
-
-
# seems like u=1 gives a good value for the beginning of "tail behaviour"
 
-u = 1
-thresholded_n_calls = n_calls[n_calls>u] - u
-plt.hist(thresholded_n_calls, bins=50)
-plt.show()
-
-
-
+
+ +
- -
-
-
- -
-
-
# fitting the gpd. We need a small lr to not hit singularities
-# We bypass fitting xi here, instead using the xi found from fitting the GEV above. 
-# Theory suggests that it should be close to the optimal value. 
-# We could also profile around it or try full gradient, of course. The latter is brittle
+
+
+
-xi_gev = daily_max_gev.concentration.numpy() -print(f"Using xi={xi_gev}") -gpd = get_gpd(xi=xi_gev, sigma=1, trainable_xi=False) -loss, animation = fit_dist(cleaned_thresholded_calls, gpd, lr=5e-6, num_steps=100) -
-
-
+
+
-
-
- -
-
-
animation
-
-
-
-
- -
-
-
-
+y_test = np.zeros(data_test.shape[0]) +y_test[-10:] = 1 -
-
-
# and the anomalies lying above it. We find thesame ones
+scores_test = np.sqrt(((data_test - mu) * (1/Sigma_diag) * (data_test - mu)).sum(axis=1)) 
 
-n_calls = taxi_df_normalized["n_calls"]
-taxi_df_normalized[taxi_df_normalized["n_calls"] > q]
+visualize_mahalanotis(data_test, y_test, scores_test, mu, Sigma_diag, thr_opt)
 
-
-
-
-

Results:

We found new candidates for anomalies (or rare events). The 10.01.2015 was the day following Charlie Hebdo related terrorist attacks, there was a large march in Paris. Maybe there was additional movement across New York's large Jewish community. See e.g. this article

-

We could not find events that could have caused the large numbers of calls on the 18/10/2014 and the 22/11/2014.

-

We also now have a probabilistic model for the tail of n_calls/30 minutes which might be useful for planning taxi availabilities on a more granular level than just per-day.

- -
-
-
-
-
-

So far, we have completely ignored the time-series aspect of our data set. When using EVT for time series, as will often be the case in practice, seasonality, trends and so on need to be taken into account.

- -
-
-
-
-
-

We have already seen a treatment of these topics for time series forecasting. Without going into details, we want to mention that the time-dependency can be to some extent taken into account in EVT by allowing for time-dependent parameters $\xi(t), \mu(t), \sigma(t)$.

- -
-
-
-
-
-

There exist multiple strategies for finding these time dependent functions from data, the most straightforward one being MLE-fitting with sliding windows over a sample. One could also easily include known modulations into the MLE fitting, e.g. something like

-$$ -\mu(t) : = \mu_0 \sin(t) -$$

might do the job if one knows that the underlying mean vary with $\sin(t)$. Then, one only needs to fit $\mu_0$.

- -
-
-
-
-
-

Confusion about EVT for anomaly detection

Unfortunately, there are some incorrect claims about applications of EVT in the AD literature. The claims often involve an incorrect analysis of EVT for multivariate and multimodal distributions.

- -
-
-
-
-
-

Note that the EVT theorems apply to univariate distributions. They also ignore multimodality as only tail-behaviour plays a role for them.

- -
-
-
-
-
-

Results of EVT from one dimension cannot be directly transferred to higher dimensions, even for Gaussians. The cdf of the Mahalanobis radius is simply not dimension independent, see here for an exact expression for it. The attempt to do that leads to a bad fit and is sometimes called failure of classical EVT. Similar approaches and resulting claims have been tried on Gaussian mixtures.

- -
-
-
-
-
-

EVT for outlier scores

Th NYC taxi data is very simple, we could apply EVT to it directly. For multidimensional data these techniques don't work out of the box. However, as mentioned in the beginning, virtually all AD algorithms will produce a 1-dimensional score which can then be given a probabilistic meaning through EVT. We will explore this approach in the last exercise of this section.

+
+
-
-
-
-
-
-

Things we have omitted

There are many ways to extend the ideas presented here

+
-
-
-
-
-
-

The PoT method can be adapted to work on streams in a memory efficient way, by automatically stripping off obvious anomalies and adjusting the threshold

-
-
-
-
-
-

We have seen how MLE with gradient descent is brittle and subject to divergences. There is a lot of literature containing bags of tricks for finding the MLE estimators for GEV and GPT distributions in a smarter, more robust way.

+
+
-
-
-
-
-

One can also give up on MLE and use goodness-of-fit objectives to minimize the difference with the empirical cdf given by the data.

-
-
-
-
-

Generally, there is a large body of literature on EVT, although more in the engineering/math directions than for AD.

-
-
-
-

Exercise 8

Using the anomaly scores from a data set and algorithm from yesterday (you can choose your favorite), perform an EVT analysis along the lines of what was done above. What are your conclusions? In which situations can such an analysis be useful in practical situations?

-
-
-

Solution of exercise 8:

Left to the reader

+

Summary

    +
  • Anomalies are patterns in data that do not conform to a well defined notion of normal behavior.
  • +
  • Detecting anomalies can be very valuable in a broad spectrum of industry sectors and company divisions.
  • +
  • Anomaly detection uses mostly unsupervised techniques.
  • +
  • Outlier scores measure the degree of outlyingness.
  • +
  • If some statistical properties of the nominal distribution are known then the surprise or the Mahalanobis distance can be used as an outlier score.
  • +
  • Evaluation metrics: precision, recall, $F_1$, ROC (AUC), PR (AUC).
  • +
@@ -14962,23 +13986,9 @@

Solution of exercise 8:

Snow

-
Thank you for the attention, this concludes the A.D. training.
-
We will be happy to see you in another Transferlab training soon!
-
-
-

-
-
- -
-
-
 
-
-
-
@@ -14989,5 +13999,9 @@

Solution of exercise 8: +{"state": {"4aec781192c84cfcaf23711c63ddea4a": {"model_name": "LayoutModel", "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "55c8d2a6a98b4ae2ab88ea3a547e3e65": {"model_name": "VBoxModel", "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "state": {"_dom_classes": ["widget-interact"], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "VBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "VBoxView", "box_style": "", "children": ["IPY_MODEL_181774fdbb9d4cbb95004a7ef2062ec3", "IPY_MODEL_1dc19323b27f4bd0989b7bd790df7b23"], "layout": "IPY_MODEL_4aec781192c84cfcaf23711c63ddea4a", "tabbable": null, "tooltip": null}}, "a9189596f9aa491b908a012d35904b44": {"model_name": "LayoutModel", "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "2a9212ac71b54880b8266c406314fa4e": {"model_name": "SliderStyleModel", "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "SliderStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "StyleView", "description_width": "", "handle_color": null}}, "181774fdbb9d4cbb95004a7ef2062ec3": {"model_name": "FloatSliderModel", "model_module": "@jupyter-widgets/controls", "model_module_version": "2.0.0", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "2.0.0", "_model_name": "FloatSliderModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "2.0.0", "_view_name": "FloatSliderView", "behavior": "drag-tap", "continuous_update": true, "description": "threshold", "description_allow_html": false, "disabled": false, "layout": "IPY_MODEL_a9189596f9aa491b908a012d35904b44", "max": 6.0, "min": 0.0, "orientation": "horizontal", "readout": true, "readout_format": ".2f", "step": 0.1, "style": "IPY_MODEL_2a9212ac71b54880b8266c406314fa4e", "tabbable": null, "tooltip": null, "value": 3.0}}, "786711689c8b4dbd9fb2dfe85c6152e5": {"model_name": "LayoutModel", "model_module": "@jupyter-widgets/base", "model_module_version": "2.0.0", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "2.0.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "2.0.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border_bottom": null, "border_left": null, "border_right": null, "border_top": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "1dc19323b27f4bd0989b7bd790df7b23": {"model_name": "OutputModel", "model_module": "@jupyter-widgets/output", "model_module_version": "1.0.0", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/output", "_model_module_version": "1.0.0", "_model_name": "OutputModel", "_view_count": null, "_view_module": "@jupyter-widgets/output", "_view_module_version": "1.0.0", "_view_name": "OutputView", "layout": "IPY_MODEL_786711689c8b4dbd9fb2dfe85c6152e5", "msg_id": "", "outputs": [], "tabbable": null, "tooltip": null}}}, "version_major": 2, "version_minor": 0} + + diff --git a/docs/_static/extended_intro_and_ad_taxonomy.html b/docs/_static/nb_01_1_intro_and_ad_taxonomy.html similarity index 99% rename from docs/_static/extended_intro_and_ad_taxonomy.html rename to docs/_static/nb_01_1_intro_and_ad_taxonomy.html index 02738d6..3bb6dc4 100644 --- a/docs/_static/extended_intro_and_ad_taxonomy.html +++ b/docs/_static/nb_01_1_intro_and_ad_taxonomy.html @@ -3,7 +3,7 @@ -extended_intro_and_ad_taxonomy +nb_01_1_intro_and_ad_taxonomy + + + + + + + + + + + +
+
+ +
+ +
+
+ +
+
+ +
+ + +
+ +
+ +
+
+ +
+
+ +
+
+ +
+
+ +
+ + + +
+

$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$ +$\newcommand{\amax}{{\text{argmax}}}$ +$\newcommand{\P}{{\mathbb{P}}}$ +$\newcommand{\E}{{\mathbb{E}}}$ +$\newcommand{\R}{{\mathbb{R}}}$ +$\newcommand{\Z}{{\mathbb{Z}}}$ +$\newcommand{\N}{{\mathbb{N}}}$ +$\newcommand{\C}{{\mathbb{C}}}$ +$\newcommand{\abs}[1]{{ \left| #1 \right| }}$ +$\newcommand{\simpl}[1]{{\Delta^{#1} }}$

+ +
+ +
+ +
+
+ +
+
+
+
+ +
+
+
import numpy as np
+import itertools as it
+from tqdm import tqdm
+
+import matplotlib
+from matplotlib import pyplot as plt
+import plotly.express as px
+import pandas as pd
+
+import ipywidgets as widgets
+
+from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \
+perform_rkde_experiment, get_mnist_data
+
+from ipywidgets import interact
+
+from sklearn.metrics import roc_auc_score, average_precision_score
+from sklearn.model_selection import RandomizedSearchCV
+from sklearn.preprocessing import MinMaxScaler
+from sklearn.preprocessing import LabelBinarizer
+from sklearn.ensemble import IsolationForest
+from sklearn import metrics
+from sklearn.model_selection import train_test_split
+from sklearn.decomposition import PCA
+from sklearn.neighbors import KernelDensity
+
+from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst
+
+from tensorflow import keras
+
+%matplotlib inline
+matplotlib.rcParams['figure.figsize'] = (5, 5)
+
+ +
+
+
+ +
+
+
+

Anomaly Detection via Density Estimation

Idea: Estimate the density of $F_0$. Areas of low density are anomalous.

+
    +
  • Often $p$ is too small to estimate complete mixture model
  • +
  • Takes into account that $F_1$ might not be well-defined
  • +
  • Estimation procedure needs to be robust against contamination if no clean training data is available
  • +
+ +
+
+
+
+
+

Kernel Density Estimation

    +
  • Non-parametric method
  • +
  • Can represent almost arbitrarily shaped densities
  • +
  • Each training point "spreads" a fraction of the probability mass as specified by the kernel function
  • +
+ +
+
+
+
+
+

Definition

+
+

+

Definition:

+
    +
  • $K: \mathbb{R} \to \mathbb{R}$ kernel function
      +
    • $K(r) \geq 0$ for all $r\in \mathbb{R}$
    • +
    • $\int_{-\infty}^{\infty} K(r) dr = 1$
    • +
    +
  • +
  • $h > 0$ bandwidth
  • +
  • Bandwidth is the most crucial parameter
  • +
+
+ +
+
+
+
+
+

Definition:

+
+

Let $D = \{x_1,\ldots,x_N\}\subset \mathbb{R}^p$. The KDE with kernel $K$ and bandwidth $h$ is +$KDE_h(x, D) = \frac{1}{N}\sum_{i=1}^N \frac{1}{h^p}K\left(\frac{|x-x_i|}{h}\right)$

+
+ + + + + + + + +
Effect of bandwidth and kernel
+
+
+
+
+
+

Exercise

Play with the parameters!

+ +
+
+
+
+
+ +
+
+
dists = create_distributions(dim=2, dim_irrelevant=0)
+
+sample_train = dists['Double Blob'].sample(500)
+X_train = sample_train[-1]
+y_train = [0]*len(X_train)
+
+plt.scatter(X_train[:,0], X_train[:,1], c = 'blue', s=10)
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ +
+
2023-04-21 23:25:27.607732: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
+To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# Helper function
+def fit_kde(kernel: str, bandwidth: float, X_train: np.array) -> KernelDensity:
+    """ Fit KDE
+    
+    @param kernel: kernel
+    @param bandwidth: bandwidth
+    @param x_train: data
+    """
+    kde = KernelDensity(kernel=kernel, bandwidth=bandwidth)
+    kde.fit(X_train)
+    return kde
+
+def visualize_kde(kde: KernelDensity, bandwidth: float, X_test: np.array, y_test: np.array) -> None:
+    """Plot KDE
+    
+    @param kde: KDE
+    @param bandwidth: bandwidth
+    @param X_test: test data
+    @param y_test: test label
+    """
+    fig, axis = plt.subplots(figsize=(5, 5))
+
+    lin = np.linspace(-10, 10, 50)
+    grid_points = list(it.product(lin, lin))
+    ys, xs = np.meshgrid(lin, lin)
+    # The score function of sklearn returns log-densities
+    scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)
+    colormesh = axis.contourf(xs, ys, scores)
+    fig.colorbar(colormesh)
+    axis.set_title('Density Conturs (Bandwidth={})'.format(bandwidth))
+    axis.set_aspect('equal')
+    color = ['blue' if i ==0 else 'red' for i in y_test]
+    plt.scatter(X_test[:, 0], X_test[:, 1], c=color)
+    plt.show()
+
+ +
+
+
+ +
+
+
+

Choose KDE Parameters

+
+
+
+
+
+ +
+
+
ker = None
+bdw = None
+@interact(
+    kernel=['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'],
+    bandwidth=(.1, 10.)
+)
+def set_kde_params(kernel: str, bandwidth: float) -> None:
+    """Helper funtion to set widget parameters
+    
+    @param kernel: kernel
+    @param bandwidth: bandwidth
+    """
+    global ker, bdw
+
+    ker = kernel
+    bdw = bandwidth
+
+ +
+
+
+ +
+
+ +
+ + + + + + +
+ + +
+ +
+ +
+
+ +
+
+
+ +
+
+
kde = fit_kde(ker, bdw, X_train)
+visualize_kde(kde, bdw, X_train, y_train)
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

Bandwidth Selection

The bandwidth is the most important parameter of a KDE model. A wrongly adjusted value will lead to over- or +under-smoothing of the density curve.

+

A common method to select a bandwidth is maximum log-likelihood cross validation. +$$h_{\textrm{llcv}} = \arg\max_{h}\frac{1}{k}\sum_{i=1}^k\sum_{y\in D_i}\log\left(\frac{k}{N(k-1)}\sum_{x\in D_{-i}}K_h(x, y)\right)$$ +where $D_{-i}$ is the data without the $i$th cross validation fold $D_i$.

+ +
+
+
+
+
+

Exercises

+
+
+
+
+
+

ex no.1: Noisy sinusoidal

+ +
+
+
+
+
+ +
+
+
# Generate example
+dists = create_distributions(dim=2)
+
+distribution_with_anomalies = contamination(
+    nominal=dists['Sinusoidal'],
+    anomaly=dists['Blob'],
+    p=0.05
+)
+
+# Train data
+sample_train = dists['Sinusoidal'].sample(500)
+X_train = sample_train[-1].numpy()
+
+# Test data
+sample_test = distribution_with_anomalies.sample(500)
+X_test = sample_test[-1].numpy()
+y_test = sample_test[0].numpy()
+
+scatter = plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test)
+handels, _ = scatter.legend_elements()
+plt.legend(handels, ['Nominal', 'Anomaly'])
+plt.gca().set_aspect('equal')
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

TODO: Define the search space for the kernel and the bandwidth

+
+
+
+
+
+ +
+
+
param_space = {
+    'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels
+    'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter
+}
+
+ +
+
+
+ +
+
+
+ +
+
+
def hyperopt_by_score(X_train: np.array, param_space: dict, cv: int=5):
+    """Performs hyperoptimization by score
+    
+    @param X_train: data
+    @param param_space: parameter space
+    @param cv: number of cv folds
+    """
+    kde = KernelDensity()
+
+    search = RandomizedSearchCV(
+        estimator=kde,
+        param_distributions=param_space,
+        n_iter=100,
+        cv=cv,
+        scoring=None # use estimators internal scoring function, i.e. the log-probability of the validation set for KDE
+    )
+
+    search.fit(X_train)
+    return search.best_params_, search.best_estimator_
+
+ +
+
+
+ +
+
+
+

Run the code below to perform hyperparameter optimization.

+ +
+
+
+
+
+ +
+
+
params, kde = hyperopt_by_score(X_train, param_space)
+
+print('Best parameters:')
+for key in params:
+    print('{}: {}'.format(key, params[key]))
+
+test_scores = -kde.score_samples(X_test)
+test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
+
+curves = evaluate(y_test, test_scores)
+
+ +
+
+
+ +
+
+ +
+ +
+
/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:922: UserWarning: One or more of the test scores are non-finite: [-583.74541248 -463.01696684 -482.86882001 -499.81601835 -546.99909867
+ -562.2118973  -404.18630801 -441.90964061 -508.5352788           -inf
+ -455.89237528 -474.60064732 -460.72114131 -523.02652677 -599.66745977
+ -469.21060725 -649.46845268 -484.62263455 -534.71782939 -385.81385264
+ -498.35959278 -666.08926837 -402.40788348 -616.13224704 -514.42363587
+ -486.8817315  -496.70674387 -576.02539342 -519.26203456          -inf
+ -476.17747255          -inf -566.20708475 -633.59732893 -460.43606272
+ -519.25900889 -631.67434369 -502.48986055 -539.33069415 -611.61260259
+ -416.6039647  -406.4145171  -466.23139723 -485.14283562 -490.13677541
+ -637.14220215 -559.25833087 -507.88071232 -486.91719244 -562.75298394
+ -414.54025058 -420.71904273 -385.56190503 -503.34657184 -526.01859062
+ -519.97321291 -404.56775515 -532.78639414 -411.43385054 -486.28358272
+ -606.08910552 -581.33532785 -405.37961147 -510.35741871 -667.85545766
+ -593.96240838 -624.98459821 -498.05695285 -587.68751473 -523.85639014
+ -535.3897791  -587.05218198 -492.07924826 -608.63368405 -500.8619376
+          -inf -596.29256226 -485.75889656 -481.95324848 -431.57291947
+ -505.76999431 -485.82228871 -659.05544596          -inf          -inf
+ -641.6261979  -405.38065655 -613.68889441 -600.98775327 -457.76369778
+ -531.39767679 -497.24276193 -603.57527611 -518.98183963 -484.21261874
+ -453.4834935  -617.71569138          -inf -490.16773265 -510.35538669]
+  warnings.warn(
+/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:929: RuntimeWarning: invalid value encountered in subtract
+  array_stds = np.sqrt(np.average((array -
+
+
+
+ +
+ +
+
Best parameters:
+kernel: linear
+bandwidth: 1.0
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
visualize_kde(kde, params['bandwidth'], X_test, y_test)
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

Exercise: Isolate anomalies in house prices

+
+
+
+
+
+

You are a company resposible to estimate house prices around Ames, Iowa, specifically around college area. But there is a problem: houses from a nearby area, 'Veenker', are often included in your dataset. You want to build an anomaly detection algorithm that filters one by one every point that comes from the wrong neighborhood. You have been able to isolate an X_train dataset which, you are sure, contains only houses from College area. Following the previous example, test your ability to isolate anomalies in new incoming data (X_test) with KDE.

+

Advanced exercise: +What happens if the contamination comes from other areas? You can choose among the following names:

+

OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste

+ +
+
+
+
+
+ +
+
+
X_train, X_test, y_test = get_house_prices_data(neighborhood = 'CollgCr', anomaly_neighborhood='Veenker')
+X_train
+
+ +
+
+
+ +
+
+ +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
LotAreaSalePriceOverallCond
084611639905
1103352040005
295482370006
392451450005
4155231335006
............
11592402870005
116113171800005
11744261410005
11890662300005
11979901100006
+

120 rows × 3 columns

+
+ +
+ +
+
+ +
+
+
+ +
+
+
# Total data
+train_test_data = X_train.append(X_test, ignore_index=True)
+y_total = [0] * len(X_train) + y_test
+
+fig = px.scatter_3d(train_test_data, x='LotArea', y='OverallCond', z='SalePrice', color=y_total)
+
+fig.show()
+
+ +
+
+
+ +
+
+ +
+ + +
+
+ +
+ +
+ + +
+ +
+ +
+
+ +
+
+
+

Solution

+
+
+
+
+
+ +
+
+
# When data are highly in-homogeneous, like in this case, it is often beneficial 
+# to rescale them before applying any anomaly detection or clustering technique.
+scaler = MinMaxScaler()
+X_train_rescaled = scaler.fit_transform(X_train)
+
+ +
+
+
+ +
+
+
+ +
+
+
param_space = {
+    'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels
+    'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter
+}
+params, kde = hyperopt_by_score(X_train_rescaled, param_space)
+
+ +
+
+
+ +
+
+ +
+ +
+
/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:922: UserWarning:
+
+One or more of the test scores are non-finite: [ -45.25900425  -42.41544626 -105.24271623  -76.62262594  -58.16553273
+  -52.66553234 -183.01584751 -119.35431748  -98.69187821 -168.03134231
+ -111.03706006  -62.91122917 -234.261542   -168.22031548 -220.86820955
+  -92.32919548 -154.69523369 -227.42928936  -84.284484   -142.67107811
+ -117.16427593 -113.6970155  -227.52633839 -161.95663169  -41.52019748
+ -136.74875435 -152.97004208 -128.4919998    19.20525133          -inf
+  -98.87203905          -inf -159.80490645 -162.17445626  -67.13120683
+ -117.6390077  -140.62181147  -50.67220638 -237.44899903 -197.22483009
+ -123.50498573 -188.99783275 -101.07642949  -72.83784089 -229.7863771
+ -132.07211645 -168.26256671 -230.06793251 -135.31495507 -187.61056982
+ -147.31823309          -inf -146.00497938  -33.29831913 -194.93892381
+  -96.25876027 -178.48444701 -123.12220664  -83.77069893 -199.88529605
+ -170.61800732 -186.74828407 -134.23720459  -35.0511072  -131.81801061
+ -224.69026195 -164.15751703          -inf -217.86235331  -79.81216211
+ -124.69089389  -13.75418293 -192.81244316 -167.46002124  -72.58312108
+ -160.42768007          -inf -229.19908446 -159.69783332 -199.44038951
+ -196.43550278 -135.24648056  -71.43898844 -191.77357892 -177.71452367
+ -153.08130804  -64.75586096 -151.5744935  -104.68544216 -107.00124511
+ -192.57805657    5.68816971 -158.41708204  -30.07922034 -203.15690702
+ -165.74543603 -155.17305524 -243.42948442 -142.66462637 -179.45090448]
+
+/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:929: RuntimeWarning:
+
+invalid value encountered in subtract
+
+
+
+
+ +
+
+ +
+
+
+ +
+
+
print('Best parameters:')
+for key in params:
+    print('{}: {}'.format(key, params[key]))
+
+X_test_rescaled = scaler.transform(X_test)
+test_scores = -kde.score_samples(X_test_rescaled)
+test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
+curves = evaluate(y_test, test_scores)
+
+ +
+
+
+ +
+
+ +
+ +
+
Best parameters:
+kernel: epanechnikov
+bandwidth: 0.4
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

The Curse of Dimensionality

The flexibility of KDE comes at a price. The dependency on the dimensionality of the data is quite unfavorable.

+

Theorem [Stone, 1982] +Any estimator that is consistent$^*$ with the class of all $k$-fold differentiable pdfs over $\mathbb{R}^d$ has a +convergence rate of at most

+$$ +\frac{1}{n^{\frac{k}{2k+d}}} +$$

$^*$Consistency = for all pdfs $p$ in the class: $\lim_{n\to\infty}|KDE_h(x, D) - p(x)|_\infty = 0$ with probability $1$.

+ +
+
+
+
+
+

Exercise

    +
  • The very slow convergence in high dimensions does not necessary mean that we will see bad results in high dimensional anomaly detection with KDE.
  • +
  • Especially if the anomalies are very outlying.
  • +
  • However, in cases where contours of the nominal distribution are non-convex we can run into problems.
  • +
+

We take a look at a higher dimensional version of out previous data set.

+ +
+
+
+
+
+ +
+
+
dists = create_distributions(dim=3)
+
+distribution_with_anomalies = contamination(
+    nominal=dists['Sinusoidal'],
+    anomaly=dists['Blob'],
+    p=0.01
+)
+
+sample = distribution_with_anomalies.sample(500)
+
+y = sample[0]
+X = sample[-1]
+
+ +
+
+
+ +
+
+
+ +
+
+
fig = px.scatter_3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], color=y)
+fig.show()
+
+ +
+
+
+ +
+
+ +
+ + +
+ +
+ +
+
+ +
+
+
+ +
+
+
# Fit KDE on high dimensional examples 
+rocs = []
+auprs = []
+bandwidths = []
+
+param_space = {
+        'kernel': ['gaussian'],
+        'bandwidth': np.linspace(0.1, 100, 1000), # Define Search space for bandwidth parameter
+    }
+
+kdes = {}
+dims = np.arange(2,16)
+for d in tqdm(dims):
+    # Generate d dimensional distributions
+    dists = create_distributions(dim=d)
+
+    distribution_with_anomalies = contamination(
+        nominal=dists['Sinusoidal'],
+        anomaly=dists['Blob'],
+        p=0
+    )
+
+    # Train on clean data
+    sample_train = dists['Sinusoidal'].sample(500)
+    X_train = sample_train[-1].numpy()
+    # Test data
+    sample_test = distribution_with_anomalies.sample(500)
+    X_test = sample_test[-1].numpy()
+    y_test = sample_test[0].numpy()
+
+    # Optimize bandwidth
+    params, kde = hyperopt_by_score(X_train, param_space)
+    kdes[d] = (params, kde)
+    
+    bandwidths.append(params['bandwidth'])
+
+    test_scores = -kde.score_samples(X_test)
+    test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
+
+    
+
+ +
+
+
+ +
+
+ +
+ +
+
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:39<00:00,  2.84s/it]
+
+
+
+ +
+
+ +
+
+
+ +
+
+
# Plot cross section of pdf 
+fig, axes = plt.subplots(nrows=2, ncols=7, figsize=(15, 5))
+for d, axis in tqdm(list(zip(kdes, axes.flatten()))):
+    
+    params, kde = kdes[d]
+
+    lin = np.linspace(-10, 10, 50)
+    grid_points = list(it.product(*([[0]]*(d-2)), lin, lin))
+    ys, xs = np.meshgrid(lin, lin)
+    # The score function of sklearn returns log-densities
+    scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)
+    colormesh = axis.contourf(xs, ys, scores)
+    axis.set_title("Dim = {}".format(d))
+    axis.set_aspect('equal')
+    
+
+# Plot evaluation
+print('Crossection of the KDE at (0,...,0, x, y)')
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ +
+
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:01<00:00,  8.39it/s]
+
+
+
+ +
+ +
+
Crossection of the KDE at (0,...,0, x, y)
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

Robustness

Another drawback of KDE in the context of anomaly detection is that it is not robust against contamination of the data

+

Definition +The breakdown point of an estimator is the smallest fraction of observations that need to be changed so that we can +move the estimate arbitrarily far away from the true value.

+ +
+
+
+
+
+

Example: The sample mean has a breakdown point of $0$. Indeed, for a sample of $x_1,\ldots, x_n$ we only need to +change a single value in order to move the sample mean in any way we want. That means that the breakdown point is +smaller than $\frac{1}{n}$ for every $n\in\mathbb{N}$.

+ +
+
+
+
+
+

Robust Statistics

There are robust replacements for the sample mean:

+
    +
  • Median of means: Split the dataset into $S$ equally sized subsets $X_1,\ldots, X_S$ and compute +$\mathrm{median}(\overline{X_1},\ldots, \overline{X_S})$
  • +
  • M-estimation: The mean in a normed vector space is the value that minimizes the squared distances +
    +$\overline{X} = \min_{y}\sum_{x\in X}|x-y|^2$ +
    +M-estimation replaces the quadratic loss with a more robust loss function.
  • +
+ +
+
+
+
+
+

Huber loss

Switch from quadratic to linear loss at prescribed threshold

+ +
+
+
+
+
+ +
+
+
import numpy as np
+
+
+def huber(error: float, threshold: float):
+    """Huber loss
+    
+    @param error: base error
+    @param threshold: threshold for linear transition
+    """
+    test = (np.abs(error) <= threshold)
+    return (test * (error**2)/2) + ((1-test)*threshold*(np.abs(error) - threshold/2))
+
+x = np.linspace(-5, 5)
+y = huber(x, 1)
+
+plt.plot(x, y)
+plt.gca().set_title("Huber Loss")
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

Hampel loss

More complex loss function. Depends on 3 parameters 0 < a < b< r

+ +
+
+
+
+
+ +
+
+
import numpy as np
+
+def single_point_hampel(error: float, a: float, b: float, r: float):
+    """Hampel loss
+    
+    @param error: base error
+    @param a: 1st threshold parameter
+    @param b: 2nd threshold parameter
+    @param r: 3rd threshold parameter
+    """
+    if abs(error) <= a:
+        return error**2/2
+    elif a < abs(error) <= b:
+        return (1/2 *a**2 + a* (abs(error)-a))
+    elif  b < abs(error) <= r:
+        return a * (2*b-a+(abs(error)-b) * (1+ (r-abs(error))/(r-b)))/2
+    else:
+        return a*(b-a+r)/2
+
+hampel = np.vectorize(single_point_hampel)
+
+x = np.linspace(-10.1, 10.1)
+y = hampel(x, a=1.5, b=3.5, r=8)
+
+plt.plot(x, y)
+plt.gca().set_title("Hampel Loss")
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

KDE is a Mean

+

Kernel as scalar product:

+
    +
  • Let $K$ be a radial monotonic$^\ast$ kernel over $\mathbb{R}^n$.
  • +
  • For $x\in\mathbb{R}^n$ let $\phi_x = K(\cdot, x)$.
  • +
  • Vector space over the linear span of $\{\phi_x \mid x\in\mathbb{R}^n\}$:
      +
    • Pointwise addition and scalar multiplication.
    • +
    +
  • +
  • Define the scalar product $\langle \phi_x, \phi_y\rangle = K(x,y)$.
  • +
  • Advantage: Scalar product is computable
  • +
  • Call this the reproducing kernel Hilbert space (RKHS) of $K$.
  • +
  • $\mathrm{KDE}_h(\cdot, D) = \frac{1}{N}\sum_{i=1}^N K_h(\cdot, x_i) = \frac{1}{N}\sum_{i=1}^N\phi_{x_i}$
      +
    • where $K_h(x,y) = \frac{1}{h}K\left(\frac{|x-y|}{h}\right)$
    • +
    +
  • +
+

$^*$All kernels that we have seen are radial and monotonic

+ +
+
+
+
+
+

Exercise

We compare the performance of different approaches to recover the nominal distribution under contamination. +Here, we use code by Humbert et al. to replicate +the results in the referenced paper on median-of-mean KDE. More details on rKDE can instead be found in this paper by Kim and Scott.

+ +
+
+
+
+
+ +
+
+
# =======================================================
+#   Parameters
+# =======================================================
+algos = [
+    'kde',
+    'mom-kde', # Median-of-Means
+    'rkde-huber', # robust KDE with huber loss
+    'rkde-hampel', # robust KDE with hampel loss
+]
+
+dataset = 'house-prices'
+dataset_options = {'neighborhood': 'CollgCr', 'anomaly_neighborhood': 'Edwards'}
+
+outlierprop_range = [0.01, 0.02, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5]
+kernel = 'gaussian'
+
+ +
+
+
+ +
+
+
+ +
+
+
auc_scores = perform_rkde_experiment(
+    algos,
+    dataset,
+    dataset_options,
+    outlierprop_range,
+    kernel,
+)
+
+ +
+
+
+ +
+
+ +
+ +
+
Dataset:  house-prices
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+Outlier prop: 0.01 (1 / 10)
+downsample outliers
+Finding best bandwidth...
+Algo:  kde
+Algo:  mom-kde
+Algo:  rkde-huber
+Stop at 4 iterations
+Stop at 2 iterations
+Algo:  rkde-hampel
+Stop at 4 iterations
+Stop at 100 iterations
+
+Outlier prop: 0.02 (2 / 10)
+downsample outliers
+Finding best bandwidth...
+Algo:  kde
+Algo:  mom-kde
+Algo:  rkde-huber
+Stop at 4 iterations
+Stop at 2 iterations
+Algo:  rkde-hampel
+Stop at 4 iterations
+Stop at 10 iterations
+
+Outlier prop: 0.03 (3 / 10)
+downsample outliers
+Finding best bandwidth...
+Algo:  kde
+Algo:  mom-kde
+Algo:  rkde-huber
+Stop at 3 iterations
+Stop at 2 iterations
+Algo:  rkde-hampel
+Stop at 3 iterations
+Stop at 100 iterations
+
+Outlier prop: 0.05 (4 / 10)
+downsample outliers
+Finding best bandwidth...
+Algo:  kde
+Algo:  mom-kde
+Algo:  rkde-huber
+Stop at 5 iterations
+Stop at 3 iterations
+Algo:  rkde-hampel
+Stop at 5 iterations
+Stop at 13 iterations
+
+Outlier prop: 0.07 (5 / 10)
+downsample outliers
+Finding best bandwidth...
+Algo:  kde
+Algo:  mom-kde
+Algo:  rkde-huber
+Stop at 4 iterations
+Stop at 2 iterations
+Algo:  rkde-hampel
+Stop at 4 iterations
+Stop at 100 iterations
+
+Outlier prop: 0.1 (6 / 10)
+downsample outliers
+Finding best bandwidth...
+Algo:  kde
+Algo:  mom-kde
+Algo:  rkde-huber
+Stop at 4 iterations
+Stop at 3 iterations
+Algo:  rkde-hampel
+Stop at 4 iterations
+Stop at 100 iterations
+
+Outlier prop: 0.2 (7 / 10)
+downsample outliers
+Finding best bandwidth...
+Algo:  kde
+Algo:  mom-kde
+Algo:  rkde-huber
+Stop at 5 iterations
+Stop at 3 iterations
+Algo:  rkde-hampel
+Stop at 5 iterations
+Stop at 100 iterations
+
+Outlier prop: 0.3 (8 / 10)
+downsample outliers
+Finding best bandwidth...
+Algo:  kde
+Algo:  mom-kde
+Algo:  rkde-huber
+Stop at 4 iterations
+Stop at 3 iterations
+Algo:  rkde-hampel
+Stop at 4 iterations
+Stop at 15 iterations
+
+Outlier prop: 0.4 (9 / 10)
+downsample outliers
+Finding best bandwidth...
+Algo:  kde
+Algo:  mom-kde
+Algo:  rkde-huber
+Stop at 4 iterations
+Stop at 2 iterations
+Algo:  rkde-hampel
+Stop at 4 iterations
+Stop at 100 iterations
+
+Outlier prop: 0.5 (10 / 10)
+downsample inliers
+Finding best bandwidth...
+Algo:  kde
+Algo:  mom-kde
+Algo:  rkde-huber
+Stop at 3 iterations
+Stop at 2 iterations
+Algo:  rkde-hampel
+Stop at 3 iterations
+Stop at 100 iterations
+
+
+
+ +
+
+ +
+
+
+ +
+
+
fig, ax = plt.subplots(figsize=(7, 5))
+for algo, algo_data in auc_scores.groupby('algo'):
+    x = algo_data.groupby('outlier_prop').mean().index
+    y = algo_data.groupby('outlier_prop').mean()['auc_anomaly']
+    ax.plot(x, y, 'o-', label=algo)
+plt.legend()
+plt.xlabel('outlier_prop')
+plt.ylabel('auc_score')
+plt.title('Comparison of rKDE against contamination')
+
+ +
+
+
+ +
+
+ +
+ + + +
+
Text(0.5, 1.0, 'Comparison of rKDE against contamination')
+
+ +
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

Try using different neighborhoods for contamination. Which robust KDE algorithm performs better overall? Choose among the following options:

+

OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste

+

You can also change the kernel type: gaussian, tophat, epechenikov, exponential, linear or cosine,

+ +
+
+
+
+
+

Summary

    +
  • Kernel density estimation is a non-parametric method to estimate a pdf from a sample.
  • +
  • Bandwidth is the most important parameter.
  • +
  • Converges to the true pdf if $n\to\infty$.
      +
    • Convergence exponentially depends on the dimension.
    • +
    +
  • +
  • KDE is sensitive to contamination:
      +
    • In a contaminated setting one can employ methods from robust statistics to obtain robust estimates.
    • +
    +
  • +
+

Implementations

+ +
+
+
+
+
+

Snow

+ +
+
+
+
+
+ +
+
+
 
+
+ +
+
+
+ +
+
+
+ + + + + + + + + + + diff --git a/docs/_static/intro_anomaly_detection.html b/docs/_static/nb_03_anomaly_detection_via_isolation.html similarity index 78% rename from docs/_static/intro_anomaly_detection.html rename to docs/_static/nb_03_anomaly_detection_via_isolation.html index c40796b..271b64f 100644 --- a/docs/_static/intro_anomaly_detection.html +++ b/docs/_static/nb_03_anomaly_detection_via_isolation.html @@ -3,7 +3,7 @@ -intro_anomaly_detection +nb_03_anomaly_detection_via_isolation

+ +
-
-
-
+
-
-
-
%%capture
+
+
-%set_random_seed 12 -
+
+
+ +
+ + + +
+

$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$ +$\newcommand{\amax}{{\text{argmax}}}$ +$\newcommand{\P}{{\mathbb{P}}}$ +$\newcommand{\E}{{\mathbb{E}}}$ +$\newcommand{\R}{{\mathbb{R}}}$ +$\newcommand{\Z}{{\mathbb{Z}}}$ +$\newcommand{\N}{{\mathbb{N}}}$ +$\newcommand{\C}{{\mathbb{C}}}$ +$\newcommand{\abs}[1]{{ \left| #1 \right| }}$ +$\newcommand{\simpl}[1]{{\Delta^{#1} }}$

-
+
-
-
+
-
-
-
%load_latex_macros
-
+
+
+
+

Anomaly Detection via Isolation

Snow

-
-
@@ -13153,17 +13302,34 @@
import numpy as np
-
+import itertools as it
+from tqdm import tqdm
 
 import matplotlib
 from matplotlib import pyplot as plt
-from matplotlib.patches import Ellipse
+import plotly.express as px
+import pandas as pd
+
+import ipywidgets as widgets
 
-from tfl_training_anomaly_detection.exercise_tools import evaluate, visualize_mahalanobis
+from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \
+perform_rkde_experiment, get_mnist_data
 
 from ipywidgets import interact
 
-from sklearn.metrics import f1_score, precision_score, recall_score
+from sklearn.metrics import roc_auc_score, average_precision_score
+from sklearn.model_selection import RandomizedSearchCV
+from sklearn.preprocessing import MinMaxScaler
+from sklearn.preprocessing import LabelBinarizer
+from sklearn.ensemble import IsolationForest
+from sklearn import metrics
+from sklearn.model_selection import train_test_split
+from sklearn.decomposition import PCA
+from sklearn.neighbors import KernelDensity
+
+from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst
+
+from tensorflow import keras
 
 %matplotlib inline
 matplotlib.rcParams['figure.figsize'] = (5, 5)
@@ -13176,450 +13342,706 @@
 
-

Snow

-
Anomaly Detection
+

Anomaly Detection via Isolation

Idea: An anomaly should allow "simple" descriptions that distinguish it from the rest of the data.

+
    +
  • Descriptions: Conjunction of single attribute tests, i.e. +$X_i \leq c$ or $X_i > c$.
  • +
  • Example: $X_1 \leq 1.2 \text{ and } X_5 > -3.4 \text{ and } X_7 \leq 5.6$.
  • +
  • Complexity of description: Number of conjunctions.
  • +
+

Moreover, we assume that a short random descriptions will have a significantly larger chance of isolating an anomaly +than isolating any nominal point.

+
    +
  • Choose random isolating descriptions and compute anomaly score from average complexity.
  • +
+
-

Introduction to Anomaly Detection

+

Isolation Tree

Isolation Forest (iForest) implements this idea by generating an ensemble of random decision trees. +Each tree is built as follows:

+

Input: Data set (subsample) $X$, maximal height $h$

+
    +
  • Randomly choose feature $i$ and split value $s$ (in range of data)
  • +
  • Recursively build subtrees on $X_L = \{x\in X\mid x_i \leq s\}$ and $X_R = X\setminus X_L$
  • +
  • Stop if remaining data set $ \leq 1$ or maximal height reached
  • +
  • Store test $x_i\leq s$ for inner nodes and $|X|$ for leaf nodes
  • +
+
-

What is an Anomaly?

+

Visualization

+ + Isolation Tree as Partition Diagram +
-
- -
+

Isolation Depth

Depth of an observation $x$ in an isolation tree is defined as the expected number of tests needed to isolate $x$.

+

Input: Observation $x$

+
    +
  • ${\ell} = $ length of path from root to leaf according to tests
  • +
  • ${n} = $ size of remaining data set in leaf node
  • +
  • ${c(n)} =$ expected length of a path in a BST with $n$ nodes $={O}(\log n)$
  • +
  • ${h(x)} = \ell + c(n)$
  • +
+ + + + + + + + +
Isolation Depth of Outlier (red) and nominal (blue)
-

Grubbs, 1969:

-

An outlying observation, or "outlier," is one that appears to deviate markedly from other members of -the sample in which it occurs.

-
-

Hawkins, 1980:

-

An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was -generated by a different mechanism.

-
-

Chandola et al., 2009:

-

Anomalies are patterns in data that do not conform to a well defined notion of normal behavior

-
+

Isolation Forest

    +
  • Train $k$ isolation trees on subsamples of size $N$
  • +
-

Anomalies Can be Hard to Detect

- -
+ + + + + + + + +
Isolation depth of nominal point (left) and outlier (right)
-

Practical Relevance of Anomaly Detection

+

Variants of Isolation Forest

-

Predictive Maintenance

    -
  • Determine condition of in-service equipment
  • -
  • Optimize maintenance cycle
  • -
  • Too frequent inspections cause unnecessary costs and downtime
  • -
  • Too infrequent inspections can lead to failures or even breaking of the equipment
    - -
  • +

    Variant: Random Robust Cut Forest

    New Rule to Choose Split Test:

    +
      +
    • $\ell_i$: length of the $i$th component of the bounding box around current data set
    • +
    • Choose dimension $i$ with probability $\frac{\ell_i}{\sum_j \ell_j}$
    • +
    • More robust against "noise dimensions"
    -

    Anomaly Detection: Sensory data can provide valuable information about the condition of the component. Increasingly -abnormal readings may indicate a wear of the equipment.

-

Fraud Detection

    -
  • Identify fraudulent transactions, e.g. credit card
  • -
  • Prevent criminal activities
  • -
  • Avoid financial or other damages for the involved parties
  • -
- -

Anomaly Detection: Fraudulent transactions can often be identified through unusual destinations, amounts, -or network topology (over several transactions).

- + +
-

Intrusion Detection

    -
  • Detect attacks against a network
  • -
  • Protect nodes against unauthorized access
  • +

    Variant: Extended Isolation Forest

    New split criterion:

    +
      +
    • Uniformly choose a normal and an orthogonal hyperplane through the data
    • +
    • Removes a bias that was empirically observed when plotting the outlier score of iForest on low dimensional data sets
    - -

    Anomaly Detection -Malicious connections can leaf unusual footprints, e.g., used protocol, ports, number of packages, IP, duration, etc.

    - + +
-

Relevance of Unsupervised Machine Learning in AD

    -
  • Due to the difficulty of identifying anomalies one often has no labeled data available
  • -
  • Even if labels are available, anomalies are rare and the data sets are heavily imbalanced
  • -
  • Often, we don't want to restrict the system to anomalies that we have encountered in the past
  • -
  • The information that is available heavily influences the applicable techniques:
      -
    • Is the distribution of nominal data known?
    • -
    • Is there clean data (without anomalies) for training?
    • -
    • Do we have labeled anomalies for evaluation?
    • -
    • How large is the proportion of anomalies?
    • -
    • How much noise is in the data?
    • -
    -
  • -
+

Exercise: Network Security

In the final exercise of today you will have to develop an anomaly detection system for network traffic.

+

Briefing

A large e-commerce company A is experiencing downtime due to attacks on their infrastructure. +You were instructed to develop a system that can detect malicious connections to the infrastructure. +It is planned that suspicious clients will be banned.

+

Another data science team already prepared the connection data of the last year for you. They also separated a test set and manually identified and labeled attacks in that data.

+

The Data

We will work on a version of the classic KDD99 data set.

+

Kddcup 99 Data Set

======================

+

The KDD Cup '99 dataset was created by processing the tcp dump portions +of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, +created by MIT Lincoln Lab [1]. The artificial data (described on the dataset's +homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>_) was +generated using a closed network and hand-injected attacks to produce a +large number of different types of attack with normal activity in the +background.

-
-
-
-
-
-

Question

Where do you think you can benefit from anomaly detection?

+
=========================   ====================================================
+Samples total               976158
+Dimensionality              41
+Features                    string (str), discrete (int), continuous (float)
+Targets                     str, 'normal.' or name of the anomaly type
+Proportion of Anomalies     1%
+=========================   ====================================================
+
+
+
+

Task

You will have to develop the system on your own. In particular, you will have to

    -
  • Which problem do you want to solve?
  • -
  • How does it translate into an anomaly detection problem?
  • -
  • What data is available (dimensionality, time dependence, $\ldots$)?
      -
    • Clean data (without anomalies) available?
    • -
    • Labeled anomalies available?
    • -
    • Proportion of outliers?
    • -
    -
  • +
  • Explore the data.
  • +
  • Choose an algorithm.
  • +
  • Find a good detection threshold.
  • +
  • Evaluate and summarize your results.
  • +
  • Estimate how much A could save through the use of your system.
-
-
-

Contamination Framework

-
    -
  • Unsupervised Scenario
  • -
  • Two distributions:
      -
    • $F_0$ generates normal points
    • -
    • $F_1$ generates anomalies
    • -
    • $p$ relative frequency of $F_1$
    • -
    -
  • -
  • Data set $D \stackrel{\text{IID}}{\sim} F=(1-p)F_0 + pF_1$
  • -
-

Task: Estimate if given $x$ is anomalous

-

Assumptions:

-
    -
  • Few: $p \ll 1/2$
  • -
  • Outlying: $F_0$ and $F_1$ do not overlap too much
  • -
  • Sparse: $F_1$ is less clustered than $F_0$
  • -
+
+
+ +
+
+
X_train,X_test,y_test = get_kdd_data()
+
+
+
-

Does the Contamination Framework Always Apply?

+

Explore Data

-
-
-

No!

-
    -
  • We might have clean data without anomalies available for training
  • -
  • In an adversarial scenario, like fraud detection, the opponent might change her behavior over time to evade detection -$\Rightarrow$ $F_1$ might not be well-defined
  • -
  • The degree to which the three assumptions are true can vary for each specific problem
  • -
  • Some assumptions might even be false in some scenarios
  • -
+
+
-
-
-
-
-
-

Evaluation Metrics

    -
  • Accuracy is not a good measure in anomaly detection:
      -
    • $1\%$ anomalies $\Rightarrow$ always predicting nominal gives $99\%$ accuracy!
    • -
    -
  • -
  • Better measures are precision, recall and $F_1$
  • -
  • The confusion matrix divides a test set according to the predictions and ground truth
  • -
- - - - - - - - - - - - - - - - - - - -
Confusion MatrixActual NominalActual Anomaly
Predicted NominalTrue Negative (TN)False Negative (FN)
Predicted AnomalyFalse Positive (FP)True Positive (TP)
+
+
+
#
+# Add your exploration code
+#
+X_train = pd.DataFrame(X_train)
+X_test = pd.DataFrame(X_test)
+
+
-
-
-
-

Precision, Recall

    -
  • Precision is defined as
  • -
-$$\frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$$

It estimates the probability that an observation really is anomalous given that the detection system predicted it to be.

-
    -
  • Recall is defined as
  • -
-$$\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$

It estimates the probability that an observation will be predicted to be anomalous given that it really is.

-
-
-
-
-

$F_1$ Score

    -
  • $F_1$ is defined as the harmonic mean of precision and recall
  • -
-$$2\cdot \frac{\text{Precision}\cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

-

It balances between precision and recall.

+
+
-
-
-
-
-
-

Evaluating Thresholds

    -
  • Most anomaly detection algorithms output an anomaly score where higher values mean more anomalous.
  • -
  • We need to set a decision threshold $\tau$ in order to compute precision, recall and $F_1$. -
  • -
  • The precision-recall (PR) curve plots the pairs
  • -
-$$\{(\text{Recall}(\tau), \text{Precision}(\tau)) \mid \tau_{\text{min}} \leq \tau \leq \tau_{\text{max}}\}$$
    -
  • The receiver-operator-characteristics curve (ROC) plots the true positive rate (TPR) against the false positive rate (FPR) for the possible thresholds
      -
    • $\mathrm{TPR(\tau)} = \frac{\mathrm{TP(\tau)}}{\mathrm{TP(\tau)}+\mathrm{FN(\tau)}}$
    • -
    • $\mathrm{FPR(\tau)} = \frac{\mathrm{FP(\tau)}}{\mathrm{FP}(\tau)+\mathrm{TN}(\tau)}$
    • -
    -
  • -
+
+
+
# get description
+X_train.describe()
+
+
-
-
-
-

Cost Matrix

Choosing the optimal threshold does not only depend on the values of our metrics but also on the cost associated to the confusion matrix. Similarly to precision, recall, etc. the confusion matrix has to be understood as a function of the threshold. For each threshold we obtain different numbers of true positives, false positives... . The associated costs e.g. for the false positives are the expected costs that a falsely positive prediction generates (profits are represented as negative costs). Other than the confusion matrix, the cost matrix does not depend on $\tau$.

- - - - - - - - - - - - - - - - - - - -
Cost MatrixActual NominalActual Anomaly
Predicted NominalCost of TN (CTN)Cost of FN(CFN)
Predicted AnomalyCost of FP (CFP)Cost of TP (CTP)
-

Our goal is to set the threshold such that the expected empirical costs will be minimized

-$$ -\begin{align*} -\tau_{\text{opt}}=\arg\min_\tau \frac{\mathrm{TP}(\tau)\cdot \mathrm{CTP} + \mathrm{FP}(\tau)\cdot \mathrm{CFP} + \mathrm{TN}(\tau)\cdot \mathrm{CTN} + \mathrm{FN}(\tau)\cdot \mathrm{CFN}}{N} -\end{align*} -$$

Where $N$ is the total number of samples.

-
-
-
-
-
-

Our First Anomaly Detection Approach

Let's have a look at a simple probabilistic anomaly score.

-
    -
  • If the distribution of nominal data is known then we can use $-\log p(x)$, also known as the surprise.
  • -
  • If only the covariance $\Sigma$ and the mean $\mu$ is known, we can compute the Mahalanobis distance to the mean $\sqrt{(x-\mu)^T\Sigma^{-1}(x-\mu)}$ can be used.
      -
    • Only applicable if nominal distribution is unimodally centered around the mean.
    • -
    • Extension for mixture models with means $\mu_1,\ldots,\mu_k$ and covariance matrices $\Sigma_1,\ldots,\Sigma_k$: $\min_{1\leq i \leq k}\sqrt{(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}$
    • -
    -
  • -
+
+
-
-
-
-
-
-

Motivation of Mahalanobis Distance

    -
  • The Mahalanobis distance is motivated by the surprise of a Gaussian: -\begin{align*} --\log p(x) &= -\log \frac{1}{(2\pi)^{\frac{m}{2}}\sqrt{|\det(\Sigma)|}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)\\ -&= \frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu) + c -\end{align*}
  • -
  • Since monotonous transformations (such as $\sqrt{\cdot}$ or adding a constant) don't change the outlier ranking this is equivalent to the Mahalanobis distance
  • -
+
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
0123456789...31323334353637383940
count80524805248052480524805248052480524805248052480524...805248052480524.080524.080524.080524.080524.080524.080524.080524.0
unique202535293026974921219...256256101.096.0101.053.089.050.0101.0101.0
top0b'tcp'b'http'b'SF'10500000...2552551.00.00.00.00.00.00.00.0
freq7109962167494437534159281396880522805248052380067...347154593752113.051478.027073.040645.075976.075534.072976.073043.0
+

4 rows × 41 columns

+
-
-
-
-
-

Exercise

Try the outlier scores for yourself in a simple synthetic scenario. We have prepared the function evaluate for you. Try to find the optimal threshold for the dataset.

+
-
nominal = np.random.normal(0, [1, 1.5], size=(300, 2))
-anomaly = np.random.normal(5, 2, size=(10, 2))
-
-data = np.concatenate([nominal, anomaly], axis=0)
-y = np.zeros(310)
-y[-10:] = 1
-
-plt.scatter(data[:, 0], data[:,1], c=y)
-plt.gca().set_aspect('equal')
-plt.show()
+
# get better description
+X_train.drop(columns=[1,2,3]).astype(float).describe()
 
+
+
+ +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
0456789101112...31323334353637383940
count80524.0000008.052400e+048.052400e+0480524.00000080524.080524.00000080524.00000080524.00000080524.00000080524.000000...80524.00000080524.00000080524.00000080524.00000080524.00000080524.00000080524.00000080524.00000080524.00000080524.000000
mean212.5208261.132199e+033.334559e+030.0000250.00.0000370.0450920.0002240.6946750.034362...152.229584201.2689510.8403260.0560820.1540180.0233290.0101100.0091030.0569110.055090
std1335.6581333.327535e+043.920496e+040.0049840.00.0105720.8653730.0228370.4605484.447863...103.50468488.0809780.3117560.1790460.3072950.0493200.0936010.0906080.2236010.217788
min0.0000000.000000e+000.000000e+000.0000000.00.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000001.460000e+021.050000e+020.0000000.00.0000000.0000000.0000000.0000000.000000...40.000000168.0000000.9100000.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%0.0000002.320000e+023.920000e+020.0000000.00.0000000.0000000.0000001.0000000.000000...175.000000255.0000001.0000000.0000000.0100000.0000000.0000000.0000000.0000000.000000
75%0.0000003.170000e+022.012000e+030.0000000.00.0000000.0000000.0000001.0000000.000000...255.000000255.0000001.0000000.0100000.0900000.0300000.0000000.0000000.0000000.000000
max30190.0000002.194619e+065.134218e+061.0000000.03.00000030.0000004.0000001.000000884.000000...255.000000255.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000
+

8 rows × 38 columns

+
+
-
-
-

Fit a Gaussian

+
-
mu = data.mean(axis=0)
-Sigma_diag = data.std(axis=0) # assumes independant components
-print('Mean: {}\nStd: {}'.format(mu, Sigma_diag))
+
# Check for NaNs
+print("Number of NaNs: {}".format(X_train.isna().sum().sum()))
 
-
-
-
-

Question

How did the contamination influence the parameter estimation?

+
+
+
+ +
+
Number of NaNs: 0
+
-
-
-
-

Compute scores and evaluate

+
-
# Mahalanobis distance from the mean of N(mu, Sigma)
-scores = np.sqrt(((data - mu) * (1/Sigma_diag) * (data - mu)).sum(axis=1)) 
-curves = evaluate(y, scores)
+
#
+# Add your preperation code here
+#
+
+# Encode string features
+binarizer = LabelBinarizer()
+one_hots = None
+one_hots_test = None
+for i in [1, 2, 3]:
+    binarizer.fit(X_train[[i]].astype(str))
+    if one_hots is None:
+        one_hots = binarizer.transform(X_train[[i]].astype(str))
+        one_hots_test = binarizer.transform(X_test[[i]].astype(str))
+    else:
+        one_hots = np.concatenate([one_hots, binarizer.transform(X_train[[i]].astype(str))], axis=1)
+        one_hots_test = np.concatenate([one_hots_test, binarizer.transform(X_test[[i]].astype(str))], axis=1)
+
+X_train.drop(columns=[1,2,3], inplace=True)
+X_train_onehot = pd.DataFrame(np.concatenate([X_train.values, one_hots], axis=1))
+
+X_test.drop(columns=[1,2,3], inplace=True)
+X_test_onehot = pd.DataFrame(np.concatenate([X_test.values, one_hots_test], axis=1))
 
-
-
-
-

Choose a threshold

- -
-
-
def visualize_mahalanotis(data, y, scores, mu, sigma_diag, thr):
-    _, axes = plt.subplots(figsize=(6, 6))
-
-    # Visualize Data
-    scatter_gt = axes.scatter(data[:, 0], data[:,1], c=y)
-    plt.scatter(mu[0], mu[1], color='red')
-    axes.set_title('Ground Truth')
-    handles, _ = scatter_gt.legend_elements()
-    axes.legend(handles, ['Nominal', 'Anomaly'])
-    axes.set_aspect('equal')
-    # Draw descicion contour
-    descion_border = Ellipse(
-        mu,
-        width=2*np.sqrt(sigma_diag[0])*thr,
-        height=2*np.sqrt(sigma_diag[1])*thr,
-        color='red',
-        fill=False
-    )
-    axes.add_patch(descion_border)
-    
-    # Evaluate threshold
-    y_pred = scores >  thr
-
-    precision = precision_score(y, y_pred)
-    recall = recall_score(y, y_pred)
-    f1 = f1_score(y, y_pred)
-    
-    axes.set_title("Precision: {}\nRecall: {}\nF1: {}".format(precision, recall, f1))
-    
-    plt.tight_layout()
-    plt.show()
+
# Encode y
+y_test_bin = np.where(y_test == b'normal.', 0, 1)
 
@@ -13632,86 +14054,162 @@

Question

-
thr = None
-
-@interact(threshold=(0., 6.))
-def set_threshold(threshold):
-    global thr
-    thr = threshold
-    plt.show()
+
# Remove suspicious data
+# This step is not strictly neccessary but can improve performance
+suspicious = X_train_onehot.apply(lambda col: (col - col.mean()).abs() > 4 * col.std() if col.std() > 1 else False)
+suspicious = suspicious.any(axis=1)# 4 sigma rule
+print('filtering {} suspicious data points'.format(suspicious.sum()))
+X_train_clean = X_train_onehot[~suspicious]
 
+
+
+ +
+ +
+
filtering 2932 suspicious data points
+
+
+
+ +
+
+ +
+
+
+

Summary

    +
  • Isolation Forest empirically shows very good performance up to relatively high dimensions
  • +
  • It is relatively robust against contamination
  • +
  • Usually little need for hyperparameter tuning
  • +
+

Implementations

+ +
+
+
+
+
+

Choose Algorithm

+
+
-
visualize_mahalanobis(data, y, scores, mu, Sigma_diag, thr)
+
# TODO: implement proper model selection
+iforest = IsolationForest()
+iforest.fit(X_train_clean)
 
+
+
+ +
+ + + +
+
IsolationForest()
-
-
-

Task: Find optimal threshold and evaluate on test set.

Choose good threshold. You may write additional code to determine the threshold.

+
+
+
-
thr_opt = 3.2 # 
+
# find best threshold
+X_test_onehot, X_val_onehot, y_test_bin, y_val_bin = train_test_split(X_test_onehot, y_test_bin, test_size=.5)
+y_score = -iforest.score_samples(X_val_onehot).reshape(-1)
 
+
+
+
+

Evaluate Solution

+
+
-
data_test = np.concatenate([np.random.normal(0, [1, 1.5], size=(300, 2)), np.random.normal(3, 1.5, size=(10, 2))])
+
#
+# Insert evaluation code
+#
 
-y_test = np.zeros(data_test.shape[0])
-y_test[-10:] = 1
+# calculate scores if any anomaly is present
+if np.any(y_val_bin == 1):
+    eval = evaluate(y_val_bin, y_score)
+    prec, rec, thr = eval['PR']
+    f1s = 2 * (prec * rec)/(prec + rec)
+    threshold = thr[np.argmax(f1s)]
 
-scores_test = np.sqrt(((data_test - mu) * (1/Sigma_diag) * (data_test - mu)).sum(axis=1)) 
+    y_score = -iforest.score_samples(X_test_onehot).reshape(-1)
+    y_pred = np.where(y_score < threshold, 0, 1)
 
-visualize_mahalanotis(data_test, y_test, scores_test, mu, Sigma_diag, thr_opt)
+    print('Precision: {}'.format(metrics.precision_score(y_test_bin, y_pred)))
+    print('Recall: {}'.format(metrics.recall_score(y_test_bin, y_pred)))
+    print('F1: {}'.format(metrics.f1_score(y_test_bin, y_pred)))
 
+
+
+ +
+ + + +
+ +
+ +
+ +
+ +
+
Precision: 0.9413489736070382
+Recall: 0.9846625766871165
+F1: 0.9625187406296852
+
+
-
-
-

Summary

    -
  • Anomalies are patterns in data that do not conform to a well defined notion of normal behavior.
  • -
  • Detecting anomalies can be very valuable in a broad spectrum of industry sectors and company divisions.
  • -
  • Anomaly detection uses mostly unsupervised techniques.
  • -
  • Outlier scores measure the degree of outlyingness.
  • -
  • If some statistical properties of the nominal distribution are known then the surprise or the Mahalanobis distance can be used as an outlier score.
  • -
  • Evaluation metrics: precision, recall, $F_1$, ROC (AUC), PR (AUC).
  • -
+
@@ -13719,6 +14217,19 @@

Summary

+
+
+
+ +
+
+
 
+
+ +
+
+
+
diff --git a/docs/_static/anomaly_detection_approaches.html b/docs/_static/nb_04_anomaly_detection_via_reconstruction.html similarity index 63% rename from docs/_static/anomaly_detection_approaches.html rename to docs/_static/nb_04_anomaly_detection_via_reconstruction.html index af2d0f3..8191009 100644 --- a/docs/_static/anomaly_detection_approaches.html +++ b/docs/_static/nb_04_anomaly_detection_via_reconstruction.html @@ -3,7 +3,7 @@ -anomaly_detection_approaches - -
- -
- -
-
- -
-
-
- -
-
-
kde = fit_kde(ker, bdw, X_train)
-visualize_kde(kde, bdw, X_train, y_train)
-
- -
-
-
- -
-
- -
- - - -
- -
- -
- -
-
- -
-
-
-

Bandwidth Selection

The bandwidth is the most important parameter of a KDE model. A wrongly adjusted value will lead to over- or -under-smoothing of the density curve.

-

A common method to select a bandwidth is maximum log-likelihood cross validation. -$$h_{\textrm{llcv}} = \arg\max_{h}\frac{1}{k}\sum_{i=1}^k\sum_{y\in D_i}\log\left(\frac{k}{N(k-1)}\sum_{x\in D_{-i}}K_h(x, y)\right)$$ -where $D_{-i}$ is the data without the $i$th cross validation fold $D_i$.

- -
-
-
-
-
-

Exercises

-
-
-
-
-
-

ex no.1: Noisy sinusoidal

- -
-
-
-
-
- -
-
-
# Generate example
-dists = create_distributions(dim=2)
-
-distribution_with_anomalies = contamination(
-    nominal=dists['Sinusoidal'],
-    anomaly=dists['Blob'],
-    p=0.05
-)
-
-# Train data
-sample_train = dists['Sinusoidal'].sample(500)
-X_train = sample_train[-1].numpy()
-
-# Test data
-sample_test = distribution_with_anomalies.sample(500)
-X_test = sample_test[-1].numpy()
-y_test = sample_test[0].numpy()
-
-scatter = plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test)
-handels, _ = scatter.legend_elements()
-plt.legend(handels, ['Nominal', 'Anomaly'])
-plt.gca().set_aspect('equal')
-plt.show()
-
- -
-
-
- -
-
- -
- - - -
- -
- -
- -
-
- -
-
-
-

TODO: Define the search space for the kernel and the bandwidth

-
-
-
-
-
- -
-
-
param_space = {
-    'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels
-    'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter
-}
-
- -
-
-
- -
-
-
- -
-
-
def hyperopt_by_score(X_train: np.array, param_space: dict, cv: int=5):
-    """Performs hyperoptimization by score
-    
-    @param X_train: data
-    @param param_space: parameter space
-    @param cv: number of cv folds
-    """
-    kde = KernelDensity()
-
-    search = RandomizedSearchCV(
-        estimator=kde,
-        param_distributions=param_space,
-        n_iter=100,
-        cv=cv,
-        scoring=None # use estimators internal scoring function, i.e. the log-probability of the validation set for KDE
-    )
-
-    search.fit(X_train)
-    return search.best_params_, search.best_estimator_
-
- -
-
-
- -
-
-
-

Run the code below to perform hyperparameter optimization.

- -
-
-
-
-
- -
-
-
params, kde = hyperopt_by_score(X_train, param_space)
-
-print('Best parameters:')
-for key in params:
-    print('{}: {}'.format(key, params[key]))
-
-test_scores = -kde.score_samples(X_test)
-test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
-
-curves = evaluate(y_test, test_scores)
-
- -
-
-
- -
-
- -
- -
-
/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:922: UserWarning: One or more of the test scores are non-finite: [-583.74541248 -463.01696684 -482.86882001 -499.81601835 -546.99909867
- -562.2118973  -404.18630801 -441.90964061 -508.5352788           -inf
- -455.89237528 -474.60064732 -460.72114131 -523.02652677 -599.66745977
- -469.21060725 -649.46845268 -484.62263455 -534.71782939 -385.81385264
- -498.35959278 -666.08926837 -402.40788348 -616.13224704 -514.42363587
- -486.8817315  -496.70674387 -576.02539342 -519.26203456          -inf
- -476.17747255          -inf -566.20708475 -633.59732893 -460.43606272
- -519.25900889 -631.67434369 -502.48986055 -539.33069415 -611.61260259
- -416.6039647  -406.4145171  -466.23139723 -485.14283562 -490.13677541
- -637.14220215 -559.25833087 -507.88071232 -486.91719244 -562.75298394
- -414.54025058 -420.71904273 -385.56190503 -503.34657184 -526.01859062
- -519.97321291 -404.56775515 -532.78639414 -411.43385054 -486.28358272
- -606.08910552 -581.33532785 -405.37961147 -510.35741871 -667.85545766
- -593.96240838 -624.98459821 -498.05695285 -587.68751473 -523.85639014
- -535.3897791  -587.05218198 -492.07924826 -608.63368405 -500.8619376
-          -inf -596.29256226 -485.75889656 -481.95324848 -431.57291947
- -505.76999431 -485.82228871 -659.05544596          -inf          -inf
- -641.6261979  -405.38065655 -613.68889441 -600.98775327 -457.76369778
- -531.39767679 -497.24276193 -603.57527611 -518.98183963 -484.21261874
- -453.4834935  -617.71569138          -inf -490.16773265 -510.35538669]
-  warnings.warn(
-/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:929: RuntimeWarning: invalid value encountered in subtract
-  array_stds = np.sqrt(np.average((array -
-
-
-
- -
- -
-
Best parameters:
-kernel: linear
-bandwidth: 1.0
-
-
-
- -
- - - -
- -
- -
- -
-
- -
-
-
- -
-
-
visualize_kde(kde, params['bandwidth'], X_test, y_test)
-
- -
-
-
- -
-
- -
- - - -
- -
- -
- -
-
- -
-
-
-

Exercise: Isolate anomalies in house prices

-
-
-
-
-
-

You are a company resposible to estimate house prices around Ames, Iowa, specifically around college area. But there is a problem: houses from a nearby area, 'Veenker', are often included in your dataset. You want to build an anomaly detection algorithm that filters one by one every point that comes from the wrong neighborhood. You have been able to isolate an X_train dataset which, you are sure, contains only houses from College area. Following the previous example, test your ability to isolate anomalies in new incoming data (X_test) with KDE.

-

Advanced exercise: -What happens if the contamination comes from other areas? You can choose among the following names:

-

OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste

- -
-
-
-
-
- -
-
-
X_train, X_test, y_test = get_house_prices_data(neighborhood = 'CollgCr', anomaly_neighborhood='Veenker')
-X_train
-
- -
-
-
- -
-
- -
- - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
LotAreaSalePriceOverallCond
084611639905
1103352040005
295482370006
392451450005
4155231335006
............
11592402870005
116113171800005
11744261410005
11890662300005
11979901100006
-

120 rows × 3 columns

-
- -
- -
-
- -
-
-
- -
-
-
# Total data
-train_test_data = X_train.append(X_test, ignore_index=True)
-y_total = [0] * len(X_train) + y_test
-
-fig = px.scatter_3d(train_test_data, x='LotArea', y='OverallCond', z='SalePrice', color=y_total)
-
-fig.show()
-
- -
-
-
- -
-
- -
- - -
-
- -
- -
- - -
- -
- -
-
- -
-
-
-

Solution

-
-
-
-
-
- -
-
-
# When data are highly in-homogeneous, like in this case, it is often beneficial 
-# to rescale them before applying any anomaly detection or clustering technique.
-scaler = MinMaxScaler()
-X_train_rescaled = scaler.fit_transform(X_train)
-
- -
-
-
- -
-
-
- -
-
-
param_space = {
-    'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels
-    'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter
-}
-params, kde = hyperopt_by_score(X_train_rescaled, param_space)
-
- -
-
-
- -
-
- -
- -
-
/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:922: UserWarning:
-
-One or more of the test scores are non-finite: [ -45.25900425  -42.41544626 -105.24271623  -76.62262594  -58.16553273
-  -52.66553234 -183.01584751 -119.35431748  -98.69187821 -168.03134231
- -111.03706006  -62.91122917 -234.261542   -168.22031548 -220.86820955
-  -92.32919548 -154.69523369 -227.42928936  -84.284484   -142.67107811
- -117.16427593 -113.6970155  -227.52633839 -161.95663169  -41.52019748
- -136.74875435 -152.97004208 -128.4919998    19.20525133          -inf
-  -98.87203905          -inf -159.80490645 -162.17445626  -67.13120683
- -117.6390077  -140.62181147  -50.67220638 -237.44899903 -197.22483009
- -123.50498573 -188.99783275 -101.07642949  -72.83784089 -229.7863771
- -132.07211645 -168.26256671 -230.06793251 -135.31495507 -187.61056982
- -147.31823309          -inf -146.00497938  -33.29831913 -194.93892381
-  -96.25876027 -178.48444701 -123.12220664  -83.77069893 -199.88529605
- -170.61800732 -186.74828407 -134.23720459  -35.0511072  -131.81801061
- -224.69026195 -164.15751703          -inf -217.86235331  -79.81216211
- -124.69089389  -13.75418293 -192.81244316 -167.46002124  -72.58312108
- -160.42768007          -inf -229.19908446 -159.69783332 -199.44038951
- -196.43550278 -135.24648056  -71.43898844 -191.77357892 -177.71452367
- -153.08130804  -64.75586096 -151.5744935  -104.68544216 -107.00124511
- -192.57805657    5.68816971 -158.41708204  -30.07922034 -203.15690702
- -165.74543603 -155.17305524 -243.42948442 -142.66462637 -179.45090448]
-
-/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:929: RuntimeWarning:
-
-invalid value encountered in subtract
-
-
-
-
- -
-
- -
-
-
- -
-
-
print('Best parameters:')
-for key in params:
-    print('{}: {}'.format(key, params[key]))
-
-X_test_rescaled = scaler.transform(X_test)
-test_scores = -kde.score_samples(X_test_rescaled)
-test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
-curves = evaluate(y_test, test_scores)
-
- -
-
-
- -
-
- -
- -
-
Best parameters:
-kernel: epanechnikov
-bandwidth: 0.4
-
-
-
- -
- - - -
- -
- -
- -
-
- -
-
-
-

The Curse of Dimensionality

The flexibility of KDE comes at a price. The dependency on the dimensionality of the data is quite unfavorable.

-
-

Theorem [Stone, 1982] -Any estimator that is consistent$^*$ with the class of all $k$-fold differentiable pdfs over $\mathbb{R}^d$ has a -convergence rate of at most

-$$ -\frac{1}{n^{\frac{k}{2k+d}}} -$$
-

$^*$Consistency = for all pdfs $p$ in the class: $\lim_{n\to\infty}|KDE_h(x, D) - p(x)|_\infty = 0$ with probability $1$.

- -
-
-
-
-
-

Exercise

    -
  • The very slow convergence in high dimensions does not necessary mean that we will see bad results in high dimensional anomaly detection with KDE.
  • -
  • Especially if the anomalies are very outlying.
  • -
  • However, in cases where contours of the nominal distribution are non-convex we can run into problems.
  • -
-

We take a look at a higher dimensional version of out previous data set.

- -
-
-
-
-
- -
-
-
dists = create_distributions(dim=3)
-
-distribution_with_anomalies = contamination(
-    nominal=dists['Sinusoidal'],
-    anomaly=dists['Blob'],
-    p=0.01
-)
-
-sample = distribution_with_anomalies.sample(500)
-
-y = sample[0]
-X = sample[-1]
-
- -
-
-
- -
-
-
- -
-
-
fig = px.scatter_3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], color=y)
-fig.show()
-
- -
-
-
- -
-
- -
- - -
- -
- -
-
- -
-
-
- -
-
-
# Fit KDE on high dimensional examples 
-rocs = []
-auprs = []
-bandwidths = []
-
-param_space = {
-        'kernel': ['gaussian'],
-        'bandwidth': np.linspace(0.1, 100, 1000), # Define Search space for bandwidth parameter
-    }
-
-kdes = {}
-dims = np.arange(2,16)
-for d in tqdm(dims):
-    # Generate d dimensional distributions
-    dists = create_distributions(dim=d)
-
-    distribution_with_anomalies = contamination(
-        nominal=dists['Sinusoidal'],
-        anomaly=dists['Blob'],
-        p=0
-    )
-
-    # Train on clean data
-    sample_train = dists['Sinusoidal'].sample(500)
-    X_train = sample_train[-1].numpy()
-    # Test data
-    sample_test = distribution_with_anomalies.sample(500)
-    X_test = sample_test[-1].numpy()
-    y_test = sample_test[0].numpy()
-
-    # Optimize bandwidth
-    params, kde = hyperopt_by_score(X_train, param_space)
-    kdes[d] = (params, kde)
-    
-    bandwidths.append(params['bandwidth'])
-
-    test_scores = -kde.score_samples(X_test)
-    test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
-
-    
-
- -
-
-
- -
-
- -
- -
-
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:39<00:00,  2.81s/it]
-
-
-
- -
-
- -
-
-
- -
-
-
# Plot cross section of pdf 
-fig, axes = plt.subplots(nrows=2, ncols=7, figsize=(15, 5))
-for d, axis in tqdm(list(zip(kdes, axes.flatten()))):
-    
-    params, kde = kdes[d]
-
-    lin = np.linspace(-10, 10, 50)
-    grid_points = list(it.product(*([[0]]*(d-2)), lin, lin))
-    ys, xs = np.meshgrid(lin, lin)
-    # The score function of sklearn returns log-densities
-    scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)
-    colormesh = axis.contourf(xs, ys, scores)
-    axis.set_title("Dim = {}".format(d))
-    axis.set_aspect('equal')
-    
-
-# Plot evaluation
-print('Crossection of the KDE at (0,...,0, x, y)')
-plt.show()
-
- -
-
-
- -
-
- -
- -
-
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:01<00:00,  7.66it/s]
-
-
-
- -
- -
-
Crossection of the KDE at (0,...,0, x, y)
-
-
-
- -
- - - -
- -
- -
- -
-
- -
-
-
-

Robustness

Another drawback of KDE in the context of anomaly detection is that it is not robust against contamination of the data

-
-

Definition -The breakdown point of an estimator is the smallest fraction of observations that need to be changed so that we can -move the estimate arbitrarily far away from the true value.

-
- -
-
-
-
-
-

Example: The sample mean has a breakdown point of $0$. Indeed, for a sample of $x_1,\ldots, x_n$ we only need to -change a single value in order to move the sample mean in any way we want. That means that the breakdown point is -smaller than $\frac{1}{n}$ for every $n\in\mathbb{N}$.

- -
-
-
-
-
-

Robust Statistics

There are robust replacements for the sample mean:

-
    -
  • Median of means: Split the dataset into $S$ equally sized subsets $X_1,\ldots, X_S$ and compute -$\mathrm{median}(\overline{X_1},\ldots, \overline{X_S})$
  • -
  • M-estimation: The mean in a normed vector space is the value that minimizes the squared distances -
    -$\overline{X} = \min_{y}\sum_{x\in X}|x-y|^2$ -
    -M-estimation replaces the quadratic loss with a more robust loss function.
  • -
- -
-
-
-
-
-

Huber loss

Switch from quadratic to linear loss at prescribed threshold

- -
-
-
-
-
- -
-
-
import numpy as np
-
-
-def huber(error: float, threshold: float):
-    """Huber loss
-    
-    @param error: base error
-    @param threshold: threshold for linear transition
-    """
-    test = (np.abs(error) <= threshold)
-    return (test * (error**2)/2) + ((1-test)*threshold*(np.abs(error) - threshold/2))
-
-x = np.linspace(-5, 5)
-y = huber(x, 1)
-
-plt.plot(x, y)
-plt.gca().set_title("Huber Loss")
-plt.show()
-
- -
-
-
- -
-
- -
- - - -
- -
- -
- -
-
- -
-
-
-

Hampel loss

More complex loss function. Depends on 3 parameters 0 < a < b< r

- -
-
-
-
-
- -
-
-
import numpy as np
-
-def single_point_hampel(error: float, a: float, b: float, r: float):
-    """Hampel loss
-    
-    @param error: base error
-    @param a: 1st threshold parameter
-    @param b: 2nd threshold parameter
-    @param r: 3rd threshold parameter
-    """
-    if abs(error) <= a:
-        return error**2/2
-    elif a < abs(error) <= b:
-        return (1/2 *a**2 + a* (abs(error)-a))
-    elif  b < abs(error) <= r:
-        return a * (2*b-a+(abs(error)-b) * (1+ (r-abs(error))/(r-b)))/2
-    else:
-        return a*(b-a+r)/2
-
-hampel = np.vectorize(single_point_hampel)
-
-x = np.linspace(-10.1, 10.1)
-y = hampel(x, a=1.5, b=3.5, r=8)
-
-plt.plot(x, y)
-plt.gca().set_title("Hampel Loss")
-plt.show()
-
- -
-
-
- -
-
- -
- - - -
- -
- -
- -
-
- -
-
-
-

KDE is a Mean


-

Kernel as scalar product: -

-
    -
  • Let $K$ be a radial monotonic$^\ast$ kernel over $\mathbb{R}^n$.
  • -
  • For $x\in\mathbb{R}^n$ let $\phi_x = K(\cdot, x)$.
  • -
  • Vector space over the linear span of $\{\phi_x \mid x\in\mathbb{R}^n\}$:
      -
    • Pointwise addition and scalar multiplication.
    • -
    -
  • -
  • Define the scalar product $\langle \phi_x, \phi_y\rangle = K(x,y)$.
  • -
  • Advantage: Scalar product is computable
  • -
  • Call this the reproducing kernel Hilbert space (RKHS) of $K$.
  • -
  • $\mathrm{KDE}_h(\cdot, D) = \frac{1}{N}\sum_{i=1}^N K_h(\cdot, x_i) = \frac{1}{N}\sum_{i=1}^N\phi_{x_i}$
      -
    • where $K_h(x,y) = \frac{1}{h}K\left(\frac{|x-y|}{h}\right)$
    • -
    -
  • -
-
-

$^*$All kernels that we have seen are radial and monotonic

- -
-
-
-
-
-

Exercise

We compare the performance of different approaches to recover the nominal distribution under contamination. -Here, we use code by Humbert et al. to replicate -the results in the referenced paper on median-of-mean KDE. More details on rKDE can instead be found in this paper by Kim and Scott.

- -
-
-
-
-
- -
-
-
# =======================================================
-#   Parameters
-# =======================================================
-algos = [
-    'kde',
-    'mom-kde', # Median-of-Means
-    'rkde-huber', # robust KDE with huber loss
-    'rkde-hampel', # robust KDE with hampel loss
-]
-
-dataset = 'house-prices'
-dataset_options = {'neighborhood': 'CollgCr', 'anomaly_neighborhood': 'Edwards'}
-
-outlierprop_range = [0.01, 0.02, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5]
-kernel = 'gaussian'
-
- -
-
-
- -
-
-
- -
-
-
auc_scores = perform_rkde_experiment(
-    algos,
-    dataset,
-    dataset_options,
-    outlierprop_range,
-    kernel,
-)
-
- -
-
-
- -
-
- -
- -
-
Dataset:  house-prices
-
-
-
- -
- - -
- -
- -
- -
-
-Outlier prop: 0.01 (1 / 10)
-downsample outliers
-Finding best bandwidth...
-Algo:  kde
-Algo:  mom-kde
-Algo:  rkde-huber
-Stop at 4 iterations
-Stop at 2 iterations
-Algo:  rkde-hampel
-Stop at 4 iterations
-Stop at 100 iterations
-
-Outlier prop: 0.02 (2 / 10)
-downsample outliers
-Finding best bandwidth...
-Algo:  kde
-Algo:  mom-kde
-Algo:  rkde-huber
-Stop at 4 iterations
-Stop at 2 iterations
-Algo:  rkde-hampel
-Stop at 4 iterations
-Stop at 10 iterations
-
-Outlier prop: 0.03 (3 / 10)
-downsample outliers
-Finding best bandwidth...
-Algo:  kde
-Algo:  mom-kde
-Algo:  rkde-huber
-Stop at 3 iterations
-Stop at 2 iterations
-Algo:  rkde-hampel
-Stop at 3 iterations
-Stop at 100 iterations
-
-Outlier prop: 0.05 (4 / 10)
-downsample outliers
-Finding best bandwidth...
-Algo:  kde
-Algo:  mom-kde
-Algo:  rkde-huber
-Stop at 5 iterations
-Stop at 3 iterations
-Algo:  rkde-hampel
-Stop at 5 iterations
-Stop at 13 iterations
-
-Outlier prop: 0.07 (5 / 10)
-downsample outliers
-Finding best bandwidth...
-Algo:  kde
-Algo:  mom-kde
-Algo:  rkde-huber
-Stop at 4 iterations
-Stop at 2 iterations
-Algo:  rkde-hampel
-Stop at 4 iterations
-Stop at 100 iterations
-
-Outlier prop: 0.1 (6 / 10)
-downsample outliers
-Finding best bandwidth...
-Algo:  kde
-Algo:  mom-kde
-Algo:  rkde-huber
-Stop at 4 iterations
-Stop at 3 iterations
-Algo:  rkde-hampel
-Stop at 4 iterations
-Stop at 100 iterations
-
-Outlier prop: 0.2 (7 / 10)
-downsample outliers
-Finding best bandwidth...
-Algo:  kde
-Algo:  mom-kde
-Algo:  rkde-huber
-Stop at 5 iterations
-Stop at 3 iterations
-Algo:  rkde-hampel
-Stop at 5 iterations
-Stop at 100 iterations
-
-Outlier prop: 0.3 (8 / 10)
-downsample outliers
-Finding best bandwidth...
-Algo:  kde
-Algo:  mom-kde
-Algo:  rkde-huber
-Stop at 4 iterations
-Stop at 3 iterations
-Algo:  rkde-hampel
-Stop at 4 iterations
-Stop at 15 iterations
-
-Outlier prop: 0.4 (9 / 10)
-downsample outliers
-Finding best bandwidth...
-Algo:  kde
-Algo:  mom-kde
-Algo:  rkde-huber
-Stop at 4 iterations
-Stop at 2 iterations
-Algo:  rkde-hampel
-Stop at 4 iterations
-Stop at 100 iterations
-
-Outlier prop: 0.5 (10 / 10)
-downsample inliers
-Finding best bandwidth...
-Algo:  kde
-Algo:  mom-kde
-Algo:  rkde-huber
-Stop at 3 iterations
-Stop at 2 iterations
-Algo:  rkde-hampel
-Stop at 3 iterations
-Stop at 100 iterations
-
-
-
- -
-
- -
-
-
- -
-
-
fig, ax = plt.subplots(figsize=(7, 5))
-for algo, algo_data in auc_scores.groupby('algo'):
-    x = algo_data.groupby('outlier_prop').mean().index
-    y = algo_data.groupby('outlier_prop').mean()['auc_anomaly']
-    ax.plot(x, y, 'o-', label=algo)
-plt.legend()
-plt.xlabel('outlier_prop')
-plt.ylabel('auc_score')
-plt.title('Comparison of rKDE against contamination')
-
- -
-
-
- -
-
- -
- - - -
-
Text(0.5, 1.0, 'Comparison of rKDE against contamination')
-
- -
- -
- - - -
- -
- -
- -
-
- -
-
-
-

Try using different neighborhoods for contamination. Which robust KDE algorithm performs better overall? Choose among the following options:

-

OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste

-

You can also change the kernel type: gaussian, tophat, epechenikov, exponential, linear or cosine,

- -
-
-
-
-
-

Summary

    -
  • Kernel density estimation is a non-parametric method to estimate a pdf from a sample.
  • -
  • Bandwidth is the most important parameter.
  • -
  • Converges to the true pdf if $n\to\infty$.
      -
    • Convergence exponentially depends on the dimension.
    • -
    -
  • -
  • KDE is sensitive to contamination:
      -
    • In a contaminated setting one can employ methods from robust statistics to obtain robust estimates.
    • -
    -
  • -
-

Implementations

- -
-
-
-
-
-

Anomaly Detection via Isolation

Idea: An anomaly should allow "simple" descriptions that distinguish it from the rest of the data.

-
    -
  • Descriptions: Conjunction of single attribute tests, i.e. -$X_i \leq c$ or $X_i > c$.
  • -
  • Example: $X_1 \leq 1.2 \text{ and } X_5 > -3.4 \text{ and } X_7 \leq 5.6$.
  • -
  • Complexity of description: Number of conjunctions.
  • -
-

Moreover, we assume that a short random descriptions will have a significantly larger chance of isolating an anomaly -than isolating any nominal point.

-
    -
  • Choose random isolating descriptions and compute anomaly score from average complexity.
  • -
- -
-
-
-
-
-

Isolation Tree

Isolation Forest (iForest) implements this idea by generating an ensemble of random decision trees. -Each tree is built as follows:

-
-

Input: Data set (subsample) $X$, maximal height $h$

-
    -
  • Randomly choose feature $i$ and split value $s$ (in range of data)
  • -
  • Recursively build subtrees on $X_L = \{x\in X\mid x_i \leq s\}$ and $X_R = X\setminus X_L$
  • -
  • Stop if remaining data set $ \leq 1$ or maximal height reached
  • -
  • Store test $x_i\leq s$ for inner nodes and $|X|$ for leaf nodes
  • -
-
- -
-
-
-
-
-

Visualization

- - Isolation Tree as Partition Diagram -
-
-
-
-
-
-

Isolation Depth


-

Input: Observation $x$

-
    -
  • ${\ell} = $ length of path from root to leaf according to tests
  • -
  • ${n} = $ size of remaining data set in leaf node
  • -
  • ${c(n)} =$ expected length of a path in a BST with $n$ nodes $={O}(\log n)$
  • -
  • ${h(x)} = \ell + c(n)$
  • -
-
- - - - - - - - -
Isolation Depth of Outlier (red) and nominal (blue)
-
-
-
-
-
-

Isolation Forest

    -
  • Train $k$ isolation trees on subsamples of size $N$
  • -
- -
-
-
-
-
- - - - - - - - -
Isolation depth of nominal point (left) and outlier (right)
-
-
-
-
-
-

Variants of Isolation Forest

-
-
-
-
-
-

Variant: Random Robust Cut Forest

New Rule to Choose Split Test:

-
    -
  • $\ell_i$: length of the $i$th component of the bounding box around current data set
  • -
  • Choose dimension $i$ with probability $\frac{\ell_i}{\sum_j \ell_j}$
  • -
  • More robust against "noise dimensions"
  • -
- -
-
-
-
-
-
- -
-
-
-
-
-
-

Variant: Extended Isolation Forest

New split criterion:

-
    -
  • Uniformly choose a normal and an orthogonal hyperplane through the data
  • -
  • Removes a bias that was empirically observed when plotting the outlier score of iForest on low dimensional data sets
  • -
-
- -
-
-
-
-
-
-

Exercise: Network Security

In the final exercise of today you will have to develop an anomaly detection system for network traffic.

-

Briefing

A large e-commerce company A is experiencing downtime due to attacks on their infrastructure. -You were instructed to develop a system that can detect malicious connections to the infrastructure. -It is planned that suspicious clients will be banned.

-

Another data science team already prepared the connection data of the last year for you. They also separated a test set and manually identified and labeled attacks in that data.

-

The Data

We will work on a version of the classic KDD99 data set.

-

Kddcup 99 Data Set


-

The KDD Cup '99 dataset was created by processing the tcp dump portions -of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, -created by MIT Lincoln Lab [1]. The artificial data (described on the dataset's -homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>_) was -generated using a closed network and hand-injected attacks to produce a -large number of different types of attack with normal activity in the -background.

- -
=========================   ====================================================
-Samples total               976158
-Dimensionality              41
-Features                    string (str), discrete (int), continuous (float)
-Targets                     str, 'normal.' or name of the anomaly type
-Proportion of Anomalies     1%
-=========================   ====================================================
-
-
-
-

Task

You will have to develop the system on your own. In particular, you will have to

-
    -
  • Explore the data.
  • -
  • Choose an algorithm.
  • -
  • Find a good detection threshold.
  • -
  • Evaluate and summarize your results.
  • -
  • Estimate how much A could save through the use of your system.
  • -
- -
-
-
-
-
- -
-
-
X_train,X_test,y_test = get_kdd_data()
-
- -
-
-
- -
-
-
-

Explore Data

-
-
-
-
-
- -
-
-
#
-# Add your exploration code
-#
-X_train = pd.DataFrame(X_train)
-X_test = pd.DataFrame(X_test)
-
- -
-
-
- -
-
-
- -
-
-
# get description
-X_train.describe()
-
- -
-
-
- -
-
- -
- - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0123456789...31323334353637383940
count80524805248052480524805248052480524805248052480524...805248052480524.080524.080524.080524.080524.080524.080524.080524.0
unique2027351103028974923219...256256101.097.0101.054.089.050.0101.0101.0
top0b'tcp'b'http'b'SF'10500000...2552551.00.00.00.00.00.00.00.0
freq7109862139494517537459281395880522805168052380057...347084595652140.051505.027038.040645.076024.075584.072942.073012.0
-

4 rows × 41 columns

-
- -
- -
-
- -
-
-
- -
-
-
# get better description
-X_train.drop(columns=[1,2,3]).astype(float).describe()
-
- -
-
-
- -
-
- -
- - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0456789101112...31323334353637383940
count80524.0000008.052400e+048.052400e+0480524.00000080524.00000080524.00000080524.00000080524.00000080524.00000080524.000000...80524.00000080524.00000080524.00000080524.00000080524.00000080524.00000080524.00000080524.00000080524.00000080524.000000
mean212.6134321.233283e+033.335463e+030.0000250.0002730.0000370.0450180.0002360.6947740.034474...152.213849201.3416370.8406780.0561000.1542890.0233430.0094110.0084160.0571950.055349
std1335.7888403.889103e+043.920498e+040.0049840.0281910.0105720.8600650.0231070.4605064.447875...103.50490188.0144820.3113670.1791410.3075650.0495120.0898340.0867730.2241170.218304
min0.0000000.000000e+000.000000e+000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000001.460000e+021.050000e+020.0000000.0000000.0000000.0000000.0000000.0000000.000000...40.000000169.0000000.9100000.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%0.0000002.320000e+023.920000e+020.0000000.0000000.0000000.0000000.0000001.0000000.000000...175.000000255.0000001.0000000.0000000.0100000.0000000.0000000.0000000.0000000.000000
75%0.0000003.170000e+022.013000e+030.0000000.0000000.0000000.0000000.0000001.0000000.000000...255.000000255.0000001.0000000.0100000.0900000.0300000.0000000.0000000.0000000.000000
max30190.0000005.135678e+065.134218e+061.0000003.0000003.00000030.0000004.0000001.000000884.000000...255.000000255.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000
-

8 rows × 38 columns

-
- -
- -
-
- -
-
-
- -
-
-
# Check for NaNs
-print("Number of NaNs: {}".format(X_train.isna().sum().sum()))
-
- -
-
-
- -
-
- -
- -
-
Number of NaNs: 0
-
-
-
- -
-
- -
-
-
- -
-
-
#
-# Add your preperation code here
-#
-
-# Encode string features
-binarizer = LabelBinarizer()
-one_hots = None
-one_hots_test = None
-for i in [1, 2, 3]:
-    binarizer.fit(X_train[[i]].astype(str))
-    if one_hots is None:
-        one_hots = binarizer.transform(X_train[[i]].astype(str))
-        one_hots_test = binarizer.transform(X_test[[i]].astype(str))
-    else:
-        one_hots = np.concatenate([one_hots, binarizer.transform(X_train[[i]].astype(str))], axis=1)
-        one_hots_test = np.concatenate([one_hots_test, binarizer.transform(X_test[[i]].astype(str))], axis=1)
-
-X_train.drop(columns=[1,2,3], inplace=True)
-X_train_onehot = pd.DataFrame(np.concatenate([X_train.values, one_hots], axis=1))
-
-X_test.drop(columns=[1,2,3], inplace=True)
-X_test_onehot = pd.DataFrame(np.concatenate([X_test.values, one_hots_test], axis=1))
-
- -
-
-
- -
-
-
- -
-
-
# Encode y
-y_test_bin = np.where(y_test == b'normal.', 0, 1)
-
- -
-
-
- -
-
-
- -
-
-
# Remove suspicious data
-# This step is not strictly neccessary but can improve performance
-suspicious = X_train_onehot.apply(lambda col: (col - col.mean()).abs() > 4 * col.std() if col.std() > 1 else False)
-suspicious = suspicious.any(axis=1)# 4 sigma rule
-print('filtering {} suspicious data points'.format(suspicious.sum()))
-X_train_clean = X_train_onehot[~suspicious]
-
- -
-
-
- -
-
- -
- -
-
filtering 2951 suspicious data points
-
-
-
- -
-
-
-

Summary

    -
  • Isolation Forest empirically shows very good performance up to relatively high dimensions
  • -
  • It is relatively robust against contamination
  • -
  • Usually little need for hyperparameter tuning
  • -
-

Implementations

+

Anomaly Detection via Reconstruction Error

Snow

-
-
-
-
-
-

Choose Algorithm

@@ -15994,108 +13328,44 @@

Choose Algorithm
-
# TODO: implement proper model selection
-iforest = IsolationForest()
-iforest.fit(X_train_clean)
-
- -
-

-
- -
-
- -
- - - -
-
IsolationForest()
-
- -
- -
-
+
import numpy as np
+import itertools as it
+from tqdm import tqdm
 
-
-
-
+import matplotlib +from matplotlib import pyplot as plt +import plotly.express as px +import pandas as pd -
-
-
# find best threshold
-X_test_onehot, X_val_onehot, y_test_bin, y_val_bin = train_test_split(X_test_onehot, y_test_bin, test_size=.5)
-y_score = -iforest.score_samples(X_val_onehot).reshape(-1)
-
+import ipywidgets as widgets -
-
-
+from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \ +perform_rkde_experiment, get_mnist_data -
-
-
-
+from ipywidgets import interact -
-
-
#
-# Insert evaluation code
-#
+from sklearn.metrics import roc_auc_score, average_precision_score
+from sklearn.model_selection import RandomizedSearchCV
+from sklearn.preprocessing import MinMaxScaler
+from sklearn.preprocessing import LabelBinarizer
+from sklearn.ensemble import IsolationForest
+from sklearn import metrics
+from sklearn.model_selection import train_test_split
+from sklearn.decomposition import PCA
+from sklearn.neighbors import KernelDensity
 
-# calculate scores if any anomaly is present
-if np.any(y_val_bin == 1):
-    eval = evaluate(y_val_bin, y_score)
-    prec, rec, thr = eval['PR']
-    f1s = 2 * (prec * rec)/(prec + rec)
-    threshold = thr[np.argmax(f1s)]
+from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst
 
-    y_score = -iforest.score_samples(X_test_onehot).reshape(-1)
-    y_pred = np.where(y_score < threshold, 0, 1)
+from tensorflow import keras
 
-    print('Precision: {}'.format(metrics.precision_score(y_test_bin, y_pred)))
-    print('Recall: {}'.format(metrics.recall_score(y_test_bin, y_pred)))
-    print('F1: {}'.format(metrics.f1_score(y_test_bin, y_pred)))
+%matplotlib inline
+matplotlib.rcParams['figure.figsize'] = (5, 5)
 
-
-
- -
- - - -
- -
- -
- -
- -
-
Precision: 0.8794520547945206
-Recall: 0.9968944099378882
-F1: 0.9344978165938865
-
-
-
- -
-
-
@@ -16534,7 +13815,7 @@

Create Model&#
-
2023-04-19 10:31:23.055180: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
+
2023-04-21 23:27:54.188016: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
 
@@ -16543,65 +13824,65 @@

Create Model&#
Epoch 1/30
-493/493 [==============================] - 46s 91ms/step - loss: 43.9802 - reconstruction_loss: 36.0458 - kl_loss: 0.3257
+493/493 [==============================] - 35s 70ms/step - loss: 42.5097 - reconstruction_loss: 34.5901 - kl_loss: 0.6970
 Epoch 2/30
-493/493 [==============================] - 48s 98ms/step - loss: 31.6917 - reconstruction_loss: 29.0970 - kl_loss: 1.9113
+493/493 [==============================] - 34s 69ms/step - loss: 31.0599 - reconstruction_loss: 28.5956 - kl_loss: 2.1009
 Epoch 3/30
-493/493 [==============================] - 47s 96ms/step - loss: 30.1662 - reconstruction_loss: 27.5706 - kl_loss: 2.4761
+493/493 [==============================] - 34s 69ms/step - loss: 30.1241 - reconstruction_loss: 27.5431 - kl_loss: 2.4653
 Epoch 4/30
-493/493 [==============================] - 47s 96ms/step - loss: 29.7152 - reconstruction_loss: 26.5361 - kl_loss: 2.9571
+493/493 [==============================] - 34s 68ms/step - loss: 29.7932 - reconstruction_loss: 27.0414 - kl_loss: 2.6726
 Epoch 5/30
-493/493 [==============================] - 47s 96ms/step - loss: 28.9498 - reconstruction_loss: 25.2538 - kl_loss: 3.5752
+493/493 [==============================] - 35s 71ms/step - loss: 29.5701 - reconstruction_loss: 26.7072 - kl_loss: 2.8093
 Epoch 6/30
-493/493 [==============================] - 47s 95ms/step - loss: 28.5414 - reconstruction_loss: 24.6374 - kl_loss: 3.8247
+493/493 [==============================] - 34s 68ms/step - loss: 29.4335 - reconstruction_loss: 26.5000 - kl_loss: 2.8876
 Epoch 7/30
-493/493 [==============================] - 51s 104ms/step - loss: 28.3047 - reconstruction_loss: 24.3351 - kl_loss: 3.9369
+493/493 [==============================] - 37s 75ms/step - loss: 29.3880 - reconstruction_loss: 26.3282 - kl_loss: 2.9572
 Epoch 8/30
-493/493 [==============================] - 52s 105ms/step - loss: 28.1957 - reconstruction_loss: 24.1238 - kl_loss: 4.0250
+493/493 [==============================] - 34s 68ms/step - loss: 29.2236 - reconstruction_loss: 26.2072 - kl_loss: 3.0092
 Epoch 9/30
-493/493 [==============================] - 52s 106ms/step - loss: 28.0770 - reconstruction_loss: 23.9851 - kl_loss: 4.0686
+493/493 [==============================] - 33s 67ms/step - loss: 29.1663 - reconstruction_loss: 26.0906 - kl_loss: 3.0563
 Epoch 10/30
-493/493 [==============================] - 54s 109ms/step - loss: 28.0162 - reconstruction_loss: 23.8709 - kl_loss: 4.1166
+493/493 [==============================] - 33s 67ms/step - loss: 29.0762 - reconstruction_loss: 25.9887 - kl_loss: 3.0914
 Epoch 11/30
-493/493 [==============================] - 54s 110ms/step - loss: 27.9992 - reconstruction_loss: 23.7733 - kl_loss: 4.1561
+493/493 [==============================] - 33s 67ms/step - loss: 29.0782 - reconstruction_loss: 25.8939 - kl_loss: 3.1354
 Epoch 12/30
-493/493 [==============================] - 54s 110ms/step - loss: 27.9469 - reconstruction_loss: 23.6861 - kl_loss: 4.1964
+493/493 [==============================] - 33s 68ms/step - loss: 28.9933 - reconstruction_loss: 25.7886 - kl_loss: 3.1714
 Epoch 13/30
-493/493 [==============================] - 56s 113ms/step - loss: 27.9455 - reconstruction_loss: 23.6237 - kl_loss: 4.2143
+493/493 [==============================] - 34s 68ms/step - loss: 29.0340 - reconstruction_loss: 25.7205 - kl_loss: 3.2132
 Epoch 14/30
-493/493 [==============================] - 59s 119ms/step - loss: 27.8618 - reconstruction_loss: 23.5419 - kl_loss: 4.2608
+493/493 [==============================] - 34s 69ms/step - loss: 28.9053 - reconstruction_loss: 25.6475 - kl_loss: 3.2307
 Epoch 15/30
-493/493 [==============================] - 59s 120ms/step - loss: 27.7890 - reconstruction_loss: 23.4880 - kl_loss: 4.2745
+493/493 [==============================] - 33s 68ms/step - loss: 28.8405 - reconstruction_loss: 25.5563 - kl_loss: 3.2630
 Epoch 16/30
-493/493 [==============================] - 59s 120ms/step - loss: 27.7563 - reconstruction_loss: 23.4388 - kl_loss: 4.3066
+493/493 [==============================] - 34s 68ms/step - loss: 28.8373 - reconstruction_loss: 25.5013 - kl_loss: 3.2902
 Epoch 17/30
-493/493 [==============================] - 58s 118ms/step - loss: 27.6718 - reconstruction_loss: 23.3834 - kl_loss: 4.3206
+493/493 [==============================] - 34s 69ms/step - loss: 28.8136 - reconstruction_loss: 25.4411 - kl_loss: 3.3181
 Epoch 18/30
-493/493 [==============================] - 57s 117ms/step - loss: 27.6985 - reconstruction_loss: 23.3397 - kl_loss: 4.3462
+493/493 [==============================] - 35s 71ms/step - loss: 28.7450 - reconstruction_loss: 25.3897 - kl_loss: 3.3306
 Epoch 19/30
-493/493 [==============================] - 58s 118ms/step - loss: 27.7143 - reconstruction_loss: 23.3263 - kl_loss: 4.3451
+493/493 [==============================] - 34s 69ms/step - loss: 28.8071 - reconstruction_loss: 25.3357 - kl_loss: 3.3655
 Epoch 20/30
-493/493 [==============================] - 58s 117ms/step - loss: 27.6340 - reconstruction_loss: 23.2739 - kl_loss: 4.3725
+493/493 [==============================] - 33s 67ms/step - loss: 28.6384 - reconstruction_loss: 25.2765 - kl_loss: 3.3830
 Epoch 21/30
-493/493 [==============================] - 58s 117ms/step - loss: 27.6250 - reconstruction_loss: 23.2324 - kl_loss: 4.3844
+493/493 [==============================] - 33s 67ms/step - loss: 28.6931 - reconstruction_loss: 25.2556 - kl_loss: 3.3926
 Epoch 22/30
-493/493 [==============================] - 58s 117ms/step - loss: 27.6293 - reconstruction_loss: 23.1890 - kl_loss: 4.4027
+493/493 [==============================] - 33s 67ms/step - loss: 28.6249 - reconstruction_loss: 25.2039 - kl_loss: 3.4114
 Epoch 23/30
-493/493 [==============================] - 58s 117ms/step - loss: 27.6002 - reconstruction_loss: 23.1515 - kl_loss: 4.4319
+493/493 [==============================] - 33s 67ms/step - loss: 28.6697 - reconstruction_loss: 25.1551 - kl_loss: 3.4301
 Epoch 24/30
-493/493 [==============================] - 58s 117ms/step - loss: 27.6394 - reconstruction_loss: 23.1220 - kl_loss: 4.4499
+493/493 [==============================] - 33s 67ms/step - loss: 28.6615 - reconstruction_loss: 25.1161 - kl_loss: 3.4372
 Epoch 25/30
-493/493 [==============================] - 53s 108ms/step - loss: 27.5510 - reconstruction_loss: 23.1140 - kl_loss: 4.4538
+493/493 [==============================] - 33s 67ms/step - loss: 28.6019 - reconstruction_loss: 25.0742 - kl_loss: 3.4602
 Epoch 26/30
-493/493 [==============================] - 34s 70ms/step - loss: 27.5861 - reconstruction_loss: 23.0813 - kl_loss: 4.4622
+493/493 [==============================] - 33s 67ms/step - loss: 28.5390 - reconstruction_loss: 25.0427 - kl_loss: 3.4624
 Epoch 27/30
-493/493 [==============================] - 34s 70ms/step - loss: 27.5400 - reconstruction_loss: 23.0504 - kl_loss: 4.4710
+493/493 [==============================] - 34s 68ms/step - loss: 28.5158 - reconstruction_loss: 25.0139 - kl_loss: 3.4798
 Epoch 28/30
-493/493 [==============================] - 34s 69ms/step - loss: 27.5267 - reconstruction_loss: 23.0238 - kl_loss: 4.4779
+493/493 [==============================] - 33s 68ms/step - loss: 28.5666 - reconstruction_loss: 24.9772 - kl_loss: 3.4934
 Epoch 29/30
-493/493 [==============================] - 34s 70ms/step - loss: 27.5144 - reconstruction_loss: 23.0092 - kl_loss: 4.4840
+493/493 [==============================] - 34s 68ms/step - loss: 28.5249 - reconstruction_loss: 24.9418 - kl_loss: 3.5118
 Epoch 30/30
-493/493 [==============================] - 34s 70ms/step - loss: 27.5164 - reconstruction_loss: 22.9949 - kl_loss: 4.5029
+493/493 [==============================] - 33s 67ms/step - loss: 28.4691 - reconstruction_loss: 24.8969 - kl_loss: 3.5315
 

@@ -16693,7 +13974,7 @@

Inspect Result -

@@ -16704,7 +13985,7 @@

Inspect Result -

@@ -16715,7 +13996,7 @@

Inspect Result -

@@ -16770,12 +14051,12 @@

Inspect Result +
@@ -16806,9 +14087,86 @@

Inspect Result -
+
+ +
+ +
+ + +
diff --git a/docs/_static/anomaly_detection_on_time_series.html b/docs/_static/nb_05_anomaly_detection_on_time_series.html similarity index 99% rename from docs/_static/anomaly_detection_on_time_series.html rename to docs/_static/nb_05_anomaly_detection_on_time_series.html index 786528f..f20d449 100644 --- a/docs/_static/anomaly_detection_on_time_series.html +++ b/docs/_static/nb_05_anomaly_detection_on_time_series.html @@ -3,7 +3,7 @@ -anomaly_detection_on_time_series +nb_05_anomaly_detection_on_time_series + + + + + + + + + + + +
+
+ +
+
+

Extreme Value Theory for Anomaly Detection +Snow

+ +
+
+
+
+
+ +
+
+
import tensorflow as tf
+import tensorflow_probability as tfp
+from tensorflow import keras
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import os
+import logging
+from sklearn.preprocessing import StandardScaler
+from typing import Protocol, Sequence, Union, Tuple, List, TypeVar, Callable
+from matplotlib.animation import FuncAnimation
+from celluloid import Camera
+from IPython.core.display import HTML
+
+tfd = tfp.distributions
+
+ +
+
+
+ +
+
+
+
+

Despite the smaller applicability of EVT techniques, they are still a valuable addition to the anomaly detectionist's +toolbox. In several situations, anomalies directly correspond to large deviations from some (possibly running) mean - e.g. for sensor data, intrusion attacks based on the number of calls and others.

+ +
+
+
+
+
+

However, even for entirely different definitions of anomaly, most detection algorithms will produce a scalar outlier score for each datapoint. +EVT can then be used as a probabilistic framework for analyzing the univariate distribution of outlier scores and help determine meaningful thresholds for separating anomalous from normal cases.

+ +
+
+
+
+
+

EVT in a Nutshell

There are two fundamental theorems of extreme value theory on which most results in that field are based. The first is concerned with the asymptotic distribution of block maxima of a sequence of i.i.d. random variables. The second one gives an expression for the distribution of excesses over a threshold.

+

We will first state these theorems (in their standard formulation in the literature), then see how they can be applied to anomaly detection and after that highlight ideas of their proofs as well as some theoretical consequences.

+

From now on let $X_1, X_2, ...$ be a sequence of 1-dimensional i.i.d. random variables with cumulative distribution function $F$. Let $X$ also be a r.v. with the same c.d.f.

+ +
+
+
+
+
+

We define the n-block maximum as the random variable

+$$ +M_n := \max \{X_1, ..., X_n\}. +$$

Given a threshold $u$, the excess over the threshold is given by $X-u$.

+

In EVT, we are typically interested in approximating $P(M_n<z)$ for large $n$ and in approximating the distribution of excesses $P(X-u < y \mid X > u)$ for large $u$.

+ +
+
+
+
+
+

The Fisher-Tipett-Genedenko theorem characterizes the possible limits of renormalized block maxima.

+

If there exist sequences of real numbers $a_n>0, b_n$ such that the probability distributions +$$ +P\left(\frac{M_n-b_n}{a_n}<z \right) +$$ +converge to a non-degenerate distribution $G(z)$, then $G(z)$ must be of the following form:

+\begin{equation} + P\left(\frac{M_n-b_n}{a_n}<z \right) \xrightarrow[n\rightarrow \infty]{} G(z; \xi, \mu, \sigma) = \exp \left\{ -\left( 1 + \xi \left( \frac{z - \mu}{\sigma} \right) \right)^{- \frac{1}{\xi} } \right\} +\end{equation}

where $\xi, \mu \in \mathbb{R}$ and $\sigma >0$. This function family is called the Generalized Extreme Value distributions (GEV).

+ +
+
+
+
+
+

The Pickands–Balkema–De Haan theorem states that under the same conditions as above and for a threshold $u \in \mathbb{R}$ going to infinity, the distribution of excesses over the threshold $u$ converges to a Generalized Pareto Distribution (GPD), i.e.

+\begin{equation} +P(X-u < y \mid X > u) \xrightarrow[u \rightarrow \infty]{} H(y; \xi, \tilde{\sigma})=1 - \left( 1 + \frac{\xi \ y}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} \ +\end{equation}

where $y>0$ and $1 + \frac{\xi \ y}{\tilde{\sigma}} >0$. The parameter $\xi$ takes the same value as for the GEV.

+

We have highlighted the dependence of the limiting distributions on the parameters in both cases. In applications, these parameters will be estimated based on data.

+ +
+
+
+
+
+

Practical Significance of EVT Theorems

Before we analyze the consequences of the above distributions in detail, let us discuss their practical significance. The distinctive feature of the GEV and GPD distributions is that they are of a very restricted form, belonging to a three and two parameter function family respectively. This motivates to model distributions of block maxima for finite but large $n$ by the GEV distribution

+$$ + P\left(\frac{M_n-b_n}{a_n}<z \right) \approx G(z; \xi, \mu, \sigma) \Longleftrightarrow + P\left( M_n < z \right) \approx G(z; \xi, \mu\prime , \sigma\prime) +$$

where $\mu\prime=b_n+a_n \mu$ and $\sigma\prime=a_n \sigma$. Thus, fitting the coefficients $\xi, \mu\prime, \sigma\prime$ to the observed values of $M_n$, e.g. by maximum likelihood estimation, also finds the "best" values of the renormalizing constants $a_n$ and $b_n$.

+

Similarly, fitting $\xi, \tilde{\sigma}$ to observed excesses of a finite threshold $u$ also finds the best renormalizing constants for the GPD.

+ +
+
+
+
+
+

In the context of AD, modeling the distributions of $M_n$ or $X-u$ is useful for finding thresholds on outlier scores with probabilistic interpretations or for predicting the occurrence rates and sizes of anomalies.

+ +
+
+
+
+
+

For example, given some complex outlier score based on sensor data of a factory process, we might be interested in the probability that this outlier score exceeds a certain threshold within a month. This could be achieved by fitting a GEV distribution to observed frequencies of monthly maxima of the score.

+ +
+
+
+
+
+

The GPD can be used to directly estimate a cumulative univariate distribution $F(z)$ for large enough $z$. Then one could use it to determine the anomaly threshold $z_{\text{th}}$ by defining an anomalous upper quantile. E.g. solving $F(z_\text{th}) = 0.99$ for $z_{\text{th}}$ (where $F$ was obtained by fitting the GPD to some outlier score) would declare approximately 1% of data points as anomalous. We will describe this in more detail below.

+ +
+
+
+
+
+

EVT in Action

Let us give a quick example for fitting a GEV on data and extracting insight from it. For that we will use the NYC taxi calls dataset - a collection of taxi calls per 30-minutes intervals that was collected for over a year.

+ +
+
+
+
+
+ +
+
+
taxi_csv = os.path.join("..", 'data','nyc_taxi','nyc_taxi.csv')
+taxi_df = pd.read_csv(taxi_csv)
+taxi_df['time'] = [x.time() for x in pd.to_datetime(taxi_df['timestamp'])]
+taxi_df['date'] = [x.date() for x in pd.to_datetime(taxi_df['timestamp'])]
+taxi_df.rename(columns={"value": "n_calls"}, inplace=True)
+taxi_df.drop(columns=["timestamp"], inplace=True)
+
+ +
+
+
+ +
+
+
+ +
+
+
taxi_df.head()
+
+ +
+
+
+ +
+
+ +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
n_callstimedate
01084400:00:002014-07-01
1812700:30:002014-07-01
2621001:00:002014-07-01
3465601:30:002014-07-01
4382002:00:002014-07-01
+
+ +
+ +
+
+ +
+
+
+ +
+
+
# Helper functions for normalizing data. In most cases it will be enough to use the normalize function
+
+def normalize_data(data: Sequence) -> np.ndarray:
+    scaler = StandardScaler()
+    return scaler.fit_transform(data)
+
+
+def normalize_series(series: pd.Series) -> pd.DataFrame:
+    data = series.values.reshape(-1, 1)
+    normalized_data = normalize_data(data).reshape(-1)
+    return pd.Series(normalized_data, index=series.index)
+
+
+def normalize_df(data_frame: pd.DataFrame):
+    normalized_data = normalize_data(data_frame)
+    return pd.DataFrame(normalized_data, columns=data_frame.columns, index=data_frame.index)
+
+
+T = TypeVar("T")
+
+
+def normalize(data: T) -> T:
+    if isinstance(data, np.ndarray):
+        return normalize_data(data)
+    elif isinstance(data, pd.Series):
+        return normalize_series(data)
+    elif isinstance(data, pd.DataFrame):
+        return normalize_df(data)
+    else:
+        raise ValueError(f"Unsupported data type: {data.__class__.__name__}")
+
+ +
+
+
+ +
+
+
+ +
+
+
taxi_df_normalized = taxi_df
+taxi_df_normalized["n_calls"] = normalize(taxi_df_normalized["n_calls"])
+taxi_df_normalized.head()
+
+ +
+
+
+ +
+
+ +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
n_callstimedate
0-0.61874500:00:002014-07-01
1-1.01029100:30:002014-07-01
2-1.28654901:00:002014-07-01
3-1.51049601:30:002014-07-01
4-1.63097102:00:002014-07-01
+
+ +
+ +
+
+ +
+
+
+

We can define a trainable GEV with tensorflow probability as following

+ +
+
+
+
+
+ +
+
+
def get_gev(xi: float, mu=0., sigma=1., trainable_xi=True):
+    xi, mu, sigma = np.array([xi, mu, sigma]).astype(float)
+    if trainable_xi:
+        xi = tf.Variable(xi, name="xi")
+    return tfd.GeneralizedExtremeValue(
+        loc=tf.Variable(mu, name='mu'),
+        scale=tf.Variable(sigma, name='sigma'),
+        concentration=xi,
+     )
+
+ +
+
+
+ +
+
+
+

A glance at the tensorflow probability API

For solving the exercises in this notebook you will need to use basic properties of tensorflow probability distributions. They have a very intuitive and convenient API - you get access to the probability density, cdf, quantile function and so on.

+ +
+
+
+
+
+ +
+
+
sample_gev = get_gev(0.5)
+
+print(f"Probability density: {sample_gev.prob([1, 0.3])}")
+print(f"Cdf: {sample_gev.cdf([1, 0.3])}")
+print(f"Quantile: {sample_gev.quantile([0.5, 0.9])}")
+print(f"Trainable vars:\n {sample_gev.trainable_variables}")
+
+ +
+
+
+ +
+
+ +
+ +
+
Probability density: [0.18997937 0.30868637]
+Cdf: [0.64118039 0.46947339]
+Quantile: [0.40224482 4.16156525]
+Trainable vars:
+ (<tf.Variable 'xi:0' shape=() dtype=float64, numpy=0.5>, <tf.Variable 'mu:0' shape=() dtype=float64, numpy=0.0>, <tf.Variable 'sigma:0' shape=() dtype=float64, numpy=1.0>)
+
+
+
+ +
+ +
+
2023-04-21 23:45:44.619670: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
+To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
+2023-04-21 23:45:44.655801: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
+
+
+
+ +
+
+ +
+
+
+

Exercise 1: playing around with GEV parameters

Plot the GEV probability distribution for different values of $\xi, \mu$ and $\sigma$. How do they differ qualitatively?

+

What are the domains of definition of $z$ in the above analytic expression for $G(z)$? What values should the c.d.f. $G(z)$ take outside these domains and how does this affect fitting $\xi, \mu, \sigma$ from data by maximum likelihood estimation?

+

What expression for the GEV do we get in the limit $\xi \longrightarrow 0$?

+

The three qualitatively different shapes of the GEV have their own names. For $\xi >0$ we get the Fréchet Distribution, for $\xi<0$ the reverse Weibull distribution and for $\xi=0$ the Gumbel distribution. Note that using the Gumbel distribution in tensorflow probability is not exactly the same as using GEV with $\xi=0$ due to rounding errors. Try it out!

+ +
+
+
+
+
+ +
+
+
gev = get_gev(xi=1e-5, sigma=2)
+arr = np.linspace(-5, 5)
+
+pdf = gev.prob(arr)
+plt.plot(arr, pdf)
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

Solution Exercise 1:

The cdf of the GEV distribution is well defined when $1 + \xi \left( \frac{z - \mu}{\sigma} \right) > 0$. This is equivalent to

+$$ + z > \mu - \frac{\sigma}{\xi} \qquad \text{if $\xi>0$} +$$

and +$$ + z < \mu - \frac{|\sigma|}{\xi} \qquad \text{if $\xi<0$}. +$$

+

Thus, for $\xi>0$, the distribution has a left boundary, the probability of points lying to the left of it is zero. The value of the cdf there is zero.

+

For $\xi<0$ there is a right boundary, the probability of points lying to the right of it is zero and the value of the cdf is 1.

+

As $\xi$ moves to zero from below, the right boundary is pushed to infinity. Similarly, if it approaches zero from above, the left boundary is pushed to negative infinity. At exactly $\xi=0$, the GEV becomes the Gumbel distribution which is well defined for all $z$.

+ +
+
+
+
+
+

We can group the numbers of calls according to the dates, thereby obtaining daily maxima and minima of calls. One way of detecting anomalies in the NYC taxi data set is by fitting a GEV to the distribution of these daily maxima. Here a histogram plot of the (normalized) maxima:

+ +
+
+
+
+
+ +
+
+
daily_grouped = taxi_df.groupby("date")["n_calls"].agg(["max", "min", "sum"])
+daily_grouped["diff"] = daily_grouped["max"] - daily_grouped["min"]
+daily_grouped.head()
+
+ +
+
+
+ +
+
+ +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
maxminsumdiff
date
2014-07-011.795669-1.8840282.7904923.679696
2014-07-021.691045-1.8233581.0140523.514403
2014-07-032.139658-1.756635-2.3722373.896293
2014-07-040.481677-1.709367-25.0806062.191043
2014-07-050.438732-1.819178-24.6619682.257910
+
+ +
+ +
+
+ +
+
+
+ +
+
+
daily_grouped_normalized = normalize(daily_grouped)
+daily_grouped_normalized.head()
+
+ +
+
+
+ +
+
+ +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
maxminsumdiff
date
2014-07-010.960450-0.7140180.2087091.261851
2014-07-020.718429-0.1552740.0758440.838539
2014-07-031.7561840.459211-0.1774271.816547
2014-07-04-2.0791440.894527-1.875853-2.550535
2014-07-05-2.178485-0.116786-1.844541-2.379291
+
+ +
+ +
+
+ +
+
+
+ +
+
+
plt.hist(daily_grouped_normalized["max"], density=True, bins=40)
+plt.title("Daily maxima of n_calls/(30 minutes)")
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

Q: Can you already spot the obvious anomalies? What caused them?

+

A: See below

+

Q: Which of the three qualitatively different shapes would make "physical" sense for the taxi calls data?

+

A: The Weibull shape - thus we expect $\xi<0$.

+ +
+
+
+
+
+ +
+
+
maxima = daily_grouped_normalized["max"]
+maxima[(maxima > 1.8) | (maxima < -3.5)]
+
+ +
+
+
+ +
+
+ +
+ + + +
+
date
+2014-09-06    1.885529
+2014-11-02    4.827113
+2014-12-25   -4.166655
+2015-01-01    1.839858
+2015-01-27   -4.010309
+Name: max, dtype: float64
+
+ +
+ +
+
+ +
+
+
+
    +
  • 02/11 - NY marathon
  • +
  • 25/12 - Christmas
  • +
  • 27/01 - Snowstorm
  • +
  • 01/01 - New Years
  • +
  • 06/09 - Columbus day (big parade)
  • +
+ +
+
+
+
+
+

Fitting the GEV

Now let us infer the parameters of the GEV from the data using maximum likelihood estimation. We will make gradient descent on the negative log likelihood of the GEV. Here a very simple training loop for a suitable initial choice for the shape parameter $\xi$ (it is called "concentration") written out in detail:

+ +
+
+
+
+
+ +
+
+
# we are going to be a bit fancy and show an animation of the function as it is being fitted
+
+daily_max = daily_grouped_normalized["max"].values
+
+optimizer = keras.optimizers.SGD(learning_rate=2e-4)
+losses = []
+
+sample_gev = get_gev(xi=-0.1, trainable_xi=True)
+
+fig = plt.figure(dpi=200, figsize=(4.5, 3))
+camera = Camera(fig)
+
+for step in range(100):
+    with tf.GradientTape() as tape:
+        loss = - tf.math.reduce_sum(sample_gev.log_prob(daily_max))
+    gradients = tape.gradient(loss, sample_gev.trainable_variables)
+    optimizer.apply_gradients(zip(gradients, sample_gev.trainable_variables))
+    losses.append(loss)
+    
+    bins = plt.hist(daily_max, bins=40, density=True, color="C0")[1]
+    pdf = sample_gev.prob(bins)
+    plt.plot(bins, pdf, color="orange")
+    ax = plt.gca()
+    ax.text(0.5, 1.01, f"{step=}, Loss={loss}", transform=ax.transAxes)
+    camera.snap()
+
+plt.close()
+plt.figure()
+plt.plot(losses)
+plt.title("Negative Log Likelihood")
+plt.xlabel("gradient steps")
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

Seems like after 100 steps we have already converged. Let us have a quick look at the result

+ +
+
+
+
+
+ +
+
+
bin_positions = plt.hist(daily_max, density=True, bins=25)[1]
+plt.plot(bin_positions, sample_gev.prob(bin_positions))
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
sample_gev.trainable_variables
+
+ +
+
+
+ +
+
+ +
+ + + +
+
(<tf.Variable 'xi:0' shape=() dtype=float64, numpy=-0.19605527604088632>,
+ <tf.Variable 'mu:0' shape=() dtype=float64, numpy=-0.3901123713168851>,
+ <tf.Variable 'sigma:0' shape=() dtype=float64, numpy=1.1369220029896918>)
+
+ +
+ +
+
+ +
+
+
+

Well, we probably can do better...

+ +
+
+
+
+
+

Exercise 2.1: MLE for the generalized extreme value distribution

Find a better fit using:

+
    +
  1. Removing the obvious anomalies
  2. +
  3. Profiling in the shape parameter $\xi$ or using different initial values/learning rates for inferring $\xi$
  4. +
+

Feel free to improve the code by defining new functions and so on!

+

Evaluate the quality of your fit by visual inspection of a histogram and a Q-Q plot (more on that below).

+

You can also use other statistical tools that you are familiar with.

+

Use the fitted model to find "anomalies" for taxi calls corresponding to probabilities of less than $0.01$.

+ +
+
+
+
+
+

Q-Q Plot

A Q-Q plot is useful for visually comparing two distributions, or comparing a distribution with a dataset. We are interested in the latter. In a Q-Q plot the quantiles of one distribution are plotted against the quantiles of another. For a dataset, the natural choice of quantiles is given simply by the sorted data itself. The data points then roughly correspond to the $\frac{k}{n+1}$th percentiles, where $n$ is the number of samples and $k=1,...,n$ (these are often called plotting positions and other choices for them are possible). The corresponding theoretical quantiles from some specified c.d.f. $F$ are then given by $q_k \ \text{s.t} \ F(q_k) = \frac{k}{n+1}$ (in our applications, $F$ will generally be injective and the $q_k$ uniquely defined).

+

If the distribution is a good fit of the data, the resulting line will be close to the diagonal. Below we ask you to complete a simple implementation of the Q-Q plot for tensorflow-like distributions

+ +
+
+
+
+
+ +
+
+
ArrayLike = Sequence[Union[float, tf.Tensor]]
+
+
+class TFDistributionProtocol(Protocol):
+    name: str
+    trainable_variables: Tuple[tf.Variable]
+        
+    def quantile(self, prob: ArrayLike) -> ArrayLike: ...    
+
+ +
+
+
+ +
+
+
+ +
+
+
def qqplot(data: ArrayLike, dist: TFDistributionProtocol):
+    num_observations = len(data)
+    observed_quantiles = sorted(data)
+    plotting_positions = np.arange(1, num_observations + 1) / (num_observations + 1)
+    theoretical_quantiles = dist.quantile(plotting_positions)
+    
+    plot_origin = (theoretical_quantiles[0], observed_quantiles[0])
+    plt.plot(theoretical_quantiles, observed_quantiles)
+    plt.plot(theoretical_quantiles, theoretical_quantiles) # adding a diagonal for visual comparison
+    plt.xlabel(f"Theoretical quantiles of {dist.name}")
+    plt.ylabel(f"Observed quantiles")
+    
+
+ +
+
+
+ +
+
+
+

Solution of exercise 2.1

+
+
+
+
+
+ +
+
+
# setting up functions for normal and profile likelihood fit
+
+def fit_dist(data: ArrayLike, dist: TFDistributionProtocol, num_steps=100, lr=1e-4, 
+             plot_losses=True, return_animation=True) -> Union[float, Tuple[float, HTML]]:
+    optimizer = keras.optimizers.SGD(learning_rate=lr)
+    losses = []
+    
+    if return_animation:
+        fig = plt.figure(dpi=200, figsize=(4.5, 3))
+        camera = Camera(fig)    
+
+    for step in range(num_steps):
+        with tf.GradientTape() as tape:
+            loss = - tf.math.reduce_sum(dist.log_prob(data))
+        if np.isnan(loss.numpy()):
+            logging.warning(f"Encountered nan after {step} steps")
+            break
+        
+        gradients = tape.gradient(loss, dist.trainable_variables)
+        optimizer.apply_gradients(zip(gradients, dist.trainable_variables))
+        losses.append(loss)
+        
+        if return_animation:
+            bins = plt.hist(data, bins=50, density=True, color="C0")[1]
+            pdf = dist.prob(bins)
+            plt.plot(bins, pdf, color="orange")
+            ax = plt.gca()
+            ax.text(0.5, 1.01, f"{step=}, Loss={round(loss.numpy(), 2)}", transform=ax.transAxes)
+            camera.snap()
+    
+
+    if plot_losses:
+        plt.close()
+        plt.figure()
+        plt.plot(losses)
+        plt.title("Negative Log Likelihood")
+        plt.xlabel("gradient steps")
+        plt.show()
+    
+    result = losses[-1]
+    if return_animation:
+        result = result, HTML(camera.animate().to_html5_video())
+    return result
+
+def profile_fit_dist(data: ArrayLike, dist_factory: Callable[[float], TFDistributionProtocol], xi_values: Sequence[float], 
+                     num_steps=100, lr=1e-4) -> Tuple[float, TFDistributionProtocol]:
+    """
+    Fits the distribution to data and returns the final loss. If return_animation=True, returns the tuple
+    (final_loss, animation)
+    """
+    minimal_loss = np.infty
+    optimal_dist = None
+    for xi in xi_values:
+        dist = dist_factory(xi)
+        loss = fit_dist(data, dist, num_steps=num_steps, lr=lr, plot_losses=False, return_animation=False)
+        if loss < minimal_loss:
+            minimal_loss = loss
+            optimal_dist = dist
+    if optimal_dist is None:
+        raise RuntimeError(f"Could not find optimal dist, probably due to divergences during fit. "  
+                           "Try to find a better choice for xi_values")
+    return minimal_loss, optimal_dist
+
+ +
+
+
+ +
+
+
+ +
+
+
# removing obvious anomalies
+daily_max_without_anomalies = daily_max[np.logical_and( daily_max < 1.8, daily_max > -3.5)]
+
+plt.hist(daily_max_without_anomalies, density=True, bins=40)
+plt.title("Daily maxima without anomalies")
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# Example with profile likelihood
+
+xi_values = np.linspace(-0.3, -0.5, 30)
+dist_factory = lambda xi: get_gev(xi, trainable_xi=False)
+min_loss, optimal_gev = profile_fit_dist(daily_max_without_anomalies, dist_factory, xi_values, num_steps=80)
+print(f"Minimal loss: {min_loss}")
+print(f"Optimal xi: {optimal_gev.concentration}")
+optimal_gev.trainable_variables
+
+ +
+
+
+ +
+
+ +
+ +
+
Minimal loss: 252.73708563392216
+Optimal xi: -0.44482758620689655
+
+
+
+ +
+ + + +
+
(<tf.Variable 'mu:0' shape=() dtype=float64, numpy=-0.2125311941252912>,
+ <tf.Variable 'sigma:0' shape=() dtype=float64, numpy=0.896378037612195>)
+
+ +
+ +
+
+ +
+
+
+ +
+
+
bin_positions = plt.hist(daily_max_without_anomalies, density=True, bins=40)[1]
+plt.plot(bin_positions, optimal_gev.prob(bin_positions))
+plt.title("Result from profile likelihood")
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# Solving with gradient descent on xi
+daily_max_gev = get_gev(xi=-0.4)
+final_loss = fit_dist(daily_max_without_anomalies, daily_max_gev, return_animation=False)
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# Here the values found by fitting
+daily_max_gev.trainable_variables
+
+ +
+
+
+ +
+
+ +
+ + + +
+
(<tf.Variable 'xi:0' shape=() dtype=float64, numpy=-0.4490658772874973>,
+ <tf.Variable 'mu:0' shape=() dtype=float64, numpy=-0.21531606790499422>,
+ <tf.Variable 'sigma:0' shape=() dtype=float64, numpy=0.905020869171821>)
+
+ +
+ +
+
+ +
+
+
+ +
+
+
# and here the qqplot
+qqplot(daily_max_without_anomalies, daily_max_gev)
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

Solution exercise 2.1 - Finding anomalies from the GEV

+
+
+
+
+
+ +
+
+
#The fit looks quite good, apart from the lower region, which we are not really interested in. 
+#Let us find the anomalies corresponding to the upper 1% quantile
+upper_percentile = 0.99
+upper_quantile = daily_max_gev.quantile(upper_percentile).numpy()
+upper_quantile
+
+ +
+
+
+ +
+
+ +
+ + + +
+
1.544639556561257
+
+ +
+ +
+
+ +
+
+
+ +
+
+
# and here the anomalies above this threshold
+daily_grouped_normalized["max"][daily_grouped_normalized["max"] > upper_quantile]
+
+ +
+
+
+ +
+
+ +
+ + + +
+
date
+2014-07-03    1.756184
+2014-09-06    1.885529
+2014-11-02    4.827113
+2015-01-01    1.839858
+Name: max, dtype: float64
+
+ +
+ +
+
+ +
+
+
+

In addition to the obvious anomalies found above, we caught the independence day (one day before it). We also have the probabilistic interpretation that for 99% of the days the maximal amount of calls per 30 minutes will not exceed the threshold found above (it should be rescaled back to the original value for this statement to hold).

+ +
+
+
+
+
+

Estimating the uncertainty

One benefit of the probabilistic approach is that we get confidence intervals almost for free. These can be used to estimate the robustness of our analysis (e.g. the determination of anomalies and the quality of the fit).

+ +
+
+
+
+
+

Since we fitted our functions using MLE, which is known to be approximately normal, we get uncertainty estimates from the second derivatives of the loss function. Fortunately, tensorflow makes this extremely easy for us.

+ +
+
+
+
+
+ +
+
+
def observed_fisher_information(data: ArrayLike, dist: TFDistributionProtocol) -> tf.Tensor:
+    with tf.GradientTape() as t2:
+        with tf.GradientTape() as t1:
+            nll = - tf.math.reduce_sum(dist.log_prob(data))
+        # conversion needed b/c trainable_vars is a tuple, so gradients and jacobians are tuples too
+        g = tf.convert_to_tensor(  
+            t1.gradient(nll, dist.trainable_variables)
+        )
+    return tf.convert_to_tensor(t2.jacobian(g, dist.trainable_variables))
+
+ +
+
+
+ +
+
+
+ +
+
+
def mle_std_deviations(data: ArrayLike, dist: TFDistributionProtocol) -> tf.Tensor:
+    observed_information_matrix = observed_fisher_information(data, dist)
+    mle_covariance_matrix = tf.linalg.inv(observed_information_matrix)
+    variances = tf.linalg.tensor_diag_part(mle_covariance_matrix)
+    return tf.math.sqrt(variances)
+
+ +
+
+
+ +
+
+
+

Exercise 2.2: Uncertainty in GEV

Using the above functions, include error bars into the Q-Q plots of the maximum likelihood estimates of the GEV distribution found above.

+ +
+
+
+
+
+

Solution Exercise 2.2

+
+
+
+
+
+ +
+
+
# finding the stddevs and adding/substracting them from the values found from fitting
+std_devs = mle_std_deviations(daily_max_without_anomalies, daily_max_gev)
+print(f"Found std_devs: {std_devs}")
+
+coeff_fitted = tf.convert_to_tensor(daily_max_gev.trainable_variables)
+coeff_upper = coeff_fitted + std_devs
+coeff_lower = coeff_fitted - std_devs
+
+# creating GEVs corresponding to the boundaries of the confidence intervals found above
+gev_upper = get_gev(*coeff_upper)
+gev_lower = get_gev(*coeff_lower)
+
+ +
+
+
+ +
+
+ +
+ +
+
Found std_devs: [0.02618123 0.06607897 0.04556484]
+
+
+
+ +
+ +
+
2023-04-21 23:46:03.313721: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
+
+
+
+ +
+
+ +
+
+
+ +
+
+
# The qqplots for the original GEV and the GEVs at the boundaries
+
+qqplot(daily_max_without_anomalies, daily_max_gev)
+qqplot(daily_max_without_anomalies, gev_upper)
+qqplot(daily_max_without_anomalies, gev_lower)
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

Exercise 3: GEV for minima

Now let us repeat the same analysis fitting the distribution of the daily minima using the same strategy. Since minima for a univariate random variable $X$ correspond to maxima of $-X$, all we have to do is to fit a GEV to the minima multiplied by -1.

+ +
+
+
+
+
+

Solution of exercise 3

+
+
+
+
+
+ +
+
+
neg_minima_series = -daily_grouped_normalized["min"]
+neg_daily_min = neg_minima_series.values
+
+plt.hist(neg_daily_min, density=True, bins=40)
+plt.title("Daily minima * (-1)")
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# identifying obvious anomalies
+neg_minima_series[(neg_minima_series>2) | (neg_minima_series<-2)]
+
+ +
+
+
+ +
+
+ +
+ + + +
+
date
+2014-11-01   -4.168684
+2014-11-02   -2.561467
+2015-01-01   -2.683568
+2015-01-26    3.202483
+2015-01-27    3.442703
+Name: min, dtype: float64
+
+ +
+ +
+
+ +
+
+
+ +
+
+
# - 01/01 - New Year
+# - 01-02/11 - Marathon
+# - 26-27/01 - Snowstorm
+
+ +
+
+
+ +
+
+
+ +
+
+
neg_minima_without_anomalies = neg_daily_min[np.logical_and(neg_daily_min<2, neg_daily_min>-2)]
+plt.hist(neg_minima_without_anomalies, density=True, bins=40)
+plt.title("Daily minima * (-1) without obvious anomalies")
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
daily_min_gev = get_gev(xi=-0.3)
+final_loss = fit_dist(neg_minima_without_anomalies, daily_min_gev, return_animation=False)
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
qqplot(neg_minima_without_anomalies, daily_min_gev)
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# Fit looks good in the region we are interested in, let us find the 1% quantile and the corresponding anomalies
+
+ +
+
+
+ +
+
+
+ +
+
+
upper_quantile = daily_min_gev.quantile(0.99).numpy()
+upper_quantile
+
+ +
+
+
+ +
+
+ +
+ + + +
+
1.5803475989870672
+
+ +
+ +
+
+ +
+
+
+ +
+
+
neg_minima_series[neg_minima_series>upper_quantile]
+
+ +
+
+
+ +
+
+ +
+ + + +
+
date
+2015-01-26    3.202483
+2015-01-27    3.442703
+2015-01-28    1.755855
+Name: min, dtype: float64
+
+ +
+ +
+
+ +
+
+
+ +
+
+
# Only one non-obvious anomaly is found in the upper quantile, it is cause by the snowstorm responsible for the obvious anomalies we have seen above.
+
+ +
+
+
+ +
+
+
+

Comparison with Z-Test

+
+
+
+
+
+ +
+
+
daily_means = daily_grouped_normalized["sum"]
+
+plt.plot(daily_means.values)
+plt.axhline(y=2., color='r', linestyle='-')
+plt.axhline(y=-2., color='r', linestyle='-')
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

The big question here is: where to put the threshold? Clearly the assumption of a Gaussian distribution underlying the sum of daily calls is incorrect - the distribution seems skewed.

+ +
+
+
+
+
+ +
+
+
plt.hist(daily_means, bins=25)
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+

We can detect some anomalies with the Z-test, of course, but the probabilistic interpretation is going to be flawed.

+ +
+
+
+
+
+ +
+
+
daily_means[np.abs(daily_means) > 2]
+
+ +
+
+
+ +
+
+ +
+ + + +
+
date
+2014-11-01    2.802000
+2014-11-27   -2.192533
+2014-12-25   -3.743349
+2014-12-26   -2.452098
+2015-01-26   -3.786365
+2015-01-27   -5.330402
+Name: sum, dtype: float64
+
+ +
+ +
+
+ +
+
+
+

A look back at the theory

So, what have we really done and why does it make sense to use the GEV for such problems? What kind of guarantees does the Fisher-Tipett-Genedenko theorem give us about the quality of the fit?

+ +
+
+
+
+
+

Well, the truth is, not too many. First notice the following exact equality:

+$$ +P(M_n < z) = P(X_1< z \text{ and } X_2 < z ... \text{ and } X_n < z) = F^n(z) +$$

So, if we know the cumulative distribution, there is no need to resort to the GEV. Typically, of course, we do not know it. The above equality implies:

+$$ +\lim P(M_n < z) = + \begin{cases} + 0 & \text{if}\ F(z) < 1 \\ + 1 & \text{otherwise} + \end{cases} +$$ +
+
+
+
+
+

We actually always know the exact limit of the distribution of the block-maxima! It is degenerate (either a step function of identical zero). In fact, this degenerate distribution can be seen as a limit of the GEV. It would correspond to normalizing constants $a_n=1, \ b_n=0$.

+ +
+
+
+
+
+

While this observation is very simple and the difference between the cdf of block maxima $P(M_n < z)$ and its degenerate limit does decrease as $n$ increases, this limiting distribution is unexpressive and fitting it to data does not provide probabilistic insight.

+ +
+
+
+
+
+

Q: How many parameters does the exact limit of $F^n$ have? What would we get if we fit it to data?

+

A:

+ +
+
+
+
+
+

Introducing the normalizing constants $a_n$ and $b_n$ might allow the distribution of renormalized block maxima to converge to something non-trivial. It also might not.

+ +
+
+
+
+
+

In applications we usually care about modeling $M_n$ for a _fixed $n_0$_ (or maybe for a few selected $n_i$). An arbitrary series of $a_n$ and $b_n$ that at some point helps convergence does not directly address our needs. In fact, this is also not what we do - by fitting the GEV parameters to data for our selected $n_0$ we automatically find the best $a_{n_0}$ and $b_{n_0}$ that minimize the difference between $F^{n_0}(z)$ and $G(z)$.

+ +
+
+
+
+
+

Clearly $G(z)$ is much more expressive than the degenerate exact limit and could potentially provide a good fit.

+ +
+
+
+
+
+

So, the convergence that we really care about is to answer the question:

+

How well do the best fits of $G(z)$ for fixed $n$ - let us call them $G_n(z)$ - approximate the distributions $F^n(z)$ as $n$ increases? One could e.g. be interested in the infinity norm

+$$ +\Delta_n := \sup_z | F^n(z) - G_n(z) | +$$ +
+
+
+
+
+

This is not the same as asking how well $G(z)$ approximates some rescaled variant of $F^n(z)$ with $n$-dependent normalization constants! That would be

+$$ +\tilde{\Delta}_n(a_n, b_n) := \sup_z |F^n(a_n z + b_n) - G(z) | +$$ +
+
+
+
+
+

In the latter question, the choice of normalization constants matters, in the former it does not - they are implicitly determined by the best fit for each $n$. Since for $\Delta_n$ the $a_n, b_n$ have been optimized, one could reasonably expect a relation of the type

+$$ +\Delta_n \approx \min_{a_n, b_n} \tilde{\Delta}_n(a_n, b_n) +$$

to hold.

+ +
+
+
+
+
+

It is easy to see that given some normalizing sequences $a_n, b_n$, the convergence to a GEV is possible, than with other sequences $\tilde{a}_n, \tilde{b}_n$ with some $a>0, b$ such that

+$$ +\lim_{n\rightarrow \infty} \frac{\tilde{a}_n}{a_n} = a \quad,\quad \lim_{n \rightarrow \infty} \frac{b_n-\tilde{b}_n}{a_n} = b +$$

the rescaled $\frac{M_n-\tilde{b}_n}{\tilde{a_n}}$ also converges to a GEV of the same type (with the same $\xi$). This is often formulated that a distribution $F$ has a fixed domain of attraction. However, the error rates $\tilde{\Delta}_n(\tilde{a}_n, \tilde{b}_n)$ would be different from those associated to $a_n, b_n$.

+ +
+
+
+
+
+

Unfortunately, theoretical bounds for the quantity of interest $\Delta_n$ are hard to come by - we are not aware of any. They also highly depend on the fitting procedure, which is non-trivial, as we have seen above. There are some bounds for quantities of the type $\tilde{\Delta}_n(\tilde{a}_n, \tilde{b}_n)$ (see the annotated literature reference) but they are rather loose and not really helpful in practice. Therefore, the EVT theorems are more of a motivation for selecting distribution families for fitting than a rigorous approach with guarantees. In practice the convergence and fit tend to work pretty well, though.

+ +
+
+
+
+
+

Exercise 4 (theoretical, bonus): outlining the proof of the Fisher-Gnedenko-Tripet theorem

One may wonder how the statement of the Fisher-Gnedenko-Tripet theorem is obtained without providing bounds on convergence. The reason is that the limiting distribution of (renormalized) maxima must have a very special property - it must be max stable. It is instructive to go through a part of the proof to get a feeling for the EVT theorems. We will do so in this exercise.

+

Definition: A cumulative distribution function $D(z)$ is called max-stable iff for all $n\in\mathbb{N} \ \exists \ \alpha_n>0, \beta_n \in \mathbb{R}$ such that

+$$ +D^n(z) = D(\alpha_n z + \beta_n) +$$

Prove that from $\lim_{n\rightarrow \infty} P\left( \frac{M_n - b_n}{a_n} < z \right) = G(z)$ follows that $G(z)$ is max-stable.

+

This goes a long way towards proving the first EVT theorem. One can easily compute that the GEV distribution is max-stable and with more effort one can also prove that any max-stable distribution belongs to the GEV family. Thus, the proof of the theorem is very implicit and does not involve any convergence rates or bounds.

+ +
+
+
+
+
+

Exercise 5: increase the block size

According to the line of thought above, increasing the block-size before determining the maxima should improve convergence. Of course, it also decreases the number of points for fitting so it increases variance. We will analyze uncertainties of the fitted GEV below.

+

Repeat the fit of the GEV for 2-day maxima/minima. What do you think about the result?

+

Hint: use the .reshape method of numpy arrays on the already computed daily maxima/minima

+ +
+
+
+
+
+

Solution exercise 5:

+
+
+
+
+
+ +
+
+
bidaily_maxima = daily_max_without_anomalies.reshape(-1, 2).max(axis=1)
+
+plt.hist(bidaily_maxima, bins=40)
+plt.title("Bidaily maxima")
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
bidaily_gev = get_gev(xi=-0.5)
+loss = fit_dist(bidaily_maxima, bidaily_gev, lr=3e-4, num_steps=100, return_animation=False)
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
bidaily_gev.trainable_variables
+
+ +
+
+
+ +
+
+ +
+ + + +
+
(<tf.Variable 'xi:0' shape=() dtype=float64, numpy=-0.4352171270495099>,
+ <tf.Variable 'mu:0' shape=() dtype=float64, numpy=0.07379174390548413>,
+ <tf.Variable 'sigma:0' shape=() dtype=float64, numpy=0.7643633709060534>)
+
+ +
+ +
+
+ +
+
+
+

The shape parameter should be independent of the size of the block (it is not affected by $a_n$ and $b_n$) . +Of course, since we find it from fitting, we shouldn't be surprised to find a slightly different value.

+ +
+
+
+
+
+

We get a better fit than before (less than half of the loss with half as many data points). +But we have higher variance in the very important shape parameter $\xi$

+ +
+
+
+
+
+ +
+
+
std_devs_daily = mle_std_deviations(daily_max_without_anomalies, daily_max_gev)
+std_devs_bidaily = mle_std_deviations(bidaily_maxima, bidaily_gev)
+
+print("Daily stddevs:")
+print(std_devs_daily.numpy())
+print("Biaily stddevs:")
+print(std_devs_bidaily.numpy())
+
+ +
+
+
+ +
+
+ +
+ +
+
Daily stddevs:
+[0.02618123 0.06607897 0.04556484]
+Biaily stddevs:
+[0.03952068 0.07958698 0.05549258]
+
+
+
+ +
+
+ +
+
+
+

Peaks over threshold (PoT)

So far we have only used the first theorem of EVT. As you might have noticed above, it can be somehow wasteful when it comes to data efficiency. Since the GEV is fitted on block-maxima, a huge number of data points remain unused for parameter estimation. The second theorem of EVT gives rise to a more efficient approach

+ +
+
+
+
+
+

Exercise 6 (theoretical, bonus): deriving the second theorem of EVT

Use the approximation $\ln(1+x) \approx 1 + x$ for $|x| \ll 1$ and $F(z) \approx 1$ for large enough $z$ to derive.

+\begin{equation} +P(X-u < y \mid X > u) \approx 1 - \left( 1 + \frac{\xi \ y}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} \label{GPD-approx-original} +\end{equation}

for large enough $u$ (this is a slightly less formal derivation of Pickards' et. al. theorem). One could equivalently write

+\begin{equation} +P(X-u > y \mid X > u) \approx \left( 1 + \frac{\xi \ y}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} +\end{equation}

What is the relation between $\tilde{\sigma}$ and the normalizing coefficients of the first theorem of EVT?

+ +
+
+
+
+
+

The above equation can be used to estimate the entire tail of the cdf $F$ of $X$ from a sample of size $N$ obtained by sampling repeatedly from $F$. First note that for a single $u$ we can approximate the cdf through the sample statistics as:

+\begin{equation} +1-F(u) = P(X>u) \approx \frac{N_u}{N} +\end{equation}

where $N_u$ is the number of samples with values above $u$. Interpreting $u$ as a threshold, we will call those samples peaks over threshold (PoT) and $N_u$ is simply their count.

+ +
+
+
+
+
+

Q: What should $u$ and the data set fulfill in order for the above approximation to be accurate?

+

A: It should be small enough such that many data points are larger than it. Then the approximation in $P(X>u) \approx \frac{N_u}{N}$ holds (the estimator is not too biased).

+ +
+
+
+
+
+

Now we can perform a series of approximations for $z>u$ to get to the tail-distribution. First using $P(X>u) \approx \frac{N_u}{N}$ we get

+$$ +P(X>z) = P(X>z \cap X>u) = P(X>z \mid X>u) P(X>u) \approx \frac{N_u}{N} P(X>z \mid X>u) +$$ +
+
+
+
+
+

Now we use the GDP theorem to approximate

+$$ +P(X>z \mid X>u) = P(X-u > z -u \mid X>u) \approx + \left( 1 + \frac{\xi (z-u)}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} +$$ +
+
+
+
+
+

Putting everything together gives

+$$ +P(X>z) \approx \frac{N_u}{N} \left( 1 + \frac{\xi (z-u)}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} +$$ +
+
+
+
+
+

Q: Intuitively, what does $u$ need to fulfill for both approximations to hold?

+

A: $u$ should be small enough such that the approximation $P(X>u) \approx \frac{N_u}{N}$ holds and sufficiently large such that the generalized pareto distribution is a good estimate of the tail of the distribution for values larger than $u$. Intuitively, it should be at the beginning of the tail, where for values larger than $u$ only the tail behavior plays a role - i.e. no more local extrema or other specifics of the underlying distribution of the data.

+ +
+
+
+
+
+

Exercise 7: Using the GPD for anomaly detection

This exercise lets you explore the second theorem of EVT for anomaly detection. Here we let you calculate and code on your own, without giving too many hints. You can follow the GEV-fitting code above for solving this exercise. Feel free to ask for hints if you are stuck!

+
    +
  1. Using the results above, find an approximation of the upper quantile $z_q$ such that $P(X>z_q) < q$ (assuming $z_q > u$).
  2. +
  3. What is the relation of this quantile to the quantile of the generalized pareto distribution?
  4. +
  5. Select a threshold $u$ and fit the generalized pareto distribution to the peaks over this threshold using tensorflow-probability and the same tricks that were used above for fitting the GEV distribution. You might want to use the profile likelihood fitting.
  6. +
  7. Determine anomalies from the quantile function.
  8. +
  9. What advantages do you see in fitting the GPD with PoT compared to fitting GEV distribution using block-maxima for anomaly detection? What are the disadvantages?
  10. +
  11. Check the quality of your fit and perform an uncertainty analysis as above for the GEV.
  12. +
+ +
+
+
+
+
+

Solution Exercise 7:

+
+
+
+
+
+ +
+
+
# We define the creation of the GPD analogous to the GEV above
+
+def get_gpd(xi: float,  sigma=1., trainable_xi=True):
+    xi, sigma = np.array([xi, sigma]).astype(float)
+    if trainable_xi:
+        xi = tf.Variable(xi, name="xi")
+    return tfd.GeneralizedPareto(
+        loc=0,
+        scale=tf.Variable(sigma, name='sigma'),
+        concentration=xi
+    )
+
+ +
+
+
+ +
+
+
+ +
+
+
# GPD is fit directly on the thresholded data, no need for grouping
+
+n_calls = taxi_df_normalized["n_calls"].values
+
+ +
+
+
+ +
+
+
+ +
+
+
plt.hist(n_calls, bins=40)
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# seems like u=1 gives a good value for the beginning of "tail behaviour"
+
+u = 1
+thresholded_n_calls = n_calls[n_calls>u] - u
+plt.hist(thresholded_n_calls, bins=50)
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# obvious anomalies
+taxi_df_normalized[taxi_df_normalized["n_calls"]> u+1]
+
+ +
+
+
+ +
+
+ +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
n_callstimedate
1342.13965819:00:002014-07-03
32612.18692622:30:002014-09-06
32622.19557323:00:002014-09-06
59543.46719701:00:002014-11-02
59552.89292001:30:002014-11-02
88332.07653800:30:002015-01-01
88342.17583001:00:002015-01-01
+
+ +
+ +
+
+ +
+
+
+ +
+
+
# we filter out some calls on (one day before) independence day, Columbus day, the marathon and New Year
+
+cleaned_thresholded_calls = thresholded_n_calls[thresholded_n_calls < 1]
+plt.hist(cleaned_thresholded_calls, bins=50, density=True)
+plt.title("Thresholded calls without obvious anomalies")
+plt.show()
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# fitting the gpd. We need a small lr to not hit singularities
+# We bypass fitting xi here, instead using the xi found from fitting the GEV above. 
+# Theory suggests that it should be close to the optimal value. 
+# We could also profile around it or try full gradient, of course. The latter is brittle
+
+
+xi_gev = daily_max_gev.concentration.numpy()
+print(f"Using xi={xi_gev}")
+
+gpd = get_gpd(xi=xi_gev, sigma=1, trainable_xi=False)
+loss = fit_dist(cleaned_thresholded_calls, gpd, lr=5e-6, num_steps=100, return_animation=False)
+
+ +
+
+
+ +
+
+ +
+ +
+
Using xi=-0.4490658772874973
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# now the qqplot and the stddev
+
+std = mle_std_deviations(cleaned_thresholded_calls, gpd)
+
+fitted_coeff = tf.convert_to_tensor(gpd.trainable_variables)
+coeff_upper = fitted_coeff + std
+coeff_lower = fitted_coeff - std
+
+gpd_upper = get_gpd(gpd.concentration.numpy(), *coeff_upper.numpy())
+gpd_lower = get_gpd(gpd.concentration.numpy(), *coeff_lower.numpy())
+
+ +
+
+
+ +
+
+
+ +
+
+
qqplot(cleaned_thresholded_calls, gpd)
+qqplot(cleaned_thresholded_calls, gpd_upper)
+qqplot(cleaned_thresholded_calls, gpd_lower)
+
+ +
+
+
+ +
+
+ +
+ + + +
+ +
+ +
+ +
+
+ +
+
+
+ +
+
+
# finding the threshold corresponding to probability of 0.01 of the non-conditioned tail
+# For that, we rescale with our estimate of N_u
+
+N_u = len(cleaned_thresholded_calls)/len(thresholded_n_calls)
+percentile = 1- N_u*0.01
+
+q = u + gpd.quantile(percentile).numpy()
+q
+
+ +
+
+
+ +
+
+ +
+ + + +
+
1.877599494949569
+
+ +
+ +
+
+ +
+
+
+ +
+
+
# and the anomalies lying above it. We find thesame ones
+
+n_calls = taxi_df_normalized["n_calls"]
+taxi_df_normalized[taxi_df_normalized["n_calls"] > q]
+
+ +
+
+
+ +
+
+ +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
n_callstimedate
1342.13965819:00:002014-07-03
32612.18692622:30:002014-09-06
32622.19557323:00:002014-09-06
32631.92046823:30:002014-09-06
52791.94381323:30:002014-10-18
59421.91095619:00:002014-11-01
59543.46719701:00:002014-11-02
59552.89292001:30:002014-11-02
69591.92162023:30:002014-11-22
88332.07653800:30:002015-01-01
88342.17583001:00:002015-01-01
88351.90375101:30:002015-01-01
93101.89697823:00:002015-01-10
93111.91138923:30:002015-01-10
103101.96946519:00:002015-01-31
+
+ +
+ +
+
+ +
+
+
+

Results:

We found new candidates for anomalies (or rare events). The 10.01.2015 was the day following Charlie Hebdo related terrorist attacks, there was a large march in Paris. Maybe there was additional movement across New York's large Jewish community. See e.g. this article

+

We could not find events that could have caused the large numbers of calls on the 18/10/2014 and the 22/11/2014.

+

We also now have a probabilistic model for the tail of n_calls/30 minutes which might be useful for planning taxi availabilities on a more granular level than just per-day.

+ +
+
+
+
+
+

So far, we have completely ignored the time-series aspect of our data set. When using EVT for time series, as will often be the case in practice, seasonality, trends and so on need to be taken into account.

+ +
+
+
+
+
+

We have already seen a treatment of these topics for time series forecasting. Without going into details, we want to mention that the time-dependency can be to some extent taken into account in EVT by allowing for time-dependent parameters $\xi(t), \mu(t), \sigma(t)$.

+ +
+
+
+
+
+

There exist multiple strategies for finding these time dependent functions from data, the most straightforward one being MLE-fitting with sliding windows over a sample. One could also easily include known modulations into the MLE fitting, e.g. something like

+$$ +\mu(t) : = \mu_0 \sin(t) +$$

might do the job if one knows that the underlying mean vary with $\sin(t)$. Then, one only needs to fit $\mu_0$.

+ +
+
+
+
+
+

Confusion about EVT for anomaly detection

Unfortunately, there are some incorrect claims about applications of EVT in the AD literature. The claims often involve an incorrect analysis of EVT for multivariate and multimodal distributions.

+ +
+
+
+
+
+

Note that the EVT theorems apply to univariate distributions. They also ignore multimodality as only tail-behaviour plays a role for them.

+ +
+
+
+
+
+

Results of EVT from one dimension cannot be directly transferred to higher dimensions, even for Gaussians. The cdf of the Mahalanobis radius is simply not dimension independent, see here for an exact expression for it. The attempt to do that leads to a bad fit and is sometimes called failure of classical EVT. Similar approaches and resulting claims have been tried on Gaussian mixtures.

+ +
+
+
+
+
+

EVT for outlier scores

Th NYC taxi data is very simple, we could apply EVT to it directly. For multidimensional data these techniques don't work out of the box. However, as mentioned in the beginning, virtually all AD algorithms will produce a 1-dimensional score which can then be given a probabilistic meaning through EVT. We will explore this approach in the last exercise of this section.

+ +
+
+
+
+
+

Things we have omitted

There are many ways to extend the ideas presented here

+ +
+
+
+
+
+

The PoT method can be adapted to work on streams in a memory efficient way, by automatically stripping off obvious anomalies and adjusting the threshold

+ +
+
+
+
+
+

We have seen how MLE with gradient descent is brittle and subject to divergences. There is a lot of literature containing bags of tricks for finding the MLE estimators for GEV and GPT distributions in a smarter, more robust way.

+ +
+
+
+
+
+

One can also give up on MLE and use goodness-of-fit objectives to minimize the difference with the empirical cdf given by the data.

+ +
+
+
+
+
+

Generally, there is a large body of literature on EVT, although more in the engineering/math directions than for AD.

+ +
+
+
+
+
+

Exercise 8

Using the anomaly scores from a data set and algorithm from yesterday (you can choose your favorite), perform an EVT analysis along the lines of what was done above. What are your conclusions? In which situations can such an analysis be useful in practical situations?

+ +
+
+
+
+
+

Solution of exercise 8:

Left to the reader

+ +
+
+
+
+
+

Snow

+
Thank you for the attention, this concludes the A.D. training.
+
We will be happy to see you in another Transferlab training soon!
+
+
+
+
+
+ +
+
+
 
+
+ +
+
+
+ +
+
+
+ + + + + + + + + diff --git a/docs/exercises.rst b/docs/exercises.rst new file mode 100644 index 0000000..e3b3f06 --- /dev/null +++ b/docs/exercises.rst @@ -0,0 +1,18 @@ + +Exercises +========= + + +* `nb_01_0_intro_anomaly_detection <_static/nb_01_0_intro_anomaly_detection.html>`_ + +* `nb_01_1_intro_and_ad_taxonomy <_static/nb_01_1_intro_and_ad_taxonomy.html>`_ + +* `nb_02_anomaly_detection_via_density_estimation <_static/nb_02_anomaly_detection_via_density_estimation.html>`_ + +* `nb_03_anomaly_detection_via_isolation <_static/nb_03_anomaly_detection_via_isolation.html>`_ + +* `nb_04_anomaly_detection_via_reconstruction <_static/nb_04_anomaly_detection_via_reconstruction.html>`_ + +* `nb_05_anomaly_detection_on_time_series <_static/nb_05_anomaly_detection_on_time_series.html>`_ + +* `nb_06_extreme_value_theory_for_anomaly_detection <_static/nb_06_extreme_value_theory_for_anomaly_detection.html>`_ diff --git a/docs/intro.rst b/docs/intro.rst index f038a72..d5ff943 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -1,12 +1,10 @@ About this documentation ======================== -Welcome to the documentation of the Example: tfl-training-anomaly-detection training. +Welcome to the documentation of the tfl-training-anomaly-detection. It contains the executed exercise notebooks and the documentation of the source code. Note that some links to source code files inside the rendered exercises might be broken, since the linking mechanism is different in jupyter and the rendered documentation. -TODO: Add more information about the training here! - diff --git a/notebooks/nb_01_0_intro_anomaly_detection.ipynb b/notebooks/nb_01_0_intro_anomaly_detection.ipynb index c3d62f1..c3e0c8c 100644 --- a/notebooks/nb_01_0_intro_anomaly_detection.ipynb +++ b/notebooks/nb_01_0_intro_anomaly_detection.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": { "hide_input": true, "init_cell": true, @@ -28,7 +28,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": { "hide_input": true, "init_cell": true, @@ -40,178 +40,14 @@ "remove-input-nbconv" ] }, - "outputs": [ - { - "data": { - "text/html": [ - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "%presentation_style" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": { "hide_input": true, "init_cell": true, @@ -234,7 +70,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": { "hide_input": true, "init_cell": true, @@ -246,85 +82,27 @@ "remove-cell" ] }, - "outputs": [ - { - "data": { - "text/markdown": [ - "\n", - "$\\newcommand{\\vect}[1]{{\\mathbf{\\boldsymbol{#1}} }}$\n", - "$\\newcommand{\\amax}{{\\text{argmax}}}$\n", - "$\\newcommand{\\P}{{\\mathbb{P}}}$\n", - "$\\newcommand{\\E}{{\\mathbb{E}}}$\n", - "$\\newcommand{\\R}{{\\mathbb{R}}}$\n", - "$\\newcommand{\\Z}{{\\mathbb{Z}}}$\n", - "$\\newcommand{\\N}{{\\mathbb{N}}}$\n", - "$\\newcommand{\\C}{{\\mathbb{C}}}$\n", - "$\\newcommand{\\abs}[1]{{ \\left| #1 \\right| }}$\n", - "$\\newcommand{\\simpl}[1]{{\\Delta^{#1} }}$\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "%load_latex_macros" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "%load_ext autoreload\n", - "%autoreload 2\n", - "%matplotlib inline\n", - "%load_ext tfl_training_anomaly_detection" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, "outputs": [], "source": [ - "%presentation_style" + "%load_latex_macros" ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, "source": [ - "%%capture\n", "\n", - "%set_random_seed 12" + "# Introduction to Anomaly Detection\n", + "\"Snow\"" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%load_latex_macros" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, "outputs": [], "source": [ "import numpy as np\n", @@ -342,30 +120,10 @@ "\n", "%matplotlib inline\n", "matplotlib.rcParams['figure.figsize'] = (5, 5)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "\"Snow\"\n", - "
Anomaly Detection
" - ] - }, - { - "cell_type": "markdown", + ], "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "# Introduction to Anomaly Detection" - ] + "collapsed": false + } }, { "cell_type": "markdown", diff --git a/notebooks/nb_01_1_intro_and_ad_taxonomy.ipynb b/notebooks/nb_01_1_intro_and_ad_taxonomy.ipynb index bc77b27..5952cef 100644 --- a/notebooks/nb_01_1_intro_and_ad_taxonomy.ipynb +++ b/notebooks/nb_01_1_intro_and_ad_taxonomy.ipynb @@ -91,8 +91,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\"Snow\"\n", - "
Taxonomy of Anomaly Detection Approaches
" + "# Taxonomy of Anomaly Detection Approaches\n", + "\"Snow\"\n" ] }, { @@ -372,7 +372,7 @@ "metadata": { "celltoolbar": "Slideshow", "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -386,7 +386,20 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.4" + "version": "3.9.16" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false } }, "nbformat": 4, diff --git a/notebooks/nb_02_anomaly_detection_approaches.ipynb b/notebooks/nb_02_anomaly_detection_approaches.ipynb deleted file mode 100644 index cf552fd..0000000 --- a/notebooks/nb_02_anomaly_detection_approaches.ipynb +++ /dev/null @@ -1,2493 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "c634d79e", - "metadata": { - "hide_input": true, - "init_cell": true, - "slideshow": { - "slide_type": "skip" - }, - "tags": [ - "remove-input", - "remove-output", - "remove-input-nbconv", - "remove-output-nbconv" - ] - }, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "%load_ext autoreload\n", - "%autoreload 2\n", - "%matplotlib inline\n", - "%load_ext tfl_training_anomaly_detection" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "596df825", - "metadata": { - "hide_input": true, - "init_cell": true, - "slideshow": { - "slide_type": "skip" - }, - "tags": [ - "remove-input", - "remove-input-nbconv" - ] - }, - "outputs": [], - "source": [ - "%presentation_style" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "158112af", - "metadata": { - "hide_input": true, - "init_cell": true, - "slideshow": { - "slide_type": "skip" - }, - "tags": [ - "remove-input", - "remove-output", - "remove-input-nbconv", - "remove-output-nbconv" - ] - }, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "%set_random_seed 12" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "97c8e783", - "metadata": { - "hide_input": true, - "init_cell": true, - "slideshow": { - "slide_type": "skip" - }, - "tags": [ - "remove-input-nbconv", - "remove-cell" - ] - }, - "outputs": [], - "source": [ - "%load_latex_macros" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d10c2fd2", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "%load_ext autoreload\n", - "%autoreload 2\n", - "%matplotlib inline\n", - "%load_ext tfl_training_anomaly_detection" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "88b83c4c", - "metadata": {}, - "outputs": [], - "source": [ - "%presentation_style" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "86090b84", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "%set_random_seed 12" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "653f2b7d", - "metadata": {}, - "outputs": [], - "source": [ - "%load_latex_macros" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "46db38bf", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "import itertools as it\n", - "from tqdm import tqdm\n", - "\n", - "import matplotlib\n", - "from matplotlib import pyplot as plt\n", - "import plotly.express as px\n", - "import pandas as pd\n", - "\n", - "import ipywidgets as widgets\n", - "\n", - "from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \\\n", - "perform_rkde_experiment, get_mnist_data\n", - "\n", - "from ipywidgets import interact\n", - "\n", - "from sklearn.metrics import roc_auc_score, average_precision_score\n", - "from sklearn.model_selection import RandomizedSearchCV\n", - "from sklearn.preprocessing import MinMaxScaler\n", - "from sklearn.preprocessing import LabelBinarizer\n", - "from sklearn.ensemble import IsolationForest\n", - "from sklearn import metrics\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.decomposition import PCA\n", - "from sklearn.neighbors import KernelDensity\n", - "\n", - "from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst\n", - "\n", - "from tensorflow import keras\n", - "\n", - "%matplotlib inline\n", - "matplotlib.rcParams['figure.figsize'] = (5, 5)\n" - ] - }, - { - "cell_type": "markdown", - "id": "7cc2f648", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "\"Snow\"\n", - "
Anomaly Detection
" - ] - }, - { - "cell_type": "markdown", - "id": "cb85606a", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "# Anomaly Detection via Density Estimation\n", - "**Idea:** Estimate the density of $F_0$. Areas of low density are anomalous.\n", - "- Often $p$ is too small to estimate complete mixture model\n", - "- Takes into account that $F_1$ might not be well-defined\n", - "- Estimation procedure needs to be robust against contamination if no clean training data is available" - ] - }, - { - "cell_type": "markdown", - "id": "ad028f22", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Kernel Density Estimation\n", - "- Non-parametric method\n", - "- Can represent almost arbitrarily shaped densities\n", - "- Each training point \"spreads\" a fraction of the probability mass as specified by the kernel function" - ] - }, - { - "cell_type": "markdown", - "id": "8a5f8585", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "\n", - "---\n", - "\n", - "\n", - "**Definition:**\n", - "- $K: \\mathbb{R} \\to \\mathbb{R}$ kernel function\n", - " - $K(r) \\geq 0$ for all $r\\in \\mathbb{R}$\n", - "\t- $\\int_{-\\infty}^{\\infty} K(r) dr = 1$\n", - "- $h > 0$ bandwidth\n", - "- Bandwidth is the most crucial parameter\n", - "---\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "9847bb69", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "\n", - "---\n", - "**Definition:**\n", - "Let $D = \\{x_1,\\ldots,x_N\\}\\subset \\mathbb{R}^p$. The KDE with kernel $K$ and bandwidth $h$ is\n", - "$KDE_h(x, D) = \\frac{1}{N}\\sum_{i=1}^N \\frac{1}{h^p}K\\left(\\frac{|x-x_i|}{h}\\right)$\n", - "\n", - "---\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Effect of bandwidth and kernel
" - ] - }, - { - "cell_type": "markdown", - "id": "434d0f3e", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Exercise\n", - "Play with the parameters!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "10d29d49", - "metadata": { - "hideCode": false - }, - "outputs": [], - "source": [ - "dists = create_distributions(dim=2, dim_irrelevant=0)\n", - "\n", - "sample_train = dists['Double Blob'].sample(500)\n", - "X_train = sample_train[-1]\n", - "y_train = [0]*len(X_train)\n", - "\n", - "plt.scatter(X_train[:,0], X_train[:,1], c = 'blue', s=10)\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "639512ea", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# Helper function\n", - "def fit_kde(kernel: str, bandwidth: float, X_train: np.array):\n", - " \"\"\" Fit KDE\n", - " \n", - " @param kernel: kernel\n", - " @param bandwidth: bandwidth\n", - " @param x_train: data\n", - " \"\"\"\n", - " kde = KernelDensity(kernel=kernel, bandwidth=bandwidth)\n", - " kde.fit(X_train)\n", - " return kde\n", - "\n", - "def visualize_kde(kde: KernelDensity, bandwidth: float, X_test: np.array, y_test: np.array):\n", - " \"\"\"Plot KDE\n", - " \n", - " @param kde: KDE\n", - " @param bandwidth: bandwidth\n", - " @param X_test: test data\n", - " @param y_test: test label\n", - " \"\"\"\n", - " fig, axis = plt.subplots(figsize=(5, 5))\n", - "\n", - " lin = np.linspace(-10, 10, 50)\n", - " grid_points = list(it.product(lin, lin))\n", - " ys, xs = np.meshgrid(lin, lin)\n", - " # The score function of sklearn returns log-densities\n", - " scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)\n", - " colormesh = axis.contourf(xs, ys, scores)\n", - " fig.colorbar(colormesh)\n", - " axis.set_title('Density Conturs (Bandwidth={})'.format(bandwidth))\n", - " axis.set_aspect('equal')\n", - " color = ['blue' if i ==0 else 'red' for i in y_test]\n", - " plt.scatter(X_test[:, 0], X_test[:, 1], c=color)\n", - " plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "d2fa638a", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "# Choose KDE Parameters" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "89e5ef25", - "metadata": { - "hideCode": false, - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "ker = None\n", - "bdw = None\n", - "@interact(\n", - " kernel=['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'],\n", - " bandwidth=(.1, 10.)\n", - ")\n", - "def set_kde_params(kernel: str, bandwidth: float):\n", - " \"\"\"Helper funtion to set widget parameters\n", - " \n", - " @param kernel: kernel\n", - " @param bandwidth: bandwidth\n", - " \"\"\"\n", - " global ker, bdw\n", - "\n", - " ker = kernel\n", - " bdw = bandwidth" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "909c4c5c", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "kde = fit_kde(ker, bdw, X_train)\n", - "visualize_kde(kde, bdw, X_train, y_train)" - ] - }, - { - "cell_type": "markdown", - "id": "0ddc2d6c", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Bandwidth Selection\n", - "The bandwidth is the most important parameter of a KDE model. A wrongly adjusted value will lead to over- or\n", - "under-smoothing of the density curve.\n", - "\n", - "A common method to select a bandwidth is maximum log-likelihood cross validation.\n", - "$$h_{\\textrm{llcv}} = \\arg\\max_{h}\\frac{1}{k}\\sum_{i=1}^k\\sum_{y\\in D_i}\\log\\left(\\frac{k}{N(k-1)}\\sum_{x\\in D_{-i}}K_h(x, y)\\right)$$\n", - "where $D_{-i}$ is the data without the $i$th cross validation fold $D_i$." - ] - }, - { - "cell_type": "markdown", - "id": "df4068e5", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "# Exercises" - ] - }, - { - "cell_type": "markdown", - "id": "94c87f47", - "metadata": {}, - "source": [ - "ex no.1: Noisy sinusoidal" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0238d830", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# Generate example\n", - "dists = create_distributions(dim=2)\n", - "\n", - "distribution_with_anomalies = contamination(\n", - " nominal=dists['Sinusoidal'],\n", - " anomaly=dists['Blob'],\n", - " p=0.05\n", - ")\n", - "\n", - "# Train data\n", - "sample_train = dists['Sinusoidal'].sample(500)\n", - "X_train = sample_train[-1].numpy()\n", - "\n", - "# Test data\n", - "sample_test = distribution_with_anomalies.sample(500)\n", - "X_test = sample_test[-1].numpy()\n", - "y_test = sample_test[0].numpy()\n", - "\n", - "scatter = plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test)\n", - "handels, _ = scatter.legend_elements()\n", - "plt.legend(handels, ['Nominal', 'Anomaly'])\n", - "plt.gca().set_aspect('equal')\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "57d255dc", - "metadata": { - "solution2": "hidden", - "solution2_first": true - }, - "source": [ - "## TODO: Define the search space for the kernel and the bandwidth" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b02d0c88", - "metadata": { - "solution2": "hidden" - }, - "outputs": [], - "source": [ - "param_space = {\n", - " 'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels\n", - " 'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter\n", - "}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d095bdb2", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "def hyperopt_by_score(X_train: np.array, param_space: dict, cv: int=5):\n", - " \"\"\"Performs hyperoptimization by score\n", - " \n", - " @param X_train: data\n", - " @param param_space: parameter space\n", - " @param cv: number of cv folds\n", - " \"\"\"\n", - " kde = KernelDensity()\n", - "\n", - " search = RandomizedSearchCV(\n", - " estimator=kde,\n", - " param_distributions=param_space,\n", - " n_iter=100,\n", - " cv=cv,\n", - " scoring=None # use estimators internal scoring function, i.e. the log-probability of the validation set for KDE\n", - " )\n", - "\n", - " search.fit(X_train)\n", - " return search.best_params_, search.best_estimator_" - ] - }, - { - "cell_type": "markdown", - "id": "79ed34cc", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "Run the code below to perform hyperparameter optimization." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "01513b81", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "params, kde = hyperopt_by_score(X_train, param_space)\n", - "\n", - "print('Best parameters:')\n", - "for key in params:\n", - " print('{}: {}'.format(key, params[key]))\n", - "\n", - "test_scores = -kde.score_samples(X_test)\n", - "test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)\n", - "\n", - "curves = evaluate(y_test, test_scores)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ec8cf537", - "metadata": {}, - "outputs": [], - "source": [ - "visualize_kde(kde, params['bandwidth'], X_test, y_test)" - ] - }, - { - "cell_type": "markdown", - "id": "2d598cda", - "metadata": {}, - "source": [ - "### Exercise: Isolate anomalies in house prices" - ] - }, - { - "cell_type": "markdown", - "id": "965b4783", - "metadata": {}, - "source": [ - "You are a company resposible to estimate house prices around Ames, Iowa, specifically around college area. But there is a problem: houses from a nearby area, 'Veenker', are often included in your dataset. You want to build an anomaly detection algorithm that filters one by one every point that comes from the wrong neighborhood. You have been able to isolate an X_train dataset which, you are sure, contains only houses from College area. Following the previous example, test your ability to isolate anomalies in new incoming data (X_test) with KDE.\n", - "\n", - "Advanced exercise:\n", - "What happens if the contamination comes from other areas? You can choose among the following names:\n", - "\n", - "OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1fd5a1af", - "metadata": {}, - "outputs": [], - "source": [ - "X_train, X_test, y_test = get_house_prices_data(neighborhood = 'CollgCr', anomaly_neighborhood='Veenker')\n", - "X_train" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "288e044a", - "metadata": {}, - "outputs": [], - "source": [ - "# Total data\n", - "train_test_data = X_train.append(X_test, ignore_index=True)\n", - "y_total = [0] * len(X_train) + y_test\n", - "\n", - "fig = px.scatter_3d(train_test_data, x='LotArea', y='OverallCond', z='SalePrice', color=y_total)\n", - "\n", - "fig.show()" - ] - }, - { - "cell_type": "markdown", - "id": "dfc80809", - "metadata": {}, - "source": [ - "### Solution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b336ebbd", - "metadata": { - "hideCode": false, - "hidePrompt": false - }, - "outputs": [], - "source": [ - "# When data are highly in-homogeneous, like in this case, it is often beneficial \n", - "# to rescale them before applying any anomaly detection or clustering technique.\n", - "scaler = MinMaxScaler()\n", - "X_train_rescaled = scaler.fit_transform(X_train)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0159cf66", - "metadata": { - "hideCode": false - }, - "outputs": [], - "source": [ - "param_space = {\n", - " 'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels\n", - " 'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter\n", - "}\n", - "params, kde = hyperopt_by_score(X_train_rescaled, param_space)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c6217804", - "metadata": { - "hideCode": false - }, - "outputs": [], - "source": [ - "print('Best parameters:')\n", - "for key in params:\n", - " print('{}: {}'.format(key, params[key]))\n", - "\n", - "X_test_rescaled = scaler.transform(X_test)\n", - "test_scores = -kde.score_samples(X_test_rescaled)\n", - "test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)\n", - "curves = evaluate(y_test, test_scores)" - ] - }, - { - "cell_type": "markdown", - "id": "ad2238bb", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## The Curse of Dimensionality\n", - "The flexibility of KDE comes at a price. The dependency on the dimensionality of the data is quite unfavorable.\n", - "\n", - "---\n", - "*Theorem* [Stone, 1982]\n", - "Any estimator that is consistent$^*$ with the class of all $k$-fold differentiable pdfs over $\\mathbb{R}^d$ has a\n", - "convergence rate of at most\n", - "\n", - "$$\n", - "\\frac{1}{n^{\\frac{k}{2k+d}}}\n", - "$$\n", - "\n", - "\n", - "---\n", - "\n", - "$^*$Consistency = for all pdfs $p$ in the class: $\\lim_{n\\to\\infty}|KDE_h(x, D) - p(x)|_\\infty = 0$ with probability $1$." - ] - }, - { - "cell_type": "markdown", - "id": "65268b84", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "# Exercise\n", - "- The very slow convergence in high dimensions does not necessary mean that we will see bad results in high dimensional anomaly detection with KDE.\n", - "- Especially if the anomalies are very outlying.\n", - "- However, in cases where contours of the nominal distribution are non-convex we can run into problems.\n", - "\n", - "We take a look at a higher dimensional version of out previous data set." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "75d8b1a5", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "dists = create_distributions(dim=3)\n", - "\n", - "distribution_with_anomalies = contamination(\n", - " nominal=dists['Sinusoidal'],\n", - " anomaly=dists['Blob'],\n", - " p=0.01\n", - ")\n", - "\n", - "sample = distribution_with_anomalies.sample(500)\n", - "\n", - "y = sample[0]\n", - "X = sample[-1]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "44acf871", - "metadata": {}, - "outputs": [], - "source": [ - "fig = px.scatter_3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], color=y)\n", - "fig.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "58b7d48b", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# Fit KDE on high dimensional examples \n", - "rocs = []\n", - "auprs = []\n", - "bandwidths = []\n", - "\n", - "param_space = {\n", - " 'kernel': ['gaussian'],\n", - " 'bandwidth': np.linspace(0.1, 100, 1000), # Define Search space for bandwidth parameter\n", - " }\n", - "\n", - "kdes = {}\n", - "dims = np.arange(2,16)\n", - "for d in tqdm(dims):\n", - " # Generate d dimensional distributions\n", - " dists = create_distributions(dim=d)\n", - "\n", - " distribution_with_anomalies = contamination(\n", - " nominal=dists['Sinusoidal'],\n", - " anomaly=dists['Blob'],\n", - " p=0\n", - " )\n", - "\n", - " # Train on clean data\n", - " sample_train = dists['Sinusoidal'].sample(500)\n", - " X_train = sample_train[-1].numpy()\n", - " # Test data\n", - " sample_test = distribution_with_anomalies.sample(500)\n", - " X_test = sample_test[-1].numpy()\n", - " y_test = sample_test[0].numpy()\n", - "\n", - " # Optimize bandwidth\n", - " params, kde = hyperopt_by_score(X_train, param_space)\n", - " kdes[d] = (params, kde)\n", - " \n", - " bandwidths.append(params['bandwidth'])\n", - "\n", - " test_scores = -kde.score_samples(X_test)\n", - " test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)\n", - "\n", - " " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cc679493", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# Plot cross section of pdf \n", - "fig, axes = plt.subplots(nrows=2, ncols=7, figsize=(15, 5))\n", - "for d, axis in tqdm(list(zip(kdes, axes.flatten()))):\n", - " \n", - " params, kde = kdes[d]\n", - "\n", - " lin = np.linspace(-10, 10, 50)\n", - " grid_points = list(it.product(*([[0]]*(d-2)), lin, lin))\n", - " ys, xs = np.meshgrid(lin, lin)\n", - " # The score function of sklearn returns log-densities\n", - " scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)\n", - " colormesh = axis.contourf(xs, ys, scores)\n", - " axis.set_title(\"Dim = {}\".format(d))\n", - " axis.set_aspect('equal')\n", - " \n", - "\n", - "# Plot evaluation\n", - "print('Crossection of the KDE at (0,...,0, x, y)')\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "020500ed", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Robustness\n", - "Another drawback of KDE in the context of anomaly detection is that it is not robust against contamination of the data\n", - "\n", - "---\n", - "**Definition**\n", - "The *breakdown point* of an estimator is the smallest fraction of observations that need to be changed so that we can\n", - "move the estimate arbitrarily far away from the true value.\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "74cf4c13", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "**Example**: The sample mean has a breakdown point of $0$. Indeed, for a sample of $x_1,\\ldots, x_n$ we only need to\n", - "change a single value in order to move the sample mean in any way we want. That means that the breakdown point is\n", - "smaller than $\\frac{1}{n}$ for every $n\\in\\mathbb{N}$." - ] - }, - { - "cell_type": "markdown", - "id": "4efce06e", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Robust Statistics\n", - "There are robust replacements for the sample mean:\n", - "- Median of means: Split the dataset into $S$ equally sized subsets $X_1,\\ldots, X_S$ and compute\n", - "$\\mathrm{median}(\\overline{X_1},\\ldots, \\overline{X_S})$\n", - "- M-estimation: The mean in a normed vector space is the value that minimizes the squared distances\n", - "
\n", - "$\\overline{X} = \\min_{y}\\sum_{x\\in X}|x-y|^2$\n", - "
\n", - "M-estimation replaces the quadratic loss with a more robust loss function." - ] - }, - { - "cell_type": "markdown", - "id": "ac5655c6", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "### Huber loss\n", - "Switch from quadratic to linear loss at prescribed threshold" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b6861d26", - "metadata": { - "hideCode": false, - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "\n", - "def huber(error: float, threshold: float):\n", - " \"\"\"Huber loss\n", - " \n", - " @param error: base error\n", - " @param threshold: threshold for linear transition\n", - " \"\"\"\n", - " test = (np.abs(error) <= threshold)\n", - " return (test * (error**2)/2) + ((1-test)*threshold*(np.abs(error) - threshold/2))\n", - "\n", - "x = np.linspace(-5, 5)\n", - "y = huber(x, 1)\n", - "\n", - "plt.plot(x, y)\n", - "plt.gca().set_title(\"Huber Loss\")\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "3017109f", - "metadata": {}, - "source": [ - "### Hampel loss\n", - "More complex loss function. Depends on 3 parameters 0 < a < b< r" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "69f95fad", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "def single_point_hampel(error: float, a: float, b: float, r: float):\n", - " \"\"\"Hampel loss\n", - " \n", - " @param error: base error\n", - " @param a: 1st threshold parameter\n", - " @param b: 2nd threshold parameter\n", - " @param r: 3rd threshold parameter\n", - " \"\"\"\n", - " if abs(error) <= a:\n", - " return error**2/2\n", - " elif a < abs(error) <= b:\n", - " return (1/2 *a**2 + a* (abs(error)-a))\n", - " elif b < abs(error) <= r:\n", - " return a * (2*b-a+(abs(error)-b) * (1+ (r-abs(error))/(r-b)))/2\n", - " else:\n", - " return a*(b-a+r)/2\n", - "\n", - "hampel = np.vectorize(single_point_hampel)\n", - "\n", - "x = np.linspace(-10.1, 10.1)\n", - "y = hampel(x, a=1.5, b=3.5, r=8)\n", - "\n", - "plt.plot(x, y)\n", - "plt.gca().set_title(\"Hampel Loss\")\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "c4ed645c", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## KDE is a Mean\n", - "\n", - "---\n", - "\n", - "**Kernel as scalar product:**\n", - "\n", - "\n", - "- Let $K$ be a radial monotonic$^\\ast$ kernel over $\\mathbb{R}^n$.\n", - "- For $x\\in\\mathbb{R}^n$ let $\\phi_x = K(\\cdot, x)$.\n", - "- Vector space over the linear span of $\\{\\phi_x \\mid x\\in\\mathbb{R}^n\\}$:\n", - " - Pointwise addition and scalar multiplication.\n", - "- Define the scalar product $\\langle \\phi_x, \\phi_y\\rangle = K(x,y)$.\n", - "- Advantage: Scalar product is computable\n", - "- Call this the reproducing kernel Hilbert space (RKHS) of $K$.\n", - "- $\\mathrm{KDE}_h(\\cdot, D) = \\frac{1}{N}\\sum_{i=1}^N K_h(\\cdot, x_i) = \\frac{1}{N}\\sum_{i=1}^N\\phi_{x_i}$\n", - " - where $K_h(x,y) = \\frac{1}{h}K\\left(\\frac{|x-y|}{h}\\right)$\n", - "\n", - "---\n", - "\n", - "$^*$All kernels that we have seen are radial and monotonic" - ] - }, - { - "cell_type": "markdown", - "id": "aa0e6487", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "# Exercise\n", - "We compare the performance of different approaches to recover the nominal distribution under contamination.\n", - "Here, we use code by [Humbert et al.](https://github.com/lminvielle/mom-kde) to replicate\n", - "the results in the referenced paper on median-of-mean KDE. More details on rKDE can instead be found in this paper by [Kim and Scott.](https://arxiv.org/abs/1107.3133#:~:text=We%20propose%20a%20method%20for,ideas%20from%20classical%20M%2Destimation.)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "58243128", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# =======================================================\n", - "# Parameters\n", - "# =======================================================\n", - "algos = [\n", - " 'kde',\n", - " 'mom-kde', # Median-of-Means\n", - " 'rkde-huber', # robust KDE with huber loss\n", - " 'rkde-hampel', # robust KDE with hampel loss\n", - "]\n", - "\n", - "dataset = 'house-prices'\n", - "dataset_options = {'neighborhood': 'CollgCr', 'anomaly_neighborhood': 'Edwards'}\n", - "\n", - "outlierprop_range = [0.01, 0.02, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5]\n", - "kernel = 'gaussian'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "da628e2f", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "auc_scores = perform_rkde_experiment(\n", - " algos,\n", - " dataset,\n", - " dataset_options,\n", - " outlierprop_range,\n", - " kernel,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "280ed959", - "metadata": {}, - "outputs": [], - "source": [ - "fig, ax = plt.subplots(figsize=(7, 5))\n", - "for algo, algo_data in auc_scores.groupby('algo'):\n", - " x = algo_data.groupby('outlier_prop').mean().index\n", - " y = algo_data.groupby('outlier_prop').mean()['auc_anomaly']\n", - " ax.plot(x, y, 'o-', label=algo)\n", - "plt.legend()\n", - "plt.xlabel('outlier_prop')\n", - "plt.ylabel('auc_score')\n", - "plt.title('Comparison of rKDE against contamination')" - ] - }, - { - "cell_type": "markdown", - "id": "bd659a73", - "metadata": {}, - "source": [ - "Try using different neighborhoods for contamination. Which robust KDE algorithm performs better overall? Choose among the following options:\n", - "\n", - "OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste\n", - "\n", - "You can also change the kernel type: gaussian, tophat, epechenikov, exponential, linear or cosine, " - ] - }, - { - "cell_type": "markdown", - "id": "76312d0d", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "# Summary\n", - "- Kernel density estimation is a non-parametric method to estimate a pdf from a sample.\n", - "- Bandwidth is the most important parameter.\n", - "- Converges to the true pdf if $n\\to\\infty$.\n", - " - Convergence exponentially depends on the dimension.\n", - "- KDE is sensitive to contamination:\n", - " - In a contaminated setting one can employ methods from robust statistics to obtain robust estimates.\n", - " \n", - "## Implementations\n", - "- Sklearn: [sklearn.neighbors.KernelDensity](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity)\n", - "- Statsmodels: [statsmodels.nonparametric.kernel_density.KDEMultivariate](https://www.statsmodels.org/dev/generated/statsmodels.nonparametric.kernel_density.KDEMultivariate.html)\n", - "- FastKDE: [link](https://pypi.org/project/fastkde/), offers automatic bandwidth and kernel selection." - ] - }, - { - "cell_type": "markdown", - "id": "71e46caa", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "# Anomaly Detection via Isolation\n", - "**Idea:** An anomaly should allow \"simple\" descriptions that distinguish it from the rest of the data.\n", - "\n", - "- Descriptions: Conjunction of single attribute tests, i.e.\n", - " $X_i \\leq c$ or $X_i > c$.\n", - "- Example: $X_1 \\leq 1.2 \\text{ and } X_5 > -3.4 \\text{ and }\tX_7 \\leq 5.6$.\n", - "- Complexity of description: Number of conjunctions.\n", - "\n", - "Moreover, we assume that a short random descriptions will have a significantly larger chance of isolating an anomaly\n", - "than isolating any nominal point.\n", - "\n", - "- Choose random isolating descriptions and compute anomaly score from average complexity." - ] - }, - { - "cell_type": "markdown", - "id": "f9ce4a9f", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Isolation Tree\n", - "Isolation Forest (iForest) implements this idea by generating an ensemble of random decision trees.\n", - "Each tree is built as follows:\n", - "\n", - "---\n", - "**Input:** Data set (subsample) $X$, maximal height $h$\n", - "- Randomly choose feature $i$ and split value $s$ (in range of data)\n", - "- Recursively build subtrees on $X_L = \\{x\\in X\\mid x_i \\leq s\\}$ and $X_R = X\\setminus X_L$\n", - "- Stop if remaining data set $ \\leq 1$ or maximal height reached\n", - "- Store test $x_i\\leq s$ for inner nodes and $|X|$ for leaf nodes\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "16dabe54", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Visualization\n", - "
\n", - "\n", - " Isolation Tree as Partition Diagram\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "39fc2c5b", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Isolation Depth\n", - "\n", - "---\n", - "**Input:** Observation $x$\n", - "- ${\\ell} = $ length of path from root to leaf according to tests\n", - "- ${n} = $ size of remaining data set in leaf node\n", - "- ${c(n)} =$ expected length of a path in a BST with $n$ nodes $={O}(\\log n)$\n", - "- ${h(x)} = \\ell + c(n)$\n", - "\n", - "---\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Isolation Depth of Outlier (red) and nominal (blue)
\n" - ] - }, - { - "cell_type": "markdown", - "id": "4659e094", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Isolation Forest\n", - "- Train $k$ isolation trees on subsamples of size $N$" - ] - }, - { - "cell_type": "markdown", - "id": "2423122d", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Isolation depth of nominal point (left) and outlier (right)
" - ] - }, - { - "cell_type": "markdown", - "id": "f00a372e", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "# Variants of Isolation Forest" - ] - }, - { - "cell_type": "markdown", - "id": "702f9985", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "## Variant: Random Robust Cut Forest\n", - "**New Rule to Choose Split Test:**\n", - "- $\\ell_i$: length of the $i$th component of the bounding box around current data set\n", - "- Choose dimension $i$ with probability $\\frac{\\ell_i}{\\sum_j \\ell_j}$\n", - "- More robust against \"noise dimensions\"" - ] - }, - { - "cell_type": "markdown", - "id": "ef0a2568", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "
\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "ce7f67a8", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "## Variant: Extended Isolation Forest\n", - "**New split criterion:**\n", - "- Uniformly choose a normal and an orthogonal hyperplane through the data\n", - "- Removes a bias that was empirically observed when plotting the outlier score of iForest on low dimensional data sets\n", - "\n", - "
\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "f144c10d", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "# Exercise: Network Security\n", - "\n", - "In the final exercise of today you will have to develop an anomaly detection system for network traffic.\n", - "\n", - "## Briefing\n", - "A large e-commerce company __A__ is experiencing downtime due to attacks on their infrastructure.\n", - "You were instructed to develop a system that can detect malicious connections to the infrastructure.\n", - "It is planned that suspicious clients will be banned.\n", - "\n", - "Another data science team already prepared the connection data of the last year for you. They also separated a test set and manually identified and labeled attacks in that data.\n", - "\n", - "## The Data\n", - "We will work on a version of the classic KDD99 data set.\n", - "\n", - "### Kddcup 99 Data Set\n", - "----------------------------\n", - "The KDD Cup '99 dataset was created by processing the tcp dump portions\n", - "of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,\n", - "created by MIT Lincoln Lab [1]. The artificial data (described on the `dataset's\n", - "homepage `_) was\n", - "generated using a closed network and hand-injected attacks to produce a\n", - "large number of different types of attack with normal activity in the\n", - "background.\n", - "\n", - " ========================= ====================================================\n", - " Samples total 976158\n", - " Dimensionality 41\n", - " Features string (str), discrete (int), continuous (float)\n", - " Targets str, 'normal.' or name of the anomaly type\n", - " Proportion of Anomalies 1%\n", - " ========================= ====================================================\n", - "\n", - "----------------------------------\n", - "\n", - "## Task\n", - "You will have to develop the system on your own. In particular, you will have to\n", - "- Explore the data.\n", - "- Choose an algorithm.\n", - "- Find a good detection threshold.\n", - "- Evaluate and summarize your results.\n", - "- Estimate how much __A__ could save through the use of your system." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "91e31b17", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "X_train,X_test,y_test = get_kdd_data()" - ] - }, - { - "cell_type": "markdown", - "id": "9a47f0c1", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "# Explore Data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5044d551", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "#\n", - "# Add your exploration code\n", - "#\n", - "X_train = pd.DataFrame(X_train)\n", - "X_test = pd.DataFrame(X_test)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f2f4d539", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# get description\n", - "X_train.describe()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d1483fbf", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# get better description\n", - "X_train.drop(columns=[1,2,3]).astype(float).describe()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8ac2ddd4", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# Check for NaNs\n", - "print(\"Number of NaNs: {}\".format(X_train.isna().sum().sum()))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "763c76dc", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "#\n", - "# Add your preperation code here\n", - "#\n", - "\n", - "# Encode string features\n", - "binarizer = LabelBinarizer()\n", - "one_hots = None\n", - "one_hots_test = None\n", - "for i in [1, 2, 3]:\n", - " binarizer.fit(X_train[[i]].astype(str))\n", - " if one_hots is None:\n", - " one_hots = binarizer.transform(X_train[[i]].astype(str))\n", - " one_hots_test = binarizer.transform(X_test[[i]].astype(str))\n", - " else:\n", - " one_hots = np.concatenate([one_hots, binarizer.transform(X_train[[i]].astype(str))], axis=1)\n", - " one_hots_test = np.concatenate([one_hots_test, binarizer.transform(X_test[[i]].astype(str))], axis=1)\n", - "\n", - "X_train.drop(columns=[1,2,3], inplace=True)\n", - "X_train_onehot = pd.DataFrame(np.concatenate([X_train.values, one_hots], axis=1))\n", - "\n", - "X_test.drop(columns=[1,2,3], inplace=True)\n", - "X_test_onehot = pd.DataFrame(np.concatenate([X_test.values, one_hots_test], axis=1))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1d363320", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# Encode y\n", - "y_test_bin = np.where(y_test == b'normal.', 0, 1)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ef0cf969", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# Remove suspicious data\n", - "# This step is not strictly neccessary but can improve performance\n", - "suspicious = X_train_onehot.apply(lambda col: (col - col.mean()).abs() > 4 * col.std() if col.std() > 1 else False)\n", - "suspicious = suspicious.any(axis=1)# 4 sigma rule\n", - "print('filtering {} suspicious data points'.format(suspicious.sum()))\n", - "X_train_clean = X_train_onehot[~suspicious]" - ] - }, - { - "cell_type": "markdown", - "id": "201f1dfb", - "metadata": {}, - "source": [ - "# Summary\n", - "- Isolation Forest empirically shows very good performance up to relatively high dimensions\n", - "- It is relatively robust against contamination\n", - "- Usually little need for hyperparameter tuning\n", - "\n", - "## Implementations\n", - "- Sklearn: [sklearn.ensemble.IsolationForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)\n", - "- Extended Isolation Forest: [variant](https://github.com/sahandha/eif)\n", - "- Random Robust Cut Forest: [variant](https://github.com/kLabUM/rrcf)" - ] - }, - { - "cell_type": "markdown", - "id": "80805d83", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "# Choose Algorithm" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "13d462c9", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# TODO: implement proper model selection\n", - "iforest = IsolationForest()\n", - "iforest.fit(X_train_clean)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1877e13d", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# find best threshold\n", - "X_test_onehot, X_val_onehot, y_test_bin, y_val_bin = train_test_split(X_test_onehot, y_test_bin, test_size=.5)\n", - "y_score = -iforest.score_samples(X_val_onehot).reshape(-1)" - ] - }, - { - "cell_type": "markdown", - "id": "7338159f", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "# Evaluate Solution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "920afcf1", - "metadata": { - "hideCode": false, - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "#\n", - "# Insert evaluation code\n", - "#\n", - "\n", - "# calculate scores if any anomaly is present\n", - "if np.any(y_val_bin == 1):\n", - " eval = evaluate(y_val_bin, y_score)\n", - " prec, rec, thr = eval['PR']\n", - " f1s = 2 * (prec * rec)/(prec + rec)\n", - " threshold = thr[np.argmax(f1s)]\n", - "\n", - " y_score = -iforest.score_samples(X_test_onehot).reshape(-1)\n", - " y_pred = np.where(y_score < threshold, 0, 1)\n", - "\n", - " print('Precision: {}'.format(metrics.precision_score(y_test_bin, y_pred)))\n", - " print('Recall: {}'.format(metrics.recall_score(y_test_bin, y_pred)))\n", - " print('F1: {}'.format(metrics.f1_score(y_test_bin, y_pred)))" - ] - }, - { - "cell_type": "markdown", - "id": "6287222a", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "# Anomaly Detection via Reconstruction Error\n", - "**Idea:** Embed the data into low dimensional space and reconstruct it again.\n", - "\t\t\tGood embedding of nominal data $\\Rightarrow$ high reconstruction error indicates anomaly.\n", - "\n", - "**Autoencoder:**\n", - "- Parametric family of encoders: $f_\\phi: \\mathbb{R}^d \\to \\mathbb{R}^{\\text{low}}$\n", - "- Parametric family of decoders: $g_\\theta: \\mathbb{R}^{\\text{low}} \\to \\mathbb{R}^{d}$\n", - "- Reconstruction error of $(f_\\phi, g_\\theta)$ on $x$: $|x - g_\\theta(f_\\phi(x))|$\n", - "- Given data set $D$, find $\\phi,\\theta$ that minimize $\\sum_{x\\in D} L(|x- g_\\theta(f_\\phi(x))|) $\n", - " for some loss function $L$.\n" - ] - }, - { - "cell_type": "markdown", - "id": "ebb5094e", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "# Visualization\n", - "
\n", - "\n", - " Autoencoder Schema\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "565a43e9", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Neural Networks\n", - "Neural networks are very well suited for finding low dimensional representations of data. Hence they are a popular choice for the encoder and the decoder.\n", - "\n", - "---\n", - "**Artificial Neuron with $N$ inputs:** $y = \\sigma\\left(\\sum_i^N w_i X_i + b\\right)$\n", - "\n", - "- $\\sigma$: nonlinear activation-function (applied component wise).\n", - "- $b$ bias\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "e658847c", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Isolation depth of nominal point and anomaly
" - ] - }, - { - "cell_type": "markdown", - "id": "09243719", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "# Neural Networks\n", - "\n", - "Neural networks combine many artificial neurons into a complex network. These networks are usually organized in layers\n", - "where the result of each layer is the input for the next layer. Some commonly used layers are:" - ] - }, - { - "cell_type": "markdown", - "id": "3376f048", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "
\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "d8794297", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Variational Autoencoders\n", - "An important extension of autoencoders that relates the idea to density estimation.\n", - "More precisely, we define a generative model for our data using latent variables and combine the maximum likelihood\n", - "estimation of the parameters with a simultaneous posterior estimation of the latents through amortized stochastic\n", - "variational inference. We use a decoder network to transform the latent variables into the data distribution, and an\n", - "encoder network to compute the posterior distribution of the latents given the data." - ] - }, - { - "cell_type": "markdown", - "id": "6b5efeac", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "---\n", - "**Definition:**\n", - "The model uses an observed variable $X$ (the data) and a latent variable $Z$ (the defining features of $X$). We assume\n", - "both $P(Z)$ and $P(X\\mid Z)$ to be normally distributed. More precisely\n", - "\n", - "- $P(Z) = \\mathcal{N}(0, I)$\n", - "- $P(X\\mid Z) = \\mathcal{N}(\\mu_\\phi(Z), I)$\n", - "\n", - "where $\\mu_\\phi$ is a neural network parametrized with $\\phi$.\n", - "We use variational inference to perform posterior inference on $Z$ given $X$. We assume that the distribution $P(Z\\mid X)$\n", - "to be relatively well approximated by a Gaussian and use the posterior approximation:\n", - "- $q(X\\mid Z) = \\mathcal{N}(\\mu_\\psi(X), \\sigma_\\psi(X))$\n", - "\n", - "$\\mu_\\psi$ and $\\sigma_\\psi$ are neural networks parameterized with $\\psi$\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "76d013d3", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "
\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "81661bf9", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Given a data set $D$ we minimize the (amortized) Kullback-Leibler divergence between our posterior approximation and the\n", - "true posterior:\n", - "\\begin{align*}\n", - " D_{KL}(q(z\\mid x),p(z\\mid x)) &= E_{x\\sim X, z\\sim q(Z\\mid x)}\\left[\\log\\left(\\frac{q(z \\mid x)}{p(z \\mid X)}\\right)\\right] \\\\\n", - " &= E_{x\\sim X, z\\sim q(Z\\mid X)}\\left[\\log\\left(\\frac{q(z \\mid x)}{\\frac{p(x \\mid z)p(z)}{p(x)}}\\right)\\right] \\\\\n", - " &= E_{x\\sim X, z\\sim q(Z\\mid x)}\\left[\\log\\left(\\frac{q(z \\mid x)}{p(x \\mid z)p(z)}\\right) + \\log(p(x))\\right] \\\\\n", - " &= E_{x\\sim X, z\\sim q(Z\\mid x)}\\left[\\log\\left(\\frac{q(z \\mid x)}{p(x \\mid z)p(z)}\\right)\\right] + E_{x\\sim X}[\\log(p(x))]\\\\\n", - "\\end{align*}\n", - "\n", - "Now we can define\n", - "\n", - "\\begin{align*}\n", - " \\mathrm{ELBO}(q(z\\mid x),p(z\\mid x)) &:= E_{x\\sim X}[\\log(p(x))] - D_{KL}(q(z\\mid x),p(z\\mid x)) \\\\\n", - " &= -E_{x\\sim X, z\\sim q(Z\\mid x)}\\left[\\log\\left(\\frac{q(z \\mid x)}{p(x \\mid z)p(z)}\\right)\\right]\n", - "\\end{align*}\n", - "\n", - "Note that we can evaluate the expression inside the expectation of the final RHS of the\n", - "equation and we can obtain unbiased estmates of the expectation via sampling.\n", - "Let us further try to understand the ELBO as an optimization objective. On one hand, maximizing the ELBO with respect to the parameters in $q$ is equivalent to\n", - "minimizing the KL divergence between $p$ and $q$. On the other hand, maximizing the ELBO with\n", - "respect to the parameters in $p$ can be understood as raising a lower bound for the likelihood of the\n", - "generative model $p(x)$. Hence, the optimization tries to find an encoder and a decoder pair such that\n", - "it simultaneously provides a good generative explanation of the data and a good approximation of the posterior\n", - "distribution of the latent variables." - ] - }, - { - "cell_type": "markdown", - "id": "7131628c", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "# Exercise" - ] - }, - { - "cell_type": "markdown", - "id": "05f8c5a6", - "metadata": {}, - "source": [ - "# The MNIST Data Set\n", - "MNIST is one of the most iconic data sets in the history of machine learning.\n", - "It contains 70000 samples of $28\\times 28$ grayscale images of handwritten digits.\n", - "Because of its moderate complexity and good visualizability it is well suited to study the behavior of machine learning\n", - "algorithms in higher dimensional spaces.\n", - "\n", - "While originally created for classification (optical character recognition), we can build an anomaly detection data set\n", - "by corrupting some of the images.\n" - ] - }, - { - "cell_type": "markdown", - "id": "28979a57", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "# Pre-processing\n", - "We first need to obtain the MNIST data set and prepare an anomaly detection set from it.\n", - "Note that the data set is n row vector format.\n", - "Therefore, we work with $28\\times 28 = 784$ dimensional data points." - ] - }, - { - "cell_type": "markdown", - "id": "d2508225", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "## Load MNIST Data Set" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a95d8450", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "mnist = get_mnist_data()\n", - "\n", - "data = mnist['data']\n", - "print('data.shape: {}'.format(data.shape))\n", - "target = mnist['target'].astype(int)" - ] - }, - { - "cell_type": "markdown", - "id": "f6b6bde9", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "## Build contaminated Data Sets\n", - "We prepared a function that does the job for us.\n", - "It corrupts a prescribed portion of the data by introducing a rotation, noise or a blackout of some part of the image.\n", - "\n", - "First, we need to transform the data into image format." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "954d3762", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "X = data.reshape(-1, 28, 28, 1)/255" - ] - }, - { - "cell_type": "markdown", - "id": "f4d6c089", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "### Train/Test-Split\n", - "We will only corrupt the test set, hence we will perform the train-test split beforehand.\n", - "We separate a relatively small test set so that we can use as much as possible from the data to obtain high quality\n", - "representations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a0fb9c57", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "test_size = .1\n", - "X_train, X_test, target_train, target_test = train_test_split(X, target, test_size=test_size)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d6ad3d43", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "X_test, y_test = build_contaminated_minst(X_test)\n", - "\n", - "# Visualize contamination\n", - "anomalies = X_test[y_test != 0]\n", - "selection = np.random.choice(len(anomalies), 25)\n", - "\n", - "fig, axes = plt.subplots(nrows=5, ncols=5, figsize=(5, 5))\n", - "for img, ax in zip(anomalies[selection], axes.flatten()):\n", - " ax.imshow(img, 'gray')\n", - " ax.axis('off')\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "3c459a8a", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "# Autoencoder\n", - "Let us finally train an autoencoder model. We replicate the model given in the\n", - "[Keras documentation](https://keras.io/examples/generative/vae/) and apply it in a synthetic outlier detection scenario\n", - "based on MNIST.\n", - "\n", - "in the vae package we provide the implementation of the VAE. Please take a look into the source code to see how\n", - "the minimization of the KL divergence is implemented." - ] - }, - { - "cell_type": "markdown", - "id": "c1af1c41", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "## Create Model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c90996a0", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "latent_dim = 3\n", - "vae = VAE(decoder=build_decoder_mnist(latent_dim=latent_dim), encoder=build_encoder_minst(latent_dim=latent_dim))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "efb89bdd", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "## Inspect model architecture\n", - "vae.encoder.summary()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "68b219e9", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "## Inspect model architecture\n", - "vae.decoder.summary()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "01b43aff", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# train model\n", - "n_epochs = 30\n", - "\n", - "vae.compile(optimizer=keras.optimizers.Adam(learning_rate=.001))\n", - "history = vae.fit(X_train, epochs=n_epochs, batch_size=128)" - ] - }, - { - "cell_type": "markdown", - "id": "e1519875", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "## Inspect Result" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "80ab41fd", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "\n", - "\n", - "def plot_latent_space(vae: VAE, n: int=10, figsize: float=10):\n", - " \"\"\"Plot sample images from 2D slices of latent space\n", - " \n", - " @param vae: vae model\n", - " @param n: sample nXn images per slice\n", - " @param figsize: figure size\n", - " \n", - " \"\"\"\n", - " for perm in [[0, 1, 2], [1, 2, 0], [2, 1, 0]]:\n", - " # display a n*n 2D manifold of digits\n", - " digit_size = 28\n", - " scale = 1.0\n", - " figure = np.zeros((digit_size * n, digit_size * n))\n", - " # linearly spaced coordinates corresponding to the 2D plot\n", - " # of digit classes in the latent space\n", - " grid_x = np.linspace(-scale, scale, n)\n", - " grid_y = np.linspace(-scale, scale, n)[::-1]\n", - "\n", - " for i, yi in enumerate(grid_y):\n", - " for j, xi in enumerate(grid_x):\n", - " z_sample = np.array([[xi, yi, 0]])\n", - " z_sample[0] = z_sample[0][perm]\n", - " x_decoded = vae.decoder.predict(z_sample)\n", - " digit = x_decoded[0].reshape(digit_size, digit_size)\n", - " figure[\n", - " i * digit_size : (i + 1) * digit_size,\n", - " j * digit_size : (j + 1) * digit_size,\n", - " ] = digit\n", - "\n", - " plt.figure(figsize=(figsize, figsize))\n", - " start_range = digit_size // 2\n", - " end_range = n * digit_size + start_range\n", - " pixel_range = np.arange(start_range, end_range, digit_size)\n", - " sample_range_x = np.round(grid_x, 1)\n", - " sample_range_y = np.round(grid_y, 1)\n", - " plt.xticks(pixel_range, sample_range_x)\n", - " plt.yticks(pixel_range, sample_range_y)\n", - " plt.xlabel(\"z[{}]\".format(perm[0]))\n", - " plt.ylabel(\"z[{}]\".format(perm[1]))\n", - " plt.gca().set_title('z[{}] = 0'.format(perm[2]))\n", - " plt.imshow(figure, cmap=\"Greys_r\")\n", - " plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bdb0f67d", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "plot_latent_space(vae)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0d6a5b6f", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "# Principal components\n", - "pca = PCA()\n", - "latents = vae.encoder.predict(X_train)[2]\n", - "pca.fit(latents)\n", - "\n", - "kwargs = {'x_{}'.format(i): (-1., 1.) for i in range(latent_dim)}\n", - "\n", - "\n", - "@widgets.interact(**kwargs)\n", - "def explore_latent_space(**kwargs):\n", - " \"\"\"Widget to explore latent space from given start position\n", - " \"\"\"\n", - " center_img = pca.transform(np.zeros([1,latent_dim]))\n", - "\n", - " latent_rep_pca = center_img + np.array([[kwargs[key] for key in kwargs]])\n", - " latent_rep = pca.inverse_transform(latent_rep_pca)\n", - " img = vae.decoder(latent_rep).numpy().reshape(28, 28)\n", - "\n", - " fig, ax = plt.subplots()\n", - " ax.axis('off')\n", - " ax.axis('off')\n", - "\n", - " ax.imshow(img,cmap='gray', vmin=0, vmax=1)\n", - " plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6f9fb82f", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "latents = vae.encoder.predict(X_train)[2]\n", - "scatter = px.scatter_3d(x=latents[:, 0], y=latents[:, 1], z=latents[:, 2], color=target_train)\n", - "\n", - "scatter.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ea370a83", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "latents = vae.encoder.predict(X_test)[2]\n", - "scatter = px.scatter_3d(x=latents[:, 0], y=latents[:, 1], z=latents[:, 2], color=y_test)\n", - "\n", - "scatter.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dee0a98e", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "X_test, X_val, y_test, y_val = train_test_split(X_test, y_test)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "65c957f8", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "n_samples = 10\n", - "\n", - "s = np.random.choice(range(len(X_val)), n_samples)\n", - "s = X_val[s]\n", - "#s = [X_train_img[i] for i in s]\n", - "\n", - "fig, axes = plt.subplots(nrows=2, ncols=n_samples, figsize=(10, 2))\n", - "for img, ax_row in zip(s, axes.T):\n", - " x = vae.decoder.predict(vae.encoder.predict(img.reshape(1, 28, 28, 1))[2]).reshape(28, 28)\n", - " diff = x - img.reshape(28, 28)\n", - " error = (diff * diff).sum()\n", - " ax_row[0].axis('off')\n", - " ax_row[1].axis('off')\n", - " ax_row[0].imshow(img,cmap='gray', vmin=0, vmax=1)\n", - " ax_row[1].imshow(x, cmap='gray', vmin=0, vmax=1)\n", - " ax_row[1].set_title('E={:.1f}'.format(error))\n", - "\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "350edb6c", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "from sklearn import metrics\n", - "y_test_bin = y_test.copy()\n", - "y_test_bin[y_test != 0] = 1\n", - "y_val_bin = y_val.copy()\n", - "y_val_bin[y_val != 0] = 1\n", - "# Evaluate\n", - "reconstruction = vae.decoder.predict(vae.encoder(X_val)[2])\n", - "rerrors = (reconstruction - X_val).reshape(-1, 28*28)\n", - "rerrors = (rerrors * rerrors).sum(axis=1)\n", - "\n", - "# Let's calculate scores if any anomaly is present\n", - "if np.any(y_val_bin == 1):\n", - " eval = evaluate(y_val_bin.astype(int), rerrors.astype(float))\n", - " pr, rec, thr = eval['PR']\n", - " f1s = (2 * ((pr * rec)[:-1]/(pr + rec)[:-1]))\n", - " threshold = thr[np.argmax(f1s)]\n", - " print('Optimal threshold: {}'.format(threshold))\n", - "\n", - " reconstruction = vae.decoder.predict(vae.encoder(X_test)[2])\n", - " reconstruction_error = (reconstruction - X_test).reshape(-1, 28*28)\n", - " reconstruction_error = (reconstruction_error * reconstruction_error).sum(axis=1)\n", - "\n", - "\n", - " classification = (reconstruction_error > threshold).astype(int)\n", - "\n", - " print('Precision: {}'.format(metrics.precision_score(y_test_bin, classification)))\n", - " print('Recall: {}'.format(metrics.recall_score(y_test_bin, classification)))\n", - " print('F1: {}'.format(metrics.f1_score(y_test_bin, classification)))\n", - "\n", - " metrics.confusion_matrix(y_test_bin, classification)\n", - "else:\n", - " reconstruction_error = None\n" - ] - }, - { - "cell_type": "markdown", - "id": "c8c5568d", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "## Sort Data by Reconstruction Error" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9b304ec8", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "if reconstruction_error is not None:\n", - " combined = list(zip(X_test, reconstruction_error))\n", - " combined.sort(key = lambda x: x[1])\n" - ] - }, - { - "cell_type": "markdown", - "id": "555fd7f3", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "## Show Top Autoencoder Outliers" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d51d7a5c", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [ - "if reconstruction_error is not None:\n", - " n_rows = 10\n", - " n_cols = 10\n", - " n_samples = n_rows*n_cols\n", - "\n", - " samples = [c[0] for c in combined[-n_samples:]]\n", - "\n", - " fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(2*n_cols, 2*n_rows))\n", - " for img, ax in zip(samples, axes.reshape(-1)):\n", - " ax.axis('off')\n", - " ax.imshow(img.reshape((28,28)), cmap='gray', vmin=0, vmax=1)\n", - "\n", - " plt.show()\n" - ] - }, - { - "cell_type": "markdown", - "id": "85cd3f88", - "metadata": {}, - "source": [ - "# Summary\n", - "- Autoencoders are the most prominent reconstruction error based anomaly detection method.\n", - "- Can provide high quality results on high dimensional data.\n", - "- Architecture is highly adaptable to the data (fully connected, CNN, attention,...).\n", - "- Sensitive to contamination.\n", - "- Variational autoencoder are an important variant the improves the interpretability of the latent space.\n", - "\n", - "## Implementations\n", - "- Keras: see vae.py or [here](https://keras.io/examples/generative/vae/)\n", - "- Pytorch: [example implementation](https://colab.research.google.com/github/smartgeometry-ucl/dl4g/blob/master/variational_autoencoder.ipynb)\n", - "- Pyro (pytorch based probabilistic programming language): [example implementation](https://pyro.ai/examples/vae.html)" - ] - }, - { - "cell_type": "markdown", - "id": "b1dd207a", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "\"Snow\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "67dab0f6", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "celltoolbar": "Hide code", - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - }, - "toc": { - "base_numbering": 1, - "nav_menu": {}, - "number_sections": true, - "sideBar": true, - "skip_h1_title": false, - "title_cell": "Table of Contents", - "title_sidebar": "Contents", - "toc_cell": false, - "toc_position": {}, - "toc_section_display": true, - "toc_window_display": false - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/notebooks/nb_02_anomaly_detection_via_density_estimation.ipynb b/notebooks/nb_02_anomaly_detection_via_density_estimation.ipynb new file mode 100644 index 0000000..9dfbe29 --- /dev/null +++ b/notebooks/nb_02_anomaly_detection_via_density_estimation.ipynb @@ -0,0 +1,1166 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "c634d79e", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input", + "remove-output", + "remove-input-nbconv", + "remove-output-nbconv" + ] + }, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "%matplotlib inline\n", + "%load_ext tfl_training_anomaly_detection" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "596df825", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input", + "remove-input-nbconv" + ] + }, + "outputs": [], + "source": [ + "%presentation_style" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "158112af", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input", + "remove-output", + "remove-input-nbconv", + "remove-output-nbconv" + ] + }, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "%set_random_seed 12" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97c8e783", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input-nbconv", + "remove-cell" + ] + }, + "outputs": [], + "source": [ + "%load_latex_macros" + ] + }, + { + "cell_type": "markdown", + "id": "7cc2f648", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Anomaly Detection via Density Estimation\n", + "\"Snow\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "import numpy as np\n", + "import itertools as it\n", + "from tqdm import tqdm\n", + "\n", + "import matplotlib\n", + "from matplotlib import pyplot as plt\n", + "import plotly.express as px\n", + "import pandas as pd\n", + "\n", + "import ipywidgets as widgets\n", + "\n", + "from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \\\n", + "perform_rkde_experiment, get_mnist_data\n", + "\n", + "from ipywidgets import interact\n", + "\n", + "from sklearn.metrics import roc_auc_score, average_precision_score\n", + "from sklearn.model_selection import RandomizedSearchCV\n", + "from sklearn.preprocessing import MinMaxScaler\n", + "from sklearn.preprocessing import LabelBinarizer\n", + "from sklearn.ensemble import IsolationForest\n", + "from sklearn import metrics\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.decomposition import PCA\n", + "from sklearn.neighbors import KernelDensity\n", + "\n", + "from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst\n", + "\n", + "from tensorflow import keras\n", + "\n", + "%matplotlib inline\n", + "matplotlib.rcParams['figure.figsize'] = (5, 5)\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "id": "cb85606a", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Anomaly Detection via Density Estimation\n", + "**Idea:** Estimate the density of $F_0$. Areas of low density are anomalous.\n", + "- Often $p$ is too small to estimate complete mixture model\n", + "- Takes into account that $F_1$ might not be well-defined\n", + "- Estimation procedure needs to be robust against contamination if no clean training data is available" + ] + }, + { + "cell_type": "markdown", + "id": "ad028f22", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Kernel Density Estimation\n", + "- Non-parametric method\n", + "- Can represent almost arbitrarily shaped densities\n", + "- Each training point \"spreads\" a fraction of the probability mass as specified by the kernel function" + ] + }, + { + "cell_type": "markdown", + "id": "8a5f8585", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "**Definition**\n", + "\n", + "---\n", + "\n", + "\n", + "**Definition:**\n", + "- $K: \\mathbb{R} \\to \\mathbb{R}$ kernel function\n", + " - $K(r) \\geq 0$ for all $r\\in \\mathbb{R}$\n", + "\t- $\\int_{-\\infty}^{\\infty} K(r) dr = 1$\n", + "- $h > 0$ bandwidth\n", + "- Bandwidth is the most crucial parameter\n", + "---\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "9847bb69", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "**Definition:**\n", + "\n", + "---\n", + "Let $D = \\{x_1,\\ldots,x_N\\}\\subset \\mathbb{R}^p$. The KDE with kernel $K$ and bandwidth $h$ is\n", + "$KDE_h(x, D) = \\frac{1}{N}\\sum_{i=1}^N \\frac{1}{h^p}K\\left(\\frac{|x-x_i|}{h}\\right)$\n", + "\n", + "---\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Effect of bandwidth and kernel
" + ] + }, + { + "cell_type": "markdown", + "id": "434d0f3e", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Exercise\n", + "Play with the parameters!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10d29d49", + "metadata": { + "hideCode": false + }, + "outputs": [], + "source": [ + "dists = create_distributions(dim=2, dim_irrelevant=0)\n", + "\n", + "sample_train = dists['Double Blob'].sample(500)\n", + "X_train = sample_train[-1]\n", + "y_train = [0]*len(X_train)\n", + "\n", + "plt.scatter(X_train[:,0], X_train[:,1], c = 'blue', s=10)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "639512ea", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Helper function\n", + "def fit_kde(kernel: str, bandwidth: float, X_train: np.array) -> KernelDensity:\n", + " \"\"\" Fit KDE\n", + " \n", + " @param kernel: kernel\n", + " @param bandwidth: bandwidth\n", + " @param x_train: data\n", + " \"\"\"\n", + " kde = KernelDensity(kernel=kernel, bandwidth=bandwidth)\n", + " kde.fit(X_train)\n", + " return kde\n", + "\n", + "def visualize_kde(kde: KernelDensity, bandwidth: float, X_test: np.array, y_test: np.array) -> None:\n", + " \"\"\"Plot KDE\n", + " \n", + " @param kde: KDE\n", + " @param bandwidth: bandwidth\n", + " @param X_test: test data\n", + " @param y_test: test label\n", + " \"\"\"\n", + " fig, axis = plt.subplots(figsize=(5, 5))\n", + "\n", + " lin = np.linspace(-10, 10, 50)\n", + " grid_points = list(it.product(lin, lin))\n", + " ys, xs = np.meshgrid(lin, lin)\n", + " # The score function of sklearn returns log-densities\n", + " scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)\n", + " colormesh = axis.contourf(xs, ys, scores)\n", + " fig.colorbar(colormesh)\n", + " axis.set_title('Density Conturs (Bandwidth={})'.format(bandwidth))\n", + " axis.set_aspect('equal')\n", + " color = ['blue' if i ==0 else 'red' for i in y_test]\n", + " plt.scatter(X_test[:, 0], X_test[:, 1], c=color)\n", + " plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "d2fa638a", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "# Choose KDE Parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "89e5ef25", + "metadata": { + "hideCode": false, + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "ker = None\n", + "bdw = None\n", + "@interact(\n", + " kernel=['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'],\n", + " bandwidth=(.1, 10.)\n", + ")\n", + "def set_kde_params(kernel: str, bandwidth: float) -> None:\n", + " \"\"\"Helper funtion to set widget parameters\n", + " \n", + " @param kernel: kernel\n", + " @param bandwidth: bandwidth\n", + " \"\"\"\n", + " global ker, bdw\n", + "\n", + " ker = kernel\n", + " bdw = bandwidth" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "909c4c5c", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "kde = fit_kde(ker, bdw, X_train)\n", + "visualize_kde(kde, bdw, X_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "id": "0ddc2d6c", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Bandwidth Selection\n", + "The bandwidth is the most important parameter of a KDE model. A wrongly adjusted value will lead to over- or\n", + "under-smoothing of the density curve.\n", + "\n", + "A common method to select a bandwidth is maximum log-likelihood cross validation.\n", + "$$h_{\\textrm{llcv}} = \\arg\\max_{h}\\frac{1}{k}\\sum_{i=1}^k\\sum_{y\\in D_i}\\log\\left(\\frac{k}{N(k-1)}\\sum_{x\\in D_{-i}}K_h(x, y)\\right)$$\n", + "where $D_{-i}$ is the data without the $i$th cross validation fold $D_i$." + ] + }, + { + "cell_type": "markdown", + "id": "df4068e5", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "# Exercises" + ] + }, + { + "cell_type": "markdown", + "id": "94c87f47", + "metadata": {}, + "source": [ + "ex no.1: Noisy sinusoidal" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0238d830", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Generate example\n", + "dists = create_distributions(dim=2)\n", + "\n", + "distribution_with_anomalies = contamination(\n", + " nominal=dists['Sinusoidal'],\n", + " anomaly=dists['Blob'],\n", + " p=0.05\n", + ")\n", + "\n", + "# Train data\n", + "sample_train = dists['Sinusoidal'].sample(500)\n", + "X_train = sample_train[-1].numpy()\n", + "\n", + "# Test data\n", + "sample_test = distribution_with_anomalies.sample(500)\n", + "X_test = sample_test[-1].numpy()\n", + "y_test = sample_test[0].numpy()\n", + "\n", + "scatter = plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test)\n", + "handels, _ = scatter.legend_elements()\n", + "plt.legend(handels, ['Nominal', 'Anomaly'])\n", + "plt.gca().set_aspect('equal')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "57d255dc", + "metadata": { + "solution2": "hidden", + "solution2_first": true + }, + "source": [ + "## TODO: Define the search space for the kernel and the bandwidth" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b02d0c88", + "metadata": { + "solution2": "hidden" + }, + "outputs": [], + "source": [ + "param_space = {\n", + " 'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels\n", + " 'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d095bdb2", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "def hyperopt_by_score(X_train: np.array, param_space: dict, cv: int=5):\n", + " \"\"\"Performs hyperoptimization by score\n", + " \n", + " @param X_train: data\n", + " @param param_space: parameter space\n", + " @param cv: number of cv folds\n", + " \"\"\"\n", + " kde = KernelDensity()\n", + "\n", + " search = RandomizedSearchCV(\n", + " estimator=kde,\n", + " param_distributions=param_space,\n", + " n_iter=100,\n", + " cv=cv,\n", + " scoring=None # use estimators internal scoring function, i.e. the log-probability of the validation set for KDE\n", + " )\n", + "\n", + " search.fit(X_train)\n", + " return search.best_params_, search.best_estimator_" + ] + }, + { + "cell_type": "markdown", + "id": "79ed34cc", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Run the code below to perform hyperparameter optimization." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01513b81", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "params, kde = hyperopt_by_score(X_train, param_space)\n", + "\n", + "print('Best parameters:')\n", + "for key in params:\n", + " print('{}: {}'.format(key, params[key]))\n", + "\n", + "test_scores = -kde.score_samples(X_test)\n", + "test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)\n", + "\n", + "curves = evaluate(y_test, test_scores)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec8cf537", + "metadata": {}, + "outputs": [], + "source": [ + "visualize_kde(kde, params['bandwidth'], X_test, y_test)" + ] + }, + { + "cell_type": "markdown", + "id": "2d598cda", + "metadata": {}, + "source": [ + "### Exercise: Isolate anomalies in house prices" + ] + }, + { + "cell_type": "markdown", + "id": "965b4783", + "metadata": {}, + "source": [ + "You are a company resposible to estimate house prices around Ames, Iowa, specifically around college area. But there is a problem: houses from a nearby area, 'Veenker', are often included in your dataset. You want to build an anomaly detection algorithm that filters one by one every point that comes from the wrong neighborhood. You have been able to isolate an X_train dataset which, you are sure, contains only houses from College area. Following the previous example, test your ability to isolate anomalies in new incoming data (X_test) with KDE.\n", + "\n", + "Advanced exercise:\n", + "What happens if the contamination comes from other areas? You can choose among the following names:\n", + "\n", + "OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1fd5a1af", + "metadata": {}, + "outputs": [], + "source": [ + "X_train, X_test, y_test = get_house_prices_data(neighborhood = 'CollgCr', anomaly_neighborhood='Veenker')\n", + "X_train" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "288e044a", + "metadata": {}, + "outputs": [], + "source": [ + "# Total data\n", + "train_test_data = X_train.append(X_test, ignore_index=True)\n", + "y_total = [0] * len(X_train) + y_test\n", + "\n", + "fig = px.scatter_3d(train_test_data, x='LotArea', y='OverallCond', z='SalePrice', color=y_total)\n", + "\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "id": "dfc80809", + "metadata": {}, + "source": [ + "### Solution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b336ebbd", + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [], + "source": [ + "# When data are highly in-homogeneous, like in this case, it is often beneficial \n", + "# to rescale them before applying any anomaly detection or clustering technique.\n", + "scaler = MinMaxScaler()\n", + "X_train_rescaled = scaler.fit_transform(X_train)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0159cf66", + "metadata": { + "hideCode": false + }, + "outputs": [], + "source": [ + "param_space = {\n", + " 'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels\n", + " 'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter\n", + "}\n", + "params, kde = hyperopt_by_score(X_train_rescaled, param_space)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c6217804", + "metadata": { + "hideCode": false + }, + "outputs": [], + "source": [ + "print('Best parameters:')\n", + "for key in params:\n", + " print('{}: {}'.format(key, params[key]))\n", + "\n", + "X_test_rescaled = scaler.transform(X_test)\n", + "test_scores = -kde.score_samples(X_test_rescaled)\n", + "test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)\n", + "curves = evaluate(y_test, test_scores)" + ] + }, + { + "cell_type": "markdown", + "id": "ad2238bb", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## The Curse of Dimensionality\n", + "The flexibility of KDE comes at a price. The dependency on the dimensionality of the data is quite unfavorable.\n", + "\n", + "\n", + "*Theorem* [Stone, 1982]\n", + "Any estimator that is consistent$^*$ with the class of all $k$-fold differentiable pdfs over $\\mathbb{R}^d$ has a\n", + "convergence rate of at most\n", + "\n", + "$$\n", + "\\frac{1}{n^{\\frac{k}{2k+d}}}\n", + "$$\n", + "\n", + "\n", + "\n", + "\n", + "$^*$Consistency = for all pdfs $p$ in the class: $\\lim_{n\\to\\infty}|KDE_h(x, D) - p(x)|_\\infty = 0$ with probability $1$." + ] + }, + { + "cell_type": "markdown", + "id": "65268b84", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "# Exercise\n", + "- The very slow convergence in high dimensions does not necessary mean that we will see bad results in high dimensional anomaly detection with KDE.\n", + "- Especially if the anomalies are very outlying.\n", + "- However, in cases where contours of the nominal distribution are non-convex we can run into problems.\n", + "\n", + "We take a look at a higher dimensional version of out previous data set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75d8b1a5", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "dists = create_distributions(dim=3)\n", + "\n", + "distribution_with_anomalies = contamination(\n", + " nominal=dists['Sinusoidal'],\n", + " anomaly=dists['Blob'],\n", + " p=0.01\n", + ")\n", + "\n", + "sample = distribution_with_anomalies.sample(500)\n", + "\n", + "y = sample[0]\n", + "X = sample[-1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "44acf871", + "metadata": {}, + "outputs": [], + "source": [ + "fig = px.scatter_3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], color=y)\n", + "fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58b7d48b", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Fit KDE on high dimensional examples \n", + "rocs = []\n", + "auprs = []\n", + "bandwidths = []\n", + "\n", + "param_space = {\n", + " 'kernel': ['gaussian'],\n", + " 'bandwidth': np.linspace(0.1, 100, 1000), # Define Search space for bandwidth parameter\n", + " }\n", + "\n", + "kdes = {}\n", + "dims = np.arange(2,16)\n", + "for d in tqdm(dims):\n", + " # Generate d dimensional distributions\n", + " dists = create_distributions(dim=d)\n", + "\n", + " distribution_with_anomalies = contamination(\n", + " nominal=dists['Sinusoidal'],\n", + " anomaly=dists['Blob'],\n", + " p=0\n", + " )\n", + "\n", + " # Train on clean data\n", + " sample_train = dists['Sinusoidal'].sample(500)\n", + " X_train = sample_train[-1].numpy()\n", + " # Test data\n", + " sample_test = distribution_with_anomalies.sample(500)\n", + " X_test = sample_test[-1].numpy()\n", + " y_test = sample_test[0].numpy()\n", + "\n", + " # Optimize bandwidth\n", + " params, kde = hyperopt_by_score(X_train, param_space)\n", + " kdes[d] = (params, kde)\n", + " \n", + " bandwidths.append(params['bandwidth'])\n", + "\n", + " test_scores = -kde.score_samples(X_test)\n", + " test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)\n", + "\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc679493", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Plot cross section of pdf \n", + "fig, axes = plt.subplots(nrows=2, ncols=7, figsize=(15, 5))\n", + "for d, axis in tqdm(list(zip(kdes, axes.flatten()))):\n", + " \n", + " params, kde = kdes[d]\n", + "\n", + " lin = np.linspace(-10, 10, 50)\n", + " grid_points = list(it.product(*([[0]]*(d-2)), lin, lin))\n", + " ys, xs = np.meshgrid(lin, lin)\n", + " # The score function of sklearn returns log-densities\n", + " scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)\n", + " colormesh = axis.contourf(xs, ys, scores)\n", + " axis.set_title(\"Dim = {}\".format(d))\n", + " axis.set_aspect('equal')\n", + " \n", + "\n", + "# Plot evaluation\n", + "print('Crossection of the KDE at (0,...,0, x, y)')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "020500ed", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Robustness\n", + "Another drawback of KDE in the context of anomaly detection is that it is not robust against contamination of the data\n", + "\n", + "\n", + "**Definition**\n", + "The *breakdown point* of an estimator is the smallest fraction of observations that need to be changed so that we can\n", + "move the estimate arbitrarily far away from the true value.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "74cf4c13", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "**Example**: The sample mean has a breakdown point of $0$. Indeed, for a sample of $x_1,\\ldots, x_n$ we only need to\n", + "change a single value in order to move the sample mean in any way we want. That means that the breakdown point is\n", + "smaller than $\\frac{1}{n}$ for every $n\\in\\mathbb{N}$." + ] + }, + { + "cell_type": "markdown", + "id": "4efce06e", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Robust Statistics\n", + "There are robust replacements for the sample mean:\n", + "- Median of means: Split the dataset into $S$ equally sized subsets $X_1,\\ldots, X_S$ and compute\n", + "$\\mathrm{median}(\\overline{X_1},\\ldots, \\overline{X_S})$\n", + "- M-estimation: The mean in a normed vector space is the value that minimizes the squared distances\n", + "
\n", + "$\\overline{X} = \\min_{y}\\sum_{x\\in X}|x-y|^2$\n", + "
\n", + "M-estimation replaces the quadratic loss with a more robust loss function." + ] + }, + { + "cell_type": "markdown", + "id": "ac5655c6", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "### Huber loss\n", + "Switch from quadratic to linear loss at prescribed threshold" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b6861d26", + "metadata": { + "hideCode": false, + "slideshow": { + "slide_type": "-" + } + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "\n", + "def huber(error: float, threshold: float):\n", + " \"\"\"Huber loss\n", + " \n", + " @param error: base error\n", + " @param threshold: threshold for linear transition\n", + " \"\"\"\n", + " test = (np.abs(error) <= threshold)\n", + " return (test * (error**2)/2) + ((1-test)*threshold*(np.abs(error) - threshold/2))\n", + "\n", + "x = np.linspace(-5, 5)\n", + "y = huber(x, 1)\n", + "\n", + "plt.plot(x, y)\n", + "plt.gca().set_title(\"Huber Loss\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "3017109f", + "metadata": {}, + "source": [ + "### Hampel loss\n", + "More complex loss function. Depends on 3 parameters 0 < a < b< r" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "69f95fad", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "def single_point_hampel(error: float, a: float, b: float, r: float):\n", + " \"\"\"Hampel loss\n", + " \n", + " @param error: base error\n", + " @param a: 1st threshold parameter\n", + " @param b: 2nd threshold parameter\n", + " @param r: 3rd threshold parameter\n", + " \"\"\"\n", + " if abs(error) <= a:\n", + " return error**2/2\n", + " elif a < abs(error) <= b:\n", + " return (1/2 *a**2 + a* (abs(error)-a))\n", + " elif b < abs(error) <= r:\n", + " return a * (2*b-a+(abs(error)-b) * (1+ (r-abs(error))/(r-b)))/2\n", + " else:\n", + " return a*(b-a+r)/2\n", + "\n", + "hampel = np.vectorize(single_point_hampel)\n", + "\n", + "x = np.linspace(-10.1, 10.1)\n", + "y = hampel(x, a=1.5, b=3.5, r=8)\n", + "\n", + "plt.plot(x, y)\n", + "plt.gca().set_title(\"Hampel Loss\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "c4ed645c", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## KDE is a Mean\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "**Kernel as scalar product:**\n", + "\n", + "- Let $K$ be a radial monotonic$^\\ast$ kernel over $\\mathbb{R}^n$.\n", + "- For $x\\in\\mathbb{R}^n$ let $\\phi_x = K(\\cdot, x)$.\n", + "- Vector space over the linear span of $\\{\\phi_x \\mid x\\in\\mathbb{R}^n\\}$:\n", + " - Pointwise addition and scalar multiplication.\n", + "- Define the scalar product $\\langle \\phi_x, \\phi_y\\rangle = K(x,y)$.\n", + "- Advantage: Scalar product is computable\n", + "- Call this the reproducing kernel Hilbert space (RKHS) of $K$.\n", + "- $\\mathrm{KDE}_h(\\cdot, D) = \\frac{1}{N}\\sum_{i=1}^N K_h(\\cdot, x_i) = \\frac{1}{N}\\sum_{i=1}^N\\phi_{x_i}$\n", + " - where $K_h(x,y) = \\frac{1}{h}K\\left(\\frac{|x-y|}{h}\\right)$\n", + "\n", + "\n", + "\n", + "$^*$All kernels that we have seen are radial and monotonic" + ] + }, + { + "cell_type": "markdown", + "id": "aa0e6487", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "# Exercise\n", + "We compare the performance of different approaches to recover the nominal distribution under contamination.\n", + "Here, we use code by [Humbert et al.](https://github.com/lminvielle/mom-kde) to replicate\n", + "the results in the referenced paper on median-of-mean KDE. More details on rKDE can instead be found in this paper by [Kim and Scott.](https://arxiv.org/abs/1107.3133#:~:text=We%20propose%20a%20method%20for,ideas%20from%20classical%20M%2Destimation.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58243128", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# =======================================================\n", + "# Parameters\n", + "# =======================================================\n", + "algos = [\n", + " 'kde',\n", + " 'mom-kde', # Median-of-Means\n", + " 'rkde-huber', # robust KDE with huber loss\n", + " 'rkde-hampel', # robust KDE with hampel loss\n", + "]\n", + "\n", + "dataset = 'house-prices'\n", + "dataset_options = {'neighborhood': 'CollgCr', 'anomaly_neighborhood': 'Edwards'}\n", + "\n", + "outlierprop_range = [0.01, 0.02, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5]\n", + "kernel = 'gaussian'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da628e2f", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "auc_scores = perform_rkde_experiment(\n", + " algos,\n", + " dataset,\n", + " dataset_options,\n", + " outlierprop_range,\n", + " kernel,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "280ed959", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(7, 5))\n", + "for algo, algo_data in auc_scores.groupby('algo'):\n", + " x = algo_data.groupby('outlier_prop').mean().index\n", + " y = algo_data.groupby('outlier_prop').mean()['auc_anomaly']\n", + " ax.plot(x, y, 'o-', label=algo)\n", + "plt.legend()\n", + "plt.xlabel('outlier_prop')\n", + "plt.ylabel('auc_score')\n", + "plt.title('Comparison of rKDE against contamination')" + ] + }, + { + "cell_type": "markdown", + "id": "bd659a73", + "metadata": {}, + "source": [ + "Try using different neighborhoods for contamination. Which robust KDE algorithm performs better overall? Choose among the following options:\n", + "\n", + "OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste\n", + "\n", + "You can also change the kernel type: gaussian, tophat, epechenikov, exponential, linear or cosine, " + ] + }, + { + "cell_type": "markdown", + "id": "76312d0d", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "# Summary\n", + "- Kernel density estimation is a non-parametric method to estimate a pdf from a sample.\n", + "- Bandwidth is the most important parameter.\n", + "- Converges to the true pdf if $n\\to\\infty$.\n", + " - Convergence exponentially depends on the dimension.\n", + "- KDE is sensitive to contamination:\n", + " - In a contaminated setting one can employ methods from robust statistics to obtain robust estimates.\n", + " \n", + "## Implementations\n", + "- Sklearn: [sklearn.neighbors.KernelDensity](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity)\n", + "- Statsmodels: [statsmodels.nonparametric.kernel_density.KDEMultivariate](https://www.statsmodels.org/dev/generated/statsmodels.nonparametric.kernel_density.KDEMultivariate.html)\n", + "- FastKDE: [link](https://pypi.org/project/fastkde/), offers automatic bandwidth and kernel selection." + ] + }, + { + "cell_type": "markdown", + "id": "b1dd207a", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "\"Snow\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67dab0f6", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "celltoolbar": "Hide code", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/nb_03_anomaly_detection_via_isolation.ipynb b/notebooks/nb_03_anomaly_detection_via_isolation.ipynb new file mode 100644 index 0000000..e1bd35b --- /dev/null +++ b/notebooks/nb_03_anomaly_detection_via_isolation.ipynb @@ -0,0 +1,706 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "c634d79e", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input", + "remove-output", + "remove-input-nbconv", + "remove-output-nbconv" + ] + }, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "%matplotlib inline\n", + "%load_ext tfl_training_anomaly_detection" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "596df825", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input", + "remove-input-nbconv" + ] + }, + "outputs": [], + "source": [ + "%presentation_style" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "158112af", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input", + "remove-output", + "remove-input-nbconv", + "remove-output-nbconv" + ] + }, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "%set_random_seed 12" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97c8e783", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input-nbconv", + "remove-cell" + ] + }, + "outputs": [], + "source": [ + "%load_latex_macros" + ] + }, + { + "cell_type": "markdown", + "id": "7cc2f648", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Anomaly Detection via Isolation\n", + "\"Snow\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "import numpy as np\n", + "import itertools as it\n", + "from tqdm import tqdm\n", + "\n", + "import matplotlib\n", + "from matplotlib import pyplot as plt\n", + "import plotly.express as px\n", + "import pandas as pd\n", + "\n", + "import ipywidgets as widgets\n", + "\n", + "from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \\\n", + "perform_rkde_experiment, get_mnist_data\n", + "\n", + "from ipywidgets import interact\n", + "\n", + "from sklearn.metrics import roc_auc_score, average_precision_score\n", + "from sklearn.model_selection import RandomizedSearchCV\n", + "from sklearn.preprocessing import MinMaxScaler\n", + "from sklearn.preprocessing import LabelBinarizer\n", + "from sklearn.ensemble import IsolationForest\n", + "from sklearn import metrics\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.decomposition import PCA\n", + "from sklearn.neighbors import KernelDensity\n", + "\n", + "from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst\n", + "\n", + "from tensorflow import keras\n", + "\n", + "%matplotlib inline\n", + "matplotlib.rcParams['figure.figsize'] = (5, 5)\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "id": "71e46caa", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Anomaly Detection via Isolation\n", + "**Idea:** An anomaly should allow \"simple\" descriptions that distinguish it from the rest of the data.\n", + "\n", + "- Descriptions: Conjunction of single attribute tests, i.e.\n", + " $X_i \\leq c$ or $X_i > c$.\n", + "- Example: $X_1 \\leq 1.2 \\text{ and } X_5 > -3.4 \\text{ and }\tX_7 \\leq 5.6$.\n", + "- Complexity of description: Number of conjunctions.\n", + "\n", + "Moreover, we assume that a short random descriptions will have a significantly larger chance of isolating an anomaly\n", + "than isolating any nominal point.\n", + "\n", + "- Choose random isolating descriptions and compute anomaly score from average complexity." + ] + }, + { + "cell_type": "markdown", + "id": "f9ce4a9f", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Isolation Tree\n", + "Isolation Forest (iForest) implements this idea by generating an ensemble of random decision trees.\n", + "Each tree is built as follows:\n", + "\n", + "\n", + "**Input:** Data set (subsample) $X$, maximal height $h$\n", + "- Randomly choose feature $i$ and split value $s$ (in range of data)\n", + "- Recursively build subtrees on $X_L = \\{x\\in X\\mid x_i \\leq s\\}$ and $X_R = X\\setminus X_L$\n", + "- Stop if remaining data set $ \\leq 1$ or maximal height reached\n", + "- Store test $x_i\\leq s$ for inner nodes and $|X|$ for leaf nodes\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "16dabe54", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Visualization\n", + "
\n", + "\n", + " Isolation Tree as Partition Diagram\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "39fc2c5b", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Isolation Depth\n", + "Depth of an observation $x$ in an isolation tree is defined as the expected number of tests needed to isolate $x$.\n", + "\n", + "\n", + "**Input:** Observation $x$\n", + "- ${\\ell} = $ length of path from root to leaf according to tests\n", + "- ${n} = $ size of remaining data set in leaf node\n", + "- ${c(n)} =$ expected length of a path in a BST with $n$ nodes $={O}(\\log n)$\n", + "- ${h(x)} = \\ell + c(n)$\n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Isolation Depth of Outlier (red) and nominal (blue)
\n" + ] + }, + { + "cell_type": "markdown", + "id": "4659e094", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Isolation Forest\n", + "- Train $k$ isolation trees on subsamples of size $N$" + ] + }, + { + "cell_type": "markdown", + "id": "2423122d", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Isolation depth of nominal point (left) and outlier (right)
" + ] + }, + { + "cell_type": "markdown", + "id": "f00a372e", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "# Variants of Isolation Forest" + ] + }, + { + "cell_type": "markdown", + "id": "702f9985", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "## Variant: Random Robust Cut Forest\n", + "**New Rule to Choose Split Test:**\n", + "- $\\ell_i$: length of the $i$th component of the bounding box around current data set\n", + "- Choose dimension $i$ with probability $\\frac{\\ell_i}{\\sum_j \\ell_j}$\n", + "- More robust against \"noise dimensions\"" + ] + }, + { + "cell_type": "markdown", + "id": "ef0a2568", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "
\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "ce7f67a8", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "## Variant: Extended Isolation Forest\n", + "**New split criterion:**\n", + "- Uniformly choose a normal and an orthogonal hyperplane through the data\n", + "- Removes a bias that was empirically observed when plotting the outlier score of iForest on low dimensional data sets\n", + "\n", + "
\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "f144c10d", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "# Exercise: Network Security\n", + "\n", + "In the final exercise of today you will have to develop an anomaly detection system for network traffic.\n", + "\n", + "## Briefing\n", + "A large e-commerce company __A__ is experiencing downtime due to attacks on their infrastructure.\n", + "You were instructed to develop a system that can detect malicious connections to the infrastructure.\n", + "It is planned that suspicious clients will be banned.\n", + "\n", + "Another data science team already prepared the connection data of the last year for you. They also separated a test set and manually identified and labeled attacks in that data.\n", + "\n", + "## The Data\n", + "We will work on a version of the classic KDD99 data set.\n", + "\n", + "### Kddcup 99 Data Set\n", + "======================\n", + "\n", + "The KDD Cup '99 dataset was created by processing the tcp dump portions\n", + "of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,\n", + "created by MIT Lincoln Lab [1]. The artificial data (described on the `dataset's\n", + "homepage `_) was\n", + "generated using a closed network and hand-injected attacks to produce a\n", + "large number of different types of attack with normal activity in the\n", + "background.\n", + "\n", + " ========================= ====================================================\n", + " Samples total 976158\n", + " Dimensionality 41\n", + " Features string (str), discrete (int), continuous (float)\n", + " Targets str, 'normal.' or name of the anomaly type\n", + " Proportion of Anomalies 1%\n", + " ========================= ====================================================\n", + "\n", + "\n", + "## Task\n", + "You will have to develop the system on your own. In particular, you will have to\n", + "- Explore the data.\n", + "- Choose an algorithm.\n", + "- Find a good detection threshold.\n", + "- Evaluate and summarize your results.\n", + "- Estimate how much __A__ could save through the use of your system." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91e31b17", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "X_train,X_test,y_test = get_kdd_data()" + ] + }, + { + "cell_type": "markdown", + "id": "9a47f0c1", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "# Explore Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5044d551", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "#\n", + "# Add your exploration code\n", + "#\n", + "X_train = pd.DataFrame(X_train)\n", + "X_test = pd.DataFrame(X_test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2f4d539", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# get description\n", + "X_train.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d1483fbf", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# get better description\n", + "X_train.drop(columns=[1,2,3]).astype(float).describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8ac2ddd4", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Check for NaNs\n", + "print(\"Number of NaNs: {}\".format(X_train.isna().sum().sum()))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "763c76dc", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "#\n", + "# Add your preperation code here\n", + "#\n", + "\n", + "# Encode string features\n", + "binarizer = LabelBinarizer()\n", + "one_hots = None\n", + "one_hots_test = None\n", + "for i in [1, 2, 3]:\n", + " binarizer.fit(X_train[[i]].astype(str))\n", + " if one_hots is None:\n", + " one_hots = binarizer.transform(X_train[[i]].astype(str))\n", + " one_hots_test = binarizer.transform(X_test[[i]].astype(str))\n", + " else:\n", + " one_hots = np.concatenate([one_hots, binarizer.transform(X_train[[i]].astype(str))], axis=1)\n", + " one_hots_test = np.concatenate([one_hots_test, binarizer.transform(X_test[[i]].astype(str))], axis=1)\n", + "\n", + "X_train.drop(columns=[1,2,3], inplace=True)\n", + "X_train_onehot = pd.DataFrame(np.concatenate([X_train.values, one_hots], axis=1))\n", + "\n", + "X_test.drop(columns=[1,2,3], inplace=True)\n", + "X_test_onehot = pd.DataFrame(np.concatenate([X_test.values, one_hots_test], axis=1))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d363320", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Encode y\n", + "y_test_bin = np.where(y_test == b'normal.', 0, 1)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ef0cf969", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Remove suspicious data\n", + "# This step is not strictly neccessary but can improve performance\n", + "suspicious = X_train_onehot.apply(lambda col: (col - col.mean()).abs() > 4 * col.std() if col.std() > 1 else False)\n", + "suspicious = suspicious.any(axis=1)# 4 sigma rule\n", + "print('filtering {} suspicious data points'.format(suspicious.sum()))\n", + "X_train_clean = X_train_onehot[~suspicious]" + ] + }, + { + "cell_type": "markdown", + "id": "201f1dfb", + "metadata": {}, + "source": [ + "# Summary\n", + "- Isolation Forest empirically shows very good performance up to relatively high dimensions\n", + "- It is relatively robust against contamination\n", + "- Usually little need for hyperparameter tuning\n", + "\n", + "## Implementations\n", + "- Sklearn: [sklearn.ensemble.IsolationForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)\n", + "- Extended Isolation Forest: [variant](https://github.com/sahandha/eif)\n", + "- Random Robust Cut Forest: [variant](https://github.com/kLabUM/rrcf)" + ] + }, + { + "cell_type": "markdown", + "id": "80805d83", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "# Choose Algorithm" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13d462c9", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# TODO: implement proper model selection\n", + "iforest = IsolationForest()\n", + "iforest.fit(X_train_clean)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1877e13d", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# find best threshold\n", + "X_test_onehot, X_val_onehot, y_test_bin, y_val_bin = train_test_split(X_test_onehot, y_test_bin, test_size=.5)\n", + "y_score = -iforest.score_samples(X_val_onehot).reshape(-1)" + ] + }, + { + "cell_type": "markdown", + "id": "7338159f", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "# Evaluate Solution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "920afcf1", + "metadata": { + "hideCode": false, + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "#\n", + "# Insert evaluation code\n", + "#\n", + "\n", + "# calculate scores if any anomaly is present\n", + "if np.any(y_val_bin == 1):\n", + " eval = evaluate(y_val_bin, y_score)\n", + " prec, rec, thr = eval['PR']\n", + " f1s = 2 * (prec * rec)/(prec + rec)\n", + " threshold = thr[np.argmax(f1s)]\n", + "\n", + " y_score = -iforest.score_samples(X_test_onehot).reshape(-1)\n", + " y_pred = np.where(y_score < threshold, 0, 1)\n", + "\n", + " print('Precision: {}'.format(metrics.precision_score(y_test_bin, y_pred)))\n", + " print('Recall: {}'.format(metrics.recall_score(y_test_bin, y_pred)))\n", + " print('F1: {}'.format(metrics.f1_score(y_test_bin, y_pred)))" + ] + }, + { + "cell_type": "markdown", + "id": "b1dd207a", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "\"Snow\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67dab0f6", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "celltoolbar": "Hide code", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/nb_04_anomaly_detection_via_reconstruction.ipynb b/notebooks/nb_04_anomaly_detection_via_reconstruction.ipynb new file mode 100644 index 0000000..b8700b1 --- /dev/null +++ b/notebooks/nb_04_anomaly_detection_via_reconstruction.ipynb @@ -0,0 +1,982 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "c634d79e", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input", + "remove-output", + "remove-input-nbconv", + "remove-output-nbconv" + ] + }, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "%matplotlib inline\n", + "%load_ext tfl_training_anomaly_detection" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "596df825", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input", + "remove-input-nbconv" + ] + }, + "outputs": [], + "source": [ + "%presentation_style" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "158112af", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input", + "remove-output", + "remove-input-nbconv", + "remove-output-nbconv" + ] + }, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "%set_random_seed 12" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97c8e783", + "metadata": { + "hide_input": true, + "init_cell": true, + "slideshow": { + "slide_type": "skip" + }, + "tags": [ + "remove-input-nbconv", + "remove-cell" + ] + }, + "outputs": [], + "source": [ + "%load_latex_macros" + ] + }, + { + "cell_type": "markdown", + "id": "7cc2f648", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Anomaly Detection via Reconstruction Error\n", + "\"Snow\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "import numpy as np\n", + "import itertools as it\n", + "from tqdm import tqdm\n", + "\n", + "import matplotlib\n", + "from matplotlib import pyplot as plt\n", + "import plotly.express as px\n", + "import pandas as pd\n", + "\n", + "import ipywidgets as widgets\n", + "\n", + "from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \\\n", + "perform_rkde_experiment, get_mnist_data\n", + "\n", + "from ipywidgets import interact\n", + "\n", + "from sklearn.metrics import roc_auc_score, average_precision_score\n", + "from sklearn.model_selection import RandomizedSearchCV\n", + "from sklearn.preprocessing import MinMaxScaler\n", + "from sklearn.preprocessing import LabelBinarizer\n", + "from sklearn.ensemble import IsolationForest\n", + "from sklearn import metrics\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.decomposition import PCA\n", + "from sklearn.neighbors import KernelDensity\n", + "\n", + "from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst\n", + "\n", + "from tensorflow import keras\n", + "\n", + "%matplotlib inline\n", + "matplotlib.rcParams['figure.figsize'] = (5, 5)\n" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "id": "6287222a", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Anomaly Detection via Reconstruction Error\n", + "**Idea:** Embed the data into low dimensional space and reconstruct it again.\n", + "\t\t\tGood embedding of nominal data $\\Rightarrow$ high reconstruction error indicates anomaly.\n", + "\n", + "**Autoencoder:**\n", + "- Parametric family of encoders: $f_\\phi: \\mathbb{R}^d \\to \\mathbb{R}^{\\text{low}}$\n", + "- Parametric family of decoders: $g_\\theta: \\mathbb{R}^{\\text{low}} \\to \\mathbb{R}^{d}$\n", + "- Reconstruction error of $(f_\\phi, g_\\theta)$ on $x$: $|x - g_\\theta(f_\\phi(x))|$\n", + "- Given data set $D$, find $\\phi,\\theta$ that minimize $\\sum_{x\\in D} L(|x- g_\\theta(f_\\phi(x))|) $\n", + " for some loss function $L$.\n" + ] + }, + { + "cell_type": "markdown", + "id": "ebb5094e", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "# Visualization\n", + "
\n", + "\n", + " Autoencoder Schema\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "565a43e9", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Neural Networks\n", + "Neural networks are very well suited for finding low dimensional representations of data. Hence they are a popular choice for the encoder and the decoder.\n", + "\n", + "\n", + "**Artificial Neuron with $N$ inputs:** $y = \\sigma\\left(\\sum_i^N w_i X_i + b\\right)$\n", + "\n", + "- $\\sigma$: nonlinear activation-function (applied component wise).\n", + "- $b$ bias\n" + ] + }, + { + "cell_type": "markdown", + "id": "e658847c", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Isolation depth of nominal point and anomaly
" + ] + }, + { + "cell_type": "markdown", + "id": "09243719", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "# Neural Networks\n", + "\n", + "Neural networks combine many artificial neurons into a complex network. These networks are usually organized in layers\n", + "where the result of each layer is the input for the next layer. Some commonly used layers are:" + ] + }, + { + "cell_type": "markdown", + "id": "3376f048", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "
\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "d8794297", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Variational Autoencoders\n", + "An important extension of autoencoders that relates the idea to density estimation.\n", + "More precisely, we define a generative model for our data using latent variables and combine the maximum likelihood\n", + "estimation of the parameters with a simultaneous posterior estimation of the latents through amortized stochastic\n", + "variational inference. We use a decoder network to transform the latent variables into the data distribution, and an\n", + "encoder network to compute the posterior distribution of the latents given the data." + ] + }, + { + "cell_type": "markdown", + "id": "6b5efeac", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "**Definition:**\n", + "\n", + "\n", + "The model uses an observed variable $X$ (the data) and a latent variable $Z$ (the defining features of $X$). We assume\n", + "both $P(Z)$ and $P(X\\mid Z)$ to be normally distributed. More precisely\n", + "\n", + "- $P(Z) = \\mathcal{N}(0, I)$\n", + "- $P(X\\mid Z) = \\mathcal{N}(\\mu_\\phi(Z), I)$\n", + "\n", + "where $\\mu_\\phi$ is a neural network parametrized with $\\phi$.\n", + "We use variational inference to perform posterior inference on $Z$ given $X$. We assume that the distribution $P(Z\\mid X)$\n", + "to be relatively well approximated by a Gaussian and use the posterior approximation:\n", + "- $q(X\\mid Z) = \\mathcal{N}(\\mu_\\psi(X), \\sigma_\\psi(X))$\n", + "\n", + "$\\mu_\\psi$ and $\\sigma_\\psi$ are neural networks parameterized with $\\psi$\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "76d013d3", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "
\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "81661bf9", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Given a data set $D$ we minimize the (amortized) Kullback-Leibler divergence between our posterior approximation and the\n", + "true posterior:\n", + "\\begin{align*}\n", + " D_{KL}(q(z\\mid x),p(z\\mid x)) &= E_{x\\sim X, z\\sim q(Z\\mid x)}\\left[\\log\\left(\\frac{q(z \\mid x)}{p(z \\mid X)}\\right)\\right] \\\\\n", + " &= E_{x\\sim X, z\\sim q(Z\\mid X)}\\left[\\log\\left(\\frac{q(z \\mid x)}{\\frac{p(x \\mid z)p(z)}{p(x)}}\\right)\\right] \\\\\n", + " &= E_{x\\sim X, z\\sim q(Z\\mid x)}\\left[\\log\\left(\\frac{q(z \\mid x)}{p(x \\mid z)p(z)}\\right) + \\log(p(x))\\right] \\\\\n", + " &= E_{x\\sim X, z\\sim q(Z\\mid x)}\\left[\\log\\left(\\frac{q(z \\mid x)}{p(x \\mid z)p(z)}\\right)\\right] + E_{x\\sim X}[\\log(p(x))]\\\\\n", + "\\end{align*}\n", + "\n", + "Now we can define\n", + "\n", + "\\begin{align*}\n", + " \\mathrm{ELBO}(q(z\\mid x),p(z\\mid x)) &:= E_{x\\sim X}[\\log(p(x))] - D_{KL}(q(z\\mid x),p(z\\mid x)) \\\\\n", + " &= -E_{x\\sim X, z\\sim q(Z\\mid x)}\\left[\\log\\left(\\frac{q(z \\mid x)}{p(x \\mid z)p(z)}\\right)\\right]\n", + "\\end{align*}\n", + "\n", + "Note that we can evaluate the expression inside the expectation of the final RHS of the\n", + "equation and we can obtain unbiased estmates of the expectation via sampling.\n", + "Let us further try to understand the ELBO as an optimization objective. On one hand, maximizing the ELBO with respect to the parameters in $q$ is equivalent to\n", + "minimizing the KL divergence between $p$ and $q$. On the other hand, maximizing the ELBO with\n", + "respect to the parameters in $p$ can be understood as raising a lower bound for the likelihood of the\n", + "generative model $p(x)$. Hence, the optimization tries to find an encoder and a decoder pair such that\n", + "it simultaneously provides a good generative explanation of the data and a good approximation of the posterior\n", + "distribution of the latent variables." + ] + }, + { + "cell_type": "markdown", + "id": "7131628c", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "# Exercise" + ] + }, + { + "cell_type": "markdown", + "id": "05f8c5a6", + "metadata": {}, + "source": [ + "# The MNIST Data Set\n", + "MNIST is one of the most iconic data sets in the history of machine learning.\n", + "It contains 70000 samples of $28\\times 28$ grayscale images of handwritten digits.\n", + "Because of its moderate complexity and good visualizability it is well suited to study the behavior of machine learning\n", + "algorithms in higher dimensional spaces.\n", + "\n", + "While originally created for classification (optical character recognition), we can build an anomaly detection data set\n", + "by corrupting some of the images.\n" + ] + }, + { + "cell_type": "markdown", + "id": "28979a57", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "# Pre-processing\n", + "We first need to obtain the MNIST data set and prepare an anomaly detection set from it.\n", + "Note that the data set is n row vector format.\n", + "Therefore, we work with $28\\times 28 = 784$ dimensional data points." + ] + }, + { + "cell_type": "markdown", + "id": "d2508225", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "## Load MNIST Data Set" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a95d8450", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "mnist = get_mnist_data()\n", + "\n", + "data = mnist['data']\n", + "print('data.shape: {}'.format(data.shape))\n", + "target = mnist['target'].astype(int)" + ] + }, + { + "cell_type": "markdown", + "id": "f6b6bde9", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "## Build contaminated Data Sets\n", + "We prepared a function that does the job for us.\n", + "It corrupts a prescribed portion of the data by introducing a rotation, noise or a blackout of some part of the image.\n", + "\n", + "First, we need to transform the data into image format." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "954d3762", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "X = data.reshape(-1, 28, 28, 1)/255" + ] + }, + { + "cell_type": "markdown", + "id": "f4d6c089", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "### Train/Test-Split\n", + "We will only corrupt the test set, hence we will perform the train-test split beforehand.\n", + "We separate a relatively small test set so that we can use as much as possible from the data to obtain high quality\n", + "representations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0fb9c57", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "test_size = .1\n", + "X_train, X_test, target_train, target_test = train_test_split(X, target, test_size=test_size)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6ad3d43", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "X_test, y_test = build_contaminated_minst(X_test)\n", + "\n", + "# Visualize contamination\n", + "anomalies = X_test[y_test != 0]\n", + "selection = np.random.choice(len(anomalies), 25)\n", + "\n", + "fig, axes = plt.subplots(nrows=5, ncols=5, figsize=(5, 5))\n", + "for img, ax in zip(anomalies[selection], axes.flatten()):\n", + " ax.imshow(img, 'gray')\n", + " ax.axis('off')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "3c459a8a", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "# Autoencoder\n", + "Let us finally train an autoencoder model. We replicate the model given in the\n", + "[Keras documentation](https://keras.io/examples/generative/vae/) and apply it in a synthetic outlier detection scenario\n", + "based on MNIST.\n", + "\n", + "in the vae package we provide the implementation of the VAE. Please take a look into the source code to see how\n", + "the minimization of the KL divergence is implemented." + ] + }, + { + "cell_type": "markdown", + "id": "c1af1c41", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "## Create Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c90996a0", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "latent_dim = 3\n", + "vae = VAE(decoder=build_decoder_mnist(latent_dim=latent_dim), encoder=build_encoder_minst(latent_dim=latent_dim))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "efb89bdd", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "## Inspect model architecture\n", + "vae.encoder.summary()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "68b219e9", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "## Inspect model architecture\n", + "vae.decoder.summary()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01b43aff", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# train model\n", + "n_epochs = 30\n", + "\n", + "vae.compile(optimizer=keras.optimizers.Adam(learning_rate=.001))\n", + "history = vae.fit(X_train, epochs=n_epochs, batch_size=128)" + ] + }, + { + "cell_type": "markdown", + "id": "e1519875", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "## Inspect Result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80ab41fd", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "\n", + "def plot_latent_space(vae: VAE, n: int=10, figsize: float=10):\n", + " \"\"\"Plot sample images from 2D slices of latent space\n", + " \n", + " @param vae: vae model\n", + " @param n: sample nXn images per slice\n", + " @param figsize: figure size\n", + " \n", + " \"\"\"\n", + " for perm in [[0, 1, 2], [1, 2, 0], [2, 1, 0]]:\n", + " # display a n*n 2D manifold of digits\n", + " digit_size = 28\n", + " scale = 1.0\n", + " figure = np.zeros((digit_size * n, digit_size * n))\n", + " # linearly spaced coordinates corresponding to the 2D plot\n", + " # of digit classes in the latent space\n", + " grid_x = np.linspace(-scale, scale, n)\n", + " grid_y = np.linspace(-scale, scale, n)[::-1]\n", + "\n", + " for i, yi in enumerate(grid_y):\n", + " for j, xi in enumerate(grid_x):\n", + " z_sample = np.array([[xi, yi, 0]])\n", + " z_sample[0] = z_sample[0][perm]\n", + " x_decoded = vae.decoder.predict(z_sample)\n", + " digit = x_decoded[0].reshape(digit_size, digit_size)\n", + " figure[\n", + " i * digit_size : (i + 1) * digit_size,\n", + " j * digit_size : (j + 1) * digit_size,\n", + " ] = digit\n", + "\n", + " plt.figure(figsize=(figsize, figsize))\n", + " start_range = digit_size // 2\n", + " end_range = n * digit_size + start_range\n", + " pixel_range = np.arange(start_range, end_range, digit_size)\n", + " sample_range_x = np.round(grid_x, 1)\n", + " sample_range_y = np.round(grid_y, 1)\n", + " plt.xticks(pixel_range, sample_range_x)\n", + " plt.yticks(pixel_range, sample_range_y)\n", + " plt.xlabel(\"z[{}]\".format(perm[0]))\n", + " plt.ylabel(\"z[{}]\".format(perm[1]))\n", + " plt.gca().set_title('z[{}] = 0'.format(perm[2]))\n", + " plt.imshow(figure, cmap=\"Greys_r\")\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bdb0f67d", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "plot_latent_space(vae)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d6a5b6f", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "# Principal components\n", + "pca = PCA()\n", + "latents = vae.encoder.predict(X_train)[2]\n", + "pca.fit(latents)\n", + "\n", + "kwargs = {'x_{}'.format(i): (-1., 1.) for i in range(latent_dim)}\n", + "\n", + "\n", + "@widgets.interact(**kwargs)\n", + "def explore_latent_space(**kwargs):\n", + " \"\"\"Widget to explore latent space from given start position\n", + " \"\"\"\n", + " center_img = pca.transform(np.zeros([1,latent_dim]))\n", + "\n", + " latent_rep_pca = center_img + np.array([[kwargs[key] for key in kwargs]])\n", + " latent_rep = pca.inverse_transform(latent_rep_pca)\n", + " img = vae.decoder(latent_rep).numpy().reshape(28, 28)\n", + "\n", + " fig, ax = plt.subplots()\n", + " ax.axis('off')\n", + " ax.axis('off')\n", + "\n", + " ax.imshow(img,cmap='gray', vmin=0, vmax=1)\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f9fb82f", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "latents = vae.encoder.predict(X_train)[2]\n", + "scatter = px.scatter_3d(x=latents[:, 0], y=latents[:, 1], z=latents[:, 2], color=target_train)\n", + "\n", + "scatter.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea370a83", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "latents = vae.encoder.predict(X_test)[2]\n", + "scatter = px.scatter_3d(x=latents[:, 0], y=latents[:, 1], z=latents[:, 2], color=y_test)\n", + "\n", + "scatter.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dee0a98e", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "X_test, X_val, y_test, y_val = train_test_split(X_test, y_test)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65c957f8", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "n_samples = 10\n", + "\n", + "s = np.random.choice(range(len(X_val)), n_samples)\n", + "s = X_val[s]\n", + "#s = [X_train_img[i] for i in s]\n", + "\n", + "fig, axes = plt.subplots(nrows=2, ncols=n_samples, figsize=(10, 2))\n", + "for img, ax_row in zip(s, axes.T):\n", + " x = vae.decoder.predict(vae.encoder.predict(img.reshape(1, 28, 28, 1))[2]).reshape(28, 28)\n", + " diff = x - img.reshape(28, 28)\n", + " error = (diff * diff).sum()\n", + " ax_row[0].axis('off')\n", + " ax_row[1].axis('off')\n", + " ax_row[0].imshow(img,cmap='gray', vmin=0, vmax=1)\n", + " ax_row[1].imshow(x, cmap='gray', vmin=0, vmax=1)\n", + " ax_row[1].set_title('E={:.1f}'.format(error))\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "350edb6c", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "from sklearn import metrics\n", + "y_test_bin = y_test.copy()\n", + "y_test_bin[y_test != 0] = 1\n", + "y_val_bin = y_val.copy()\n", + "y_val_bin[y_val != 0] = 1\n", + "# Evaluate\n", + "reconstruction = vae.decoder.predict(vae.encoder(X_val)[2])\n", + "rerrors = (reconstruction - X_val).reshape(-1, 28*28)\n", + "rerrors = (rerrors * rerrors).sum(axis=1)\n", + "\n", + "# Let's calculate scores if any anomaly is present\n", + "if np.any(y_val_bin == 1):\n", + " eval = evaluate(y_val_bin.astype(int), rerrors.astype(float))\n", + " pr, rec, thr = eval['PR']\n", + " f1s = (2 * ((pr * rec)[:-1]/(pr + rec)[:-1]))\n", + " threshold = thr[np.argmax(f1s)]\n", + " print('Optimal threshold: {}'.format(threshold))\n", + "\n", + " reconstruction = vae.decoder.predict(vae.encoder(X_test)[2])\n", + " reconstruction_error = (reconstruction - X_test).reshape(-1, 28*28)\n", + " reconstruction_error = (reconstruction_error * reconstruction_error).sum(axis=1)\n", + "\n", + "\n", + " classification = (reconstruction_error > threshold).astype(int)\n", + "\n", + " print('Precision: {}'.format(metrics.precision_score(y_test_bin, classification)))\n", + " print('Recall: {}'.format(metrics.recall_score(y_test_bin, classification)))\n", + " print('F1: {}'.format(metrics.f1_score(y_test_bin, classification)))\n", + "\n", + " metrics.confusion_matrix(y_test_bin, classification)\n", + "else:\n", + " reconstruction_error = None\n" + ] + }, + { + "cell_type": "markdown", + "id": "c8c5568d", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "## Sort Data by Reconstruction Error" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b304ec8", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "if reconstruction_error is not None:\n", + " combined = list(zip(X_test, reconstruction_error))\n", + " combined.sort(key = lambda x: x[1])\n" + ] + }, + { + "cell_type": "markdown", + "id": "555fd7f3", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "## Show Top Autoencoder Outliers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d51d7a5c", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "if reconstruction_error is not None:\n", + " n_rows = 10\n", + " n_cols = 10\n", + " n_samples = n_rows*n_cols\n", + "\n", + " samples = [c[0] for c in combined[-n_samples:]]\n", + "\n", + " fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(2*n_cols, 2*n_rows))\n", + " for img, ax in zip(samples, axes.reshape(-1)):\n", + " ax.axis('off')\n", + " ax.imshow(img.reshape((28,28)), cmap='gray', vmin=0, vmax=1)\n", + "\n", + " plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "id": "85cd3f88", + "metadata": {}, + "source": [ + "# Summary\n", + "- Autoencoders are the most prominent reconstruction error based anomaly detection method.\n", + "- Can provide high quality results on high dimensional data.\n", + "- Architecture is highly adaptable to the data (fully connected, CNN, attention,...).\n", + "- Sensitive to contamination.\n", + "- Variational autoencoder are an important variant the improves the interpretability of the latent space.\n", + "\n", + "## Implementations\n", + "- Keras: see vae.py or [here](https://keras.io/examples/generative/vae/)\n", + "- Pytorch: [example implementation](https://colab.research.google.com/github/smartgeometry-ucl/dl4g/blob/master/variational_autoencoder.ipynb)\n", + "- Pyro (pytorch based probabilistic programming language): [example implementation](https://pyro.ai/examples/vae.html)" + ] + }, + { + "cell_type": "markdown", + "id": "b1dd207a", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "\"Snow\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67dab0f6", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "celltoolbar": "Hide code", + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/nb_03_anomaly_detection_on_time_series.ipynb b/notebooks/nb_05_anomaly_detection_on_time_series.ipynb similarity index 99% rename from notebooks/nb_03_anomaly_detection_on_time_series.ipynb rename to notebooks/nb_05_anomaly_detection_on_time_series.ipynb index 1e31aa7..ab4f61f 100644 --- a/notebooks/nb_03_anomaly_detection_on_time_series.ipynb +++ b/notebooks/nb_05_anomaly_detection_on_time_series.ipynb @@ -278,18 +278,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\"Snow\"\n", - "
Anomaly Detection on Time Series
\n" + "# Anomaly Detection on Time Series\n", + "\"Snow\"\n" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, "outputs": [], "source": [ "import pandas as pd\n", @@ -304,7 +299,10 @@ "from sklearn.linear_model import LinearRegression\n", "\n", "matplotlib.rcParams['figure.figsize'] = (15, 5)\n" - ] + ], + "metadata": { + "collapsed": false + } }, { "cell_type": "markdown", @@ -1107,7 +1105,7 @@ } }, "source": [ - "## Summery\n", + "## Summary\n", "- Time series data differs from point data in the sense that the readings are in general not independent of each other.\n", "- Stationarity is a central concept in time series analysis.\n", "- A time series can contain trends and seasonalities which demand special attention when one tries to identify anomalies.\n", @@ -1270,7 +1268,7 @@ "## The ARMA Model\n", "Both models can be combined building the $\\mathrm{ARMA}(p, q)$ model:\n", "\n", - "$$ \\left(1 - \\sum_{i=0}^p\\phi_i L^i\\right)X_t = \\left(1 + \\sum_{j=1}^q \\psi_j L^j\\right)\\epsilon_t $$" + "$$ \\left(1 - \\sum_{i=1}^p\\phi_i L^i\\right)X_t = \\left(1 + \\sum_{j=1}^q \\psi_j L^j\\right)\\epsilon_t $$" ] }, { @@ -1296,7 +1294,7 @@ "First, the $\\mathrm{ARIMA}(p, d, q)$ model applies d-th order differencing before applying $\\mathrm{ARMA}(p, q)$\n", "to remove trends from the time series\n", "\n", - "$$ \\left(1 - \\sum_{i=0}^p\\phi_iL^i\\right)(1-L)^dX_t = \\left(1 + \\sum_{j=1}^q \\psi_jL^j\\right)\\epsilon_t $$" + "$$ \\left(1 - \\sum_{i=1}^p\\phi_iL^i\\right)(1-L)^dX_t = \\left(1 + \\sum_{j=1}^q \\psi_jL^j\\right)\\epsilon_t $$" ] }, { @@ -1329,7 +1327,7 @@ "$\\mathrm{SARIMA}(p, d, q)(P, D, Q)_m$ can be written as\n", "\n", "\\begin{align*}\n", - " &\\phantom{=..} \\left(1 - \\sum_{i=0}^p\\phi_iL^i\\right)\\left(1 - \\sum_{i=0}^P\\Phi_iL^{im}\\right)(1-L)^d(1-L^m)^DX_t \\\\\n", + " &\\phantom{=..} \\left(1 - \\sum_{i=1}^p\\phi_iL^i\\right)\\left(1 - \\sum_{i=1}^P\\Phi_iL^{im}\\right)(1-L)^d(1-L^m)^DX_t \\\\\n", " &= \\left(1 + \\sum_{j=1}^q \\psi_jL^j\\right)\\left(1 + \\sum_{j=1}^Q \\Psi_jL^{jm}\\right)\\epsilon_t\n", "\\end{align*}\n" ] diff --git a/notebooks/nb_04_extreme_value_theory_for_anomaly_detection.ipynb b/notebooks/nb_06_extreme_value_theory_for_anomaly_detection.ipynb similarity index 99% rename from notebooks/nb_04_extreme_value_theory_for_anomaly_detection.ipynb rename to notebooks/nb_06_extreme_value_theory_for_anomaly_detection.ipynb index 5fe0155..9cb6d97 100644 --- a/notebooks/nb_04_extreme_value_theory_for_anomaly_detection.ipynb +++ b/notebooks/nb_06_extreme_value_theory_for_anomaly_detection.ipynb @@ -1,13 +1,20 @@ { "cells": [ { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "skip" + "slide_type": "slide" } }, + "source": [ + "Extreme Value Theory for Anomaly Detection\n", + "\"Snow\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, "outputs": [], "source": [ "import tensorflow as tf\n", @@ -22,22 +29,13 @@ "from typing import Protocol, Sequence, Union, Tuple, List, TypeVar, Callable\n", "from matplotlib.animation import FuncAnimation\n", "from celluloid import Camera\n", - "from IPython.core.display import HTML \n", + "from IPython.core.display import HTML\n", "\n", "tfd = tfp.distributions" - ] - }, - { - "cell_type": "markdown", + ], "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "\"Snow\"\n", - "
Extreme Value Theory for Anomaly Detection
" - ] + "collapsed": false + } }, { "cell_type": "markdown", diff --git a/notebooks/snippets/hint_housing_data.md b/notebooks/snippets/hint_housing_data.md deleted file mode 100644 index a1ea3f8..0000000 --- a/notebooks/snippets/hint_housing_data.md +++ /dev/null @@ -1,52 +0,0 @@ -## Hint for Exercise 1 - -You should encode the categorical values and remove nans. - -Then, create a train-test split, normalize the data frame and train -some sklear model, for example a linear regression, on the target column. - -The results can be visualized with `plt.scatter`, you should also -compute some metrics like the mean squared error and the r2 score. - -We found writing functions with the following signatures useful (you can -ignore the `Protocol` part, it's just a way to define a type hint): - -```python - -class SKlearnModelProtocol(Protocol): - def fit(self, X: pd.DataFrame, y: pd.DataFrame): - ... - - def predict(self, X: pd.DataFrame) -> np.ndarray: - ... - - -def get_categorical_columns(df: pd.DataFrame) -> list["str"]: - pass - - -def one_hot_encode_categorical( - df: pd.DataFrame, columns: list[str] = None -) -> pd.DataFrame: - pass - - -def train_sklearn_regression_model( - model: SKlearnModelProtocol, df: pd.DataFrame, target_column: str -) -> SKlearnModelProtocol: - pass - - -def remove_nans(df: pd.DataFrame) -> pd.DataFrame: - pass - - -def get_normalized_train_test_df(df: pd.DataFrame, test_size: float = 0.2) -> tuple[pd.DataFrame, pd.DataFrame]: - pass - - -def evaluate_model( - model: SKlearnModelProtocol, X_test: pd.DataFrame, y_test: pd.DataFrame -) -> np.ndarray: - pass -``` \ No newline at end of file diff --git a/notebooks/snippets/solution_housing_data.py b/notebooks/snippets/solution_housing_data.py deleted file mode 100644 index 8f3d24e..0000000 --- a/notebooks/snippets/solution_housing_data.py +++ /dev/null @@ -1,77 +0,0 @@ -class SKlearnModelProtocol(Protocol): - def fit(self, X: pd.DataFrame, y: pd.DataFrame): - ... - - def predict(self, X: pd.DataFrame) -> np.ndarray: - ... - - -def get_categorical_columns(df: pd.DataFrame): - return df.select_dtypes(include=["object", "category"]).columns - - -def one_hot_encode_categorical( - df: pd.DataFrame, columns: list[str] = None -) -> pd.DataFrame: - columns = columns or get_categorical_columns(df) - for column in columns: - df = pd.concat([df, pd.get_dummies(df[column], prefix=column)], axis=1) - df = df.drop(column, axis=1) - return df - - -def train_sklearn_regression_model( - model: SKlearnModelProtocol, df: pd.DataFrame, target_column: str -): - X = df.drop(target_column, axis=1) - y = df[target_column] - model.fit(X, y) - return model - - -def remove_nans(df: pd.DataFrame): - # count rows with nans - nans_count = df.isna().sum().sum() - if nans_count > 0: - print(f"Warning: {nans_count} NaNs were found and removed") - return df.dropna() - - -def get_normalized_train_test_df(df: pd.DataFrame, test_size: float = 0.2): - df = remove_nans(df) - df, _ = normalize_and_get_scaler(df) - df = one_hot_encode_categorical(df) - train_df = df.sample(frac=1 - test_size) - test_df = df.drop(train_df.index) - return train_df, test_df - - -def evaluate_model( - model: SKlearnModelProtocol, X_test: pd.DataFrame, y_test: pd.DataFrame -): - y_pred = model.predict(X_test) - print(f"Mean squared error: {mean_squared_error(y_test, y_pred)}") - print(f"R2 score: {r2_score(y_test, y_pred)}") - return y_pred - - -# normalize and split data -train_df, test_df = get_normalized_train_test_df(housing_df) - -# train -trained_model = train_sklearn_regression_model( - LinearRegression(), train_df, "median_house_value" -) - -# evaluate -y_pred = evaluate_model( - trained_model, - test_df.drop("median_house_value", axis=1), - test_df["median_house_value"], -) - -# visualize results -plt.scatter(test_df["median_house_value"], y_pred) -plt.xlabel("True median house value") -plt.ylabel("Predicted median house value") -plt.show()