diff --git a/docs/_static/extreme_value_theory_for_anomaly_detection.html b/docs/_static/nb_01_0_intro_anomaly_detection.html similarity index 50% rename from docs/_static/extreme_value_theory_for_anomaly_detection.html rename to docs/_static/nb_01_0_intro_anomaly_detection.html index f47f174..346cfe6 100644 --- a/docs/_static/extreme_value_theory_for_anomaly_detection.html +++ b/docs/_static/nb_01_0_intro_anomaly_detection.html @@ -3,7 +3,34 @@
-%%capture
+
%load_latex_macros
-
import tensorflow as tf
-import tensorflow_probability as tfp
-from tensorflow import keras
-import numpy as np
-import pandas as pd
-import matplotlib.pyplot as plt
-import os
-import logging
-from sklearn.preprocessing import StandardScaler
-from typing import Protocol, Sequence, Union, Tuple, List, TypeVar, Callable
-from matplotlib.animation import FuncAnimation
-from celluloid import Camera
-from IPython.core.display import HTML
-
-tfd = tfp.distributions
+import numpy as np
+
+
+import matplotlib
+from matplotlib import pyplot as plt
+from matplotlib.patches import Ellipse
+
+from tfl_training_anomaly_detection.exercise_tools import evaluate, visualize_mahalanobis
+
+from ipywidgets import interact
+
+from sklearn.metrics import f1_score, precision_score, recall_score
+
+%matplotlib inline
+matplotlib.rcParams['figure.figsize'] = (5, 5)
Extreme value theory (EVT) usually deals with tails of univariate probability distributions, i.e. with events that -are rare because they are large. -Anomaly detection deals with events that rare but not necessarily large. Therefore, the topic of anomaly detection is -more general than EVT.
- +Despite the smaller applicability of EVT techniques, they are still a valuable addition to the anomaly detectionist's -toolbox. In several situations, anomalies directly correspond to large deviations from some (possibly running) mean - e.g. for sensor data, intrusion attacks based on the number of calls and others.
+Grubbs, 1969:
++An outlying observation, or "outlier," is one that appears to deviate markedly from other members of +the sample in which it occurs.
+
Hawkins, 1980:
++An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was +generated by a different mechanism.
+
Chandola et al., 2009:
+Anomalies are patterns in data that do not conform to a well defined notion of normal behavior
+
However, even for entirely different definitions of anomaly, most detection algorithms will produce a scalar outlier score for each datapoint. -EVT can then be used as a probabilistic framework for analyzing the univariate distribution of outlier scores and help determine meaningful thresholds for separating anomalous from normal cases.
- +There are two fundamental theorems of extreme value theory on which most results in that field are based. The first is concerned with the asymptotic distribution of block maxima of a sequence of i.i.d. random variables. The second one gives an expression for the distribution of excesses over a threshold.
-We will first state these theorems (in their standard formulation in the literature), then see how they can be applied to anomaly detection and after that highlight ideas of their proofs as well as some theoretical consequences.
-From now on let $X_1, X_2, ...$ be a sequence of 1-dimensional i.i.d. random variables with cumulative distribution function $F$. Let $X$ also be a r.v. with the same c.d.f.
+Anomaly Detection: Sensory data can provide valuable information about the condition of the component. Increasingly +abnormal readings may indicate a wear of the equipment.
We define the n-block maximum as the random variable
-$$ -M_n := \max \{X_1, ..., X_n\}. -$$Given a threshold $u$, the excess over the threshold is given by $X-u$.
-In EVT, we are typically interested in approximating $P(M_n<z)$ for large $n$ and in approximating the distribution of excesses $P(X-u < y \mid X > u)$ for large $u$.
+Anomaly Detection: Fraudulent transactions can often be identified through unusual destinations, amounts, +or network topology (over several transactions).
The Fisher-Tipett-Genedenko theorem characterizes the possible limits of renormalized block maxima.
-If there exist sequences of real numbers $a_n>0, b_n$ such that the probability distributions -$$ -P\left(\frac{M_n-b_n}{a_n}<z \right) -$$ -converge to a non-degenerate distribution $G(z)$, then $G(z)$ must be of the following form:
-\begin{equation} - P\left(\frac{M_n-b_n}{a_n}<z \right) \xrightarrow[n\rightarrow \infty]{} G(z; \xi, \mu, \sigma) = \exp \left\{ -\left( 1 + \xi \left( \frac{z - \mu}{\sigma} \right) \right)^{- \frac{1}{\xi} } \right\} -\end{equation}where $\xi, \mu \in \mathbb{R}$ and $\sigma >0$. This function family is called the Generalized Extreme Value distributions (GEV).
+Anomaly Detection +Malicious connections can leaf unusual footprints, e.g., used protocol, ports, number of packages, IP, duration, etc.
The Pickands–Balkema–De Haan theorem states that under the same conditions as above and for a threshold $u \in \mathbb{R}$ going to infinity, the distribution of excesses over the threshold $u$ converges to a Generalized Pareto Distribution (GPD), i.e.
-\begin{equation} -P(X-u < y \mid X > u) \xrightarrow[u \rightarrow \infty]{} H(y; \xi, \tilde{\sigma})=1 - \left( 1 + \frac{\xi \ y}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} \ -\end{equation}where $y>0$ and $1 + \frac{\xi \ y}{\tilde{\sigma}} >0$. The parameter $\xi$ takes the same value as for the GEV.
-We have highlighted the dependence of the limiting distributions on the parameters in both cases. In applications, these parameters will be estimated based on data.
+Before we analyze the consequences of the above distributions in detail, let us discuss their practical significance. The distinctive feature of the GEV and GPD distributions is that they are of a very restricted form, belonging to a three and two parameter function family respectively. This motivates to model distributions of block maxima for finite but large $n$ by the GEV distribution
-$$ - P\left(\frac{M_n-b_n}{a_n}<z \right) \approx G(z; \xi, \mu, \sigma) \Longleftrightarrow - P\left( M_n < z \right) \approx G(z; \xi, \mu\prime , \sigma\prime) -$$where $\mu\prime=b_n+a_n \mu$ and $\sigma\prime=a_n \sigma$. Thus, fitting the coefficients $\xi, \mu\prime, \sigma\prime$ to the observed values of $M_n$, e.g. by maximum likelihood estimation, also finds the "best" values of the renormalizing constants $a_n$ and $b_n$.
-Similarly, fitting $\xi, \tilde{\sigma}$ to observed excesses of a finite threshold $u$ also finds the best renormalizing constants for the GPD.
+Where do you think you can benefit from anomaly detection?
+In the context of AD, modeling the distributions of $M_n$ or $X-u$ is useful for finding thresholds on outlier scores with probabilistic interpretations or for predicting the occurrence rates and sizes of anomalies.
+Task: Estimate if given $x$ is anomalous
+Assumptions:
+For example, given some complex outlier score based on sensor data of a factory process, we might be interested in the probability that this outlier score exceeds a certain threshold within a month. This could be achieved by fitting a GEV distribution to observed frequencies of monthly maxima of the score.
+No!
+The GPD can be used to directly estimate a cumulative univariate distribution $F(z)$ for large enough $z$. Then one could use it to determine the anomaly threshold $z_{\text{th}}$ by defining an anomalous upper quantile. E.g. solving $F(z_\text{th}) = 0.99$ for $z_{\text{th}}$ (where $F$ was obtained by fitting the GPD to some outlier score) would declare approximately 1% of data points as anomalous. We will describe this in more detail below.
+Confusion Matrix | +Actual Nominal | +Actual Anomaly | +
---|---|---|
Predicted Nominal | +True Negative (TN) | +False Negative (FN) | +
Predicted Anomaly | +False Positive (FP) | +True Positive (TP) | +
Let us give a quick example for fitting a GEV on data and extracting insight from it. For that we will use the NYC taxi calls dataset - a collection of taxi calls per 30-minutes intervals that was collected for over a year.
+It estimates the probability that an observation really is anomalous given that the detection system predicted it to be.
+It estimates the probability that an observation will be predicted to be anomalous given that it really is.
It balances between precision and recall.
-taxi_csv = os.path.join("..", 'data','nyc_taxi','nyc_taxi.csv')
-taxi_df = pd.read_csv(taxi_csv)
-taxi_df['time'] = [x.time() for x in pd.to_datetime(taxi_df['timestamp'])]
-taxi_df['date'] = [x.date() for x in pd.to_datetime(taxi_df['timestamp'])]
-taxi_df.rename(columns={"value": "n_calls"}, inplace=True)
-taxi_df.drop(columns=["timestamp"], inplace=True)
-
Choosing the optimal threshold does not only depend on the values of our metrics but also on the cost associated to the confusion matrix. Similarly to precision, recall, etc. the confusion matrix has to be understood as a function of the threshold. For each threshold we obtain different numbers of true positives, false positives... . The associated costs e.g. for the false positives are the expected costs that a falsely positive prediction generates (profits are represented as negative costs). Other than the confusion matrix, the cost matrix does not depend on $\tau$.
+Cost Matrix | +Actual Nominal | +Actual Anomaly | +
---|---|---|
Predicted Nominal | +Cost of TN (CTN) | +Cost of FN(CFN) | +
Predicted Anomaly | +Cost of FP (CFP) | +Cost of TP (CTP) | +
Our goal is to set the threshold such that the expected empirical costs will be minimized
+$$ +\begin{align*} +\tau_{\text{opt}}=\arg\min_\tau \frac{\mathrm{TP}(\tau)\cdot \mathrm{CTP} + \mathrm{FP}(\tau)\cdot \mathrm{CFP} + \mathrm{TN}(\tau)\cdot \mathrm{CTN} + \mathrm{FN}(\tau)\cdot \mathrm{CFN}}{N} +\end{align*} +$$Where $N$ is the total number of samples.
Let's have a look at a simple probabilistic anomaly score.
+taxi_df.head()
-
Try the outlier scores for yourself in a simple synthetic scenario. We have prepared the function evaluate for you. Try to find the optimal threshold for the dataset.
+# Helper functions for normalizing data. In most cases it will be enough to use the normalize function
-
-def normalize_data(data: Sequence) -> np.ndarray:
- scaler = StandardScaler()
- return scaler.fit_transform(data)
-
+nominal = np.random.normal(0, [1, 1.5], size=(300, 2))
+anomaly = np.random.normal(5, 2, size=(10, 2))
-def normalize_series(series: pd.Series) -> pd.DataFrame:
- data = series.values.reshape(-1, 1)
- normalized_data = normalize_data(data).reshape(-1)
- return pd.Series(normalized_data, index=series.index)
+data = np.concatenate([nominal, anomaly], axis=0)
+y = np.zeros(310)
+y[-10:] = 1
+plt.scatter(data[:, 0], data[:,1], c=y)
+plt.gca().set_aspect('equal')
+plt.show()
+
-def normalize_df(data_frame: pd.DataFrame):
- normalized_data = normalize_data(data_frame)
- return pd.DataFrame(normalized_data, columns=data_frame.columns, index=data_frame.index)
+
taxi_df_normalized = taxi_df
-taxi_df_normalized["n_calls"] = normalize(taxi_df_normalized["n_calls"])
-taxi_df_normalized.head()
-
We can define a trainable GEV with tensorflow probability as following
+Fit a Gaussian
def get_gev(xi: float, mu=0., sigma=1., trainable_xi=True):
- xi, mu, sigma = np.array([xi, mu, sigma]).astype(float)
- if trainable_xi:
- xi = tf.Variable(xi, name="xi")
- return tfd.GeneralizedExtremeValue(
- loc=tf.Variable(mu, name='mu'),
- scale=tf.Variable(sigma, name='sigma'),
- concentration=xi,
- )
+mu = data.mean(axis=0)
+Sigma_diag = data.std(axis=0) # assumes independant components
+print('Mean: {}\nStd: {}'.format(mu, Sigma_diag))
For solving the exercises in this notebook you will need to use basic properties of tensorflow probability distributions. They have a very intuitive and convenient API - you get access to the probability density, cdf, quantile function and so on.
+sample_gev = get_gev(0.5)
+
How did the contamination influence the parameter estimation?
-Plot the GEV probability distribution for different values of $\xi, \mu$ and $\sigma$. How do they differ qualitatively?
-What are the domains of definition of $z$ in the above analytic expression for $G(z)$? What values should the c.d.f. $G(z)$ take outside these domains and how does this affect fitting $\xi, \mu, \sigma$ from data by maximum likelihood estimation?
-What expression for the GEV do we get in the limit $\xi \longrightarrow 0$?
-The three qualitatively different shapes of the GEV have their own names. For $\xi >0$ we get the Fréchet Distribution, for $\xi<0$ the reverse Weibull distribution and for $\xi=0$ the Gumbel distribution. Note that using the Gumbel distribution in tensorflow probability is not exactly the same as using GEV with $\xi=0$ due to rounding errors. Try it out!
+Compute scores and evaluate
gev = get_gev(xi=1e-5, sigma=2)
-arr = np.linspace(-5, 5)
-
-pdf = gev.prob(arr)
-plt.plot(arr, pdf)
-plt.show()
+# Mahalanobis distance from the mean of N(mu, Sigma)
+scores = np.sqrt(((data - mu) * (1/Sigma_diag) * (data - mu)).sum(axis=1))
+curves = evaluate(y, scores)
The cdf of the GEV distribution is well defined when $1 + \xi \left( \frac{z - \mu}{\sigma} \right) > 0$. This is equivalent to
-$$ - z > \mu - \frac{\sigma}{\xi} \qquad \text{if $\xi>0$} -$$and -$$ - z < \mu - \frac{|\sigma|}{\xi} \qquad \text{if $\xi<0$}. -$$
-Thus, for $\xi>0$, the distribution has a left boundary, the probability of points lying to the left of it is zero. The value of the cdf there is zero.
-For $\xi<0$ there is a right boundary, the probability of points lying to the right of it is zero and the value of the cdf is 1.
-As $\xi$ moves to zero from below, the right boundary is pushed to infinity. Similarly, if it approaches zero from above, the left boundary is pushed to negative infinity. At exactly $\xi=0$, the GEV becomes the Gumbel distribution which is well defined for all $z$.
We can group the numbers of calls according to the dates, thereby obtaining daily maxima and minima of calls. One way of detecting anomalies in the NYC taxi data set is by fitting a GEV to the distribution of these daily maxima. Here a histogram plot of the (normalized) maxima:
+Choose a threshold
daily_grouped = taxi_df.groupby("date")["n_calls"].agg(["max", "min", "sum"])
-daily_grouped["diff"] = daily_grouped["max"] - daily_grouped["min"]
-daily_grouped.head()
+def visualize_mahalanotis(data, y, scores, mu, sigma_diag, thr):
+ _, axes = plt.subplots(figsize=(6, 6))
+
+ # Visualize Data
+ scatter_gt = axes.scatter(data[:, 0], data[:,1], c=y)
+ plt.scatter(mu[0], mu[1], color='red')
+ axes.set_title('Ground Truth')
+ handles, _ = scatter_gt.legend_elements()
+ axes.legend(handles, ['Nominal', 'Anomaly'])
+ axes.set_aspect('equal')
+ # Draw descicion contour
+ descion_border = Ellipse(
+ mu,
+ width=2*np.sqrt(sigma_diag[0])*thr,
+ height=2*np.sqrt(sigma_diag[1])*thr,
+ color='red',
+ fill=False
+ )
+ axes.add_patch(descion_border)
+
+ # Evaluate threshold
+ y_pred = scores > thr
+
+ precision = precision_score(y, y_pred)
+ recall = recall_score(y, y_pred)
+ f1 = f1_score(y, y_pred)
+
+ axes.set_title("Precision: {}\nRecall: {}\nF1: {}".format(precision, recall, f1))
+
+ plt.tight_layout()
+ plt.show()
daily_grouped_normalized = normalize(daily_grouped)
-daily_grouped_normalized.head()
+thr = None
+
+@interact(threshold=(0., 6.))
+def set_threshold(threshold):
+ global thr
+ thr = threshold
+ plt.show()
plt.hist(daily_grouped_normalized["max"], density=True, bins=40)
-plt.title("Daily maxima of n_calls/(30 minutes)")
-plt.show()
-
Q: Can you already spot the obvious anomalies? What caused them?
-A: See below
-Q: Which of the three qualitatively different shapes would make "physical" sense for the taxi calls data?
-A: The Weibull shape - thus we expect $\xi<0$.
- -maxima = daily_grouped_normalized["max"]
-maxima[(maxima > 1.8) | (maxima < -3.5)]
-
Now let us infer the parameters of the GEV from the data using maximum likelihood estimation. We will make gradient descent on the negative log likelihood of the GEV. Here a very simple training loop for a suitable initial choice for the shape parameter $\xi$ (it is called "concentration") written out in detail:
- -# we are going to be a bit fancy and show an animation of the function as it is being fitted
-
-daily_max = daily_grouped_normalized["max"].values
-
-optimizer = keras.optimizers.SGD(learning_rate=2e-4)
-losses = []
-
-sample_gev = get_gev(xi=-0.1, trainable_xi=True)
-
-fig = plt.figure(dpi=200, figsize=(4.5, 3))
-camera = Camera(fig)
-
-for step in range(100):
- with tf.GradientTape() as tape:
- loss = - tf.math.reduce_sum(sample_gev.log_prob(daily_max))
- gradients = tape.gradient(loss, sample_gev.trainable_variables)
- optimizer.apply_gradients(zip(gradients, sample_gev.trainable_variables))
- losses.append(loss)
-
- bins = plt.hist(daily_max, bins=40, density=True, color="C0")[1]
- pdf = sample_gev.prob(bins)
- plt.plot(bins, pdf, color="orange")
- ax = plt.gca()
- ax.text(0.5, 1.01, f"{step=}, Loss={loss}", transform=ax.transAxes)
- camera.snap()
-
-plt.close()
-plt.figure()
-plt.plot(losses)
-plt.title("Negative Log Likelihood")
-plt.xlabel("gradient steps")
-plt.show()
-
Seems like after 100 steps we have already converged. Let us have a quick look at the result
- -bin_positions = plt.hist(daily_max, density=True, bins=25)[1]
-plt.plot(bin_positions, sample_gev.prob(bin_positions))
-plt.show()
-
HTML(camera.animate().to_html5_video())
-
sample_gev.trainable_variables
-
Well, we probably can do better...
- -Find a better fit using:
-Feel free to improve the code by defining new functions and so on!
-Evaluate the quality of your fit by visual inspection of a histogram and a Q-Q plot (more on that below).
-You can also use other statistical tools that you are familiar with.
-Use the fitted model to find "anomalies" for taxi calls corresponding to probabilities of less than $0.01$.
- -A Q-Q plot is useful for visually comparing two distributions, or comparing a distribution with a dataset. We are interested in the latter. In a Q-Q plot the quantiles of one distribution are plotted against the quantiles of another. For a dataset, the natural choice of quantiles is given simply by the sorted data itself. The data points then roughly correspond to the $\frac{k}{n+1}$th percentiles, where $n$ is the number of samples and $k=1,...,n$ (these are often called plotting positions and other choices for them are possible). The corresponding theoretical quantiles from some specified c.d.f. $F$ are then given by $q_k \ \text{s.t} \ F(q_k) = \frac{k}{n+1}$ (in our applications, $F$ will generally be injective and the $q_k$ uniquely defined).
-If the distribution is a good fit of the data, the resulting line will be close to the diagonal. Below we ask you to complete a simple implementation of the Q-Q plot for tensorflow-like distributions
- -ArrayLike = Sequence[Union[float, tf.Tensor]]
-
-
-class TFDistributionProtocol(Protocol):
- name: str
- trainable_variables: Tuple[tf.Variable]
-
- def quantile(self, prob: ArrayLike) -> ArrayLike: ...
-
def qqplot(data: ArrayLike, dist: TFDistributionProtocol):
- num_observations = len(data)
- observed_quantiles = sorted(data)
- plotting_positions = np.arange(1, num_observations + 1) / (num_observations + 1)
- theoretical_quantiles = dist.quantile(plotting_positions)
-
- plot_origin = (theoretical_quantiles[0], observed_quantiles[0])
- plt.plot(theoretical_quantiles, observed_quantiles)
- plt.plot(theoretical_quantiles, theoretical_quantiles) # adding a diagonal for visual comparison
- plt.xlabel(f"Theoretical quantiles of {dist.name}")
- plt.ylabel(f"Observed quantiles")
-
-
# setting up functions for normal and profile likelihood fit
-
-def fit_dist(data: ArrayLike, dist: TFDistributionProtocol, num_steps=100, lr=1e-4,
- plot_losses=True, return_animation=True) -> Union[float, Tuple[float, HTML]]:
- optimizer = keras.optimizers.SGD(learning_rate=lr)
- losses = []
-
- if return_animation:
- fig = plt.figure(dpi=200, figsize=(4.5, 3))
- camera = Camera(fig)
-
- for step in range(num_steps):
- with tf.GradientTape() as tape:
- loss = - tf.math.reduce_sum(dist.log_prob(data))
- if np.isnan(loss.numpy()):
- logging.warning(f"Encountered nan after {step} steps")
- break
-
- gradients = tape.gradient(loss, dist.trainable_variables)
- optimizer.apply_gradients(zip(gradients, dist.trainable_variables))
- losses.append(loss)
-
- if return_animation:
- bins = plt.hist(data, bins=50, density=True, color="C0")[1]
- pdf = dist.prob(bins)
- plt.plot(bins, pdf, color="orange")
- ax = plt.gca()
- ax.text(0.5, 1.01, f"{step=}, Loss={round(loss.numpy(), 2)}", transform=ax.transAxes)
- camera.snap()
-
-
- if plot_losses:
- plt.close()
- plt.figure()
- plt.plot(losses)
- plt.title("Negative Log Likelihood")
- plt.xlabel("gradient steps")
- plt.show()
-
- result = losses[-1]
- if return_animation:
- result = result, HTML(camera.animate().to_html5_video())
- return result
-
-def profile_fit_dist(data: ArrayLike, dist_factory: Callable[[float], TFDistributionProtocol], xi_values: Sequence[float],
- num_steps=100, lr=1e-4) -> Tuple[float, TFDistributionProtocol]:
- """
- Fits the distribution to data and returns the final loss. If return_animation=True, returns the tuple
- (final_loss, animation)
- """
- minimal_loss = np.infty
- optimal_dist = None
- for xi in xi_values:
- dist = dist_factory(xi)
- loss = fit_dist(data, dist, num_steps=num_steps, lr=lr, plot_losses=False, return_animation=False)
- if loss < minimal_loss:
- minimal_loss = loss
- optimal_dist = dist
- if optimal_dist is None:
- raise RuntimeError(f"Could not find optimal dist, probably due to divergences during fit. "
- "Try to find a better choice for xi_values")
- return minimal_loss, optimal_dist
-
# removing obvious anomalies
-daily_max_without_anomalies = daily_max[np.logical_and( daily_max < 1.8, daily_max > -3.5)]
-
-plt.hist(daily_max_without_anomalies, density=True, bins=40)
-plt.title("Daily maxima without anomalies")
-plt.show()
-
# Example with profile likelihood
-
-xi_values = np.linspace(-0.3, -0.5, 30)
-dist_factory = lambda xi: get_gev(xi, trainable_xi=False)
-min_loss, optimal_gev = profile_fit_dist(daily_max_without_anomalies, dist_factory, xi_values, num_steps=80)
-print(f"Minimal loss: {min_loss}")
-print(f"Optimal xi: {optimal_gev.concentration}")
-optimal_gev.trainable_variables
-
bin_positions = plt.hist(daily_max_without_anomalies, density=True, bins=40)[1]
-plt.plot(bin_positions, optimal_gev.prob(bin_positions))
-plt.title("Result from profile likelihood")
-plt.show()
-
# Solving with gradient descent on xi
-daily_max_gev = get_gev(xi=-0.4)
-final_loss, animation = fit_dist(daily_max_without_anomalies, daily_max_gev)
-
animation
-
# Here the values found by fitting
-daily_max_gev.trainable_variables
-
# and here the qqplot
-qqplot(daily_max_without_anomalies, daily_max_gev)
-
#The fit looks quite good, apart from the lower region, which we are not really interested in.
-#Let us find the anomalies corresponding to the upper 1% quantile
-upper_percentile = 0.99
-upper_quantile = daily_max_gev.quantile(upper_percentile).numpy()
-upper_quantile
-
# and here the anomalies above this threshold
-daily_grouped_normalized["max"][daily_grouped_normalized["max"] > upper_quantile]
-
In addition to the obvious anomalies found above, we caught the independence day (one day before it). We also have the probabilistic interpretation that for 99% of the days the maximal amount of calls per 30 minutes will not exceed the threshold found above (it should be rescaled back to the original value for this statement to hold).
- -One benefit of the probabilistic approach is that we get confidence intervals almost for free. These can be used to estimate the robustness of our analysis (e.g. the determination of anomalies and the quality of the fit).
- -Since we fitted our functions using MLE, which is known to be approximately normal, we get uncertainty estimates from the second derivatives of the loss function. Fortunately, tensorflow makes this extremely easy for us.
- -def observed_fisher_information(data: ArrayLike, dist: TFDistributionProtocol) -> tf.Tensor:
- with tf.GradientTape() as t2:
- with tf.GradientTape() as t1:
- nll = - tf.math.reduce_sum(dist.log_prob(data))
- # conversion needed b/c trainable_vars is a tuple, so gradients and jacobians are tuples too
- g = tf.convert_to_tensor(
- t1.gradient(nll, dist.trainable_variables)
- )
- return tf.convert_to_tensor(t2.jacobian(g, dist.trainable_variables))
-
def mle_std_deviations(data: ArrayLike, dist: TFDistributionProtocol) -> tf.Tensor:
- observed_information_matrix = observed_fisher_information(data, dist)
- mle_covariance_matrix = tf.linalg.inv(observed_information_matrix)
- variances = tf.linalg.tensor_diag_part(mle_covariance_matrix)
- return tf.math.sqrt(variances)
-
Using the above functions, include error bars into the Q-Q plots of the maximum likelihood estimates of the GEV distribution found above.
- -# finding the stddevs and adding/substracting them from the values found from fitting
-std_devs = mle_std_deviations(daily_max_without_anomalies, daily_max_gev)
-print(f"Found std_devs: {std_devs}")
-
-coeff_fitted = tf.convert_to_tensor(daily_max_gev.trainable_variables)
-coeff_upper = coeff_fitted + std_devs
-coeff_lower = coeff_fitted - std_devs
-
-# creating GEVs corresponding to the boundaries of the confidence intervals found above
-gev_upper = get_gev(*coeff_upper)
-gev_lower = get_gev(*coeff_lower)
-
# The qqplots for the original GEV and the GEVs at the boundaries
-
-qqplot(daily_max_without_anomalies, daily_max_gev)
-qqplot(daily_max_without_anomalies, gev_upper)
-qqplot(daily_max_without_anomalies, gev_lower)
-
Now let us repeat the same analysis fitting the distribution of the daily minima using the same strategy. Since minima for a univariate random variable $X$ correspond to maxima of $-X$, all we have to do is to fit a GEV to the minima multiplied by -1.
- -neg_minima_series = -daily_grouped_normalized["min"]
-neg_daily_min = neg_minima_series.values
-
-plt.hist(neg_daily_min, density=True, bins=40)
-plt.title("Daily minima * (-1)")
-plt.show()
-
# identifying obvious anomalies
-neg_minima_series[(neg_minima_series>2) | (neg_minima_series<-2)]
-
# - 01/01 - New Year
-# - 01-02/11 - Marathon
-# - 26-27/01 - Snowstorm
-
neg_minima_without_anomalies = neg_daily_min[np.logical_and(neg_daily_min<2, neg_daily_min>-2)]
-plt.hist(neg_minima_without_anomalies, density=True, bins=40)
-plt.title("Daily minima * (-1) without obvious anomalies")
-plt.show()
-
daily_min_gev = get_gev(xi=-0.3)
-final_loss, animation = fit_dist(neg_minima_without_anomalies, daily_min_gev)
-
animation
-
qqplot(neg_minima_without_anomalies, daily_min_gev)
-
# Fit looks good in the region we are interested in, let us find the 1% quantile and the corresponding anomalies
-
upper_quantile = daily_min_gev.quantile(0.99).numpy()
-upper_quantile
-
neg_minima_series[neg_minima_series>upper_quantile]
-
# Only one non-obvious anomaly is found in the upper quantile, it is cause by the snowstorm responsible for the obvious anomalies we have seen above.
-
daily_means = daily_grouped_normalized["sum"]
-
-plt.plot(daily_means.values)
-plt.axhline(y=2., color='r', linestyle='-')
-plt.axhline(y=-2., color='r', linestyle='-')
-plt.show()
-
The big question here is: where to put the threshold? Clearly the assumption of a Gaussian distribution underlying the sum of daily calls is incorrect - the distribution seems skewed.
- -plt.hist(daily_means, bins=25)
-plt.show()
-
We can detect some anomalies with the Z-test, of course, but the probabilistic interpretation is going to be flawed.
- -daily_means[np.abs(daily_means) > 2]
-
So, what have we really done and why does it make sense to use the GEV for such problems? What kind of guarantees does the Fisher-Tipett-Genedenko theorem give us about the quality of the fit?
- -Well, the truth is, not too many. First notice the following exact equality:
-$$ -P(M_n < z) = P(X_1< z \text{ and } X_2 < z ... \text{ and } X_n < z) = F^n(z) -$$So, if we know the cumulative distribution, there is no need to resort to the GEV. Typically, of course, we do not know it. The above equality implies:
-$$ -\lim P(M_n < z) = - \begin{cases} - 0 & \text{if}\ F(z) < 1 \\ - 1 & \text{otherwise} - \end{cases} -$$ -We actually always know the exact limit of the distribution of the block-maxima! It is degenerate (either a step function of identical zero). In fact, this degenerate distribution can be seen as a limit of the GEV. It would correspond to normalizing constants $a_n=1, \ b_n=0$.
- -While this observation is very simple and the difference between the cdf of block maxima $P(M_n < z)$ and its degenerate limit does decrease as $n$ increases, this limiting distribution is unexpressive and fitting it to data does not provide probabilistic insight.
- -Q: How many parameters does the exact limit of $F^n$ have? What would we get if we fit it to data?
-A:
- -Introducing the normalizing constants $a_n$ and $b_n$ might allow the distribution of renormalized block maxima to converge to something non-trivial. It also might not.
- -In applications we usually care about modeling $M_n$ for a _fixed $n_0$_ (or maybe for a few selected $n_i$). An arbitrary series of $a_n$ and $b_n$ that at some point helps convergence does not directly address our needs. In fact, this is also not what we do - by fitting the GEV parameters to data for our selected $n_0$ we automatically find the best $a_{n_0}$ and $b_{n_0}$ that minimize the difference between $F^{n_0}(z)$ and $G(z)$.
- -Clearly $G(z)$ is much more expressive than the degenerate exact limit and could potentially provide a good fit.
- -So, the convergence that we really care about is to answer the question:
-How well do the best fits of $G(z)$ for fixed $n$ - let us call them $G_n(z)$ - approximate the distributions $F^n(z)$ as $n$ increases? One could e.g. be interested in the infinity norm
-$$ -\Delta_n := \sup_z | F^n(z) - G_n(z) | -$$ -This is not the same as asking how well $G(z)$ approximates some rescaled variant of $F^n(z)$ with $n$-dependent normalization constants! That would be
-$$ -\tilde{\Delta}_n(a_n, b_n) := \sup_z |F^n(a_n z + b_n) - G(z) | -$$ -In the latter question, the choice of normalization constants matters, in the former it does not - they are implicitly determined by the best fit for each $n$. Since for $\Delta_n$ the $a_n, b_n$ have been optimized, one could reasonably expect a relation of the type
-$$ -\Delta_n \approx \min_{a_n, b_n} \tilde{\Delta}_n(a_n, b_n) -$$to hold.
- -It is easy to see that given some normalizing sequences $a_n, b_n$, the convergence to a GEV is possible, than with other sequences $\tilde{a}_n, \tilde{b}_n$ with some $a>0, b$ such that
-$$ -\lim_{n\rightarrow \infty} \frac{\tilde{a}_n}{a_n} = a \quad,\quad \lim_{n \rightarrow \infty} \frac{b_n-\tilde{b}_n}{a_n} = b -$$the rescaled $\frac{M_n-\tilde{b}_n}{\tilde{a_n}}$ also converges to a GEV of the same type (with the same $\xi$). This is often formulated that a distribution $F$ has a fixed domain of attraction. However, the error rates $\tilde{\Delta}_n(\tilde{a}_n, \tilde{b}_n)$ would be different from those associated to $a_n, b_n$.
- -Unfortunately, theoretical bounds for the quantity of interest $\Delta_n$ are hard to come by - we are not aware of any. They also highly depend on the fitting procedure, which is non-trivial, as we have seen above. There are some bounds for quantities of the type $\tilde{\Delta}_n(\tilde{a}_n, \tilde{b}_n)$ (see the annotated literature reference) but they are rather loose and not really helpful in practice. Therefore, the EVT theorems are more of a motivation for selecting distribution families for fitting than a rigorous approach with guarantees. In practice the convergence and fit tend to work pretty well, though.
- -One may wonder how the statement of the Fisher-Gnedenko-Tripet theorem is obtained without providing bounds on convergence. The reason is that the limiting distribution of (renormalized) maxima must have a very special property - it must be max stable. It is instructive to go through a part of the proof to get a feeling for the EVT theorems. We will do so in this exercise.
-Definition: A cumulative distribution function $D(z)$ is called max-stable iff for all $n\in\mathbb{N} \ \exists \ \alpha_n>0, \beta_n \in \mathbb{R}$ such that
-$$ -D^n(z) = D(\alpha_n z + \beta_n) -$$Prove that from $\lim_{n\rightarrow \infty} P\left( \frac{M_n - b_n}{a_n} < z \right) = G(z)$ follows that $G(z)$ is max-stable.
-This goes a long way towards proving the first EVT theorem. One can easily compute that the GEV distribution is max-stable and with more effort one can also prove that any max-stable distribution belongs to the GEV family. Thus, the proof of the theorem is very implicit and does not involve any convergence rates or bounds.
- -According to the line of thought above, increasing the block-size before determining the maxima should improve convergence. Of course, it also decreases the number of points for fitting so it increases variance. We will analyze uncertainties of the fitted GEV below.
-Repeat the fit of the GEV for 2-day maxima/minima. What do you think about the result?
-Hint: use the .reshape method of numpy arrays on the already computed daily maxima/minima
- -bidaily_maxima = daily_max_without_anomalies.reshape(-1, 2).max(axis=1)
-
-plt.hist(bidaily_maxima, bins=40)
-plt.title("Bidaily maxima")
-plt.show()
-
bidaily_gev = get_gev(xi=-0.5)
-loss, animation = fit_dist(bidaily_maxima, bidaily_gev, lr=3e-4, num_steps=100)
-
bidaily_gev.trainable_variables
-
The shape parameter should be independent of the size of the block (it is not affected by $a_n$ and $b_n$) . -Of course, since we find it from fitting, we shouldn't be surprised to find a slightly different value.
- -animation
-
We get a better fit than before (less than half of the loss with half as many data points). -But we have higher variance in the very important shape parameter $\xi$
- -std_devs_daily = mle_std_deviations(daily_max_without_anomalies, daily_max_gev)
-std_devs_bidaily = mle_std_deviations(bidaily_maxima, bidaily_gev)
-
-print("Daily stddevs:")
-print(std_devs_daily.numpy())
-print("Biaily stddevs:")
-print(std_devs_bidaily.numpy())
-
So far we have only used the first theorem of EVT. As you might have noticed above, it can be somehow wasteful when it comes to data efficiency. Since the GEV is fitted on block-maxima, a huge number of data points remain unused for parameter estimation. The second theorem of EVT gives rise to a more efficient approach
- -Use the approximation $\ln(1+x) \approx 1 + x$ for $|x| \ll 1$ and $F(z) \approx 1$ for large enough $z$ to derive.
-\begin{equation} -P(X-u < y \mid X > u) \approx 1 - \left( 1 + \frac{\xi \ y}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} \label{GPD-approx-original} -\end{equation}for large enough $u$ (this is a slightly less formal derivation of Pickards' et. al. theorem). One could equivalently write
-\begin{equation} -P(X-u > y \mid X > u) \approx \left( 1 + \frac{\xi \ y}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} -\end{equation}What is the relation between $\tilde{\sigma}$ and the normalizing coefficients of the first theorem of EVT?
- -The above equation can be used to estimate the entire tail of the cdf $F$ of $X$ from a sample of size $N$ obtained by sampling repeatedly from $F$. First note that for a single $u$ we can approximate the cdf through the sample statistics as:
-\begin{equation} -1-F(u) = P(X>u) \approx \frac{N_u}{N} -\end{equation}where $N_u$ is the number of samples with values above $u$. Interpreting $u$ as a threshold, we will call those samples peaks over threshold (PoT) and $N_u$ is simply their count.
- -Q: What should $u$ and the data set fulfill in order for the above approximation to be accurate?
-A: It should be small enough such that many data points are larger than it. Then the approximation in $P(X>u) \approx \frac{N_u}{N}$ holds (the estimator is not too biased).
- -Now we can perform a series of approximations for $z>u$ to get to the tail-distribution. First using $P(X>u) \approx \frac{N_u}{N}$ we get
-$$ -P(X>z) = P(X>z \cap X>u) = P(X>z \mid X>u) P(X>u) \approx \frac{N_u}{N} P(X>z \mid X>u) -$$ -Now we use the GDP theorem to approximate
-$$ -P(X>z \mid X>u) = P(X-u > z -u \mid X>u) \approx - \left( 1 + \frac{\xi (z-u)}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} -$$ -Putting everything together gives
-$$ -P(X>z) \approx \frac{N_u}{N} \left( 1 + \frac{\xi (z-u)}{\tilde{\sigma}} \right)^{-\frac{1}{\xi}} -$$ -Q: Intuitively, what does $u$ need to fulfill for both approximations to hold?
-A: $u$ should be small enough such that the approximation $P(X>u) \approx \frac{N_u}{N}$ holds and sufficiently large such that the generalized pareto distribution is a good estimate of the tail of the distribution for values larger than $u$. Intuitively, it should be at the beginning of the tail, where for values larger than $u$ only the tail behavior plays a role - i.e. no more local extrema or other specifics of the underlying distribution of the data.
- -This exercise lets you explore the second theorem of EVT for anomaly detection. Here we let you calculate and code on your own, without giving too many hints. You can follow the GEV-fitting code above for solving this exercise. Feel free to ask for hints if you are stuck!
-# We define the creation of the GPD analogous to the GEV above
-
-def get_gpd(xi: float, sigma=1., trainable_xi=True):
- xi, sigma = np.array([xi, sigma]).astype(float)
- if trainable_xi:
- xi = tf.Variable(xi, name="xi")
- return tfd.GeneralizedPareto(
- loc=0,
- scale=tf.Variable(sigma, name='sigma'),
- concentration=xi
- )
-
# GPD is fit directly on the thresholded data, no need for grouping
-
-n_calls = taxi_df_normalized["n_calls"].values
-
plt.hist(n_calls, bins=40)
-plt.show()
-
# seems like u=1 gives a good value for the beginning of "tail behaviour"
-u = 1
-thresholded_n_calls = n_calls[n_calls>u] - u
-plt.hist(thresholded_n_calls, bins=50)
-plt.show()
-
# obvious anomalies
-taxi_df_normalized[taxi_df_normalized["n_calls"]> u+1]
-
# we filter out some calls on (one day before) independence day, Columbus day, the marathon and New Year
-
-cleaned_thresholded_calls = thresholded_n_calls[thresholded_n_calls < 1]
-plt.hist(cleaned_thresholded_calls, bins=50, density=True)
-plt.title("Thresholded calls without obvious anomalies")
-plt.show()
+visualize_mahalanobis(data, y, scores, mu, Sigma_diag, thr)
# fitting the gpd. We need a small lr to not hit singularities
-# We bypass fitting xi here, instead using the xi found from fitting the GEV above.
-# Theory suggests that it should be close to the optimal value.
-# We could also profile around it or try full gradient, of course. The latter is brittle
+
+
+
-xi_gev = daily_max_gev.concentration.numpy()
-print(f"Using xi={xi_gev}")
-gpd = get_gpd(xi=xi_gev, sigma=1, trainable_xi=False)
-loss, animation = fit_dist(cleaned_thresholded_calls, gpd, lr=5e-6, num_steps=100)
-
-
-
+
animation
-
# now the qqplot and the stddev
-
-std = mle_std_deviations(cleaned_thresholded_calls, gpd)
-
-fitted_coeff = tf.convert_to_tensor(gpd.trainable_variables)
-coeff_upper = fitted_coeff + std
-coeff_lower = fitted_coeff - std
-
-gpd_upper = get_gpd(gpd.concentration.numpy(), *coeff_upper.numpy())
-gpd_lower = get_gpd(gpd.concentration.numpy(), *coeff_lower.numpy())
-
Choose good threshold. You may write additional code to determine the threshold.
-qqplot(cleaned_thresholded_calls, gpd)
-qqplot(cleaned_thresholded_calls, gpd_upper)
-qqplot(cleaned_thresholded_calls, gpd_lower)
+thr_opt = 3.2 #
# finding the threshold corresponding to probability of 0.01 of the non-conditioned tail
-# For that, we rescale with our estimate of N_u
-
-N_u = len(cleaned_thresholded_calls)/len(thresholded_n_calls)
-percentile = 1- N_u*0.01
-
-q = u + gpd.quantile(percentile).numpy()
-q
-
data_test = np.concatenate([np.random.normal(0, [1, 1.5], size=(300, 2)), np.random.normal(3, 1.5, size=(10, 2))])
-
# and the anomalies lying above it. We find thesame ones
+scores_test = np.sqrt(((data_test - mu) * (1/Sigma_diag) * (data_test - mu)).sum(axis=1))
-n_calls = taxi_df_normalized["n_calls"]
-taxi_df_normalized[taxi_df_normalized["n_calls"] > q]
+visualize_mahalanotis(data_test, y_test, scores_test, mu, Sigma_diag, thr_opt)
We found new candidates for anomalies (or rare events). The 10.01.2015 was the day following Charlie Hebdo related terrorist attacks, there was a large march in Paris. Maybe there was additional movement across New York's large Jewish community. See e.g. this article
-We could not find events that could have caused the large numbers of calls on the 18/10/2014 and the 22/11/2014.
-We also now have a probabilistic model for the tail of n_calls/30 minutes which might be useful for planning taxi availabilities on a more granular level than just per-day.
- -So far, we have completely ignored the time-series aspect of our data set. When using EVT for time series, as will often be the case in practice, seasonality, trends and so on need to be taken into account.
- -We have already seen a treatment of these topics for time series forecasting. Without going into details, we want to mention that the time-dependency can be to some extent taken into account in EVT by allowing for time-dependent parameters $\xi(t), \mu(t), \sigma(t)$.
- -There exist multiple strategies for finding these time dependent functions from data, the most straightforward one being MLE-fitting with sliding windows over a sample. One could also easily include known modulations into the MLE fitting, e.g. something like
-$$ -\mu(t) : = \mu_0 \sin(t) -$$might do the job if one knows that the underlying mean vary with $\sin(t)$. Then, one only needs to fit $\mu_0$.
- -Unfortunately, there are some incorrect claims about applications of EVT in the AD literature. The claims often involve an incorrect analysis of EVT for multivariate and multimodal distributions.
- -Note that the EVT theorems apply to univariate distributions. They also ignore multimodality as only tail-behaviour plays a role for them.
- -Results of EVT from one dimension cannot be directly transferred to higher dimensions, even for Gaussians. The cdf of the Mahalanobis radius is simply not dimension independent, see here for an exact expression for it. The attempt to do that leads to a bad fit and is sometimes called failure of classical EVT. Similar approaches and resulting claims have been tried on Gaussian mixtures.
- -Th NYC taxi data is very simple, we could apply EVT to it directly. For multidimensional data these techniques don't work out of the box. However, as mentioned in the beginning, virtually all AD algorithms will produce a 1-dimensional score which can then be given a probabilistic meaning through EVT. We will explore this approach in the last exercise of this section.
+The PoT method can be adapted to work on streams in a memory efficient way, by automatically stripping off obvious anomalies and adjusting the threshold
-We have seen how MLE with gradient descent is brittle and subject to divergences. There is a lot of literature containing bags of tricks for finding the MLE estimators for GEV and GPT distributions in a smarter, more robust way.
+ -One can also give up on MLE and use goodness-of-fit objectives to minimize the difference with the empirical cdf given by the data.
Generally, there is a large body of literature on EVT, although more in the engineering/math directions than for AD.
Using the anomaly scores from a data set and algorithm from yesterday (you can choose your favorite), perform an EVT analysis along the lines of what was done above. What are your conclusions? In which situations can such an analysis be useful in practical situations?
-Left to the reader
+
-
import numpy as np
+import itertools as it
+from tqdm import tqdm
+
+import matplotlib
+from matplotlib import pyplot as plt
+import plotly.express as px
+import pandas as pd
+
+import ipywidgets as widgets
+
+from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \
+perform_rkde_experiment, get_mnist_data
+
+from ipywidgets import interact
+
+from sklearn.metrics import roc_auc_score, average_precision_score
+from sklearn.model_selection import RandomizedSearchCV
+from sklearn.preprocessing import MinMaxScaler
+from sklearn.preprocessing import LabelBinarizer
+from sklearn.ensemble import IsolationForest
+from sklearn import metrics
+from sklearn.model_selection import train_test_split
+from sklearn.decomposition import PCA
+from sklearn.neighbors import KernelDensity
+
+from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst
+
+from tensorflow import keras
+
+%matplotlib inline
+matplotlib.rcParams['figure.figsize'] = (5, 5)
+
Idea: Estimate the density of $F_0$. Areas of low density are anomalous.
+Definition
+Definition:
+Definition:
+Let $D = \{x_1,\ldots,x_N\}\subset \mathbb{R}^p$. The KDE with kernel $K$ and bandwidth $h$ is +$KDE_h(x, D) = \frac{1}{N}\sum_{i=1}^N \frac{1}{h^p}K\left(\frac{|x-x_i|}{h}\right)$
++ | + |
Play with the parameters!
+ +dists = create_distributions(dim=2, dim_irrelevant=0)
+
+sample_train = dists['Double Blob'].sample(500)
+X_train = sample_train[-1]
+y_train = [0]*len(X_train)
+
+plt.scatter(X_train[:,0], X_train[:,1], c = 'blue', s=10)
+plt.show()
+
# Helper function
+def fit_kde(kernel: str, bandwidth: float, X_train: np.array) -> KernelDensity:
+ """ Fit KDE
+
+ @param kernel: kernel
+ @param bandwidth: bandwidth
+ @param x_train: data
+ """
+ kde = KernelDensity(kernel=kernel, bandwidth=bandwidth)
+ kde.fit(X_train)
+ return kde
+
+def visualize_kde(kde: KernelDensity, bandwidth: float, X_test: np.array, y_test: np.array) -> None:
+ """Plot KDE
+
+ @param kde: KDE
+ @param bandwidth: bandwidth
+ @param X_test: test data
+ @param y_test: test label
+ """
+ fig, axis = plt.subplots(figsize=(5, 5))
+
+ lin = np.linspace(-10, 10, 50)
+ grid_points = list(it.product(lin, lin))
+ ys, xs = np.meshgrid(lin, lin)
+ # The score function of sklearn returns log-densities
+ scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)
+ colormesh = axis.contourf(xs, ys, scores)
+ fig.colorbar(colormesh)
+ axis.set_title('Density Conturs (Bandwidth={})'.format(bandwidth))
+ axis.set_aspect('equal')
+ color = ['blue' if i ==0 else 'red' for i in y_test]
+ plt.scatter(X_test[:, 0], X_test[:, 1], c=color)
+ plt.show()
+
ker = None
+bdw = None
+@interact(
+ kernel=['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'],
+ bandwidth=(.1, 10.)
+)
+def set_kde_params(kernel: str, bandwidth: float) -> None:
+ """Helper funtion to set widget parameters
+
+ @param kernel: kernel
+ @param bandwidth: bandwidth
+ """
+ global ker, bdw
+
+ ker = kernel
+ bdw = bandwidth
+
kde = fit_kde(ker, bdw, X_train)
+visualize_kde(kde, bdw, X_train, y_train)
+
The bandwidth is the most important parameter of a KDE model. A wrongly adjusted value will lead to over- or +under-smoothing of the density curve.
+A common method to select a bandwidth is maximum log-likelihood cross validation. +$$h_{\textrm{llcv}} = \arg\max_{h}\frac{1}{k}\sum_{i=1}^k\sum_{y\in D_i}\log\left(\frac{k}{N(k-1)}\sum_{x\in D_{-i}}K_h(x, y)\right)$$ +where $D_{-i}$ is the data without the $i$th cross validation fold $D_i$.
+ +ex no.1: Noisy sinusoidal
+ +# Generate example
+dists = create_distributions(dim=2)
+
+distribution_with_anomalies = contamination(
+ nominal=dists['Sinusoidal'],
+ anomaly=dists['Blob'],
+ p=0.05
+)
+
+# Train data
+sample_train = dists['Sinusoidal'].sample(500)
+X_train = sample_train[-1].numpy()
+
+# Test data
+sample_test = distribution_with_anomalies.sample(500)
+X_test = sample_test[-1].numpy()
+y_test = sample_test[0].numpy()
+
+scatter = plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test)
+handels, _ = scatter.legend_elements()
+plt.legend(handels, ['Nominal', 'Anomaly'])
+plt.gca().set_aspect('equal')
+plt.show()
+
param_space = {
+ 'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels
+ 'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter
+}
+
def hyperopt_by_score(X_train: np.array, param_space: dict, cv: int=5):
+ """Performs hyperoptimization by score
+
+ @param X_train: data
+ @param param_space: parameter space
+ @param cv: number of cv folds
+ """
+ kde = KernelDensity()
+
+ search = RandomizedSearchCV(
+ estimator=kde,
+ param_distributions=param_space,
+ n_iter=100,
+ cv=cv,
+ scoring=None # use estimators internal scoring function, i.e. the log-probability of the validation set for KDE
+ )
+
+ search.fit(X_train)
+ return search.best_params_, search.best_estimator_
+
Run the code below to perform hyperparameter optimization.
+ +params, kde = hyperopt_by_score(X_train, param_space)
+
+print('Best parameters:')
+for key in params:
+ print('{}: {}'.format(key, params[key]))
+
+test_scores = -kde.score_samples(X_test)
+test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
+
+curves = evaluate(y_test, test_scores)
+
visualize_kde(kde, params['bandwidth'], X_test, y_test)
+
You are a company resposible to estimate house prices around Ames, Iowa, specifically around college area. But there is a problem: houses from a nearby area, 'Veenker', are often included in your dataset. You want to build an anomaly detection algorithm that filters one by one every point that comes from the wrong neighborhood. You have been able to isolate an X_train dataset which, you are sure, contains only houses from College area. Following the previous example, test your ability to isolate anomalies in new incoming data (X_test) with KDE.
+Advanced exercise: +What happens if the contamination comes from other areas? You can choose among the following names:
+OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste
+ +X_train, X_test, y_test = get_house_prices_data(neighborhood = 'CollgCr', anomaly_neighborhood='Veenker')
+X_train
+
# Total data
+train_test_data = X_train.append(X_test, ignore_index=True)
+y_total = [0] * len(X_train) + y_test
+
+fig = px.scatter_3d(train_test_data, x='LotArea', y='OverallCond', z='SalePrice', color=y_total)
+
+fig.show()
+
# When data are highly in-homogeneous, like in this case, it is often beneficial
+# to rescale them before applying any anomaly detection or clustering technique.
+scaler = MinMaxScaler()
+X_train_rescaled = scaler.fit_transform(X_train)
+
param_space = {
+ 'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels
+ 'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter
+}
+params, kde = hyperopt_by_score(X_train_rescaled, param_space)
+
print('Best parameters:')
+for key in params:
+ print('{}: {}'.format(key, params[key]))
+
+X_test_rescaled = scaler.transform(X_test)
+test_scores = -kde.score_samples(X_test_rescaled)
+test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
+curves = evaluate(y_test, test_scores)
+
The flexibility of KDE comes at a price. The dependency on the dimensionality of the data is quite unfavorable.
+Theorem [Stone, 1982] +Any estimator that is consistent$^*$ with the class of all $k$-fold differentiable pdfs over $\mathbb{R}^d$ has a +convergence rate of at most
+$$ +\frac{1}{n^{\frac{k}{2k+d}}} +$$$^*$Consistency = for all pdfs $p$ in the class: $\lim_{n\to\infty}|KDE_h(x, D) - p(x)|_\infty = 0$ with probability $1$.
+ +We take a look at a higher dimensional version of out previous data set.
+ +dists = create_distributions(dim=3)
+
+distribution_with_anomalies = contamination(
+ nominal=dists['Sinusoidal'],
+ anomaly=dists['Blob'],
+ p=0.01
+)
+
+sample = distribution_with_anomalies.sample(500)
+
+y = sample[0]
+X = sample[-1]
+
fig = px.scatter_3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], color=y)
+fig.show()
+
# Fit KDE on high dimensional examples
+rocs = []
+auprs = []
+bandwidths = []
+
+param_space = {
+ 'kernel': ['gaussian'],
+ 'bandwidth': np.linspace(0.1, 100, 1000), # Define Search space for bandwidth parameter
+ }
+
+kdes = {}
+dims = np.arange(2,16)
+for d in tqdm(dims):
+ # Generate d dimensional distributions
+ dists = create_distributions(dim=d)
+
+ distribution_with_anomalies = contamination(
+ nominal=dists['Sinusoidal'],
+ anomaly=dists['Blob'],
+ p=0
+ )
+
+ # Train on clean data
+ sample_train = dists['Sinusoidal'].sample(500)
+ X_train = sample_train[-1].numpy()
+ # Test data
+ sample_test = distribution_with_anomalies.sample(500)
+ X_test = sample_test[-1].numpy()
+ y_test = sample_test[0].numpy()
+
+ # Optimize bandwidth
+ params, kde = hyperopt_by_score(X_train, param_space)
+ kdes[d] = (params, kde)
+
+ bandwidths.append(params['bandwidth'])
+
+ test_scores = -kde.score_samples(X_test)
+ test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
+
+
+
# Plot cross section of pdf
+fig, axes = plt.subplots(nrows=2, ncols=7, figsize=(15, 5))
+for d, axis in tqdm(list(zip(kdes, axes.flatten()))):
+
+ params, kde = kdes[d]
+
+ lin = np.linspace(-10, 10, 50)
+ grid_points = list(it.product(*([[0]]*(d-2)), lin, lin))
+ ys, xs = np.meshgrid(lin, lin)
+ # The score function of sklearn returns log-densities
+ scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)
+ colormesh = axis.contourf(xs, ys, scores)
+ axis.set_title("Dim = {}".format(d))
+ axis.set_aspect('equal')
+
+
+# Plot evaluation
+print('Crossection of the KDE at (0,...,0, x, y)')
+plt.show()
+
Another drawback of KDE in the context of anomaly detection is that it is not robust against contamination of the data
+Definition +The breakdown point of an estimator is the smallest fraction of observations that need to be changed so that we can +move the estimate arbitrarily far away from the true value.
+ +Example: The sample mean has a breakdown point of $0$. Indeed, for a sample of $x_1,\ldots, x_n$ we only need to +change a single value in order to move the sample mean in any way we want. That means that the breakdown point is +smaller than $\frac{1}{n}$ for every $n\in\mathbb{N}$.
+ +There are robust replacements for the sample mean:
+Switch from quadratic to linear loss at prescribed threshold
+ +import numpy as np
+
+
+def huber(error: float, threshold: float):
+ """Huber loss
+
+ @param error: base error
+ @param threshold: threshold for linear transition
+ """
+ test = (np.abs(error) <= threshold)
+ return (test * (error**2)/2) + ((1-test)*threshold*(np.abs(error) - threshold/2))
+
+x = np.linspace(-5, 5)
+y = huber(x, 1)
+
+plt.plot(x, y)
+plt.gca().set_title("Huber Loss")
+plt.show()
+
More complex loss function. Depends on 3 parameters 0 < a < b< r
+ +import numpy as np
+
+def single_point_hampel(error: float, a: float, b: float, r: float):
+ """Hampel loss
+
+ @param error: base error
+ @param a: 1st threshold parameter
+ @param b: 2nd threshold parameter
+ @param r: 3rd threshold parameter
+ """
+ if abs(error) <= a:
+ return error**2/2
+ elif a < abs(error) <= b:
+ return (1/2 *a**2 + a* (abs(error)-a))
+ elif b < abs(error) <= r:
+ return a * (2*b-a+(abs(error)-b) * (1+ (r-abs(error))/(r-b)))/2
+ else:
+ return a*(b-a+r)/2
+
+hampel = np.vectorize(single_point_hampel)
+
+x = np.linspace(-10.1, 10.1)
+y = hampel(x, a=1.5, b=3.5, r=8)
+
+plt.plot(x, y)
+plt.gca().set_title("Hampel Loss")
+plt.show()
+
Kernel as scalar product:
+$^*$All kernels that we have seen are radial and monotonic
+ +We compare the performance of different approaches to recover the nominal distribution under contamination. +Here, we use code by Humbert et al. to replicate +the results in the referenced paper on median-of-mean KDE. More details on rKDE can instead be found in this paper by Kim and Scott.
+ +# =======================================================
+# Parameters
+# =======================================================
+algos = [
+ 'kde',
+ 'mom-kde', # Median-of-Means
+ 'rkde-huber', # robust KDE with huber loss
+ 'rkde-hampel', # robust KDE with hampel loss
+]
+
+dataset = 'house-prices'
+dataset_options = {'neighborhood': 'CollgCr', 'anomaly_neighborhood': 'Edwards'}
+
+outlierprop_range = [0.01, 0.02, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5]
+kernel = 'gaussian'
+
auc_scores = perform_rkde_experiment(
+ algos,
+ dataset,
+ dataset_options,
+ outlierprop_range,
+ kernel,
+)
+
fig, ax = plt.subplots(figsize=(7, 5))
+for algo, algo_data in auc_scores.groupby('algo'):
+ x = algo_data.groupby('outlier_prop').mean().index
+ y = algo_data.groupby('outlier_prop').mean()['auc_anomaly']
+ ax.plot(x, y, 'o-', label=algo)
+plt.legend()
+plt.xlabel('outlier_prop')
+plt.ylabel('auc_score')
+plt.title('Comparison of rKDE against contamination')
+
Try using different neighborhoods for contamination. Which robust KDE algorithm performs better overall? Choose among the following options:
+OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste
+You can also change the kernel type: gaussian, tophat, epechenikov, exponential, linear or cosine,
+ +
+