-
Notifications
You must be signed in to change notification settings - Fork 250
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
…2736) Co-authored-by: István Zoltán Szabó <[email protected]>
- Loading branch information
1 parent
6a6cdab
commit 078bf23
Showing
1 changed file
with
172 additions
and
4 deletions.
There are no files selected for viewing
176 changes: 172 additions & 4 deletions
176
docs/en/stack/ml/anomaly-detection/ml-ad-troubleshooting.asciidoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,183 @@ | ||
[role="xpack"] | ||
[[ml-ad-troubleshooting]] | ||
= Troubleshooting {ml} {anomaly-detect} | ||
= Troubleshooting {ml} {anomaly-detect} and frequently asked questions | ||
++++ | ||
<titleabbrev>Troubleshooting</titleabbrev> | ||
<titleabbrev>Troubleshooting and FAQ</titleabbrev> | ||
++++ | ||
|
||
Use the information in this section to troubleshoot common problems and find | ||
answers for frequently asked questions. | ||
|
||
|
||
[discrete] | ||
[[ml-ad-restart-failed-jobs]] | ||
== Restart failed {anomaly-jobs} | ||
== How to restart failed {anomaly-jobs} | ||
|
||
include::ml-restart-failed-jobs.asciidoc[] | ||
|
||
[discrete] | ||
[[faq-methods]] | ||
== What {ml} methods are used for {anomaly-detect}? | ||
|
||
For detailed information, refer to the paper https://www.ijmlc.org/papers/398-LC018.pdf[Anomaly Detection in Application Performance Monitoring Data] by Thomas Veasey and Stephen Dodson, as well as our webinars on https://www.elastic.co/elasticon/conf/2018/sf/the-math-behind-elastic-machine-learning[The Math behind Elastic Machine Learning] and | ||
https://www.elastic.co/elasticon/conf/2017/sf/machine-learning-and-statistical-methods-for-time-series-analysis[Machine Learning and Statistical Methods for Time Series Analysis]. | ||
|
||
Further papers cited in the C++ code: | ||
|
||
* http://arxiv.org/pdf/1109.2378.pdf[Modern hierarchical, agglomerative clustering algorithms] | ||
* https://www.cs.umd.edu/~mount/Projects/KMeans/pami02.pdf[An Efficient k-Means Clustering Algorithm: Analysis and Implementation] | ||
* http://www.stat.columbia.edu/~madigan/PAPERS/techno.pdf[Large-Scale Bayesian Logistic Regression for Text Categorization] | ||
* https://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf[X-means: Extending K-means with Efficient Estimation of the Number of Clusters] | ||
|
||
|
||
[discrete] | ||
[[faq-features]] | ||
== What are the input features used by the model? | ||
|
||
All input features are specified by the user, for example, using | ||
https://www.elastic.co/guide/en/machine-learning/current/ml-functions.html[diverse statistical functions] | ||
like count or mean over the data of interest. | ||
|
||
|
||
[discrete] | ||
[[faq-data]] | ||
== Does the data used by the model only include customers' data? | ||
|
||
Yes. Only the data specified in the {anomaly-job} configuration are used for | ||
detection. | ||
|
||
|
||
[discrete] | ||
[[faq-output-score]] | ||
== What does the model output score represent? How is it generated and calibrated? | ||
|
||
The ensemble model generates a probability value, which is then mapped to an | ||
anomaly severity score between 0 and 100. The lower the probability of observed | ||
data, the higher the severity score. Refer to this | ||
<<ml-ad-explain,advanced concept doc>> for details. Calibration (also called as | ||
normalization) happens on two levels: | ||
|
||
. Within the same metric/partition, the scores are re-normalized “back in time” | ||
within the window specified by the `renormalization_window_days` parameter. | ||
This is the reason, for example, that both `record_score` and | ||
`initial_record_score` exist. | ||
. Over multiple partitions, scores are renormalized as described in | ||
https://www.elastic.co/blog/changes-to-elastic-machine-learning-anomaly-scoring-in-6-5[this blog post]. | ||
|
||
|
||
[discrete] | ||
[[faq-model-update]] | ||
== Is the model static or updated periodically? | ||
|
||
It's an online model and updated continuously. Old parts of the model are pruned | ||
out based on the parameter `model_prune_window` (usually 30 days). | ||
|
||
|
||
[discrete] | ||
[[faq-model-performance]] | ||
== Is the performance of the model monitored? | ||
|
||
There is a set of benchmarks to monitor the performance of the {anomaly-detect} | ||
algorithms and to ensure no regression occurs as the methods are continuously | ||
developed and refined. They are called "data scenarios" and consist of 3 things: | ||
|
||
* a dataset (stored as an {es} snapshot), | ||
* a {ml} config ({anomaly-detect}, {dfanalysis}, {transform}, or {infer}), | ||
* an arbitrary set of static assertions (bucket counts, anomaly scores, accuracy | ||
value, and so on). | ||
|
||
Performance metrics are collected from each and every scenario run and they are | ||
persisted in an Elastic Cloud cluster. This information is then used to track | ||
the performance over time, across the different builds, mainly to detect any | ||
regressions in the performance (both result quality and compute time). | ||
|
||
On the customer side, the situation is different. There is no conventional way | ||
to monitor the model performance as it's unsupervised. Usually, | ||
operationalization of the model output include one or several of the following | ||
steps: | ||
|
||
* Creating alerts for influencers, buckets, or records based on a certain | ||
anomaly score. | ||
* Use the forecasting feature to predict the development of the metric of | ||
interest in the future. | ||
* Use one or a combination of multiple {anomaly-jobs} to identify the | ||
significant anomaly influencers. | ||
|
||
|
||
[discrete] | ||
[[faq-model-accuracy]] | ||
== How to measure the accuracy of the unsupervised {ml} model? | ||
|
||
For each record in a given time series, anomaly detection models provide an | ||
anomaly severity score, 95% confidence intervals, and an actual value. This data | ||
is stored in an index and can be retrieved using the Get Records API. With this | ||
information, you can use standard measures to assess prediction accuracy, | ||
interval calibration, and so on. Elasticsearch aggregations can be used to | ||
compute these statistics. | ||
|
||
The purpose of {anomaly-detect} is to achieve the best ranking of periods where | ||
an anomaly happened. A practical way to evaluate this is to keep track of real | ||
incidents and see how well they correlate with the predictions of | ||
{anomaly-detect}. | ||
|
||
|
||
[discrete] | ||
[[faq-model-drift]] | ||
== Can the {anomaly-detect} model experience model drift? | ||
|
||
Elasticsearch's {anomaly-detect} model continuously learns and adapts to changes | ||
in the time series. These changes can take the form of slow drifts as well as | ||
sudden jumps. Therefore, we take great care to manage the adaptation to changing | ||
data characteristics. There is always a fine trade-off between fitting anomalous | ||
periods (over-fitting) and not learning new normal behavior. The following are | ||
the main approaches Elastic uses to manage this trade-off: | ||
|
||
* Learning the optimal decay rate based on measuring the bias in the forecast | ||
and the moments of the error distribution and error distribution moments. | ||
* Allowing continuous small drifts in periodic patterns. This is achieved by | ||
continuously minimizing the mean prediction error over the last iteration with | ||
respect to a small bounded time shift. | ||
* If the predictions are significantly wrong over a long period of time, the | ||
algorithm tests whether the time series has undergone a sudden change. | ||
Hypothesis Testing is used to test for different types of changes, such as | ||
scaling of values, shifting of values, and large time shifts in periodic | ||
patterns such as daylight saving time. | ||
* Running continuous hypothesis tests on time windows of various lengths to test | ||
for significant evidence of new or changed periodic patterns, and update the | ||
model if the null hypothesis of unchanged features is rejected. | ||
* Accumulating error statistics on calendar days and continuously test whether | ||
predictive calendar features need to be added or removed from the model. | ||
|
||
|
||
[discrete] | ||
[[faq-minimum-data]] | ||
== What is the minimum amount of data for an {anomaly-job}? | ||
|
||
Elastic {ml} needs a minimum amount of data to be able to build an effective | ||
model for {anomaly-detect}. | ||
|
||
* For sampled metrics such as `mean`, `min`, `max`, | ||
and `median`, the minimum data amount is either eight non-empty bucket spans or | ||
two hours, whichever is greater. | ||
* For all other non-zero/null metrics and count-based quantities, it's four | ||
non-empty bucket spans or two hours, whichever is greater. | ||
* For the `count` and `sum` functions, empty buckets matter and therefore it is | ||
the same as sampled metrics - eight buckets or two hours. | ||
* For the `rare` function, it's typically around 20 bucket spans. It can be faster | ||
for population models, but it depends on the number of people that interact per | ||
bucket. | ||
|
||
Rules of thumb: | ||
|
||
* more than three weeks for periodic data or a few hundred buckets for | ||
non-periodic data | ||
* at least as much data as you want to forecast | ||
|
||
|
||
[discrete] | ||
[[faq-data-integrity]] | ||
== Are there any checks or processes to ensure data integrity? | ||
|
||
include::ml-restart-failed-jobs.asciidoc[] | ||
The Elastic {ml} algorithms are programmed to work with missing and noisy data | ||
and use denoising and data reputation techniques based on the learned | ||
statistical properties. |