-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Algorithm] Implement real-time anomaly detection for metrics #7
Comments
Thanks! For the algorithms, we can directly install package or git submodule from your repository. |
By now, although the algorithm has not been evaluated yet, I personally prefer SPOT. It can show us clearly dynamic upper and lower bounds (Some commercial products have this function, like datadog), which is more user-friendly than others. Of course, Prophet is also an option. |
That is great, let's keep this preference in mind and test. I will provide you with a gRPC implementation for exporting metrics if needed, but for very early testing purposes, a simple generator function will suffice to mock a stream. |
Recently, I'm trying to make the benchmark. I realize that a single detector may not be enough to meet generalization. I'm trying to introduce AutoML-related technology to automatically select the best from different detectors. |
Many commercial vendors use such techniques to provide reliable results, I believe it's the right direction to go. Good luck and keep me updated so we can collaborate. |
@Fengrui-Liu Are the metrics algorithms trained incrementally online or offline periodically? (I checked the spot paper and they say both are doable, but idk the actual tradeoff) I'm considering in terms of orchestration. When there are many models (1+ for each metric stream) need to be trained at the same time, this will introduce an overhead to a single node of engine when it also has to do other computation (log analysis, ingestion, inference etc.) Python cannot handle this much easily without multiprocessing and it will most likely lead to unmaintainable code. So we best to scale them out either by a periodic learning task scheduler (airflow) or assign continuous learning tasks to 1-N analyzer nodes. The final thing will have engine core, data ingestion, analyzers (actual learner workers) each as standalone modules, so it naturally has the basis for scaling and can work/die independently. |
By now, all the algorithms that we have implemented are trained incrementally.
Exactly, computing consumption also needs to be considered. I think this can be a reason why those commercial products do not deploy complex models. But in my opinion,
Both of the options are OK for our detectors by now. For periodic detection, we can use
This can be achieved by instantiating objects. @Superskyyy |
Good insights, I'm deciding to move away from Airflow (it was never intended for streaming ETL purposes). We will only rely on a simple MQ to implement the orchestration. In the end, this is just a secondary system to a secondary system (Monitoring platform) and should be as simple to learn as possible.
Yes this is intended behaviour, the skywalking metrics exporter natively supports partial subscription.
The standalone modules are a common pattern in today's containerization deployments. In our case, each node only communicates via a Redis task queue, they don't even need to know the existence of others. In a local machine installation, everything will still be bundled together without any remote nodes (which I'm implementing right now, ideal for testing and first release) |
The goals of our project are:
The text was updated successfully, but these errors were encountered: