-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Algorithm] Drain output aggregate log cluster count over time #11
Comments
Logs usually carry service information, and you need to define how to define the cluster concept. Aggregating logs in time seems a practical way to detect abnormal status, this could be done simply through MAL + LAL + Altering, rather than shipping logs to AI core. We should define the boundaries of the AIEngine. |
By log cluster, I mean the below ones. Each line is a cluster with a Taking the first line with log timestamp from OAP log index, a counter function can return the full trend for this log cluster [104, 208, 306, 407, 200, 900, ...] then it's a plotting job.
Thank you for the discussion! @wu-sheng As you said, The AIOps version can be switched on only when users decides to cluster logs from this service because it's merely a byproduct of the log clustering algorithm as explained above. I do not intend this thing to compromise or replace OAP side functionality, the advantage is - Rule-based solution and AI solution can easily work together.
|
My major question is about, why is this relative to UI and how? I am not following the logical relationship between this UI proposal and your explanation for anormal log detecting mechanism. |
If you are going to do scanning reading, A.K.A. full-table scan, which would be in low performance, and I am afraid you can't get all data in time. |
AIEngine could send metrics to OAP, and OAP could export(changes required) logs rawdata to AIEngine. |
I think I see the point of confusion!Let me clarify it. The TL'DR -> Cluster means a group of logs in a service/pod, Template means the log format without parameters for that cluster. The name definition of clusters - is not related to service clusters. It's a general data science term that refers to a group of entities (logs within a service/instance) that are very similar most likely with different parameters. Think of it as a reverse process of 10,000,000 logs back to only their base part and sort them each into only < 100 groups (clusters), so humans can see very fast and clearly what type of logs are possible to contribute to a service failure.
We do strictly analyze in service by service (or pod by pod) way. For example, here in the output of the Drain algorithm, I picked 3 example
Say we have 4 logs like below.
Result will be 3 different groups (clusters): I hope this clearifies the questions; thanks for reading. @wu-sheng |
Yes, thankd for the explanation. The flow is clear to me now. |
According to the flow, there is a background you should know. Considering to product env performance, there is no many logs, as debug info logs are disabled. |
Point understood, will bare this in mind when testing, especially for tracebacks. |
We will use the Redis HyperLogLog algorithm to do the counting job and directly send the counter metrics to OAP. |
This will be accomplished by keep tracking of template - service - log mapping during clustering process and update to a topic, then do aggregation. |
Once we have the log clusters learnt by Drain,, we can enable anomaly detection without needing algorithms but simply plot them in time.
The idea is simple: if a type of log count surges or suddenly decreases over some points in time, it may be an anomaly given its content; this is up to the human operator to further decide (it may be just normal increases in access)
So we essentially generate a metric for the clustered logs, one for each cluster. And we plot them in SkyWalking UI.
See the below for what I'm saying
It's just an idea now; since most of the work is on UI during integration, the metrics calculation probably also should be done on the SkyWalking side before visualization.
The text was updated successfully, but these errors were encountered: