-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Algorithm] Implement incremental clustering of streaming log based on Drain #5
Comments
FYI @Liangshumin |
Drain initial |
Does drain method have been implemented? Or, are we going to implement this? |
@wu-sheng It could be a more than minimal modification in consideration of the original code base's LOC. I'm thinking to have a fork here in the org for the features, then sometime in the future we cherrypick and contribute these features back to its upstream. It seems unnecessary to totally rewrite it given the core parts are very well written already and continuously receiving updates. |
An interesting thing I found earlier - Drain has another updated version (in its journal paper) - called DAG drain, this more recent paper describes an auto parameter tuning method with DAG implementation, do take a deep look. @Liangshumin |
Could you share from what perspective we are going to change it? Could we try to do |
About |
In the short run (before 0.1.0) I will only request @Liangshumin to optimize the input layer to try to utilize cache lookup so we can speed it up a bit more. So it should be able to do by wrapping the library without needing to submodule it. Anyway, anything can be overridden so not a problem. In the longer run, I or someone will work on rewriting some key methods to Cython or parallelizing the tree states using multi-processing, considering the original algorithm core part only got < 500 lines of code, it's like a major rewrite, I'd like to contribute these things back to upstream. @wu-sheng Btw, please advise how many logs per second typically a medium-scale SkyWalking deployment would receive? I don't have too much idea but I'd like to make some understanding of the requirements so we can avoid under/over-optimizing it. It's a very fast algorithm already, but it can be extremely fast if we do the above optimization.
Fair enough, since most research reproduced code projects are under MIT. |
Don't worry about performance. All open source starts from MVP, and then we run the benchmark to see its capability. |
That's true, thank you for the advice, anyway, I just tested itself can handle at least 10k+ raw logs per second given a 5 million dataset with 200+ patterns, seems like stable enough. |
BTW, for exporting logs to this engine, we could provide a throughput limit sampling mechanism. Such as 10k/s as the max exporting, which could make the AI Engine's payload predictable. |
The DAG version of Drain seems unnecessary and will not provide much enhancement. I'm formalizing the algorithm implementation to our code base. |
After the initial evaluation phase, we would start our log analysis feature by implementing the Drain method.
The goal of this algorithm is to ingest a raw log record stream and to produce most likely matches of the log among learnt template clusters.
Drain initial evaluation
Drain implementation
Drain tuning & experiments
Based on the second paper , auto parameter tuning is definitely a huge plus for our use case, we should pursue to implement it.
The text was updated successfully, but these errors were encountered: