-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue with number_observed #67
Comments
One way to mitigate this issue would be to extend the work done in #64 for observation expressions to comparison expression. cti-pattern-matcher/stix2matcher/matcher.py Lines 1368 to 1383 in bcd37ee
It might also be better to evaluate all I'll probably won't have time to do it in the foreseeable future. |
When working with real low-level data (like data from EDRs or Sysmon) we are experiencing a huge performance degradation. As an example, there are 15754 observables generated out of 100 original observables. In another example, there are 189023 instances generated out of 300 original observables. When comparing the two above example with a version without instances duplication, the timing is as follows:
|
The
number_observed
attribute of observed-data incurs a linear cost to the matcher which seems to be related to the way it is used internally to multiply the SDO. Below are some profiles of the same test but using 1000 vs 10000number_observed
. The profiles look a bit different just because the profiler includes more in the call graph due to longer execution time.The examples below use 1000 & 10000 just to be illustrative with a single SDO (and it makes it easier to capture the relevant bits in the profile). I realize that is extreme, but smaller values of
number_observed
in a larger SDO list could also add up.1 SDO with 1000 number_observed * 50 patterns
1 SDO with 10000 number_observed * 50 patterns
The SDO list looks like
Where
number_observed
is changed between the two tests above.The text was updated successfully, but these errors were encountered: