Performance issue with number_observed #67

isstabb · 2020-09-23T01:49:33Z

The number_observed attribute of observed-data incurs a linear cost to the matcher which seems to be related to the way it is used internally to multiply the SDO. Below are some profiles of the same test but using 1000 vs 10000 number_observed. The profiles look a bit different just because the profiler includes more in the call graph due to longer execution time.

The examples below use 1000 & 10000 just to be illustrative with a single SDO (and it makes it easier to capture the relevant bits in the profile). I realize that is extreme, but smaller values of number_observed in a larger SDO list could also add up.

1 SDO with 1000 number_observed * 50 patterns

1 SDO with 10000 number_observed * 50 patterns

The SDO list looks like

[
    {
        "id": "observed-data--107c9a2d-12e9-4599-8a0c-2021a88b472d",
        "type": "observed-data",
        "created_by_ref": "identity--f431f809-377b-45e0-aa1c-6a4751cae3ee",
        "last_observed": "2020-08-25T20:01:28.567Z",
        "first_observed": "2020-08-25T20:01:28.567Z",
        "number_observed": 10000,
        "created": "2020-08-26T13:23:57.728Z",
        "modified": "2020-08-26T13:23:57.728Z",
        "objects": {
            "0": {
                "type": "windows-registry-key",
                "key": "HKLM\\SYSTEM\\CurrentControlSet\\Control\\MiniNt",
            },
            "1": {
                "type": "process",
                "name": "powershell.exe",
                "pid": 8816,
                "x_ecs_entity_id": "{747f3d96-6e04-5f45-9d00-000000003800}",
                "binary_ref": "3",
                "x_ecs_event_ref": "6",
            },
            "2": {"type": "process", "child_refs": ["1"]},
            "3": {
                "type": "file",
                "name": "powershell.exe",
                "parent_directory_ref": "4",
            },
            "4": {
                "type": "directory",
                "path": "C:\\Windows\\System32\\WindowsPowerShell\\v1.0",
            },
            "5": {
                "type": "x-ecs-host",
                "hostname": "MSEDGEWIN10",
                "os_name": "Windows 10 Enterprise Evaluation",
                "os_version": "10.0",
                "os_platform": "windows",
                "ip": ["fe80::c50d:519f:96a4:e108", "10.0.2.15"],
                "name": "MSEDGEWIN10",
                "id": "747f3d96-68a7-43f1-8cbe-e8d6dadd0358",
                "mac": ["08:00:27:e6:e5:59"],
                "architecture": "x86_64",
            },
            "6": {
                "type": "x-event",
                "code": 12,
                "provider": "Microsoft-Windows-Sysmon",
                "created": "2020-08-25T20:01:28.591Z",
                "kind": "event",
                "module": "sysmon",
                "action": "CreateKey",
            },
        },
    }
]

Where number_observed is changed between the two tests above.

The text was updated successfully, but these errors were encountered:

clslgrnc · 2020-10-06T19:13:45Z

One way to mitigate this issue would be to extend the work done in #64 for observation expressions to comparison expression.
All exitComparisonExpression* should be modified to work with generators, so that obs_ids in the following is a generator:

cti-pattern-matcher/stix2matcher/matcher.py

Lines 1368 to 1383 in bcd37ee

    
               def exitObservationExpressionSimple(self, ctx): 
        
                   """ 
        
                   Consumes a the results of the inner comparison expression.  See 
        
                   exitComparisonExpression(). 
        
                   Produces: a generator of 1-tuples of the IDs.  At this stage, the root 
        
                   Cyber Observable object IDs are no longer needed, and are dropped. 
        
                   This is a preparatory transformative step, so that higher-level 
        
                   processing has consistent structures to work with (always generator of 
        
                   tuples). 
        
                   """ 
        
                   debug_label = u"exitObservationExpression (simple)" 
        
                   obs_ids = self.__pop(debug_label) 
        
                   obs_id_tuples = ((obs_id,) for obs_id in obs_ids.keys()) 
        
                   self.__push(obs_id_tuples, debug_label)

It might also be better to evaluate all exitComparisonExpression* without expanding the number_observed and only duplicate the observed-data in exitObservationExpressionSimple (if it makes sense).

I'll probably won't have time to do it in the foreseeable future.

dennispo · 2021-03-14T15:56:36Z

When working with real low-level data (like data from EDRs or Sysmon) we are experiencing a huge performance degradation. As an example, there are 15754 observables generated out of 100 original observables. In another example, there are 189023 instances generated out of 300 original observables.

When comparing the two above example with a version without instances duplication, the timing is as follows:

Improvement	Time measured for 100 observed_data	% of improvement	Time measured for 300 observed_data	% of improvement
basic	0:04:59.641		1:00:40.532
events deduplication	0:00:06.043	97.98%	0:00:15.945	99.56%

dennispo mentioned this issue Mar 14, 2021

Performance boost #72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with number_observed #67

Performance issue with number_observed #67

isstabb commented Sep 23, 2020 •

edited

Loading

clslgrnc commented Oct 6, 2020

dennispo commented Mar 14, 2021 •

edited

Loading

Performance issue with number_observed #67

Performance issue with number_observed #67

Comments

isstabb commented Sep 23, 2020 • edited Loading

clslgrnc commented Oct 6, 2020

dennispo commented Mar 14, 2021 • edited Loading

isstabb commented Sep 23, 2020 •

edited

Loading

dennispo commented Mar 14, 2021 •

edited

Loading