Feature: DNS Cache datasource #3679

NDStrahilevitz · 2023-11-09T15:46:01Z

1. Explain what the PR does

9ebacee docs: add dns cache
0b78579 feature: add a dns:ip cache
96b581f fix(dns): wrong dns argument type
0bfbe8f feature: add generic set data structure

2. Explain how to test it

An instrumentation test and in-code e2e test were added.

3. Other comments

The PR currently uses a processor on net_packet_dns event to populate the cache. This meant that I had to add an additional processEvent call in the deriveEvents stage of the pipeline.
A possible improvement in the future is to migrate (or add) a control plane signal for this cache.

The cache currently uses the second node based design proposed.

Resolves #3678

rafaeldtinoco · 2023-11-13T04:10:03Z

Ill review this tomorrow, just wanted a heads up on the feature flag. Alon and I discussed about having a feature flag per "enabled feature that would enrich the proc tree" or "just enriching if that event is being traced".

I have also thought about a multi-var feature-flag like for proctree enriching... Something like --proctree enrich:socket,dns,file,bleh,blah together with all other proctree cmdline options available.

Long story short: I don't think we should have feature flag for this particular dns cache feature. Its either one that would us allow to enable many proctree enrichers (such as this one) OR none needed (tracing the event that causes the enrichment would be enough).

What is your thought ?

NDStrahilevitz · 2023-11-13T10:51:16Z

Long story short: I don't think we should have feature flag for this particular dns cache feature. Its either one that would us allow to enable many proctree enrichers (such as this one) OR none needed (tracing the event that causes the enrichment would be enough).

What is your thought ?

I think that since this doesn't enrich or touch the process tree (maybe only currently), it has no reason to be a part of its feature flags. I'm not sure where this would plug into the process tree (at least not directly), as the closest the process tree could get to it would be through open fds maybe, and then if those fds are network sockets you could enrich the relevant dns of the connection.

I also don't like just equating the enable flag to enabling the relevant dns event (and ideally i would have a signal eBPF program here, but I didn't want to delay the PR further), because it obfuscates how the dns cache should be enabled. If a user tries to access the dns cache but it's empty because the net_packet_dns_response event wasn't enabled in his policy, that's just kind of annoying and unintuitive. The PR is still missing docs, but if it had docs, that would be the only place where this "enabling" method could be located, it's not self documenting.

rafaeldtinoco · 2023-11-13T13:15:33Z

I'm not sure where this would plug into the process tree (at least not directly), as the closest the process tree could get to it would be through open fds maybe, and then if those fds are network sockets you could enrich the relevant dns of the connection.

Having a per process list of resolved DNS entries would allow blocking network connections of a process that has solved a blacklisted DNS name (just one example, there are many). Think of the process tree as a place to save all this information (opened files, queries names, IPs communicated with, etc).

rafaeldtinoco · 2023-11-13T13:17:37Z

If a user tries to access the dns cache but it's empty because the net_packet_dns_response event wasn't enabled in his policy, that's just kind of annoying and unintuitive. The PR is still missing docs, but if it had docs, that would be the only place where this "enabling" method could be located, it's not self documenting.

This is the problem we need to solve (for data sources & process tree features). I believe we should solve this now or else it might be too late when we have multiple features done.

rafaeldtinoco · 2023-11-13T13:20:52Z

(and ideally i would have a signal eBPF program here, but I didn't want to delay the PR further)

We also need to discuss signal events and their future. We need to think what we really want from them. For the process tree, they are helpful for busy pipelines, for example, but couldn't live without the regular events logic in place as well. Either way, when we agreed to have them, we missed the discussion of "up to what point" and "what for".

rafaeldtinoco · 2023-11-14T04:12:26Z

Alright, gave a good first look. I believe its solid. I left comments here and there mostly for the Set utils package. I believe you could have avoided string parsing for AAAA CNAME PTR and things like that if you had used the common net_packet_dns event (instead of the response, which was made for keeping signatures working mostly).

Either way, the code per se is good IMO. There are major discussions about why keep data, what for, etc, but the logic is solid and makes sense.

rafaeldtinoco

left comments

pkg/ebpf/processor.go

pkg/ebpf/events_pipeline.go

pkg/utils/set/set.go

pkg/dnscache/dnscache.go

NDStrahilevitz · 2023-11-15T10:32:28Z

I want to add configuration for the cache LRU expiry time and max size before merging.

NDStrahilevitz · 2023-11-20T14:10:22Z

@rafaeldtinoco I've added an alternative implementation of the cache backend using a tree with indexed nodes (tell me if there is a better/more correct way to describe it). Please review and contrast with the original implementation, which i've also iterated upon. I believe the tree cache is more correct in how it caches results, but i'm not sure if it will be good enough performance wise on the querying side. Anyway, let's decide and i'll remove the irrelevant commits before merging.

pkg/dnscache/nodednscache.go

pkg/cmd/flags/dnscache.go

pkg/dnscache/events

pkg/dnscache/dnscache.go

pkg/dnscache/nodednscache.go

rafaeldtinoco · 2023-11-22T04:26:55Z

I like the entire code but the node and nodednscache part.

I don't like we are trying to optimize using multiple data structures, before even knowing this is needed, and mixing the index from queries and answers (raised a thought in the code). Particularly worried about all possible queries/answers combinations.

I also think the code needs better documentation (about your intents and example of strings whenever they are being considered, small texts explaining reasonings, etc).

From what I see there are 2 possible initial ways to do this, based on the following requirements (which I believe we are on the same page):

Given a DNS name, to get the resulting IN A entry (after all aliases are solved).
Given an IP, to pick the initial query name that resulted to it.

From the way you implemented, I believe you tried to address lots of valid concerns:

Being able to construct the full resolution path (through node links)
Being memory efficient by having LRU of nodes and time evictions
Allow fast queries through keys
Memory deduplication through tree structure, etc.

I think that we should have a complete and simpler approach before trying all (or any) optimizations. I have created pseudo code mixed with garbage mixed with go at the following gist:

https://gist.github.com/rafaeldtinoco/0e1a77963c8284c363af08ca2c7b9292

NOTE: Im not suggesting you use my code (specially because it does not work). I wanted to think so I started playing a bit, that's all. It might give you ideas if you accept my feedback. I don't think we should worry so much about access time if the LRU entry is already hashed (and even in recursive walks, like my example, it might fast enough).

So, at the end, Im suggesting that you take an simpler approach until proved something else is needed.

With all that said, I like very much everything else to be honest. Its like this part "does not fit yet" the rest, which was very good IMO.

NDStrahilevitz · 2023-11-22T13:29:21Z

@rafaeldtinoco I have the intuition that the simpler approach as it is will accidentally break for unexpected queries.
Also we have in a previous internal conversation assessed that having access to the query history, specifically from an IP up, is actually just as much of a requirement (both for the future dns enrichment in connect events, and for the dnscache data source results).
In addition, the memory efficiency with the current LRU implementation could easily explode (see the amount of queries you generated), especially if we only rely on expiry, so I think that is also a valid concern.

I would like to continue anyway with this approach, improving its documentation as you mentioned.

rafaeldtinoco · 2023-11-22T13:37:04Z

I would like to continue anyway with this approach, improving its documentation as you mentioned.

OK, I can accept that just because you're trying hard to optimize things, but please expand tests then.

You can keep tracee opened for some hours and compress the events yaml file using gzip and do a gzip reader for the test. I would like to have many different queries tested.

I would also go for a "needle in the haystack" test, trying to find a specific complex case query (multiple aliases, multiples IPs and even a reverse lookup for the same IPs) among everything that was cached.

Also, anytime something is overwritten I would like it to be logged in debug mode.

I would feel more confident with that approach, and possibly others you have in mind.

rafaeldtinoco

Leaving comments for us to talk. We're getting there...

pkg/ebpf/tracee.go

tests/e2e-inst-test.sh

pkg/dnscache/dnscache.go

pkg/utils/set/set.go

pkg/utils/set/threadsafe.go

pkg/utils/set/set.go

pkg/dnscache/dnscache_test.go

Introduce a generic set data structure, including a thread safe version (with a mutex).

The proto_dns argument's type was *trace.ProtoDNS instead of being non-pointer. This is an issue when using the event in "everything is an event mode", since type conversion isn't done between tracee-ebpf to tracee-rules.

Add an optionally enabled cache in tracee which tracks dns queries and subsequent answers through dns packet events. Internally the cache is organized as a bidirectional graph of nodes representing dns queries. Initial questions (query roots) are stored in an LRU, so there is a maximum amount of query graphs possible (5000 by default, but it is configurable). The cache is exposed as a data source in signatures. A usage example and e2e test with a signature is included.

rafaeldtinoco

LGTM. Thanks for addressing/replying to all concerns.

NDStrahilevitz added kind/feature milestone/v0.20.0 labels Nov 9, 2023

NDStrahilevitz requested a review from rafaeldtinoco November 9, 2023 15:46

NDStrahilevitz self-assigned this Nov 9, 2023

github-actions bot added area/ebpf area/testing area/UX area/build labels Nov 9, 2023

NDStrahilevitz force-pushed the dns_cache_datasource branch 3 times, most recently from 54f3e8a to c7cddcc Compare November 12, 2023 10:27

rafaeldtinoco reviewed Nov 14, 2023

View reviewed changes

NDStrahilevitz force-pushed the dns_cache_datasource branch from c7cddcc to 0be7012 Compare November 15, 2023 05:12

github-actions bot added the area/flags label Nov 16, 2023

NDStrahilevitz force-pushed the dns_cache_datasource branch from a953ac2 to c11f71d Compare November 20, 2023 14:06

NDStrahilevitz force-pushed the dns_cache_datasource branch 3 times, most recently from 223b3ac to 77dd275 Compare November 20, 2023 15:15

NDStrahilevitz commented Nov 20, 2023

View reviewed changes

pkg/dnscache/nodednscache.go Outdated Show resolved Hide resolved

NDStrahilevitz force-pushed the dns_cache_datasource branch 2 times, most recently from 31424d8 to a964997 Compare November 20, 2023 15:46

NDStrahilevitz commented Nov 20, 2023

View reviewed changes

pkg/dnscache/nodednscache.go Outdated Show resolved Hide resolved

rafaeldtinoco reviewed Nov 20, 2023

View reviewed changes

pkg/cmd/flags/dnscache.go Outdated Show resolved Hide resolved

rafaeldtinoco reviewed Nov 20, 2023

View reviewed changes

pkg/cmd/flags/dnscache.go Outdated Show resolved Hide resolved

rafaeldtinoco reviewed Nov 20, 2023

View reviewed changes

pkg/dnscache/events Outdated Show resolved Hide resolved

rafaeldtinoco reviewed Nov 21, 2023

View reviewed changes

pkg/dnscache/dnscache.go Outdated Show resolved Hide resolved

rafaeldtinoco reviewed Nov 21, 2023

View reviewed changes

pkg/dnscache/dnscache.go Outdated Show resolved Hide resolved

rafaeldtinoco reviewed Nov 21, 2023

View reviewed changes

pkg/dnscache/dnscache.go Outdated Show resolved Hide resolved

rafaeldtinoco reviewed Nov 22, 2023

View reviewed changes

pkg/dnscache/nodednscache.go Outdated Show resolved Hide resolved

NDStrahilevitz force-pushed the dns_cache_datasource branch from a964997 to c45fec3 Compare November 27, 2023 15:43

github-actions bot added the area/events label Nov 27, 2023

NDStrahilevitz force-pushed the dns_cache_datasource branch 4 times, most recently from eec8de9 to 8aba682 Compare November 27, 2023 16:45

NDStrahilevitz requested a review from rafaeldtinoco November 27, 2023 17:23

rafaeldtinoco reviewed Nov 28, 2023

View reviewed changes

NDStrahilevitz added 2 commits November 29, 2023 10:20

feature: add generic set data structure

0bfbe8f

Introduce a generic set data structure, including a thread safe version (with a mutex).

fix(dns): wrong dns argument type

96b581f

The proto_dns argument's type was *trace.ProtoDNS instead of being non-pointer. This is an issue when using the event in "everything is an event mode", since type conversion isn't done between tracee-ebpf to tracee-rules.

NDStrahilevitz force-pushed the dns_cache_datasource branch 2 times, most recently from b4cbc42 to 9ebacee Compare November 29, 2023 13:30

github-actions bot added the kind/documentation label Nov 29, 2023

NDStrahilevitz added 2 commits November 29, 2023 13:54

docs: add dns cache

43eedc4

NDStrahilevitz force-pushed the dns_cache_datasource branch from 9ebacee to 43eedc4 Compare November 29, 2023 13:54

rafaeldtinoco approved these changes Nov 29, 2023

View reviewed changes

NDStrahilevitz merged commit 3918cd8 into aquasecurity:main Nov 29, 2023
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: DNS Cache datasource #3679

Feature: DNS Cache datasource #3679

NDStrahilevitz commented Nov 9, 2023 •

edited

Loading

rafaeldtinoco commented Nov 13, 2023

NDStrahilevitz commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

rafaeldtinoco commented Nov 14, 2023

rafaeldtinoco left a comment

NDStrahilevitz commented Nov 15, 2023

NDStrahilevitz commented Nov 20, 2023

rafaeldtinoco commented Nov 22, 2023

NDStrahilevitz commented Nov 22, 2023

rafaeldtinoco commented Nov 22, 2023

rafaeldtinoco left a comment

rafaeldtinoco left a comment

Feature: DNS Cache datasource #3679

Feature: DNS Cache datasource #3679

Conversation

NDStrahilevitz commented Nov 9, 2023 • edited Loading

1. Explain what the PR does

2. Explain how to test it

3. Other comments

rafaeldtinoco commented Nov 13, 2023

NDStrahilevitz commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

rafaeldtinoco commented Nov 14, 2023

rafaeldtinoco left a comment

Choose a reason for hiding this comment

NDStrahilevitz commented Nov 15, 2023

NDStrahilevitz commented Nov 20, 2023

rafaeldtinoco commented Nov 22, 2023

NDStrahilevitz commented Nov 22, 2023

rafaeldtinoco commented Nov 22, 2023

rafaeldtinoco left a comment

Choose a reason for hiding this comment

rafaeldtinoco left a comment

Choose a reason for hiding this comment

NDStrahilevitz commented Nov 9, 2023 •

edited

Loading