-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: DNS Cache datasource #3679
Feature: DNS Cache datasource #3679
Conversation
54f3e8a
to
c7cddcc
Compare
Ill review this tomorrow, just wanted a heads up on the feature flag. Alon and I discussed about having a feature flag per "enabled feature that would enrich the proc tree" or "just enriching if that event is being traced". I have also thought about a multi-var feature-flag like for proctree enriching... Something like --proctree enrich:socket,dns,file,bleh,blah together with all other proctree cmdline options available. Long story short: I don't think we should have feature flag for this particular dns cache feature. Its either one that would us allow to enable many proctree enrichers (such as this one) OR none needed (tracing the event that causes the enrichment would be enough). What is your thought ? |
I think that since this doesn't enrich or touch the process tree (maybe only currently), it has no reason to be a part of its feature flags. I'm not sure where this would plug into the process tree (at least not directly), as the closest the process tree could get to it would be through open fds maybe, and then if those fds are network sockets you could enrich the relevant dns of the connection. I also don't like just equating the enable flag to enabling the relevant dns event (and ideally i would have a signal eBPF program here, but I didn't want to delay the PR further), because it obfuscates how the dns cache should be enabled. If a user tries to access the dns cache but it's empty because the |
Having a per process list of resolved DNS entries would allow blocking network connections of a process that has solved a blacklisted DNS name (just one example, there are many). Think of the process tree as a place to save all this information (opened files, queries names, IPs communicated with, etc). |
This is the problem we need to solve (for data sources & process tree features). I believe we should solve this now or else it might be too late when we have multiple features done. |
We also need to discuss signal events and their future. We need to think what we really want from them. For the process tree, they are helpful for busy pipelines, for example, but couldn't live without the regular events logic in place as well. Either way, when we agreed to have them, we missed the discussion of "up to what point" and "what for". |
Alright, gave a good first look. I believe its solid. I left comments here and there mostly for the Either way, the code per se is good IMO. There are major discussions about why keep data, what for, etc, but the logic is solid and makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left comments
c7cddcc
to
0be7012
Compare
I want to add configuration for the cache LRU expiry time and max size before merging. |
a953ac2
to
c11f71d
Compare
@rafaeldtinoco I've added an alternative implementation of the cache backend using a tree with indexed nodes (tell me if there is a better/more correct way to describe it). Please review and contrast with the original implementation, which i've also iterated upon. I believe the tree cache is more correct in how it caches results, but i'm not sure if it will be good enough performance wise on the querying side. Anyway, let's decide and i'll remove the irrelevant commits before merging. |
223b3ac
to
77dd275
Compare
31424d8
to
a964997
Compare
I like the entire code but the I don't like we are trying to optimize using multiple data structures, before even knowing this is needed, and mixing the index from queries and answers (raised a thought in the code). Particularly worried about all possible queries/answers combinations.
From what I see there are 2 possible initial ways to do this, based on the following requirements (which I believe we are on the same page):
From the way you implemented, I believe you tried to address lots of valid concerns:
I think that we should have a complete and simpler approach before trying all (or any) optimizations. I have created pseudo code mixed with garbage mixed with go at the following gist: NOTE: Im not suggesting you use my code (specially because it does not work). I wanted to think so I started playing a bit, that's all. It might give you ideas if you accept my feedback. I don't think we should worry so much about access time if the LRU entry is already hashed (and even in recursive walks, like my example, it might fast enough). So, at the end, Im suggesting that you take an simpler approach until proved something else is needed. With all that said, I like very much everything else to be honest. Its like this part "does not fit yet" the rest, which was very good IMO. |
@rafaeldtinoco I have the intuition that the simpler approach as it is will accidentally break for unexpected queries. I would like to continue anyway with this approach, improving its documentation as you mentioned. |
OK, I can accept that just because you're trying hard to optimize things, but please expand tests then. You can keep tracee opened for some hours and compress the events yaml file using gzip and do a gzip reader for the test. I would like to have many different queries tested. I would also go for a "needle in the haystack" test, trying to find a specific complex case query (multiple aliases, multiples IPs and even a reverse lookup for the same IPs) among everything that was cached. Also, anytime something is overwritten I would like it to be logged in debug mode. I would feel more confident with that approach, and possibly others you have in mind. |
a964997
to
c45fec3
Compare
eec8de9
to
8aba682
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving comments for us to talk. We're getting there...
Introduce a generic set data structure, including a thread safe version (with a mutex).
The proto_dns argument's type was *trace.ProtoDNS instead of being non-pointer. This is an issue when using the event in "everything is an event mode", since type conversion isn't done between tracee-ebpf to tracee-rules.
b4cbc42
to
9ebacee
Compare
Add an optionally enabled cache in tracee which tracks dns queries and subsequent answers through dns packet events. Internally the cache is organized as a bidirectional graph of nodes representing dns queries. Initial questions (query roots) are stored in an LRU, so there is a maximum amount of query graphs possible (5000 by default, but it is configurable). The cache is exposed as a data source in signatures. A usage example and e2e test with a signature is included.
9ebacee
to
43eedc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for addressing/replying to all concerns.
1. Explain what the PR does
9ebacee docs: add dns cache
0b78579 feature: add a dns:ip cache
96b581f fix(dns): wrong dns argument type
0bfbe8f feature: add generic set data structure
2. Explain how to test it
An instrumentation test and in-code e2e test were added.
3. Other comments
The PR currently uses a processor on
net_packet_dns
event to populate the cache. This meant that I had to add an additionalprocessEvent
call in thederiveEvents
stage of the pipeline.A possible improvement in the future is to migrate (or add) a control plane signal for this cache.
The cache currently uses the second node based design proposed.
Resolves #3678