-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Seccomp] Switch to refcount logic for kernels >= 5.9 #346
Conversation
As seen from our CI builds, somehow this is trying (and failing) to use |
Oh, I do see now: we're declaring the struct type with such fields unconditionally. We should wrap this in compile-time checks like we do for the usage. Also, let's not include the |
What type of wrapping are you thinking? We are using refcount API and expecting that |
The reason I'm asking since |
Oh now I see, wrapping the definition around the kernel version :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made two comments. One is about a major bug to fix. The other is about code cleanup.
struct p_fake_seccomp_filter *p_filter = (struct p_fake_seccomp_filter *)p_task->seccomp.filter; | ||
|
||
if (p_filter) | ||
refcount_dec(&p_filter->refs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is unsafe to just dec
. We have to implement the logic of __put_seccomp_filter
here. If another part of the kernel releases the only other remaining reference while we were holding ours, then it becomes our responsibility to release the associated resources (that other part of the kernel couldn't do that because of our extra reference). Please re-read my comments on the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing this special case (race with release by another part of the kernel) is tricky, though. And it's not good for us to have untested code in LKRG. Can you come up with a good test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be able to implement seccomp_free we would need to define full seccomp structure because of the potential BPF:
https://elixir.bootlin.com/linux/latest/source/kernel/seccomp.c#L523
static inline void seccomp_filter_free(struct seccomp_filter *filter)
{
if (filter) {
bpf_prog_destroy(filter->prog);
kfree(filter);
}
}
Maybe that's the only way? In that case, I guess a bit more research is needed before merging the PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. This may make the whole approach infeasible.
Maybe we should keep the dec
and document the potential resource leak in a comment? Not that I like it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is for us to fix only the get/put inconsistency bug, and not the symbol lookup failures our users are occasionally seeing - if so, we could pair inc
with __put_seccomp_filter
. Or we could have that as the primary approach, and have a fallback for when symbol lookup fails (either allow for resource leak then, or not monitor seccomp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no mechanism for removing filters once they have been applied to a process
I was wondering where this phrase came from. Found it only in https://lwn.net/Articles/599931/ from 2014. I wonder if this still holds.
When process dies, that's the only way to remove filters.
If so, what happens if the task dies while we held the seccomp filter reference?
In theory we do not need any sync logic since we have guarantee that as long as process is not dead, seccomp filters cannot be gone. However, filters can be modified
Does the refcount increase help us in any way in case the filter is modified? This is not locking against concurrent access anyway, and if you say the filter can't be gone at this time anyway, then the refcount increase is not needed.
Let's be consistent. By your logic, it appears that either we don't need even the inc/dec or we need the full refcount decrease logic (including resource freeing on a possible decrease to zero). The middle ground of inc/dec only does not make sense under assumption that the filter can't be gone at this time.
However, there's the case of task dying. For this case, the inc/dec could reduce the impact from accessing freed memory to a resource leak.
Unless you convince me that this can't happen even without refcount inc (task dying and filter freed from under us), I suggest we split this PR in two - at first pair inc
with __put_seccomp_filter
. At least this would be consistent with what the kernel itself does, where it merely has inc
in __get_seccomp_filter
. So this is also a code state we might end up reverting to if we later find that whatever logic we come up with in the second PR is flawed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for delay in this thread. I did some homework and per my understanding, the process can't be killed while we execute the logic if seccomp verification in the context of the process which we execute on in the kernel. If we are interrupting / hooking kernel call from the current process, and we verify their seccomp rules, the process can't die while is in the kernel. SIGKILL will be added to the list of signals, but process won't die (if other thread decides to die). It means we are safe here. The corner case is if we decide to verify not-currently-running-in-the-kernel process (which can happen in paranoid mode).
Filtered can be only freed when process is dying and this is addressed by the comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corner case is if we decide to verify not-currently-running-in-the-kernel process (which can happen in paranoid mode).
Well, even if you're correct about everything, we do have this mode and thus this corner case, and it is wrong for us to be doing especially unsafe things when in paranoid mode.
I'm really tempted to just exclude our seccomp support on 5.9+ for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather add the check for paranoid mode that if operates on non-current context to not perform seccomp verification but still do it otherwise. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather add the check for paranoid mode that if operates on non-current context to not perform seccomp verification but still do it otherwise. What do you think?
I'm not entirely happy about that, but I'm OK with it if the code ends up looking sane.
P_SYM_INIT(__put_seccomp_filter) | ||
#else | ||
P_SYM_INIT(put_seccomp_filter) | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't lookup these symbols on 5.9+ at all. We should even exclude our pointer variables that would hold those. Since we don't actually call those functions on 5.9+, then why still have a dependency on being able to look them up at all. Besides, not having them on 5.9+ would avoid us inadvertently calling them (with a bug elsewhere in the code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I overlooked that. Thx
Oh, also this PR should be just one commit, not several. So next time you update, please amend the first commit and force-push. You can easily squash the second commit into the first one by marking it a |
Adam, as I wrote in a review comment here: Another option is for us to fix only the get/put inconsistency bug, and not the symbol lookup failures our users are occasionally seeing - if so, we could pair I think we should do just that in this PR, which would fix #338 and only it. Let's not try to remove the dependency on this symbol at this same time. We can approach that as a separate task/PR. |
Starting from kernel 5.9+ function 'put_seccomp_filter' has been inlined and unavailble for hooking. However, internal not-inline function was used to mitigate the problem. Unfortunately, there is no equivalent counter-part function for new hook and the old one looks incompatible which we overlooked. This patch is switching the logic for kernels 5.9+ to custom implementation of refcount logic and it should address the issue reported as lkrg-org#338
I'm not sure if I did it correctly, but I think it works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK'ish to me now, except that the inc/dec should be unneeded if your understanding described in comments here is correct and is insufficient otherwise. However, I suppose it's fine to keep in case the understanding is incorrect, so that impact is likely reduced to a resource leak... unless the struct layout ever changes and we inc/dec something wrong.
I think we should condition this part of #if defined(CONFIG_SECCOMP)
void (*p_get_seccomp_filter)(struct task_struct *p_task);
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0)
void (*p_put_seccomp_filter)(struct seccomp_filter *p_filter);
#else
void (*p_put_seccomp_filter)(struct task_struct *p_task);
#endif
#endif on |
Done via de5bdc2 |
Starting from kernel 5.9+ function 'put_seccomp_filter' has been inlined and unavailable for hooking. However, internal not-inline function was used to mitigate the problem. Unfortunately, there is no equivalent counter-part function for new hook and the old one looks incompatible which we overlooked. This patch is switching the logic for kernels 5.9+ to custom implementation of refcount logic and it should address the issue reported as #338
How Has This Been Tested?
Verified on AlmaLinux 8.8 (4.18.0-477.27.2.el8_8.x86_64) and Ubuntu 22.04.4 LTS (6.3.0-060300-generic). Likely more comprehensive tests are needed.