-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Seccomp] Switch to refcount logic for kernels >= 5.9 #346
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,6 +34,43 @@ static struct kretprobe p_seccomp_kretprobe = { | |
.data_size = sizeof(struct p_seccomp_data), | ||
}; | ||
|
||
int p_lkrg_seccomp_init(void) { | ||
|
||
#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) | ||
P_SYM_INIT(get_seccomp_filter) | ||
P_SYM_INIT(put_seccomp_filter) | ||
#endif | ||
|
||
return P_LKRG_SUCCESS; | ||
|
||
#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) | ||
p_sym_error: | ||
return P_LKRG_GENERAL_ERROR; | ||
#endif | ||
} | ||
|
||
void p_lkrg_seccomp_filter_get(struct task_struct *p_task) { | ||
#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) | ||
P_SYM(p_get_seccomp_filter)(p_task); | ||
#else | ||
struct p_fake_seccomp_filter *p_filter = (struct p_fake_seccomp_filter *)p_task->seccomp.filter; | ||
|
||
if (p_filter) | ||
refcount_inc(&p_filter->refs); | ||
#endif | ||
} | ||
|
||
void p_lkrg_seccomp_filter_put(struct task_struct *p_task) { | ||
#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) | ||
P_SYM(p_put_seccomp_filter)(p_task); | ||
#else | ||
struct p_fake_seccomp_filter *p_filter = (struct p_fake_seccomp_filter *)p_task->seccomp.filter; | ||
|
||
if (p_filter) | ||
refcount_dec(&p_filter->refs); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is unsafe to just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Testing this special case (race with release by another part of the kernel) is tricky, though. And it's not good for us to have untested code in LKRG. Can you come up with a good test case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be able to implement seccomp_free we would need to define full seccomp structure because of the potential BPF:
Maybe that's the only way? In that case, I guess a bit more research is needed before merging the PR... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, right. This may make the whole approach infeasible. Maybe we should keep the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another option is for us to fix only the get/put inconsistency bug, and not the symbol lookup failures our users are occasionally seeing - if so, we could pair There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I was wondering where this phrase came from. Found it only in https://lwn.net/Articles/599931/ from 2014. I wonder if this still holds.
If so, what happens if the task dies while we held the seccomp filter reference?
Does the refcount increase help us in any way in case the filter is modified? This is not locking against concurrent access anyway, and if you say the filter can't be gone at this time anyway, then the refcount increase is not needed. Let's be consistent. By your logic, it appears that either we don't need even the inc/dec or we need the full refcount decrease logic (including resource freeing on a possible decrease to zero). The middle ground of inc/dec only does not make sense under assumption that the filter can't be gone at this time. However, there's the case of task dying. For this case, the inc/dec could reduce the impact from accessing freed memory to a resource leak. Unless you convince me that this can't happen even without refcount inc (task dying and filter freed from under us), I suggest we split this PR in two - at first pair There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for delay in this thread. I did some homework and per my understanding, the process can't be killed while we execute the logic if seccomp verification in the context of the process which we execute on in the kernel. If we are interrupting / hooking kernel call from the current process, and we verify their seccomp rules, the process can't die while is in the kernel. SIGKILL will be added to the list of signals, but process won't die (if other thread decides to die). It means we are safe here. The corner case is if we decide to verify not-currently-running-in-the-kernel process (which can happen in paranoid mode). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Well, even if you're correct about everything, we do have this mode and thus this corner case, and it is wrong for us to be doing especially unsafe things when in paranoid mode. I'm really tempted to just exclude our seccomp support on 5.9+ for now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would rather add the check for paranoid mode that if operates on non-current context to not perform seccomp verification but still do it otherwise. What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not entirely happy about that, but I'm OK with it if the code ends up looking sane. |
||
#endif | ||
} | ||
|
||
/* | ||
* x86-64 syscall ABI: | ||
* *rax - syscall_number | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't lookup these symbols on 5.9+ at all. We should even exclude our pointer variables that would hold those. Since we don't actually call those functions on 5.9+, then why still have a dependency on being able to look them up at all. Besides, not having them on 5.9+ would avoid us inadvertently calling them (with a bug elsewhere in the code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I overlooked that. Thx