[Seccomp] Switch to refcount logic for kernels >= 5.9 #346

Adam-pi3 · 2024-07-06T02:06:44Z

Starting from kernel 5.9+ function 'put_seccomp_filter' has been inlined and unavailable for hooking. However, internal not-inline function was used to mitigate the problem. Unfortunately, there is no equivalent counter-part function for new hook and the old one looks incompatible which we overlooked. This patch is switching the logic for kernels 5.9+ to custom implementation of refcount logic and it should address the issue reported as #338

How Has This Been Tested?

Verified on AlmaLinux 8.8 (4.18.0-477.27.2.el8_8.x86_64) and Ubuntu 22.04.4 LTS (6.3.0-060300-generic). Likely more comprehensive tests are needed.

solardiz · 2024-07-06T12:51:39Z

As seen from our CI builds, somehow this is trying (and failing) to use refcount_t on CentOS 7, I don't immediately see why.

solardiz · 2024-07-06T12:54:40Z

this is trying (and failing) to use refcount_t on CentOS 7, I don't immediately see why.

Oh, I do see now: we're declaring the struct type with such fields unconditionally. We should wrap this in compile-time checks like we do for the usage. Also, let's not include the users field, which we don't use and which varies across kernels versions.

Adam-pi3 · 2024-07-06T17:32:59Z

We should wrap this in compile-time checks like we do for the usage

What type of wrapping are you thinking? We are using refcount API and expecting that refcount_t is available. If that's not the case, should we use atomic?

Adam-pi3 · 2024-07-06T17:34:32Z

The reason I'm asking since atomic instead of refcount are in a very old kernels. If something is < 5.9 it should use normally exposed API, only new kernel (which guarantees to have refcount) is using new logic. It doesn't make sense that compilation is failing for kernels 3.10.xxx

Adam-pi3 · 2024-07-06T17:36:38Z

Oh now I see, wrapping the definition around the kernel version :)

solardiz

I made two comments. One is about a major bug to fix. The other is about code cleanup.

solardiz · 2024-07-06T18:13:18Z

src/modules/exploit_detection/syscalls/p_seccomp/p_seccomp.c

+   struct p_fake_seccomp_filter *p_filter = (struct p_fake_seccomp_filter *)p_task->seccomp.filter;
+
+   if (p_filter)
+      refcount_dec(&p_filter->refs);


I think it is unsafe to just dec. We have to implement the logic of __put_seccomp_filter here. If another part of the kernel releases the only other remaining reference while we were holding ours, then it becomes our responsibility to release the associated resources (that other part of the kernel couldn't do that because of our extra reference). Please re-read my comments on the issue.

Testing this special case (race with release by another part of the kernel) is tricky, though. And it's not good for us to have untested code in LKRG. Can you come up with a good test case?

To be able to implement seccomp_free we would need to define full seccomp structure because of the potential BPF:
https://elixir.bootlin.com/linux/latest/source/kernel/seccomp.c#L523

static inline void seccomp_filter_free(struct seccomp_filter *filter) { if (filter) { bpf_prog_destroy(filter->prog); kfree(filter); } }

Maybe that's the only way? In that case, I guess a bit more research is needed before merging the PR...

Oh, right. This may make the whole approach infeasible.

Maybe we should keep the dec and document the potential resource leak in a comment? Not that I like it...

Another option is for us to fix only the get/put inconsistency bug, and not the symbol lookup failures our users are occasionally seeing - if so, we could pair inc with __put_seccomp_filter. Or we could have that as the primary approach, and have a fallback for when symbol lookup fails (either allow for resource leak then, or not monitor seccomp).

no mechanism for removing filters once they have been applied to a process

I was wondering where this phrase came from. Found it only in https://lwn.net/Articles/599931/ from 2014. I wonder if this still holds.

When process dies, that's the only way to remove filters.

If so, what happens if the task dies while we held the seccomp filter reference?

In theory we do not need any sync logic since we have guarantee that as long as process is not dead, seccomp filters cannot be gone. However, filters can be modified

Does the refcount increase help us in any way in case the filter is modified? This is not locking against concurrent access anyway, and if you say the filter can't be gone at this time anyway, then the refcount increase is not needed.

Let's be consistent. By your logic, it appears that either we don't need even the inc/dec or we need the full refcount decrease logic (including resource freeing on a possible decrease to zero). The middle ground of inc/dec only does not make sense under assumption that the filter can't be gone at this time.

However, there's the case of task dying. For this case, the inc/dec could reduce the impact from accessing freed memory to a resource leak.

Unless you convince me that this can't happen even without refcount inc (task dying and filter freed from under us), I suggest we split this PR in two - at first pair inc with __put_seccomp_filter. At least this would be consistent with what the kernel itself does, where it merely has inc in __get_seccomp_filter. So this is also a code state we might end up reverting to if we later find that whatever logic we come up with in the second PR is flawed.

Sorry for delay in this thread. I did some homework and per my understanding, the process can't be killed while we execute the logic if seccomp verification in the context of the process which we execute on in the kernel. If we are interrupting / hooking kernel call from the current process, and we verify their seccomp rules, the process can't die while is in the kernel. SIGKILL will be added to the list of signals, but process won't die (if other thread decides to die). It means we are safe here. The corner case is if we decide to verify not-currently-running-in-the-kernel process (which can happen in paranoid mode).
Filtered can be only freed when process is dying and this is addressed by the comment above.

The corner case is if we decide to verify not-currently-running-in-the-kernel process (which can happen in paranoid mode).

Well, even if you're correct about everything, we do have this mode and thus this corner case, and it is wrong for us to be doing especially unsafe things when in paranoid mode.

I'm really tempted to just exclude our seccomp support on 5.9+ for now.

I would rather add the check for paranoid mode that if operates on non-current context to not perform seccomp verification but still do it otherwise. What do you think?

I would rather add the check for paranoid mode that if operates on non-current context to not perform seccomp verification but still do it otherwise. What do you think?

I'm not entirely happy about that, but I'm OK with it if the code ends up looking sane.

solardiz · 2024-07-06T18:16:25Z

src/modules/exploit_detection/syscalls/p_seccomp/p_seccomp.c

+   P_SYM_INIT(__put_seccomp_filter)
+#else
+   P_SYM_INIT(put_seccomp_filter)
+#endif


I think we shouldn't lookup these symbols on 5.9+ at all. We should even exclude our pointer variables that would hold those. Since we don't actually call those functions on 5.9+, then why still have a dependency on being able to look them up at all. Besides, not having them on 5.9+ would avoid us inadvertently calling them (with a bug elsewhere in the code).

Yes, I overlooked that. Thx

solardiz · 2024-07-06T18:18:54Z

Oh, also this PR should be just one commit, not several. So next time you update, please amend the first commit and force-push. You can easily squash the second commit into the first one by marking it a fixup in git rebase -i.

solardiz · 2024-07-06T21:17:46Z

Adam, as I wrote in a review comment here:

Another option is for us to fix only the get/put inconsistency bug, and not the symbol lookup failures our users are occasionally seeing - if so, we could pair inc with __put_seccomp_filter.

I think we should do just that in this PR, which would fix #338 and only it.

Let's not try to remove the dependency on this symbol at this same time. We can approach that as a separate task/PR.

Starting from kernel 5.9+ function 'put_seccomp_filter' has been inlined and unavailble for hooking. However, internal not-inline function was used to mitigate the problem. Unfortunately, there is no equivalent counter-part function for new hook and the old one looks incompatible which we overlooked. This patch is switching the logic for kernels 5.9+ to custom implementation of refcount logic and it should address the issue reported as lkrg-org#338

Adam-pi3 · 2024-08-01T22:24:06Z

Oh, also this PR should be just one commit, not several. So next time you update, please amend the first commit and force-push. You can easily squash the second commit into the first one by marking it a fixup in git rebase -i.

I'm not sure if I did it correctly, but I think it works

solardiz

Looks OK'ish to me now, except that the inc/dec should be unneeded if your understanding described in comments here is correct and is insufficient otherwise. However, I suppose it's fine to keep in case the understanding is incorrect, so that impact is likely reduced to a resource leak... unless the struct layout ever changes and we inc/dec something wrong.

solardiz · 2024-08-11T22:44:54Z

I think we should condition this part of p_lkrg_main.h:

#if defined(CONFIG_SECCOMP)
   void (*p_get_seccomp_filter)(struct task_struct *p_task);
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0)
   void (*p_put_seccomp_filter)(struct seccomp_filter *p_filter);
#else
   void (*p_put_seccomp_filter)(struct task_struct *p_task);
#endif
#endif

on #if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) (and indeed remove the >= part from inside of this piece).

solardiz · 2024-08-11T22:53:11Z

I think we should condition this part of p_lkrg_main.h:

Done via de5bdc2

Adam-pi3 requested a review from solardiz July 6, 2024 02:06

Adam-pi3 assigned solardiz Jul 6, 2024

solardiz requested changes Jul 6, 2024

View reviewed changes

Adam-pi3 force-pushed the main branch from d555fef to d41fb2b Compare August 1, 2024 22:22

solardiz approved these changes Aug 11, 2024

View reviewed changes

solardiz merged commit 38b3b11 into lkrg-org:main Aug 11, 2024
18 of 19 checks passed

solardiz mentioned this pull request Aug 11, 2024

Inconsistent use of get_seccomp_filter vs. __put_seccomp_filter #338

Closed

Adam-pi3 mentioned this pull request Aug 22, 2024

Can't find '__put_seccomp_filter' #225

Closed

solardiz mentioned this pull request Oct 6, 2024

False positive "seccomp filter pointer corruption" on Linux 6.11.0-1-default x86_64 Opensuse tumbleweed #354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Seccomp] Switch to refcount logic for kernels >= 5.9 #346

[Seccomp] Switch to refcount logic for kernels >= 5.9 #346

Adam-pi3 commented Jul 6, 2024

solardiz commented Jul 6, 2024

solardiz commented Jul 6, 2024

Adam-pi3 commented Jul 6, 2024

Adam-pi3 commented Jul 6, 2024

Adam-pi3 commented Jul 6, 2024

solardiz left a comment

solardiz Jul 6, 2024

solardiz Jul 6, 2024

Adam-pi3 Jul 6, 2024 •

edited

Loading

solardiz Jul 6, 2024

solardiz Jul 6, 2024

solardiz Jul 9, 2024

Adam-pi3 Jul 30, 2024

solardiz Jul 30, 2024

Adam-pi3 Jul 31, 2024

solardiz Jul 31, 2024

solardiz Jul 6, 2024

Adam-pi3 Jul 6, 2024

solardiz commented Jul 6, 2024

solardiz commented Jul 6, 2024

Adam-pi3 commented Aug 1, 2024

solardiz left a comment

solardiz commented Aug 11, 2024

solardiz commented Aug 11, 2024

[Seccomp] Switch to refcount logic for kernels >= 5.9 #346

[Seccomp] Switch to refcount logic for kernels >= 5.9 #346

Conversation

Adam-pi3 commented Jul 6, 2024

How Has This Been Tested?

solardiz commented Jul 6, 2024

solardiz commented Jul 6, 2024

Adam-pi3 commented Jul 6, 2024

Adam-pi3 commented Jul 6, 2024

Adam-pi3 commented Jul 6, 2024

solardiz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adam-pi3 Jul 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

solardiz commented Jul 6, 2024

solardiz commented Jul 6, 2024

Adam-pi3 commented Aug 1, 2024

solardiz left a comment

Choose a reason for hiding this comment

solardiz commented Aug 11, 2024

solardiz commented Aug 11, 2024

Adam-pi3 Jul 6, 2024 •

edited

Loading