Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Seccomp] Switch to refcount logic for kernels >= 5.9 #346

Merged
merged 1 commit into from
Aug 11, 2024

Conversation

Adam-pi3
Copy link
Collaborator

@Adam-pi3 Adam-pi3 commented Jul 6, 2024

Starting from kernel 5.9+ function 'put_seccomp_filter' has been inlined and unavailable for hooking. However, internal not-inline function was used to mitigate the problem. Unfortunately, there is no equivalent counter-part function for new hook and the old one looks incompatible which we overlooked. This patch is switching the logic for kernels 5.9+ to custom implementation of refcount logic and it should address the issue reported as #338

How Has This Been Tested?

Verified on AlmaLinux 8.8 (4.18.0-477.27.2.el8_8.x86_64) and Ubuntu 22.04.4 LTS (6.3.0-060300-generic). Likely more comprehensive tests are needed.

@solardiz
Copy link
Contributor

solardiz commented Jul 6, 2024

As seen from our CI builds, somehow this is trying (and failing) to use refcount_t on CentOS 7, I don't immediately see why.

@solardiz
Copy link
Contributor

solardiz commented Jul 6, 2024

this is trying (and failing) to use refcount_t on CentOS 7, I don't immediately see why.

Oh, I do see now: we're declaring the struct type with such fields unconditionally. We should wrap this in compile-time checks like we do for the usage. Also, let's not include the users field, which we don't use and which varies across kernels versions.

@Adam-pi3
Copy link
Collaborator Author

Adam-pi3 commented Jul 6, 2024

We should wrap this in compile-time checks like we do for the usage

What type of wrapping are you thinking? We are using refcount API and expecting that refcount_t is available. If that's not the case, should we use atomic?

@Adam-pi3
Copy link
Collaborator Author

Adam-pi3 commented Jul 6, 2024

The reason I'm asking since atomic instead of refcount are in a very old kernels. If something is < 5.9 it should use normally exposed API, only new kernel (which guarantees to have refcount) is using new logic. It doesn't make sense that compilation is failing for kernels 3.10.xxx

@Adam-pi3
Copy link
Collaborator Author

Adam-pi3 commented Jul 6, 2024

Oh now I see, wrapping the definition around the kernel version :)

Copy link
Contributor

@solardiz solardiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made two comments. One is about a major bug to fix. The other is about code cleanup.

struct p_fake_seccomp_filter *p_filter = (struct p_fake_seccomp_filter *)p_task->seccomp.filter;

if (p_filter)
refcount_dec(&p_filter->refs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is unsafe to just dec. We have to implement the logic of __put_seccomp_filter here. If another part of the kernel releases the only other remaining reference while we were holding ours, then it becomes our responsibility to release the associated resources (that other part of the kernel couldn't do that because of our extra reference). Please re-read my comments on the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this special case (race with release by another part of the kernel) is tricky, though. And it's not good for us to have untested code in LKRG. Can you come up with a good test case?

Copy link
Collaborator Author

@Adam-pi3 Adam-pi3 Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be able to implement seccomp_free we would need to define full seccomp structure because of the potential BPF:
https://elixir.bootlin.com/linux/latest/source/kernel/seccomp.c#L523

static inline void seccomp_filter_free(struct seccomp_filter *filter)
{
	if (filter) {
		bpf_prog_destroy(filter->prog);
		kfree(filter);
	}
}

Maybe that's the only way? In that case, I guess a bit more research is needed before merging the PR...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. This may make the whole approach infeasible.

Maybe we should keep the dec and document the potential resource leak in a comment? Not that I like it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is for us to fix only the get/put inconsistency bug, and not the symbol lookup failures our users are occasionally seeing - if so, we could pair inc with __put_seccomp_filter. Or we could have that as the primary approach, and have a fallback for when symbol lookup fails (either allow for resource leak then, or not monitor seccomp).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no mechanism for removing filters once they have been applied to a process

I was wondering where this phrase came from. Found it only in https://lwn.net/Articles/599931/ from 2014. I wonder if this still holds.

When process dies, that's the only way to remove filters.

If so, what happens if the task dies while we held the seccomp filter reference?

In theory we do not need any sync logic since we have guarantee that as long as process is not dead, seccomp filters cannot be gone. However, filters can be modified

Does the refcount increase help us in any way in case the filter is modified? This is not locking against concurrent access anyway, and if you say the filter can't be gone at this time anyway, then the refcount increase is not needed.

Let's be consistent. By your logic, it appears that either we don't need even the inc/dec or we need the full refcount decrease logic (including resource freeing on a possible decrease to zero). The middle ground of inc/dec only does not make sense under assumption that the filter can't be gone at this time.

However, there's the case of task dying. For this case, the inc/dec could reduce the impact from accessing freed memory to a resource leak.

Unless you convince me that this can't happen even without refcount inc (task dying and filter freed from under us), I suggest we split this PR in two - at first pair inc with __put_seccomp_filter. At least this would be consistent with what the kernel itself does, where it merely has inc in __get_seccomp_filter. So this is also a code state we might end up reverting to if we later find that whatever logic we come up with in the second PR is flawed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delay in this thread. I did some homework and per my understanding, the process can't be killed while we execute the logic if seccomp verification in the context of the process which we execute on in the kernel. If we are interrupting / hooking kernel call from the current process, and we verify their seccomp rules, the process can't die while is in the kernel. SIGKILL will be added to the list of signals, but process won't die (if other thread decides to die). It means we are safe here. The corner case is if we decide to verify not-currently-running-in-the-kernel process (which can happen in paranoid mode).
Filtered can be only freed when process is dying and this is addressed by the comment above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corner case is if we decide to verify not-currently-running-in-the-kernel process (which can happen in paranoid mode).

Well, even if you're correct about everything, we do have this mode and thus this corner case, and it is wrong for us to be doing especially unsafe things when in paranoid mode.

I'm really tempted to just exclude our seccomp support on 5.9+ for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather add the check for paranoid mode that if operates on non-current context to not perform seccomp verification but still do it otherwise. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather add the check for paranoid mode that if operates on non-current context to not perform seccomp verification but still do it otherwise. What do you think?

I'm not entirely happy about that, but I'm OK with it if the code ends up looking sane.

P_SYM_INIT(__put_seccomp_filter)
#else
P_SYM_INIT(put_seccomp_filter)
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't lookup these symbols on 5.9+ at all. We should even exclude our pointer variables that would hold those. Since we don't actually call those functions on 5.9+, then why still have a dependency on being able to look them up at all. Besides, not having them on 5.9+ would avoid us inadvertently calling them (with a bug elsewhere in the code).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I overlooked that. Thx

@solardiz
Copy link
Contributor

solardiz commented Jul 6, 2024

Oh, also this PR should be just one commit, not several. So next time you update, please amend the first commit and force-push. You can easily squash the second commit into the first one by marking it a fixup in git rebase -i.

@solardiz
Copy link
Contributor

solardiz commented Jul 6, 2024

Adam, as I wrote in a review comment here:

Another option is for us to fix only the get/put inconsistency bug, and not the symbol lookup failures our users are occasionally seeing - if so, we could pair inc with __put_seccomp_filter.

I think we should do just that in this PR, which would fix #338 and only it.

Let's not try to remove the dependency on this symbol at this same time. We can approach that as a separate task/PR.

Starting from kernel 5.9+ function 'put_seccomp_filter' has been inlined
and unavailble for hooking. However, internal not-inline function was used
to mitigate the problem. Unfortunately, there is no equivalent counter-part
function for new hook and the old one looks incompatible which we overlooked.
This patch is switching the logic for kernels 5.9+ to custom implementation
of refcount logic and it should address the issue reported as lkrg-org#338
@Adam-pi3
Copy link
Collaborator Author

Adam-pi3 commented Aug 1, 2024

Oh, also this PR should be just one commit, not several. So next time you update, please amend the first commit and force-push. You can easily squash the second commit into the first one by marking it a fixup in git rebase -i.

I'm not sure if I did it correctly, but I think it works

Copy link
Contributor

@solardiz solardiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK'ish to me now, except that the inc/dec should be unneeded if your understanding described in comments here is correct and is insufficient otherwise. However, I suppose it's fine to keep in case the understanding is incorrect, so that impact is likely reduced to a resource leak... unless the struct layout ever changes and we inc/dec something wrong.

@solardiz
Copy link
Contributor

I think we should condition this part of p_lkrg_main.h:

#if defined(CONFIG_SECCOMP)
   void (*p_get_seccomp_filter)(struct task_struct *p_task);
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0)
   void (*p_put_seccomp_filter)(struct seccomp_filter *p_filter);
#else
   void (*p_put_seccomp_filter)(struct task_struct *p_task);
#endif
#endif

on #if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) (and indeed remove the >= part from inside of this piece).

@solardiz solardiz merged commit 38b3b11 into lkrg-org:main Aug 11, 2024
18 of 19 checks passed
@solardiz
Copy link
Contributor

I think we should condition this part of p_lkrg_main.h:

Done via de5bdc2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants