Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Seccomp] Switch to refcount logic for kernels >= 5.9 #346

Merged
merged 1 commit into from
Aug 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 13 additions & 20 deletions src/modules/exploit_detection/p_exploit_detection.c
Original file line number Diff line number Diff line change
Expand Up @@ -424,7 +424,7 @@ static notrace void p_dump_creds(struct p_cred *p_where, const struct cred *p_fr
#if defined(CONFIG_SECCOMP)
static notrace void p_dump_seccomp(struct p_seccomp *p_sec, struct task_struct *p_task, char p_force) {

P_SYM(p_get_seccomp_filter)(p_task);
p_lkrg_seccomp_filter_get(p_task);
p_sec->sec.mode = p_task->seccomp.mode; // Mode
p_sec->sec.filter = p_task->seccomp.filter; // Filter
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,11,0)
Expand All @@ -437,12 +437,7 @@ static notrace void p_dump_seccomp(struct p_seccomp *p_sec, struct task_struct *
p_sec->flag = 0;
if (p_force)
p_sec->flag_sync_thread = 0;
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0)
P_SYM(p_put_seccomp_filter)(p_task->seccomp.filter);
#else
P_SYM(p_put_seccomp_filter)(p_task);
#endif

p_lkrg_seccomp_filter_put(p_task);
}
#endif

Expand Down Expand Up @@ -1373,8 +1368,12 @@ static int p_cmp_tasks(struct p_ed_process *p_orig, struct task_struct *p_curren

#if defined(CONFIG_SECCOMP)
/* Seccomp */
#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0)
if (p_orig->p_ed_task.p_sec.flag) { // SECCOMP was enabled so it make sense to compare...
P_SYM(p_get_seccomp_filter)(p_current);
#else
if (p_orig->p_ed_task.p_sec.flag && current == p_current) { // SECCOMP was enabled so it make sense to compare...
#endif
p_lkrg_seccomp_filter_get(p_current);

#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,11,0)
if (test_task_syscall_work(p_current,SECCOMP) != p_orig->p_ed_task.p_sec.flag) {
Expand Down Expand Up @@ -1408,11 +1407,7 @@ static int p_cmp_tasks(struct p_ed_process *p_orig, struct task_struct *p_curren

P_CMP_PTR(p_orig->p_ed_task.p_sec.sec.filter, p_current->seccomp.filter, "seccomp filter")

#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0)
P_SYM(p_put_seccomp_filter)(p_current->seccomp.filter);
#else
P_SYM(p_put_seccomp_filter)(p_current);
#endif
p_lkrg_seccomp_filter_put(p_current);
}
#endif

Expand Down Expand Up @@ -1980,13 +1975,11 @@ int p_exploit_detection_init(void) {
P_SYM_INIT(__kernel_text_address)
P_SYM_INIT(mm_find_pmd)
#if defined(CONFIG_SECCOMP)
P_SYM_INIT(get_seccomp_filter)
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0)
#define p___put_seccomp_filter p_put_seccomp_filter
P_SYM_INIT(__put_seccomp_filter)
#else
P_SYM_INIT(put_seccomp_filter)
#endif
if (P_LKRG_SUCCESS != p_lkrg_seccomp_init()) {
p_print_log(P_LOG_FATAL, "Can't initialize seccomp() logic");
p_ret = P_LKRG_GENERAL_ERROR;
goto p_exploit_detection_init_out;
}
#endif

#ifdef CONFIG_SECURITY_SELINUX
Expand Down
37 changes: 37 additions & 0 deletions src/modules/exploit_detection/syscalls/p_seccomp/p_seccomp.c
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,43 @@ static struct kretprobe p_seccomp_kretprobe = {
.data_size = sizeof(struct p_seccomp_data),
};

int p_lkrg_seccomp_init(void) {

#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0)
P_SYM_INIT(get_seccomp_filter)
P_SYM_INIT(put_seccomp_filter)
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't lookup these symbols on 5.9+ at all. We should even exclude our pointer variables that would hold those. Since we don't actually call those functions on 5.9+, then why still have a dependency on being able to look them up at all. Besides, not having them on 5.9+ would avoid us inadvertently calling them (with a bug elsewhere in the code).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I overlooked that. Thx


return P_LKRG_SUCCESS;

#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0)
p_sym_error:
return P_LKRG_GENERAL_ERROR;
#endif
}

void p_lkrg_seccomp_filter_get(struct task_struct *p_task) {
#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0)
P_SYM(p_get_seccomp_filter)(p_task);
#else
struct p_fake_seccomp_filter *p_filter = (struct p_fake_seccomp_filter *)p_task->seccomp.filter;

if (p_filter)
refcount_inc(&p_filter->refs);
#endif
}

void p_lkrg_seccomp_filter_put(struct task_struct *p_task) {
#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0)
P_SYM(p_put_seccomp_filter)(p_task);
#else
struct p_fake_seccomp_filter *p_filter = (struct p_fake_seccomp_filter *)p_task->seccomp.filter;

if (p_filter)
refcount_dec(&p_filter->refs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is unsafe to just dec. We have to implement the logic of __put_seccomp_filter here. If another part of the kernel releases the only other remaining reference while we were holding ours, then it becomes our responsibility to release the associated resources (that other part of the kernel couldn't do that because of our extra reference). Please re-read my comments on the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this special case (race with release by another part of the kernel) is tricky, though. And it's not good for us to have untested code in LKRG. Can you come up with a good test case?

Copy link
Collaborator Author

@Adam-pi3 Adam-pi3 Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be able to implement seccomp_free we would need to define full seccomp structure because of the potential BPF:
https://elixir.bootlin.com/linux/latest/source/kernel/seccomp.c#L523

static inline void seccomp_filter_free(struct seccomp_filter *filter)
{
	if (filter) {
		bpf_prog_destroy(filter->prog);
		kfree(filter);
	}
}

Maybe that's the only way? In that case, I guess a bit more research is needed before merging the PR...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. This may make the whole approach infeasible.

Maybe we should keep the dec and document the potential resource leak in a comment? Not that I like it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is for us to fix only the get/put inconsistency bug, and not the symbol lookup failures our users are occasionally seeing - if so, we could pair inc with __put_seccomp_filter. Or we could have that as the primary approach, and have a fallback for when symbol lookup fails (either allow for resource leak then, or not monitor seccomp).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no mechanism for removing filters once they have been applied to a process

I was wondering where this phrase came from. Found it only in https://lwn.net/Articles/599931/ from 2014. I wonder if this still holds.

When process dies, that's the only way to remove filters.

If so, what happens if the task dies while we held the seccomp filter reference?

In theory we do not need any sync logic since we have guarantee that as long as process is not dead, seccomp filters cannot be gone. However, filters can be modified

Does the refcount increase help us in any way in case the filter is modified? This is not locking against concurrent access anyway, and if you say the filter can't be gone at this time anyway, then the refcount increase is not needed.

Let's be consistent. By your logic, it appears that either we don't need even the inc/dec or we need the full refcount decrease logic (including resource freeing on a possible decrease to zero). The middle ground of inc/dec only does not make sense under assumption that the filter can't be gone at this time.

However, there's the case of task dying. For this case, the inc/dec could reduce the impact from accessing freed memory to a resource leak.

Unless you convince me that this can't happen even without refcount inc (task dying and filter freed from under us), I suggest we split this PR in two - at first pair inc with __put_seccomp_filter. At least this would be consistent with what the kernel itself does, where it merely has inc in __get_seccomp_filter. So this is also a code state we might end up reverting to if we later find that whatever logic we come up with in the second PR is flawed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delay in this thread. I did some homework and per my understanding, the process can't be killed while we execute the logic if seccomp verification in the context of the process which we execute on in the kernel. If we are interrupting / hooking kernel call from the current process, and we verify their seccomp rules, the process can't die while is in the kernel. SIGKILL will be added to the list of signals, but process won't die (if other thread decides to die). It means we are safe here. The corner case is if we decide to verify not-currently-running-in-the-kernel process (which can happen in paranoid mode).
Filtered can be only freed when process is dying and this is addressed by the comment above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corner case is if we decide to verify not-currently-running-in-the-kernel process (which can happen in paranoid mode).

Well, even if you're correct about everything, we do have this mode and thus this corner case, and it is wrong for us to be doing especially unsafe things when in paranoid mode.

I'm really tempted to just exclude our seccomp support on 5.9+ for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather add the check for paranoid mode that if operates on non-current context to not perform seccomp verification but still do it otherwise. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather add the check for paranoid mode that if operates on non-current context to not perform seccomp verification but still do it otherwise. What do you think?

I'm not entirely happy about that, but I'm OK with it if the code ends up looking sane.

#endif
}

/*
* x86-64 syscall ABI:
* *rax - syscall_number
Expand Down
9 changes: 9 additions & 0 deletions src/modules/exploit_detection/syscalls/p_seccomp/p_seccomp.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,15 @@ struct p_seccomp_data {
ktime_t entry_stamp;
};

#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,9,0)
struct p_fake_seccomp_filter {
refcount_t refs;
};
#endif

int p_lkrg_seccomp_init(void);
void p_lkrg_seccomp_filter_get(struct task_struct *p_task);
void p_lkrg_seccomp_filter_put(struct task_struct *p_task);

int p_seccomp_ret(struct kretprobe_instance *p_ri, struct pt_regs *p_regs);
int p_seccomp_entry(struct kretprobe_instance *p_ri, struct pt_regs *p_regs);
Expand Down
Loading