Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freezing OP-TEE fails during suspend #4581

Open
cgellner opened this issue Apr 28, 2021 · 15 comments
Open

Freezing OP-TEE fails during suspend #4581

cgellner opened this issue Apr 28, 2021 · 15 comments

Comments

@cgellner
Copy link

When trying to suspend to RAM the OP-TEE TA user-space tasks using OP-TEE might fail to freeze

PM: suspend entry (s2idle)
Filesystems sync: 0.000 seconds
Freezing user space processes ... 
Freezing of tasks failed after 20.005 seconds (1 tasks refusing to freeze, wq_busy=0):
task:optee_example_s state:R  running task     stack:    0 pid:  128 ppid:     1 flags:0x00000001
[<807d3e24>] (__schedule) from [<c01bae20>] (0xc01bae20)
OOM killer enabled.
Restarting tasks ... done.
PM: suspend exit

According to my understanding this happens if a TA, triggered by the task to freeze, is trying to communicate to tee-supplicant, but tee-supplicant has been frozen first.

In order to reproduce OP-TEE QEMU environment can be used.
In order to increase the probability of the TA trying to communicate to tee-supplicant during suspend, the below patch should be applied to secure_storage_ta.c in optee_examples.

index edaf6a3..39f5ed2 100644
--- a/secure_storage/ta/secure_storage_ta.c
+++ b/secure_storage/ta/secure_storage_ta.c
@@ -256,11 +256,19 @@ TEE_Result TA_InvokeCommandEntryPoint(void __unused *session,
 				      uint32_t param_types,
 				      TEE_Param params[4])
 {
+	TEE_Result res;
+	size_t i;
+
 	switch (command) {
 	case TA_SECURE_STORAGE_CMD_WRITE_RAW:
 		return create_raw_object(param_types, params);
 	case TA_SECURE_STORAGE_CMD_READ_RAW:
-		return read_raw_object(param_types, params);
+		for (i=0; i<400; i++) {
+			res = read_raw_object(param_types, params);
+			if (res != TEE_SUCCESS)
+				break;
+		}
+		return res;
 	case TA_SECURE_STORAGE_CMD_DELETE:
 		return delete_object(param_types, params);
 	default:
--

Afterwards you can reproduce by executing
(optee_example_secure_storage &); sleep 1; echo -n mem > /sys/power/state
from non-secure world.

During debugging I could see, that while kernel trying to freeze the task is looping in
drivers/tee/supp.c - optee_supp_thrd_req
while (wait_for_completion_interruptible(&req->c))
Note: In my understanding other cases might be possible as well e.g.

  • optee_cq_wait_for_completion waiting for available OP-TEE Thread
  • one TA waiting in OPTEE_MSG_RPC_WAIT_QUEUE_SLEEP (wq_sleep) for other TA calling OPTEE_MSG_RPC_WAIT_QUEUE_WAKEUP (wq_wakeup). But second TA is blocked due to one of the other cases.

When I modified kernel/power/process.c - try_to_freeze_tasks - not to freeze tee-supplicant when freezing user tasks, the issue is not reproducible and suspend will happen after the TA did complete its work.
I wonder if there is a generic kernel mechanism to avoid freezing of user-space helper tasks like tee-supplicant.
How is this solved for fuse-FS? Isn't it using some user-space tasks for processing as well?

When I modified drivers/tee/supp.c - optee_supp_thrd_req - to call try_to_freeze in the wait_for_completion_interruptible loop mentioned above, the issue is not reproducible and suspend will happen although TA did not yet complete its work. TA will continue after the resume.
I'm not clear, if it is legal from OP-TEE perspective to suspend while TA is still executing.

@etienne-lms
Copy link
Contributor

Thanks a lot for the description. Pretty clear and detailed.
This all makes sense. I think the last proposal is the right way.

@cgellner
Copy link
Author

In order to reproduce the other two situations where wait_for_completion can get stuck, you need multiple parallel requests.

To allow multiple requests you can apply the following patch to secure_storage in optee_examples (in addition to the reproduction patch mentioned above)

diff --git a/secure_storage/ta/user_ta_header_defines.h b/secure_storage/ta/user_ta_header_defines.h
index cc69e72..83e762d 100644
--- a/secure_storage/ta/user_ta_header_defines.h
+++ b/secure_storage/ta/user_ta_header_defines.h
@@ -36,7 +36,7 @@
 #define TA_UUID                                TA_SECURE_STORAGE_UUID
-#define TA_FLAGS                       (TA_FLAG_EXEC_DDR | TA_FLAG_SINGLE_INSTANCE)
+#define TA_FLAGS                       (TA_FLAG_EXEC_DDR | TA_FLAG_SINGLE_INSTANCE | TA_FLAG_MULTI_SESSION)
 #define TA_STACK_SIZE                  (2 * 1024)
 #define TA_DATA_SIZE                   (32 * 1024)
--

Afterwards you can reproduce the issue in QEMU v7 using the following command

for I in $(seq 1 3); do (optee_example_secure_storage &); done; sleep 3; echo -n mem > /sys/power/state

In QEMU v7 you could additionally schedule resuming e.g. after 30sec by prepending

echo "+30" > /sys/devices/platform/9010000.pl031/rtc/rtc0/wakealarm

Afterwards you should get a output similar to below having 3 stuck tasks

- Read back the object
PM: suspend entry (s2idle)
Filesystems sync: 0.000 seconds
Freezing user space processes ...
Freezing of tasks failed after 20.008 seconds (3 tasks refusing to freeze, wq_busy=0):
task:optee_example_s state:R  running task     stack:    0 pid:  124 ppid:     1 flags:0x00000001
[<807d3e24>] (__schedule) from [<841c4000>] (0x841c4000)
task:optee_example_s state:D stack:    0 pid:  126 ppid:     1 flags:0x00000001
[<807d3e24>] (__schedule) from [<807d41d0>] (schedule+0x60/0x120)
[<807d41d0>] (schedule) from [<807d7ffc>] (schedule_timeout+0x1f4/0x340)
[<807d7ffc>] (schedule_timeout) from [<807d56a0>] (wait_for_completion+0x94/0xfc)
[<807d56a0>] (wait_for_completion) from [<80692134>] (optee_cq_wait_for_completion+0x14/0x60)
[<80692134>] (optee_cq_wait_for_completion) from [<806924dc>] (optee_do_call_with_arg+0x14c/0x154)
[<806924dc>] (optee_do_call_with_arg) from [<80692edc>] (optee_shm_unregister+0x78/0xcc)
[<80692edc>] (optee_shm_unregister) from [<80690a9c>] (tee_shm_release+0x88/0x174)
[<80690a9c>] (tee_shm_release) from [<8057f89c>] (dma_buf_release+0x44/0xb0)
[<8057f89c>] (dma_buf_release) from [<8028e4e8>] (__dentry_kill+0x110/0x17c)
[<8028e4e8>] (__dentry_kill) from [<80276cfc>] (__fput+0xc0/0x234)
[<80276cfc>] (__fput) from [<80140b1c>] (task_work_run+0x90/0xbc)
[<80140b1c>] (task_work_run) from [<8010b1c8>] (do_work_pending+0x4a0/0x5a0)
[<8010b1c8>] (do_work_pending) from [<801000cc>] (slow_work_pending+0xc/0x20)
Exception stack(0x843f5fb0 to 0x843f5ff8)
5fa0:                                     00000000 7ef63448 fffffffe 00000000
5fc0: 7ef63448 76f163b0 7ef63448 00000006 7ef63448 7ef634e0 7ef63438 00000000
5fe0: 00000006 7ef63400 76e74833 76dff856 800e0130 00000004
task:optee_example_s state:D stack:    0 pid:  128 ppid:     1 flags:0x00000001
[<807d3e24>] (__schedule) from [<807d41d0>] (schedule+0x60/0x120)
[<807d41d0>] (schedule) from [<807d7ffc>] (schedule_timeout+0x1f4/0x340)
[<807d7ffc>] (schedule_timeout) from [<807d56a0>] (wait_for_completion+0x94/0xfc)
[<807d56a0>] (wait_for_completion) from [<8069359c>] (optee_handle_rpc+0x554/0x710)
[<8069359c>] (optee_handle_rpc) from [<806924cc>] (optee_do_call_with_arg+0x13c/0x154)
[<806924cc>] (optee_do_call_with_arg) from [<80692910>] (optee_invoke_func+0x110/0x190)
[<80692910>] (optee_invoke_func) from [<8068fe3c>] (tee_ioctl+0x113c/0x1244)
[<8068fe3c>] (tee_ioctl) from [<802892ec>] (sys_ioctl+0xe0/0xa24)
[<802892ec>] (sys_ioctl) from [<80100060>] (ret_fast_syscall+0x0/0x54)
Exception stack(0x8424ffa8 to 0x8424fff0)
ffa0:                   00000000 7eb67584 00000003 8010a403 7eb67438 7eb675fc
ffc0: 00000000 7eb67584 7eb67604 00000036 7eb67448 7eb674e0 7eb67438 00000000
ffe0: 76ef7030 7eb6742c 76ee6469 76e83178
OOM killer enabled.
Restarting tasks ... done.
PM: suspend exit
sh: write error: Device or resource busy

The first back-trace is stuck while waiting for tee-supplicant (see initial message)

The second back-trace shows optee_cq_wait_for_completion waiting for available OP-TEE Thread

The last back-trace seems to be at OPTEE_MSG_RPC_WAIT_QUEUE_SLEEP as this is the only other occurrence of wait_for_completion below optee_handle_rpc. It seems to be waiting for the single instance to become available.

@cgellner
Copy link
Author

Although it isn't directly related and it won't get stuck I wanted to mention one additional scenario influenced by freezing the tasks.

The msleep_interruptible in handle_rpc_func_cmd_wait should be affected by the fake signals of the suspend.
As it is interruptible it should not get stuck, but from my understanding the wait time might be significantly shorter than expected.

@github-actions
Copy link

This issue has been marked as a stale issue because it has been open (more than) 30 days with no activity. Remove the stale label or add a comment, otherwise this issue will automatically be closed in 5 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Jun 11, 2021
@cgellner
Copy link
Author

I proposed a set of patches via mailing lists
[RFC PATCH 0/3] tee: optee: Allow to freeze when tee-supplicant is frozen

Currently waiting for some more feedback

@github-actions
Copy link

This issue has been marked as a stale issue because it has been open (more than) 30 days with no activity. Remove the stale label or add a comment, otherwise this issue will automatically be closed in 5 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Jul 12, 2021
@cgellner
Copy link
Author

Beside the response of @jenswi-linaro I didn't see any feedback on RFC tagged patches

Therefore I send patches again - this time without RFC tag
[PATCH 0/3] tee: optee: Allow to freeze when tee-supplicant is frozen

@github-actions github-actions bot removed the Stale label Jul 17, 2021
@github-actions
Copy link

This issue has been marked as a stale issue because it has been open (more than) 30 days with no activity. Remove the stale label or add a comment, otherwise this issue will automatically be closed in 5 days. Note, that you can always re-open a closed issue at any time.

@etienne-lms
Copy link
Contributor

Can this thread be re-open and tagged enhancement?

@cgellner
Copy link
Author

cgellner commented Mar 4, 2022

Hi,
I have resend the request to [email protected]; [email protected] last August.

Do you know if the mentioned "suspend" problem has been solved differently ?
I didn't test with most recent OP-TEE or kernel.

I have seen that latest kernel has some small changes (787c80cc7b22804aa370f04a19f9fe0fa98b1e49)
So I believe patch 3/3 can't be applied, but I assume the issue is still there in the new code (still using wait_for_completion)

In case there is some interest in this patch I could

  • reproduce issue again in latest version
  • rework the patch set so it can be applied on latest kernel version
  • resend the patch request

But before starting I would like to understand if there is some general interest in this patch at all or if failing to freeze during suspend is no problem from your perspective.

@jforissier
Copy link
Contributor

Hi @cgellner,

Technically speaking I can't say if your patches are correct because I don't know the Linux suspend mechanisms well enough. However it seems you are addressing a real issue with these patches, so I think they deserve more attention.

In case there is some interest in this patch I could

* reproduce issue again in latest version

* rework the patch set so it can be applied on latest kernel version

* resend the patch request

@jenswi-linaro I think it is the way to go, isn't it? Maybe CC some experts of the "freeze" interface, too?

@jenswi-linaro
Copy link
Contributor

I agree and the same problem here that I don't know that area well enough.

@victorchenpan
Copy link

I have meet the problem too. When tee-supplicant has been frozen, and then other task request tee-supplicant, it can't enter freezing flow. So kernel can't finish freeze flow, and will not enter suspend flow.

Your patch can solve the problem.

@jforissier
Copy link
Contributor

The last trace of @cgellner 's patchset I could see on the Linux Kernel Mailing list is https://lore.kernel.org/all/[email protected]/ (August 2021) and it has never gathered any Acked-by, Reviewed-by or Tested-by apparently (at least there is no such tags in the patches themselves). @victorchenpan are you subscribed to the LKML? Would you be able to provide feedback if the series was to be re-posted? Or perhaps even reply to the already sent emails by following the "Reply instructions" on the ML page?

@baiqiang728
Copy link

Hi, I have resend the request to [email protected]; [email protected] last August.

Do you know if the mentioned "suspend" problem has been solved differently ? I didn't test with most recent OP-TEE or kernel.

I have seen that latest kernel has some small changes (787c80cc7b22804aa370f04a19f9fe0fa98b1e49) So I believe patch 3/3 can't be applied, but I assume the issue is still there in the new code (still using wait_for_completion)

In case there is some interest in this patch I could

  • reproduce issue again in latest version
  • rework the patch set so it can be applied on latest kernel version
  • resend the patch request

But before starting I would like to understand if there is some general interest in this patch at all or if failing to freeze during suspend is no problem from your perspective.

Thanks for your patch @cgellner , I met the same issue. After I update your patch, it seems that it works. However, if I do more test , such as stress test for STR when xtest is running. It may cause CPU rcu stall issue. Do you have similar phenomena ?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants