Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regions_mm: vmh_free: Prevent null pointer dereference #9776

Merged
merged 1 commit into from
Jan 17, 2025

Conversation

softwarecki
Copy link
Collaborator

The vmh_free function may have referenced a null pointer if an invalid pointer was passed to it. Before referencing the allocator array item, check if it has been initialized.

Add k_panic() function call in error handling code to help detect potential memory release errors.

@lgirdwood
Copy link
Member

@kv2019i needed for v2.12 or is VMH not enabled ?

lyakh
lyakh previously requested changes Jan 16, 2025
zephyr/lib/regions_mm.c Outdated Show resolved Hide resolved
tr_err(&zephyr_tr, "Unable to free %p! %d", ptr, ret);
k_panic();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again not sure I understand why this is needed. Should we panic in all error cases? Or at least in all cases where we cannot directly propagate the error up the stack? Maybe we should use or add an IPC notification for some such cases.

@kv2019i
Copy link
Collaborator

kv2019i commented Jan 16, 2025

@lgirdwood wrote:

@kv2019i needed for v2.12 or is VMH not enabled ?

VMH is now enabled for all Intel ACE platforms (CONFIG_VIRTUAL_HEAP=y), so this does affect the release. I'll postpone the release tag until we have clarify. It's not clear under which circumstances this failure can happen (and i.e. how serious this is)? But to err on safe side, not tagging release until we have clarity.

@lgirdwood
Copy link
Member

@lgirdwood wrote:

@kv2019i needed for v2.12 or is VMH not enabled ?

VMH is now enabled for all Intel ACE platforms (CONFIG_VIRTUAL_HEAP=y), so this does affect the release. I'll postpone the release tag until we have clarify. It's not clear under which circumstances this failure can happen (and i.e. how serious this is)? But to err on safe side, not tagging release until we have clarity.

@softwarecki can you provide details on the reproduction i.e. can in impact v2.12 ? Thanks !

@softwarecki
Copy link
Collaborator Author

softwarecki commented Jan 16, 2025

@lgirdwood @kv2019i @lyakh
The issue is reproduced when sending a response with a payload to an IPC that was directed to a secondary core. The second core allocates memory to hold the payload and queues the response for IPC. The primary core removes this message from the queue, sends it, and tries to free the memory. Since each core has a separate VMH allocator, selected based on the core id, the pointer to be freed does not belong to the selected allocator. The allocator, in search of sys_mem_blocks to which the pointer belongs, goes through the entire .physical_blocks_allocators array. The current memory allocation configuration defines only 9 blocks out of 10 possible. The last element of the .physical_blocks_allocators array is NULL, which the code was not prepared for.

The vmh_free function returns an error if the heap belongs to another core, the given pointer is invalid (does not belong to the allocator), there is an error in the code determining the size of the block to be freed, or memory unmapping fails. I figured that the log entry is easy to miss. Stopping the firmware at this point will draw attention to a critical problem related to memory allocation. Otherwise, we will have a slowly progressing memory leak.

@lyakh
Copy link
Collaborator

lyakh commented Jan 16, 2025

Since each core has a separate VMH allocator, selected based on the core id, the pointer to be freed does not belong to the selected allocator.

@softwarecki That's a very good explanation, thank you! Now the issues are becoming clearer, so we can think how to fix them.

  1. You said there are 10 slots, but only 9 are used. I understand those 9 used ones are currently determined by static_hp_buffers which has 9 array entries. That's why only 9 are allocated. In principle, yes, the configuration can have fewer slots too, the code should be prepared to that. I see it is already checked in vmh_alloc() but for some reason there's a continue in the loop at

    sof/zephyr/lib/regions_mm.c

    Lines 419 to 420 in fcbca60

    if (!heap->physical_blocks_allocators[mem_block_iterator])
    continue;
    if a NULL entry is encountered - shouldn't it be a break? Same for the check in your proposed fix.
  2. Optionally you could consider adding the number of used slots in the array to struct vmh_heap
  3. You've explained that the primary core cannot free memory allocated by a secondary core, and currently exactly that is happening. Do we have a fix yet for that very problem - will the affected secondary core eventually free the buffer or it's leaked?
  4. Could you also modify the commit message a bit - the function takes two pointers as arguments so it wasn't clear to me which pointer was invalid. Maybe specifically say "second argument" or "pointer to be freed" or something similar?
  5. I still don't understand why the second commit in this series is needed. If you agree, that it isn't really necessary for this fix and might not even be a good improvement to consider, maybe we can drop it, that would resolve my request for change.

The vmh_free function may have referenced a null pointer if an invalid
pointer to be freed was passed to it. Before referencing the allocator
array item, check if it has been initialized.

Signed-off-by: Adrian Warecki <[email protected]>
@lyakh
Copy link
Collaborator

lyakh commented Jan 16, 2025

@lyakh

  1. Do we have a fix yet for that very problem

#9777

Ah! So it doesn't just "simplify" a flow, it also fixes a bug! Got it, would be good to have it mentioned there too.

@lyakh lyakh dismissed their stale review January 16, 2025 16:55

issue clarified, unneeded k_panic() addition removed

@softwarecki
Copy link
Collaborator Author

@lyakh

  1. Both variants seem correct, please ask the author @dabekjakub why he chose this one.
  2. as above.
  3. ipc4: Simplify ipc response sending memory allocations #9777 Yea, this is a silent fix, please don't tell anyone. 🤫
  4. Fixed.
  5. zephyr: alloc: virtual_heap_free: Panic on deallocations errors #9779

@kv2019i
Copy link
Collaborator

kv2019i commented Jan 16, 2025

@softwarecki wrote:

3. [ipc4: Simplify ipc response sending memory allocations #9777](https://github.com/thesofproject/sof/pull/9777) Yea, this is a silent fix, please don't tell anyone. 🤫

Sorry, I need your help with this one. This is getting more than usual attention as we were supposed to tag final 2.12 today and this is the only issue left. As far as I can tell, this invalid free can only happen with ipc4_get_large_config_module_instance() if sent to a module running on non-zero core?

I'm puzzled why we haven't hit this in CI before. The NULL dereference is on read, so should create a normal Zephyr panic when hit.

Not a mainstream case, but I guess we need to have this in the release.

@lyakh
Copy link
Collaborator

lyakh commented Jan 17, 2025

@softwarecki wrote:

3. [ipc4: Simplify ipc response sending memory allocations #9777](https://github.com/thesofproject/sof/pull/9777) Yea, this is a silent fix, please don't tell anyone. 🤫

Sorry, I need your help with this one. This is getting more than usual attention as we were supposed to tag final 2.12 today and this is the only issue left. As far as I can tell, this invalid free can only happen with ipc4_get_large_config_module_instance() if sent to a module running on non-zero core?

I'm puzzled why we haven't hit this in CI before. The NULL dereference is on read, so should create a normal Zephyr panic when hit.

Not a mainstream case, but I guess we need to have this in the release.

@kv2019i It has to be an IPC response with payload sent by a secondary core. I guess we don't have those in our tests. Maybe we should add some. Which IPCs return payload? Something like get-config? Can we test some of those on secondary cores?

@lyakh
Copy link
Collaborator

lyakh commented Jan 17, 2025

Both variants seem correct, please ask the author @dabekjakub why he chose this one.

@softwarecki with continue you support sparsely populated arrays, i.e. if somebody uses a VMH configuration with 0-sized slots or if some slots failed to allocate and at some point of time we have a version which then leaves them empty... Not checking in the code now, but probably none of those are currently possible, so I'd make it explicit, that the first empty slot means the end of VMH slots, and then use a break, but maybe there are valid use-cases... Would be good to make that explicit at least

@lyakh
Copy link
Collaborator

lyakh commented Jan 17, 2025

Not a mainstream case, but I guess we need to have this in the release.

@kv2019i with my incomplete knowledge of all IPC4 response cases, I'd tend to agree.

@kv2019i
Copy link
Collaborator

kv2019i commented Jan 17, 2025

Update on this: I've been reviewing impacts of the bug this is addressing and seems this case cannot be hit in configurations we are shipping in public releases. Given this and issues with CI getting this merged, I'm leaning to exclude this from 2.12 final tag today. @lgirdwood @softwarecki @lyakh please add if you have further data points to consider

@lyakh
Copy link
Collaborator

lyakh commented Jan 17, 2025

Update on this: I've been reviewing impacts of the bug this is addressing and seems this case cannot be hit in configurations we are shipping in public releases. Given this and issues with CI getting this merged, I'm leaning to exclude this from 2.12 final tag today. @lgirdwood @softwarecki @lyakh please add if you have further data points to consider

@kv2019i It looks like this bug cannot trigger with any of the topologies, included in SOF official releases, so to trigger it you'd need a custom topology. It still doesn't feel particularly happy, but well, if a bug cannot be triggered, it doesn't exist...

@kv2019i kv2019i removed this from the v2.12 milestone Jan 17, 2025
@kv2019i kv2019i merged commit 898adc7 into thesofproject:main Jan 17, 2025
46 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants