Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dram data storage: fast_get()/_put() #9765

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jsarha
Copy link
Contributor

@jsarha jsarha commented Jan 7, 2025

@lyakh , @singalsu , @lrgirdwo This is the current situation with fast_get()/_put(), and its usage in SRC module.

Copy link
Collaborator

@singalsu singalsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Were you able to check MCPS impact before/after. I guess it should not impact?

@jsarha
Copy link
Contributor Author

jsarha commented Jan 7, 2025

Looks good to me. Were you able to check MCPS impact before/after. I guess it should not impact?

No there was no difference:
No SRC
ll_thread2 stack 18.8% load 5.9%

44.1kHz to 48kHz SRC coefs in SRAM
ll_thread2 stack 18.8% load 16.1%

44.1kHz to 48kHz SRC coefs in DRAM
ll_thread2 stack 18.8% load 23.1%

44.1kHz to 48kHz SRC coefs copied from DRAM to SRAM
ll_thread2 stack 18.8% load 16.1%

The full measurements log is here:
src_tests_with_nocodec.txt

@singalsu
Copy link
Collaborator

singalsu commented Jan 7, 2025

Looks good to me. Were you able to check MCPS impact before/after. I guess it should not impact?

No there was no difference: No SRC ll_thread2 stack 18.8% load 5.9%

44.1kHz to 48kHz SRC coefs in SRAM ll_thread2 stack 18.8% load 16.1%

44.1kHz to 48kHz SRC coefs in DRAM ll_thread2 stack 18.8% load 23.1%

44.1kHz to 48kHz SRC coefs copied from DRAM to SRAM ll_thread2 stack 18.8% load 16.1%

The full measurements log is here: src_tests_with_nocodec.txt

Thanks, that's very good news! Actually I'm surprised that that load without copy to SRAM isn't higher.

@jsarha
Copy link
Contributor Author

jsarha commented Jan 8, 2025

Thanks, that's very good news! Actually I'm surprised that that load without copy to SRAM isn't high

It is. There was three tests, coefs in SRAM all the time (16.1% load), coefs in DRAM no copying (23.1% load), and coefs in DRAM but copied to SRAM before use (16.1% load). An for comparison, same pipelines without need for SRC 5.9% load.

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good! Just a couple of simple mostly cosmetic changes

Kconfig.sof Outdated
module data from DRAM to SRAM when the data is needed and freeing
the SRAM when the data is not needed anymore. If multiple module
instances need the same chunk the same copy is used with reference
counting. Source is src/lib/gast-get.c. The option should be selected
Copy link
Collaborator

@lyakh lyakh Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "fast-get.c"


#include <stddef.h>

void *fast_get(const void * const dram_ptr, size_t size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must return a const pointer too

#include <stddef.h>

void *fast_get(const void * const dram_ptr, size_t size);
void fast_put(void *sram_ptr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const void *

/* for cmocka environment */
#ifndef EXPORT_SYMBOL
#define EXPORT_SYMBOL(a)
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh, it is already defined in posix/include/rtos/symbol.h - isn't that enough? Need to

#include <rtos/symbol.h>

then it should work

const void *dram_ptr;
void *sram_ptr;
size_t size;
uint32_t refcount;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just unsigned int?

{
void *ret;

(void)state; /* unused */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does ARG_UNUSED() work in cmocka?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not defined in cmocka environment. Its coming from Zephyr side, and there its defined as
#define ARG_UNUSED(x) (void)(x)

I could in make some similar hack header for cmocka code, but open coded void cast is already used in several places cmocka tests.

copy[0][i] = fast_get(testdata[i], sizeof(testdata[0]));

for (i = 0; i < ARRAY_SIZE(copy[0]); i++)
copy[1][i] = fast_get(testdata[i], sizeof(testdata[0]));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also test unmatched allocation - with a wrong size. Also you could add a test for too large an allocation - don't think malloc() will succeed a 1GiB allocation even if it's available, IIRC there was a limit. But maybe I'm wrong. This can be an incremental PR, this test set is already quite good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add one more test. But I do not think the allocation limit should be enforced by fast-get-code.

@@ -593,6 +594,60 @@ int src_param_set(struct comp_dev *dev, struct comp_data *cd)
return 0;
}

static void debug_print(struct comp_dev *dev, const char *name, const struct src_stage *s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left-over

stage_dst1 = rmalloc(SOF_MEM_ZONE_RUNTIME, 0, SOF_MEM_CAPS_RAM,
2 * sizeof(*stage_dst1));
if (!stage_dst1) {
comp_err(dev, "src_allocate_copy_stages(): failed allocate stages");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use function name in logging, Zephyr does it already

const struct src_stage *stage_src1,
const struct src_stage *stage_src2)
{
struct src_stage *stage_dst1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just use an array as struct src_stage *stage_dst and then stage_dst[0].coefs etc. but up to you too, just an idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't bother with the change, not yet anyway. I have a hunch that there will be more rounds coming.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the ifdef FAST_GET at the function level i.e. if FAST_GET=n then src_allocate_copy_stages() does what it does today, if =y then it does the fast get logic

@lgirdwood
Copy link
Member

@jsarha can you list any dependency PRs that need to be merged first here. Thanks !

lyakh added 2 commits January 9, 2025 10:12
With this data can be assigned to a separate read-only section, which
then will be kept in DRAM without copying it to SRAM on module
instantiation.

Signed-off-by: Guennadi Liakhovetski <[email protected]>
src coefficients take a lot of space, keep them in DRAM to only copy
used sets in SRAM.

Signed-off-by: Guennadi Liakhovetski <[email protected]>
@lyakh
Copy link
Collaborator

lyakh commented Jan 9, 2025

@jsarha can you list any dependency PRs that need to be merged first here. Thanks !

@lgirdwood I think all the dependencies have been merged now. It just needs to switch to using a newer version of my commits from #9721

@lgirdwood
Copy link
Member

@jsarha can you list any dependency PRs that need to be merged first here. Thanks !

@lgirdwood I think all the dependencies have been merged now. It just needs to switch to using a newer version of my commits from #9721

@lyakh just some conflicts now.

Jyri Sarha added 3 commits January 15, 2025 11:38
Implement fast_get() and fast_put() functions. The purpose of these
functions is to maintain shared SRAM copies of data stored in DRAM.

fast_get()

First checks if there is already an SRAM copy of the same DRAM chunk.

If there isn't reserve an SRAM chunk of the same size and copy the
contents there, store the both pointers, size and reference count to
in an internal data structure, and return the SRAM pointer.

If there is, return the pointer to the existing SRAM copy and
increase the reference count.

fast_put()

Look up the internal data record based on the SRAM address and
decrement reference count. Free the SRAM chunk and the data record if
reference count reaches zero,

Signed-off-by: Jyri Sarha <[email protected]>
And simple module tests for fast_get and fast_put() implemented in
src/lib/fast-get/fast-get.c.

Signed-off-by: Jyri Sarha <[email protected]>
The SRC coefficients are loaded to DRAM and commit copies the
coefficients to SRAM when they are needed. The copying is done using
fast_get() and the copy is released with fast_put() when its not
needed anymore.

Signed-off-by: Jyri Sarha <[email protected]>
Copy link
Contributor Author

@jsarha jsarha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comment applied and commits rebased on top of the latest #9721

{
void *ret;

(void)state; /* unused */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not defined in cmocka environment. Its coming from Zephyr side, and there its defined as
#define ARG_UNUSED(x) (void)(x)

I could in make some similar hack header for cmocka code, but open coded void cast is already used in several places cmocka tests.

copy[0][i] = fast_get(testdata[i], sizeof(testdata[0]));

for (i = 0; i < ARRAY_SIZE(copy[0]); i++)
copy[1][i] = fast_get(testdata[i], sizeof(testdata[0]));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add one more test. But I do not think the allocation limit should be enforced by fast-get-code.

const struct src_stage *stage_src1,
const struct src_stage *stage_src2)
{
struct src_stage *stage_dst1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't bother with the change, not yet anyway. I have a hunch that there will be more rounds coming.

@jsarha jsarha force-pushed the dram_data_storage branch from 1c74172 to d7f2279 Compare January 15, 2025 15:22
@jsarha
Copy link
Contributor Author

jsarha commented Jan 15, 2025

After the rebase something has happened to the performance of the case where coefficients are left in DRAM and not copied to SRAM. The 44.1kHz playback load of that case has dropped from 23.1% to 18.8%. @lyakh , any idea where this is coming? This result is from MTL so it may not be important, but it is still interesting. Also, now the build fails if SRC is not built as module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants