Realm: More robust handling of case when -ll:nsize allocation fails #1785

manopapad · 2024-10-29T20:37:39Z

So it looks like currently when we ask for an -ll:nsize that is too large we get a warning, then the run continues without that allocation at all.

For example, running legate with:

--cpus=1 --gpus=1 --omps=1 --ompthreads=28 --utility=2 --sysmem=256 --numamem=308442 --fbmem=14184 --zcmem=128 --regmem=0

we see:

[0 - 7f6783618000]    0.000000 {4}{numa}: insufficient memory in NUMA node 0 (323424878592 > 67007963136 bytes) - skipping allocation
[0 - 7f6783618000]    0.000000 {4}{numa}: insufficient memory in NUMA node 1 (323424878592 > 56011509760 bytes) - skipping allocation
[0 - 7f6783618000]    0.000736 {4}{threads}: reservation ('OMP0 proc 1d00000000000003 (worker 16)') cannot be satisfied

then later the available memories are:

[0 - 7fe255c00000]    0.231816 {2}{legate.mapper}: Memories on rank 0:
[0 - 7fe255c00000]    0.231842 {2}{legate.mapper}:   1e00000000000000 (SYSTEM_MEM): 268435456 bytes
[0 - 7fe255c00000]    0.231848 {2}{legate.mapper}:   1e00000000000001 (SYSTEM_MEM): 0 bytes
[0 - 7fe255c00000]    0.231855 {2}{legate.mapper}:   1e00000000000002 (GPU_FB_MEM): 14873001984 bytes
[0 - 7fe255c00000]    0.231867 {2}{legate.mapper}:   1e00000000000003 (GPU_DYNAMIC_MEM): 15655829504 bytes
[0 - 7fe255c00000]    0.231876 {2}{legate.mapper}:   1e00000000000004 (Z_COPY_MEM): 134217728 bytes
[0 - 7fe255c00000]    0.231882 {2}{legate.mapper}:   1e00000000000005 (FILE_MEM): 0 bytes

I think this should either be an error, or Realm should proceed and just allocate the memory but allow it to span multiple NUMA domains.

The text was updated successfully, but these errors were encountered:

eddy16112 · 2024-10-29T23:44:45Z

@manopapad Does the following behaviors sound good to you?

If -ll:nsize or -ll:ncpu is set, but numa is not available on the machine, we consider it as an error
if -ll:nsize is too large, we consider it as an error. How about -ll:ncpu? We do oversubscription just as -ll:cpu?

manopapad · 2024-10-30T00:03:20Z

Let's bring it up for discussion at an upcoming Realm/Legion meeting

eddy16112 · 2024-10-30T18:41:38Z

Based on the discussion in the meeting, we will let realm crash in both cases.

muraj · 2024-10-30T18:47:42Z

Just to be clear, I do not approve of crashing inside Realm unless it's a true bug that Realm engineering needs to deal with. That said, we have cases today (like specifying -ll:gpu when the CUDA module is not loaded, or specifying too large a value for -ll:fsize) where we crash. To maintain consistency and expectations, we should make these different cases respond in the same way. For now, we'll crash with an error message. Later on, when we get better consensus on how to propagate errors and maintain compatibility, we'll return an error and turn off the feature, but otherwise limp on, or some other mechanism that we work out.

apryakhin added the Realm Issues pertaining to Realm label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm: More robust handling of case when -ll:nsize allocation fails #1785

Realm: More robust handling of case when -ll:nsize allocation fails #1785

manopapad commented Oct 29, 2024

eddy16112 commented Oct 29, 2024 •

edited

Loading

manopapad commented Oct 30, 2024

eddy16112 commented Oct 30, 2024

muraj commented Oct 30, 2024

Realm: More robust handling of case when -ll:nsize allocation fails #1785

Realm: More robust handling of case when -ll:nsize allocation fails #1785

Comments

manopapad commented Oct 29, 2024

eddy16112 commented Oct 29, 2024 • edited Loading

manopapad commented Oct 30, 2024

eddy16112 commented Oct 30, 2024

muraj commented Oct 30, 2024

eddy16112 commented Oct 29, 2024 •

edited

Loading