Using ULT stack to store RPC lineage #300

mdorier · 2024-12-21T14:41:37Z

@carns This PR is an alternative to disabling ABT_keys for lineage tracking. It tries to solve the performance problem by using the end of a ULT's stack to write the current RPC ID. I'm also writing a magic number before it so we can check if the RPC ID has been set (rather than getting garbage), and these values are wiped at the end of the ULT so if the stack is reused, we don't risk having valid information there.

While we rarely use lineage information for RPCs themselves (i.e. knowing the parent of an RPC), I use lineage information all the time when it comes to tracking performance of bulk transfers, so I'd like to try fixing the performance problem rather than making lineage tracking optional and off by default. Could you give this PR a try with hpctoolkit?

codecov · 2024-12-21T14:48:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 57.97%. Comparing base (ce62b9e) to head (093a203).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #300      +/-   ##
==========================================
+ Coverage   57.93%   57.97%   +0.04%     
==========================================
  Files          70       71       +1     
  Lines       10196    10202       +6     
  Branches     1341     1339       -2     
==========================================
+ Hits         5907     5915       +8     
+ Misses       3453     3451       -2     
  Partials      836      836

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/margo-rpc-lineage.h

carns · 2025-01-07T22:38:09Z

src/margo-rpc-lineage.h

+
+static inline int margo_lineage_set(hg_id_t current_rpc_id) {
+    __MARGO_LINEAGE_COMMON;
+    memcpy(stackend - 8 - sizeof(hg_id_t), magic, 8);


Minor stylistic comment. It looks odd for one sizeof() to use the type and the other sizeof() to use the variable (visually it makes you think that they are different values). Maybe just use the same style all the way through? I usually prefer the latter but the former might make more sense here since erase() doesn't have a variable name to use.

carns · 2025-01-07T22:40:37Z

Nifty solution to stash the information in the stack :) I would not have thought of that.

I would have a mild preference for fixing the key functions in ABT if it were straightforward (since they are intuitive for this use case), but this is probably the more practical performance solution. I'll run some tests with a profiler and report back.

carns · 2025-01-08T15:58:42Z

Unfortunately, based on an initial hpctoolkit profile this looks like it may actually be more costly than the ABT key based implementation:

This is part of the "bottom up" view from hpcviewer. It shows ABT_thread_get_attr() consuming 3.7% CPU as a result of calls from lineage_set() and lineage_erase(), almost all attributed to posix_memalign() calls triggered by ABT.

Interestingly I think I'm getting more detail on my current hpctoolkit runs than I was before. I'm going to try a clean build without this PR and see if I can find more information about costs in the original implementation. If it indicates something straightforward in the key path that we could improve then maybe we consider that, otherwise this may just be an overhead that we need to pay for now and move on to some higher-value issues.

mdorier · 2025-01-08T16:12:39Z

It's the ABTI_thread_attr_dup that's causing an allocation. We shouldn't need to duplicate the thread attribute and free it later, Argobots should give us an ABT_thread_get_stack_info(ABT_thread, void** address, size_t* size); to just get the info we need.

carns · 2025-01-08T16:36:10Z

FWIW, I am not able to resolve anything further below ABT_key_set() (as in the ABT_thread_get_attr() call above). In this example it happened to only show up as 1.5% CPU time but I think more typically I have seen 2.5% in the past:

So maybe false alarm about seeing more detail in hpctoolkit this time. If the get_attr() is explicitly using posix_memalign() rather than a simple malloc that could be the culprit in the previous path; I don't know if it has a specific reason for creating aligned memory.

carns · 2025-01-08T17:00:27Z

Hrm. The default key table size is tunable in Argobots via ABT_KEY_TABLE_SIZE and will have an impact on the initial table allocation (which must be done on the first key_set() call for a work unit if it doesn't yet exist, if I'm reading the code correctly in https://github.com/pmodels/argobots/blob/main/src/include/abti_key.h#L225 ). This parameter has a complicated impact on how memory is allocated both at table creation time and key creation time. I'll do a quick check to see what this parameter does.

carns · 2025-01-08T18:36:12Z

Setting ABT_KEY_TABLE_SIZE=0 seems to offer a minor improvement (down to 1.1% CPU) but the results are noisy. I think I'd like to pause this for now and come back to it later.

using ULT stack to store RPC lineage

fe536a9

mdorier assigned carns Dec 21, 2024

github-advanced-security bot found potential problems Dec 21, 2024

View reviewed changes

src/margo-rpc-lineage.h Dismissed Show dismissed Hide dismissed

src/margo-rpc-lineage.h Fixed Show fixed Hide fixed

fixed memory leak and codeql

093a203

carns mentioned this pull request Jan 7, 2025

disable rpc tracing by default #299

Closed

carns reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ULT stack to store RPC lineage #300

Using ULT stack to store RPC lineage #300

mdorier commented Dec 21, 2024

codecov bot commented Dec 21, 2024 •

edited

Loading

carns Jan 7, 2025

carns commented Jan 7, 2025

carns commented Jan 8, 2025

mdorier commented Jan 8, 2025

carns commented Jan 8, 2025

carns commented Jan 8, 2025

carns commented Jan 8, 2025

Using ULT stack to store RPC lineage #300

Are you sure you want to change the base?

Using ULT stack to store RPC lineage #300

Conversation

mdorier commented Dec 21, 2024

codecov bot commented Dec 21, 2024 • edited Loading

Codecov Report

carns Jan 7, 2025

Choose a reason for hiding this comment

carns commented Jan 7, 2025

carns commented Jan 8, 2025

mdorier commented Jan 8, 2025

carns commented Jan 8, 2025

carns commented Jan 8, 2025

carns commented Jan 8, 2025

codecov bot commented Dec 21, 2024 •

edited

Loading