Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ULT stack to store RPC lineage #300

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

mdorier
Copy link
Contributor

@mdorier mdorier commented Dec 21, 2024

@carns This PR is an alternative to disabling ABT_keys for lineage tracking. It tries to solve the performance problem by using the end of a ULT's stack to write the current RPC ID. I'm also writing a magic number before it so we can check if the RPC ID has been set (rather than getting garbage), and these values are wiped at the end of the ULT so if the stack is reused, we don't risk having valid information there.

While we rarely use lineage information for RPCs themselves (i.e. knowing the parent of an RPC), I use lineage information all the time when it comes to tracking performance of bulk transfers, so I'd like to try fixing the performance problem rather than making lineage tracking optional and off by default. Could you give this PR a try with hpctoolkit?

Copy link

codecov bot commented Dec 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 57.97%. Comparing base (ce62b9e) to head (093a203).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #300      +/-   ##
==========================================
+ Coverage   57.93%   57.97%   +0.04%     
==========================================
  Files          70       71       +1     
  Lines       10196    10202       +6     
  Branches     1341     1339       -2     
==========================================
+ Hits         5907     5915       +8     
+ Misses       3453     3451       -2     
  Partials      836      836              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/margo-rpc-lineage.h Dismissed Show dismissed Hide dismissed
src/margo-rpc-lineage.h Fixed Show fixed Hide fixed

static inline int margo_lineage_set(hg_id_t current_rpc_id) {
__MARGO_LINEAGE_COMMON;
memcpy(stackend - 8 - sizeof(hg_id_t), magic, 8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stylistic comment. It looks odd for one sizeof() to use the type and the other sizeof() to use the variable (visually it makes you think that they are different values). Maybe just use the same style all the way through? I usually prefer the latter but the former might make more sense here since erase() doesn't have a variable name to use.

@carns
Copy link
Member

carns commented Jan 7, 2025

Nifty solution to stash the information in the stack :) I would not have thought of that.

I would have a mild preference for fixing the key functions in ABT if it were straightforward (since they are intuitive for this use case), but this is probably the more practical performance solution. I'll run some tests with a profiler and report back.

@carns
Copy link
Member

carns commented Jan 8, 2025

Unfortunately, based on an initial hpctoolkit profile this looks like it may actually be more costly than the ABT key based implementation:

image

This is part of the "bottom up" view from hpcviewer. It shows ABT_thread_get_attr() consuming 3.7% CPU as a result of calls from lineage_set() and lineage_erase(), almost all attributed to posix_memalign() calls triggered by ABT.

Interestingly I think I'm getting more detail on my current hpctoolkit runs than I was before. I'm going to try a clean build without this PR and see if I can find more information about costs in the original implementation. If it indicates something straightforward in the key path that we could improve then maybe we consider that, otherwise this may just be an overhead that we need to pay for now and move on to some higher-value issues.

@mdorier
Copy link
Contributor Author

mdorier commented Jan 8, 2025

It's the ABTI_thread_attr_dup that's causing an allocation. We shouldn't need to duplicate the thread attribute and free it later, Argobots should give us an ABT_thread_get_stack_info(ABT_thread, void** address, size_t* size); to just get the info we need.

@carns
Copy link
Member

carns commented Jan 8, 2025

FWIW, I am not able to resolve anything further below ABT_key_set() (as in the ABT_thread_get_attr() call above). In this example it happened to only show up as 1.5% CPU time but I think more typically I have seen 2.5% in the past:

image

So maybe false alarm about seeing more detail in hpctoolkit this time. If the get_attr() is explicitly using posix_memalign() rather than a simple malloc that could be the culprit in the previous path; I don't know if it has a specific reason for creating aligned memory.

@carns
Copy link
Member

carns commented Jan 8, 2025

Hrm. The default key table size is tunable in Argobots via ABT_KEY_TABLE_SIZE and will have an impact on the initial table allocation (which must be done on the first key_set() call for a work unit if it doesn't yet exist, if I'm reading the code correctly in https://github.com/pmodels/argobots/blob/main/src/include/abti_key.h#L225 ). This parameter has a complicated impact on how memory is allocated both at table creation time and key creation time. I'll do a quick check to see what this parameter does.

@carns
Copy link
Member

carns commented Jan 8, 2025

Setting ABT_KEY_TABLE_SIZE=0 seems to offer a minor improvement (down to 1.1% CPU) but the results are noisy. I think I'd like to pause this for now and come back to it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants