Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD: parse the architecture as supplied by gcnArchName #11244

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Haus1
Copy link

@Haus1 Haus1 commented Jan 14, 2025

The value provided by minor is truncated for AMD so parse the value returned by gcnArchName for an accurate ID.

We can also use the common value for GCN4, gfx800, to avoid missing compatible devices.

This is a follow-up to #11209 and will change the behavior of CDNA3, CDNA, VEGA and GCN4 as they should now be recognized as expected. Of those I only have access to a GCN4 device for testing.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 14, 2025
@JohannesGaessler
Copy link
Collaborator

I don't know at all whether this is the correct way to do it. @IMbackK your input would be appreciated.

@IMbackK
Copy link
Collaborator

IMbackK commented Jan 16, 2025

Yes this is more correct, the current code misses the arch step part.
But i have never seen a device report gfx800, rocblas only supports and checks for gfx803, which is reported by fiji and all polaris variants, and the only other variant i am aware of is gfx802, which is not supported by rocblas (or any rocm component).

Thus nak on the change to the gfx8 define.
I will try this pr out on my devices (i have access to gfx803, gfx900, gfx906, gfx908 and gfx1030)

@IMbackK
Copy link
Collaborator

IMbackK commented Jan 16, 2025

Theres also a snag in this pr regarding gfx90a, gfx90a reports 9.1 as major minor but its gcnArchName is gfx90a which this pr wont parse correctly, same goes for others like gfx90c.

So the current code is not correct, but this pr has too many issues to serve as an improvement as is.

@Haus1
Copy link
Author

Haus1 commented Jan 16, 2025

It appears this returns the full target ID as defined in https://github.com/ROCm/clr/blob/amd-staging/rocclr/device/device.cpp around line 125. This'll need to be expanded upon in order to parse out xnack status and to handle the addition of generics.

If it were possible to retrieve the version stepping directly that would be preferable to parsing it out of a string. Would the xnack status be of any use here or can that just be ignored?

@IMbackK
Copy link
Collaborator

IMbackK commented Jan 16, 2025

xnack can be ignored since we dont use hipMallocManaged allocated memory. Outside of the user recompileing the whole rocm stack with non default flags only gfx942 and gfx90a can end up in xnak+ mode.

@Haus1
Copy link
Author

Haus1 commented Jan 16, 2025

Yeah, they certainly don't make enabling xnack easy. On linux the kernel module also needs patched to prevent it from rejecting the device

@Haus1 Haus1 force-pushed the amd-rework-version branch from 7bd1195 to 468296f Compare January 17, 2025 22:47
@github-actions github-actions bot added script Script related testing Everything test related python python script changes labels Jan 17, 2025
@Haus1 Haus1 force-pushed the amd-rework-version branch from 468296f to 9620bce Compare January 17, 2025 22:54
@Haus1
Copy link
Author

Haus1 commented Jan 17, 2025

This will now work with all the IDs AMD has in staging and will gracefully fall back to the old way if it fails. Please let me know if I've missed anything.

Would it be better to submit backend changes like this to ggml first?

@Haus1 Haus1 force-pushed the amd-rework-version branch from 9620bce to f77ea24 Compare January 18, 2025 21:05
@IMbackK
Copy link
Collaborator

IMbackK commented Jan 22, 2025

This is more correct, but still NAK on the gfx800 change. No gfx800 ever existed, gfx803 has instructions gfx802 dose not have thus we cant simply have all gfx8 devices be gfx800.

LLVM suggests that at some point gfx802 was considered gfx800 but gfx803 was always a distinct isa https://github.com/ROCm/llvm-project/blob/656552edc693e2bb4abc9258399c39d190fce2b3/amd/comgr/src/comgr-metadata.cpp#L469

Its also inconstant, various sub variants do exist that are isa comptable. all gfx103x versions are the same isa and can execute eatch others code. the same goes for gfx900 and 90c, either you also fold all of these, or you unfold gfx8. Atm the is nothing that checks for gfx8 anyhow, any code path took for gfx8 is taken because its < gfx900 so this folding dosent do anything for you anyhow.

I am also rethinking if this approach is the right one to take. Its overly complicated. On rocm the target architecture is allways known at compile time to client code anyhow, thus parsing the isa at runtime only makes sense for the rocm runtime itself.

I thus suggest that instead of this complicated solution we should simply set the cc to the lowest one of the group of gpus in ggml/src/ggml-cuda/vendors/hip.h (ie gcn/cdna/rdna1/2/3) and when we find the need to add an optimzation that requires this to be subdivided, simply do so.

edit: i was forgetting we also use cc in host code here, this pr is indeed the best solution, besides the gfx8 inconsistency.

@JohannesGaessler
Copy link
Collaborator

I assume the Python script and the C file with the duplicated code were for testing and are intended to be removed prior to merging. If this is done I am willing to approve this PR (after checking on my own machine that it works).

@IMbackK
Copy link
Collaborator

IMbackK commented Jan 22, 2025

The python script needs to go somewhere, since we will have to regenerate the table fairly often when amd releases new devices.

I tested this pr on gfx908 and gfx1030 and it worked fine there

@JohannesGaessler
Copy link
Collaborator

If these files are needed long term, please add short blurbs that explain why they are needed and how to use them.

@Haus1
Copy link
Author

Haus1 commented Jan 26, 2025

Thanks for the feeedback. Are the datasheets still publicly available for these architectures? I double checked against the compiler's feature and product lists while trying to understand the concerns about gfx8, but it only showed that as an issue for gfx9 and seemed fruitless. I'll remove it nevertheless as it can always be revisited later.

The python script isn't necessary and was just a convenience I thought other's may find helpful. AMD violates their own guidelines they include when these values are defined so the only way to verify this functions is to test all +100 variations.

@Haus1 Haus1 force-pushed the amd-rework-version branch 2 times, most recently from c3f41aa to bc59003 Compare January 26, 2025 18:26
The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.
Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Haus1 Thank you, this looks good now and works well on my machine (gfx908, gfx1030)

The gfx8 folding will always be unnecessary as we will simply check for <900 just like we have to check < 1100 && >= 1030 to hit all the gfx103x variants, that are all isa identical.

I would be good to have the test case that tests against the generated table and the means to regenerate the table when amd changes it, but this do sent have to land right now.

Therefore i ACK this as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes script Script related testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants