-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use MPS and explicitly disable autocast & GradScaler for non-CUDA #654
Conversation
The issue with |
Hmm, I don't know who is best suited to review this PR or who else is interested in running open_clip on M1/M2 Macs for that matter 🤔 @gabrielilharco Could you take a look? |
Sorry, I don't have access to that kind of hardware so can't test it myself @EIFY |
@gabrielilharco Do you know if any of the owners do? If not, can I get an external M1/M2 Mac user to endorse instead? |
So, supporting mps and other non cuda/cpu devices worthwhile goal, not sure 'this' is the best approach though. For autocast, should we rely on the amp (precision) arg to determine whether or not to try to use autocast? If autocast is used with mps it should crash instead of falling back (in my opinion), so that it's more clear it doesn't work. For the initialization of device, probably better to explicitly pass a device str to the fn that will be sensibly merged with the distributed env. For mps distributed doesn't make sense, but I wouldn't say we want to default to mps if mps is available? it's a tossup on m1 if you want to use it vs the CPU, we should likely err towards being explicit rather than implict here... |
@rwightman Falling back is the current behavior for both autocast & grad_scaler, see these two warnings:
but I actually agree: I would rather the training code to crash when
Similarly for the current handling of
|
device = "mps"
makes both training and inference ~ an order of magnitude faster on newer Macs,torch==2.2.0.dev20231002
:Training: MPS vs. CPU, RN50 model (username redacted)
Inference:
This PR also includes explicit handling of autocast & GradScaler for non-CUDA devices. Currently open_clip is hardcoded to use the CUDA version (
torch.cuda.amp.autocast
andtorch.cuda.amp.GradScaler
respectively), which get disabled with warnings likeWith this PR we issue our own warnings and explain the rationales & consequences:
torch.cpu.amp.autocast
but training with it failed with the stated attn_mask & query dtype mismatch.