-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Multi GPU(Mirrored Strategy) training with XLA and AMP(Mixed Precision) #20763
Comments
Hi @keshusharmamrt, thanks for reporting this. I ran your code with |
hi @dhantule Thanks for your reply.
gist. Also I have tried to run this code with our own clusters where we have two gpus I see similar stuff there too.(I tried this with this official image of tensorflow(
Interestingly It works when we have single GPU 😄 on Colab as well as on our cluster. Also one more thing I tried to use "JAX" backend for keras and it seems to work by employing this tutorial. I was wondering If its possible to use Multi GPU with XLA and AMP with tensorflow backend of keras?? |
Please use same version of Then you could use it by
Let me know if this works for you. |
hi @sampathweb thanks for the reply. |
Hi I've encountered some issues while trying to perform multi-GPU training with XLA (Accelerated Linear Algebra), and AMP (Automatic Mixed Precision).
I'm reaching out to understand if it's possible to use multi-GPU training with XLA and AMP together.
If so, I'd like guidance on which versions of tensorflow and keras should I use or how to modify my code to make this work.
Background:
In earlier versions of TensorFlow (prior to 2.11), we were able to successfully train models using multiple GPUs with both XLA and AMP enabled. However, with versions beyond tensorflow 2.11 versions, I've not been able to run Training with multi gpu+xla+amp.
Issues Encountered with Different Versions:
I use tf-keras=2.15 for all these tests,
1. tensorflow=2.17.1/2.16.2 and keras=3.8.0:
Error Message
RuntimeError: Exception encountered when calling Cond.call() merge_call called while defining a new graph or a tf.function. This can often happen if the function fn passed to strategy.run() contains a nested @tf.function, and the nested @tf.function contains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients), or if the function fn uses a control flow statement which contains a synchronization point in the body. Such behaviors are not yet supported. Instead, please avoid nested tf.functions or control flow statements that may potentially cross a synchronization boundary, for example, wrap the fn passed to strategy.run or the entire strategy.run inside a tf.function or move the control flow out of fn. If you are subclassing a tf.keras.Model, please avoid decorating overridden methods test_step and train_step in tf.function
2. tensorflow=2.17.1/2.16.2 and keras=3.5.0:
Issue:
Training gets stuck after few epochs and not progress.
3. tensorflow=2.17.1/2.16.2 and keras=3.0.5
Error Message:
UnimplementedError: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.
4. tensorflow=2.18.0 Also gives similar error with keras versions 3.0,3.5 and 3.6
5. Using
TF_USE_LEGACY_KERAS=1
Training gets stuck after some time similarly as with keras>3. I have tried this with various tensorflow versions but got same training stuck.
Code Snippet:
Here's a simplified version of the code I'm using. This example is adapted from the Keras documentation on distributed training:
Can Someone Suggest what changes I should make in code or which version of keras and tensorflow to use to make training with multi gpu+xla+amp work?
or is it not possible train using multi GPU+XLA+AMP?
The text was updated successfully, but these errors were encountered: