-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The sample demo have some bug. #401
Comments
Hi @MoFHeka could you please help with it? |
Any update on this? I am also facing this issue |
@sosixyz Could you please provide a minimum reproducible code? |
@alykhantejani Most of TFRA users are using GPU sync training without PS. So it's few people to aware this issue. Line 267 in bbce3c7
|
@sosixyz This could be caused by ShadowVariable wasn't created in the right device. Maybe there is a way to fix this bug. Here is the key code Line 212 in bbce3c7
|
Thanks for the response, I'll try and take a closer look here. If using sync training with GPUs, you would need GPUs with large on-device memory correct? I thought for this reason PS strategy would be more common |
I should not Im using an in-mem cluster for testing with 2 PS and passing these device names to |
@alykhantejani Don’t worry the memory, DE alltoall embedding layer will shard entire embedding into different worker rank. And also you can use cpu embedding table, but DE HKV backend would be the best solution which is able to use both gpu memory and host memory for embedding storage. Most situations, 2T host memory is enough. |
Thank you for your reply!The reproducible code is copied from the link https://github.com/tensorflow/recommenders-addons/blob/master/demo/dynamic_embedding/movielens-1m-keras-ps/movielens-1m-keras-ps.py. |
Thank you your reply! I will try the sample demo later。I found the keras guidance that when using the |
Yes, you're right. Mostly using tf.keras.utils.experimental.DatasetCreator for dispatch input data to different worker. But this is a simple demo after all, so I got lazy. Could you please contribute a more complete demo if it's convenient for you. |
@MoFHeka is there any example anywhere that does synchronous multi-worker large dymaic embeddings? |
@alykhantejani |
@MoFHeka I meant using TFRA specifically, especially using Host Memory not GPU mem (as GPU devices are expensive) |
@alykhantejani If you want to place the embedding in host memory, please set parameter devices=["CPU"] when you create embedding layer. If you want to use both host memory and device memory for embedding, using HKV. Replace the code with HKV creator when you assign the hash table backend. Here is the explanation how sync distributed training works: #365 |
@MoFHeka Hello, I try to use tf.keras.utils.experimental.DatasetCreator , but new bug arised. I just change data input function, I use keras.embeding that demo is running well. The demo is
|
I meet the same question. Set with_unique=False will be ok for training, but it seems not correct. So have you solve this problem? |
@kefault Sorry, PS support I'll sort it out when I have time |
tensorflow: 2.8.0
tfra: 0.6.0
Acoording to the file, I sh start.sh ,The bug is show 。
The text was updated successfully, but these errors were encountered: