-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP Tutorial may not be using the GPU #52
Comments
@ervteng Can you speak to this? We had validated internally that we were using GCP/GPU. |
Hi @Sohojoe, the obstacle-tower-env doesn't have a TensorFlow requirement as it doesn't install ml-agents. You can check the GPU usage with |
@ervteng - I have a GTX 1080 locally. How many training steps per second do you see? running I guess the default tensorflow does not include cpu optimizations and that is why it shows the warning: |
@ervteng what do you mean by
? Then what does this mean in the README?
Also I remember that my tensorflow version was overwritten with 1.7.1 when running |
@kwea123 obstacle tower installs a special version of ml-agents that doesn't specify tensorflow in its' install requirements file. obstacle tower does need tensorflow to run. The normal ml-agents specifies tensorflow 1.7.x as this is required for running the trained models from within until. obstacle tower doesn't need this. |
@Sohojoe you are correct, the Readme is wrong (and we'll fix it). The newest versions of OTC no longer uses ML-Agents in its entirety, and doesn't require TensorFlow. Dopamine does require TensorFlow, but as far as I know will work with most recent versions. I'm getting about 45.61 steps per second on a T4 on GCP, but it's using only about 10% of the GPU. In our past testing, we found that the OTC environment tends to be CPU-bound. What CPU do you have on your desktop machine? I'm curious to see how we can get the environments training faster. |
I have an i7-8700k @ 3.7GHz which has 6 processors / 12 cores A big help to performance would be to support multiple instances of the environment within the Unity level. I regularly train with 128 concurrent agents and I'm reading some papers where they go up to 2048. I made a modification to ml-agents in my dev branch of marathon-envs which enables one to set --num-agents=128 in the command line to specify the number of agents. I would be happy to work on a PR. But, it does require the environment to work relative to its spawn position. I have also been working on adapting large-scale-curiosity to work with obstacle tower as it supports instancing via MPI. I have been able to get it training on windows at 400-500 fps but it is not learning yet. Also, MPI on windows is not very stable and I've only been able to get 16-24 instances running (but this should not be a problem on linux servers). My code is here |
When following the GCP Tutorial - I see Tensorflow warning that the version of Tensorflow is not optimized for the cpu.
Given that the cloud instance does include the optimized version of Tensorflow, I wonder if installing obstacle-tower-env overrides the optimized version. If this is the case, then it may mean it has installed the unoptimized CPU only Tensorflow as ml-agents has the requirement
'tensorflow>=1.7,<1.8'
The training speed seems slow: 56 steps per second compared with 130 steps per second on my home pc:
The text was updated successfully, but these errors were encountered: