Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP Tutorial may not be using the GPU #52

Open
Sohojoe opened this issue Feb 14, 2019 · 8 comments
Open

GCP Tutorial may not be using the GPU #52

Sohojoe opened this issue Feb 14, 2019 · 8 comments
Assignees

Comments

@Sohojoe
Copy link

Sohojoe commented Feb 14, 2019

When following the GCP Tutorial - I see Tensorflow warning that the version of Tensorflow is not optimized for the cpu.

Given that the cloud instance does include the optimized version of Tensorflow, I wonder if installing obstacle-tower-env overrides the optimized version. If this is the case, then it may mean it has installed the unoptimized CPU only Tensorflow as ml-agents has the requirement 'tensorflow>=1.7,<1.8'

The training speed seems slow: 56 steps per second compared with 130 steps per second on my home pc:

screen shot 2019-02-13 at 9 49 37 pm

@awjuliani
Copy link
Contributor

@ervteng Can you speak to this? We had validated internally that we were using GCP/GPU.

@ervteng
Copy link
Contributor

ervteng commented Feb 19, 2019

Hi @Sohojoe, the obstacle-tower-env doesn't have a TensorFlow requirement as it doesn't install ml-agents. You can check the GPU usage with nvidia-smi. What type of GPU are you running locally?

@Sohojoe
Copy link
Author

Sohojoe commented Feb 20, 2019

@ervteng - I have a GTX 1080 locally. How many training steps per second do you see?

running nvidia-smi shows that it is using the GPU so I wass wrong:

image

I guess the default tensorflow does not include cpu optimizations and that is why it shows the warning:
image

@kwea123
Copy link

kwea123 commented Feb 20, 2019

@ervteng what do you mean by

the obstacle-tower-env doesn't have a TensorFlow requirement as it doesn't install ml-agents.

? Then what does this mean in the README?

Requirements
The Obstacle Tower environment runs on Mac OS X, Windows, or Linux.

Python dependencies (also in setup.py):

Unity ML-Agents v0.6
OpenAI Gym
Pillow

Also I remember that my tensorflow version was overwritten with 1.7.1 when running pip install -e . from this repo. Although I re-installed 1.9.0 and found that there was no problem running the obstacle tower environment...

@Sohojoe
Copy link
Author

Sohojoe commented Feb 20, 2019

@kwea123 obstacle tower installs a special version of ml-agents that doesn't specify tensorflow in its' install requirements file.

obstacle tower does need tensorflow to run.

The normal ml-agents specifies tensorflow 1.7.x as this is required for running the trained models from within until. obstacle tower doesn't need this.

@kwea123
Copy link

kwea123 commented Feb 20, 2019

@Sohojoe Oh, I see. Sorry for the misunderstanding @ervteng

@ervteng
Copy link
Contributor

ervteng commented Feb 20, 2019

@Sohojoe you are correct, the Readme is wrong (and we'll fix it). The newest versions of OTC no longer uses ML-Agents in its entirety, and doesn't require TensorFlow. Dopamine does require TensorFlow, but as far as I know will work with most recent versions.

I'm getting about 45.61 steps per second on a T4 on GCP, but it's using only about 10% of the GPU. In our past testing, we found that the OTC environment tends to be CPU-bound. What CPU do you have on your desktop machine? I'm curious to see how we can get the environments training faster.

@Sohojoe
Copy link
Author

Sohojoe commented Feb 21, 2019

I have an i7-8700k @ 3.7GHz which has 6 processors / 12 cores

A big help to performance would be to support multiple instances of the environment within the Unity level. I regularly train with 128 concurrent agents and I'm reading some papers where they go up to 2048. I made a modification to ml-agents in my dev branch of marathon-envs which enables one to set --num-agents=128 in the command line to specify the number of agents. I would be happy to work on a PR. But, it does require the environment to work relative to its spawn position.

I have also been working on adapting large-scale-curiosity to work with obstacle tower as it supports instancing via MPI. I have been able to get it training on windows at 400-500 fps but it is not learning yet. Also, MPI on windows is not very stable and I've only been able to get 16-24 instances running (but this should not be a problem on linux servers). My code is here

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants