GPU cluster simulator for distributed deep learning training

NOTE: Currently there are a couple of assumptions:

Homogenous cluster set up
model gradients transfer is the same as the model size saved in ckpts (model_factory)
Parameter Server / Worker frameworks (All-reduce not yet implemented)
Synchronize SGD

Execution Before the exection, what's needed?

Job trace The job trace to simulate. For each job, the simulator needs the following information:
- job_id: for tracking
- num_gpu: gpu requirement
- submit_time: when the job is submitted. The simulator is event-based and discrete-time. Therefore, the time value starts from 0, and in second-scale.
- iterations: the number of iterations to training. Used by Network costs calculation when in data parallel jobs.
- model_name: what's the model in that job. This is used to estimate GPU memory usage, and network costs.
- duration: how long this job will run. This information is used to generate job completion event by the simulator.
- interval: job submission interval from this job to the next job
How to run the simulator? A simple example of the execution commend should be:
```
python execute.py
```
Inside the execute file The following options are necessary:
- --cluster_spec: infrastructure spec file
- --trace_file: job trace
- --scheme: placement scheme
- --schedule: scheduler
Optional inputs:
- --print: print debug information
- --log_path: the output path of the log (cluster, job). The default will be time-stamp folder under current path
What are the placement and scheduling algorithms provided? Placement:
- yarn: get GPUs from the same server nodes under the same switch
Scheduling
- fifo
- sjf: Smallest-job-first, in terms of GPU requirement
- TODO BELOW
- lpjf: longest pending job first
- shorest: shorestest remaining time job first
- shorest-gpu: shortest-remaining-gputime job first
- dlas: discretized LAS (just time-based) In jobs.py, you need to specify num_queue and queue_limit for MLFQ (also for dlas-gpu, and gittins)
```
# Example1: there are two queues, and the threshold for Q1 is 3600 seconds
self.queue_limit = [3600]

# Example2: there are four queues, and the threshold for queues is 3600, 7200, 18000 seconds
self.queue_limit = [3600, 7200, 18000]
```
- dlas-gpu: discretized LAS (gpu-time-based)
- gittins: discretized Gittins Index (gpu-time-based)
What's the output? Based on the --log_path, all the output files are in that folder (e.g., result-20190210-12-20-37 including:
1. cluster.csv: cluster-level resource utilization info at each event point
2. jobs.csv: the job execution information
The output logs are defined in log.py; You can modify that file to adjust the output information.

Others

[email protected] James Bulman [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
core		core
infra		infra
model		model
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
cluster_spec.csv		cluster_spec.csv
demonstrate_trace.py		demonstrate_trace.py
execute.py		execute.py
log.py		log.py
log_manager.py		log_manager.py
run_sim.py		run_sim.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU cluster simulator for distributed deep learning training

Others

About

Releases

Packages

Languages

matthewygf/GPUSchedule

Folders and files

Latest commit

History

Repository files navigation

GPU cluster simulator for distributed deep learning training

Others

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages