NOTE: Currently there are a couple of assumptions:
- Homogenous cluster set up
- model gradients transfer is the same as the model size saved in ckpts (model_factory)
- Parameter Server / Worker frameworks (All-reduce not yet implemented)
- Synchronize SGD
Execution Before the exection, what's needed?
-
Job trace The job trace to simulate. For each job, the simulator needs the following information:
job_id
: for trackingnum_gpu
: gpu requirementsubmit_time
: when the job is submitted. The simulator is event-based and discrete-time. Therefore, the time value starts from0
, and in second-scale.iterations
: the number of iterations to training. Used by Network costs calculation when in data parallel jobs.model_name
: what's the model in that job. This is used to estimate GPU memory usage, and network costs.duration
: how long this job will run. This information is used to generate job completion event by the simulator.interval
: job submission interval from this job to the next job
-
How to run the simulator? A simple example of the execution commend should be:
python execute.py
Inside the execute file The following options are necessary:
--cluster_spec
: infrastructure spec file--trace_file
: job trace--scheme
: placement scheme--schedule
: scheduler
Optional inputs:
--print
: print debug information--log_path
: the output path of the log (cluster, job). The default will betime-stamp
folder under current path
-
What are the placement and scheduling algorithms provided? Placement:
yarn
: get GPUs from the same server nodes under the same switch
Scheduling
fifo
sjf
: Smallest-job-first, in terms of GPU requirement- TODO BELOW
lpjf
: longest pending job firstshorest
: shorestest remaining time job firstshorest-gpu
: shortest-remaining-gputime job firstdlas
: discretized LAS (just time-based) Injobs.py
, you need to specifynum_queue
andqueue_limit
forMLFQ
(also fordlas-gpu
, andgittins
)# Example1: there are two queues, and the threshold for Q1 is 3600 seconds self.queue_limit = [3600] # Example2: there are four queues, and the threshold for queues is 3600, 7200, 18000 seconds self.queue_limit = [3600, 7200, 18000]
dlas-gpu
: discretized LAS (gpu-time-based)gittins
: discretized Gittins Index (gpu-time-based)
-
What's the output? Based on the
--log_path
, all the output files are in that folder (e.g.,result-20190210-12-20-37
including:cluster.csv
: cluster-level resource utilization info at each event pointjobs.csv
: the job execution information
The output logs are defined in
log.py
; You can modify that file to adjust the output information.
[email protected] James Bulman [email protected]