Skip to content

Display GPU Usage For Training Job

The arena top job command allows you to see the gpu resource consumption for training jobs.

1. display gpu resource consumption of all training jobs:

$ arena top job
NAME                           STATUS     TRAINER  AGE  GPU(Requested)  GPU(Allocated)  NODE
dawnbench-1x1-v4               PENDING    MPIJOB   20d  100             0               N/A
dawnbench-1x1-v7               SUCCEEDED  MPIJOB   12h  1               0               192.168.8.3
dawnbench-1x1-v6               SUCCEEDED  MPIJOB   12h  1               0               192.168.8.3
tf-distributed-test            FAILED     TFJOB    11m  0               0               N/A
tf-git                         SUCCEEDED  TFJOB    14m  0               0               N/A
dawnbench-1x1-v3               SUCCEEDED  MPIJOB   1d   0               0               N/A
dawnbench-1x1-v2               SUCCEEDED  MPIJOB   12h  0               0               N/A
dawnbench-1x1-v1               SUCCEEDED  MPIJOB   12h  0               0               N/A
mpi-test                       SUCCEEDED  MPIJOB   12h  0               0               N/A
elastic-training               SUCCEEDED  ETJOB    40d  0               0               N/A
horovod-resnet50-v2-4x8-fluid  SUCCEEDED  MPIJOB   1h   0               0               N/A
horovod-resnet50-v2-4x8-nfs    SUCCEEDED  MPIJOB   2h   0               0               N/A

Total Allocated/Requested GPUs of Training Jobs: 0/0

2. display gpu resource consumption of single training job:

$ arena top job dawnbench-1x1-v6
Name:      dawnbench-1x1-v6
Status:    SUCCEEDED
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  12h

Instances:
  NAME                             STATUS     GPU(Request)  NODE         GPU(DeviceIndex)  GPU(DutyCycle)  GPU_MEMORY(Used/Total)
  ----                             ------     ------------  ----         ----------------  --------------  ---------------
  dawnbench-1x1-v6-launcher-686cl  Completed  0             192.168.8.3  N/A               N/A             N/A

GPUs:
  Allocated/Requested GPUs of Job: 0/1

3. If you need to monitor the training job in real time, "-r" is required:

$ arena top job dawnbench-1x1-v6 -r

Name:      dawnbench-1x1-v6
Status:    SUCCEEDED
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  12h

Instances:
  NAME                             STATUS     GPU(Request)  NODE         GPU(DeviceIndex)  GPU(DutyCycle)  GPU_MEMORY(Used/Total)
  ----                             ------     ------------  ----         ----------------  --------------  ---------------
  dawnbench-1x1-v6-launcher-686cl  Completed  0             192.168.8.3  N/A               N/A             N/A

GPUs:
  Allocated/Requested GPUs of Job: 0/1
------------------------------------------- 2021-02-22 17:42:25 ----------------------------------------------------
Name:      dawnbench-1x1-v6
Status:    SUCCEEDED
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  12h

Instances:
  NAME                             STATUS     GPU(Request)  NODE         GPU(DeviceIndex)  GPU(DutyCycle)  GPU_MEMORY(Used/Total)
  ----                             ------     ------------  ----         ----------------  --------------  ---------------
  dawnbench-1x1-v6-launcher-686cl  Completed  0             192.168.8.3  N/A               N/A             N/A

GPUs:
  Allocated/Requested GPUs of Job: 0/1
------------------------------------------- 2021-02-22 17:42:27 ----------------------------------------------------