Skip to content

Get training job details

You can use arena get to get the training job details, we will introduce some usages about arena get.

1. Get the training job details.

$ arena get mpi-dist

Name:        mpi-dist
Status:      SUCCEEDED
Namespace:   default
Priority:    N/A
Trainer:     MPIJOB
Duration:    10m

Instances:
NAME                     STATUS     AGE  IS_CHIEF  GPU(Requested)  NODE
----                     ------     ---  --------  --------------  ----
mpi-dist-launcher-6fwhd  Completed  10m  true      0               cn-beijing.192.168.1.112

Tensorboard:
Your tensorboard will be available on:
http://192.168.1.101:30600

2. you cant get the training job details with json(or yaml) format only add option -o json(or -o yaml).

$ arena get mpi-dist -o json
{
    "name": "mpi-dist",
    "namespace": "default",
    "duration": "618s",
    "status": "SUCCEEDED",
    "trainer": "mpijob",
    "tensorboard": "http://192.168.1.101:30600",
    "chiefName": "mpi-dist-launcher-6fwhd",
    "instances": [
        {
            "ip": "172.27.0.10",
            "status": "Completed",
            "name": "mpi-dist-launcher-6fwhd",
            "age": "601s",
            "node": "cn-beijing.192.168.1.112",
            "nodeIP": "192.168.1.112",
            "chief": true,
            "requestGPUs": 0,
            "gpuMetrics": {}
        }
    ],
    "priority": "N/A",
    "requestGPUs": 1,
    "allocatedGPUs": 0
}

3. -e can display the events of training job instances.

$ arena get tf-standalone-test -e

Name:        tf-standalone-test
Status:      PENDING
Namespace:   default
Priority:    N/A
Trainer:     TFJOB
Duration:    9s

Instances:
NAME                        STATUS             AGE  IS_CHIEF  GPU(Requested)  NODE
----                        ------             ---  --------  --------------  ----
tf-standalone-test-chief-0  ContainerCreating  9s   true      0               N/A

Tensorboard:
Your tensorboard will be available on:
http://192.168.1.101:32221

Events:
SOURCE                          TYPE    AGE  MESSAGE
------                          ----    ---  -------
pod/tf-standalone-test-chief-0  Normal  9s   [Scheduled] Successfully assigned default/tf-standalone-test-chief-0 to cn-beijing.192.168.1.112
pod/tf-standalone-test-chief-0  Normal  9s   [Pulling] Pulling image "registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu"

4. When the status of training job is RUNNING(or COMPLETEDED), -g can display the gpu utilization of training job(this option depends prometheus).

$ arena get tf-standalone-test -g

Name:        tf-standalone-test
Status:      RUNNING
Namespace:   default
Priority:    N/A
Trainer:     TFJOB
Duration:    6s

Instances:
NAME                        STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
----                        ------   ---  --------  --------------  ----
tf-standalone-test-chief-0  Running  6s   true      1               cn-beijing.192.168.1.112

GPUs:
INSTANCE                    NODE(IP)       GPU(Requested)  GPU(IndexId)  GPU(DutyCycle)  GPU Memory(Used/Total)
--------                    --------       --------------  ------------  --------------  ----------------------
tf-standalone-test-chief-0  192.168.1.112  1               N/A           N/A             N/A

Allocated/Requested GPUs of Job: 1/1

Tensorboard:
Your tensorboard will be available on:
http://192.168.1.101:31869

5. You can use --type to specify the training job type.

$ arena get tf-standalone-test --type tf

Name:        tf-standalone-test
Status:      RUNNING
Namespace:   default
Priority:    N/A
Trainer:     TFJOB
Duration:    6m

Instances:
NAME                        STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
----                        ------   ---  --------  --------------  ----
tf-standalone-test-chief-0  Running  6m   true      1               cn-beijing.192.168.1.112

Tensorboard:
Your tensorboard will be available on:
http://192.168.1.101:31869