Skip to content

List training jobs

If you want to get all training jobs' names, the command arena list can help you.

1. List all training jobs.

$ arena list

NAME                           STATUS     TRAINER  DURATION  GPU(Requested)  GPU(Allocated)  NODE
mpi-dist                       SUCCEEDED  MPIJOB   10m       1               N/A             192.168.1.112
elastic-training               SUCCEEDED  ETJOB    2d        0               N/A             N/A
horovod-resnet50-v2-4x8-fluid  SUCCEEDED  MPIJOB   1h        0               N/A             N/A
horovod-resnet50-v2-4x8-nfs    SUCCEEDED  MPIJOB   2h        0               N/A             N/A

As you see, there is two training types: MPIJOB and ETJOB.

2. If you want to list all training jobs by the training job type, you can use -T or --type to specify the job type. For example, the following command is used to list all mpi training jobs:

$ arena list -T mpi

NAME                           STATUS     TRAINER  DURATION  GPU(Requested)  GPU(Allocated)  NODE
mpi-dist                       SUCCEEDED  MPIJOB   10m       1               N/A             192.168.1.112
horovod-resnet50-v2-4x8-fluid  SUCCEEDED  MPIJOB   1h        0               N/A             N/A
horovod-resnet50-v2-4x8-nfs    SUCCEEDED  MPIJOB   2h        0               N/A             N/A

3. arena list will list the all training jobs in the default namespace, if you want to get all training jobs in other namespaces, -n or --namespace can help you. The example command will list all training jobs in namespace test.

$ arena list -n test

NAME                           STATUS     TRAINER  DURATION  GPU(Requested)  GPU(Allocated)  NODE
mpi-dist-1                     SUCCEEDED  MPIJOB   10m       1               N/A             192.168.1.112

Note

  • --all-namespaces or -A will list all training jobs in all namespaces.

4. If you want to get the output of arena list with json(or yaml) format, -o json (or -o yaml) can help you.

$ arena list -o json
[
    {
        "name": "mpi-dist",
        "namespace": "default",
        "duration": "618s",
        "status": "SUCCEEDED",
        "trainer": "mpijob",
        "tensorboard": "http://192.168.1.101:30600",
        "chiefName": "mpi-dist-launcher-6fwhd",
        "instances": [
            {
                "ip": "172.27.0.10",
                "status": "Completed",
                "name": "mpi-dist-launcher-6fwhd",
                "age": "601s",
                "node": "cn-beijing.192.168.1.112",
                "nodeIP": "192.168.1.112",
                "chief": true,
                "requestGPUs": 0,
                "gpuMetrics": {}
            }
        ],
        "priority": "N/A",
        "requestGPUs": 1,
        "allocatedGPUs": 0
    },
    {
        "name": "elastic-training",
        "namespace": "default",
        "duration": "250342s",
        "status": "SUCCEEDED",
        "trainer": "etjob",
        "tensorboard": "",
        "chiefName": "",
        "instances": [],
        "priority": "N/A",
        "requestGPUs": 0,
        "allocatedGPUs": 0
    },
    {
        "name": "horovod-resnet50-v2-4x8-fluid",
        "namespace": "default",
        "duration": "5388s",
        "status": "SUCCEEDED",
        "trainer": "mpijob",
        "tensorboard": "",
        "chiefName": "",
        "instances": [],
        "priority": "N/A",
        "requestGPUs": 0,
        "allocatedGPUs": 0
    },
    {
        "name": "horovod-resnet50-v2-4x8-nfs",
        "namespace": "default",
        "duration": "7242s",
        "status": "SUCCEEDED",
        "trainer": "mpijob",
        "tensorboard": "",
        "chiefName": "",
        "instances": [],
        "priority": "N/A",
        "requestGPUs": 0,
        "allocatedGPUs": 0
    }
]