List training jobs
If you want to get all training jobs' names, the command arena list
can help you.
1. List all training jobs.
$ arena list
NAME STATUS TRAINER DURATION GPU(Requested) GPU(Allocated) NODE
mpi-dist SUCCEEDED MPIJOB 10m 1 N/A 192.168.1.112
elastic-training SUCCEEDED ETJOB 2d 0 N/A N/A
horovod-resnet50-v2-4x8-fluid SUCCEEDED MPIJOB 1h 0 N/A N/A
horovod-resnet50-v2-4x8-nfs SUCCEEDED MPIJOB 2h 0 N/A N/A
As you see, there is two training types: MPIJOB
and ETJOB
.
2. If you want to list all training jobs by the training job type, you can use -T
or --type
to specify the job type. For example, the following command is used to list all mpi training jobs:
$ arena list -T mpi
NAME STATUS TRAINER DURATION GPU(Requested) GPU(Allocated) NODE
mpi-dist SUCCEEDED MPIJOB 10m 1 N/A 192.168.1.112
horovod-resnet50-v2-4x8-fluid SUCCEEDED MPIJOB 1h 0 N/A N/A
horovod-resnet50-v2-4x8-nfs SUCCEEDED MPIJOB 2h 0 N/A N/A
3. arena list
will list the all training jobs in the default
namespace, if you want to get all training jobs in other namespaces, -n
or --namespace
can help you. The example command will list all training jobs in namespace test
.
$ arena list -n test
NAME STATUS TRAINER DURATION GPU(Requested) GPU(Allocated) NODE
mpi-dist-1 SUCCEEDED MPIJOB 10m 1 N/A 192.168.1.112
Note
--all-namespaces
or-A
will list all training jobs in all namespaces.
4. If you want to get the output of arena list
with json(or yaml) format, -o json
(or -o yaml
) can help you.
$ arena list -o json
[
{
"name": "mpi-dist",
"namespace": "default",
"duration": "618s",
"status": "SUCCEEDED",
"trainer": "mpijob",
"tensorboard": "http://192.168.1.101:30600",
"chiefName": "mpi-dist-launcher-6fwhd",
"instances": [
{
"ip": "172.27.0.10",
"status": "Completed",
"name": "mpi-dist-launcher-6fwhd",
"age": "601s",
"node": "cn-beijing.192.168.1.112",
"nodeIP": "192.168.1.112",
"chief": true,
"requestGPUs": 0,
"gpuMetrics": {}
}
],
"priority": "N/A",
"requestGPUs": 1,
"allocatedGPUs": 0
},
{
"name": "elastic-training",
"namespace": "default",
"duration": "250342s",
"status": "SUCCEEDED",
"trainer": "etjob",
"tensorboard": "",
"chiefName": "",
"instances": [],
"priority": "N/A",
"requestGPUs": 0,
"allocatedGPUs": 0
},
{
"name": "horovod-resnet50-v2-4x8-fluid",
"namespace": "default",
"duration": "5388s",
"status": "SUCCEEDED",
"trainer": "mpijob",
"tensorboard": "",
"chiefName": "",
"instances": [],
"priority": "N/A",
"requestGPUs": 0,
"allocatedGPUs": 0
},
{
"name": "horovod-resnet50-v2-4x8-nfs",
"namespace": "default",
"duration": "7242s",
"status": "SUCCEEDED",
"trainer": "mpijob",
"tensorboard": "",
"chiefName": "",
"instances": [],
"priority": "N/A",
"requestGPUs": 0,
"allocatedGPUs": 0
}
]