Skip to content

Submit a elastic training job(tensorflow)

This guide walks through the steps to submit a elastic training job with horovod.

1. Build image for training environment

You can use the following image directly.

registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1

In addition, you can also build your own image with the help of this document elastic-training-sample-image.

2. Submit a elastic training job. Example code from tensorflow2_mnist_elastic.py.

$ arena submit etjob \
    --name=elastic-training \
    --gpus=1 \
    --workers=3 \
    --max-workers=9 \
    --min-workers=1 \
    --image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
    --working-dir=/examples \
    "horovodrun
    -np \$((\${workers}*\${gpus}))
    --min-np \$((\${minWorkers}*\${gpus}))
    --max-np \$((\${maxWorkers}*\${gpus}))
    --host-discovery-script /usr/local/bin/discover_hosts.sh
    python /examples/elastic/tensorflow2_mnist_elastic.py
    "

configmap/elastic-training-etjob created
configmap/elastic-training-etjob labeled
trainingjob.kai.alibabacloud.com/elastic-training created
INFO[0000] The Job elastic-training has been submitted successfully
INFO[0000] You can run `arena get elastic-training --type etjob` to check the job status

3. List then job.

$ arena list
NAME              STATUS   TRAINER  AGE  NODE
elastic-training  RUNNING  ETJOB    52s  192.168.0.116

4. Get the job details.

$ arena get elastic-training

STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m

NAME              STATUS   TRAINER  AGE  INSTANCE                   NODE
elastic-training  RUNNING  ETJOB    1m   elastic-training-launcher  192.168.0.116
elastic-training  RUNNING  ETJOB    1m   elastic-training-worker-0  192.168.0.114
elastic-training  RUNNING  ETJOB    1m   elastic-training-worker-1  192.168.0.116
elastic-training  RUNNING  ETJOB    1m   elastic-training-worker-2  192.168.0.116

5. Check logs of the job.

$ arena logs elastic-training --tail 10

Tue Sep  8 08:32:50 2020[1]<stdout>:Step #2170  Loss: 0.021992
Tue Sep  8 08:32:50 2020[0]<stdout>:Step #2180  Loss: 0.000902
Tue Sep  8 08:32:50 2020[1]<stdout>:Step #2180  Loss: 0.023190
Tue Sep  8 08:32:50 2020[2]<stdout>:Step #2180  Loss: 0.013149
Tue Sep  8 08:32:51 2020[0]<stdout>:Step #2190  Loss: 0.029536
Tue Sep  8 08:32:51 2020[2]<stdout>:Step #2190  Loss: 0.017537
Tue Sep  8 08:32:51 2020[1]<stdout>:Step #2190  Loss: 0.018273
Tue Sep  8 08:32:51 2020[2]<stdout>:Step #2200  Loss: 0.038399
Tue Sep  8 08:32:51 2020[0]<stdout>:Step #2200  Loss: 0.007017
Tue Sep  8 08:32:51 2020[1]<stdout>:Step #2200  Loss: 0.017495

6. Scaleout your job. the following sample command Will add one worker into jobs.

$ arena scaleout etjob --name="elastic-training" --count=1 --timeout=1m

configmap/elastic-training-1599548177-scaleout created
configmap/elastic-training-1599548177-scaleout labeled
scaleout.kai.alibabacloud.com/elastic-training-1599548177 created
INFO[0000] The scaleout job elastic-training-1599548177 has been submitted successfully

7. Get your job details. We can see new worker(elastic-training-worker-3) has been "RUNNING".

$ arena get elastic-training

STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 2m

NAME              STATUS   TRAINER  AGE  INSTANCE                   NODE
elastic-training  RUNNING  ETJOB    2m   elastic-training-launcher  192.168.0.116
elastic-training  RUNNING  ETJOB    2m   elastic-training-worker-0  192.168.0.114
elastic-training  RUNNING  ETJOB    2m   elastic-training-worker-1  192.168.0.116
elastic-training  RUNNING  ETJOB    2m   elastic-training-worker-2  192.168.0.116
elastic-training  RUNNING  ETJOB    2m   elastic-training-worker-3  192.168.0.117

8. Check logs of the job.

$ arena logs elastic-training --tail 10

Tue Sep  8 08:33:33 2020[1]<stdout>:Step #3140  Loss: 0.014412
Tue Sep  8 08:33:33 2020[0]<stdout>:Step #3140  Loss: 0.004425
Tue Sep  8 08:33:33 2020[3]<stdout>:Step #3150  Loss: 0.000513
Tue Sep  8 08:33:33 2020[2]<stdout>:Step #3150  Loss: 0.062282
Tue Sep  8 08:33:33 2020[1]<stdout>:Step #3150  Loss: 0.020650
Tue Sep  8 08:33:33 2020[0]<stdout>:Step #3150  Loss: 0.008056
Tue Sep  8 08:33:34 2020[3]<stdout>:Step #3160  Loss: 0.002170
Tue Sep  8 08:33:34 2020[2]<stdout>:Step #3160  Loss: 0.009676
Tue Sep  8 08:33:34 2020[1]<stdout>:Step #3160  Loss: 0.051425
Tue Sep  8 08:33:34 2020[0]<stdout>:Step #3160  Loss: 0.023769

9. Scalein your job. Will remove one worker from current jobs.

$ arena scalein etjob --name="elastic-training" --count=1 --timeout=1m

configmap/elastic-training-1599554041-scalein created
configmap/elastic-training-1599554041-scalein labeled
scalein.kai.alibabacloud.com/elastic-training-1599554041 created
INFO[0000] The scalein job elastic-training-1599554041 has been submitted successfully

10. Get your job details. We can see that elastic-training-worker-3 has been removed.

$ arena get elastic-training

STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 3m

NAME              STATUS   TRAINER  AGE  INSTANCE                   NODE
elastic-training  RUNNING  ETJOB    3m   elastic-training-launcher  192.168.0.116
elastic-training  RUNNING  ETJOB    3m   elastic-training-worker-0  192.168.0.114
elastic-training  RUNNING  ETJOB    3m   elastic-training-worker-1  192.168.0.116
elastic-training  RUNNING  ETJOB    3m   elastic-training-worker-2  192.168.0.116
```

11. Check logs of the job.

$ arena logs elastic-training --tail 10

Tue Sep  8 08:34:43 2020[0]<stdout>:Step #5210  Loss: 0.005627
Tue Sep  8 08:34:43 2020[2]<stdout>:Step #5220  Loss: 0.002142
Tue Sep  8 08:34:43 2020[1]<stdout>:Step #5220  Loss: 0.002978
Tue Sep  8 08:34:43 2020[0]<stdout>:Step #5220  Loss: 0.011404
Tue Sep  8 08:34:44 2020[2]<stdout>:Step #5230  Loss: 0.000689
Tue Sep  8 08:34:44 2020[1]<stdout>:Step #5230  Loss: 0.024597
Tue Sep  8 08:34:44 2020[0]<stdout>:Step #5230  Loss: 0.040936
Tue Sep  8 08:34:44 2020[0]<stdout>:Step #5240  Loss: 0.000125
Tue Sep  8 08:34:44 2020[2]<stdout>:Step #5240  Loss: 0.026498
Tue Sep  8 08:34:44 2020[1]<stdout>:Step #5240  Loss: 0.000308