Submit Tensorflow Job with specified role sequence
The Distributed Tensorflow job has some roles, includes: Worker,PS,Chief,Evaluator. Sometimes, you may need to decide the sequence when creating them, for example, you may need to create "Worker" role firstly and then create "PS" role, This guide will help you.
1. Now, assume that you want to submit a Distributed Tensorflow job,the tensorflow job has four roles: Worker,PS,Chief,Evaluator and you need the role starting sequence is "Worker,Chief,PS,Evaluator", it is simple for you only add option "--role-sequence" when submitting the job,the following command is an example:
$ arena submit tfjob \
--name=tf-distributed-test \
--role-sequence "Worker,Chief,PS,Evaluator" \
--chief \
--evaluator \
--gpus=1 \
--workers=1 \
--worker-image=cheyang/tf-mnist-distributed:gpu \
--ps-image=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--tensorboard-image="registry.cn-hongkong.aliyuncs.com/ai-samples/tensorflow:1.12.0-devel" \
"python /app/main.py"
the "--role-sequence Worker,Chief,PS,Evaluator" is the same as "--role-sequence w,c,p,e" and "w" represents "Worker", "c" represents "Chief", "p" represents "PS" and "e" represents "Evaluator".
- Make sure at least one pod belonging to the tfjob "tf-distributed-test" has annotation "job-role-sequence=Worker,Chief,PS,Evaluator":
$ kubectl get po -l tf-job-name=tf-distributed-test
NAME READY STATUS RESTARTS AGE
tf-distributed-test-chief-0 0/1 ContainerCreating 0 5m47s
tf-distributed-test-evaluator-0 0/1 ContainerCreating 0 5m47s
tf-distributed-test-ps-0 1/1 Running 0 5m47s
tf-distributed-test-worker-0 0/1 ContainerCreating 0 5m47s
$ kubectl get po tf-distributed-test-worker-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
job-role-sequence: Worker,Chief,PS,Evaluator
kubernetes.io/psp: ack.privileged
requestGPUsOfJobOwner: "3"
creationTimestamp: 2021-02-22T03:07:49Z
....
- You can validate it by querying the tf-operator logs.
$ kubectl get po -n arena-system
NAME READY STATUS RESTARTS AGE
et-operator-576887864c-lvmrs 1/1 Running 1 19d
mpi-operator-66b4cf9b76-kl2fm 1/1 Running 0 26d
pytorch-operator-8545c46f98-cffgw 1/1 Running 4 26d
tf-job-dashboard-78478bfc45-msbzn 1/1 Running 0 19d
tf-job-operator-554d594cff-5vxfg 1/1 Running 0 101m
Query the logs of tf-job-operator-554d594cff-5vxfg.
$ kubectl logs tf-job-operator-554d594cff-5vxfg -n arena-system | grep "the Role Sequence" | tail -n 1
{"filename":"tensorflow/controller.go:453","job":"default.tf-distributed-test","level":"info","msg":"the Role Sequence of job tf-distributed-test is: [Worker Chief PS Evaluator]","time":"2021-02-01T13:22:23Z","uid":"7db02629-4591-4e0c-a938-c6e4a1cfc074"}
As you see the sequence of tf-operator handles the tfjob roles is matched the sequence you specified.
If you don't want to specify the role sequence every time when submitting the tfjob, you can save the role sequence to the arena configuration file "~/.arena/config", like:
tfjob_role_sequence = Worker,PS,Chief,Evaluator
or
tfjob_role_sequence = w,p,c,e