Tensorflow job with gang scheduling enabled
Arena supports distributed TensorFlow Training with gang scheduling by using scheduler-plugins. We can enable gang scheduling by adding parameter --gang. Learn more https://help.aliyun.com/document_detail/178169.html
When running distributed TensorFlow, we'd better to make sure all or nothing. Gang scheduling can help such case.
Warning
Limitation: when using gang scheduling, the tensorboard feature doesn't work well.
The following command is an example. In this example, it defines 2 workers and 1 PS, and each worker has 1 GPU. The source code of worker and PS are located in git, and the tensorboard are enabled.
$ arena submit tf \
--name=tf-dist-git \
--gpus=1 \
--workers=2 \
--worker-image=tensorflow/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--ps=1 \
--ps-image=tensorflow/tensorflow:1.5.0-devel \
--gang \
--tensorboard \
"python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir=/training_logs --data_dir=code/tensorflow-sample-code/data"
service/tf-dist-git-tensorboard created
deployment.apps/tf-dist-git-tensorboard created
tfjob.kubeflow.org/tf-dist-git created
INFO[0002] The Job tf-dist-git has been submitted successfully
INFO[0002] You can run `arena get tf-dist-git --type tfjob` to check the job status
If there are no enough resources, all the instances of the job are PENDING. If it's not gang scheduling, you can see some pods are RUNNING and others are PENDING.
$ arena get tf-dist-git
Name: tf-dist-git
Status: PENDING
Namespace: default
Priority: N/A
Trainer: TFJOB
Duration: 4s
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
tf-dist-git-ps-0 Pending 4s false 0 N/A
tf-dist-git-worker-0 Pending 4s true 1 N/A
tf-dist-git-worker-1 Pending 4s false 1 N/A
Tensorboard:
Your tensorboard will be available on:
http://10.0.0.80:31029
When there are enough resources, the instances become RUNNING.
$ arena get tf-dist-git
Name: tf-dist-git
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: TFJOB
Duration: 50s
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
tf-dist-git-ps-0 Running 0s false 0 cn-beijing.10.0.0.84
tf-dist-git-worker-0 Running 50s true 1 cn-beijing.10.0.0.83
tf-dist-git-worker-1 Running 50s false 1 cn-beijing.10.0.0.85
Tensorboard:
Your tensorboard will be available on:
http://10.0.0.80:31029