Training job with image pull secret
You can use a private registry and set image pull secrets for training jobs(include tensorboard images). Assume the following images are in your private registry.
# pytorch
registry.cn-beijing.aliyuncs.com/ai-samples/pytorch-with-tensorboard-secret:1.5.1-cuda10.1-cudnn7-runtime
# tf
registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
# mpi
registry.cn-beijing.aliyuncs.com/ai-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5
# tensorboard (--tensorboard-image)
registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.12.0-devel
Create Image Pull Secrets
Create a Secret with kubectl, it's a imagePullSecrets in following case.
$ kubectl create secret docker-registry <REG_SECRET> --docker-server=<REGISTRY> --docker-username=<USERNAME> --docker-password=<PASSWORD> --docker-email=<EMAIL>
Note
- REG_SECRET: is the name of the secret key, which can be defined by yourself.
- REGISTRY: is your private registry address.
- USERNAME: is username of your private registry.
- PASSWORD: is password of your private registry.
- EMAIL: is your email address, Optional.
For Example, use the following command to create a image pull secret:
$ kubectl create secret docker-registry \
lumo-secret \
--docker-server=registry.cn-huhehaote.aliyuncs.com \
--docker-username=******@test.aliyunid.com \
--docker-password=******
secret/lumo-secret created
You can check that the secret was created.
$ kubectl get secrets | grep lumo-secret
lumo-secret kubernetes.io/dockerconfigjson 1 52s
Submit a tfjob with imagePullSecrets
Submit the job by using --image-pull-secrets
to specify the imagePullSecrets.
1. Submit a tensorflow job, the following command is an example.
$ arena submit tf \
--name=tf-git-with-secret \
--working-dir=/root \
--gpus=1 \
--image=registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--data=training-data:/mnist_data \
--tensorboard \
--tensorboard-image=registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.12.0-devel \
--logdir=/mnist_data/tf_data/logs \
--image-pull-secrets=lumo-secret \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --log_dir /mnist_data/tf_data/logs --data_dir /mnist_data/tf_data/"
Note
- If you have many
imagePullSecrets
to use, you can use--image-pull-secrets
multiple times, like:$ arena submit tf \ --name=tf-git-with-secret \ ... \ --image-pull-secrets=lumo-secret \ --image-pull-secrets=king-secret \ --image-pull-secrets=test-secret ...
2. Get the details of the job.
$ arena get tf-git-with-secret
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 17s
NAME STATUS TRAINER AGE INSTANCE NODE
tf-git-with-secret RUNNING TFJOB 17s tf-git-with-secret-chief-0 172.16.0.202
Your tensorboard will be available on:
http://172.16.0.198:30080
Submit a mpijob with imagePullSecrets
Submit the mpi job by using --image-pull-secrets
to specify the imagePullSecrets.
1. Submit mpi job, the following command is an example:
$ arena submit mpi \
--name=mpi-dist-with-secret \
--gpus=1 \
--workers=2 \
--image=registry.cn-beijing.aliyuncs.com/ai-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \
--sync-mode=git \
--sync-source=https://github.com/tensorflow/benchmarks.git \
--tensorboard \
--tensorboard-image=registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.12.0-devel \
--image-pull-secrets=lumo-secret \
"mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
2. Get the details of the job.
$ arena get mpi-dist-with-secret
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 9m
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-launcher-v8sgt 172.16.0.201
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-worker-0 172.16.0.201
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-worker-1 172.16.0.202
Your tensorboard will be available on:
http://172.16.0.198:30450
Submit a pytorchjob with imagePullSecrets
Submit the pytorchjob by using --image-pull-secrets
to specify the imagePullSecrets.
1. Submit pytorch job, the following command is an example:
$ arena submit pytorch \
--name=pytorch-git-with-secret \
--gpus=1 \
--working-dir=/root \
--image=registry.cn-beijing.aliyuncs.com/ai-samples/pytorch-with-tensorboard-secret:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--data=training-data:/mnist_data \
--tensorboard \
--tensorboard-image=registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.12.0-devel \
--logdir=/mnist_data/pytorch_data/logs \
--image-pull-secrets=lumo-secret \
"python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend nccl --dir /mnist_data/pytorch_data/logs --data /mnist_data/pytorch_data/"
2. Get the details of the job.
$ arena get pytorch-git-with-secret
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 2m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-git-with-secret RUNNING PYTORCHJOB 2m pytorch-git-with-secret-master-0 172.16.0.202
Your tensorboard will be available on:
http://172.16.0.198:31155
Load imagePullSecrets from arena configuration file
If you don't want to submit job by --image-pull-secrets
every time. You can replace it with configuration of Arena.
Open the file ~/.arena/config
, if not exist, create it. And fill in the following configurations.
imagePullSecrets=lumo-secret,king-secret
Note
--image-pull-secrets
will overwrite~/.arena/config
.