Skip to content

Pytorch Training Job with specified cleaning task policy

You can specify the clean-up policy of pod after the pytorch job finished.

1. Submit a pytorch job, and specify --clean-task-policy as All. After the job finished (SUCCEEDED or FAILED), all instances (pods) will be deleted.

➜ arena --loglevel info submit pytorch \
    --name=pytorch-clean-policy \
    --gpus=1 \
    --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
    --sync-mode=git \
    --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
    --clean-task-policy=All \
    "python /root/code/mnist-pytorch/mnist.py --backend gloo"

configmap/pytorch-clean-policy-pytorchjob created
configmap/pytorch-clean-policy-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-clean-policy created
INFO[0000] The Job pytorch-clean-policy has been submitted successfully
INFO[0000] You can run `arena get pytorch-clean-policy --type pytorchjob` to check the job status

Note

  • All: represent that the all instances(pods) will be deleted after the job is finished
  • None(default): represent that the all instances(pods) will be retained after the job is finished

2. Get the job details. After the job is finished, the instance python-clean-policy-master-0 has been deleted.

➜ arena get pytorch-clean-policy
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 37s

NAME  STATUS  TRAINER  AGE  INSTANCE  NODE