Pytorch Training Job with datasets

This example shows how to use Arena to submit a distributed pytorch job and mount an NFS data volume. The sample downloads the source code from git URL.

1. Set up an NFS server.(refer to: https://www.cnblogs.com/weifeng1463/p/10037803.html )

step1: install the nfs-utils.

➜ yum install nfs-utils -y

step2: Create local directory of NFS server.

➜ mkdir -p /root/nfs/data

step3: Configure nfs server

➜ cat /etc/exports
/root/nfs/data *(rw,no_root_squash)

step4: Start nfs server

➜ systemctl start nfs;  systemctl start rpcbind
➜ systemctl enable nfs
Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service.

2. Download training data to shared directory of NFS.

step1: Get information of NFS server by showmount, 172.16.0.200 is the host ip of NFS server

➜ showmount -e 172.16.0.200
Export list for 172.16.0.200:
/root/nfs/data *

step2: Enter shared directory.

➜ cd /root/nfs/data

step3: Prepare training data to shared directory.

➜ pwd
/root/nfs/data

step4: list files, MNIST -> That's the training data we need.

➜ ll
total 8.0K
drwxr-xr-x 4  502 games 4.0K 6月  17 16:05 data
drwxr-xr-x 4 root root  4.0K 6月  23 15:17 MNIST

3. Create a PV.

# Note: Typesetting may cause yaml indentation problems
➜ cat nfs-pv.yaml 
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pytorchdata
  labels:
    pytorchdata: nas-mnist
spec:
  persistentVolumeReclaimPolicy: Retain
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteMany
  nfs:
    server: 172.16.0.200
    path: "/root/nfs/data"

create the pv by kubectl.

➜ kubectl create -f nfs-pv.yaml
persistentvolume/pytorchdata created

check the pv create by us.

➜ kubectl get pv | grep pytorchdata
pytorchdata   10Gi       RWX            Retain           Bound    default/pytorchdata    7m38s

4. Create a PVC base on the pv created by us.

create the pvc manifest.

➜ cat nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pytorchdata
  annotations:
    description: "this is the mnist demo"
    owner: Tom
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
       storage: 5Gi
  selector:
    matchLabels:
      pytorchdata: nas-mnist

create the pvc by kubectl.

➜ kubectl create -f nfs-pvc.yaml
persistentvolumeclaim/pytorchdata created

check the pvc.

➜ kubectl get pvc | grep pytorchdata
pytorchdata   Bound    pytorchdata   10Gi       RWX                           2m3s

5. Check the data volume.

➜ arena data list
NAME         ACCESSMODE     DESCRIPTION             OWNER  AGE
pytorchdata  ReadWriteMany  this is the mnist demo  Tom    2m

6. Submit the pytorch job through --data pvc_name:container_path mount distributed storage volume.

➜ arena --loglevel info submit pytorch \
    --name=pytorch-data \
    --gpus=1 \
    --workers=2 \
    --image=registry.cn-beijing.aliyuncs.com/ai-samples/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
    --sync-mode=git \
    --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
    --data=pytorchdata:/mnist_data \
    "python /root/code/mnist-pytorch/mnist.py --backend gloo --data /mnist_data/data"

configmap/pytorch-data-pytorchjob created
configmap/pytorch-data-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-data created
INFO[0000] The Job pytorch-data has been submitted successfully
INFO[0000] You can run `arena get pytorch-data --type pytorchjob` to check the job status

7. Get status of volume pytorchdata in one of the instances by kubectl describe.

# Get the details of the this job
➜ arena get pytorch-data
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 56s

NAME          STATUS     TRAINER     AGE  INSTANCE               NODE
pytorch-data  SUCCEEDED  PYTORCHJOB  1m   pytorch-data-master-0  172.16.0.210
pytorch-data  SUCCEEDED  PYTORCHJOB  1m   pytorch-data-worker-0  172.16.0.210

8. Get status of volume pytorchdata from pytorch-data-master-0.

➜ kubectl describe pod pytorch-data-master-0 | grep pytorchdata -C 3