Tensorflow job with datasets
Arena allows to mount multiple data volumes into the training jobs. There is an example that mounts data volume
into the training job.
1. You need to create /data
in the NFS Server, and prepare mnist data
.
$ arenamkdir -p /nfs
$ mount -t nfs -o vers=4.0 NFS_SERVER_IP:/ /nfs
$ mkdir -p /data
$ cd /data
$ wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-images-idx3-ubyte.gz
$ wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-labels-idx1-ubyte.gz
$ wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-images-idx3-ubyte.gz
$ wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-labels-idx1-ubyte.gz
$ cd /
$ umount /nfs
2. Create Persistent Volume. Moidfy NFS_SERVER_IP
to yours.
$ cat nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: tfdata
labels:
tfdata: nas-mnist
spec:
persistentVolumeReclaimPolicy: Retain
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
server: NFS_SERVER_IP
path: "/data"
$ kubectl create -f nfs-pv.yaml
3. Create Persistent Volume Claim.
$ cat nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfdata
annotations:
description: "this is the mnist demo"
owner: Tom
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
selector:
matchLabels:
tfdata: nas-mnist
$ kubectl create -f nfs-pvc.yaml
Note
suggest to add description
and owner
4. Check the data volume.
$ arena data list
NAME ACCESSMODE DESCRIPTION OWNER AGE
tfdata ReadWriteMany this is for mnist demo myteam 43d
5. Now we can submit a distributed training job with arena
, it will download the source code from github and mount data volume tfdata
to /mnist_data
.
$ arena submit tf --name=tf-dist-data \
--gpus=1 \
--workers=2 \
--workerImage=tensorflow/tensorflow:1.5.0-devel-gpu \
--syncMode=git \
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--ps=1 \
--psImage=tensorflow/tensorflow:1.5.0-devel \
--tensorboard \
--data=tfdata:/mnist_data \
"python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir /training_logs --data_dir /mnist_data"
Note
--data
specifies the data volume to mount to all the tasks of the job, like tfdata
, and the target directory is /mnist_data
.
6. From the logs, we find that the training data is extracted from /mnist_data
instead of downloading from internet directly.
$ arena logs tf-dist-data
...
Extracting /mnist_data/train-images-idx3-ubyte.gz
Extracting /mnist_data/train-labels-idx1-ubyte.gz
Extracting /mnist_data/t10k-images-idx3-ubyte.gz
Extracting /mnist_data/t10k-labels-idx1-ubyte.gz
...
Accuracy at step 960: 0.9753
Accuracy at step 970: 0.9739
Accuracy at step 980: 0.9756
Accuracy at step 990: 0.9777
Adding run metadata for 999