KServe job with custom serving runtime

This guide walks through the steps to deploy and serve a custom serving runtime with kserve.

1. Setup

Follow the KServe Guide to install Kserve.

2. Submit your serving job into kserve

create a PVC 'training-data' before, and then download the 'bloom-560m' model from HuggingFace to the PVC.

deploy an InferenceService with a predictor that will load a bloom model with text-generation-inference.

$ arena serve kserve \
    --name=bloom-560m \
    --image=ghcr.io/huggingface/text-generation-inference:1.0.2 \
    --gpus=1 \
    --cpu=12 \
    --memory=50Gi \
    --port=8080 \
    --env=STORAGE_URI=pvc://training-data \
    "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m --num-shard 1 -p 8080"

inferenceservice.serving.kserve.io/bloom-560m created
INFO[0010] The Job bloom-560m has been submitted successfully
INFO[0010] You can run `arena serve get bloom-560m --type kserve -n default` to check the job status

3. Check the status of KServe job

$ arena serve list
NAME         TYPE    VERSION  DESIRED  AVAILABLE  ADDRESS                                      PORTS
bloom-560m   KServe  00001    1        1          http://bloom-560m.default-group.example.com  :80

$ arena serve get bloom-560m
Name:       bloom-560m
Namespace:  default
Type:       KServe
Version:    00001
Desired:    1
Available:  1
Age:        7m
Address:    http://bloom-560m.default.example.com
Port:       :80
GPU:        1

LatestRevision:     bloom-560m-predictor-00001
LatestPrecent:      100

Instances:
  NAME                                                    STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                                    ------   ---  -----  --------  ---  ----
  bloom-560m-predictor-00001-deployment-56b8bdbf87-sg8v8  Running  7m   2/2    0         1    192.168.5.241

4. Perform inference

you can curl with the ingress gateway external IP using the HOST Header.

$ curl -H "Host: bloom-560m.default.example.com" http://${INGRESS_HOST}:80/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
-H 'Content-Type: application/json'

{"generated_text":" Deep Learning is a new type of machine learning that is used to solve complex problems."}

5. Update the InferenceService with the canary rollout strategy

Add the canaryTrafficPercent field to the predictor component and update command to use a new/updated model path /mnt/models/bloom-560m-v2.

$ arena serve update kserve \
--name bloom-560m \
--canary-traffic-percent=10 \
"text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m-v2 --num-shard 1 -p 8080"

After rolling out the canary model, traffic is split between the latest ready revision 2 and the previously rolled out revision 1.

$ arena serve get bloom-560m
Name:       bloom-560m
Namespace:  default
Type:       KServe
Version:    00002
Desired:    2
Available:  2
Age:        26m
Address:    http://bloom-560m.default.example.com
Port:       :80

LatestRevision:     bloom-560m-predictor-00002
LatestPrecent:      10
PrevRevision:       bloom-560m-predictor-00001
PrevPrecent:        90

Instances:
  NAME                                                    STATUS   AGE   READY  RESTARTS  GPU  NODE
  ----                                                    ------   ---   -----  --------  ---  ----
  bloom-560m-predictor-00001-deployment-56b8bdbf87-sg8v8  Running  19m   2/2    0         1    192.168.5.241
  bloom-560m-predictor-00002-deployment-84dbb64cc4-647wx  Running  2m    2/2    0         1    192.168.5.239

6. Promote the canary model

If the canary model is healthy/passes your tests, you can set canary-traffic-percent to 100.

$ arena serve update kserve \
--name bloom-560m \
--canary-traffic-percent=100

Now all traffic goes to the revision 2 for the new model. The pods for revision generation 1 automatically scales down to 0 as it is no longer getting the traffic.

$ arena serve get bloom-560m
Name:       bloom-560m
Namespace:  default
Type:       KServe
Version:    00002
Desired:    2
Available:  2
Age:        26m
Address:    http://bloom-560m.default.example.com
Port:       :80

LatestRevision:     bloom-560m-predictor-00002
LatestPrecent:      100

Instances:
  NAME                                                    STATUS       AGE  READY  RESTARTS  GPU  NODE
  ----                                                    ------       ---  -----  --------  ---  ----
  bloom-560m-predictor-00001-deployment-56b8bdbf87-sg8v8  Terminating  22m  1/2    0         0    192.168.5.241
  bloom-560m-predictor-00002-deployment-84dbb64cc4-647wx  Running      5m   2/2    0         1    192.168.5.239

7. Delete the kserve job

$ arena serve delete bloom-560m