Add local model cache tutorial and update blog (#434)

* Add local model cache doc Signed-off-by: Dan Sun <[email protected]> * Add local model cache tutorial and update blog Signed-off-by: Dan Sun <[email protected]> * Add vllm section Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Dan Sun <[email protected]>
kserve · Dec 25, 2024 · 5bdede6 · 5bdede6
1 parent 2ff73bd
commit 5bdede6
Show file tree

Hide file tree

Showing 9 changed files with 315 additions and 47 deletions.
diff --git a/docs/blog/articles/2024-12-13-KServe-0.14-release.md b/docs/blog/articles/2024-12-13-KServe-0.14-release.md
@@ -13,6 +13,7 @@ Inline with the features documented in issue [#3270](https://github.com/kserve/k
 * The clients are asynchronous
 * Support for HTTP/2 (via [httpx](https://www.python-httpx.org/) library)
 * Support Open Inference Protocol v1 and v2
+* Allow client send and receive tensor data in binary format for HTTP/REST request, see [binary tensor data extension docs](https://kserve.github.io/website/0.14/modelserving/data_plane/binary_tensor_data_extension/).
 
 As usual, the version 0.14.0 of the KServe Python SDK is [published to PyPI](https://pypi.org/project/kserve/0.14.0/) and available to install via `pip install`.
 
@@ -39,57 +40,28 @@ Modelcars is one implementation option for supporting OCI images for model stora
 
 Using volume mounts based on OCI artifacts is the optimal implementation, but this is only [recently possible since Kubernetes 1.31](https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/) as a native alpha feature. KServe can now evolve to use this new Kubernetes feature.
 
-## Introducing model cache
+## Introducing Model Cache
 
 With models increasing in size, specially true for LLM models, pulling from storage each time a pod is created can result in unmanageable start-up times. Although OCI storage also has the benefit of model caching, the capabilities are not flexible since the management is delegated to the cluster.
 
-The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an **alpha** feature. It relies on a PV for storing models and it provides control about which models to store in the cache. The feature was designed to mainly to use node Filesystem as storage. Read the [design document for the details](https://docs.google.com/document/d/1nao8Ws3tonO2zNAzdmXTYa0hECZNoP2SV_z9Zg0PzLA/edit).
+The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an **alpha** feature. 
+In this release local node storage is used for storing models and `LocalModelCache` custom resource provides the control about which models to store in the cache.
+The local model cache state can always be rebuilt from the models stored on persistent storage like model registry or S3.
+Read the [design document for the details](https://docs.google.com/document/d/1nao8Ws3tonO2zNAzdmXTYa0hECZNoP2SV_z9Zg0PzLA/edit).
 
-The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap.
+![!localmodelcache](../../images/localmodelcache.png)
 
-You start by creating a node group as follows:
+By caching the models, you get the following benefits:
 
-```yaml
-apiVersion: serving.kserve.io/v1alpha1
-kind: LocalModelNodeGroup 
-metadata:
-  name: nodegroup1
-spec:
-  persistentVolumeSpec:
-    accessModes:
-    - ReadWriteOnce
-    volumeMode: Filesystem
-    capacity:
-      storage: 2Gi
-    hostPath:
-      path: /models
-      type: ""
-    persistentVolumeReclaimPolicy: Delete
-    storageClassName: standard
-  persistentVolumeClaimSpec:
-    accessModes:
-    - ReadWriteOnce
-    resources:
-      requests:
-        storage: 2Gi
-    storageClassName: standard
-    volumeMode: Filesystem
-    volumeName: kserve
+- Minimize the time it takes for LLM pods to start serving requests.
 
-```
+- Sharing the same storage for pods scheduled on the same GPU node.
 
-Then, you can specify to store an cache a model with the following resource:
+- Model Cache allows scaling your AI workload efficiently without worrying about the slow model server container startup.
 
-```yaml
-apiVersion: serving.kserve.io/v1alpha1
-kind: ClusterLocalModel
-metadata:
-  name: iris
-spec:
-  modelSize: 1Gi
-  nodeGroup: nodegroup1
-  sourceModelUri: gs://kfserving-examples/models/sklearn/1.0/model
-```
+The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap.
+
+You can follow [local model cache tutorial](../../modelserving/storage/modelcache/localmodel.md) to cache LLMs on local NVMe of your GPU nodes and deploy LLMs with `InferenceService` by loading models from local cache to accelerate the container startup. 
 
 <!--
 Related tickets:
@@ -122,6 +94,15 @@ Related tickets:
 * Implement Huggingface model download in storage initializer [#3584](https://github.com/kserve/kserve/pull/3584)
 -->
 
+## Hugging Face vLLM backend changes
+ 
+* vLLM backend to update to 0.6.1 [#3948](https://github.com/kserve/kserve/pull/3948)
+* Support trust_remote_code flag for vllm [#3729](https://github.com/kserve/kserve/pull/3729)
+* Support text embedding task in hugging face server [#3743](https://github.com/kserve/kserve/pull/3743)
+* Add health endpoint for vLLM backend [#3850](https://github.com/kserve/kserve/pull/3850)
+* Added `hostIPC` field to `ServingRuntime` CRD, for supporting more than one GPU in Serverless mode [#3791](https://github.com/kserve/kserve/issues/3791)
+* Support shared memory volume for vLLM backend [#3910](https://github.com/kserve/kserve/pull/3910)
+
 ## Other Changes
 
 This release also includes several enhancements and changes:
@@ -130,9 +111,10 @@ This release also includes several enhancements and changes:
 * New flag for automount serviceaccount token by [#3979](https://github.com/kserve/kserve/pull/3979)
 * TLS support for inference loggers [#3837](https://github.com/kserve/kserve/issues/3837)
 * Allow PVC storage to be mounted in ReadWrite mode via an annotation [#3687](https://github.com/kserve/kserve/issues/3687)
+* Support HTTP Headers passing for KServe python custom runtimes [#3669](https://github.com/kserve/kserve/pull/3669)
 
 ### What's Changed?
-* Added `hostIPC` field to `ServingRuntime` CRD, for supporting more than one GPU in Serverless mode [#3791](https://github.com/kserve/kserve/issues/3791)
+* Ray is now an optional dependency [#3834](https://github.com/kserve/kserve/pull/3834)
 * Support for Python 3.12 is added, while support Python 3.8 is removed [#3645](https://github.com/kserve/kserve/pull/3645)
 
 For complete details on the new features and updates, visit our [official release notes](https://github.com/kserve/kserve/releases/tag/v0.14.0).

diff --git a/docs/images/localmodelcache.png b/docs/images/localmodelcache.png
diff --git a/docs/modelserving/storage/modelcache/jobstoragecontainer.yaml b/docs/modelserving/storage/modelcache/jobstoragecontainer.yaml
@@ -0,0 +1,25 @@
+apiVersion: "serving.kserve.io/v1alpha1"
+kind: ClusterStorageContainer
+metadata:
+  name: hf-hub
+spec:
+  container:
+    name: storage-initializer
+    image: kserve/storage-initializer:latest
+    env:
+    - name: HF_TOKEN  # Option 2 for authenticating with HF_TOKEN
+      valueFrom:
+        secretKeyRef:
+          name: hf-secret
+          key: HF_TOKEN
+          optional: false  
+    resources:
+      requests:
+        memory: 100Mi
+        cpu: 100m
+      limits:
+        memory: 1Gi
+        cpu: "1"
+  supportedUriFormats:
+    - prefix: hf://
+  workloadType: localModelDownloadJob
diff --git a/docs/modelserving/storage/modelcache/localmodel.md b/docs/modelserving/storage/modelcache/localmodel.md
@@ -0,0 +1,212 @@
+# KServe Local Model Cache
+
+By caching LLM models locally, the `InferenceService` startup time can be greatly improved. For deployments with more than one replica,
+the local persistent volume can serve multiple pods with the warmed up model cache.
+
+- `LocalModelCache` is a KServe custom resource to specify which model from persistent storage to cache on local storage of the kubernetes node. 
+- `LocalModelNodeGroup` is a KServe custom resource to manage the node group for caching the models and the local persistent storage.
+- `LocalModelNode` is a KServe custom resource to track the status of the models cached on given local node.
+
+In this example, we demonstrate how you can cache the models using Kubernetes nodes' local disk NVMe volumes from HF hub.
+
+## Create the LocalModelNodeGroup
+
+Create the `LocalModelNodeGroup` using the local persistent volume with specified local NVMe volume path.
+
+- The `storageClassName` should be set to `local-storage`.
+- The `nodeAffinity` should be specified which nodes to cache the model using node selector.
+- Local path should be specified on PV as the local storage to cache the models.
+```yaml
+apiVersion: serving.kserve.io/v1alpha1
+kind: LocalModelNodeGroup
+metadata:
+  name: workers
+spec:
+  storageLimit: 1.7T
+  persistentVolumeClaimSpec:
+    accessModes:
+      - ReadWriteOnce
+    resources:
+      requests:
+        storage: 1700G
+    storageClassName: local-storage
+    volumeMode: Filesystem
+    volumeName: models
+  persistentVolumeSpec:
+    accessModes:
+      - ReadWriteOnce
+    volumeMode: Filesystem
+    capacity:
+      storage: 1700G
+    local:
+      path: /models
+    nodeAffinity:
+       required:
+         nodeSelectorTerms:
+           - key: nvidia.com/gpu-product
+             values:
+               - NVIDIA-A100-SXM4-80GB
+```
+
+## Configure Local Model Download Job Namespace
+Before creating the `LocalModelCache` resource to cache the models, you need to make sure the credentials are configured in the download job namespace.
+The download jobs are created in the configured namespace `kserve-localmodel-jobs`. In this example we are caching the models from HF hub, so the HF token secret should be created pre-hand in the same namespace
+along with the storage container configurations.
+
+Create the HF Hub token secret.
+```yaml
+apiVersion: v1
+kind: Secret
+metadata:
+  name: hf-secret
+  namespace: kserve-localmodel-jobs
+type: Opaque
+stringData:
+  HF_TOKEN: xxxx # fill in the hf hub token
+```
+
+Create the HF Hub cluster storage container to refer to the HF Hub secret.
+
+```yaml
+apiVersion: "serving.kserve.io/v1alpha1"
+kind: ClusterStorageContainer
+metadata:
+  name: hf-hub
+spec:
+  container:
+    name: storage-initializer
+    image: kserve/storage-initializer:latest
+    env:
+    - name: HF_TOKEN  # Option 2 for authenticating with HF_TOKEN
+      valueFrom:
+        secretKeyRef:
+          name: hf-secret
+          key: HF_TOKEN
+          optional: false
+    resources:
+      requests:
+        memory: 100Mi
+        cpu: 100m
+      limits:
+        memory: 1Gi
+        cpu: "1"
+  supportedUriFormats:
+    - prefix: hf://
+  workloadType: localModelDownloadJob
+```
+
+
+## Create the LocalModelCache
+
+Create the `LocalModelCache` to specify the source model storage URI to pre-download the models to local NVMe volumes for warming up the cache.
+
+- `sourceModelUri` is the model persistent storage location where to download the model for local cache. 
+- `nodeGroups` is specified to indicate which nodes to cache the model.
+
+
+```yaml
+apiVersion: serving.kserve.io/v1alpha1
+kind: LocalModelCache
+metadata:
+  name: meta-llama3-8b-instruct
+spec:
+  sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct
+  modelSize: 10Gi
+  nodeGroups: 
+  - workers
+```
+
+After `LocalModelCache` is created, KServe creates the download jobs on each node in the group to cache the model in local storage.
+
+```bash
+kubectl get jobs meta-llama3-8b-instruct-kind-worker  -n kserve-localmodel-jobs
+NAME                                       STATUS     COMPLETIONS   DURATION   AGE
+meta-llama3-8b-instruct-gptq-kind-worker   Complete   1/1           4m21s      5d17h
+```
+
+The download job is created using the provisioned PV/PVC.
+```bash
+kubectl get pvc meta-llama3-8b-instruct  -n kserve-localmodel-jobs 
+NAME                      STATUS   VOLUME                             CAPACITY   ACCESS MODES   STORAGECLASS    VOLUMEATTRIBUTESCLASS   AGE
+meta-llama3-8b-instruct   Bound    meta-llama3-8b-instruct-download   10Gi       RWO            local-storage   <unset>                 9h
+```
+
+## Check the LocalModelCache Status
+
+`LocalModelCache` shows the model download status for each node in the group.
+
+```bash
+kubectl get localmodelcache meta-llama3-8b-instruct -oyaml
+```
+```yaml
+apiVersion: serving.kserve.io/v1alpha1
+kind: LocalModelCache
+metadata:
+  name: meta-llama3-8b-instruct-gptq
+spec:
+  modelSize: 10Gi
+  nodeGroup: workers
+  sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct
+status:
+  copies:
+    available: 1
+    total: 1
+  nodeStatus:
+    kind-worker: NodeDownloaded
+```
+
+`LocalModelNode` shows the model download status of each model expected to cache on the given node.
+
+```bash
+kubectl get localmodelnode kind-worker -oyaml
+```
+
+```yaml
+apiVersion: serving.kserve.io/v1alpha1
+kind: LocalModelNode
+metadata:
+  name: kind-worker
+spec:
+  localModels:
+    - modelName: meta-llama3-8b-instruct
+      sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct
+status:
+  modelStatus:
+    meta-llama3-8b-instruct: ModelDownloaded
+```
+
+## Deploy InferenceService using the LocalModelCache
+
+Finally you can deploy the LLMs with `InferenceService` using the local model cache if the model has been previously cached
+using the `LocalModelCache` resource by matching the model storage URI.
+
+The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap.
+
+=== "Yaml"
+
+    ```yaml
+    kubectl apply -f - <<EOF
+    apiVersion: serving.kserve.io/v1beta1
+    kind: InferenceService
+    metadata:
+      name: huggingface-llama3
+    spec:
+      predictor:
+        model:
+          modelFormat:
+            name: huggingface
+          args:
+            - --model_name=llama3
+            - --model_id=meta-llama/meta-llama-3-8b-instruct
+          storageUri: hf://meta-llama/meta-llama-3-8b-instruct
+          resources:
+            limits:
+              cpu: "6"
+              memory: 24Gi
+              nvidia.com/gpu: "1"
+            requests:
+              cpu: "6"
+              memory: 24Gi
+              nvidia.com/gpu: "1"
+    EOF
+    ```
diff --git a/docs/modelserving/storage/modelcache/localmodelcache.yaml b/docs/modelserving/storage/modelcache/localmodelcache.yaml
@@ -0,0 +1,9 @@
+apiVersion: serving.kserve.io/v1alpha1
+kind: LocalModelCache
+metadata:
+  name: meta-llama3-8b-instruct
+spec:
+  sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct
+  modelSize: 10Gi
+  nodeGroups:
+  - workers
diff --git a/docs/modelserving/storage/modelcache/secret.yaml b/docs/modelserving/storage/modelcache/secret.yaml
@@ -0,0 +1,8 @@
+apiVersion: v1
+kind: Secret
+metadata:
+  name: hf-secret 
+  namespace: kserve-localmodel-jobs
+type: Opaque
+stringData:
+  HF_TOKEN: xxxx # fill in the hf hub token
diff --git a/docs/modelserving/storage/modelcache/storage.yaml b/docs/modelserving/storage/modelcache/storage.yaml
@@ -0,0 +1,33 @@
+apiVersion: serving.kserve.io/v1alpha1
+kind: LocalModelNodeGroup
+metadata:
+  name: workers
+spec:
+  storageLimit: 10Gi
+  persistentVolumeClaimSpec:
+    accessModes:
+      - ReadWriteOnce
+    resources:
+      requests:
+        storage: 10Gi
+    storageClassName: local-storage
+    volumeMode: Filesystem
+    volumeName: models
+  persistentVolumeSpec:
+    accessModes:
+      - ReadWriteOnce
+    volumeMode: Filesystem
+    capacity:
+      storage: 10Gi
+    local:
+      path: /models
+    persistentVolumeReclaimPolicy: Delete
+    storageClassName: local-storage    
+    nodeAffinity:
+       required:
+         nodeSelectorTerms:
+           - matchExpressions:
+             - key: kubernetes.io/hostname
+               operator: In
+               values:
+               - kind-worker