Skip to content

Commit

Permalink
Add local model cache tutorial and update blog (#434)
Browse files Browse the repository at this point in the history
* Add local model cache doc

Signed-off-by: Dan Sun <[email protected]>

* Add local model cache tutorial and update blog

Signed-off-by: Dan Sun <[email protected]>

* Add vllm section

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Dan Sun <[email protected]>
  • Loading branch information
yuzisun authored Dec 25, 2024
1 parent 2ff73bd commit 5bdede6
Show file tree
Hide file tree
Showing 9 changed files with 315 additions and 47 deletions.
68 changes: 25 additions & 43 deletions docs/blog/articles/2024-12-13-KServe-0.14-release.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Inline with the features documented in issue [#3270](https://github.com/kserve/k
* The clients are asynchronous
* Support for HTTP/2 (via [httpx](https://www.python-httpx.org/) library)
* Support Open Inference Protocol v1 and v2
* Allow client send and receive tensor data in binary format for HTTP/REST request, see [binary tensor data extension docs](https://kserve.github.io/website/0.14/modelserving/data_plane/binary_tensor_data_extension/).

As usual, the version 0.14.0 of the KServe Python SDK is [published to PyPI](https://pypi.org/project/kserve/0.14.0/) and available to install via `pip install`.

Expand All @@ -39,57 +40,28 @@ Modelcars is one implementation option for supporting OCI images for model stora

Using volume mounts based on OCI artifacts is the optimal implementation, but this is only [recently possible since Kubernetes 1.31](https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/) as a native alpha feature. KServe can now evolve to use this new Kubernetes feature.

## Introducing model cache
## Introducing Model Cache

With models increasing in size, specially true for LLM models, pulling from storage each time a pod is created can result in unmanageable start-up times. Although OCI storage also has the benefit of model caching, the capabilities are not flexible since the management is delegated to the cluster.

The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an **alpha** feature. It relies on a PV for storing models and it provides control about which models to store in the cache. The feature was designed to mainly to use node Filesystem as storage. Read the [design document for the details](https://docs.google.com/document/d/1nao8Ws3tonO2zNAzdmXTYa0hECZNoP2SV_z9Zg0PzLA/edit).
The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an **alpha** feature.
In this release local node storage is used for storing models and `LocalModelCache` custom resource provides the control about which models to store in the cache.
The local model cache state can always be rebuilt from the models stored on persistent storage like model registry or S3.
Read the [design document for the details](https://docs.google.com/document/d/1nao8Ws3tonO2zNAzdmXTYa0hECZNoP2SV_z9Zg0PzLA/edit).

The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap.
![!localmodelcache](../../images/localmodelcache.png)

You start by creating a node group as follows:
By caching the models, you get the following benefits:

```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelNodeGroup
metadata:
name: nodegroup1
spec:
persistentVolumeSpec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
capacity:
storage: 2Gi
hostPath:
path: /models
type: ""
persistentVolumeReclaimPolicy: Delete
storageClassName: standard
persistentVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: standard
volumeMode: Filesystem
volumeName: kserve
- Minimize the time it takes for LLM pods to start serving requests.

```
- Sharing the same storage for pods scheduled on the same GPU node.

Then, you can specify to store an cache a model with the following resource:
- Model Cache allows scaling your AI workload efficiently without worrying about the slow model server container startup.

```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterLocalModel
metadata:
name: iris
spec:
modelSize: 1Gi
nodeGroup: nodegroup1
sourceModelUri: gs://kfserving-examples/models/sklearn/1.0/model
```
The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap.

You can follow [local model cache tutorial](../../modelserving/storage/modelcache/localmodel.md) to cache LLMs on local NVMe of your GPU nodes and deploy LLMs with `InferenceService` by loading models from local cache to accelerate the container startup.

<!--
Related tickets:
Expand Down Expand Up @@ -122,6 +94,15 @@ Related tickets:
* Implement Huggingface model download in storage initializer [#3584](https://github.com/kserve/kserve/pull/3584)
-->
## Hugging Face vLLM backend changes
* vLLM backend to update to 0.6.1 [#3948](https://github.com/kserve/kserve/pull/3948)
* Support trust_remote_code flag for vllm [#3729](https://github.com/kserve/kserve/pull/3729)
* Support text embedding task in hugging face server [#3743](https://github.com/kserve/kserve/pull/3743)
* Add health endpoint for vLLM backend [#3850](https://github.com/kserve/kserve/pull/3850)
* Added `hostIPC` field to `ServingRuntime` CRD, for supporting more than one GPU in Serverless mode [#3791](https://github.com/kserve/kserve/issues/3791)
* Support shared memory volume for vLLM backend [#3910](https://github.com/kserve/kserve/pull/3910)

## Other Changes

This release also includes several enhancements and changes:
Expand All @@ -130,9 +111,10 @@ This release also includes several enhancements and changes:
* New flag for automount serviceaccount token by [#3979](https://github.com/kserve/kserve/pull/3979)
* TLS support for inference loggers [#3837](https://github.com/kserve/kserve/issues/3837)
* Allow PVC storage to be mounted in ReadWrite mode via an annotation [#3687](https://github.com/kserve/kserve/issues/3687)
* Support HTTP Headers passing for KServe python custom runtimes [#3669](https://github.com/kserve/kserve/pull/3669)

### What's Changed?
* Added `hostIPC` field to `ServingRuntime` CRD, for supporting more than one GPU in Serverless mode [#3791](https://github.com/kserve/kserve/issues/3791)
* Ray is now an optional dependency [#3834](https://github.com/kserve/kserve/pull/3834)
* Support for Python 3.12 is added, while support Python 3.8 is removed [#3645](https://github.com/kserve/kserve/pull/3645)

For complete details on the new features and updates, visit our [official release notes](https://github.com/kserve/kserve/releases/tag/v0.14.0).
Expand Down
Binary file added docs/images/localmodelcache.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions docs/modelserving/storage/modelcache/jobstoragecontainer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
apiVersion: "serving.kserve.io/v1alpha1"
kind: ClusterStorageContainer
metadata:
name: hf-hub
spec:
container:
name: storage-initializer
image: kserve/storage-initializer:latest
env:
- name: HF_TOKEN # Option 2 for authenticating with HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
optional: false
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 1Gi
cpu: "1"
supportedUriFormats:
- prefix: hf://
workloadType: localModelDownloadJob
212 changes: 212 additions & 0 deletions docs/modelserving/storage/modelcache/localmodel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# KServe Local Model Cache

By caching LLM models locally, the `InferenceService` startup time can be greatly improved. For deployments with more than one replica,
the local persistent volume can serve multiple pods with the warmed up model cache.

- `LocalModelCache` is a KServe custom resource to specify which model from persistent storage to cache on local storage of the kubernetes node.
- `LocalModelNodeGroup` is a KServe custom resource to manage the node group for caching the models and the local persistent storage.
- `LocalModelNode` is a KServe custom resource to track the status of the models cached on given local node.

In this example, we demonstrate how you can cache the models using Kubernetes nodes' local disk NVMe volumes from HF hub.

## Create the LocalModelNodeGroup

Create the `LocalModelNodeGroup` using the local persistent volume with specified local NVMe volume path.

- The `storageClassName` should be set to `local-storage`.
- The `nodeAffinity` should be specified which nodes to cache the model using node selector.
- Local path should be specified on PV as the local storage to cache the models.
```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelNodeGroup
metadata:
name: workers
spec:
storageLimit: 1.7T
persistentVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1700G
storageClassName: local-storage
volumeMode: Filesystem
volumeName: models
persistentVolumeSpec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
capacity:
storage: 1700G
local:
path: /models
nodeAffinity:
required:
nodeSelectorTerms:
- key: nvidia.com/gpu-product
values:
- NVIDIA-A100-SXM4-80GB
```
## Configure Local Model Download Job Namespace
Before creating the `LocalModelCache` resource to cache the models, you need to make sure the credentials are configured in the download job namespace.
The download jobs are created in the configured namespace `kserve-localmodel-jobs`. In this example we are caching the models from HF hub, so the HF token secret should be created pre-hand in the same namespace
along with the storage container configurations.

Create the HF Hub token secret.
```yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-secret
namespace: kserve-localmodel-jobs
type: Opaque
stringData:
HF_TOKEN: xxxx # fill in the hf hub token
```

Create the HF Hub cluster storage container to refer to the HF Hub secret.

```yaml
apiVersion: "serving.kserve.io/v1alpha1"
kind: ClusterStorageContainer
metadata:
name: hf-hub
spec:
container:
name: storage-initializer
image: kserve/storage-initializer:latest
env:
- name: HF_TOKEN # Option 2 for authenticating with HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
optional: false
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 1Gi
cpu: "1"
supportedUriFormats:
- prefix: hf://
workloadType: localModelDownloadJob
```


## Create the LocalModelCache

Create the `LocalModelCache` to specify the source model storage URI to pre-download the models to local NVMe volumes for warming up the cache.

- `sourceModelUri` is the model persistent storage location where to download the model for local cache.
- `nodeGroups` is specified to indicate which nodes to cache the model.


```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelCache
metadata:
name: meta-llama3-8b-instruct
spec:
sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct
modelSize: 10Gi
nodeGroups:
- workers
```

After `LocalModelCache` is created, KServe creates the download jobs on each node in the group to cache the model in local storage.

```bash
kubectl get jobs meta-llama3-8b-instruct-kind-worker -n kserve-localmodel-jobs
NAME STATUS COMPLETIONS DURATION AGE
meta-llama3-8b-instruct-gptq-kind-worker Complete 1/1 4m21s 5d17h
```

The download job is created using the provisioned PV/PVC.
```bash
kubectl get pvc meta-llama3-8b-instruct -n kserve-localmodel-jobs
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
meta-llama3-8b-instruct Bound meta-llama3-8b-instruct-download 10Gi RWO local-storage <unset> 9h
```

## Check the LocalModelCache Status

`LocalModelCache` shows the model download status for each node in the group.

```bash
kubectl get localmodelcache meta-llama3-8b-instruct -oyaml
```
```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelCache
metadata:
name: meta-llama3-8b-instruct-gptq
spec:
modelSize: 10Gi
nodeGroup: workers
sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct
status:
copies:
available: 1
total: 1
nodeStatus:
kind-worker: NodeDownloaded
```

`LocalModelNode` shows the model download status of each model expected to cache on the given node.

```bash
kubectl get localmodelnode kind-worker -oyaml
```

```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelNode
metadata:
name: kind-worker
spec:
localModels:
- modelName: meta-llama3-8b-instruct
sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct
status:
modelStatus:
meta-llama3-8b-instruct: ModelDownloaded
```

## Deploy InferenceService using the LocalModelCache

Finally you can deploy the LLMs with `InferenceService` using the local model cache if the model has been previously cached
using the `LocalModelCache` resource by matching the model storage URI.

The model cache is currently disabled by default. To enable, you need to modify the `localmodel.enabled` field on the `inferenceservice-config` ConfigMap.

=== "Yaml"

```yaml
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
storageUri: hf://meta-llama/meta-llama-3-8b-instruct
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
EOF
```
9 changes: 9 additions & 0 deletions docs/modelserving/storage/modelcache/localmodelcache.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelCache
metadata:
name: meta-llama3-8b-instruct
spec:
sourceModelUri: hf://meta-llama/meta-llama-3-8b-instruct
modelSize: 10Gi
nodeGroups:
- workers
8 changes: 8 additions & 0 deletions docs/modelserving/storage/modelcache/secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: v1
kind: Secret
metadata:
name: hf-secret
namespace: kserve-localmodel-jobs
type: Opaque
stringData:
HF_TOKEN: xxxx # fill in the hf hub token
33 changes: 33 additions & 0 deletions docs/modelserving/storage/modelcache/storage.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelNodeGroup
metadata:
name: workers
spec:
storageLimit: 10Gi
persistentVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: local-storage
volumeMode: Filesystem
volumeName: models
persistentVolumeSpec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
capacity:
storage: 10Gi
local:
path: /models
persistentVolumeReclaimPolicy: Delete
storageClassName: local-storage
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- kind-worker
Loading

0 comments on commit 5bdede6

Please sign in to comment.