GoogleCloudPlatform · arueth · Jan 13, 2025 · Jan 10, 2025 · Jan 10, 2025 · Jan 10, 2025
diff --git a/...ms/gke-aiml/playground/templates/configsync/templates/_namespace_template/app/values.yaml b/...ms/gke-aiml/playground/templates/configsync/templates/_namespace_template/app/values.yaml
@@ -14,7 +14,7 @@
 
 image:
   repository: rayproject/ray
-  tag: 2.7.1-py310-gpu
+  tag: 2.40.0-py312-gpu
   pullPolicy: IfNotPresent
 
 nameOverride: "kuberay"
@@ -24,7 +24,7 @@ imagePullSecrets: []
 
 head:
   groupName: headgroup
-  rayVersion: 2.7.1
+  rayVersion: 2.40.0
   enableInTreeAutoscaling: true
   autoscalerOptions:
     resources:
@@ -48,7 +48,7 @@ head:
     num-cpus: '0' # Prevent tasks from begins scheduled on the head
   image:
     repository: rayproject/ray
-    tag: 2.7.1-py310
+    tag: 2.40.0-py312
     pullPolicy: IfNotPresent
   containerEnv:
   - name: RAY_memory_monitor_refresh_ms

diff --git a/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md b/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md
@@ -9,6 +9,8 @@ Depending on the infrastructure you provisioned, the data preparation step takes
 - This guide was developed to be run on the [playground AI/ML platform](/platforms/gke-aiml/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
 - A bucket containing the processed data from the [Data Processing example](/use-cases/model-fine-tuning-pipeline/data-processing/ray/README.md)
 
+> NOTE: If you did not execute the data processing example, follow [these instructions](/use-cases/prerequisites/processed-data.md) to load the processed data into the bucket.
+
 ## Preparation
 
 - Accept Llama 3.1 on Vertex AI license agreement terms
@@ -46,37 +48,6 @@ Depending on the infrastructure you provisioned, the data preparation step takes
 
   > The Llama 3.1 on Vertex API is in preview, it is only available in `us-central1`
 
-## Data Preparation (Optional)
-
-To execute this scenario without going through the [Data Processing example](/use-cases/model-fine-tuning-pipeline/data-processing/ray/README.md), we have a processed dataset that you can use.
-
-Select a path between **Full dataset** and **Smaller dataset (subset)**. The smaller dataset is a quicker way to experience the pipeline, but it will produce a less than ideal fine-tuned model.
-
-- If you would like to use the **Smaller dataset (subset)**, set the variable below.
-
-  ```shell
-  DATASET_SUBSET=-subset
-  ```
-
-- Download the Hugging Face CLI library
-
-  ```shell
-  pip3 install -U "huggingface_hub[cli]==0.26.2"
-  ```
-
-- Download the processed dataset CSV file from Hugging Face and copy it into the GCS bucket
-
-  ```shell
-  PROCESSED_DATA_REPO=gcp-acp/flipkart-preprocessed${DATASET_SUBSET}
-
-  ${HOME}/.local/bin/huggingface-cli download --repo-type dataset ${PROCESSED_DATA_REPO} --local-dir ./temp
-
-  gcloud storage cp ./temp/flipkart.csv \
-    gs://${MLP_DATA_BUCKET}/flipkart_preprocessed_dataset/flipkart.csv && \
-
-  rm ./temp/flipkart.csv
-  ```
-
 ## Build the container image
 
 - Build the container image using Cloud Build and push the image to Artifact Registry

diff --git a/use-cases/model-fine-tuning-pipeline/data-processing/ray/README.md b/use-cases/model-fine-tuning-pipeline/data-processing/ray/README.md
@@ -24,6 +24,7 @@ The data processing step takes approximately 18-20 minutes.
 ## Prerequisites
 
 - This guide was developed to be run on the [playground AI/ML platform](/platforms/gke-aiml/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
+- The raw data that will be processed in this example, follow [these instructions](/use-cases/prerequisites/raw-data.md) to load the data into the bucket.
 
 ## Preparation
 
@@ -43,54 +44,6 @@ The data processing step takes approximately 18-20 minutes.
 
   > You should see the various variables populated with the information specific to your environment.
 
-## Data Preparation
-
-Select a path between **Full dataset** and **Smaller dataset (subset)**. The smaller dataset is a quicker way to experience the pipeline, but it will produce a less than ideal fine-tuned model.
-
-- **Full dataset** Download the raw data CSV file from [Kaggle](https://kaggle.com) and store it into the bucket created in the previous step.
-
-  - You will need kaggle cli to download the file. The kaggle cli can be installed using the following command in Cloud Shell:
-
-    ```shell
-    pip3 install --user kaggle
-    ```
-
-    For more details, you can read those [instructions](https://github.com/Kaggle/kaggle-api#installation).
-
-  - To use the cli you must create an API token. To create the token, register on [kaggle.com](https://kaggle.com) if you already don't have an account. Go to `kaggle.com/settings > API > Create New Token`, the downloaded file should be stored in `$HOME/.kaggle/kaggle.json`. Note, you will have to create the dir `$HOME/.kaggle`. After the configuration is done, you can run the following command to download the dataset and copy it to the GCS bucket:
-
-    ```shell
-    kaggle datasets download --unzip PromptCloudHQ/flipkart-products && \
-
-    gcloud storage cp flipkart_com-ecommerce_sample.csv \
-      gs://${MLP_DATA_BUCKET}/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv && \
-
-    rm flipkart_com-ecommerce_sample.csv
-    ```
-
-  - Alternatively, you can [downloaded the dataset](https://www.kaggle.com/datasets/PromptCloudHQ/flipkart-products) directly from the kaggle website and copy it to the bucket.
-
-- **Smaller dataset (subset)** Download the raw data CSV from Hugging Face.
-
-  - Download the Hugging Face CLI library
-
-    ```shell
-    pip3 install -U "huggingface_hub[cli]==0.26.2"
-    ```
-
-  - Download the preprocessed dataset CSV file from Hugging Face and copy it into the GCS bucket
-
-    ```shell
-    RAW_DATA_REPO=gcp-acp/flipkart-raw-subset
-
-    ${HOME}/.local/bin/huggingface-cli download --repo-type dataset ${RAW_DATA_REPO} --local-dir ./temp
-
-    gcloud storage cp ./temp/flipkart_com-ecommerce_sample.csv \
-      gs://${MLP_DATA_BUCKET}/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv && \
-
-    rm ./temp/flipkart_com-ecommerce_sample.csv
-    ```
-
 ## Build the container image
 
 - Build container image using Cloud Build and push the image to Artifact Registry

diff --git a/use-cases/model-fine-tuning-pipeline/data-processing/ray/src/Dockerfile b/use-cases/model-fine-tuning-pipeline/data-processing/ray/src/Dockerfile
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-FROM python:3.10.14-slim-bullseye as build-stage
+FROM python:3.12.8-slim-bullseye as build-stage
 
 ENV PATH=/venv/bin:${PATH}
 ENV PYTHONDONTWRITEBYTECODE=1

diff --git a/use-cases/model-fine-tuning-pipeline/data-processing/ray/src/preprocessing.py b/use-cases/model-fine-tuning-pipeline/data-processing/ray/src/preprocessing.py
@@ -262,10 +262,11 @@ def run_remote():
     # Ray runtime env
     runtime_env = {
         "pip": [
-            "google-cloud-storage==2.16.0",
-            "spacy==3.7.4",
-            "jsonpickle==3.0.3",
-            "pandas==2.2.1",
+            "google-cloud-storage==2.19.0",
+            "spacy==3.7.6",
+            "jsonpickle==4.0.1",
+            "pandas==2.2.3",
+            "pydantic==2.10.5",
         ],
         "env_vars": {"PIP_NO_CACHE_DIR": "1", "PIP_DISABLE_PIP_VERSION_CHECK": "1"},
     }

diff --git a/use-cases/model-fine-tuning-pipeline/data-processing/ray/src/requirements.txt b/use-cases/model-fine-tuning-pipeline/data-processing/ray/src/requirements.txt
@@ -1,9 +1,9 @@
-fsspec==2024.3.1
-gcsfs==2024.3.1
-google-cloud-storage==2.16.0
-jsonpickle==3.0.3
-pandas==2.2.1
-ray==2.7.1
-ray[client]==2.7.1
-spacy==3.7.4
+fsspec==2024.12.0
+gcsfs==2024.12.0
+google-cloud-storage==2.19.0
+jsonpickle==4.0.1
+pandas==2.2.3
+ray==2.40.0
+ray[client]==2.40.0
+spacy==3.7.6
 thejsonlogger==0.0.3
diff --git a/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md b/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md
@@ -9,7 +9,9 @@ The resulting fine-tuned model is, Built with Meta Llama 3.1, using the the data
 ## Prerequisites
 
 - This guide was developed to be run on the [playground AI/ML platform](/platforms/gke-aiml/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
-- A bucket containing the processed and prepared data from the [Data Preparation example](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md)
+- A bucket containing the prepared data from the [Data Preparation example](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md)
+
+> NOTE: If you did not execute the data preparation example, follow [these instructions](/use-cases/prerequisites/prepared-data.md) to load the dataset into the bucket.
 
 ## Preparation
 
@@ -37,37 +39,6 @@ The resulting fine-tuned model is, Built with Meta Llama 3.1, using the the data
   HF_TOKEN=
   ```
 
-## Data Preparation
-
-To execute this scenario without going through the [Data Preparation example](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md), we have a prepared dataset that you can use.
-
-Select a path between **Full dataset** and **Smaller dataset (subset)**. The smaller dataset is a quicker way to experience the pipeline, but it will produce a less than ideal fine-tuned model.
-
-- If you would like to use the **Smaller dataset (subset)**, set the variable below.
-
-  ```sh
-  DATASET_SUBSET=-subset
-  ```
-
-- Download the Hugging Face CLI library
-
-  ```sh
-  pip3 install -U "huggingface_hub[cli]==0.26.2"
-  ```
-
-- Download the prepared dataset from Hugging Face and copy it into the GCS bucket
-
-  ```sh
-  DATAPREP_REPO=gcp-acp/flipkart-dataprep${DATASET_SUBSET}
-
-  ${HOME}/.local/bin/huggingface-cli download --repo-type dataset ${DATAPREP_REPO} --local-dir ./temp
-
-  gcloud storage cp -R ./temp/* \
-    gs://${MLP_DATA_BUCKET}/dataset/output && \
-
-  rm -rf ./temp
-  ```
-
 ## Build the container image
 
 - Build the container image using Cloud Build and push the image to Artifact Registry

diff --git a/use-cases/model-fine-tuning-pipeline/model-eval/README.md b/use-cases/model-fine-tuning-pipeline/model-eval/README.md
@@ -8,8 +8,14 @@ for this activity, the first is to send prompts to the fine-tuned model, the sec
 ## Prerequisites
 
 - This guide was developed to be run on the [playground AI/ML platform](/platforms/gke-aiml/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
+- A bucket containing the prepared data from the [Data Preparation example](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md)
+
+> NOTE: If you did not execute the data preparation example, follow [these instructions](/use-cases/prerequisites/prepared-data.md) to load the dataset into the bucket.
+
 - A bucket containing the model weights from the [Fine tuning example](/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md)
 
+> NOTE: If you did not execute the fine-tuning example, follow [these instructions](/use-cases/prerequisites/fine-tuned-model.md) to load the model into the bucket.
+
 ## Preparation
 
 - Clone the repository and change directory to the guide directory
@@ -28,79 +34,6 @@ for this activity, the first is to send prompts to the fine-tuned model, the sec
 
   > You should see the various variables populated with the information specific to your environment.
 
-## Data Preparation (Optional)
-
-To execute this scenario without going through the [Fine tuning example](/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md)
-
-Select a path between **Full dataset** and **Smaller dataset (subset)**. The smaller dataset is a quicker way to experience the pipeline, but the evaluation results would be on a smaller sample.
-
-- If you would like to use the **Smaller dataset (subset)**, set the variable below.
-
-  ```sh
-  DATASET_SUBSET=-subset
-  ```
-
-- Download the Hugging Face CLI library
-
-  ```sh
-  pip3 install -U "huggingface_hub[cli]==0.26.2"
-  ```
-
-- Download the prepared dataset from Hugging Face and copy it into the GCS bucket
-
-  ```sh
-  DATAPREP_REPO=gcp-acp/flipkart-dataprep${DATASET_SUBSET}
-
-  ${HOME}/.local/bin/huggingface-cli download --repo-type dataset ${DATAPREP_REPO} --local-dir ./temp
-
-  gcloud storage cp -R ./temp/* \
-    gs://${MLP_DATA_BUCKET}/dataset/output && \
-
-  rm -rf ./temp
-  ```
-
-- Download the fine-tuned model from Hugging Face and copy it into the GCS bucket.
-
-  > NOTE: Due to the limitations of Cloud Shell’s storage and the size of our model we need to run this job to perform the transfer to GCS on the cluster.
-
-  - Get credentials for the GKE cluster
-
-    ```sh
-    gcloud container fleet memberships get-credentials ${MLP_CLUSTER_NAME} --project ${MLP_PROJECT_ID}
-    ```
-
-  - Replace the respective variables required for the job
-
-    ```sh
-    MODEL_REPO=gcp-acp/Llama-gemma-2-9b-it-ft
-
-    sed \
-      -i -e "s|V_KSA|${MLP_MODEL_EVALUATION_KSA}|" \
-      -i -e "s|V_BUCKET|${MLP_MODEL_BUCKET}|" \
-      -i -e "s|V_MODEL_REPO|${MODEL_REPO}|" \
-      manifests/transfer-to-gcs.yaml
-    ```
-
-  - Deploy the job
-
-    ```sh
-    kubectl apply --namespace ${MLP_KUBERNETES_NAMESPACE} \
-      -f manifests/transfer-to-gcs.yaml
-    ```
-
-  - Trigger the wait for job completion (the job will take ~5 minutes to complete)
-
-    ```sh
-    kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} wait \
-      --for=condition=complete --timeout=900s job/transfer-to-gcs
-    ```
-
-  - Example output of the job completion
-
-    ```sh
-    job.batch/transfer-to-gcs condition met
-    ```
-
 ## Build the container image
 
 - Build container image using Cloud Build and push the image to Artifact Registry

diff --git a/use-cases/prerequisites/prepared-data.md b/use-cases/prerequisites/prepared-data.md
@@ -24,7 +24,7 @@ Select a path between **Full dataset** and **Smaller dataset (subset)**. The sma
 - Download the Hugging Face CLI library
 
   ```sh
-  pip3 install -U "huggingface_hub[cli]==0.26.2"
+  pip3 install -U "huggingface_hub[cli]==0.27.1"
   ```
 
 - Download the prepared dataset from Hugging Face and copy it into the GCS bucket

diff --git a/use-cases/prerequisites/processed-data.md b/use-cases/prerequisites/processed-data.md
@@ -24,7 +24,7 @@ Select a path between **Full dataset** and **Smaller dataset (subset)**. The sma
 - Download the Hugging Face CLI library
 
   ```sh
-  pip3 install -U "huggingface_hub[cli]==0.26.2"
+  pip3 install -U "huggingface_hub[cli]==0.27.1"
   ```
 
 - Download the processed dataset CSV file from Hugging Face and copy it into the GCS bucket

diff --git a/use-cases/prerequisites/raw-data.md b/use-cases/prerequisites/raw-data.md
@@ -43,7 +43,7 @@ Select a path between **Full dataset** and **Smaller dataset (subset)**. The sma
   - Download the Hugging Face CLI library
 
     ```sh
-    pip3 install -U "huggingface_hub[cli]==0.26.2"
+    pip3 install -U "huggingface_hub[cli]==0.27.1"
     ```
 
   - Download the preprocessed dataset CSV file from Hugging Face and copy it into the GCS bucket