Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Processing Ray library updates, readme prereq updates #90

Merged
merged 4 commits into from
Jan 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

image:
repository: rayproject/ray
tag: 2.7.1-py310-gpu
tag: 2.40.0-py312-gpu
pullPolicy: IfNotPresent

nameOverride: "kuberay"
Expand All @@ -24,7 +24,7 @@ imagePullSecrets: []

head:
groupName: headgroup
rayVersion: 2.7.1
rayVersion: 2.40.0
enableInTreeAutoscaling: true
autoscalerOptions:
resources:
Expand All @@ -48,7 +48,7 @@ head:
num-cpus: '0' # Prevent tasks from begins scheduled on the head
image:
repository: rayproject/ray
tag: 2.7.1-py310
tag: 2.40.0-py312
pullPolicy: IfNotPresent
containerEnv:
- name: RAY_memory_monitor_refresh_ms
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Depending on the infrastructure you provisioned, the data preparation step takes
- This guide was developed to be run on the [playground AI/ML platform](/platforms/gke-aiml/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
- A bucket containing the processed data from the [Data Processing example](/use-cases/model-fine-tuning-pipeline/data-processing/ray/README.md)

> NOTE: If you did not execute the data processing example, follow [these instructions](/use-cases/prerequisites/processed-data.md) to load the processed data into the bucket.

## Preparation

- Accept Llama 3.1 on Vertex AI license agreement terms
Expand Down Expand Up @@ -46,37 +48,6 @@ Depending on the infrastructure you provisioned, the data preparation step takes

> The Llama 3.1 on Vertex API is in preview, it is only available in `us-central1`

## Data Preparation (Optional)

To execute this scenario without going through the [Data Processing example](/use-cases/model-fine-tuning-pipeline/data-processing/ray/README.md), we have a processed dataset that you can use.

Select a path between **Full dataset** and **Smaller dataset (subset)**. The smaller dataset is a quicker way to experience the pipeline, but it will produce a less than ideal fine-tuned model.

- If you would like to use the **Smaller dataset (subset)**, set the variable below.

```shell
DATASET_SUBSET=-subset
```

- Download the Hugging Face CLI library

```shell
pip3 install -U "huggingface_hub[cli]==0.26.2"
```

- Download the processed dataset CSV file from Hugging Face and copy it into the GCS bucket

```shell
PROCESSED_DATA_REPO=gcp-acp/flipkart-preprocessed${DATASET_SUBSET}

${HOME}/.local/bin/huggingface-cli download --repo-type dataset ${PROCESSED_DATA_REPO} --local-dir ./temp

gcloud storage cp ./temp/flipkart.csv \
gs://${MLP_DATA_BUCKET}/flipkart_preprocessed_dataset/flipkart.csv && \

rm ./temp/flipkart.csv
```

## Build the container image

- Build the container image using Cloud Build and push the image to Artifact Registry
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ The data processing step takes approximately 18-20 minutes.
## Prerequisites

- This guide was developed to be run on the [playground AI/ML platform](/platforms/gke-aiml/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
- The raw data that will be processed in this example, follow [these instructions](/use-cases/prerequisites/raw-data.md) to load the data into the bucket.

## Preparation

Expand All @@ -43,54 +44,6 @@ The data processing step takes approximately 18-20 minutes.

> You should see the various variables populated with the information specific to your environment.

## Data Preparation

Select a path between **Full dataset** and **Smaller dataset (subset)**. The smaller dataset is a quicker way to experience the pipeline, but it will produce a less than ideal fine-tuned model.

- **Full dataset** Download the raw data CSV file from [Kaggle](https://kaggle.com) and store it into the bucket created in the previous step.

- You will need kaggle cli to download the file. The kaggle cli can be installed using the following command in Cloud Shell:

```shell
pip3 install --user kaggle
```

For more details, you can read those [instructions](https://github.com/Kaggle/kaggle-api#installation).

- To use the cli you must create an API token. To create the token, register on [kaggle.com](https://kaggle.com) if you already don't have an account. Go to `kaggle.com/settings > API > Create New Token`, the downloaded file should be stored in `$HOME/.kaggle/kaggle.json`. Note, you will have to create the dir `$HOME/.kaggle`. After the configuration is done, you can run the following command to download the dataset and copy it to the GCS bucket:

```shell
kaggle datasets download --unzip PromptCloudHQ/flipkart-products && \

gcloud storage cp flipkart_com-ecommerce_sample.csv \
gs://${MLP_DATA_BUCKET}/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv && \

rm flipkart_com-ecommerce_sample.csv
```

- Alternatively, you can [downloaded the dataset](https://www.kaggle.com/datasets/PromptCloudHQ/flipkart-products) directly from the kaggle website and copy it to the bucket.

- **Smaller dataset (subset)** Download the raw data CSV from Hugging Face.

- Download the Hugging Face CLI library

```shell
pip3 install -U "huggingface_hub[cli]==0.26.2"
```

- Download the preprocessed dataset CSV file from Hugging Face and copy it into the GCS bucket

```shell
RAW_DATA_REPO=gcp-acp/flipkart-raw-subset

${HOME}/.local/bin/huggingface-cli download --repo-type dataset ${RAW_DATA_REPO} --local-dir ./temp

gcloud storage cp ./temp/flipkart_com-ecommerce_sample.csv \
gs://${MLP_DATA_BUCKET}/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv && \

rm ./temp/flipkart_com-ecommerce_sample.csv
```

## Build the container image

- Build container image using Cloud Build and push the image to Artifact Registry
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

FROM python:3.10.14-slim-bullseye as build-stage
FROM python:3.12.8-slim-bullseye as build-stage

ENV PATH=/venv/bin:${PATH}
ENV PYTHONDONTWRITEBYTECODE=1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -262,10 +262,11 @@ def run_remote():
# Ray runtime env
runtime_env = {
"pip": [
"google-cloud-storage==2.16.0",
"spacy==3.7.4",
"jsonpickle==3.0.3",
"pandas==2.2.1",
"google-cloud-storage==2.19.0",
"spacy==3.7.6",
"jsonpickle==4.0.1",
"pandas==2.2.3",
"pydantic==2.10.5",
],
"env_vars": {"PIP_NO_CACHE_DIR": "1", "PIP_DISABLE_PIP_VERSION_CHECK": "1"},
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
fsspec==2024.3.1
gcsfs==2024.3.1
google-cloud-storage==2.16.0
jsonpickle==3.0.3
pandas==2.2.1
ray==2.7.1
ray[client]==2.7.1
spacy==3.7.4
fsspec==2024.12.0
gcsfs==2024.12.0
google-cloud-storage==2.19.0
jsonpickle==4.0.1
pandas==2.2.3
ray==2.40.0
ray[client]==2.40.0
spacy==3.7.6
thejsonlogger==0.0.3
35 changes: 3 additions & 32 deletions use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@ The resulting fine-tuned model is, Built with Meta Llama 3.1, using the the data
## Prerequisites

- This guide was developed to be run on the [playground AI/ML platform](/platforms/gke-aiml/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
- A bucket containing the processed and prepared data from the [Data Preparation example](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md)
- A bucket containing the prepared data from the [Data Preparation example](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md)

> NOTE: If you did not execute the data preparation example, follow [these instructions](/use-cases/prerequisites/prepared-data.md) to load the dataset into the bucket.

## Preparation

Expand Down Expand Up @@ -37,37 +39,6 @@ The resulting fine-tuned model is, Built with Meta Llama 3.1, using the the data
HF_TOKEN=
```

## Data Preparation

To execute this scenario without going through the [Data Preparation example](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md), we have a prepared dataset that you can use.

Select a path between **Full dataset** and **Smaller dataset (subset)**. The smaller dataset is a quicker way to experience the pipeline, but it will produce a less than ideal fine-tuned model.

- If you would like to use the **Smaller dataset (subset)**, set the variable below.

```sh
DATASET_SUBSET=-subset
```

- Download the Hugging Face CLI library

```sh
pip3 install -U "huggingface_hub[cli]==0.26.2"
```

- Download the prepared dataset from Hugging Face and copy it into the GCS bucket

```sh
DATAPREP_REPO=gcp-acp/flipkart-dataprep${DATASET_SUBSET}

${HOME}/.local/bin/huggingface-cli download --repo-type dataset ${DATAPREP_REPO} --local-dir ./temp

gcloud storage cp -R ./temp/* \
gs://${MLP_DATA_BUCKET}/dataset/output && \

rm -rf ./temp
```

## Build the container image

- Build the container image using Cloud Build and push the image to Artifact Registry
Expand Down
79 changes: 6 additions & 73 deletions use-cases/model-fine-tuning-pipeline/model-eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,14 @@ for this activity, the first is to send prompts to the fine-tuned model, the sec
## Prerequisites

- This guide was developed to be run on the [playground AI/ML platform](/platforms/gke-aiml/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
- A bucket containing the prepared data from the [Data Preparation example](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md)

> NOTE: If you did not execute the data preparation example, follow [these instructions](/use-cases/prerequisites/prepared-data.md) to load the dataset into the bucket.

- A bucket containing the model weights from the [Fine tuning example](/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md)

> NOTE: If you did not execute the fine-tuning example, follow [these instructions](/use-cases/prerequisites/fine-tuned-model.md) to load the model into the bucket.

## Preparation

- Clone the repository and change directory to the guide directory
Expand All @@ -28,79 +34,6 @@ for this activity, the first is to send prompts to the fine-tuned model, the sec

> You should see the various variables populated with the information specific to your environment.

## Data Preparation (Optional)

To execute this scenario without going through the [Fine tuning example](/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md)

Select a path between **Full dataset** and **Smaller dataset (subset)**. The smaller dataset is a quicker way to experience the pipeline, but the evaluation results would be on a smaller sample.

- If you would like to use the **Smaller dataset (subset)**, set the variable below.

```sh
DATASET_SUBSET=-subset
```

- Download the Hugging Face CLI library

```sh
pip3 install -U "huggingface_hub[cli]==0.26.2"
```

- Download the prepared dataset from Hugging Face and copy it into the GCS bucket

```sh
DATAPREP_REPO=gcp-acp/flipkart-dataprep${DATASET_SUBSET}

${HOME}/.local/bin/huggingface-cli download --repo-type dataset ${DATAPREP_REPO} --local-dir ./temp

gcloud storage cp -R ./temp/* \
gs://${MLP_DATA_BUCKET}/dataset/output && \

rm -rf ./temp
```

- Download the fine-tuned model from Hugging Face and copy it into the GCS bucket.

> NOTE: Due to the limitations of Cloud Shell’s storage and the size of our model we need to run this job to perform the transfer to GCS on the cluster.

- Get credentials for the GKE cluster

```sh
gcloud container fleet memberships get-credentials ${MLP_CLUSTER_NAME} --project ${MLP_PROJECT_ID}
```

- Replace the respective variables required for the job

```sh
MODEL_REPO=gcp-acp/Llama-gemma-2-9b-it-ft

sed \
-i -e "s|V_KSA|${MLP_MODEL_EVALUATION_KSA}|" \
-i -e "s|V_BUCKET|${MLP_MODEL_BUCKET}|" \
-i -e "s|V_MODEL_REPO|${MODEL_REPO}|" \
manifests/transfer-to-gcs.yaml
```

- Deploy the job

```sh
kubectl apply --namespace ${MLP_KUBERNETES_NAMESPACE} \
-f manifests/transfer-to-gcs.yaml
```

- Trigger the wait for job completion (the job will take ~5 minutes to complete)

```sh
kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} wait \
--for=condition=complete --timeout=900s job/transfer-to-gcs
```

- Example output of the job completion

```sh
job.batch/transfer-to-gcs condition met
```

## Build the container image

- Build container image using Cloud Build and push the image to Artifact Registry
Expand Down
2 changes: 1 addition & 1 deletion use-cases/prerequisites/prepared-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Select a path between **Full dataset** and **Smaller dataset (subset)**. The sma
- Download the Hugging Face CLI library

```sh
pip3 install -U "huggingface_hub[cli]==0.26.2"
pip3 install -U "huggingface_hub[cli]==0.27.1"
```

- Download the prepared dataset from Hugging Face and copy it into the GCS bucket
Expand Down
2 changes: 1 addition & 1 deletion use-cases/prerequisites/processed-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Select a path between **Full dataset** and **Smaller dataset (subset)**. The sma
- Download the Hugging Face CLI library

```sh
pip3 install -U "huggingface_hub[cli]==0.26.2"
pip3 install -U "huggingface_hub[cli]==0.27.1"
```

- Download the processed dataset CSV file from Hugging Face and copy it into the GCS bucket
Expand Down
2 changes: 1 addition & 1 deletion use-cases/prerequisites/raw-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Select a path between **Full dataset** and **Smaller dataset (subset)**. The sma
- Download the Hugging Face CLI library

```sh
pip3 install -U "huggingface_hub[cli]==0.26.2"
pip3 install -U "huggingface_hub[cli]==0.27.1"
```

- Download the preprocessed dataset CSV file from Hugging Face and copy it into the GCS bucket
Expand Down
Loading