Update processorca for new rules and patching potential TTFT exploit.…

… Enable GPU for Offline scenario in Llama-v2 Reference Implementation. (mlcommons#1544) * Add environment files for GPU * Update flags to enable running on device=gpu * Enable BS>1 on GPU, update processorca for new rules and first token workaround * Dump outputs and add script to consolidate into single pickle file for analysis * Add option to continue run if prior session was killed * Small comment fix, fix default flags to match original reference implementation * Generalize launch scripts for outside users, update README * Minor fix: comments and default arguments * Add accuracy target to README * Add calibration dataset generation to processorca.py * Update language/llama2-70b/README.md Co-authored-by: Zhihan Jiang <[email protected]> * Make calibration rng seed a kwarg --------- Co-authored-by: Zhihan Jiang <[email protected]>
ctuning · Jan 8, 2024 · 94b0cc4 · 94b0cc4
1 parent 678ed4f
commit 94b0cc4
Show file tree

Hide file tree

Showing 11 changed files with 514 additions and 43 deletions.
diff --git a/language/llama2-70b/Dockerfile b/language/llama2-70b/Dockerfile
@@ -0,0 +1,48 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
+SHELL ["/bin/bash", "-c"]
+
+ENV LC_ALL=C.UTF-8
+ENV LANG=C.UTF-8
+
+ENV TZ=US/Pacific
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
+RUN rm -rf /var/lib/apt/lists/* && rm /etc/apt/sources.list.d/* \
+ && apt update \
+ && apt install -y --no-install-recommends build-essential autoconf \
+        libtool git ccache curl wget pkg-config sudo ca-certificates \
+        automake libssl-dev bc python3-dev python3-pip google-perftools \
+        gdb libglib2.0-dev clang sshfs libre2-dev libboost-dev \
+        libnuma-dev numactl sysstat sshpass ntpdate less iputils-ping \
+ && apt -y autoremove \
+ && apt remove -y cmake \
+ && apt install -y --no-install-recommends pkg-config zip g++ zlib1g-dev \
+        unzip libarchive-dev
+RUN apt install -y --no-install-recommends rsync
+
+# Install setuptools
+RUN python3 -m pip install --upgrade pip \
+    && python3 -m pip install --upgrade setuptools wheel virtualenv
+
+# Install conda
+WORKDIR /tmp
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh \
+    && bash Miniconda3-* -b -p /opt/miniconda3
+ENV PATH="$PATH:/opt/miniconda3/bin"
+RUN conda create -n llama2-70b python=3.10
+RUN chmod -R 777 /opt/miniconda3
diff --git a/language/llama2-70b/README.md b/language/llama2-70b/README.md
@@ -7,6 +7,9 @@
 
 
 ## Prepare environment
+
+For a CPU-only run:
+
 ```
 conda create -n llama2-70b python=3.9
 conda activate llama2-70b
@@ -26,9 +29,35 @@ git merge llm-server
 python -m pip install .
 ```
 
+For a GPU-based run:
+
+A dockerfile is provided, along with scripts to help launch it. First, add any docker volume mounts you want in
+`launch.sh`. There is a section at the top of the file that looks like:
+```
+# Add any volume mounts here with the following syntax
+# /path/to/src:/path/to/dir/in/container
+MOUNTS=(
+    $MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
+)
+```
+
+For example if you have a raid space located at `/raid/data` on your local machine, you can add it to the same path in the container like so:
+```
+# Add any volume mounts here with the following syntax
+# /path/to/src:/path/to/dir/in/container
+MOUNTS=(
+    $MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
+    /raid/data:/raid/data
+)
+```
+Once you have added all your mounts, launch the container with `bash launch.sh`.
+
+Inside the container, set up the environment with `bash build.sh`. This will install all the dependencies from the
+CPU-only setup, as well as any GPU versions for applicable libraries like PyTorch.
+
 
 ## Get Model
-+ For now, MLCommons is not hosting the checkpoing, so you must first go to [llama2-request-link](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and make a request, sign in to huggingface (if you don't have account, you'd need to create one). **Please note your authentication credentials** as you may be required to provide them when cloninng below
++ For now, MLCommons is not hosting the checkpoint, so you must first go to [llama2-request-link](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and make a request, sign in to huggingface (if you don't have account, you'd need to create one). **Please note your authentication credentials** as you may be required to provide them when cloninng below
 + Requires Git Large Files Storage
 ```
 export CHECKPOINT_PATH=${PWD}/Llama-2-70b-chat-hf
@@ -49,7 +78,7 @@ EXPORT_DIR=${PWD}/processed-openorca
 export DATASET_PATH=${PWD}/processed-data.pkl
 
 # Process the dataset according the Taskforce's agreed criteria
-python3 processorca.py --dataset_pq_path=${OPENORCA_PARQUET} --model_dir=${CHECKPOINT_PATH} --seqlen_limit=2048 --export_dir=${EXPORT_DIR} --num_total_samples=24576
+python3 processorca.py --dataset_pq_path=${OPENORCA_PARQUET} --model_dir=${CHECKPOINT_PATH} --seqlen_limit=1024 --export_dir=${EXPORT_DIR} --num_total_samples=24576
 
 mv ${EXPORT_DIR}/open_orca_gpt4_tokenized_llama.sampled_24576.pkl ${DATASET_PATH}
 ```
@@ -65,11 +94,24 @@ python -u main.py --scenario Offline \
                 --user-conf user.conf \
                 --total-sample-count 24576 \
                 --device cpu \
-		--dataset-path ${DATASET_PATH} \
+                --dataset-path ${DATASET_PATH} \
                 --output-log-dir offline-logs
 
 ```
 
+For a GPU-based run:
+```
+python3 -u main.py --scenario Offline \
+        --model-path ${CHECKPOINT_PATH} \
+        --mlperf-conf mlperf.conf \
+        --user-conf user.conf \
+        --total-sample-count 24576 \
+        --dataset-path ${DATASET_PATH} \
+        --output-log-dir offline-logs \
+        --dtype float32 \
+        --device cuda:0 2>&1 | tee offline_performance_log.log
+```
+
 ### Server
 ```
 python -u main.py --scenario Server \
@@ -82,13 +124,17 @@ python -u main.py --scenario Server \
                 --output-log-dir server-logs
 ```
 
+The ServerSUT was not tested for GPU runs.
+
 
 ## Run Accuracy Benchmarks
 
 ### Offline
 ```
 OUTPUT_LOG_DIR=offline-accuracy-logs
 
+mkdir -p "run_outputs"  # The script will dump all the outputs to 'run_outputs'.
+
 python -u main.py --scenario Offline \
                 --model-path ${CHECKPOINT_PATH} \
                 --accuracy \
@@ -105,8 +151,23 @@ if [ -e ${ACCURACY_LOG_FILE} ]; then
         python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
                 --mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
 fi
+
+# Optional: Create a pickled pandas DataFrame that is the original dataset with extra columns with output data from the
+# accuracy run. The following columns will be added:
+# - "gen_output_tok_id": A list of ints representing the tokenized output sequence.
+# - "gen_output_text": A str representing the untokenized output sequence.
+# - "gen_output_tok_len": An int representing the number of output tokens.
+# - "rouge1": The rouge1 score for this sample
+# - "rouge2": The rouge2 score for this sample
+# - "rougeL": The rougeL score for this sample
+# This file will by default be saved to 'full_output.pkl'. You can modify this with --output-pkl-path.
+python consolidate_results.py --dataset-path ${DATASET_PATH} --model-dir ${CHECKPOINT_PATH}
 ```
 
+For the GPU run - The above steps have been automated in `run_accuracy.sh`. You can also modify this script to use
+`--device cpu` to adapt it to a CPU-only run.
+
+
 ### Server
 ```
 OUTPUT_LOG_DIR=server-accuracy-logs
@@ -129,3 +190,15 @@ if [ -e ${ACCURACY_LOG_FILE} ]; then
 fi
 ```
 
+The ServerSUT was not tested for GPU runs. You can try setting `--device cuda:0`, but YMMV.
+
+
+## Accuracy Target
+Running the GPU implementation in FP32 precision resulted in the following FP32 accuracy targets (normalized to a 0-100
+scale from a 0.0-1.0 scale):
+- Rouge1: 43.88
+- Rouge2: 21.7108
+- RougeL: 28.2502
+- RougeLsum: 41.4821
+
+This was run an 8xH100 node. Total runtime was ~4.5 days.