Skip to content

Commit

Permalink
Update processorca for new rules and patching potential TTFT exploit.…
Browse files Browse the repository at this point in the history
… Enable GPU for Offline scenario in Llama-v2 Reference Implementation. (mlcommons#1544)

* Add environment files for GPU

* Update flags to enable running on device=gpu

* Enable BS>1 on GPU, update processorca for new rules and first token workaround

* Dump outputs and add script to consolidate into single pickle file for analysis

* Add option to continue run if prior session was killed

* Small comment fix, fix default flags to match original reference implementation

* Generalize launch scripts for outside users, update README

* Minor fix: comments and default arguments

* Add accuracy target to README

* Add calibration dataset generation to processorca.py

* Update language/llama2-70b/README.md

Co-authored-by: Zhihan Jiang <[email protected]>

* Make calibration rng seed a kwarg

---------

Co-authored-by: Zhihan Jiang <[email protected]>
  • Loading branch information
nv-alicheng and nvzhihanj authored Jan 8, 2024
1 parent 678ed4f commit 94b0cc4
Show file tree
Hide file tree
Showing 11 changed files with 514 additions and 43 deletions.
48 changes: 48 additions & 0 deletions language/llama2-70b/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
SHELL ["/bin/bash", "-c"]

ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8

ENV TZ=US/Pacific
ENV DEBIAN_FRONTEND=noninteractive

RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN rm -rf /var/lib/apt/lists/* && rm /etc/apt/sources.list.d/* \
&& apt update \
&& apt install -y --no-install-recommends build-essential autoconf \
libtool git ccache curl wget pkg-config sudo ca-certificates \
automake libssl-dev bc python3-dev python3-pip google-perftools \
gdb libglib2.0-dev clang sshfs libre2-dev libboost-dev \
libnuma-dev numactl sysstat sshpass ntpdate less iputils-ping \
&& apt -y autoremove \
&& apt remove -y cmake \
&& apt install -y --no-install-recommends pkg-config zip g++ zlib1g-dev \
unzip libarchive-dev
RUN apt install -y --no-install-recommends rsync

# Install setuptools
RUN python3 -m pip install --upgrade pip \
&& python3 -m pip install --upgrade setuptools wheel virtualenv

# Install conda
WORKDIR /tmp
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh \
&& bash Miniconda3-* -b -p /opt/miniconda3
ENV PATH="$PATH:/opt/miniconda3/bin"
RUN conda create -n llama2-70b python=3.10
RUN chmod -R 777 /opt/miniconda3
79 changes: 76 additions & 3 deletions language/llama2-70b/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@


## Prepare environment

For a CPU-only run:

```
conda create -n llama2-70b python=3.9
conda activate llama2-70b
Expand All @@ -26,9 +29,35 @@ git merge llm-server
python -m pip install .
```

For a GPU-based run:

A dockerfile is provided, along with scripts to help launch it. First, add any docker volume mounts you want in
`launch.sh`. There is a section at the top of the file that looks like:
```
# Add any volume mounts here with the following syntax
# /path/to/src:/path/to/dir/in/container
MOUNTS=(
$MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
)
```

For example if you have a raid space located at `/raid/data` on your local machine, you can add it to the same path in the container like so:
```
# Add any volume mounts here with the following syntax
# /path/to/src:/path/to/dir/in/container
MOUNTS=(
$MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
/raid/data:/raid/data
)
```
Once you have added all your mounts, launch the container with `bash launch.sh`.

Inside the container, set up the environment with `bash build.sh`. This will install all the dependencies from the
CPU-only setup, as well as any GPU versions for applicable libraries like PyTorch.


## Get Model
+ For now, MLCommons is not hosting the checkpoing, so you must first go to [llama2-request-link](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and make a request, sign in to huggingface (if you don't have account, you'd need to create one). **Please note your authentication credentials** as you may be required to provide them when cloninng below
+ For now, MLCommons is not hosting the checkpoint, so you must first go to [llama2-request-link](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and make a request, sign in to huggingface (if you don't have account, you'd need to create one). **Please note your authentication credentials** as you may be required to provide them when cloninng below
+ Requires Git Large Files Storage
```
export CHECKPOINT_PATH=${PWD}/Llama-2-70b-chat-hf
Expand All @@ -49,7 +78,7 @@ EXPORT_DIR=${PWD}/processed-openorca
export DATASET_PATH=${PWD}/processed-data.pkl
# Process the dataset according the Taskforce's agreed criteria
python3 processorca.py --dataset_pq_path=${OPENORCA_PARQUET} --model_dir=${CHECKPOINT_PATH} --seqlen_limit=2048 --export_dir=${EXPORT_DIR} --num_total_samples=24576
python3 processorca.py --dataset_pq_path=${OPENORCA_PARQUET} --model_dir=${CHECKPOINT_PATH} --seqlen_limit=1024 --export_dir=${EXPORT_DIR} --num_total_samples=24576
mv ${EXPORT_DIR}/open_orca_gpt4_tokenized_llama.sampled_24576.pkl ${DATASET_PATH}
```
Expand All @@ -65,11 +94,24 @@ python -u main.py --scenario Offline \
--user-conf user.conf \
--total-sample-count 24576 \
--device cpu \
--dataset-path ${DATASET_PATH} \
--dataset-path ${DATASET_PATH} \
--output-log-dir offline-logs
```

For a GPU-based run:
```
python3 -u main.py --scenario Offline \
--model-path ${CHECKPOINT_PATH} \
--mlperf-conf mlperf.conf \
--user-conf user.conf \
--total-sample-count 24576 \
--dataset-path ${DATASET_PATH} \
--output-log-dir offline-logs \
--dtype float32 \
--device cuda:0 2>&1 | tee offline_performance_log.log
```

### Server
```
python -u main.py --scenario Server \
Expand All @@ -82,13 +124,17 @@ python -u main.py --scenario Server \
--output-log-dir server-logs
```

The ServerSUT was not tested for GPU runs.


## Run Accuracy Benchmarks

### Offline
```
OUTPUT_LOG_DIR=offline-accuracy-logs
mkdir -p "run_outputs" # The script will dump all the outputs to 'run_outputs'.
python -u main.py --scenario Offline \
--model-path ${CHECKPOINT_PATH} \
--accuracy \
Expand All @@ -105,8 +151,23 @@ if [ -e ${ACCURACY_LOG_FILE} ]; then
python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
--mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
fi
# Optional: Create a pickled pandas DataFrame that is the original dataset with extra columns with output data from the
# accuracy run. The following columns will be added:
# - "gen_output_tok_id": A list of ints representing the tokenized output sequence.
# - "gen_output_text": A str representing the untokenized output sequence.
# - "gen_output_tok_len": An int representing the number of output tokens.
# - "rouge1": The rouge1 score for this sample
# - "rouge2": The rouge2 score for this sample
# - "rougeL": The rougeL score for this sample
# This file will by default be saved to 'full_output.pkl'. You can modify this with --output-pkl-path.
python consolidate_results.py --dataset-path ${DATASET_PATH} --model-dir ${CHECKPOINT_PATH}
```

For the GPU run - The above steps have been automated in `run_accuracy.sh`. You can also modify this script to use
`--device cpu` to adapt it to a CPU-only run.


### Server
```
OUTPUT_LOG_DIR=server-accuracy-logs
Expand All @@ -129,3 +190,15 @@ if [ -e ${ACCURACY_LOG_FILE} ]; then
fi
```

The ServerSUT was not tested for GPU runs. You can try setting `--device cuda:0`, but YMMV.


## Accuracy Target
Running the GPU implementation in FP32 precision resulted in the following FP32 accuracy targets (normalized to a 0-100
scale from a 0.0-1.0 scale):
- Rouge1: 43.88
- Rouge2: 21.7108
- RougeL: 28.2502
- RougeLsum: 41.4821

This was run an 8xH100 node. Total runtime was ~4.5 days.
Loading

0 comments on commit 94b0cc4

Please sign in to comment.