How to analyze large models like llama 3 70B that requires model parallelism? #907

ccchow · 2024-06-28T04:43:00Z

The model engine is built from llama 3 70b with tensor parallelism tp=2 and pp=2 and deployed by below triton launch script:
python3 scripts/launch_triton_server.py --world_size 4 --model_repo=llama_ifb

In this case, how to leverage model-analyzer to analyze this parallelized model/deployment?

nv-braf · 2024-07-08T14:53:53Z

Are you able to run this model on PA or GenAI-Perf?

ccchow · 2024-07-09T02:09:02Z

I was able to use perf_analyzer to instrument llama 3 70b on 4*A100 (trtllm backend) via a launched triton server as below

python3 scripts/launch_triton_server.py --world_size 4 --model_repo=llama_ifb/
perf_analyzer -m ensemble --measurement-interval 10000 --concurrency-range <start:end:step> --input-data input.json

I'm wondering how can I tune triton model config using model analyzer in this case.

Thanks.

LanceB57 · 2024-07-11T07:40:51Z

I'm in a very similar predicament, but with 8*H100. I'm getting pretty underwhelming results and would also like to know how to utilize model-analyzer, as I'm fairly new to Triton.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to analyze large models like llama 3 70B that requires model parallelism? #907

How to analyze large models like llama 3 70B that requires model parallelism? #907

ccchow commented Jun 28, 2024 •

edited

Loading

nv-braf commented Jul 8, 2024

ccchow commented Jul 9, 2024

LanceB57 commented Jul 11, 2024

How to analyze large models like llama 3 70B that requires model parallelism? #907

How to analyze large models like llama 3 70B that requires model parallelism? #907

Comments

ccchow commented Jun 28, 2024 • edited Loading

nv-braf commented Jul 8, 2024

ccchow commented Jul 9, 2024

LanceB57 commented Jul 11, 2024

ccchow commented Jun 28, 2024 •

edited

Loading