Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to analyze large models like llama 3 70B that requires model parallelism? #907

Open
ccchow opened this issue Jun 28, 2024 · 3 comments

Comments

@ccchow
Copy link

ccchow commented Jun 28, 2024

The model engine is built from llama 3 70b with tensor parallelism tp=2 and pp=2 and deployed by below triton launch script:
python3 scripts/launch_triton_server.py --world_size 4 --model_repo=llama_ifb

In this case, how to leverage model-analyzer to analyze this parallelized model/deployment?

@nv-braf
Copy link
Contributor

nv-braf commented Jul 8, 2024

Are you able to run this model on PA or GenAI-Perf?

@ccchow
Copy link
Author

ccchow commented Jul 9, 2024

I was able to use perf_analyzer to instrument llama 3 70b on 4*A100 (trtllm backend) via a launched triton server as below

python3 scripts/launch_triton_server.py --world_size 4 --model_repo=llama_ifb/
perf_analyzer -m ensemble --measurement-interval 10000 --concurrency-range <start:end:step> --input-data input.json

I'm wondering how can I tune triton model config using model analyzer in this case.

Thanks.

@LanceB57
Copy link

I'm in a very similar predicament, but with 8*H100. I'm getting pretty underwhelming results and would also like to know how to utilize model-analyzer, as I'm fairly new to Triton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants