Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load a pytorch model with model analyzer #699

Open
benhgm opened this issue Jun 5, 2023 · 16 comments
Open

How to load a pytorch model with model analyzer #699

benhgm opened this issue Jun 5, 2023 · 16 comments

Comments

@benhgm
Copy link

benhgm commented Jun 5, 2023

Hi, I am trying to use model analyzer to analyze an ensemble model that contains two python models and 1 ONNX model. The python models using pytorch to perform some preprocessing and postprocessing functions.

However, when I use the following command, I get a "ModuleNotFoundError: no module named 'torch'" error.
model-analyzer profile \ --model-repository=/model_repository \ --profile-models=ensemble_model --triton-launch-mode=docker \ --triton-http-endpoint=localhost:8000 --triton-grpc-endpoint=localhost:8003 --triton-metrics-url=localhost:8002 \ --output-model-repository-path=/model_analyzer_outputs/ \ --override-output-model-repository \ --run-config-search-mode quick \ --triton-output-path triton_log.txt \ --triton-docker-image devel

How do i make sure that the docker container spun up by model analyzer has pytorch installed?

@tgerdesnv
Copy link
Collaborator

tgerdesnv commented Jun 5, 2023

Hi @benhgm, by default when Model Analyzer is run --triton-launch-mode=docker, the docker spun up will be the matching xx.yy-py3 triton server image from the Nvidia NGC. It looks like you have supplied a custom docker image called devel that will be used instead. What does that container contain? That image needs to have triton server executable plus anything else needed to run the model. The easiest way to do that if you need something special in the container is to build off of the NGC container. If you don't need anything special, then you can omit the triton-docker-image option

@benhgm
Copy link
Author

benhgm commented Jun 6, 2023

Hi @tgerdesnv thanks for the tip!

To clarify, the command that gives the 'ModuleNotFoundError' error does not include the --triton-docker-image devel flag. I had incorrectly placed that in.

To provide some context, devel is a container that is built to serve my models in tritonserver. It has all the dependencies that I need for my models to work. However, when I run the command with '--triton-docker-image devel', I can the error message docker.errors.ImageNotFound: 404 Client Error for http+docker://localhost/v1.41/images/create?tag=latest&fromImage=devel: Not Found ("pull access denied for devel, repository does not exist or may require 'docker login': denied: requested access to the resource is denied").

Maybe my experience with docker is still not so good, but here are some questions I have:

  1. When i pass the input for the flag --triton-doker-image, do I provide the Image ID, the Image Repository Name or the Image Tag?
  2. When i run docker images inside the tritonserver container, loaded by the command docker run -it --gpus all \ -v /var/run/docker.sock:/var/run/docker.sock --net=host [nvcr.io/nvidia/tritonserver:23.02-py3-sdk](http://nvcr.io/nvidia/tritonserver:23.02-py3-sdk) (including mounting all the volumes i need), I get a bash: docker: command not found' error. How do I then pass an existing docker image into the --triton-docker-image` flag?

Thanks for your time and help, greatly appreciate it.

@nv-braf
Copy link
Contributor

nv-braf commented Jun 6, 2023

Support for custom local docker images was not added unit the 23.03 release. Can you try running on that (or a newer version) and let me know if you are still seeing an issue. Thanks.

@benhgm
Copy link
Author

benhgm commented Jun 7, 2023

Hi @nv-braf I have changed to use the 23.03 release. When I start up the local docker image instance, I get an error in my triton log file, tritonserver: unrecognized option '--metrics-interval-ms=1000'. I did not pass that flag anywhere in my local docker instance, hence I'm not sure how it got there. Note that my local docker instance is 21.08.

NVIDIA Release 21.08 (build 26170506)

Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

tritonserver: unrecognized option '--metrics-interval-ms=1000'
Usage: tritonserver [options]
  --help
	Print usage
  --log-verbose <integer>
	Set verbose logging level. Zero (0) disables verbose logging
	and values >= 1 enable verbose logging.
  --log-info <boolean>
	Enable/disable info-level logging.
  --log-warning <boolean>
	Enable/disable warning-level logging.
  --log-error <boolean>
	Enable/disable error-level logging.
  --id <string>
	Identifier for this server.
  --model-store <string>
	Equivalent to --model-repository.
  --model-repository <string>
	Path to model repository directory. It may be specified
	multiple times to add multiple model repositories. Note that if a model
	is not unique across all model repositories at any time, the model
	will not be available.
  --exit-on-error <boolean>
	Exit the inference server if an error occurs during
	initialization.
  --strict-model-config <boolean>
	If true model configuration files must be provided and all
	required configuration settings must be specified. If false the model
	configuration may be absent or only partially specified and the
	server will attempt to derive the missing required configuration.
  --strict-readiness <boolean>
	If true /v2/health/ready endpoint indicates ready if the
	server is responsive and all models are available. If false
	/v2/health/ready endpoint indicates ready if server is responsive even if
	some/all models are unavailable.
  --allow-http <boolean>
	Allow the server to listen for HTTP requests.
  --http-port <integer>
	The port for the server to listen on for HTTP requests.
  --http-thread-count <integer>
	Number of threads handling HTTP requests.
  --allow-grpc <boolean>
	Allow the server to listen for GRPC requests.
  --grpc-port <integer>
	The port for the server to listen on for GRPC requests.
  --grpc-infer-allocation-pool-size <integer>
	The maximum number of inference request/response objects
	that remain allocated for reuse. As long as the number of in-flight
	requests doesn't exceed this value there will be no
	allocation/deallocation of request/response objects.
  --grpc-use-ssl <boolean>
	Use SSL authentication for GRPC requests. Default is false.
  --grpc-use-ssl-mutual <boolean>
	Use mututal SSL authentication for GRPC requests. Default is
	false.
  --grpc-server-cert <string>
	File holding PEM-encoded server certificate. Ignored unless
	--grpc-use-ssl is true.
  --grpc-server-key <string>
	File holding PEM-encoded server key. Ignored unless
	--grpc-use-ssl is true.
  --grpc-root-cert <string>
	File holding PEM-encoded root certificate. Ignore unless
	--grpc-use-ssl is false.
  --grpc-infer-response-compression-level <string>
	The compression level to be used while returning the infer
	response to the peer. Allowed values are none, low, medium and high.
	By default, compression level is selected as none.
  --grpc-keepalive-time <integer>
	The period (in milliseconds) after which a keepalive ping is
	sent on the transport. Default is 7200000 (2 hours).
  --grpc-keepalive-timeout <integer>
	The period (in milliseconds) the sender of the keepalive
	ping waits for an acknowledgement. If it does not receive an
	acknowledgment within this time, it will close the connection. Default is
	20000 (20 seconds).
  --grpc-keepalive-permit-without-calls <boolean>
	Allows keepalive pings to be sent even if there are no calls
	in flight (0 : false; 1 : true). Default is 0 (false).
  --grpc-http2-max-pings-without-data <integer>
	The maximum number of pings that can be sent when there is
	no data/header frame to be sent. gRPC Core will not continue sending
	pings if we run over the limit. Setting it to 0 allows sending pings
	without such a restriction. Default is 2.
  --grpc-http2-min-recv-ping-interval-without-data <integer>
	If there are no data/header frames being sent on the
	transport, this channel argument on the server side controls the minimum
	time (in milliseconds) that gRPC Core would expect between receiving
	successive pings. If the time between successive pings is less than
	this time, then the ping will be considered a bad ping from the peer.
	Such a ping counts as a ‘ping strike’. Default is 300000 (5
	minutes).
  --grpc-http2-max-ping-strikes <integer>
	Maximum number of bad pings that the server will tolerate
	before sending an HTTP2 GOAWAY frame and closing the transport.
	Setting it to 0 allows the server to accept any number of bad pings.
	Default is 2.
  --allow-sagemaker <boolean>
	Allow the server to listen for Sagemaker requests. Default
	is false.
  --sagemaker-port <integer>
	The port for the server to listen on for Sagemaker requests.
	Default is 8080.
  --sagemaker-safe-port-range <<integer>-<integer>>
	Set the allowed port range for endpoints other than the
	SageMaker endpoints.
  --sagemaker-thread-count <integer>
	Number of threads handling Sagemaker requests. Default is 8.
  --allow-metrics <boolean>
	Allow the server to provide prometheus metrics.
  --allow-gpu-metrics <boolean>
	Allow the server to provide GPU metrics. Ignored unless
	--allow-metrics is true.
  --metrics-port <integer>
	The port reporting prometheus metrics.
  --trace-file <string>
	Set the file where trace output will be saved.
  --trace-level <string>
	Set the trace level. OFF to disable tracing, MIN for minimal
	tracing, MAX for maximal tracing. Default is OFF.
  --trace-rate <integer>
	Set the trace sampling rate. Default is 1000.
  --model-control-mode <string>
	Specify the mode for model management. Options are "none",
	"poll" and "explicit". The default is "none". For "none", the server
	will load all models in the model repository(s) at startup and will
	not make any changes to the load models after that. For "poll", the
	server will poll the model repository(s) to detect changes and will
	load/unload models based on those changes. The poll rate is
	controlled by 'repository-poll-secs'. For "explicit", model load and unload
	is initiated by using the model control APIs, and only models
	specified with --load-model will be loaded at startup.
  --repository-poll-secs <integer>
	Interval in seconds between each poll of the model
	repository to check for changes. Valid only when --model-control-mode=poll is
	specified.
  --load-model <string>
	Name of the model to be loaded on server startup. It may be
	specified multiple times to add multiple models. Note that this
	option will only take affect if --model-control-mode=explicit is true.
  --pinned-memory-pool-byte-size <integer>
	The total byte size that can be allocated as pinned system
	memory. If GPU support is enabled, the server will allocate pinned
	system memory to accelerate data transfer between host and devices
	until it exceeds the specified byte size. If 'numa-node' is configured
	via --host-policy, the pinned system memory of the pool size will be
	allocated on each numa node. This option will not affect the
	allocation conducted by the backend frameworks. Default is 256 MB.
  --cuda-memory-pool-byte-size <<integer>:<integer>>
	The total byte size that can be allocated as CUDA memory for
	the GPU device. If GPU support is enabled, the server will allocate
	CUDA memory to minimize data transfer between host and devices
	until it exceeds the specified byte size. This option will not affect
	the allocation conducted by the backend frameworks. The argument
	should be 2 integers separated by colons in the format <GPU device
	ID>:<pool byte size>. This option can be used multiple times, but only
	once per GPU device. Subsequent uses will overwrite previous uses for
	the same GPU device. Default is 64 MB.

@tgerdesnv
Copy link
Collaborator

@benhgm Are you able to move to a newer version of triton server, ideally 23.03 to match your SDK version (or move both to the latest 23.05 release)? As you have observed, using different versions between the two can cause incompatibilities.

@benhgm
Copy link
Author

benhgm commented Jun 8, 2023

@tgerdesnv thanks for the advice! I tried that and was able to run a full analysis on my ensemble model. I got some very nice results and report, but there is now one small error/glitch that I see, where the model analyzer reported that No GPU metric corresponding to tag 'gpu_used_memory' found in the model's measurement. Possibly comparing measurements across devices. and No GPU metric corresponding to tag 'gpu_used_memory' found in the model's measurement. Possibly comparing measurements across devices.

From the message, I guess this is because I ran the analysis over a multi-GPU instance, and if I set the --gpus flag to a specific GPU UUID, I will be able to get these metrics. I will try it out and update if I face the same error.

Otherwise, how can I enable GPU metrics reporting even on a multi-GPU instance?

@nv-braf
Copy link
Contributor

nv-braf commented Jun 8, 2023

This warning occurs when a measurement returning from Perf Analyzer does not contain a GPU metric, in this case, the amount of memory used by the GPU, when it was expected.
Yes, please try to specify the GPU you want to profile on with the --gpus flag and let me know if this doesn't remove the warning.

@benhgm
Copy link
Author

benhgm commented Jun 9, 2023

@nv-braf hello, I tried by setting --gpus to the UUID of the GPU I want to use, and model analysis began with the correct GPU that I specified. However, at the end of the analysis, I got the same error messages. Are there any other workarounds I can try?

@nv-braf
Copy link
Contributor

nv-braf commented Jun 9, 2023

Are measurements being taken? Are the charts/data being outputted correctly at the end of profile? If so, then it's probably safe to ignore this warning message.

@tgerdesnv
Copy link
Collaborator

Have you specified CPU_ONLY anywhere in the original configuration? Do the resulting output model configurations have KIND_CPU or KIND_GPU under instance_group?

My concern if you are getting no GPU metrics is that nothing is actually running on the GPU.

@benhgm
Copy link
Author

benhgm commented Jun 12, 2023

Hi @tgerdesnv you make a good point. I realised that although I had put KIND_GPU for all my models, in my pre and post-processing models, I did not explicitly pass the models to GPU using a .to(torch.device("cuda")).

However, my main inference model (a CNN) has always been set to run in GPU. Which I am puzzled over as to why no GPU metrics were recorded for that.

@tgerdesnv
Copy link
Collaborator

Can you answer @nv-braf 's question?

Are measurements being taken? Are the charts/data being outputted correctly at the end of profile? If so, then it's probably safe to ignore this warning message.

Those warnings may show up if any of the models are running entirely on the CPU.

@benhgm
Copy link
Author

benhgm commented Jun 12, 2023

Hi @tgerdesnv @nv-braf my apologies, I missed out on the other question.

Yes I am getting measurements on the latency and throughput, those are fine. I was just wondering how to make the GPU metrics appear.

@tgerdesnv I understand what you mean. However, if in an ensemble model, for example, where I have a pipeline of pre processing model -> CNN -> post processing model and only the CNN is on GPU, should I expect GPU metrics to be recorded from the CNN side even though the pre and post processing models are on CPU?

@nv-braf
Copy link
Contributor

nv-braf commented Jun 12, 2023

As long as you have not set the cpu_only flag I would expect the composing config to gather GPU metrics and they should be shown in the summary report. Can you confirm that you are not seeing any GPU metrics (like GPU utilization or GPU memory usage) in the summary report table?

@benhgm
Copy link
Author

benhgm commented Jun 13, 2023

Hi @nv-braf yes I confirm that I did not use the cpu_only flag and I did not encounter any GPU metrics.

@riyajatar37003
Copy link

riyajatar37003 commented May 13, 2024

i am running the examples/add_sub with local model , and with cpu instances but i am getting follwoing error log in docker container

root@cfbe7ff7cf1e:/app/ma# model-analyzer profile
--model-repository /app/ma/examples/quick-start
--profile-models add_sub
--output-model-repository-path /app/ma/output11
--export-path profile_results --triton-launch-mode=local
[Model Analyzer] Starting a local Triton Server
[Model Analyzer] Loaded checkpoint from file /app/ma/checkpoints/0.ckpt
[Model Analyzer] GPU devices match checkpoint - skipping server metric acquisition
[Model Analyzer] Starting a local Triton Server
[Model Analyzer] Model add_sub load failed: [StatusCode.INTERNAL] failed to load 'add_sub', failed to poll from model repository
[Model Analyzer] Model readiness failed for model add_sub. Error [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8001: Failed to connect to remote host: Connection refused
[Model Analyzer] Model readiness failed for model add_sub. Error [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: ipv6:%5B::1%5D:8001: Failed to connect to remote host: Connection refused
[Model Analyzer] Saved checkpoint to /app/ma/checkpoints/1.ckpt
Traceback (most recent call last):
File "/opt/app_venv/bin/model-analyzer", line 8, in
sys.exit(main())
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/entrypoint.py", line 278, in main
analyzer.profile(
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/analyzer.py", line 124, in profile
self._profile_models()
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/analyzer.py", line 242, in _profile_models
self._model_manager.run_models(models=[model])
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/model_manager.py", line 118, in run_models
self._check_for_ensemble_model_incompatibility(models)
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/model_manager.py", line 189, in _check_for_ensemble_model_incompatibility
model_config = ModelConfig.create_from_profile_spec(
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/triton/model/model_config.py", line 270, in create_from_profile_spec
model_config_dict = ModelConfig.create_model_config_dict(
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/triton/model/model_config.py", line 92, in create_model_config_dict
config = ModelConfig._get_default_config_from_server(
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/triton/model/model_config.py", line 149, in _get_default_config_from_server
config = client.get_model_config(model_name, config.client_max_retries)
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/triton/client/grpc_client.py", line 79, in get_model_config
model_config_dict = self._client.get_model_config(model_name, as_json=True)
File "/opt/app_venv/lib/python3.10/site-packages/tritonclient/grpc/_client.py", line 593, in get_model_config
raise_error_grpc(rpc_error)
File "/opt/app_venv/lib/python3.10/site-packages/tritonclient/grpc/_utils.py", line 77, in raise_error_grpc
raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: ipv6:%5B::1%5D:8001: Failed to connect to remote host: Connection refused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants