v14: latest libraries
Compared to the previous stable (v13.2) release:
General
- External models are no longer packaged.
vsmlrt.py
- Plugin invocation order in the
get_plugin_path()
function is sorted to reduce memory consumption. - Added support for RIFE v4.7 ~ v4.16 (lite, ensemble) models.
- Added support for SCUNet models for image denoising.
TRT
plugin and runtime libraries
- Upgraded to TensorRT 10.0.1.
- Maxwell and Pascal GPUs are no longer supported. Other backends still support these GPUs.
- Reduce GPU memory usage for dynamically shaped engines when the actual tile size is smaller than the maximum tile size set during engine building.
- Reduced engine build time.
- Added long path support for engines on Windows.
- cuDNN is no longer a strict runtime dependency.
vsmlrt.py
- The cuDNN tactic is no longer enabled by default.
- TF32 acceleration is disabled by default.
- The maximum workspace is set to
None
for the total memory size of the GPU. - Add parameters
builder_optimization_level
,max_aux_streams
,bf16
(#64),custom_env
,custom_args
,short_path
andengine_folder
(#90):builder_optimization_level
: "adjust how long TensorRT should spend searching for tactics with potentially better performance" linkmax_aux_streams
: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." linkbf16
: "TensorRT supports the bfloat16 (brain float) floating point format on NVIDIA Ampere and later architectures ... Note that not all layers support bfloat16." linkcustom_env
,custom_args
: custom environment variable and arguments for trtexec engine build.short_path
: whether to shorten engine name.- On Windows, this could be useful in addressing the maximum path length limitation, and is enabled by default.
engine_folder
: used to specify custom directory for engines.
known issues
-
Accoding to the documentation,
There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.
This affects RIFE and SAFA models. -
trtexec may reports errors like:
[E] Error[9]: Skipping tactic 0xded5318b4a444b84 due to exception Cask convolution execution
[E] Error[2]: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
This issue has been submitted to NVIDIA.
ORT
- Upgraded to ONNX Runtime v1.18.0.
interface
- The
ORT_*
backends now support fp16 I/O. The semantics of thefp16
flag in these backends is as follows:- Enabling
fp16
will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by theoutput_format
option (0 = fp32, 1 = fp16
). - Disabling
fp16
will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
- Enabling
CUDA
- Reduced execution overhead.
- Added support for TF32 acceleration. This is disabled by default.
- Added experimental
prefer_nhwc
flag to reduce the number of layout transformations when using tensor cores. This is disabled by default.
OV
- Upgraded to OpenVINO 2024.2.0.
- Added experimental
OV_NPU
backend for Intel NPUs.
MIGX
- Added support for MIGraphX backend for AMD GPUs. Currently this backend is Linux only.
Community contributions
scripts/vsmlrt.py
: update esrgan janai models by @hooke007 in #53scripts/vsmlrt.py
: add more esrgan janai models by @hooke007 in #82vsmigx
: allow fp16 input & output by @abihf in #86scripts/vsmlrt.py
: fix fp16 precision issues of RIFE v2 representations by @charlessuh in #66 (comment)
Benchmark
NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8
1920x1080 RGBS, TRT backend, CUDA graphs enabled, fp16
Measurements: FPS / Device Memory (MB)
model | 1 stream | 2 streams | 3 streams |
---|---|---|---|
dpir color | 10.99 / 1715.172 | 11.62 / 3048.540 | 11.64 / 4381.912 |
waifu2x upconv_7_{anime_style_art_rgb, photo} | 22.38 / 2016.352 | 32.66 / 3734.880 | 32.54 / 5453.404 |
waifu2x cunet / cugan | 12.41 / 4359.284 | 15.53 / 8363.392 | 15.47 / 12367.504 |
waifu2x swin_unet | 3.80 / 7304.332 | 4.06 / 14392.408 | 4.06 / 21276.380 |
real-esrgan (v2/v3, xsx2) | 16.65 / 955.480 | 22.53 / 1645.904 | 22.49 / 2336.324 |
scunet color | 4.20 / 2847.708 | 4.33 / 6646.884 | 4.33 / 9792.736 |
Also check benchmarks from previous pre-releases v14.test4 (NVIDIA RTX 2080 Ti/3090/4090 GPUs) and v14.test3 (NVIDIA RTX 4090 and AMD RX 7900 XTX GPUs).
This release uses CUDA 12.4.1, cuDNN 8.9.7, TensorRT 10.0.1, ONNX Runtime v1.18.0, OpenVINO 2024.2.0 and ncnn 20220915 b16f8ca.
Full Changelog: v13.2...v14