Release v14: latest libraries · AmusementClub/vs-mlrt

Compared to the previous stable (v13.2) release:

General

External models are no longer packaged.

vsmlrt.py

Plugin invocation order in the get_plugin_path() function is sorted to reduce memory consumption.
Added support for RIFE v4.7 ~ v4.16 (lite, ensemble) models.
Added support for SCUNet models for image denoising.

TRT

plugin and runtime libraries

Upgraded to TensorRT 10.0.1.
Maxwell and Pascal GPUs are no longer supported. Other backends still support these GPUs.
Reduce GPU memory usage for dynamically shaped engines when the actual tile size is smaller than the maximum tile size set during engine building.
Reduced engine build time.
Added long path support for engines on Windows.
cuDNN is no longer a strict runtime dependency.

vsmlrt.py

The cuDNN tactic is no longer enabled by default.
TF32 acceleration is disabled by default.
The maximum workspace is set to None for the total memory size of the GPU.
Add parameters builder_optimization_level, max_aux_streams, bf16 (#64), custom_env, custom_args, short_path and engine_folder (#90):
- builder_optimization_level: "adjust how long TensorRT should spend searching for tactics with potentially better performance" link
- max_aux_streams: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." link
- bf16: "TensorRT supports the bfloat16 (brain float) floating point format on NVIDIA Ampere and later architectures ... Note that not all layers support bfloat16." link
- custom_env, custom_args: custom environment variable and arguments for trtexec engine build.
- short_path: whether to shorten engine name.
  - On Windows, this could be useful in addressing the maximum path length limitation, and is enabled by default.
- engine_folder: used to specify custom directory for engines.

known issues

Accoding to the documentation, There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.This affects RIFE and SAFA models.
trtexec may reports errors like:
- [E] Error[9]: Skipping tactic 0xded5318b4a444b84 due to exception Cask convolution execution
- [E] Error[2]: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
This issue has been submitted to NVIDIA.

ORT

Upgraded to ONNX Runtime v1.18.0.

interface

The ORT_* backends now support fp16 I/O. The semantics of the fp16 flag in these backends is as follows:
- Enabling fp16 will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by the output_format option (0 = fp32, 1 = fp16).
- Disabling fp16 will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.

CUDA

Reduced execution overhead.
Added support for TF32 acceleration. This is disabled by default.
Added experimental prefer_nhwc flag to reduce the number of layout transformations when using tensor cores. This is disabled by default.

OV

Upgraded to OpenVINO 2024.2.0.
Added experimental OV_NPU backend for Intel NPUs.

MIGX

Added support for MIGraphX backend for AMD GPUs. Currently this backend is Linux only.

Community contributions

scripts/vsmlrt.py: update esrgan janai models by @hooke007 in #53
scripts/vsmlrt.py: add more esrgan janai models by @hooke007 in #82
vsmigx: allow fp16 input & output by @abihf in #86
scripts/vsmlrt.py: fix fp16 precision issues of RIFE v2 representations by @charlessuh in #66 (comment)

Benchmark

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8

1920x1080 RGBS, TRT backend, CUDA graphs enabled, fp16

Measurements: FPS / Device Memory (MB)

model	1 stream	2 streams	3 streams
dpir color	10.99 / 1715.172	11.62 / 3048.540	11.64 / 4381.912
waifu2x upconv_7_{anime_style_art_rgb, photo}	22.38 / 2016.352	32.66 / 3734.880	32.54 / 5453.404
waifu2x cunet / cugan	12.41 / 4359.284	15.53 / 8363.392	15.47 / 12367.504
waifu2x swin_unet	3.80 / 7304.332	4.06 / 14392.408	4.06 / 21276.380
real-esrgan (v2/v3, xsx2)	16.65 / 955.480	22.53 / 1645.904	22.49 / 2336.324
scunet color	4.20 / 2847.708	4.33 / 6646.884	4.33 / 9792.736

Also check benchmarks from previous pre-releases v14.test4 (NVIDIA RTX 2080 Ti/3090/4090 GPUs) and v14.test3 (NVIDIA RTX 4090 and AMD RX 7900 XTX GPUs).

This release uses CUDA 12.4.1, cuDNN 8.9.7, TensorRT 10.0.1, ONNX Runtime v1.18.0, OpenVINO 2024.2.0 and ncnn 20220915 b16f8ca.

Full Changelog: v13.2...v14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v14: latest libraries

General

vsmlrt.py

TRT

plugin and runtime libraries

vsmlrt.py

known issues

ORT

interface

CUDA

OV

MIGX

Community contributions

Benchmark

Contributors