Skip to content

Releases: AmusementClub/vs-mlrt

v12: latest CUDA libraries

01 Nov 10:57
Compare
Choose a tag to compare

Compared to v11, this release updated CUDA dependencies to CUDA 11.8.0, cuDNN 8.6.0 and TensorRT 8.5.1:

  • Added support for the NVIDIA 40 series GPUs.
  • Added support for RIFE on the trt backend.

Known issue

  • Performance of the OV_CPU or ORT_CUDA(fp16=True) backends for RIFE is lower than expected, which is under investigation. Please consider ORT_CPU or ORT_CUDA(fp16=False) for now.
  • The NCNN_VK backend does not support RIFE.

Installation Notes

For some advanced features, vsmlrt.py requires numpy and onnx packages to be available. You might need to run pip install onnx numpy.

Benchmark

previous benchmark

Configuration: NVIDIA RTX 3090, driver 526.47, windows server 2019, vs r60, python 3.11.0, 1080p fp16

Backends: ort-cuda, trt from vs-mlrt v12.

For the trt backend, the engine is created without CUDA_MODULE_LOADING=LAZY environment variable and with it during benchmarking to reduce device memory consumption.

Data format: fps / GPU memory usage (MB)

rife(model=44, 1920x1088)

backend 1 stream 2 streams
ort-cuda 53.62/1771 83.34/2748
trt 71.30/ 626 107.3/ 962

dpir color

backend 1 stream 2 streams
ort-cuda 4.64/3230
trt 10.32/1992 11.61/3475

waifu2x upconv_7

backend 1 stream 2 streams
ort-cuda 11.07/5916 15.04/10899
trt 18.38/2092 31.64/ 3848

waifu2x cunet

backend 1 stream 2 streams
ort-cuda 4.63/8541 5.32/16148
trt 11.44/4771 15.59/ 8972

realesrgan v2/v3

backend 1 stream 2 streams
ort-cuda 8.84/2283 11.10/4202
trt 14.59/1324 21.37/2174

v11 RIFE support

26 Oct 00:37
Compare
Choose a tag to compare

Added support for the RIFE video frame interpolation algorithm.

There are two APIs for RIFE:

  • vsmlrt.RIFE is a high-level API for interpolating a clip. set the multi argument to specify the fps factor. Just remember to perform scene detection on the input clip.
  • vsmlrt.RIFEMerge is a novel temporal std.MaskedMerge-like interface for RIFE. Use it if you want to precisely control the frames and/or time point for the interpolation.

Known issues

  • vstrt doesn't support RIFE for the moment1. The next release of TensorRT should include RIFE support and we will release v12 when that happens.

  • vstrt backend also doesn't yet support latest RTX 4000 series GPUs. This will be fixed after upgrading to the upcoming TensorRT 8.5 release. RTX 4000 series GPU owners please use other the other CUDA backends.

  • Users of the OV_GPU backend may experience errors like Exceeded max size of memory object allocation: Requested 11456040960 bytes but max alloc size is 4294959104 bytes. Please consider tiling for now.

    The reason is that the openvino library follows the opencl standard on memory object allocation restriction (CL_DEVICE_MAX_MEM_ALLOC_SIZE). For most existing intel gpus (gen9 and later), the driver imposes a maximum allocation size of ~4GiB2.

  1. It's missing grid_sample operator support, see https://github.com/onnx/onnx-tensorrt/blob/main/docs/operators.md.

  2. this value is derived from here, which states that device not supporting sharedSystemMemCapabilities has a maximum allowed allocation size of 4294959104 bytes

v11.test

23 Sep 07:08
Compare
Choose a tag to compare
v11.test Pre-release
Pre-release

internal testing only.

Added support for the RIFE video frame interpolation algorithm. Some features are still being implemented. The Python RIFE model wrapper interface is still subject to change.

Known issue

  • Users of the OV_GPU backend may experience errors like Exceeded max size of memory object allocation: Requested 11456040960 bytes but max alloc size is 4294959104 bytes. Please consider tiling for now.

    The reason is that the openvino library follows the opencl standard on memory object allocation restriction (CL_DEVICE_MAX_MEM_ALLOC_SIZE). For most existing intel gpus (gen9 and later), the driver imposes a maximum allocation size of ~4GiB1.

  1. this value is derived from here, which states that device not supporting sharedSystemMemCapabilities has a maximum allowed allocation size of 4294959104 bytes

Model Release 20220923, RIFE model

23 Sep 07:22
Compare
Choose a tag to compare
Pre-release

New modules (compared to previous model release):

  • RIFE v4.0 from vs-rife v2.0.0. rife/rife_v4.0.onnx, config: fastmode=True, ensemble=False
  • RIFE v4.2, v4.3, v4.4, v4.5, v4.6, v4.7, v4.8, v4.9, v4.10 from Practical-RIFE. rife/rife_{v4.2,v4.3,v4.4,v4.5,v4.6,v4.7,v4.8,v4.9,v4.10}.onnx, config: fastmode=True, ensemble=False
  • Other provided RIFE models can be found here, including v2 representation of RIFE v4.7-v4.10 models. Sorry for the inconvenience.

Notes:

  • For RIFE on ort-gpu, vs-mlrt v11 or later is suggested for best performance. And (as of v11), only ov-cpu, ort-cpu, ort-cuda, trt (pending new TensorRT release) support RIFE. Specifically, ncnn-vk do not support RIFE due to missing gridsample op.

v10: new vulkan based vsncnn (AMD GPU supported)

15 Sep 11:02
Compare
Choose a tag to compare

Release Highlight

Vulkan based AMD GPU support added with the new vsncnn-vk backend.

Major features

  • Introduced ncnn-based vsncnn plugin that supports any GPU with Vulkan support (NVidia, AMD, Intel integrated & discrete).
    • Good news for AMD GPU users! vs-mlrt has finally achieved full platform coverage: from x86 CPU to GPU of all three major vendors.
    • Please refer to the benchmark below for performance details. Tl;dr it's comparable to vsort-cuda on most networks (except waifu2x-cunet), but (significantly) slower than vstrt. Owing to its C++ implementation, it's generally faster than Python based ncnn implementations.
    • Hint: If your GPU has enough memory, please consider setting num_streams>1 to extract more performance.
    • Even though it's possible to use software based Vulkan implementations (as we did in the GHA tests), if you want to do CPU-only inference, it's much better to use vsov-cpu (or vsort-cpu).
  • Introduced a new smaller Vulkan-based GPU binary package (vsmlrt-windows-x64-vk.v10.7z) that only includes vsov-{cpu,gpu}, vsort-cpu and vsncnn-vk. Use this if you only use Intel/AMD GPU or don't want to download 1GB data in exchange for a backend that is merely 2~8x faster. Now there shouldn't be any reasons not to use vs-mlrt.

Benchmark

Configuration: NVIDIA RTX 3090, driver 516.94, windows server 2019, vs r60, python 3.10.7, 1080p fp16

Backends: ncnn-vk, ort-cuda, trt from vs-mlrt v10, dpir-ncnn v2.0.0, w2xncnnvk r2

Data format: fps / GPU memory usage (MB)

dpir color

backend 1 stream 2 streams
ncnn-vk 4.33/3347 4.72/6119
ort-cuda 4.56/3595
trt 10.64/2595 11.10/4593
dpir-ncnn 3.68/3326

waifu2x upconv_7

backend 1 stream 2 streams
ncnn-vk 9.46/6820 14.71/13468
ort-cuda 12.10/6411 13.98/11273
trt 21.32/3317 29.10/ 5053
w2xncnnvk 6.68/6931 12.70/13626

waifu2x cunet

backend 1 stream 2 streams
ncnn-vk 1.46/11908 1.53/23574
ort-cuda 4.85/ 8793 5.18/16231
trt 11.60/ 4960 15.60/ 9057
w2xncnnvk 1.38/11966 1.58/23687

realesrgan v2/v3

backend 1 stream 2 streams
ncnn-vk 7.23/2781 8.35/5330
ort-cuda 9.05/2669 10.18/4539
trt 15.93/1667 19.58/2543

v10.pre

14 Sep 10:20
Compare
Choose a tag to compare
v10.pre Pre-release
Pre-release

This is a pre-release for testing & benchmarking purposes only.
For production use, please use the official v10 release.

Release Highlight

Vulkan based AMD GPU support added with the new vsncnn-vk backend.

Major features

  • Introduced ncnn-based vsncnn plugin that supports any GPU with Vulkan support (NVidia, AMD, Intel integrated & discrete). Good news for AMD GPU users! vs-mlrt has finally achieved full platform coverage: from x86 CPU to GPU of all three major vendors.
  • Introduced a new smaller Vulkan-based GPU binary package (vsmlrt-windows-x64-vk.v10.pre.7z) that only includes vsov-{cpu,gpu}, vsort-cpu and vsncnn-vk. Use this if you only use Intel/AMD GPU or don't want to download 1GB data in exchange for a backend that is merely 3x faster. Now there shouldn't be any reasons not to use vs-mlrt.

v9.2

07 Aug 07:48
Compare
Choose a tag to compare

Fixed issues

  • In vs-mlrt v9 and v9.1 on windows, the ORT_CUDA backend may fails for out of memory when processing a noninitial frame. This has been fixed and the performance should be improved.
  • Parameter use_cuda_graph of the ORT_CUDA backend now works properly on windows. It is however not recommended to use currently.

Full Changelog: v9.1...v9.2

v9.1

28 Jul 07:25
Compare
Choose a tag to compare

Bugfix release for v9. Recommended update for v9 users.
Please see release notes for v9 to see all the major new features.

  • Fix ort_cuda fp16 inference for CUGAN(version=2) model.

    A new parameter fp16_blacklist_ops is introduced in ort and ov backends for other issues possibly related to reduced precision.

    Please still carefully review the output of fp16 accelerated CUGAN(version=2).

  • Conform with CUGAN(version=2)'s dynamic range compression. This feature is enabled by setting conformance=True (which is the default) in the CUGAN wrapper in vsmlrt.py, and it's implemented as:

    clip = clip.std.Expr("x 0.7 * 0.15 +")
    clip = CUGAN(clip, version=2)
    clip = clip.std.Expr("x 0.15 - 0.7 /")

Known issues

  • These two issues are fixed in the v9.2 release.
    • The ORT_CUDA backend allocates memory during inference. This degrades performance and may results in out of memory error.
    • Parameter use_cuda_graph of the ORT_CUDA backend is broken on Windows.

Full Changelog: v9...v9.1

v9 Major release: Intel GPU support & much more

25 Mar 10:46
Compare
Choose a tag to compare

This is a major release.

  • Added support for Intel GPUs (both discrete [Xe Arc series] and integrated [Gen 8+ on Broadwell+])

    • In vsmlrt.py, this corresponds to the OV_GPU backend.
    • The openvino library is now dynamically linked because of the integration of oneDNN for GPU.
  • Added support for RealESRGANv3 and cugan-pro models.

  • Upgraded CUDA toolkit to 11.7.0, TensorRT to 8.4.1 and cuDNN to 8.4.1. It is now possible to build TRT engines for CUGAN, waifu2x cunet and upresnet10 models on RTX 2000 and RTX 3000 series GPUs.

  • The trt backend in vsmlrt.py wrapper now creates a log file for trtexec output in the TEMP directory (this only works if using the bundled trtexec.exe.) The log file will only be retained if trtexec fails (and the vsmlrt exception message will include the full path of the log file.) If you want the log to go to a specific file, set environment variable TRTEXEC_LOG_FILE to the absolute path of the log file. If you don't want this behavior, set log=False when creating the backend (e.g.vsmlrt.Backend.TRT(log=False))

  • The cuda bundles now include VC runtime DLLs as well, so trtexec.exe should run even on systems without proper VC runtime redistributable packages installed (e.g. freshly installed Windows).

  • The ov backend can now configure model compilation via config. Available configurations can be found here.

    • Example:

      core.ov.Model(..., config = lambda: dict(CPU_THROUGHPUT_STREAMS=core.num_threads, CPU_BIND_THREAD="NO"))

      This configuration may be useful in improving processor utilization at the expense of significantly increased memory consumption (only try this if you have a huge number of cores underutilized by the default settings.)

      The equivalent form for the python wrapper is

      backend = vsmlrt.Backend.OV_CPU(num_streams=core.num_threads, bind_thread=False)
  • When using the vsmlrt.py wrapper, it will no longer create temporary onnx files (e.g. when using non-default alpha CUGAN parameters). Instead, the modified ONNX network will be passed directly into the various ML runtime filters. Those filters now supports (network_path=b'raw onnx protobuf serialization', path_is_serialization=True) for this. This feature also opens the door for generating ONNX on the fly (e.g. ever dreamed of GPU accelerated 2d-convolution or std.Expr?)

Update Instructions

  1. Delete the previous vsmlrt-cuda, vsov, vsort and vstrt directories and vsov.dll, vsort.dll and vstrt.dll from your VS plugins directory and then extract the newly released files (specifically, do not leave files from previous version and just overwrite with the new release as the new release might have removed some files in those four directories.)
  2. Replace vsmlrt.py in your Python package directory.
  3. Updated models directories by overwriting with the new release. (Models are generally append only. We will make special notices and bump the model release tag if we change any of the previously released models.)

Compatibility Notes

vsmrt.py in this release is not compatible with binaries in previous releases, only script level compatibility is maintained. Generally, please make sure to upgrade the filters and vsmlrt.py as a whole.

We strive to maintain script source level compatibility as much as possible (i.e. there won't be a great api4 breakage), and it means script writing for v7 (for example) will continue to function for the foreseeable future. Minor issues (like the non-monotonic denoise setting of cugan) will be documented instead of fixed with a breaking change.

Known issue

CUGAN(version=2) (a.k.a. cugan-pro) may produces blank clip when using the ORT_CUDA(fp16) backend. This is fixed in the v10 release.

Full Changelog: v8...v9

v8: latest CUDA libraries and ~10% faster

12 Mar 06:43
Compare
Choose a tag to compare
  • This release upgrades the cuda libraries to their latest version. Models are observed to be accelerated by ~1.1x.
  • vsmlrt.CUGAN() now accepts a new parameter alpha, which controls the strength of filtering. Setting alpha to non-default values requires the Python onnx package (but this might change in the future.)
  • Added tf32 parameter to the trt backend in vsmlrt.py. TF32 acceleration is enabled by default on the Ampere GPUs, mostly for fp32 inference, and it has no effect on other architectures.