Why OpenVINO is good? #8

usstq · 2021-11-10T05:22:17Z

usstq
Nov 10, 2021
Maintainer

oneDNN implemented most of the hard work for CNN inference, and major DL frameworks are all based on oneDNN, including OpenVINO, but OpenVINO has better performance than other DL framework, why?

Tickets: 67678

Answered by usstq

Nov 26, 2021

Paddle framework is heavy for inference

since Paddle reused same framework for both train & infer, it treats inference as same as train-forwarding pass but this limited the optimization can be done at framework level, OpenVINO, on the other hand, has no such burden, it's designed especially for inference so framework cost is much lower than Paddle.

This framework level cost may be not a big deal for training since major computation dominate the whole latency, but it becomes a considerable cost when inference is done on light-weighted CNN like MobileNetV2, because in these CNNs, each convolution layer is executed extremely fast since they are depth-wise or 1x1, but framework cost keeps unc…

View full answer

usstq · 2021-11-10T05:41:50Z

usstq
Nov 10, 2021
Maintainer Author

Based Tickets 67678, we can answer this by comparing inference performance of some famous networks between OpenVINO and PaddlePaddle.

Compile Paddle:

https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/install/compile/compile_Ubuntu.html
https://paddle-inference.readthedocs.io/en/latest/user_guides/source_compile.html
I use python3.6 instead of 3.5 and ON_INFER is needed for make inference_lib_dist -j4:

cmake .. -DPY_VERSION=3.6 -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON

编译飞桨过程中可能会打开很多文件，Ubuntu 18.04 默认设置最多同时打开的文件数是1024（参见 ulimit -a），需要更改这个设定值。

ulimit -n 4096

Inference using Paddle

There is a repo for inference using Paddle

git clone https://github.com/PaddlePaddle/Paddle-Inference-Demo.git
cd Paddle-Inference-Demo/c++/lib
# create paddle_inference softlink
ln -sf ~/paddle-venv/Paddle/_build/paddle_inference_install_dir paddle_inference
cd resnet50
# change compile.sh: `WITH_GPU=OFF`
./compile.sh
# run.sh will download/extract/run resnet50 model (folder resnet50)
chmod +x ./run.sh
./run.sh
# we can re-trigger the test manually with different config
./build/resnet50_test --model_file resnet50/inference.pdmodel --params_file resnet50/inference.pdiparams --repeats 10 --batch_size 10

But only one core(and one thread) is used in the whole process. on average one image take 84 ms to infer on paddle resnet50.

Compile and install OpenVINO:

./setupvars.sh must be fixed or some LD_LIBRARY_PATH may be wrongly set, just remove the extra colon after tbb/lib:

export LD_LIBRARY_PATH=$INSTALLDIR/runtime/3rdparty/tbb/lib:${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}

# use virtual env to install mo's python dependency
virtualenv ov
. ./ov/bin/activate
cd ~/openvino/_build/install/tools/model_optimizer/install_prerequisites/
./install_prerequisites_caffe.sh
./install_prerequisites_onnx.sh
./install_prerequisites_tf.sh

Inference using OpenVINO

Prepare: https://github.com/openvinotoolkit/open_model_zoo/blob/master/tools/model_tools/README.md

git clone [email protected]:openvinotoolkit/open_model_zoo.git
cd open_model_zoo/tools/downloader
pip install -r ./requirements-pytorch.in
pip install -r ./requirements-caffe2.in
pip install -r ./requirements-tensorflow.in

Download and convert model

cd open_model_zoo/tools/downloader
./downloader.py --name resnet-50-pytorch
./converter.py --name resnet-50-pytorch --precisions FP32

Inference with benchmark_app:

$ ./benchmark_app -m /home/hddl/open_model_zoo/tools/downloader/public/resnet-50-pytorch/FP32/resnet-50-pytorch.xml -d CPU
Count:      2804 iterations
Duration:   60108.46 ms
Latency:    81.84 ms
Throughput: 46.65 FPS

since benchmark_app use all 4 cores(8 hyper threads), the throughput is roughly (4*1000/81.84).
now set nstreams/nthreads so only one CPU core is used:

$ ./benchmark_app -m /home/hddl/open_model_zoo/tools/downloader/public/resnet-50-pytorch/FP32/resnet-50-pytorch.xml -d CPU -nstreams=1 -nthreads=1
Count:      741 iterations
Duration:   60143.24 ms
Latency:    80.29 ms  <=================Paddle takes 84ms
Throughput: 12.32 FPS

Luocheng's method:

download model
https://github.com/PaddlePaddle/PaddleClas/blob/release/2.3/docs/zh_CN/models/models_intro.md
test scripts
predict.py utils.py

to show performance: utils.py:82+ config.enable_profile()

predict.py
--use_gpu=False
--model_file={pd_test_dir}/models/{model_base}_pretrained-infer/inference.pdmodel
--params_file={pd_test_dir}/models/{model_base}_pretrained-infer/inference.pdiparams
--image_dir={pd_test_dir}/data/ILSVRC2012/
--image_file_list={pd_test_dir}/data/ILSVRC2012/val_list-1000.txt
--class_num=1000

4 replies

usstq Nov 22, 2021
Maintainer Author

for resnet50 test, adding env setting DNNL_VERBOSE=2 shows that Paddle also use channel blocked "aBcd8b" layout inference, that's also why it's performance is comparable to OpenVINO

usstq Nov 22, 2021
Maintainer Author

MobileNetV2

before running the tests, use following command to fix CPu frequency
sudo cpupower frequency-set -d 2.0GHz -u 2.0GHz -g performance

Compare smaller pre-trained models:

PaddlePaddle:

https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/image_classification/README_en.md#supported-models-and-performances

We need install paddlepaddle python package in order to use https://github.com/PaddlePaddle/models

$ pip install python/dist/paddlepaddle-2.1.1-cp36-cp36m-linux_x86_64.whl

convert pre-trained model to binary format (model & params):

$ git clone https://github.com/PaddlePaddle/models.git
$ cd models/PaddleCV/image_classification
$ python ./infer.py --model=MobileNetV2 --pretrained_model=/home/hddl/paddle-venv/pre/MobileNetV2_pretrained --save_inference=True --use_gpu=False
$ tree ./MobileNetV2/
MobileNetV2/
├── model
└── params

Then we can test it with the same C++ app(since input is still of shape 1x3x224x224)

$ ./build/resnet50_test -model_file=/home/hddl/paddle-venv/models/PaddleCV/image_classification/MobileNetV2/model -params_file=/home/hddl/paddle-venv/models/PaddleCV/image_classification/MobileNetV2/params --repeats 100 --batch_size=1 --warmup=10
run avg time is 14.8 ms
...

OpenVino:

https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/mobilenet-v2-pytorch

$ ./benchmark_app -m /home/hddl/open_model_zoo/tools/downloader/public/mobilenet-v2-pytorch/FP32/mobilenet-v2-pytorch.xml -d CPU -nthreads 1 -nstreams 1 -niter=1000
Count:      1000 iterations
Duration:   12750.75 ms
Latency:    12.66 ms
Throughput: 78.43 FPS

Comparing DNNL_VERBOSE=2 logs, we found that paddle model has softmax but openvino model dosn't have. But we checked the code and modify infer.py so softmax is removed, but this only gives about 0.1ms improvements. so softmax isn't the key factor.

if we switch to use https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/mobilenet-v2 , than openvino would become slower

usstq Nov 23, 2021
Maintainer Author

Root cause the performance difference

Since all framework is based on oneDNN, performance difference analysis can be done with DNNL_VERBOSE=2 logs:
we see there are differences at the beginning of the networks (with following tools to extract field from log):

# OpenVINO use higher version of oneDNN, so log format is little different
python ./extract_log.py ~/logOpenVINO.txt attr-post-ops:
python ./extract_log.py ~/logPaddle.txt post_ops:

we can see OpenVINO generated fewer oneDNN primitives than PaddlePaddle at the beginning of the network:

 OpenVINO
1:	convolution	jit:avx2	attr-post-ops:eltwise_clip:0:6 
2:	convolution	jit_dw:avx2	attr-post-ops:eltwise_clip:0:6 
3:	convolution	jit_1x1:avx2
4:	convolution	jit_1x1_with_dw_conv:avx2	attr-post-ops:eltwise_clip:0:6+depthwise_conv_old+eltwise_clip:0:6 

PaddlePaddle
1:	convolution	jit:avx2	post_ops:'eltwise_bounded_relu:6;';
2:	convolution	jit_1x1:avx2	post_ops:'eltwise_bounded_relu:6;';
3:	convolution	jit_dw:avx2	post_ops:'eltwise_bounded_relu:6;';
4:	convolution	jit_1x1:avx2
5:	convolution	jit_1x1:avx2	post_ops:'eltwise_bounded_relu:6;';
6:	convolution	jit_dw:avx2	post_ops:'eltwise_bounded_relu:6;';

search OpenVINO's code we saw jit_avx2_1x1_convolution_with_dw_conv_fwd_t was the primitive implemented jit_1x1_with_dw_conv kernel, so we try to remove this primitive from OpenVINO's oneDNN submodule and re-test, but we got
following runtime error:

[ ERROR ] Supported primitive descriptors list is empty for node: Conv_9/WithoutBiases

we suspect that there is some transformation which explicit requires jit_1x1_with_dw_conv kernel. from mkldnn_graph_optimizer.cpp, we saw FuseConvolutionAndDWConvolution() seems to be the one who fuses convolution with another convolution, adding a log into the code we saw that this transformation is really worked once:

[fuse] ==========  Conv_9/WithoutBiases + Conv_13/WithoutBiases

Remove this transform from the code we got little drop in performance but still much better than PaddlePaddle,

Count:      200 iterations
Duration:   2574.39 ms
Latency:    12.72 ms
Throughput: 77.69 FPS

jit_1x1_with_dw_conv now breaks apart into jit_1x1 and jit_dw, but still 1 op less than Paddle

1:	convolution	jit:avx2	attr-post-ops:eltwise_clip:0:6 
2:	convolution	jit_dw:avx2	attr-post-ops:eltwise_clip:0:6 
3:	convolution	jit_1x1:avx2
4:	convolution	jit_1x1:avx2	attr-post-ops:eltwise_clip:0:6 
5:	convolution	jit_dw:avx2	attr-post-ops:eltwise_clip:0:6

PS: extract_log.py:

import sys
src, pat = sys.argv[1:3]
lno = 1
with open(src) as file:
    for line in file:
        line = line.rstrip()
        allfields = line.split(",")
        match = False
        for field in allfields:
            if pat in field:
                print(f'{lno}:\t{allfields[3]}\t{allfields[4]}\t{field}')
                match = True
                break
        if not match:
            print(f'{lno}:\t{allfields[3]}\t{allfields[4]}')
        lno += 1

usstq Nov 23, 2021
Maintainer Author

Compare method revised for mobilenet v2

a better way to compare performance: use the same pre-trained paddle model:

prepare model

$ git clone https://github.com/PaddlePaddle/PaddleClas.git
$ cd PaddleClas
$ pip install -r requirements.txt
# download MobileNetV2_pretrained.pdparam into pre folder
# export pdparam into inference model (note remove .pdparam extension)
$ python tools/export_model.py -c ppcls/configs/ImageNet/MobileNetV2/MobileNetV2.yaml -o Global.pretrained_model=../pre/MobileNetV2_pretrained
$ tree inference/
inference/
├── inference.pdiparams
├── inference.pdiparams.info
└── inference.pdmodel

# further export to OpenVINO IR
mo.py --input_model ~/paddle-venv/PaddleClas/inference/inference.pdmodel  --output_dir ./mobilenet_v2_paddle -b 1 --framework=paddle

performance test

$ ./benchmark_app -m /home/hddl/open_model_zoo/tools/downloader/mobilenet_v2_paddle/inference.xml -d CPU -nthreads 1 -nstreams 1 -niter=200
Count:      200 iterations
Duration:   2659.37 ms
Latency:    13.14 ms
Throughput: 75.21 FPS

$ ~/paddle-venv/Paddle-Inference-Demo/c++/resnet50/build/resnet50_test --model_file inference/inference.pdmodel --params_file inference/inference.pdiparams --repeats 100 --batch_size 1  --warmup=10
run avg time is 15.3354 ms

OpenVINO is faster, and comparing DNNL_VERBOSE=2 logs we see that OpenVINO even do more works: a reorder at the begin of the network that convert u8 input to f32, a inner_product layer before softmax.

Paddle has much more Cache miss than OpenVINO

using following command to check cache miss, Paddle shows 40+% cache-misses while OpenVINO only has 20+%.

sudo perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches /your/command/line/with/args

Performance counter stats for './benchmark_app -m /home/hddl/open_model_zoo/tools/downloader/mobilenet_v2_paddle/inference.xml -d CPU -nthreads 1 -nstreams 1 -niter=500 -nireq=1':

     1,382,164,912      cache-references                                              (22.78%)
       312,128,676      cache-misses              #   22.583 % of all cache refs      (30.70%)
    15,476,158,839      cycles                                                        (31.09%)
    38,699,003,485      instructions              #    2.50  insn per cycle           (38.83%)
       935,142,331      branches                                                      (39.10%)
            21,863      faults                                                      
               623      migrations                                                  
       937,461,153      L1-dcache-load-misses     #    6.26% of all L1-dcache hits    (39.06%)
    14,976,451,108      L1-dcache-loads                                               (38.82%)
     1,015,849,877      L1-dcache-stores                                              (38.66%)
        32,810,756      L1-icache-load-misses                                         (30.69%)
        80,287,697      LLC-loads                                                     (30.67%)
        17,932,405      LLC-load-misses           #   22.34% of all LL-cache hits     (30.49%)
        45,858,441      LLC-stores                                                    (14.96%)
         7,744,306      LLC-store-misses                                              (15.04%)
   <not supported>      LLC-prefetches                                              

       7.395575671 seconds time elapsed

       7.397691000 seconds user
       0.425246000 seconds sys

But heaptrack tools shows that Paddle app has much bigger memory foot-print than OpenVINO benchmark_app, so is majority cache-misses happened at the graph compilation phase or inference phase?

we can do a simple experiment of comparing perf result between infer 1 frame and 100,200,300,... frames and check the delta:

cache-misses(M) vs cache-references(M) delta increase per 100 repeats
Paddle: 128/301
OpenVINO: 45/238

We can clearly see that OpenVINO does much better job at inference cache hit rate.

Here is the test script we used:

import numpy
import sys
import re

def test_paddle(cmdline):
    cmd = f'sudo -E perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches'.split() + cmdline
    p = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=PIPE)
    output, err = p.communicate(b"input data that is passed to subprocess' stdin")
    rc = p.returncode

    output = output.decode('utf-8')
    err = err.decode('utf-8')

    if (rc != 0):
        print(rc, output)
        print(rc, err)
        raise Exception(f'Failed to execute command: {cmd}')

    def parse(txt, keywords):
        ret = [-1 for key in keywords]
        for line in txt.splitlines():
            fields = line.split()
            if (len(fields) < 2):
                continue

            value = fields[0].replace(',','')
            for i in range(len(keywords)):
                if keywords[i] in fields[1]:
                    ret[i] = int(value)
        return ret


    ret = parse(err, ['cache-references', 'cache-misses', 'instructions'])

    time = 0
    m = re.search(r'run avg time is (\d*\.\d*) ms', err)
    if (m):
        time = float(m.group(1))
    else:
        m = re.search(r'Latency:\s*(\d*\.\d*) ms', output)
        if (m):
            time = float(m.group(1))

    ret.append(time)
    return ret


cmd = sys.argv[1:]
arr = []
for i in range(10):
    r = test_paddle(cmd)
    print(r, end='\r')
    arr.append(r)

print(' '*40, end='\r')

arr = numpy.array(arr)

arr_mean = numpy.mean(arr, axis=0).astype(numpy.int64)
arr_std = numpy.std(arr, axis=0).astype(numpy.int64)

print(f' cmd:{cmd}')
print(f'mean:{arr_mean}')
print(f' std:{arr_std}')

Does Paddle dynamically allocate temporary memory for each inference?

Add following code into ConvMKLDNNOpKernel::ComputeFP32(), and from the log we saw:

Paddle doesn't re-allocate temporary memory for each inference.
Paddle pass the output memory of previous primitive directly to next primitive.

    std::cout << ctx.InputName("Input") + ctx.InputName("Filter") << ": "
              << src_memory_p->get_data_handle() << ","
              << weights_memory_p->get_data_handle() << ","
              << dst_memory_p->get_data_handle() << std::endl;

after we change OpenVINO so it allocate memory dynamically cache miss rate increased dramatically(but the latency is only slightly increased by 0.5ms):
OpenVINO: 100/241

So we know the relative high cache miss rate is not the key reason here.

Is Paddle framework too heavy (executed too much instructions) ?

By comparing instructions of the perf output, we saw Paddle executes 0.5G more instruction than OpenVINO every 100 frames, on 2GHz core, this roughly will costs (0.5G/100/2G) = 0.0025 seconds, which is relatively matching what we saw.

Further profiling using vtune shows a considerable CPU cost is spent on std::hash & new in string constructor, call stack reveals a surprising fact, Paddle framework kernel ConvMKLDNNOpKernel is state-less itself, operator class instance in ExecutionContext::GetOp is also not storing any state related to kernel execution,

we can workaround a few places by constructing a light-weight hash map caching the MKLDNN primitive using object address as key instead of string and it does bring dramatic improvements (about 1+ms), so it confirmed us that it's the careless design of Paddle framework causes the performance drop.

OperatorWithKernel::RunImpl

usstq · 2021-11-26T01:42:52Z

usstq
Nov 26, 2021
Maintainer Author

Paddle framework is heavy for inference

since Paddle reused same framework for both train & infer, it treats inference as same as train-forwarding pass but this limited the optimization can be done at framework level, OpenVINO, on the other hand, has no such burden, it's designed especially for inference so framework cost is much lower than Paddle.

This framework level cost may be not a big deal for training since major computation dominate the whole latency, but it becomes a considerable cost when inference is done on light-weighted CNN like MobileNetV2, because in these CNNs, each convolution layer is executed extremely fast since they are depth-wise or 1x1, but framework cost keeps unchanged for all kinds of convolution.

std::string is bad choice as key of map

Paddle framework uses std::string as key of map structure, even when caller actually passes a string literal as key. this requires run-time string construction and hashing operation.

Performance analysis

perf can isolate performance counter for different processes
vtune is also based on perf, but it provides much powerful visualization like filter by time-span, by callstack, ...
cache-misses is for memory bandwidth analysis
instructions can estimate code complexity (how many extra codes are executed)
compare relative magnitudes between cache-misses & instructions can determine whether it's memory-bound or not
DNNL_VERBOSE log can compare oneDNN call sequences

$ sudo -E perf stat -B -e cache-references,cache-misses,cycles,instructions sleep 1
 Performance counter stats for 'sleep 1':

            65,532      cache-references                                            
            30,131      cache-misses              #   45.979 % of all cache refs    
         1,329,247      cycles                                                      
         1,050,063      instructions              #    0.79  insn per cycle         

       1.001454689 seconds time elapsed

       0.001199000 seconds user
       0.000000000 seconds sys
$ sudo -E perf stat -B -e cache-references,cache-misses,cycles,instructions sleep 10
 Performance counter stats for 'sleep 10':

            77,823      cache-references                                            
            35,103      cache-misses              #   45.106 % of all cache refs    
         1,515,698      cycles                                                      
         1,050,428      instructions              #    0.69  insn per cycle         

      10.001294553 seconds time elapsed

       0.001234000 seconds user
       0.000000000 seconds sys

1 reply

zhangYiIntel Nov 26, 2021
Collaborator

Very impressive work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why OpenVINO is good? #8

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why OpenVINO is good? #8

usstq Nov 10, 2021 Maintainer

Paddle framework is heavy for inference

Replies: 2 comments · 5 replies

usstq Nov 10, 2021 Maintainer Author

Compile Paddle:

Inference using Paddle

Compile and install OpenVINO:

Inference using OpenVINO

Luocheng's method:

usstq Nov 22, 2021 Maintainer Author

usstq Nov 22, 2021 Maintainer Author

MobileNetV2

usstq Nov 23, 2021 Maintainer Author

Root cause the performance difference

PS: extract_log.py:

usstq Nov 23, 2021 Maintainer Author

Compare method revised for mobilenet v2

prepare model

performance test

Paddle has much more Cache miss than OpenVINO

Does Paddle dynamically allocate temporary memory for each inference?

Is Paddle framework too heavy (executed too much instructions) ?

usstq Nov 26, 2021 Maintainer Author

Paddle framework is heavy for inference

std::string is bad choice as key of map

Performance analysis

zhangYiIntel Nov 26, 2021 Collaborator

usstq
Nov 10, 2021
Maintainer

Replies: 2 comments 5 replies

usstq
Nov 10, 2021
Maintainer Author

usstq Nov 22, 2021
Maintainer Author

usstq Nov 22, 2021
Maintainer Author

usstq Nov 23, 2021
Maintainer Author

usstq Nov 23, 2021
Maintainer Author

usstq
Nov 26, 2021
Maintainer Author

zhangYiIntel Nov 26, 2021
Collaborator