Shape inference API overhead #10

usstq · 2021-12-01T06:19:56Z

usstq
Dec 1, 2021
Maintainer

shape infer has following API interface, and caller has to construct vector & map which seems to be heavy, how to test if that's a problem?

template <class T>
void shape_infer(const Bucketize* op,
                 const std::vector<T>& input_shapes,
                 std::vector<T>& output_shapes,
                 const std::map<size_t, std::shared_ptr<ngraph::runtime::HostTensor>>& constant_data)

usstq · 2021-12-01T08:38:32Z

usstq
Dec 1, 2021
Maintainer Author

We can build a test case for answering this question, here we do not only test shape_inferernce(), but the shapeInfer() callback in MKLDNN plugin so the API overhead will be included.

case1: do all work entirely in CPU plugin's shapeInfer() call back, to make it easier, we make a fake shapeInfer.
case2: in shapeInfer(), call shape inference instead

calculate the time difference as the cost of API. than we got a estimate.

test case using split op

here we constructed following test (following the implementation in MKLDNNReshapeNode::shapeInfer() )

using VectorDims = std::vector<size_t>;

std::vector<VectorDims> shapeInfer0(const op::v1::Split* op, VectorDims* dims_in, int dims_cnt, int64_t* axes_values) {
    NODE_VALIDATION_CHECK(op, (dims_cnt == 2));

    const auto& data_ps = dims_in[0];
    const auto& axis_ps = dims_in[1];

    NODE_VALIDATION_CHECK(op, axis_ps.size() == 0, "'axis' input must be a scalar. Got: ", axis_ps.size());

    auto each_output_shape = data_ps;
    const auto data_rank = data_ps.size();

    auto num_splits = op->get_num_splits();

    auto axis = ov::normalize_axis(op, axes_values[0], ov::Rank(data_rank));

    const auto dimension_at_axis = data_ps[axis];

    NODE_VALIDATION_CHECK(op,
                          dimension_at_axis % num_splits == 0,
                          "Dimension of data input shape along 'axis': ",
                          dimension_at_axis,
                          " must be evenly divisible by 'num_splits' attribute value: ",
                          num_splits);

    each_output_shape[axis] = dimension_at_axis / num_splits;

    std::vector<VectorDims> ret;
    for (size_t i = 0; i < num_splits; ++i)
        ret.push_back(each_output_shape);

    return ret;
}

std::vector<VectorDims> shapeInfer1(const op::v1::Split* op, VectorDims* dims_in, int dims_cnt, int64_t* axes_values) {
    std::vector<StaticShape> input_shapes = {dims_in[0], dims_in[1]};

    std::map<size_t, std::shared_ptr<ngraph::runtime::HostTensor>> input_values = {
        {1, std::make_shared<ngraph::runtime::HostTensor>(ngraph::element::Type_t::i32, ov::Shape{}, axes_values)}};

    std::vector<StaticShape> output_shapes = {StaticShape{}};

    shape_infer(op, input_shapes, output_shapes, input_values);

    std::vector<VectorDims> result(output_shapes.size());
    std::transform(output_shapes.begin(), output_shapes.end(), result.begin(), [](const ov::StaticShape& s) {
        return s.to_shape();
    });
    return result;
}

TEST(StaticShapeInferenceTest, SplitV1_S2) {
    std::vector<VectorDims> dims_in = {{2, 8, 4}, {}};
    std::vector<VectorDims> ret;
    int64_t axes_values[2] = {2, 2};
    auto op = build_split(PartialShape({2, 8, 4}), {}, 4);

    PERF_TEST(ret = shapeInfer0(op.get(), dims_in.data(), dims_in.size(), axes_values));
    PERF_TEST(ret = shapeInfer1(op.get(), dims_in.data(), dims_in.size(), axes_values));
}

test result

I use sudo cpupower frequency-set -d 2.0GHz -u 2.0GHz -g performance to fix my CPU frequency and set PERFN=100000 to get stable result.

shapeInfer0: 339 ns
shapeInfer1: 2044 ns

if we put PERF_TEST into shapeInfer1(), just around shape_infer() API, the result is ~1000ns, so surprisingly, current shape_inference API overhead is considerable comparing to the simplest possible implementation.

Here you may say, shapeInfer0() is not template generic code, but still, it's so surprising to see that generic code is slower than best possible implemented by so big margin.

2 replies

usstq Dec 2, 2021
Maintainer Author

Anaylze with vtune

vtune shows following hot spots:

in shapeinfer1()

85 shape_infer:
- 57 get_data_as_int64
  - 28 Constant OP constructor
  - 13 cast_vector
  - 10 Node destructor
  - 3 Constant OP destructor
  - 0.1 map.at
- 11 vector::push_back
- 4 normalize_axis
- ...
26 make_shared:
25 transform:
05 vector destructor
02 StaticShape constructor
02 map constructor
...

in shapeinfer0():

15 vector::push_back
4 normalize_axis
1 vector destructor
...

So we can see, too many dynamic STL container or C++ object is constructed and destructed, in shapeinfer0(), it's just a pointer to int64_t, but in shapeinfer1(), to pass it to shape_inference, SharedPtr/HostTensor/Constant are constructed and destructed, too many wrappers causes dramatic overhead.

usstq Dec 2, 2021
Maintainer Author

Then how to improve the efficiency of generic code ?

The answer is zero-copy (not only of the data, but also of the shape). which is to wrap existing objects instead of copying them between vector/map/shared_ptr(which heavily depends on memory allocator).

check this answer: https://stackoverflow.com/questions/60151514/using-stdvector-as-view-on-to-raw-memory

for example, to archive performance comparable to best possible implementation, without explicitly required uniformed API interface, we can:

just pass the input shapes one by one in as const-reference arguments instead of copy them into a vector or using container
the shape type can be any generic random-access container for dimension.
for passing const dependency ?

usstq · 2021-12-02T05:29:55Z

usstq
Dec 2, 2021
Maintainer Author

@jane-intel According to this test, I'm really concerned about the potential performance gain that the generic code misses, further more, the generic code is also harder to understand and maintain than it's pure static version.

0 replies

usstq · 2021-12-09T05:08:38Z

usstq
Dec 9, 2021
Maintainer Author

Improvement suggestion

I did some effort today and I'm able to make the generic code as fast as the static version, the key idea is:

instead of inventing StaticShape/StaticDimension with PartialShape/Dimension compatible interface, using traits-like template to access methods incompatible with static shape/dimension (like get_length/rank/get_interval), in this way, call site can pass in their own Shape/Dimension container type instead of copying those representation into StaticShape.
instead of passing pointer through HostTensor/SharedPtr/Map, just pass a pointer wrapped in span which is very light weighted zero-copy container.

template <typename T>
class span {
    T* ptr_;
    std::size_t len_;

public:
    span(T* ptr, std::size_t len) noexcept : ptr_{ptr}, len_{len} {}

    T& operator[](int i) noexcept {
        return ptr_[i];
    }

    T const& operator[](int i) const noexcept {
        return ptr_[i];
    }

    std::size_t size() const noexcept {
        return len_;
    }

    T* begin() noexcept {
        return ptr_;
    }

    T* end() noexcept {
        return ptr_ + len_;
    }
};

namespace ov {
namespace op {
namespace v1 {

template <typename T>
bool is_static(const T& t) {
    return true;
}

template <>
bool is_static<ov::PartialShape>(const ov::PartialShape& t) {
    return t.is_static();
}

template <>
bool is_static<ov::Dimension>(const ov::Dimension& t) {
    return t.is_static();
}

template <typename T>
struct rank_type {
    using type = ov::Rank;
};

template <>
struct rank_type<VectorDims> {
    using type = ov::Rank;
};

template <typename SHAPE>
typename rank_type<SHAPE>::type rank_of(const SHAPE& s) {
    return s.rank();
}

template <>
ov::Rank rank_of<ov::PartialShape>(const PartialShape& s) {
    return s.rank();
}

template <>
ov::Rank rank_of<VectorDims>(const VectorDims& s) {
    return ov::Rank(s.size());
}

template <typename T>
int64_t get_length(const T& t) {
    return t;
}

template <>
int64_t get_length<ov::Dimension>(const Dimension& t) {
    return t.get_length();
}

template <typename T>
const ov::Interval get_interval(const T& t) {
    OPENVINO_UNREACHABLE("[shape infer] get_interval.");
    return ov::Interval();
}

template <>
const ov::Interval get_interval<ov::Dimension>(const Dimension& t) {
    return t.get_interval();
}

template <typename T>
typename T::value_type cast_dim(ov::Dimension d) {
    return d;
}

template <>
VectorDims::value_type cast_dim<VectorDims>(ov::Dimension d) {
    OPENVINO_UNREACHABLE("[shape infer] cast_dim.");
    return 0;
}

template <typename T>
T cast_pshape(ov::PartialShape d) {
    return d;
}

template <>
VectorDims cast_pshape<VectorDims>(ov::PartialShape d) {
    OPENVINO_UNREACHABLE("[shape infer] cast_pshape.");
    return {};
}

template <typename T>
void shape_inferX(const Split* op,
                  const span<T>& input_shapes,
                  std::vector<T>& output_shapes,
                  span<int64_t> axes_values) {
    using DimType = typename std::iterator_traits<typename T::iterator>::value_type;
    NODE_VALIDATION_CHECK(op, (input_shapes.size() == 2));

    output_shapes.clear();

    const auto& data_ps = input_shapes[0];
    const auto& axis_ps = input_shapes[1];

    const auto& axis_rank = rank_of(axis_ps);

    NODE_VALIDATION_CHECK(op, axis_rank.get_length() == 0, "'axis' input must be a scalar. Got: ", axis_rank);

    auto each_output_shape = data_ps;
    const auto& data_rank = rank_of(data_ps);

    auto num_splits = op->get_num_splits();
    if (axes_values.size() && is_static(data_rank)) {
        NODE_VALIDATION_CHECK(op,
                              axes_values.size() == 1,
                              "a scalar axis value is expected. Got: ",
                              axes_values.size(),
                              " axes");

        auto axis = ov::normalize_axis(op, axes_values[0], data_rank);

        if (is_static(data_ps[axis])) {
            const auto dimension_at_axis = get_length(data_ps[axis]);

            NODE_VALIDATION_CHECK(op,
                                  dimension_at_axis % num_splits == 0,
                                  "Dimension of data input shape along 'axis': ",
                                  dimension_at_axis,
                                  " must be evenly divisible by 'num_splits' attribute value: ",
                                  num_splits);

            each_output_shape[axis] = dimension_at_axis / num_splits;
        } else {
            const auto dim_interval_at_axis = get_interval(data_ps[axis]);
            NODE_VALIDATION_CHECK(op,
                                  dim_interval_at_axis.get_max_val() >= static_cast<int64_t>(num_splits),
                                  "The interval maximum of the dimension for data "
                                  "input shape along 'axis' must be "
                                  "greater or equal to 'num_splits' attribute. Got: ",
                                  dim_interval_at_axis,
                                  " and ",
                                  num_splits);

            auto dim_interval_at_axis_min =
                static_cast<int64_t>(dim_interval_at_axis.get_min_val() * (1.0f / num_splits));
            auto dim_interval_at_axis_max = dim_interval_at_axis.get_max_val();
            if (dim_interval_at_axis.has_upper_bound()) {
                dim_interval_at_axis_max = static_cast<int64_t>(dim_interval_at_axis_max * (1.0f / num_splits));
            }
            each_output_shape[axis] = cast_dim<T>(Dimension(dim_interval_at_axis_min, dim_interval_at_axis_max));
        }
    } else {
        each_output_shape = cast_pshape<T>(ov::PartialShape::dynamic(data_rank));
    }

    for (size_t i = 0; i < num_splits; ++i)
        output_shapes.push_back(each_output_shape);
}

}  // namespace v1
}  // namespace op
}  // namespace ov

std::vector<VectorDims> shapeInfer1(const op::v1::Split* op, VectorDims* dims_in, int dims_cnt, int64_t* axes_values) {
    std::vector<VectorDims> ret;

    shape_inferX(op, span<VectorDims>(dims_in, 2), ret, span<int64_t>(axes_values, 1));

    return ret;
}

0 replies

jane-intel · 2021-12-09T09:27:26Z

jane-intel
Dec 9, 2021

@usstq Thank you for the suggestion! It seems to me, that proposed approach relies on the fact that data is already available in int64 type but this is not true -- CPU will mostly pass int32 type, so to use the code you suggested CPU have to convert data on their own outside of this code.

As I see from the vtune profiling -- major time is taken for

Constant operator creation here:

57 get_data_as_int64
    28 Constant OP constructor
    13 cast_vector
    10 Node destructor
    3 Constant OP destructor
    0.1 map.at

Constant here is only created to make use of its cast_vector method which is replicable. Lets introduce our own cast vector which works with raw data which could be taken from HostTensor via get_data_ptr method. With that we will skip dependency on the Constant operator entirely. In case data type of HostTensor mathes our desired type -- int64 we can skip casting.

HostTensor creation

26 make_shared

HostTensor creation has room for optimization too.

These are duplicate check, one of which could be omitted: here and here
Also m_descriptor.size() looks like to could use some optimization link --
get_shape here uses mutex -- better to use get_partial_shape().to_shape()

shape_size(get_shape()) * m_element_type.size();

Casting from StaticShape to ov::Shape to CPU shape representation:

25 transform:

This piece is significant because Split has several outputs I believe and there are two conversions there from Static shape to regular ov::Shape and then to plugin representation. So I propose to treat this with lower priority and when we will try to optimize that consider this comment.

With that I conclude -- optimization could take place without interface changes.
Let's keep using HostTensor to encapsulate inputs of different types and try to make it faster.
And once again, thank you for all the data collection and suggestions that you've made!

0 replies

usstq · 2021-12-17T08:13:52Z

usstq
Dec 17, 2021
Maintainer Author

shape_inference opt

we rely on gtest's timing as metric, and we add a big loop wrapping shape_inference() internally to magnify proportion of the cost caused by shape_inference(). following test is used as a target for profiling (use vtune).

PERFN=10000 ./cpuUnitTests --gtest_filter="StaticShapeInferenceTest*"

shape_inference hotspot : "ov::as_type"

ov::as_type is called so many times in shape_inference(), since it needs to be called for almost each supported op, this is makes ov::as_type becomes a hotspot. and this can be avoided by build a map statically, key is the type_info and value is corresponding shape_infer functions, at runtime, we just call find() once to get the function and then call it.

total time saved:

770ms => 170ms

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shape inference API overhead #10

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Shape inference API overhead #10

usstq Dec 1, 2021 Maintainer

Replies: 5 comments · 2 replies

usstq Dec 1, 2021 Maintainer Author

test case using split op

test result

usstq Dec 2, 2021 Maintainer Author

Anaylze with vtune

usstq Dec 2, 2021 Maintainer Author

Then how to improve the efficiency of generic code ?

usstq Dec 2, 2021 Maintainer Author

usstq Dec 9, 2021 Maintainer Author

Improvement suggestion

jane-intel Dec 9, 2021

usstq Dec 17, 2021 Maintainer Author

shape_inference opt

shape_inference hotspot : "ov::as_type"

usstq
Dec 1, 2021
Maintainer

Replies: 5 comments 2 replies

usstq
Dec 1, 2021
Maintainer Author

usstq Dec 2, 2021
Maintainer Author

usstq Dec 2, 2021
Maintainer Author

usstq
Dec 2, 2021
Maintainer Author

usstq
Dec 9, 2021
Maintainer Author

jane-intel
Dec 9, 2021

usstq
Dec 17, 2021
Maintainer Author