Replies: 5 comments 2 replies
-
We can build a test case for answering this question, here we do not only test case1: do all work entirely in CPU plugin's shapeInfer() call back, to make it easier, we make a fake shapeInfer. calculate the time difference as the cost of API. than we got a estimate. test case using split ophere we constructed following test (following the implementation in using VectorDims = std::vector<size_t>;
std::vector<VectorDims> shapeInfer0(const op::v1::Split* op, VectorDims* dims_in, int dims_cnt, int64_t* axes_values) {
NODE_VALIDATION_CHECK(op, (dims_cnt == 2));
const auto& data_ps = dims_in[0];
const auto& axis_ps = dims_in[1];
NODE_VALIDATION_CHECK(op, axis_ps.size() == 0, "'axis' input must be a scalar. Got: ", axis_ps.size());
auto each_output_shape = data_ps;
const auto data_rank = data_ps.size();
auto num_splits = op->get_num_splits();
auto axis = ov::normalize_axis(op, axes_values[0], ov::Rank(data_rank));
const auto dimension_at_axis = data_ps[axis];
NODE_VALIDATION_CHECK(op,
dimension_at_axis % num_splits == 0,
"Dimension of data input shape along 'axis': ",
dimension_at_axis,
" must be evenly divisible by 'num_splits' attribute value: ",
num_splits);
each_output_shape[axis] = dimension_at_axis / num_splits;
std::vector<VectorDims> ret;
for (size_t i = 0; i < num_splits; ++i)
ret.push_back(each_output_shape);
return ret;
}
std::vector<VectorDims> shapeInfer1(const op::v1::Split* op, VectorDims* dims_in, int dims_cnt, int64_t* axes_values) {
std::vector<StaticShape> input_shapes = {dims_in[0], dims_in[1]};
std::map<size_t, std::shared_ptr<ngraph::runtime::HostTensor>> input_values = {
{1, std::make_shared<ngraph::runtime::HostTensor>(ngraph::element::Type_t::i32, ov::Shape{}, axes_values)}};
std::vector<StaticShape> output_shapes = {StaticShape{}};
shape_infer(op, input_shapes, output_shapes, input_values);
std::vector<VectorDims> result(output_shapes.size());
std::transform(output_shapes.begin(), output_shapes.end(), result.begin(), [](const ov::StaticShape& s) {
return s.to_shape();
});
return result;
}
TEST(StaticShapeInferenceTest, SplitV1_S2) {
std::vector<VectorDims> dims_in = {{2, 8, 4}, {}};
std::vector<VectorDims> ret;
int64_t axes_values[2] = {2, 2};
auto op = build_split(PartialShape({2, 8, 4}), {}, 4);
PERF_TEST(ret = shapeInfer0(op.get(), dims_in.data(), dims_in.size(), axes_values));
PERF_TEST(ret = shapeInfer1(op.get(), dims_in.data(), dims_in.size(), axes_values));
} test resultI use shapeInfer0: 339 ns if we put PERF_TEST into shapeInfer1(), just around shape_infer() API, the result is ~1000ns, so surprisingly, current shape_inference API overhead is considerable comparing to the simplest possible implementation. Here you may say, shapeInfer0() is not template generic code, but still, it's so surprising to see that generic code is slower than best possible implemented by so big margin. |
Beta Was this translation helpful? Give feedback.
-
@jane-intel According to this test, I'm really concerned about the potential performance gain that the generic code misses, further more, the generic code is also harder to understand and maintain than it's pure static version. |
Beta Was this translation helpful? Give feedback.
-
Improvement suggestionI did some effort today and I'm able to make the generic code as fast as the static version, the key idea is:
template <typename T>
class span {
T* ptr_;
std::size_t len_;
public:
span(T* ptr, std::size_t len) noexcept : ptr_{ptr}, len_{len} {}
T& operator[](int i) noexcept {
return ptr_[i];
}
T const& operator[](int i) const noexcept {
return ptr_[i];
}
std::size_t size() const noexcept {
return len_;
}
T* begin() noexcept {
return ptr_;
}
T* end() noexcept {
return ptr_ + len_;
}
};
namespace ov {
namespace op {
namespace v1 {
template <typename T>
bool is_static(const T& t) {
return true;
}
template <>
bool is_static<ov::PartialShape>(const ov::PartialShape& t) {
return t.is_static();
}
template <>
bool is_static<ov::Dimension>(const ov::Dimension& t) {
return t.is_static();
}
template <typename T>
struct rank_type {
using type = ov::Rank;
};
template <>
struct rank_type<VectorDims> {
using type = ov::Rank;
};
template <typename SHAPE>
typename rank_type<SHAPE>::type rank_of(const SHAPE& s) {
return s.rank();
}
template <>
ov::Rank rank_of<ov::PartialShape>(const PartialShape& s) {
return s.rank();
}
template <>
ov::Rank rank_of<VectorDims>(const VectorDims& s) {
return ov::Rank(s.size());
}
template <typename T>
int64_t get_length(const T& t) {
return t;
}
template <>
int64_t get_length<ov::Dimension>(const Dimension& t) {
return t.get_length();
}
template <typename T>
const ov::Interval get_interval(const T& t) {
OPENVINO_UNREACHABLE("[shape infer] get_interval.");
return ov::Interval();
}
template <>
const ov::Interval get_interval<ov::Dimension>(const Dimension& t) {
return t.get_interval();
}
template <typename T>
typename T::value_type cast_dim(ov::Dimension d) {
return d;
}
template <>
VectorDims::value_type cast_dim<VectorDims>(ov::Dimension d) {
OPENVINO_UNREACHABLE("[shape infer] cast_dim.");
return 0;
}
template <typename T>
T cast_pshape(ov::PartialShape d) {
return d;
}
template <>
VectorDims cast_pshape<VectorDims>(ov::PartialShape d) {
OPENVINO_UNREACHABLE("[shape infer] cast_pshape.");
return {};
}
template <typename T>
void shape_inferX(const Split* op,
const span<T>& input_shapes,
std::vector<T>& output_shapes,
span<int64_t> axes_values) {
using DimType = typename std::iterator_traits<typename T::iterator>::value_type;
NODE_VALIDATION_CHECK(op, (input_shapes.size() == 2));
output_shapes.clear();
const auto& data_ps = input_shapes[0];
const auto& axis_ps = input_shapes[1];
const auto& axis_rank = rank_of(axis_ps);
NODE_VALIDATION_CHECK(op, axis_rank.get_length() == 0, "'axis' input must be a scalar. Got: ", axis_rank);
auto each_output_shape = data_ps;
const auto& data_rank = rank_of(data_ps);
auto num_splits = op->get_num_splits();
if (axes_values.size() && is_static(data_rank)) {
NODE_VALIDATION_CHECK(op,
axes_values.size() == 1,
"a scalar axis value is expected. Got: ",
axes_values.size(),
" axes");
auto axis = ov::normalize_axis(op, axes_values[0], data_rank);
if (is_static(data_ps[axis])) {
const auto dimension_at_axis = get_length(data_ps[axis]);
NODE_VALIDATION_CHECK(op,
dimension_at_axis % num_splits == 0,
"Dimension of data input shape along 'axis': ",
dimension_at_axis,
" must be evenly divisible by 'num_splits' attribute value: ",
num_splits);
each_output_shape[axis] = dimension_at_axis / num_splits;
} else {
const auto dim_interval_at_axis = get_interval(data_ps[axis]);
NODE_VALIDATION_CHECK(op,
dim_interval_at_axis.get_max_val() >= static_cast<int64_t>(num_splits),
"The interval maximum of the dimension for data "
"input shape along 'axis' must be "
"greater or equal to 'num_splits' attribute. Got: ",
dim_interval_at_axis,
" and ",
num_splits);
auto dim_interval_at_axis_min =
static_cast<int64_t>(dim_interval_at_axis.get_min_val() * (1.0f / num_splits));
auto dim_interval_at_axis_max = dim_interval_at_axis.get_max_val();
if (dim_interval_at_axis.has_upper_bound()) {
dim_interval_at_axis_max = static_cast<int64_t>(dim_interval_at_axis_max * (1.0f / num_splits));
}
each_output_shape[axis] = cast_dim<T>(Dimension(dim_interval_at_axis_min, dim_interval_at_axis_max));
}
} else {
each_output_shape = cast_pshape<T>(ov::PartialShape::dynamic(data_rank));
}
for (size_t i = 0; i < num_splits; ++i)
output_shapes.push_back(each_output_shape);
}
} // namespace v1
} // namespace op
} // namespace ov
std::vector<VectorDims> shapeInfer1(const op::v1::Split* op, VectorDims* dims_in, int dims_cnt, int64_t* axes_values) {
std::vector<VectorDims> ret;
shape_inferX(op, span<VectorDims>(dims_in, 2), ret, span<int64_t>(axes_values, 1));
return ret;
} |
Beta Was this translation helpful? Give feedback.
-
@usstq Thank you for the suggestion! It seems to me, that proposed approach relies on the fact that data is already available in int64 type but this is not true -- CPU will mostly pass int32 type, so to use the code you suggested CPU have to convert data on their own outside of this code. As I see from the vtune profiling -- major time is taken for
Constant here is only created to make use of its
HostTensor creation has room for optimization too.
This piece is significant because Split has several outputs I believe and there are two conversions there from Static shape to regular ov::Shape and then to plugin representation. So I propose to treat this with lower priority and when we will try to optimize that consider this comment. With that I conclude -- optimization could take place without interface changes. |
Beta Was this translation helpful? Give feedback.
-
shape_inference optwe rely on gtest's timing as metric, and we add a big loop wrapping shape_inference() internally to magnify proportion of the cost caused by shape_inference(). following test is used as a target for profiling (use vtune). PERFN=10000 ./cpuUnitTests --gtest_filter="StaticShapeInferenceTest*" shape_inference hotspot : "ov::as_type"
total time saved: 770ms => 170ms |
Beta Was this translation helpful? Give feedback.
-
shape infer has following API interface, and caller has to construct vector & map which seems to be heavy, how to test if that's a problem?
Beta Was this translation helpful? Give feedback.
All reactions