-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NPUW: Spatial execution #26880
NPUW: Spatial execution #26880
Conversation
- Updated Compute patterns for GQ to handle GPTQ models - Added a "COMPUTE" preset as a possible option to "NPUW_ONLINE_ISOLATE" - An alias to all known patterns
- Introduced a new option NPUW_SPATIAL to enable spatial dim - Exposed the isolate tag for Groups, Subgraphs, and Functions - Made a placeholder to plug the spatial optimization code in
The spatial execution now works for select cases.
2fda433
to
689962a
Compare
@@ -2068,7 +2194,6 @@ ov::npuw::Partitioning ov::npuw::getPartitioning(const std::shared_ptr<ov::Model | |||
p.saveTinyConstants(func_group); | |||
p.saveScaleFactors(func_group); | |||
p.createFunction(func_group); | |||
p.optimize(func_group); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it gone for CWAI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the whole idea behind CWAI is to keep dequant params in the model, and the whole idea of optimize
is to work on partitioned/folded subgraphs.
There are changes that may serve both (e.g., host-side regather, PMM, etc) so maybe it makes to bring optimize
back to CWAI or just separate it somehow. Actually, DQ shouldn't work on on the CWAI-d model - patterns won't match.
NPUW_ASSERT(from.size() == to.size()); | ||
|
||
// Sub-byte views are not supported here | ||
NPUW_ASSERT(type != ov::element::u4 && type != ov::element::i4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be needed when we do spatial for subgraphs with weights? Do you mind adding a FIXME here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no difference. We only tile activations, we don't tile weights. So subgraphs weights will work over activation ranges.
view_shape.push_back(to[d] - from[d]); | ||
} | ||
|
||
const auto strides = src->get_strides(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if remote tensors support strides (as far as I know L0 tensors don't). So it should be ok for weightless subgraphs. However with adding ins/outs/intermediate tensors to remote memory, wouldn't it break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely I'm wrong. Tensors are already created at this point in the right memory. It's ov::Tensor
, so it should be fine to use get a view here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if remote tensors support strides (as far as I know L0 tensors don't)
It works perfectly there - tested. Actually, L0 tensors are plain buffers and strides are just logical distances here in a big plain buffer. There's no reason for this not to work.
std::vector<ov::SoPtr<ov::ITensor>> inputs; // # of elements - # of graph-side inputs | ||
std::vector<ov::SoPtr<ov::ITensor>> outputs; // # of elements - # of subgraph outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's different: graph-side inputs
and subgraph outputs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subgraph inputs include closures, formally. graph-side inputs are only the connections in the graph (we care about)
// Now set the spatial outputs | ||
for (std::size_t out_idx = 0u; out_idx < num_outputs; out_idx++) { | ||
const auto& oport = comp_model_desc.compiled_model->outputs()[out_idx]; | ||
r->set_tensor(oport, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we even set outputs for the infer request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ask a tile to calculate the exact region in the existing buffer. Otherwise it'd be a copy
// Now process the tail, if required | ||
if (spatial.tail_size) { | ||
// Copy the sub-ranges to spatial inputs | ||
// NOTE: tails buffers are read from/written to at 0th offset! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why below for params it's offset
and 0
for outputs? Is it related to the whole sparse representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tail is a buffer allocated to stencil size. We copy a tail (by definition, less than stencil size) into a dedicated buffer. So we read at offset (from the larger buffer) and write at 0 (in a smaller buffer).
// Now set the tail tensors | ||
for (std::size_t out_idx = 0u; out_idx < num_outputs; out_idx++) { | ||
const auto& oport = comp_model_desc.compiled_model->outputs()[out_idx]; | ||
r->set_tensor(oport, m_spatial_io[real_idx].output_tails.at(out_idx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why outputs only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputs are set in the loop above (I've just messed up with the comment)
Details:
Tickets: