Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPUW: Spatial execution #26880

Merged
merged 10 commits into from
Oct 9, 2024
Merged

Conversation

dmatveev
Copy link
Contributor

@dmatveev dmatveev commented Oct 2, 2024

Details:

  • When enabled, mainly saves the compile time (dyn range is not enabled yet)
  • More details in the issue

Tickets:

  • E-140516

@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Oct 2, 2024
- Updated Compute patterns for GQ to handle GPTQ models
- Added a "COMPUTE" preset as a possible option to "NPUW_ONLINE_ISOLATE"
  - An alias to all known patterns
- Introduced a new option NPUW_SPATIAL to enable spatial dim
- Exposed the isolate tag for Groups, Subgraphs, and Functions
- Made a placeholder to plug the spatial optimization code in
The spatial execution now works for select cases.
@dmatveev dmatveev marked this pull request as ready for review October 8, 2024 11:26
@dmatveev dmatveev requested review from a team as code owners October 8, 2024 11:26
@dmatveev dmatveev self-assigned this Oct 8, 2024
@@ -2068,7 +2194,6 @@ ov::npuw::Partitioning ov::npuw::getPartitioning(const std::shared_ptr<ov::Model
p.saveTinyConstants(func_group);
p.saveScaleFactors(func_group);
p.createFunction(func_group);
p.optimize(func_group);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it gone for CWAI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the whole idea behind CWAI is to keep dequant params in the model, and the whole idea of optimize is to work on partitioned/folded subgraphs.

There are changes that may serve both (e.g., host-side regather, PMM, etc) so maybe it makes to bring optimize back to CWAI or just separate it somehow. Actually, DQ shouldn't work on on the CWAI-d model - patterns won't match.

NPUW_ASSERT(from.size() == to.size());

// Sub-byte views are not supported here
NPUW_ASSERT(type != ov::element::u4 && type != ov::element::i4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be needed when we do spatial for subgraphs with weights? Do you mind adding a FIXME here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no difference. We only tile activations, we don't tile weights. So subgraphs weights will work over activation ranges.

view_shape.push_back(to[d] - from[d]);
}

const auto strides = src->get_strides();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if remote tensors support strides (as far as I know L0 tensors don't). So it should be ok for weightless subgraphs. However with adding ins/outs/intermediate tensors to remote memory, wouldn't it break?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely I'm wrong. Tensors are already created at this point in the right memory. It's ov::Tensor, so it should be fine to use get a view here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if remote tensors support strides (as far as I know L0 tensors don't)

It works perfectly there - tested. Actually, L0 tensors are plain buffers and strides are just logical distances here in a big plain buffer. There's no reason for this not to work.

Comment on lines +102 to +103
std::vector<ov::SoPtr<ov::ITensor>> inputs; // # of elements - # of graph-side inputs
std::vector<ov::SoPtr<ov::ITensor>> outputs; // # of elements - # of subgraph outputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's different: graph-side inputs and subgraph outputs?

Copy link
Contributor Author

@dmatveev dmatveev Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subgraph inputs include closures, formally. graph-side inputs are only the connections in the graph (we care about)

// Now set the spatial outputs
for (std::size_t out_idx = 0u; out_idx < num_outputs; out_idx++) {
const auto& oport = comp_model_desc.compiled_model->outputs()[out_idx];
r->set_tensor(oport,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even set outputs for the infer request?

Copy link
Contributor Author

@dmatveev dmatveev Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ask a tile to calculate the exact region in the existing buffer. Otherwise it'd be a copy

// Now process the tail, if required
if (spatial.tail_size) {
// Copy the sub-ranges to spatial inputs
// NOTE: tails buffers are read from/written to at 0th offset!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why below for params it's offset and 0 for outputs? Is it related to the whole sparse representation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tail is a buffer allocated to stencil size. We copy a tail (by definition, less than stencil size) into a dedicated buffer. So we read at offset (from the larger buffer) and write at 0 (in a smaller buffer).

// Now set the tail tensors
for (std::size_t out_idx = 0u; out_idx < num_outputs; out_idx++) {
const auto& oport = comp_model_desc.compiled_model->outputs()[out_idx];
r->set_tensor(oport, m_spatial_io[real_idx].output_tails.at(out_idx));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why outputs only?

Copy link
Contributor Author

@dmatveev dmatveev Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputs are set in the loop above (I've just messed up with the comment)

@dmatveev dmatveev added this pull request to the merge queue Oct 9, 2024
@dmatveev dmatveev added this to the 2024.5 milestone Oct 9, 2024
Merged via the queue into openvinotoolkit:master with commit cbf42f3 Oct 9, 2024
131 checks passed
@dmatveev dmatveev deleted the dm/npuw_spatial branch October 9, 2024 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants