Bulk message creation #480

rbx · 2023-08-29T09:17:28Z

rbx · 2023-08-29T09:45:40Z

@ktf Can you provide more context here? What are you running, how many messages, their size. What impact does this have?

dennisklein · 2023-08-29T09:58:04Z

Should the message count be a run-time or compile-time argument? e.g.

std::vector<MessagePtr> CreateMessages(size_t n, std::vector<size_t> const & sizes, std::vector<Alignment> const & alignments);
assert(n == sizes.size());
assert(n == alignments.size())

vs

template <size_t N>
std::array<MessagePtr, N> CreateMessages(std::array<size_t, N> const & sizes, std::array<Alignment, N> const & alignments); // or spans

I guess, in case of a failure to create a single msg, the whole call is expected to fail?

dennisklein · 2023-08-29T10:00:11Z

I guess, you are asking for a bulk api on a single transport only?

ktf · 2023-08-29T10:00:39Z

This is ITS readout, but it's a general issue IMHO. Whenever we use the shallow copy to increase parallelism on the same data when running in shared memory, we need to copy messages one by one as done in:

https://github.com/AliceO2Group/AliceO2/blob/dev/Framework/Core/src/DataProcessingDevice.cxx#L637

It would be good if there was a way to reduce the impact of the mutex.

ktf · 2023-08-29T10:02:21Z

Should the message count be a run-time or compile-time argument? e.g.

Runtime.

I guess, in case of a failure to create a single msg, the whole call is expected to fail?

I would say so.

ktf · 2023-08-29T10:03:52Z

I guess, you are asking for a bulk api on a single transport only?

You mean that all the messages of the bulk are created on the same transport? Yes. See the line above to see what I actually want to optimise.

dennisklein · 2023-08-29T10:20:50Z

Since this mutex is internal to the boost shm library, we can only save it by making a big allocation from the boost shmem segment and then chunk it up according to the requested fmq msgs. However, when the fmq msgs are destroyed, we would have to track and delay the deallocation on the boost shmem segment until all msg we chunked the large allocation into, are destroyed.

Will not be trivial, and we will have to see in measurements, how this compares performance-wise. It could be that we only shift the overhead to msg destruction time - which may of course still help your use case. But just to set the expectation right.

It would be a different story of course, if boost shmem would also provide a bulk allocation api, but this I don't know (yet).

rbx · 2023-08-29T10:30:25Z

There is a bulk API, but it may have fragmentation effects (positive or negative), as it will attempt to put all buffers contiguously:
https://www.boost.org/doc/libs/1_83_0/doc/html/interprocess/managed_memory_segments.html#interprocess.managed_memory_segments.managed_memory_segment_advanced_features.managed_memory_segment_multiple_allocations

Need to also see how it deals with (custom) alignment.

ktf · 2023-08-29T10:59:26Z

Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?

dennisklein · 2023-08-29T11:47:18Z

Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?

Ya, this is what I am wondering still, why do you see an allocation at all. I guess, you are hitting this case?

dennisklein · 2023-08-29T12:19:36Z

Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?

Ya, this is what I am wondering still, why do you see an allocation at all. I guess, you are hitting this case?

hmm, even if we speculatively pre-allocate those 2-byte buffer to only pay the overhead only once every couple of shmem::Message::Copy()s, I guess, we will still need a bulk Copy() api, because we would need to synchronize access to the pre-allocated refcount buffer pool as well, right? maybe we could get away with one or two atomic integers for this?

ktf · 2023-08-29T13:31:06Z

Yes, I suppose in this particular case we are in that case, assuming UR means UnmanagedRegion. Are those allocations in shared memory? That could explain the fragmentation as well, no?

rbx · 2023-08-29T13:57:19Z

Yes, they are in shared memory. Certainly they contribute to overall fragmentation.

Either we make a bulk call for Copy to store them together. Or in this case, since we know the exact data size (header to store the ref count...but still an unknown number of them) we need, we could make use of a pool allocator.

ktf · 2023-08-29T14:11:54Z

I would definitely use a pool allocator for those. We probably have thousands of them being created for each timeframe. It would actually be good if we could also have a bulk allocator for messages below size 256 bytes or so, since every other message in our datamodel is an around 80 bytes header.

rbx · 2023-10-04T09:27:29Z

Small update on this. In the dev branch there is now first implementation using pools.

For reference counting messages from UnmanagedRegion, it will internally store these in a separate segment (1 per UnmanagedRegion) and use a pool to track them. Technically this is still syncing on every message, but it should be faster since it is 1. a separate segment from the main data and 2. uses a pool. Also, this certainly should improve fragmentation in the main data segment, as now nothing is allocated there for UR region messages. The segment size is 10,000,000 bytes by default, which should be good for up to 5,000,000 messages (probably a bit less). If you expect this much or more, you can bump the size in region configuration: https://github.com/FairRootGroup/FairMQ/blob/dev/fairmq/UnmanagedRegion.h#L137.

Could you give it a try and provide some feedback on how this behaves with the bottlenecks that you observe?

A further step could be to make use of cached pools to reduce synchronization. These could be effective with a bulk message copy API.

ktf · 2023-10-04T09:45:29Z

Thanks! I will try ASAP.

rbx self-assigned this Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk message creation #480

Bulk message creation #480

rbx commented Aug 29, 2023

rbx commented Aug 29, 2023

dennisklein commented Aug 29, 2023

dennisklein commented Aug 29, 2023

ktf commented Aug 29, 2023

ktf commented Aug 29, 2023

ktf commented Aug 29, 2023

dennisklein commented Aug 29, 2023 •

edited

Loading

rbx commented Aug 29, 2023

ktf commented Aug 29, 2023

dennisklein commented Aug 29, 2023

dennisklein commented Aug 29, 2023

ktf commented Aug 29, 2023

rbx commented Aug 29, 2023

ktf commented Aug 29, 2023

rbx commented Oct 4, 2023

ktf commented Oct 4, 2023

Bulk message creation #480

Bulk message creation #480

Comments

rbx commented Aug 29, 2023

rbx commented Aug 29, 2023

dennisklein commented Aug 29, 2023

dennisklein commented Aug 29, 2023

ktf commented Aug 29, 2023

ktf commented Aug 29, 2023

ktf commented Aug 29, 2023

dennisklein commented Aug 29, 2023 • edited Loading

rbx commented Aug 29, 2023

ktf commented Aug 29, 2023

dennisklein commented Aug 29, 2023

dennisklein commented Aug 29, 2023

ktf commented Aug 29, 2023

rbx commented Aug 29, 2023

ktf commented Aug 29, 2023

rbx commented Oct 4, 2023

ktf commented Oct 4, 2023

dennisklein commented Aug 29, 2023 •

edited

Loading