Tracking progress in implementation of Dynamic Graph struct #3

wolfram77 · 2024-12-29T10:10:14Z

I just implemented a concurrent arena allocator, which given a memory pool, returns allocated memory of fixed size memory blocks. It also maintains a list of freed memory blocks, and any new requests are serviced from the freed blocks.

However, in order to be thread-safe, i.e., to support memory allocation calls from multiple threads concurrently, we are using a atomic_flag which is used as a mutex to limit access to the freed list to one thread at a time only. In order to ensure that no thread has to wait for too long, we check if the flag is set to busy, and if so, we go ahead an return a block from the pool (which uses an atomic add). However, it is also possible that we have no free space in the pool. In such a case, we repeatedly retry acquiring the mutex (yielding the current thread on failure and retrying), and once obtained, fetch a memory block from the freed list.

For freeing a memory block, however, we must access the memory pool, so we we just do repeated retries to access the freed list, and then, once acquired, append the memory block to the freed list.

The results look like the following, when (several) allocations are being performed sequentially

- malloc(): ~10s
- arena.alloc(): ~5s

- free(): ~2s
- arena.free(): ~1s

However, when allocations are done in parallel using 64 thread, the results look quite different

- malloc(): ~20s
- arena.alloc(): ~20s

- free(): ~1s
- arena.free(): too long

It appears that there is high contention in our concurrent arena allocator. Sad! We could try using per-thread freed lists to resolve this. But then, we would have to implement some stealing mechanism in order to fetch memory blocks that have been freed by other threads. Might as well use libc malloc() instead - particularly for large size memory allocations.

In fact, we have another recursive arena allocator which utilizes multiple pools to allocate smaller memory blocks to requesting thread, and is not thread-safe (for one thread only). A suitable way, then, might be to use malloc() for large allocations, and revert to per-thread recursive arena allocator for smaller allocations - we may now called it multi arena allocator instead. LGTM.

See #4 for the code.

The text was updated successfully, but these errors were encountered:

wolfram77 · 2025-01-03T16:54:19Z

Let's think what a graph would require. Concurrent allocators of various sizes.

Quickly free and allocate a new memory block. It also better be multi-level.
Ensure that vertex access is efficient, by using SoA vectors.
Ensure that edge weights are stored in a separate array.
Delay merging of edges into a single array, until the graph is finalized.
Select a different merging startegy, based on the number of edges inserted / deleted.
Perform a fast deletion of edges, i.e., deletions come first.

wolfram77 · 2025-01-04T05:25:45Z

Measuring performance of CSR read vs Graph read:

wolfram77 · 2025-01-04T15:10:50Z

Now able to handle symmetric graphs:
output-graph-openmp--load-undirected-with-double-edges-in-edgelist.log

wolfram77 · 2025-01-04T15:55:37Z

Now able to convert CSR to DiGraph:
output-graph-openmp--convert-csr-to-digraph.log

wolfram77 · 2025-01-05T09:00:57Z

Interesting to see the runtime split of duplicateIfOmpW(), which is used to convert the graph from CSR to DiGraph. Reserving space for the graph, and for the edges appear to be the largest bottlenecks, followed by updateOmpU().

Logs: output-graph-openmp--profile-duplicate-graph.log

wolfram77 · 2025-01-19T11:56:01Z

Currently attempting to load into arena graph, but have hit the wall with a double free error during reserving memory for the edges.

The relevant code is given below:

  #pragma omp parallel for schedule(dynamic, 2048)
  for (K u=0; u<S; ++u) {
    if (!x.hasVertex(u)) continue;
    a.allocateEdges(u, x.degree(u));
  }

Arena-based digraph is now working. The double-free issue was due to using a single allocator across all threads. Another issue we faced was the failure to populate the edges, resulting in abrupt program crash. This was due to using __builtin_clz() for computing capacity of the slots, but __builtin_clz() only works for 32-bit integer, not size_t. Below are the results after fixing.

wolfram77 · 2025-01-23T07:12:26Z

Upto 2KB are handled by arena allocator (with a capacity of 8 * 64KB), remaining are managed by libc.
output-graph-openmp--approach-csr-graph-no-sort.log

wolfram77 · 2025-01-23T08:01:21Z

Upto 8KB are handled by arena allocator (with a capacity of 10 * 512KB), remaining are managed by libc.
BTW, why is resizing the arrays for vertices too slow, particularly for kmer graphs?
output-graph-openmp--approach-duplicate-8192.log

Note

We also try to minimize the memory being held by the arena allocators. Tested on sk-2005 graph. Best config seems to be using up to 8KB arena allocators, with capacity of each allocator being 512KB.

wolfram77 · 2025-01-23T12:25:56Z

Upto 8KB are handled by arena allocator (with a capacity of 10 * 512KB), remaining are managed by libc
Now we are managing the vertex arrays on our own, so we can initialize them in parallel
The improved locality of the variables seems to be improving the performance of add-edges as well.
output-graph-openmp--approach-duplicate-8192-custom-arrays.log

Note

vector.resize() is sequential and is a bottleneck for graphs with a large number of vertices (such as k-mer graphs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking progress in implementation of Dynamic Graph struct #3

Tracking progress in implementation of Dynamic Graph struct #3

wolfram77 commented Dec 29, 2024 •

edited

Loading

wolfram77 commented Jan 3, 2025

wolfram77 commented Jan 4, 2025

wolfram77 commented Jan 4, 2025

wolfram77 commented Jan 4, 2025

wolfram77 commented Jan 5, 2025

wolfram77 commented Jan 19, 2025 •

edited

Loading

wolfram77 commented Jan 23, 2025

wolfram77 commented Jan 23, 2025 •

edited

Loading

wolfram77 commented Jan 23, 2025

Tracking progress in implementation of Dynamic Graph struct #3

Tracking progress in implementation of Dynamic Graph struct #3

Comments

wolfram77 commented Dec 29, 2024 • edited Loading

wolfram77 commented Jan 3, 2025

wolfram77 commented Jan 4, 2025

wolfram77 commented Jan 4, 2025

wolfram77 commented Jan 4, 2025

wolfram77 commented Jan 5, 2025

wolfram77 commented Jan 19, 2025 • edited Loading

wolfram77 commented Jan 23, 2025

wolfram77 commented Jan 23, 2025 • edited Loading

wolfram77 commented Jan 23, 2025

wolfram77 commented Dec 29, 2024 •

edited

Loading

wolfram77 commented Jan 19, 2025 •

edited

Loading

wolfram77 commented Jan 23, 2025 •

edited

Loading