Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking progress in implementation of Dynamic Graph struct #3

Open
wolfram77 opened this issue Dec 29, 2024 · 9 comments
Open

Tracking progress in implementation of Dynamic Graph struct #3

wolfram77 opened this issue Dec 29, 2024 · 9 comments

Comments

@wolfram77
Copy link
Member

wolfram77 commented Dec 29, 2024

I just implemented a concurrent arena allocator, which given a memory pool, returns allocated memory of fixed size memory blocks. It also maintains a list of freed memory blocks, and any new requests are serviced from the freed blocks.

However, in order to be thread-safe, i.e., to support memory allocation calls from multiple threads concurrently, we are using a atomic_flag which is used as a mutex to limit access to the freed list to one thread at a time only. In order to ensure that no thread has to wait for too long, we check if the flag is set to busy, and if so, we go ahead an return a block from the pool (which uses an atomic add). However, it is also possible that we have no free space in the pool. In such a case, we repeatedly retry acquiring the mutex (yielding the current thread on failure and retrying), and once obtained, fetch a memory block from the freed list.

For freeing a memory block, however, we must access the memory pool, so we we just do repeated retries to access the freed list, and then, once acquired, append the memory block to the freed list.

The results look like the following, when (several) allocations are being performed sequentially

- malloc(): ~10s
- arena.alloc(): ~5s

- free(): ~2s
- arena.free(): ~1s

However, when allocations are done in parallel using 64 thread, the results look quite different

- malloc(): ~20s
- arena.alloc(): ~20s

- free(): ~1s
- arena.free(): too long

It appears that there is high contention in our concurrent arena allocator. Sad! We could try using per-thread freed lists to resolve this. But then, we would have to implement some stealing mechanism in order to fetch memory blocks that have been freed by other threads. Might as well use libc malloc() instead - particularly for large size memory allocations.

In fact, we have another recursive arena allocator which utilizes multiple pools to allocate smaller memory blocks to requesting thread, and is not thread-safe (for one thread only). A suitable way, then, might be to use malloc() for large allocations, and revert to per-thread recursive arena allocator for smaller allocations - we may now called it multi arena allocator instead. LGTM.

See #4 for the code.

@wolfram77
Copy link
Member Author

Let's think what a graph would require. Concurrent allocators of various sizes.

  • Quickly free and allocate a new memory block. It also better be multi-level.
  • Ensure that vertex access is efficient, by using SoA vectors.
  • Ensure that edge weights are stored in a separate array.
  • Delay merging of edges into a single array, until the graph is finalized.
  • Select a different merging startegy, based on the number of edges inserted / deleted.
  • Perform a fast deletion of edges, i.e., deletions come first.

@wolfram77
Copy link
Member Author

Measuring performance of CSR read vs Graph read:

image

@wolfram77
Copy link
Member Author

Now able to handle symmetric graphs:
output-graph-openmp--load-undirected-with-double-edges-in-edgelist.log

@wolfram77
Copy link
Member Author

Now able to convert CSR to DiGraph:
output-graph-openmp--convert-csr-to-digraph.log

@wolfram77
Copy link
Member Author

Interesting to see the runtime split of duplicateIfOmpW(), which is used to convert the graph from CSR to DiGraph. Reserving space for the graph, and for the edges appear to be the largest bottlenecks, followed by updateOmpU().

Logs: output-graph-openmp--profile-duplicate-graph.log

image

@wolfram77
Copy link
Member Author

wolfram77 commented Jan 19, 2025

Currently attempting to load into arena graph, but have hit the wall with a double free error during reserving memory for the edges.

Image

The relevant code is given below:

  #pragma omp parallel for schedule(dynamic, 2048)
  for (K u=0; u<S; ++u) {
    if (!x.hasVertex(u)) continue;
    a.allocateEdges(u, x.degree(u));
  }

Arena-based digraph is now working. The double-free issue was due to using a single allocator across all threads. Another issue we faced was the failure to populate the edges, resulting in abrupt program crash. This was due to using __builtin_clz() for computing capacity of the slots, but __builtin_clz() only works for 32-bit integer, not size_t. Below are the results after fixing.

Image

@wolfram77
Copy link
Member Author

Upto 2KB are handled by arena allocator (with a capacity of 8 * 64KB), remaining are managed by libc.
output-graph-openmp--approach-csr-graph-no-sort.log

Image

Image

@wolfram77
Copy link
Member Author

wolfram77 commented Jan 23, 2025

Upto 8KB are handled by arena allocator (with a capacity of 10 * 512KB), remaining are managed by libc.
BTW, why is resizing the arrays for vertices too slow, particularly for kmer graphs?
output-graph-openmp--approach-duplicate-8192.log

Image

Image

Note

We also try to minimize the memory being held by the arena allocators. Tested on sk-2005 graph. Best config seems to be using up to 8KB arena allocators, with capacity of each allocator being 512KB.

@wolfram77
Copy link
Member Author

Upto 8KB are handled by arena allocator (with a capacity of 10 * 512KB), remaining are managed by libc
Now we are managing the vertex arrays on our own, so we can initialize them in parallel
The improved locality of the variables seems to be improving the performance of add-edges as well.
output-graph-openmp--approach-duplicate-8192-custom-arrays.log

Image

Image

Note

vector.resize() is sequential and is a bottleneck for graphs with a large number of vertices (such as k-mer graphs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant