I'm DIYing a katago-like project. Can I get some advice on backend choices/multithreading? #1014

Garbage123King · 2025-01-17T03:29:22Z

1、I use libtorch as the backend for forward propagation because it is easy to use, but I have encountered some difficulties. Due to caching or synchronization issues, the speed of forward propagation slows down by more than 50 times, from 0.0001 seconds per iteration to 0.01 seconds per iteration, making self-play nearly impossible. Should I continue working with libtorch, or should I switch to CUDA sooner?

2、I use a single thread to handle forward propagation requests from all 128 threads. This thread uses a queue with a mutex. Each of the 128 threads performs its own game simulation and waits for the neural network thread to return results during each simulation. The neural network thread performs forward propagation when the queue is empty or the batch size is >=128, and then returns the results to each thread using promise.set_value. I have measured that the multithreading part of my code doesn’t introduce much delay, and the main delay is still due to what I mentioned in point 1, where forward propagation has a significant delay. However, I would still like to ask: Should I modify my multithreading approach?

lightvector · 2025-01-17T12:35:53Z

I think the old version of your question was more useful. :)

KataGo has various custom backends partly because it was following what Leela Zero did, and because having a few different backends of different kinds makes it possible to run on different hardware in different modes without having to install other dependencies. There's not necessarily a big advantage to doing all that work if you have something like libtorch working.

lightvector · 2025-01-22T14:37:56Z

Ah looks like you switched the question back?

Anyways, for answer 1 - you would probably want to investigate whether this is a proportional slowdown of some sort or if it's just a fixed overhead. E.g. is it actually 50x slower, or is it exactly the same speed with 0.01 seconds of fixed overhead? If it's a fixed overhead, you often mitigate that just by playing even more games in parallel and using a far larger batch size. AlphaZero-style data generation is almost infinitely paralellizable.

For 2 - That sounds like a sensible architecture except for the part where the neural network thread only runs if the batch size is 128. If you only have 128 game threads, that sounds like it would introduce substantial delay because each time it would have to wait for every thread to want another neural network, thereby slowing down the speed to that of the slowest game thread. Also, what happens when one game finishes and stops needing neural network queries, so you only have 127? How does the thread work if it still waits for >= 128?

If you've measured this and you think it's performing well anyways, then maybe this doesn't matter. Still, the way to address it might be similar to answer 1 - have even more game threads relative to the neural network batch size, so that there's always enough queries ready to send in a batch without having to wait for stragglers. This architecture seems otherwise quite fine and if you think the delay is due to libtorch rather than anything related to the threading and waiting here, then this architecture would still likely work just fine with whatever replacement for libtorch you chose.

Garbage123King · 2025-01-22T16:33:38Z

Thank you for your answer. In fact, the neural network thread checks (batch.size() >= 128 || queue.empty()), meaning that as long as the queue is empty, propagation will proceed immediately even if the batch size is less than 128. This is easy and useful, isn't? When I clarified this, I was amazed by the generation details of chatGPT, haha...

However, the real issue is that I can only achieve a batch size of 126–128 during nearly every propagation when using the CPU as the device instead of the GPU. Each thread reports an average of 0.007 seconds per visit (with each MCTS performing 100 or 600 visits), which is barely acceptable, but I don't want to use the CPU. Here comes the problem: when using the GPU, it's slower than the CPU. Every time propagation occurs (i.e., when the batch queue is empty), the batch size is only between 10 and 70. Each thread reports an average of 0.05 seconds per visit, which is completely unacceptable. When I do a forward pass with a random tensor of shape (128, 4, 19, 19), it takes only 0.0003 seconds per forward pass. Then I made a key attempt: I directly cut out the forward pass and even stopped the threads from sending messages. Then I discovered the real issue, which is that the tensor as input state, whether generated or modified, is already sufficiently slow—tens or hundreds of times slower than the 0.0003 seconds. Here are some methods I tried:

1、If I first store the board state in a CPU tensor and then convert it to a kCUDA tensor using .to(device), the conversion time for one visit increases dramatically to 0.008 seconds when the tensor size is (1, 4, 19, 19). I tried it with (128, 4, 19, 19), and the conversion time skyrocketed to 0.07 seconds.

2、When constructing the game state object, I directly initialize a tensor with a specific parameter and set it to be kCUDA, then clone the root’s state tensor before each visit. This way, I can directly manipulate the tensor and discard it after each visit. For each selection, I use a function like make_move to set the board state starting from the root, where make_move uses a statement like tensor[row][col] = 1.0f. However, the tensor’s .clone() is quite slow, and the statement tensor[row][col] = 1.0f is much slower than typical CPU instructions, which adds up to significant delay.

3、Based on method 2, I removed the cloning step and added an unmake function to restore the MCTS to the root state. This way, I no longer need to clone. But this didn’t solve the problem—during one visit, make + unmake took 0.014–0.018 seconds in total.

In summary, I realized that PyTorch is something I can’t control. Perhaps I really should learn how to write neural networks using CUDA. Although this would be a much bigger project and would significantly delay my DIY timeline, I may have no other choice. Interestingly, KataGo seems to train networks using PyTorch, but with a CUDA backend. How is this achieved?

Off-topic: For Torch, even if you just sleep for a while and do nothing, the next forward pass will be slower. I asked about this here, and someone told me that there’s not much control over such low-level issues. StackOverflow question link.

Garbage123King changed the title ~~I'm DIYing a katago-like project. Can I get some advice on backend choices/multithreading?~~ katago use pytorch to train but doesn't use a libtorch backend? Jan 17, 2025

Garbage123King changed the title ~~katago use pytorch to train but doesn't use a libtorch backend?~~ I'm DIYing a katago-like project. Can I get some advice on backend choices/multithreading? Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I'm DIYing a katago-like project. Can I get some advice on backend choices/multithreading? #1014

I'm DIYing a katago-like project. Can I get some advice on backend choices/multithreading? #1014

Garbage123King commented Jan 17, 2025 •

edited

Loading

lightvector commented Jan 17, 2025

lightvector commented Jan 22, 2025

Garbage123King commented Jan 22, 2025 •

edited

Loading

I'm DIYing a katago-like project. Can I get some advice on backend choices/multithreading? #1014

I'm DIYing a katago-like project. Can I get some advice on backend choices/multithreading? #1014

Comments

Garbage123King commented Jan 17, 2025 • edited Loading

lightvector commented Jan 17, 2025

lightvector commented Jan 22, 2025

Garbage123King commented Jan 22, 2025 • edited Loading

Garbage123King commented Jan 17, 2025 •

edited

Loading

Garbage123King commented Jan 22, 2025 •

edited

Loading