Ants at a picnic

In his paper about Dynamic Tree Splitting (DTS), Robert Hyatt likened the processors working on a branch of the search tree to ants eating crumbs at a picnic. They eat one crumb together and it gets smaller and smaller until it totally disappears. Then the ants turn their attention to another crumb and start eating. This continues while there are crumbs. Making the processors work together like this requires careful design. Specifically, the processors are all equal; there are no master nor slaves, but only peers. One procesor may start working on a branch, but another one may complete it. The DTS paper explains how this is done.

While reading the DTS paper again, I noticed that there are many similarities between the supercomputers in the 80's and the modern GPUs. The shared memory architecture for example makes the DTS algorithm act similarly on both of them. In contrast, on clusters, one must take into account the communication costs. But on GPUs the communication happens through the memory, so it can be as fast as the memory accesses are. On the other hand, there are differences. GPUs have typically more complicated memory hierarchies than the ancient Cray supercomputers (I am not expert on this, but it appears to me that the memory hierarchy was quite flat).

The memory hierarchy causes some problems on the GPU. Ideally, each processor, or in the case of a GPU a thread, should have a copy of the tree state. This is required because the processors must be equal. Anyone of them must be able to complete the search and return the value to the root of the tree. This takes some memory. Unfortunately, the memory consumption is greater than with my current approach were I have one search stack per block of threads. I need one stack per thread. The stacks will be smaller, but not that much smaller that it would compensate for the increased number of stacks. And I will have to use global memory. A typical stack has hundreds of positions, each of them taking 48 bytes. This could mean tens of kilobytes per stack. While the normal stack operations yield the same number of writes to global memory, splitting the tree causes extra operations as the stacks must be copied. I have not tested this yet, so I cannot say if this overhead is bad or not. Anyway, the finer granularity of the search will guarantee better occupancy of the multiprocessors, not only theoretically, but also in practice, meaning that the processors do less futile operations and spend most of their time searching nodes. Now, for example, when a processor has processed its stack, it simply waits for the others to complete theirs. This is clearly inefficient and the DTS-like approach can definitely alleviate the problem.

It is difficult to say how successful my implementation will be in realizing the potential of the DTS algorithm. It will most probably take me a month or so to get this thing working at some level. I am travelling for the next two weeks, so I do not have access to a CUDA-capable computer. But I still have time to develop the ideas further.


Anatomy of the alpha-beta search

Chess is primarily a tactical game. Exact calculations of move series are important. That is most likely the reason why application of probabilistic methods to chess engines has failed so far. I do not try to change this, because I believe that there is no replacement for accurate analysis. Most probably, I will use some version of the alpha-beta pruning in my chess engine. The basic algorithm is serial so one must carefully design how to best utilize parallellism.

Simply splitting the search so that different moves are given for different threads is inefficient for two reasons. First, the search of a move may improve the alpha bound. Then, if one has already started searching the other moves with the original bound, some effort is wasted. Second, the search of a move may cause a beta cut-off. Then, searching all the other moves is totally wasted effort. Typically, parallel algorithms try to minimize the futile work.

Because the move ordering is usually good in modern chess engines due to the iterative deepening, transposition tables, and various other techniques, the beta cut-offs happen with the first move in about 90 % of the cases. That is why one should avoid splitting the work at Cut nodes (nodes whose children may cause beta cutoff). Instead, the splitting work at All nodes (nodes whose all children must be searched) is preffered. However, it cannot be known for sure which nodes are of which type. With a perfect move ordering, it is possible, but we cannot achieve that. And if we did achieve it, we would not need to search at all.

It could be possible to implement one of the existing parallel search algorithms. However, they do not scale that well for a large number of processors. Probably the best know algorithm is the Dynamic Tree Splitting (DTS) by Robert Hyatt and the Cray Blitz team. Many programmers consider it difficult to implement. One cannot use the typical recursive structure of function calls in the search with it. For me that is not a problem, because the structure of my program is not recursive. That offers great flexibility, because any node can be searched by any thread. So I am thinking about implementing a DTS-like algorithm on the GPU.


Breaking News: 400M positions per second!

I did it. It was not so difficult after all. Because it was obvious that the global memory writes slow down the performance, I tried to figure out how to reduce their number. First, I noticed that I could not get rid of the six store operations for the six bitboards per position, because there are no instructions to write more than 64 bits at once. That gives the upper bound of 87.5M nps. I was not happy with that. Then I thought about ways to avoid putting the positions in the global memory. Because the shared memory is limited, I could not use it for the whole position stack. But what about the last ply? Well, you can guess...

So after eliminating the global memory writes for the last ply, I got huge performance benefits. Because only the second to last ply is written to the global memory, there are surely enough instructions (handling the last ply) to hide the memory latency. This also means that I could cope with less threads as the performance converges towards the upper bound much faster with high instruction level parallellism. So below are some initial results. I used 128 threads per block, because I found out that it is the optimal number for the current version of the move generator.

Performance in Mnps vs. number of blocks.

Line line is not smooth in the beginning, because with different numbers of blocks, the multiprocessors get different numbers of blocks to process. Multiples of 60 blocks yield the best performance. That is four times the number of multiprocessors (15) on the GTX 480. I am not sure why this is good.

But now that I have a working and efficient move generator, I can start to think about the next big problem, which is the parallel search.


Approaching the performance goal

I have continued testing the move generator. After some obvious optimizations I got the speed up to 2.5M nodes per second with one multiprocessor. Then it was time to give all the processors work to do. So the approach I used was to give each processor a separate stack to operate on. I noticed that it slows down the performance if different multiprocessors write global memory addresses close to each other. Now one can give each multiprocessor its own starting position and let it process that. For example for perft(6) this could mean calling perft(5) for each child position of the initial position.

Below is a figure that shows the scaling with more than one multiprocessor. A block has 512 threads and the number of blocks varies from 2 to 26 in the test. The line is not straight in the beginning, because not all multiprocessors have work until there are at least 15 blocks, i.e. 7.5k threads in this case.

Scaling with several multiprocessors. Speed (Mnps) vs. number of threads.

Further increasing the number of blocks allows more efficient hiding of the memory latency. Below is another figure that shows the scaling up to 400 blocks. Now we can see performance figures that are of the correct order of magnitude.

Scaling with a large number of blocks. Speed (nps) vs. number of blocks.

The maximum speed achieved is about 62M nps. I had calculated the absolute upper bound for the GTX 480 to be 175M nps if 48 bytes were written for each position. I think that, in practice, the compiler messes things up and produces twice as many writes. I might have to change the data types used in the kernel to get the correct number of writes. So with the larger number of global memory writes, the upper bound becomes 87.5M nps. Yet, another thing that slows things down even more is how I have currently implemented the control parameters, e.g. depth and node count. I have a separate stack for them in the global memory. Eventually, I want to embed the control parameters into the position stack.


Perft running on the GPU

I finally got the move generator working properly, i.e. obtaining correct node counts, on the GPU. At least for one multiprocessor. I still have troubles to get it work on multiple multiprocessors. But the good news is that with one multiprocessor the scaling with the number of threads is almost perfectly linear. Look at the picture below. It shows the perft(5) speed with different numbers of the threads (192-576). With larger numbers of threads, something breaks and the performance decreases dramatically. I think this happens, because the compiler runs out of registers and puts the variables in the local memory, which is painfully slow.

Speed (million nodes per second) with varying numbers of threads. The numbers are averages of 10 runs of perft(5).

The linear scaling means, I think, that the global memory speed cannot be the limiting factor of performance with one multiprocessor. Otherwise we would see a curve that bends downwards from the linear trend and converges to a horizontal line. But that does not happen here. Indeed, with 1.5 million nodes per second and 48 bytes per position, writing the positions requires only about 72 Mb/s bandwidth, when 177 Gb/s is available. Something else is limiting the performance.

I used the compiler option to generate the PTX-assembler code. That is over 15 000 lines and about half of them are instructions (not labels or debug information). In other words, the kernel is quite long. It also has quite a many branches. But I still suspect that the compiler messes things up by putting things in the local memory. I might have to write the assembler code manually. That is really the last option. Or first I will try to find compiler options that generate faster code. I will also consider other ways to arrange the computation.


Joy of Debugging

Nothing is more annoying than debugging a program that you think should work properly. After a few hours of writing functions to print out bitboards and differences between bitboards, I finally managed to get the correct results for perft(). At first I got capturing promotions generated twice: first as promotions and then as captures. After I fixed that, I still had a bug with castling rights. I got too many castles. Because I did not have the make() and unmake() functions, I needed to make sure I clear the appropriate castling rights whenever a king or a rook moves (both captures and non-captures) or a rook gets captured (including capturing promotions). I had to go through every possible case and that is about 900 lines of code. Why that many lines? Let me explain.

A simple move generator can be much less than 900 lines of code. But, my move generator has making of the move and MVV-LVA move ordering embedded in it. In addition, I added a strict legality check in the move generator. The purpose is to have enough instructions between global memory writes. Because the memory speed is the limiting factor, it makes sense to do something useful while waiting for the write operations. It is important to notice that the global memory writes do not block the execution of other instructions that do not depend on the written data.

Now I am planning to test the code with different numbers of threads, beginning with a single thread and slowly increasing the number of the threads. If I record the nodes per second for the different numbers of threads, I can plot a graph that shows how the algorithm scales. I hope I can do this during this week.