Thinking differently

Most chess engines seem to rely on a move generator that produces a list of moves. Then, it is possible to sort the moves so that the best candidates are searched first. However, move lists take memory space. That is a problem for my GPU move generator. There is 16 kB of shared memory for a thread group. If a thread group contains 256 threads, each thread has 64 bytes of memory. 32 bytes of it is reserved for the current positions, so only 32 bytes is left for moves. That is not enough for a move list. One has to think differently.

As mentioned in my previous posts, I implemented a traditional bitboard-based move generator that produces one move at a time on the GPU. Unfortunately, the performance is not so great. For comparison, I had a move generator that outputs a full move list at once. The former was four times slower than the latter. It seems that there are two reasons for that. First, the move generator is more complex, because it has to take the previous move as an input to regenerate the state of the move generator and to produce the next move. Second, it does unnecessary work, because one move needs only one destination square, although a bitboard representing all of them is generated for each move. I think it is time to revisit the design I had for the original CUDA chess move generator.

The basic idea is to have a few generic numbers represent the state of the move generator. For example: one for the source square (1-64), one for the direction (1-8) and one for the steps in that direction (1-7). Then the numbers are increased as the moves are produced. A similar idea was used in the early dedicated chess computers. And of course, one can use bit twiddling tricks to pack all the numbers in one integer. Tests are required to see how to implement the move generator most efficiently.


Refining the plan

During the last two weeks, I have been thinking about ways to utilize better the GPU in the move generator. There are two problems: divergence and insufficient processor utilization. These problems are opposite in a way. A large number of threads handling different positions produce divergent execution paths, but on the other hand it is difficult to get good processor utilization with a small number of threads. In addition, one has to take into account that more threads use more shared memory, at least in my current plan.

There are basically four kinds of more generators: sliding pieces, pawns, kings, and knights. It seems to me that trying to write a generic move generator for all pieces, in an attempt to avoid divergence, is wasting processor time. For example, knight move generator requires only a simple table look-up. Sliding pieces generator is a little more complicated, requiring masking blocked paths and outputting potentially several destination squares per direction. Pawn moves have different capture and moving directions, and one has to take into account double pawn moves, en passant and promotions. King move generator requires complicated handling of castling (e.g. the king cannot move over a square where it were in check). I have decided to split the move generator in four parts, so I can optimize each part separately. The problem then is that I need lots of positions, so that each move generator has something to do.

I will make an attempt to use 256 positions per thread group (256 threads also). Then there is a chance that there are enough threads for the full utilization of the processors. Each of the four move generators can use 64 threads to handle a different group of positions. Divergence is not a problem when the threads in the same move generator belong to the same wave (or warp in CUDA terminology). If one move generator runs out of work, the threads in the wave can help with another move generator. The worst case scenario is that there are 253 positions for one move generator and only one position for each of the other three generators. Then the three generators with only one position waste 63 threads each, while the one with the most threads must run four times, if the others finish too late to help it. I think this is quite rare, but I still have to think about something clever to avoid idle threads.

Shared memory usage will obviously be one on the major limiting factors. With 256 positions to store, we cannot afford 8 bitboards, because that would take 16 kB, which is all of our budget for the shared memory. I think the best solution would be using 4 bitboards only (in total 8 kB). As I explained in an earlier post, it is possible to pack the flag bits in the 4 bitboards. However, because the packing and unpacking is computationally expensive, I came up with a plan, where the flags bits are not stored in the positions. They have to be stored in the move data structures anyway, because one has to be able to undo a move and restore the flag bits. So why waste extra space elsewhere. The means that there always has to be a move associated with a position, but that is the case anyway, because we are generating one move at a time, and the move generator takes the previous move as an input parameter.

The rest of the shared memory can be used for storing move stacks and split points. There is no room for full move stacks, because one move takes 8 bytes and 256 moves take 2 kB. I plan to store only three last moves per position (in total 6 kB), and store the rest in the global memory. The three moves are stored in a ring buffer that act as a cache. I hope that latency of the the memory accesses can be hidden by the work that is done in the last 2 ply. I will try to implement this in the next few weeks.


Compute shader move generation progressing

I have written the basic move generation routines with DirectX 11 compute shaders now. I use bitboards and small look-up tables that contain moves for each piece-square pair and ray tables for sliding pieces. Those tables take about 6 kB. I initially put them in the shared memory, but then began to think, that since the tables are small, they could fit in the L1 cache which could be even faster. Tests proved that this was the case. Shared memory is slower than the cache, so I moved the tables to texture memory. This frees space in the shared memory that can be used for other purposes.

There is not enough space to store generated move lists for each position. It seems that I have to use over 32 bits per move, because there must be enough information to undo the move and account for special moves like promotion. Restoring the flags for a position takes 16 bits. Then source and destination squares take 6 bits each. Moving piece, captured piece, and promoted piece take 3 bits each. These are 37 bits already. I might as well reserve 64 bits per move, because 32 bits is the smallest natural element size on many GPU architectures. Then it is possible to include some extra information in the move data structures.

My idea is to store only the moves that are actually made. So the move generation must be able to produce only one move at a time, and deduce the next move to generate from the previous move. This requires some extra work per move compared to the generation of the whole move list at once. However, the extra work pays off, because the GPU can be better utilized. The move order must also be static. I think this is not a serious issue, because this approach is used near the leaf nodes. Higher in the tree we can use more sophisticated move ordering algorithms and global memory.

The shared memory will contain the move stacks and the current position for each group of thread working on the same position. In addition, there will be information about the split points. Split points mean points in the search tree, where the work is distributed between two or more groups of threads. It is not necessary to store the positions at the split points, but only the last move generated and search bounds. There will probably be a separate list of split points. An alternative is to interleave the split points in the move stacks. I will make more concrete plans, when I get the move generation part working fast enough.