Chess on a GPGPU: Memory vs. speed

The trade-off between memory and speed is quite common in many computer algorithms. It is often possible to use more memory to gain speed. On the other hand, sometimes the memory space is limited and one must be content with the memory that is available. On the GPU, the situation is complex, because there are different types of memory. There is plenty of global memory, but it is slow. And there is very little on-chip memory, called shared memory, that is fast. Then there are other memory types that might be cached. But the shared memory is the most preffered choise for the move generator at least.

There is either 16 kB or 48 kB of shared memory. On some GPUs, it can be selected, whether there is 16 or 48 kB of shared memory, with more or less registers per thread, respectively. To play it safe, we try to fit the core of the algorithms in 16 kB. Because every thread must have the position they are working on in the shared memory, it makes sense to squeeze the position in as small number of bits as possible without degrading the performance of the algorithms too much. In the end, it will be a compromise between memory and speed, and testing is required to find the best solution.

So how many bits are required for unambiguously describing a chess position? There are 32 pieces at most, and they can be in 64 (= 6th power of 2) different squares. There are exceptions like pawns that can only be on 48 different squares, but we do not go in to such details here. In total, the piece information takes 32 x 6 bits = 192 bits or 24 bytes. This is not enough, however, because we cannot tell the piece types. Initially it is possible to fix the pieces to certain indices in the piece array, e.g. white queen is always at index 4. But the piece types change due to promotions and there might be captures that remove the pieces off the board. So we must keep track of the piece types. For this purpose 3 more bits are required per piece. Notice that it is not required to store the piece color, because we can set the array so that first 16 indices in the array are for white pieces and the next 16 indices for black pieces. Eventually, we have 32 x 9 bit = 288 bits in total. In addition, we need 4 bits to store the castling rights, 1 bit to store whose turn it is, and 4 bits for the en passant (EP) square (4 bits is enough, because we know the rank, only the file must be determined with 3 bits; and one extra bit is required to tell if the EP was valid). This means the total number of bits is 297 which is 37 bytes and a bit. In a practical implementation, we might end up using 16-bits per piece, because 9 bits do not align well with the byte boundaries. This leads to 32 x 16 bits = 512 bits or 64 bytes. The additional flags can be hidden in the empty high bits.

How does this compare with the bitboards? I take chess engine Crafty as an example. It's source code is publicly available. It uses 7 bitboards per side for the board, plus some extra information that we can ignore in this context. These 14 bitboards take, of course, 64 bits each. In total this is 14 x 64 bits = 896 bits which is 112 bytes. It is possible to use less bitboards if we are willing to abandon rotated bitboards. The absolute minimum number, I think, is 5 bitboards (white, pawns, knights, bishop-like pieces, rook-like pieces) plus two variables to tell where the kings are (I would not waste bitboards for that). That takes 5 x 64 bits + 16 bits = 336 bits or 42 bytes. This sounds good to me. Although it is more than 37 bytes, it is certainly less than 64 bytes. The downside is that we do not have rotated bitboards.

Now, let me confirm that we can get all pieces (except kings) out of the five bitboards I mentioned:

wP = w & P;
wN = w & N;
wQ = w & B & R;
wB = w & B & ~wQ;
wR = w & R & ~wQ;
bP = ~w & P;
bN = ~w & N;
bQ = ~w & B & R;
bB = ~w & B & ~bQ;
bR = ~w & R & ~bQ;

where w = white, b = black, P = pawn, N = knight, B = bishop (edit: includes queens), R = rook (edit: includes queens), and Q = queen.

So how much space does this take for several threads? Let us assume 64 threads. Then 64 x 42 bytes = 2688 bytes or about 2.6 kB. That is very nice. It leaves plenty of space for the move stacks. The move data structure must contain all the data required to undo the move. That includes 6 bits for source and destination squares, 3 bits for the captures piece, 2 bits for castling rights, 4 bits for EP square, and 2 bits for promotion piece. This is 23 bits. So strictly only 3 bytes are required. Because of the data alignment, it might be better to use 4 bytes per move which lets us position the flags more conveniently. So for 64 threads, the stack takes 64 x 4 bytes = 256 bytes per ply. In theory, the quiescence search can be as long as 30 ply. Adding this to the end of a regular search of 20 ply means a 50 ply long sequence. This takes 50 x 256 bytes = 12800 bytes or 12.5 kB. We are still under 16 kB (2.6 kB + 12.5 kB = 15.1 kB).

I will now seriously consider re-writing the move generator for the bitboards. The only questions is whether I can cope without the rotated bitboards or not. And thanks for Error323 for your comments about using bitboards. Those made the re-think my decisions about the data structures. This is one of the reasons I write the blog. I get other opinions that can improve the outcome of this project.

2 comments:

Error3232/06/2013 4:59 PM
Won't you need a separate bitboard for the queen itself? I don't see how "wQ = w & B & R;" becomes the queen bitboard. Could you clarify?

As for rotated bitboards. You could try using magics instead for all the sliding pieces (R,B,Q). This will require 64*4096*2*64 bit (4 MByte), which is a lot. Not sure if that's a good idea.

See: http://chessprogramming.wikispaces.com/Magic+Bitboards

Ps. Are you hosting the code somewhere public?
Samuel2/06/2013 6:15 PM
Yes, I can clarify. I forgot to explain the B and R properly. They contain bishop-like pieces and rook-like pieces, not only bishops and rooks. Queens belong to both categories, bishop-like pieces and rook-like pieces. Then the queens are calculated as an intersection of those two sets (= B & R).

I thought about the bitboard-implementation more today. Currently the problem is fitting the move list in the shared memory if I generate all the moves at once. If I cannot find a solution to this problem, I might have to abandon bitboard again.

I do not have the code public yet. When I have it working properly, I might publish it.

2/06/2013

Memory vs. speed

2 comments: