Chess on a GPGPU: Initial plan for shared memory usage

Global memory accesses are slow. Up to two orders of magnitude speed up can be achived by using shared memory. However, the shared memory is limited. These facts lead us to the design principle: use shared memory instead of global memory even at the expense of extra operations. For example, it does not matter if there are three times as many instructions if we can avoid memory operations that are a hundred times slower. So let us see how this can be done.

For each thread, we would like to fit in it the stack of search function parameters and the current position. The function parameters include the remaining depth, alpha value, the move information, and a pointer for returning to the previous depth. These parameters are implementation specific, of course. We try to save space, so we do not have the positions in stack, but only the moves. In addition, the beta value is not necessary in the zero window search, because it is always alpha + 1. Even the alpha need not be in the stack, because it does not change during the search. If it would change that were a fail high.

Different representations of the position take different amounts of memory. Let us compare three of them (without additional flags etc.): simple 8x8 board, 0x88 board, and bitboards. In its simplest form the 8x8 board takes 64 times 4 bits (enough for pieces) which is 256 bits or 32 bytes. It might be necessary to complement this with a piece list for efficiency reasons. Then, 0x88 boards take double the space compared to the simple 8x8 board or four times that if we reserve one byte per square. Finally, bitboards take 64 bits each, and we need 6-12 of them (white, pawns, knights, bishops, rooks, kings, and possibly black, queens, plus 4 rotated bitboards). That is 384-768 bits or 48-96 bytes. Surprisingly, the simplest board representation seems the most attractive one, because it is the smallest one.

Moves can be packed in a few bytes. Source and destination squares take 6 bytes each. Then, for undoing moves, it is necessary to store the captured piece, the en passant square, castling rights, and the promoted piece. These take correspondingly 4, 6, 2, and 4 bits. In total this is 28 bits. So, in practice, a move takes 32 bits or 4 bytes.

Fitting the stack in the shared memory is going to limit the depth of the stack or the number of threads. The number of threads should not be too small, because that would negatively affect the performance. An initial guess is 64 threads per multiprocessor. This can be changed when the implementation is tested. The depth is limited to 32 ply. In practice, this should be enough, although the quiescence search could theoretically go deeper.

The follow table summarizes the total memory consumtion so far:

Variable	Size	Number	Total Memory Consumption
Current position	32 bytes	64	2 kB
Moves	4 bytes	64 x 32	8 kB
Pointer	2 bytes	64 x 32	4 kB

For each multiprocessor, we probably need a set of lookup tables for the move generator. Because the other stuff already takes about 14 kB, and on many devices the upper limit for the shared memory is 16 kB, we have only about 2 kB space available. That is not enough for any 64 to 64 tables. Instead, it is possible to use only arrays of offsets for pieces moving to different directions.

I will refine the plan when I go into the details of move generation.

4 comments:

Anonymous1/28/2013 8:46 PM
Heyho Samuel, just my two cents...

1) you can use constant memory for storing tables that are used for move generation, it needs less cycles than global memory but a bit more than registers or shared mem.

2) You may take a look into the Nvidia Occupancy calculator, the less threads you are running the more registers per thread you have, so maybe you can store the board or other values in registers...

--
Srdja
Samuel1/29/2013 11:00 AM
Hi Srdja,

Those are interesting ideas. Constant memory is cached, and without cache misses it can be acceptably fast. I thought about using the constant memory for the tables, but if I can fit them into shared memory that's even better.

And I will use registers as much as possible. They are naturally used in the kernel variables. However, as far as I know, they cannot be indexed like arrays, so it is difficult to use them for array-like data.
Unknown1/14/2014 8:02 PM
Instead of storing the board in memory you can use a list of active pieces for each player. There are never more than 2 x 16 = 32 pieces on the board at any time. Each piece has a player association (1 bit), an x coordinate (8 possibilities -> 3 bits), a y coordinate (8 possibilities -> 3 bits) and a type (pawn, bishop, knight, rook, queen or king) (6 possibilities -> 3 bits). The status of the board (other than whose move it is) can be represented with no more than 320 bits -- even less if you use one list for each player.

- Mike
Samuel2/07/2014 10:54 AM
Michael, I thought about that. But then it is not fast to generate moves, because you have to go throught the whole table of pieces when testing obstruction for sliding pieces. And as you can see when you read the later blog entries, I came up with an idea that uses no more than the 320 bits, but also allows fast obstruction testing.

1/28/2013

Initial plan for shared memory usage

4 comments: