Global memory accesses are slow. Up to two orders of magnitude speed up can be achived by using shared memory. However, the shared memory is limited. These facts lead us to the design principle: use shared memory instead of global memory even at the expense of extra operations. For example, it does not matter if there are three times as many instructions if we can avoid memory operations that are a hundred times slower. So let us see how this can be done.
For each thread, we would like to fit in it the stack of search function parameters and the current position. The function parameters include the remaining depth, alpha value, the move information, and a pointer for returning to the previous depth. These parameters are implementation specific, of course. We try to save space, so we do not have the positions in stack, but only the moves. In addition, the beta value is not necessary in the zero window search, because it is always alpha + 1. Even the alpha need not be in the stack, because it does not change during the search. If it would change that were a fail high.
Different representations of the position take different amounts of memory. Let us compare three of them (without additional flags etc.): simple 8x8 board, 0x88 board, and bitboards. In its simplest form the 8x8 board takes 64 times 4 bits (enough for pieces) which is 256 bits or 32 bytes. It might be necessary to complement this with a piece list for efficiency reasons. Then, 0x88 boards take double the space compared to the simple 8x8 board or four times that if we reserve one byte per square. Finally, bitboards take 64 bits each, and we need 6-12 of them (white, pawns, knights, bishops, rooks, kings, and possibly black, queens, plus 4 rotated bitboards). That is 384-768 bits or 48-96 bytes. Surprisingly, the simplest board representation seems the most attractive one, because it is the smallest one.
Moves can be packed in a few bytes. Source and destination squares take 6 bytes each. Then, for undoing moves, it is necessary to store the captured piece, the en passant square, castling rights, and the promoted piece. These take correspondingly 4, 6, 2, and 4 bits. In total this is 28 bits. So, in practice, a move takes 32 bits or 4 bytes.
Fitting the stack in the shared memory is going to limit the depth of the stack or the number of threads. The number of threads should not be too small, because that would negatively affect the performance. An initial guess is 64 threads per multiprocessor. This can be changed when the implementation is tested. The depth is limited to 32 ply. In practice, this should be enough, although the quiescence search could theoretically go deeper.
The follow table summarizes the total memory consumtion so far:
|Total Memory Consumption
|64 x 32
|64 x 32
For each multiprocessor, we probably need a set of lookup tables for the move generator. Because the other stuff already takes about 14 kB, and on many devices the upper limit for the shared memory is 16 kB, we have only about 2 kB space available. That is not enough for any 64 to 64 tables. Instead, it is possible to use only arrays of offsets for pieces moving to different directions.
I will refine the plan when I go into the details of move generation.