I have been thinking about the GPU implementation again. The biggest issue is the memory consumption. I have previously shown that a position can be packed in about 40 bytes by using bitboards, so that everything can be unpacked with a few bit operations. Then the positions can reside in the shared memory or even in the registers. But move lists are the problem. To be exact, there are two kinds of move lists, those generated by the move generation routine for each position and those that are put in the stack when recursively calling the search. I call the former move lists and latter ones move stacks.
Moves in the move stack require more information, because they are used for undoing moves and that involves restoring the position as it was before the move, including castling rights, en passant squares, and half-move counters for the 50-move rule. But for move lists, less than that is sufficient, because the original position is known. The least amount of information that is required for unambiguously defining a move in a known position is: to- and from-squares plus possible promotion. This can be fitted in two bytes. This is enough for move lists produced during the move generation. With two bytes per move and average of 40 moves per position, this means about 80 bytes per ply. Then, for example a 15-ply search takes 1200 bytes for move lists and we are out of shared memory if there are more than 40 threads. So I have to think about doing something differently.
What I have been thinking is that I write a kernel that takes a position from the global memory, generates the moves, makes them, and writes the positions back to the global memory. The kernel is launched several times, always taking the previously generated positions as input and producing output that is one ply deeper. The problem with this approach is that there are typically tens of position to be written while there is only one position to read from the memory. The global memory bandwidth on my GPU (NVIDIA GeForce GTX 480: 177 Gb/s) is sufficient for writing three positions per clock cycle. So, given the average of 40 positions per thread, this means about 13 clock cycles per thread, assuming a large number of threads are running and amortized cost calculation is valid. I think this is of the same order of magnitude that it takes to generate a move, so it could be possible to hide the latency quite well.
Eventually, I might end up using the Kogge-Stone-style move generator, because it does not require lookup tables for sliding pieces. For other pieces, the lookup tables take just a few hundred bytes of shared memory. Since the positions can be in registers during the processing, the shared memory consumption is not that important. The move generator does not produce a move list, but makes the moves directly and writes the resulting positions to the global memory. Surprisingly, this might be the fastest way to do it.
I still have to find a way to implement the search, because this architecture is fine for perft(), but not for alpha-beta search as such.