Sometimes it is useful to look things from a different perspective. This applies to chess programming as well as other things. When programming for the CPU, you usually try to minimize the instruction count in order to optimize the performance. In GPU programming, it is often better to minimize the memory usage, because memories and caches are small, but on the other hand, basic arithmetic instructions are relatively cheap. The bottleneck is the memory speed, not the execution of the instructions. As a side product, I have learned new ways to do things. These might be useful in the future, even when optimizing a CPU chess engine.
I have implemented a prototype for move generation. It currently runs on the CPU, but it should be straightforward to port it to the GPU. The main reason for this is that it does not use any lookup tables, so it does not need shared memory and I do not have to struggle with the memory limit. The side product is that the CPU version is very cache friendly. Of course, the instruction count is larger than with the lookup table approaches, but that does not matter. The generator is bitboard-based and it uses the Kogge-Stone algorithm for sliding piece move generation. So, basically, there are lots of shift- and xor-operations and stuff like you usually see in bitboard engines.
I decided to pack the position in 48 bytes or six 64-bit variables. This is a reasonable compromise between size and instructions required for unpacking the information. It is important to keep the size small, because the positions will be written in the global memory on the GPU. Now the six variables take three 128-bit cache lines or three memory operations per position. Packing the positions tighter would not help unless I can fit everything under 256 bits resulting in only two memory operations. But that I cannot do. Well, actually, the smallest number of bytes for a position (including side to move, en passant square, castling rights, and counter for the 50-move rule) that I have found is 32 bytes or 256 bits, but it is not very practical. So I am happy with 48 bytes. The six bitboards are white pieces, pawns, knights, bishops+queens, rooks+queens, and kings. The extra information required is packed in the lowest and highest eight bits of the pawn-bitboard, because they are never used for pawns in legal positions.
The idea of writing the positions into the global memory lead to another idea. I can interleave the global memory writes with the move generation code. I do not generate move lists at all. I make the move at once when generating it and store the resulting position. Again, no shared memory is required. A position is so small that, ideally, it can fit in registers. A nice side product is that I do not need a data structure for a move, and I do not have to care about packing and unpacking information in it. This actually saves instructions, although that was not the goal here. It also saves us a few ifs, because I do not have to have branches for the special moves both the move generator and in the make() and unmake()-functions. There are no make() and unmake()-functions, so there cannot be ifs in them.
Yet another data structure, that is missing from my implementation, is the board with 64 squares telling which piece is on each of them. This is useful when determining if a piece is captured and which piece it is. But I do not need that, because my move generator will produce the captures so that it always knows which piece is captured. The capture move generator is designed so that it produces the captures in the commonly used Most Valuable Victim - Least Valuable Aggressor (MVV-LVA) order. This can be done by using the bitboards only.
I also get rid of the bit scan operations, because I do not need the square numbers to index the lookup tables. That is simply an unnecessary intermediate step, because the resulting positions consist of bitboards, and not square indices. There are relatively efficient ways to extract the square number of the least significant bit, like the de Bruijn-multiplication which requires a 64-element lookup table, but if I can live without it, even better.
I have to say I am very satisfied with the software architecture right now. Let's see if that changes when I have a chance to test the move generator performance.