Most chess engines seem to rely on a move generator that produces a list of moves. Then, it is possible to sort the moves so that the best candidates are searched first. However, move lists take memory space. That is a problem for my GPU move generator. There is 16 kB of shared memory for a thread group. If a thread group contains 256 threads, each thread has 64 bytes of memory. 32 bytes of it is reserved for the current positions, so only 32 bytes is left for moves. That is not enough for a move list. One has to think differently.
As mentioned in my previous posts, I implemented a traditional bitboard-based move generator that produces one move at a time on the GPU. Unfortunately, the performance is not so great. For comparison, I had a move generator that outputs a full move list at once. The former was four times slower than the latter. It seems that there are two reasons for that. First, the move generator is more complex, because it has to take the previous move as an input to regenerate the state of the move generator and to produce the next move. Second, it does unnecessary work, because one move needs only one destination square, although a bitboard representing all of them is generated for each move. I think it is time to revisit the design I had for the original CUDA chess move generator.
The basic idea is to have a few generic numbers represent the state of the move generator. For example: one for the source square (1-64), one for the direction (1-8) and one for the steps in that direction (1-7). Then the numbers are increased as the moves are produced. A similar idea was used in the early dedicated chess computers. And of course, one can use bit twiddling tricks to pack all the numbers in one integer. Tests are required to see how to implement the move generator most efficiently.