I have written the basic move generation routines with DirectX 11 compute shaders now. I use bitboards and small look-up tables that contain moves for each piece-square pair and ray tables for sliding pieces. Those tables take about 6 kB. I initially put them in the shared memory, but then began to think, that since the tables are small, they could fit in the L1 cache which could be even faster. Tests proved that this was the case. Shared memory is slower than the cache, so I moved the tables to texture memory. This frees space in the shared memory that can be used for other purposes.
There is not enough space to store generated move lists for each position. It seems that I have to use over 32 bits per move, because there must be enough information to undo the move and account for special moves like promotion. Restoring the flags for a position takes 16 bits. Then source and destination squares take 6 bits each. Moving piece, captured piece, and promoted piece take 3 bits each. These are 37 bits already. I might as well reserve 64 bits per move, because 32 bits is the smallest natural element size on many GPU architectures. Then it is possible to include some extra information in the move data structures.
My idea is to store only the moves that are actually made. So the move generation must be able to produce only one move at a time, and deduce the next move to generate from the previous move. This requires some extra work per move compared to the generation of the whole move list at once. However, the extra work pays off, because the GPU can be better utilized. The move order must also be static. I think this is not a serious issue, because this approach is used near the leaf nodes. Higher in the tree we can use more sophisticated move ordering algorithms and global memory.
The shared memory will contain the move stacks and the current position for each group of thread working on the same position. In addition, there will be information about the split points. Split points mean points in the search tree, where the work is distributed between two or more groups of threads. It is not necessary to store the positions at the split points, but only the last move generated and search bounds. There will probably be a separate list of split points. An alternative is to interleave the split points in the move stacks. I will make more concrete plans, when I get the move generation part working fast enough.