I did it. It was not so difficult after all. Because it was obvious that the global memory writes slow down the performance, I tried to figure out how to reduce their number. First, I noticed that I could not get rid of the six store operations for the six bitboards per position, because there are no instructions to write more than 64 bits at once. That gives the upper bound of 87.5M nps. I was not happy with that. Then I thought about ways to avoid putting the positions in the global memory. Because the shared memory is limited, I could not use it for the whole position stack. But what about the last ply? Well, you can guess...
So after eliminating the global memory writes for the last ply, I got huge performance benefits. Because only the second to last ply is written to the global memory, there are surely enough instructions (handling the last ply) to hide the memory latency. This also means that I could cope with less threads as the performance converges towards the upper bound much faster with high instruction level parallellism. So below are some initial results. I used 128 threads per block, because I found out that it is the optimal number for the current version of the move generator.Performance in Mnps vs. number of blocks.
Line line is not smooth in the beginning, because with different numbers of blocks, the multiprocessors get different numbers of blocks to process. Multiples of 60 blocks yield the best performance. That is four times the number of multiprocessors (15) on the GTX 480. I am not sure why this is good.
But now that I have a working and efficient move generator, I can start to think about the next big problem, which is the parallel search.