I finally got the move generator working properly, i.e. obtaining correct node counts, on the GPU. At least for one multiprocessor. I still have troubles to get it work on multiple multiprocessors. But the good news is that with one multiprocessor the scaling with the number of threads is almost perfectly linear. Look at the picture below. It shows the perft(5) speed with different numbers of the threads (192-576). With larger numbers of threads, something breaks and the performance decreases dramatically. I think this happens, because the compiler runs out of registers and puts the variables in the local memory, which is painfully slow.Speed (million nodes per second) with varying numbers of threads. The numbers are averages of 10 runs of perft(5).
The linear scaling means, I think, that the global memory speed cannot be the limiting factor of performance with one multiprocessor. Otherwise we would see a curve that bends downwards from the linear trend and converges to a horizontal line. But that does not happen here. Indeed, with 1.5 million nodes per second and 48 bytes per position, writing the positions requires only about 72 Mb/s bandwidth, when 177 Gb/s is available. Something else is limiting the performance.
I used the compiler option to generate the PTX-assembler code. That is over 15 000 lines and about half of them are instructions (not labels or debug information). In other words, the kernel is quite long. It also has quite a many branches. But I still suspect that the compiler messes things up by putting things in the local memory. I might have to write the assembler code manually. That is really the last option. Or first I will try to find compiler options that generate faster code. I will also consider other ways to arrange the computation.