I have continued testing the move generator. After some obvious optimizations I got the speed up to 2.5M nodes per second with one multiprocessor. Then it was time to give all the processors work to do. So the approach I used was to give each processor a separate stack to operate on. I noticed that it slows down the performance if different multiprocessors write global memory addresses close to each other. Now one can give each multiprocessor its own starting position and let it process that. For example for perft(6) this could mean calling perft(5) for each child position of the initial position.
Below is a figure that shows the scaling with more than one multiprocessor. A block has 512 threads and the number of blocks varies from 2 to 26 in the test. The line is not straight in the beginning, because not all multiprocessors have work until there are at least 15 blocks, i.e. 7.5k threads in this case.Scaling with several multiprocessors. Speed (Mnps) vs. number of threads.
Further increasing the number of blocks allows more efficient hiding of the memory latency. Below is another figure that shows the scaling up to 400 blocks. Now we can see performance figures that are of the correct order of magnitude.Scaling with a large number of blocks. Speed (nps) vs. number of blocks.
The maximum speed achieved is about 62M nps. I had calculated the absolute upper bound for the GTX 480 to be 175M nps if 48 bytes were written for each position. I think that, in practice, the compiler messes things up and produces twice as many writes. I might have to change the data types used in the kernel to get the correct number of writes. So with the larger number of global memory writes, the upper bound becomes 87.5M nps. Yet, another thing that slows things down even more is how I have currently implemented the control parameters, e.g. depth and node count. I have a separate stack for them in the global memory. Eventually, I want to embed the control parameters into the position stack.