After playing with the multiprocessor occupancy calculator provided with the CUDA-SDK, I have concluded that a warp must not take more than 1 kb of shared memory. Then, other things being optimal, 100 % occupancy can be achieved with at least six warps running on one multiprocessor. This is challenging, because a position takes several 8-byte bitboards, and moves take 4 bytes each. Move list are required for both storing the current search path and for move generation.
I do not want to use large look-up tables for the move generation, because the memory space is restricted. It seems that magic bitboards and even rotated bitboards require too much memory. I will try Kogge-Stone move generator. It is probably the fastest way to generate moves without look-up tables. Currently, I have most of the move generator written in CUDA-C and it compiles without errors. I have not had time to write the perft()-function yet, because I am currently quite busy at work. This GPU chess project is going forward, but it takes time.
I found a way to parallelize the capture generation part of the Kogge-Stone generator. This will speed up the quiescence search. I think that it is possible to use the same kind of optimization with parts of the evaluation function. Then, the computing power is best utilized where it matters the most: near the leaf nodes.
I try to write the performance test function this week. I will write more about that when I have results.