I did some tests with compute shaders. I am pleasantly surprised that even with the very modest GPU of my laptop (AMD Radeon 7500G), it is possible to do over 60 billion SIMD operations in a second. That is more than most multicore CPUs can achieve even today. For example, eight cores running at 4 GHz, can only achieve 32 billion operations. It is true that on the CPU these can be 128-bit vector SSE operations, but the GPU can do similar operations as well.
I implemented the position packing and unpacking tests (see previous blog post) with a compute shader. The idea of packing was to fit all the information required for a chess position in 4 bitboards of 64 bits, when storing it in the global memory. The position is unpacked when it is needed in the computations. With flag bits (turn, castling rights, etc.) included, the GPU could do 178 million pack-unpack pairs in a second. Without flag bits (only pieces), the figure was 2,85 billion pack-unpack pairs. They take about 500 and 20 instructions, respectively. Let's say that an unpacked position takes 50 % more space. It means one extra global memory fetch for reading a position and another for writing it. Global memory accesses can take hundreds of clock cycles. So the packing might be useful or not, depending on the GPU.
I have also sketched a plan for shared memory usage. The idea is to keep everything in under the 16 kB limit. Look-up tables for move generation take 7 kB. Another 7 kB are reserved for move stacks. 1 kB is reserved for positions, which means that there are 16 positions per thread group. This means that there will be 4 threads working on one position. I think that with careful design, there is enough work for them.