Chess on a GPGPU: Breaking News: 400M positions per second!

5/14/2013

Breaking News: 400M positions per second!

I did it. It was not so difficult after all. Because it was obvious that the global memory writes slow down the performance, I tried to figure out how to reduce their number. First, I noticed that I could not get rid of the six store operations for the six bitboards per position, because there are no instructions to write more than 64 bits at once. That gives the upper bound of 87.5M nps. I was not happy with that. Then I thought about ways to avoid putting the positions in the global memory. Because the shared memory is limited, I could not use it for the whole position stack. But what about the last ply? Well, you can guess...

So after eliminating the global memory writes for the last ply, I got huge performance benefits. Because only the second to last ply is written to the global memory, there are surely enough instructions (handling the last ply) to hide the memory latency. This also means that I could cope with less threads as the performance converges towards the upper bound much faster with high instruction level parallellism. So below are some initial results. I used 128 threads per block, because I found out that it is the optimal number for the current version of the move generator.

Performance in Mnps vs. number of blocks.

Line line is not smooth in the beginning, because with different numbers of blocks, the multiprocessors get different numbers of blocks to process. Multiples of 60 blocks yield the best performance. That is four times the number of multiprocessors (15) on the GTX 480. I am not sure why this is good.

But now that I have a working and efficient move generator, I can start to think about the next big problem, which is the parallel search.

9 comments:

Ankan5/14/2013 8:53 PM
Congratulations. 400 Million nodes per second is fast!
ReplyDelete
Replies
Samuel5/15/2013 10:36 AM
Thanks. Unfortunately, I forgot to switch on the capture move generation, while testing the performance. The 400M nps is for non-captures. With all the moves, its closer to 200M nps. Although captures and promotions are just a small fraction of all moves, generating them takes time, because they require more branches and move ordering is implicitly included.
ReplyDelete
Replies
Anonymous5/19/2013 12:13 PM
Congratulations, sounds like you are on the right track :)

Are these moves already sorted with MVV-LVA?
ReplyDelete
Replies
Anonymous5/19/2013 12:38 PM
do you have already a clue what kind of search algorithm you are going to implement...i am not sure if the classic approach of Parallel AlphaBeta fits on GPUs...

--
Srdja
ReplyDelete
Replies
Samuel5/19/2013 12:40 PM
Thanks.

Yes, the capturing moves are already sorted in the MVV-LVA order. That is why it takes about twice the time it takes to generate the non-capturing moves. But I cannot easily change that because the move ordering is kind of integrated in the algorithm.

I think I can push the move generation performance even further, but currently, the most important question is how to implement the parallel version of alpha-beta search.
ReplyDelete
Replies
Ankan5/20/2013 8:51 PM
Hi Srdja,

Did you delete your zeta chess blog and github project? I can't find them anymore.

I am also planning to try writing a chess engine running on GPU. I will attempt it on the new kepler gpu with dynamic parallelism. Before that I need to finish my CPU engine first :-/

Regards,
-Ankan
ReplyDelete
Replies

Add comment