I did it. It was not so difficult after all. Because it was obvious that the global memory writes slow down the performance, I tried to figure out how to reduce their number. First, I noticed that I could not get rid of the six store operations for the six bitboards per position, because there are no instructions to write more than 64 bits at once. That gives the upper bound of 87.5M nps. I was not happy with that. Then I thought about ways to avoid putting the positions in the global memory. Because the shared memory is limited, I could not use it for the whole position stack. But what about the last ply? Well, you can guess...
So after eliminating the global memory writes for the last ply, I got huge performance benefits. Because only the second to last ply is written to the global memory, there are surely enough instructions (handling the last ply) to hide the memory latency. This also means that I could cope with less threads as the performance converges towards the upper bound much faster with high instruction level parallellism. So below are some initial results. I used 128 threads per block, because I found out that it is the optimal number for the current version of the move generator.
Performance in Mnps vs. number of blocks.Line line is not smooth in the beginning, because with different numbers of blocks, the multiprocessors get different numbers of blocks to process. Multiples of 60 blocks yield the best performance. That is four times the number of multiprocessors (15) on the GTX 480. I am not sure why this is good.
But now that I have a working and efficient move generator, I can start to think about the next big problem, which is the parallel search.
Congratulations. 400 Million nodes per second is fast!
ReplyDeleteThanks. Unfortunately, I forgot to switch on the capture move generation, while testing the performance. The 400M nps is for non-captures. With all the moves, its closer to 200M nps. Although captures and promotions are just a small fraction of all moves, generating them takes time, because they require more branches and move ordering is implicitly included.
ReplyDeleteCongratulations, sounds like you are on the right track :)
ReplyDeleteAre these moves already sorted with MVV-LVA?
do you have already a clue what kind of search algorithm you are going to implement...i am not sure if the classic approach of Parallel AlphaBeta fits on GPUs...
ReplyDelete--
Srdja
Thanks.
ReplyDeleteYes, the capturing moves are already sorted in the MVV-LVA order. That is why it takes about twice the time it takes to generate the non-capturing moves. But I cannot easily change that because the move ordering is kind of integrated in the algorithm.
I think I can push the move generation performance even further, but currently, the most important question is how to implement the parallel version of alpha-beta search.
sounds like a new kind of move generator :)
Deletei can only give you the advice to take care of the SIMD nature of the threads inside one warp. The algorithm i used is divided in 7 phases, not very SIMD friendly...i achieve, running 32 threads compared to running one in a warp, an speedup of only 4x...so i made something terrible wrong ;)
maybe LIFO stacks could do the job for distributing work, on a warp-level, and on a global level across warps...just a thought...
Bon chance,
Srdja
I don't think there is anything that special about the move generator. It uses well-know techniques.
DeleteThe SIMD nature is taken care of in my code by splitting the search in one ply stages. The same code is run over and over again for all the threads. There is no recursion or anything else that would cause severe divergence issues (smaller problems still occur).
The 4x speed-up with 32 threads sounds like a divergence issue or memory bank conflict. However, I think it is very difficult to achieve 32x speed-up anyway.
Hi Srdja,
ReplyDeleteDid you delete your zeta chess blog and github project? I can't find them anymore.
I am also planning to try writing a chess engine running on GPU. I will attempt it on the new kepler gpu with dynamic parallelism. Before that I need to finish my CPU engine first :-/
Regards,
-Ankan
yes i deleted the blog and the repo...Zeta played chess but the design was not optimized for running on a gpu..i don't think it is an good example how to do it...
DeleteGood Luck Ankan,
Srdja