Chess on a GPGPU: March 2013

3/21/2013

Without exceptions...

...the world would be perfect. Unfortunately, it is not. Specifically, the rules of chess could be elegant, but there are some special moves that make things annoyingly difficult. Those are the double pawn move, promotion, en passant, and castling. In addition, making sure the side to move is not in check after the move can cause headaches. I think my move generator is only half as fast as it could be without the exceptions.

I am trying to find ways to handle the special moves like normal moves. This is not easy. Some moves require extra work. In castling, two pieces move. In promotions, the piece type changes during the move. In en passant, the captured piece is in a different square than the destination square of the piece. I do not want to make make() and unmake()-functions much more complicated. In a typical chess engine, there are ifs for testing for the special cases. I want to avoid ifs even if the thread divergence is not an issue.

I tried to fit the move data in 32 bits. This means all the data required for restoring the position after undoing the move. I could do it easily without the special moves. But with them, I think that I need more. A general move might look like this:

captured pc. 2	piece 2	to sq. 2	from sq. 2	captured pc. 1	piece 1	to sq. 1	from sq. 1
4 bits	4 bits	6 bits	6 bits	4 bits	4 bits	6 bits	6 bits

That is 40 bits. So is it enough? Let's see. We have to- and from-squares, plus moving piece and captured piece. Two times. Obviously, regular moves (captures or not) require only one set of the fields. The other set can then be set to point to a dummy square and piece. Then, special moves require more fields. Castling consists of one king move and one rook move without captured pieces. En passant consists of a pawn move which captures the en passant square and of another move where an empty square captures the opponents pawn (sounds funny, but it works). Promotion consists of a pawn move (can be a capture or not) and of another move where the promoted piece captures the pawn on the promotion square (again, sounds odd but it works).

Then, some extra fields are required for restoring the position. These include castling rights, en passant square, and move counter for the 50-moves rule. Finally, in my implementation, I need to store the piece list indices for captured pieces, which is extra 4 bits per captured piece. So the eventual move structure could be:

move counter	EP sq.	Castling rights	captured pc. 2	piece 2	to sq. 2	from sq. 2	captured pc. 1	piece 1	to sq. 1	from sq. 1
8 bits	4 bits	4 bits	4+4 bits	4 bits	6 bits	6 bits	4+4 bits	4 bits	6 bits	6 bits

This is not the thightest packing possible, but it takes 64 bits, which is convinient for alignment reasons. For the GPU, it is better to pack things like this, although a few extra instructions are required to unpack them, because memory reads take time, and going over the 1 kB per warp limit reduces the multiprocessor occupancy rate.

3/19/2013

Back to the drawing board

In a previous post I told about the Kogge-Stone-style move generator I had implemented. Unfortunately, I forgot that I must stuff the generated moves somewhere in the shared memory. If a position takes 64 bytes, I have only 960 bytes left below the 1 kB limit. If a move takes 4 bytes, than that means that I can have at most 240 moves in the memory. Obviously, generating all the moves at once is not an option if I want to do more than 1 ply searches. I might have to use the one-move-at-a-time move generation routine that I mentioned earlier. I got it working on the CPU at least.

The main idea is to have an index or a pointer to where we left in during the move generation. We generate a move, make the move, call recursive search, unmake the move, and then generate the next move based on the pointer. The challenge is to construct the pointer so that it can be incremented to get the next move. There can be 16 pieces per side, so 4 bits determine the piece. Then there can be at most eight directions in which to move the piece (knights, queens, and kings). That takes 3 bits. Finally, the sliding pieces can go at most seven steps in one direction. That takes 3 more bits. So in total the pointer has 10 bits. With 10 bits one can represent 1024 numbers. That is a lot more than there are moves in a position. So skipping over the invalid pointer values is important for efficiency.

In addition, I have turned my attention to the 0x88-board representation. It is memory friendly, because it does not require large look-up tables. Basically, what is required, is a list of directions for each piece type. I am looking for an elegant solution to combine the 0x88-board representation with the pointer-based move generator. There are lots of interesting problems to solve.

3/11/2013

More low level stuff

After playing with the multiprocessor occupancy calculator provided with the CUDA-SDK, I have concluded that a warp must not take more than 1 kb of shared memory. Then, other things being optimal, 100 % occupancy can be achieved with at least six warps running on one multiprocessor. This is challenging, because a position takes several 8-byte bitboards, and moves take 4 bytes each. Move list are required for both storing the current search path and for move generation.

I do not want to use large look-up tables for the move generation, because the memory space is restricted. It seems that magic bitboards and even rotated bitboards require too much memory. I will try Kogge-Stone move generator. It is probably the fastest way to generate moves without look-up tables. Currently, I have most of the move generator written in CUDA-C and it compiles without errors. I have not had time to write the perft()-function yet, because I am currently quite busy at work. This GPU chess project is going forward, but it takes time.

I found a way to parallelize the capture generation part of the Kogge-Stone generator. This will speed up the quiescence search. I think that it is possible to use the same kind of optimization with parts of the evaluation function. Then, the computing power is best utilized where it matters the most: near the leaf nodes.

I try to write the performance test function this week. I will write more about that when I have results.

3/04/2013

Low level programming

The source code of a chess engine often has plenty of bit twiddling in it. Especially, engines using bitboards require lots of bit operations like exclusive ors, ands, complements, and shifts to left or right. These operations can be implemented by using a high level programming language; most often C/C++. Modern C-compilers are quite smart and do all kinds of optimizations, especially if the compiler is asked to optimize the code. Therefore it is not strictly necessary to use assembler code in the engine. In addition, the assembler code can be quite difficult to debug.

Unfortunately, the nvcc-compiler used for compiling CUDA-C does not seem to be that smart. It uses lots of registers, does stupid type conversions, and uses more instructions than necessary. The register usage is perhaps the most severe problem, because the occupancy rate of the multiprocessors depends on the number of registers per thread. I have tried a simple test program written in CUDA-C and examined the PTX-assembler code produced by the compiler. Then I have written the same program directly in PTX-assembler, and got the same result with half the register count and half the instruction count. This means a speed up of factor two or more. I will certainly write the core of the engine in PTX-assembler, because wasting computing power for nothing feels bad.

So what to write in assembler? I think, the functions that are called most often. These include the make() and unmake() -functions, move generation, search, and evaluation functions. Initially, I will write a perft()-function to test the speed and validity of the move generation. This has to be faster than the CPU move generator or otherwise I have to re-design the whole thing. I let you know when I have something working.