12/22/2014

Tests with compute shaders

I did some tests with compute shaders. I am pleasantly surprised that even with the very modest GPU of my laptop (AMD Radeon 7500G), it is possible to do over 60 billion SIMD operations in a second. That is more than most multicore CPUs can achieve even today. For example, eight cores running at 4 GHz, can only achieve 32 billion operations. It is true that on the CPU these can be 128-bit vector SSE operations, but the GPU can do similar operations as well.

I implemented the position packing and unpacking tests (see previous blog post) with a compute shader. The idea of packing was to fit all the information required for a chess position in 4 bitboards of 64 bits, when storing it in the global memory. The position is unpacked when it is needed in the computations. With flag bits (turn, castling rights, etc.) included, the GPU could do 178 million pack-unpack pairs in a second. Without flag bits (only pieces), the figure was 2,85 billion pack-unpack pairs. They take about 500 and 20 instructions, respectively. Let's say that an unpacked position takes 50 % more space. It means one extra global memory fetch for reading a position and another for writing it. Global memory accesses can take hundreds of clock cycles. So the packing might be useful or not, depending on the GPU.

I have also sketched a plan for shared memory usage. The idea is to keep everything in under the 16 kB limit. Look-up tables for move generation take 7 kB. Another 7 kB are reserved for move stacks. 1 kB is reserved for positions, which means that there are 16 positions per thread group. This means that there will be 4 threads working on one position. I think that with careful design, there is enough work for them.

12/11/2014

New ideas

It has been a while since I wrote something about GPU chess. I still have not abandoned the idea of writing a GPU chess engine that could beat the CPU engines. With DirectX 11 and its compute shaders general purpose computation on the GPUs is brought one step closer to the wide audience. You do not need NVIDIA and CUDA anymore. Any modern graphics card will do. I am eager to try how the compute shaders could be used for chess algorithms.

Recently, I did some "bit twiddling" and found a more compact way to store chess positions. 128 bits is a natural alignment boundary on many GPU architectures, because four 32-bit floating point numbers fit in it. When computing graphics, those four floats represent x-, y-, z-, and w-coordinates in a homogenous coordinate system or red, green, blue, and alpha channels in the screen buffer. We may pack a chess position neatly in two of these 128-bit vectors. I will explain how.

We start with seven bitboards (64 bits each). These are pawns, knights, bishops, rooks, queens, kings, and white pieces. In addition, we have a few bits for other stuff, such as the turn bit, castling rights, en passant square, and 50-move rule counter. It is a commonly used trick in chess engines to pack queens into bitboards for bishops and rooks by bitwise OR, so that we end up with two bitboards instead of three, namely bishop-queens and rook-queens. Queens can be found by using bitwise AND between the two bitboards. This arrangement makes sense also from the move generation point of view, because queens move like bishops ans rooks. However, this is just an additional benefit. The packing trick works for any other set of bitboards as long as they contain disjoint sets of bits. We might as well have pawn-kings and pawn-knights for additional packing. We still have five bitboards and flag bits. Packing these already packed bitboards together with the same technique is not possible, because the bitsets are no longer disjoint.

However, if we start with packing three bitboards together, a more compact representation is possible. Let the six bitboards representing the pieces have names p, N, B, R, Q, K. Then let's make three new bitboards s = p | N | B, t = B | Q | R, and u = R | K | p (here "|" means bitwise OR). Now it is possible to find the original bitboards by just using bitwise operations. For example, p = s & u or Q = t & ~s & ~u (here "~" means bitwise NOT and "&" is bitwise AND). We have effectively packed six bitboards into three. Then there is the bitboard for white pieces. We cannot pack it, because it is not disjoint of the others. And finally we have the flags, which is kind of annoying, since the data would otherwise fit in exactly four bitboards.

Packing additional data into the packed bitboards in still possible. Imagine three circles, all intersecting each other. Each of these circles represent one of the aforementioned bitboards, s, t, and u. The intersection area between any two of the circles represent bitboard for a piece that they have in common. The areas of the circles left after cutting off of the intersections areas represent the bitboards for the other three pieces. In the center, there is a seventh area that is the intersection of all the three circles. That is always empty if the bitboards are defined as above. However, we can utilize this intersection area by using it to store the additional data. It is a bit tricky, but possible. We need to make sure that the bits are disjoint sets. First, notice that there are at most 32 pieces on the board. That leaves at least 32 empty squares. We can use those for storing data and keep the disjointness condition. The challenge is that the empty squares may change. Finding out which squares are empty is easy: ~(s | t | u). We can use for example the lowest 32 of those bits. Bit by bit we construct a 64-bit bitboard that contains the flag bits and then bitwise OR it with each of the three bitboards. These are the final three bitboards, and with the bitboard for white pieces, we have exactly four bitboards or 256 bits or 32 bytes for one chess position. Unpacking progresses in a reversed order, first finding out the flag bits by bitwise ANDing all the bitboards, then extracting the flags, etc.

I have tested the validity of this approach on the CPU, and the tests work fine in the sense that all the valid information about a chess positions is packed into 256 bits, and unpacked without errors. A few years old 3.2 GHz processor could pack and unpack about 10M positions per second in one thread. Most of the time was consumed in packing the flag bits into the empty squares, as explained above. Without the flags, the packing and unpacking was five times faster. On the GPU, I would consider packing all the data that goes to the global memory, or that is transferred between the GPU and the CPU. In move generation and local computations, I would use unpacked positions.

This packing thing was just one idea. I am still trying to find a good scheme for going through search trees.