Chess on a GPGPU

5/20/2020

A Different Approach

It has been quiet here for a while. During the last few months, I have regained some interest in computer chess. If I ever want to succeed in writing a GPU chess program, it must be faster than a CPU reference implementation. So I wrote new CPU-side move generation code from scratch. The purpose was to make it as fast as possible. That would set the bar. I wrote the test in the form of a "perft" program that calculates the number of positions at a given depth from a given position. I also compared it to H. G. Muller's qperft (https://home.hccnet.nl/h.g.muller/dwnldpage.html).

I used the traditional bitboard presentations for the positions, where each piece type has a 64-bit field representing squares occupied by that piece type. In addition, I needed to store a bitboard for color, and some extra information, such as whose move it is, castling rights, and en passant square. This all fits nicely is 64 bytes (8 bitboards), which is a common cache line size. I could pack them even tighter, but then would need extra instructions for reading and writing them. In addition, I later added a 64-bit hash key there too.

I have implemented a few alternative sliding piece attack algorithms in the past. I considered "magic bitboards" (https://www.chessprogramming.org/Magic_Bitboards), rotated bitboards (https://www.chessprogramming.org/Rotated_Bitboards), and the ray attacks approach (https://www.chessprogramming.org/Rotated_Bitboards). I ended up using the last one. Magic bitboard need quite large lookup tables, which pollute the cache quite easily (won't fit in L1, but need L2). Rotated bitboards, on the other hand, require extra instructions to update the directional bitboards and also some extra space too. The classical ray attacks approach has smaller lookup tables (fit in L1 cache) and doesn't need extra bitboards per position. I may do some measurements later. It may also depend on the other code, which one is the fastest (e.g. hash tables may pollute the cache anyway).

Then there is the handling of the king, which affects the performance. The lazy and easy approach would be to generate moves, so that the king is like any other piece, and only when a move is executed that captures a king, the position was marked invalid. This requires generating moves one ply deeper than necessary, if king captures were not generated in the first place. Because the branching factor of chess positions can be quite high, this would yield an order of magnitude slower code. Even if only captures to the king's square were tested, this would make it impossible to use bulk counting in leaf nodes. The bulk counting means that at the last ply, the move generation would only need the count the moves, but not execute them. These were the reasons, why I ended up implementing the much more complex move generator, that doesn't produce king captures.

To avoid generating king captures, the pinned pieces must be recognized and taken into account in move generation. Also it is necessary to detect checks and respond to them appropriately. That's why, before generating any moves, there is a preparation step that finds if there are pieces that give check (possible are 0, 1, 2 such pieces), and also find pins in each four directions (vertical, horizontal, diagonal, and anti-diagonal). These are stored in bitboards that are easy to use in move generation. Additionally, the area protected by any of opponent's piece is calculated. Then in the move generation, the pinned bitboards are used for restricting the directions in which moves are allowed for pinned pieces (pinned sliding pieces may still move in the direction of the pin). In the case of a check, there is a special check evasion move generation. Three types of moves are allowed: king moves to squares not belonging to the protection area, capturing the checking piece, and moving a piece in between the checker and the king. The latter two are only possible, when there is only one checker. The last one is valid only for sliding piece checks. There are a few caveats in this approach. E.g. the king should be removed from the board when calculating the protection area. Otherwise the squares x-ray attacked by the checker through the king would still be valid squares to move to. Also, there is a special case with en passant if the king and a rook or a queen are on the same row, but on opposite sides, of the pawns involved in en passant. The possible pin is not recognized by normal pin code, because there are two pawns between the king and the attacker. En passant should be illegal in this case, because the move removes both pawns simultaneously, leaving the king in check.

After navigating through the jungle of special cases, I finally managed to implement a move generator that is accurate and fast. I tested the move counts in several known "perft positions". After fxing a handful of bugs, I get the correct results in all tested positions. My initial implementation was single-threaded. Depending on the positions, it produced 200-300 million positions per second on my computer with Intel i7 9700K 8-core processor (on a single core at this point). Then I decided to make it multi-threaded. I implemented a version of the perft function that uses work-stealing for distributing the work on all cores. I was quite happy to get a fairly good scaling on the eight cores, the result being almost 2.1 billion positions per second in a middle game position.

Yet another speed up came, when I implemented a hash table. This proved to be a bit more challenging. First, I needed a hash key for each position. I needed a good hash, because the implementation relied on the 64-bit hash keys being different for different positions. Because of the birthday problem (https://en.wikipedia.org/wiki/Birthday_problem), the probability that two hash keys collide is greater than what the intuition would say, I needed to calculate the likelyhood of hash key collisions. Fortunately, the hash table size would limit the severity of the problem, because not all generated positions would fit in there. I calculated that even in the bad cases, the probability would in the order of one to ten million. That's enough for me. I ended up using the Zobrist keys that are commonly used in the chess engines (https://en.wikipedia.org/wiki/Zobrist_hashing). I update it when executing moves, so it adds very little extra work.

Any reasonable-sized hash table would most probably produce a cache miss for each probed position. That's why I thought it would be wise to take most out of each cache line read. The hash table entries were 16 bytes each, containing the hash key and the payload data (64-bit position counter). The lowest bits of the hash key were used for determining the index of the table, in which the hash table entry was placed. It was obvious that with a move generator producing over 2 billion positions per second, the hash table would be full in a blink of an eye. So I needed a replacement strategy for colliding entries. In this case, it was easy to use the position count covered by a hash entry as a criterion. The table would keep the entry, which had higher position count. Still for each read 16-byte entry, there would be 3 other entries (48 bytes of 64 bytes cache line) that would be read to cache for nothing. So I decided to make the hash table kind of 4-way associative. If there was a write collision, we would also check the other 3 slots in the same cache line. If any of those were empty or had worse position count, we would replace it. And when reading, we would also read those adjacent slots in the hash table for almost no extra cost (one cache line read anyway). Comparing the hash keys would reveal which of the positions was the correct one, if any. This approach improved the hit rate, when the table was filling up.

In my initial implementation, the hash table polluted the caches and trashed the performance as I was afraid of. Not only hash table accesses were memory bound, but also the normal move generation, because the lookup tables were kicked out of cache. Then I made a few adjustments based on profiling. First, I removed the hash reads and writes from the two lowest levels of the search tree. This reduced the memory traffic so that only hash table accesses waited for memory, but not the normal move generation. Another adjustment was not using pointer in hash table accesses. In initially, I had returned a pointer to the hash table entry, when probing for a read. That lead to indirection that made the program actually read the memory location twice: once when finding the correct entry, and once when using the data in it. This was revealed by looking at the compiled assembler code. I changed this so that the hash entry was returned by value, which apparently went nicely in one 8-byte register, because hash-part of the 16-byte entry was not used. This gave the performance boost I had hoped for from the hash table. An open question was still, if I should do a shared hash table for all the workers or a private one for each. After testing this, it was obvious that a shared hash table would beat individual ones any time, even with the additional locks to protect the table from concurrent access to same entries. I was even prepared to do a fancy system with bitfields accessed with atomics to protect entries with finer granularity, but in the end using just one mutex for the whole table was sufficient. The situation may be different if there were more than eight workers, but now locks took so little time that it was difficult to find in the measurements.

In the end, I got nice results, especially in deeper searches. As an example in one position, the multi-threaded, hash table-backed perft counted in 10 ply some 14.6 trillion positions in just 81 seconds. This means over 180 billion positions per second. However, it should be noted that many of these come from the hash table. In these deeper searches, hash table hit rate is over 90 %. Still, here is the standard to beat with GPU.

2/15/2015

Thinking differently

Most chess engines seem to rely on a move generator that produces a list of moves. Then, it is possible to sort the moves so that the best candidates are searched first. However, move lists take memory space. That is a problem for my GPU move generator. There is 16 kB of shared memory for a thread group. If a thread group contains 256 threads, each thread has 64 bytes of memory. 32 bytes of it is reserved for the current positions, so only 32 bytes is left for moves. That is not enough for a move list. One has to think differently.

As mentioned in my previous posts, I implemented a traditional bitboard-based move generator that produces one move at a time on the GPU. Unfortunately, the performance is not so great. For comparison, I had a move generator that outputs a full move list at once. The former was four times slower than the latter. It seems that there are two reasons for that. First, the move generator is more complex, because it has to take the previous move as an input to regenerate the state of the move generator and to produce the next move. Second, it does unnecessary work, because one move needs only one destination square, although a bitboard representing all of them is generated for each move. I think it is time to revisit the design I had for the original CUDA chess move generator.

The basic idea is to have a few generic numbers represent the state of the move generator. For example: one for the source square (1-64), one for the direction (1-8) and one for the steps in that direction (1-7). Then the numbers are increased as the moves are produced. A similar idea was used in the early dedicated chess computers. And of course, one can use bit twiddling tricks to pack all the numbers in one integer. Tests are required to see how to implement the move generator most efficiently.

1/24/2015

Refining the plan

During the last two weeks, I have been thinking about ways to utilize better the GPU in the move generator. There are two problems: divergence and insufficient processor utilization. These problems are opposite in a way. A large number of threads handling different positions produce divergent execution paths, but on the other hand it is difficult to get good processor utilization with a small number of threads. In addition, one has to take into account that more threads use more shared memory, at least in my current plan.

There are basically four kinds of more generators: sliding pieces, pawns, kings, and knights. It seems to me that trying to write a generic move generator for all pieces, in an attempt to avoid divergence, is wasting processor time. For example, knight move generator requires only a simple table look-up. Sliding pieces generator is a little more complicated, requiring masking blocked paths and outputting potentially several destination squares per direction. Pawn moves have different capture and moving directions, and one has to take into account double pawn moves, en passant and promotions. King move generator requires complicated handling of castling (e.g. the king cannot move over a square where it were in check). I have decided to split the move generator in four parts, so I can optimize each part separately. The problem then is that I need lots of positions, so that each move generator has something to do.

I will make an attempt to use 256 positions per thread group (256 threads also). Then there is a chance that there are enough threads for the full utilization of the processors. Each of the four move generators can use 64 threads to handle a different group of positions. Divergence is not a problem when the threads in the same move generator belong to the same wave (or warp in CUDA terminology). If one move generator runs out of work, the threads in the wave can help with another move generator. The worst case scenario is that there are 253 positions for one move generator and only one position for each of the other three generators. Then the three generators with only one position waste 63 threads each, while the one with the most threads must run four times, if the others finish too late to help it. I think this is quite rare, but I still have to think about something clever to avoid idle threads.

Shared memory usage will obviously be one on the major limiting factors. With 256 positions to store, we cannot afford 8 bitboards, because that would take 16 kB, which is all of our budget for the shared memory. I think the best solution would be using 4 bitboards only (in total 8 kB). As I explained in an earlier post, it is possible to pack the flag bits in the 4 bitboards. However, because the packing and unpacking is computationally expensive, I came up with a plan, where the flags bits are not stored in the positions. They have to be stored in the move data structures anyway, because one has to be able to undo a move and restore the flag bits. So why waste extra space elsewhere. The means that there always has to be a move associated with a position, but that is the case anyway, because we are generating one move at a time, and the move generator takes the previous move as an input parameter.

The rest of the shared memory can be used for storing move stacks and split points. There is no room for full move stacks, because one move takes 8 bytes and 256 moves take 2 kB. I plan to store only three last moves per position (in total 6 kB), and store the rest in the global memory. The three moves are stored in a ring buffer that act as a cache. I hope that latency of the the memory accesses can be hidden by the work that is done in the last 2 ply. I will try to implement this in the next few weeks.

1/10/2015

Compute shader move generation progressing

I have written the basic move generation routines with DirectX 11 compute shaders now. I use bitboards and small look-up tables that contain moves for each piece-square pair and ray tables for sliding pieces. Those tables take about 6 kB. I initially put them in the shared memory, but then began to think, that since the tables are small, they could fit in the L1 cache which could be even faster. Tests proved that this was the case. Shared memory is slower than the cache, so I moved the tables to texture memory. This frees space in the shared memory that can be used for other purposes.

There is not enough space to store generated move lists for each position. It seems that I have to use over 32 bits per move, because there must be enough information to undo the move and account for special moves like promotion. Restoring the flags for a position takes 16 bits. Then source and destination squares take 6 bits each. Moving piece, captured piece, and promoted piece take 3 bits each. These are 37 bits already. I might as well reserve 64 bits per move, because 32 bits is the smallest natural element size on many GPU architectures. Then it is possible to include some extra information in the move data structures.

My idea is to store only the moves that are actually made. So the move generation must be able to produce only one move at a time, and deduce the next move to generate from the previous move. This requires some extra work per move compared to the generation of the whole move list at once. However, the extra work pays off, because the GPU can be better utilized. The move order must also be static. I think this is not a serious issue, because this approach is used near the leaf nodes. Higher in the tree we can use more sophisticated move ordering algorithms and global memory.

The shared memory will contain the move stacks and the current position for each group of thread working on the same position. In addition, there will be information about the split points. Split points mean points in the search tree, where the work is distributed between two or more groups of threads. It is not necessary to store the positions at the split points, but only the last move generated and search bounds. There will probably be a separate list of split points. An alternative is to interleave the split points in the move stacks. I will make more concrete plans, when I get the move generation part working fast enough.

12/22/2014

Tests with compute shaders

I did some tests with compute shaders. I am pleasantly surprised that even with the very modest GPU of my laptop (AMD Radeon 7500G), it is possible to do over 60 billion SIMD operations in a second. That is more than most multicore CPUs can achieve even today. For example, eight cores running at 4 GHz, can only achieve 32 billion operations. It is true that on the CPU these can be 128-bit vector SSE operations, but the GPU can do similar operations as well.

I implemented the position packing and unpacking tests (see previous blog post) with a compute shader. The idea of packing was to fit all the information required for a chess position in 4 bitboards of 64 bits, when storing it in the global memory. The position is unpacked when it is needed in the computations. With flag bits (turn, castling rights, etc.) included, the GPU could do 178 million pack-unpack pairs in a second. Without flag bits (only pieces), the figure was 2,85 billion pack-unpack pairs. They take about 500 and 20 instructions, respectively. Let's say that an unpacked position takes 50 % more space. It means one extra global memory fetch for reading a position and another for writing it. Global memory accesses can take hundreds of clock cycles. So the packing might be useful or not, depending on the GPU.

I have also sketched a plan for shared memory usage. The idea is to keep everything in under the 16 kB limit. Look-up tables for move generation take 7 kB. Another 7 kB are reserved for move stacks. 1 kB is reserved for positions, which means that there are 16 positions per thread group. This means that there will be 4 threads working on one position. I think that with careful design, there is enough work for them.

12/11/2014

New ideas

It has been a while since I wrote something about GPU chess. I still have not abandoned the idea of writing a GPU chess engine that could beat the CPU engines. With DirectX 11 and its compute shaders general purpose computation on the GPUs is brought one step closer to the wide audience. You do not need NVIDIA and CUDA anymore. Any modern graphics card will do. I am eager to try how the compute shaders could be used for chess algorithms.

Recently, I did some "bit twiddling" and found a more compact way to store chess positions. 128 bits is a natural alignment boundary on many GPU architectures, because four 32-bit floating point numbers fit in it. When computing graphics, those four floats represent x-, y-, z-, and w-coordinates in a homogenous coordinate system or red, green, blue, and alpha channels in the screen buffer. We may pack a chess position neatly in two of these 128-bit vectors. I will explain how.

We start with seven bitboards (64 bits each). These are pawns, knights, bishops, rooks, queens, kings, and white pieces. In addition, we have a few bits for other stuff, such as the turn bit, castling rights, en passant square, and 50-move rule counter. It is a commonly used trick in chess engines to pack queens into bitboards for bishops and rooks by bitwise OR, so that we end up with two bitboards instead of three, namely bishop-queens and rook-queens. Queens can be found by using bitwise AND between the two bitboards. This arrangement makes sense also from the move generation point of view, because queens move like bishops ans rooks. However, this is just an additional benefit. The packing trick works for any other set of bitboards as long as they contain disjoint sets of bits. We might as well have pawn-kings and pawn-knights for additional packing. We still have five bitboards and flag bits. Packing these already packed bitboards together with the same technique is not possible, because the bitsets are no longer disjoint.

However, if we start with packing three bitboards together, a more compact representation is possible. Let the six bitboards representing the pieces have names p, N, B, R, Q, K. Then let's make three new bitboards s = p | N | B, t = B | Q | R, and u = R | K | p (here "|" means bitwise OR). Now it is possible to find the original bitboards by just using bitwise operations. For example, p = s & u or Q = t & ~s & ~u (here "~" means bitwise NOT and "&" is bitwise AND). We have effectively packed six bitboards into three. Then there is the bitboard for white pieces. We cannot pack it, because it is not disjoint of the others. And finally we have the flags, which is kind of annoying, since the data would otherwise fit in exactly four bitboards.

Packing additional data into the packed bitboards in still possible. Imagine three circles, all intersecting each other. Each of these circles represent one of the aforementioned bitboards, s, t, and u. The intersection area between any two of the circles represent bitboard for a piece that they have in common. The areas of the circles left after cutting off of the intersections areas represent the bitboards for the other three pieces. In the center, there is a seventh area that is the intersection of all the three circles. That is always empty if the bitboards are defined as above. However, we can utilize this intersection area by using it to store the additional data. It is a bit tricky, but possible. We need to make sure that the bits are disjoint sets. First, notice that there are at most 32 pieces on the board. That leaves at least 32 empty squares. We can use those for storing data and keep the disjointness condition. The challenge is that the empty squares may change. Finding out which squares are empty is easy: ~(s | t | u). We can use for example the lowest 32 of those bits. Bit by bit we construct a 64-bit bitboard that contains the flag bits and then bitwise OR it with each of the three bitboards. These are the final three bitboards, and with the bitboard for white pieces, we have exactly four bitboards or 256 bits or 32 bytes for one chess position. Unpacking progresses in a reversed order, first finding out the flag bits by bitwise ANDing all the bitboards, then extracting the flags, etc.

I have tested the validity of this approach on the CPU, and the tests work fine in the sense that all the valid information about a chess positions is packed into 256 bits, and unpacked without errors. A few years old 3.2 GHz processor could pack and unpack about 10M positions per second in one thread. Most of the time was consumed in packing the flag bits into the empty squares, as explained above. Without the flags, the packing and unpacking was five times faster. On the GPU, I would consider packing all the data that goes to the global memory, or that is transferred between the GPU and the CPU. In move generation and local computations, I would use unpacked positions.

This packing thing was just one idea. I am still trying to find a good scheme for going through search trees.

6/27/2013

Solving chess - facts and fiction

Because it is summer, I decided to take a vacation and spend some time with topics not directly related to my GPU chess project. One of the interesting topics I have been thinking is solving chess. I know it is not realistic with the current level of technology, but with future technologies, it might be possible. At least we can speculate what it would take to do that.

FICTION 1:Solving chess requires an evaluation function. It does not. Many people familiar with computer chess programs know that the computer uses a rough estimation of the value of the positions in its calculations, such as "white is up 0.11 pawns". This evaluation is not always accurate and one may thus suspect the accuracy of the solution if a computer is to solve chess. But the point is that, conceptually, every variation is calculated to the very end, meaning either a checkmate or a draw by one of the rules. Thus, there are only three evaluations: win for white, draw, or win for black. None of these involve any heuristics. They are the true values of the position. Another way to think about this is imagining a checkmate search that begins from the initial positions and tries to find a checkmate for white in any number of moves. It would take a very long time, but eventually, it would announce that there is a forced checkmate in X moves or that there is none. No evaluation required here.

FACT 1: Chess is solvable. Currently, this is merely a theoretical concept, because we lack computing power and memory. But it is fairly easy to show that there is a finite number of chess games, which ultimately can be enumerated and the best play for both sides can be found. First, there is a strict upper bound for moves that a game can last. The 50-move rule dictates that every 50th move, one side has to either move a pawn or make a capture. These are irreversible moves. This implies that, if the game does not end in another way, eventually the pawns cannot move or they promote to pieces, and all the pieces will be captured. Then there is no way to avoid a draw by the 50-move rule. Second, there is a finite number of moves in each position. There are only so many pieces that can move to so many squares. These two facts prove that there is a limited number of games. In his seminal paper "Programming a Computer for Playing Chess", Claude E. Shannon estimated the number of all games to be around 10¹²⁰ (Philosophical Magazine, Ser. 7, Vol. 41, No. 314, March 1950).

In fact, the number of legal chess positions is much smaller. Shannon estimated this to be around 10⁴³. This estimate was rough and ignored promotions. Currently, the best proven upper bound is around 10^46.7. It seems that better than enumerating all the games is listing all the positions and finding if they are winning or not. This can be done by a technique called retrograde analysis which is widely used for generating endgame tablebases. Given the types of pieces on the board, it is possible to list all the checkmates. Then working the moves backwards, all the positions from which the checkmate positions can be achieved with one move are listed and marked as wins in 1 ply for the checkmating side. Then all the positions (where it is the defender's turn) from which one can reach only positions previously listed as wins in 1 ply are marked as wins in 2 plies. And the process is repeated until all the positions have be examined. This is a simplified explanation of the process, but you probably get the idea. This has been done for all the material combinations with 7 pieces or less (see e.g. Lomonosov Endgame Tablebases).

FICTION 2: Chess can be solved with current hardware. There have been hoaxes that claim that they have nearly solved chess. They are either good natured, such as the April Fool's joke by ChessBase, or exploit people's good faith to collect money, such as the infamous Nikola Chess. It is very common to be confused with big numbers that are given with the exponent notation. For example, 10⁴⁷ and 10⁵³ seem to be quite close to each other. But their ratio is 1 to 1 000 000. Not exactly close. While modern GPU's can perform as much as 10¹²-10¹³ operations per second, this is lightyears away from what is required. Imagine we had money to buy a million GPU's with 10¹³ flops capacity. That yields 10¹⁹ flops in total. Assume we had an extremely powerful algorithm for retrograde analysis, taking only 10 operations per position. This means, we can handle 10¹⁸ positions per second. Assume we had enough memory for all the position we need. How long it would take to go through all the chess positions? About 1 600 000 000 000 000 000 000 years. The universe estimated to be about 13 000 000 000 years old. Can you count the zeros?

FACT 2: It is difficult to solve chess even in the near future. For years, the so-called Moore's law was true: the computing power doubles every 18 months. Currently, as the fastest CPUs are running at 4 GHz, we seem to have hit the ceiling. While I was studying physics at a university in the beginning of the millennium, a professor said that he expected the upper limit to be around 10 Ghz. Above that, the transistors must be so small that the quantum phenomena begin to interfere with the computing. In nanometer scale, tunneling of the electrons happens with high enough probablity to make the processing unreliable. Speed of light is another issue. Information cannot travel faster than light, so the devices must be small to make them fast. Light propagates a meter in about 3.3 nanoseconds. This will become an issue if we need lots of memory, that cannot be located on the same chip as the processor. Clever computer architecture might alleviate this, but not too much. And the real problem that seems to limit the performance is the heat. Processors produce a lot of it. The more there are transistors, the more heat is emitted and the better cooling is required.

However, let us assume for a moment that Moore's law still worked. With increasing trend towards more parallel machines, we might have some overtime. Or perhaps a quantum computer is invented. One never knows. Currently, world's most powerful supercomputers achieve 10¹⁶ flops. Let us assume here the more realistic 100 operations per position. This means 10¹⁴ positions per second. Let us assume we could run the retrograde algorithm on the supercomputer for four months, i.e. 10 million seconds. Then, we can examine 10²¹ positions. Now, increase the computing power by using the Moore's law. When do we reach the 10^46.7 positions? Simple math reveals: after 85 of the 18 months periods or 127.5 years or around year 2141. Not in our life time.

IN CONCLUSION: Heuristic approaches to chess will rule for years to come. Developing chess algorithms that use heuristic evaluation functions is still meaningful. Future chess engines will have deeper searches, more accurate evaluation, and perhaps we will see the rise of GPU chess. I want to be part of that.