Solving chess - facts and fiction

Because it is summer, I decided to take a vacation and spend some time with topics not directly related to my GPU chess project. One of the interesting topics I have been thinking is solving chess. I know it is not realistic with the current level of technology, but with future technologies, it might be possible. At least we can speculate what it would take to do that.

FICTION 1:Solving chess requires an evaluation function. It does not. Many people familiar with computer chess programs know that the computer uses a rough estimation of the value of the positions in its calculations, such as "white is up 0.11 pawns". This evaluation is not always accurate and one may thus suspect the accuracy of the solution if a computer is to solve chess. But the point is that, conceptually, every variation is calculated to the very end, meaning either a checkmate or a draw by one of the rules. Thus, there are only three evaluations: win for white, draw, or win for black. None of these involve any heuristics. They are the true values of the position. Another way to think about this is imagining a checkmate search that begins from the initial positions and tries to find a checkmate for white in any number of moves. It would take a very long time, but eventually, it would announce that there is a forced checkmate in X moves or that there is none. No evaluation required here.

FACT 1: Chess is solvable. Currently, this is merely a theoretical concept, because we lack computing power and memory. But it is fairly easy to show that there is a finite number of chess games, which ultimately can be enumerated and the best play for both sides can be found. First, there is a strict upper bound for moves that a game can last. The 50-move rule dictates that every 50th move, one side has to either move a pawn or make a capture. These are irreversible moves. This implies that, if the game does not end in another way, eventually the pawns cannot move or they promote to pieces, and all the pieces will be captured. Then there is no way to avoid a draw by the 50-move rule. Second, there is a finite number of moves in each position. There are only so many pieces that can move to so many squares. These two facts prove that there is a limited number of games. In his seminal paper "Programming a Computer for Playing Chess", Claude E. Shannon estimated the number of all games to be around 10120 (Philosophical Magazine, Ser. 7, Vol. 41, No. 314, March 1950).

In fact, the number of legal chess positions is much smaller. Shannon estimated this to be around 1043. This estimate was rough and ignored promotions. Currently, the best proven upper bound is around 1046.7. It seems that better than enumerating all the games is listing all the positions and finding if they are winning or not. This can be done by a technique called retrograde analysis which is widely used for generating endgame tablebases. Given the types of pieces on the board, it is possible to list all the checkmates. Then working the moves backwards, all the positions from which the checkmate positions can be achieved with one move are listed and marked as wins in 1 ply for the checkmating side. Then all the positions (where it is the defender's turn) from which one can reach only positions previously listed as wins in 1 ply are marked as wins in 2 plies. And the process is repeated until all the positions have be examined. This is a simplified explanation of the process, but you probably get the idea. This has been done for all the material combinations with 7 pieces or less (see e.g. Lomonosov Endgame Tablebases).

FICTION 2: Chess can be solved with current hardware. There have been hoaxes that claim that they have nearly solved chess. They are either good natured, such as the April Fool's joke by ChessBase, or exploit people's good faith to collect money, such as the infamous Nikola Chess. It is very common to be confused with big numbers that are given with the exponent notation. For example, 1047 and 1053 seem to be quite close to each other. But their ratio is 1 to 1 000 000. Not exactly close. While modern GPU's can perform as much as 1012-1013 operations per second, this is lightyears away from what is required. Imagine we had money to buy a million GPU's with 1013 flops capacity. That yields 1019 flops in total. Assume we had an extremely powerful algorithm for retrograde analysis, taking only 10 operations per position. This means, we can handle 1018 positions per second. Assume we had enough memory for all the position we need. How long it would take to go through all the chess positions? About 1 600 000 000 000 000 000 000 years. The universe estimated to be about 13 000 000 000 years old. Can you count the zeros?

FACT 2: It is difficult to solve chess even in the near future. For years, the so-called Moore's law was true: the computing power doubles every 18 months. Currently, as the fastest CPUs are running at 4 GHz, we seem to have hit the ceiling. While I was studying physics at a university in the beginning of the millennium, a professor said that he expected the upper limit to be around 10 Ghz. Above that, the transistors must be so small that the quantum phenomena begin to interfere with the computing. In nanometer scale, tunneling of the electrons happens with high enough probablity to make the processing unreliable. Speed of light is another issue. Information cannot travel faster than light, so the devices must be small to make them fast. Light propagates a meter in about 3.3 nanoseconds. This will become an issue if we need lots of memory, that cannot be located on the same chip as the processor. Clever computer architecture might alleviate this, but not too much. And the real problem that seems to limit the performance is the heat. Processors produce a lot of it. The more there are transistors, the more heat is emitted and the better cooling is required.

However, let us assume for a moment that Moore's law still worked. With increasing trend towards more parallel machines, we might have some overtime. Or perhaps a quantum computer is invented. One never knows. Currently, world's most powerful supercomputers achieve 1016 flops. Let us assume here the more realistic 100 operations per position. This means 1014 positions per second. Let us assume we could run the retrograde algorithm on the supercomputer for four months, i.e. 10 million seconds. Then, we can examine 1021 positions. Now, increase the computing power by using the Moore's law. When do we reach the 1046.7 positions? Simple math reveals: after 85 of the 18 months periods or 127.5 years or around year 2141. Not in our life time.

IN CONCLUSION: Heuristic approaches to chess will rule for years to come. Developing chess algorithms that use heuristic evaluation functions is still meaningful. Future chess engines will have deeper searches, more accurate evaluation, and perhaps we will see the rise of GPU chess. I want to be part of that.


Hashing schemes

Eventually, I will have to implement a transposition table for the chess engine. That requires accessing global memory during the search. This, however, should not be a problem if I do not use the transposition table at the leaf nodes or in the queiscence search. Transpositions occur most often at the deepest branches. However, the subtrees are smallest near the leaf nodes. Most benefit comes from nodes that are roots of larger subtrees. And fortunately, testing the transposition entries is always cheaper than move generation which involves global memory writes.

I have been thinking about the best way to implement hashing. Many chess engines use Zobrist keys for hashing. They require serializing the bitboards and using a table of random values for each piece and position. Although simple and effective, I do not like this approach, because it requires lookup tables (about 6 kb), which I have tried to eliminate. I would like to find an alternative hashing scheme. It does not matter if the hashing function is more complicated as far as it does not consume the precious memory.

The good thing about Zobrist keys is that each square and each piece has an equal contribution to the final hash value and positions are spread uniformly in the hash table. I would like to achieve similar results with less random numbers which could be hard coded as constants in the hashing function. I could interleave the hash key calculation with the global memory writes, so I would get the keys for free.


Ants at a picnic

In his paper about Dynamic Tree Splitting (DTS), Robert Hyatt likened the processors working on a branch of the search tree to ants eating crumbs at a picnic. They eat one crumb together and it gets smaller and smaller until it totally disappears. Then the ants turn their attention to another crumb and start eating. This continues while there are crumbs. Making the processors work together like this requires careful design. Specifically, the processors are all equal; there are no master nor slaves, but only peers. One procesor may start working on a branch, but another one may complete it. The DTS paper explains how this is done.

While reading the DTS paper again, I noticed that there are many similarities between the supercomputers in the 80's and the modern GPUs. The shared memory architecture for example makes the DTS algorithm act similarly on both of them. In contrast, on clusters, one must take into account the communication costs. But on GPUs the communication happens through the memory, so it can be as fast as the memory accesses are. On the other hand, there are differences. GPUs have typically more complicated memory hierarchies than the ancient Cray supercomputers (I am not expert on this, but it appears to me that the memory hierarchy was quite flat).

The memory hierarchy causes some problems on the GPU. Ideally, each processor, or in the case of a GPU a thread, should have a copy of the tree state. This is required because the processors must be equal. Anyone of them must be able to complete the search and return the value to the root of the tree. This takes some memory. Unfortunately, the memory consumption is greater than with my current approach were I have one search stack per block of threads. I need one stack per thread. The stacks will be smaller, but not that much smaller that it would compensate for the increased number of stacks. And I will have to use global memory. A typical stack has hundreds of positions, each of them taking 48 bytes. This could mean tens of kilobytes per stack. While the normal stack operations yield the same number of writes to global memory, splitting the tree causes extra operations as the stacks must be copied. I have not tested this yet, so I cannot say if this overhead is bad or not. Anyway, the finer granularity of the search will guarantee better occupancy of the multiprocessors, not only theoretically, but also in practice, meaning that the processors do less futile operations and spend most of their time searching nodes. Now, for example, when a processor has processed its stack, it simply waits for the others to complete theirs. This is clearly inefficient and the DTS-like approach can definitely alleviate the problem.

It is difficult to say how successful my implementation will be in realizing the potential of the DTS algorithm. It will most probably take me a month or so to get this thing working at some level. I am travelling for the next two weeks, so I do not have access to a CUDA-capable computer. But I still have time to develop the ideas further.


Anatomy of the alpha-beta search

Chess is primarily a tactical game. Exact calculations of move series are important. That is most likely the reason why application of probabilistic methods to chess engines has failed so far. I do not try to change this, because I believe that there is no replacement for accurate analysis. Most probably, I will use some version of the alpha-beta pruning in my chess engine. The basic algorithm is serial so one must carefully design how to best utilize parallellism.

Simply splitting the search so that different moves are given for different threads is inefficient for two reasons. First, the search of a move may improve the alpha bound. Then, if one has already started searching the other moves with the original bound, some effort is wasted. Second, the search of a move may cause a beta cut-off. Then, searching all the other moves is totally wasted effort. Typically, parallel algorithms try to minimize the futile work.

Because the move ordering is usually good in modern chess engines due to the iterative deepening, transposition tables, and various other techniques, the beta cut-offs happen with the first move in about 90 % of the cases. That is why one should avoid splitting the work at Cut nodes (nodes whose children may cause beta cutoff). Instead, the splitting work at All nodes (nodes whose all children must be searched) is preffered. However, it cannot be known for sure which nodes are of which type. With a perfect move ordering, it is possible, but we cannot achieve that. And if we did achieve it, we would not need to search at all.

It could be possible to implement one of the existing parallel search algorithms. However, they do not scale that well for a large number of processors. Probably the best know algorithm is the Dynamic Tree Splitting (DTS) by Robert Hyatt and the Cray Blitz team. Many programmers consider it difficult to implement. One cannot use the typical recursive structure of function calls in the search with it. For me that is not a problem, because the structure of my program is not recursive. That offers great flexibility, because any node can be searched by any thread. So I am thinking about implementing a DTS-like algorithm on the GPU.


Breaking News: 400M positions per second!

I did it. It was not so difficult after all. Because it was obvious that the global memory writes slow down the performance, I tried to figure out how to reduce their number. First, I noticed that I could not get rid of the six store operations for the six bitboards per position, because there are no instructions to write more than 64 bits at once. That gives the upper bound of 87.5M nps. I was not happy with that. Then I thought about ways to avoid putting the positions in the global memory. Because the shared memory is limited, I could not use it for the whole position stack. But what about the last ply? Well, you can guess...

So after eliminating the global memory writes for the last ply, I got huge performance benefits. Because only the second to last ply is written to the global memory, there are surely enough instructions (handling the last ply) to hide the memory latency. This also means that I could cope with less threads as the performance converges towards the upper bound much faster with high instruction level parallellism. So below are some initial results. I used 128 threads per block, because I found out that it is the optimal number for the current version of the move generator.

Performance in Mnps vs. number of blocks.

Line line is not smooth in the beginning, because with different numbers of blocks, the multiprocessors get different numbers of blocks to process. Multiples of 60 blocks yield the best performance. That is four times the number of multiprocessors (15) on the GTX 480. I am not sure why this is good.

But now that I have a working and efficient move generator, I can start to think about the next big problem, which is the parallel search.


Approaching the performance goal

I have continued testing the move generator. After some obvious optimizations I got the speed up to 2.5M nodes per second with one multiprocessor. Then it was time to give all the processors work to do. So the approach I used was to give each processor a separate stack to operate on. I noticed that it slows down the performance if different multiprocessors write global memory addresses close to each other. Now one can give each multiprocessor its own starting position and let it process that. For example for perft(6) this could mean calling perft(5) for each child position of the initial position.

Below is a figure that shows the scaling with more than one multiprocessor. A block has 512 threads and the number of blocks varies from 2 to 26 in the test. The line is not straight in the beginning, because not all multiprocessors have work until there are at least 15 blocks, i.e. 7.5k threads in this case.

Scaling with several multiprocessors. Speed (Mnps) vs. number of threads.

Further increasing the number of blocks allows more efficient hiding of the memory latency. Below is another figure that shows the scaling up to 400 blocks. Now we can see performance figures that are of the correct order of magnitude.

Scaling with a large number of blocks. Speed (nps) vs. number of blocks.

The maximum speed achieved is about 62M nps. I had calculated the absolute upper bound for the GTX 480 to be 175M nps if 48 bytes were written for each position. I think that, in practice, the compiler messes things up and produces twice as many writes. I might have to change the data types used in the kernel to get the correct number of writes. So with the larger number of global memory writes, the upper bound becomes 87.5M nps. Yet, another thing that slows things down even more is how I have currently implemented the control parameters, e.g. depth and node count. I have a separate stack for them in the global memory. Eventually, I want to embed the control parameters into the position stack.


Perft running on the GPU

I finally got the move generator working properly, i.e. obtaining correct node counts, on the GPU. At least for one multiprocessor. I still have troubles to get it work on multiple multiprocessors. But the good news is that with one multiprocessor the scaling with the number of threads is almost perfectly linear. Look at the picture below. It shows the perft(5) speed with different numbers of the threads (192-576). With larger numbers of threads, something breaks and the performance decreases dramatically. I think this happens, because the compiler runs out of registers and puts the variables in the local memory, which is painfully slow.

Speed (million nodes per second) with varying numbers of threads. The numbers are averages of 10 runs of perft(5).

The linear scaling means, I think, that the global memory speed cannot be the limiting factor of performance with one multiprocessor. Otherwise we would see a curve that bends downwards from the linear trend and converges to a horizontal line. But that does not happen here. Indeed, with 1.5 million nodes per second and 48 bytes per position, writing the positions requires only about 72 Mb/s bandwidth, when 177 Gb/s is available. Something else is limiting the performance.

I used the compiler option to generate the PTX-assembler code. That is over 15 000 lines and about half of them are instructions (not labels or debug information). In other words, the kernel is quite long. It also has quite a many branches. But I still suspect that the compiler messes things up by putting things in the local memory. I might have to write the assembler code manually. That is really the last option. Or first I will try to find compiler options that generate faster code. I will also consider other ways to arrange the computation.


Joy of Debugging

Nothing is more annoying than debugging a program that you think should work properly. After a few hours of writing functions to print out bitboards and differences between bitboards, I finally managed to get the correct results for perft(). At first I got capturing promotions generated twice: first as promotions and then as captures. After I fixed that, I still had a bug with castling rights. I got too many castles. Because I did not have the make() and unmake() functions, I needed to make sure I clear the appropriate castling rights whenever a king or a rook moves (both captures and non-captures) or a rook gets captured (including capturing promotions). I had to go through every possible case and that is about 900 lines of code. Why that many lines? Let me explain.

A simple move generator can be much less than 900 lines of code. But, my move generator has making of the move and MVV-LVA move ordering embedded in it. In addition, I added a strict legality check in the move generator. The purpose is to have enough instructions between global memory writes. Because the memory speed is the limiting factor, it makes sense to do something useful while waiting for the write operations. It is important to notice that the global memory writes do not block the execution of other instructions that do not depend on the written data.

Now I am planning to test the code with different numbers of threads, beginning with a single thread and slowly increasing the number of the threads. If I record the nodes per second for the different numbers of threads, I can plot a graph that shows how the algorithm scales. I hope I can do this during this week.


Still debugging

I have been quite busy during the week. The little time I had I spent debugging the code. It is surprisingly easy to introduce bugs in the move generator code if you make small changes to it. I had to implement code for printing out the positions. That helped a lot and found out that I had copied wrong masks for preventing the wrap-around in the Kogge-Stone algorithm (A- and H-files swapped).

Now that I have the sequential code working, I must still figure out how to arrange the parallel execution in an optimal way. I might just launch a constant number of threads and let them handle a bunch of positions. The basic idea is to have a kind of queue of positions for processing. The threads take the positions in the beginning of the queue and generate their successors and put them to the end of the queue. As long as there are positions with less than the maximum depth, the threads continue to process them. The obvious problem is that this is a kind of breadth-first-search approach, so the size of memory becomes the limiting factor, and perft(6) is already too much.

So, instead of a queue, I could use stack. Then, the threads take the positions on top of the stack and put the new stuff on top of the stack. Care must be taken to handle the new and old positions properly. The size of the memory area required is then dependent on the number of threads. This is more managable and probably the threads are less divergent as they handle similar types of positions.


Eliminating lookup tables

Sometimes it is useful to look things from a different perspective. This applies to chess programming as well as other things. When programming for the CPU, you usually try to minimize the instruction count in order to optimize the performance. In GPU programming, it is often better to minimize the memory usage, because memories and caches are small, but on the other hand, basic arithmetic instructions are relatively cheap. The bottleneck is the memory speed, not the execution of the instructions. As a side product, I have learned new ways to do things. These might be useful in the future, even when optimizing a CPU chess engine.

I have implemented a prototype for move generation. It currently runs on the CPU, but it should be straightforward to port it to the GPU. The main reason for this is that it does not use any lookup tables, so it does not need shared memory and I do not have to struggle with the memory limit. The side product is that the CPU version is very cache friendly. Of course, the instruction count is larger than with the lookup table approaches, but that does not matter. The generator is bitboard-based and it uses the Kogge-Stone algorithm for sliding piece move generation. So, basically, there are lots of shift- and xor-operations and stuff like you usually see in bitboard engines.

I decided to pack the position in 48 bytes or six 64-bit variables. This is a reasonable compromise between size and instructions required for unpacking the information. It is important to keep the size small, because the positions will be written in the global memory on the GPU. Now the six variables take three 128-bit cache lines or three memory operations per position. Packing the positions tighter would not help unless I can fit everything under 256 bits resulting in only two memory operations. But that I cannot do. Well, actually, the smallest number of bytes for a position (including side to move, en passant square, castling rights, and counter for the 50-move rule) that I have found is 32 bytes or 256 bits, but it is not very practical. So I am happy with 48 bytes. The six bitboards are white pieces, pawns, knights, bishops+queens, rooks+queens, and kings. The extra information required is packed in the lowest and highest eight bits of the pawn-bitboard, because they are never used for pawns in legal positions.

The idea of writing the positions into the global memory lead to another idea. I can interleave the global memory writes with the move generation code. I do not generate move lists at all. I make the move at once when generating it and store the resulting position. Again, no shared memory is required. A position is so small that, ideally, it can fit in registers. A nice side product is that I do not need a data structure for a move, and I do not have to care about packing and unpacking information in it. This actually saves instructions, although that was not the goal here. It also saves us a few ifs, because I do not have to have branches for the special moves both the move generator and in the make() and unmake()-functions. There are no make() and unmake()-functions, so there cannot be ifs in them.

Yet another data structure, that is missing from my implementation, is the board with 64 squares telling which piece is on each of them. This is useful when determining if a piece is captured and which piece it is. But I do not need that, because my move generator will produce the captures so that it always knows which piece is captured. The capture move generator is designed so that it produces the captures in the commonly used Most Valuable Victim - Least Valuable Aggressor (MVV-LVA) order. This can be done by using the bitboards only.

I also get rid of the bit scan operations, because I do not need the square numbers to index the lookup tables. That is simply an unnecessary intermediate step, because the resulting positions consist of bitboards, and not square indices. There are relatively efficient ways to extract the square number of the least significant bit, like the de Bruijn-multiplication which requires a 64-element lookup table, but if I can live without it, even better.

I have to say I am very satisfied with the software architecture right now. Let's see if that changes when I have a chance to test the move generator performance.


Bits and pieces

I have been thinking about the GPU implementation again. The biggest issue is the memory consumption. I have previously shown that a position can be packed in about 40 bytes by using bitboards, so that everything can be unpacked with a few bit operations. Then the positions can reside in the shared memory or even in the registers. But move lists are the problem. To be exact, there are two kinds of move lists, those generated by the move generation routine for each position and those that are put in the stack when recursively calling the search. I call the former move lists and latter ones move stacks.

Moves in the move stack require more information, because they are used for undoing moves and that involves restoring the position as it was before the move, including castling rights, en passant squares, and half-move counters for the 50-move rule. But for move lists, less than that is sufficient, because the original position is known. The least amount of information that is required for unambiguously defining a move in a known position is: to- and from-squares plus possible promotion. This can be fitted in two bytes. This is enough for move lists produced during the move generation. With two bytes per move and average of 40 moves per position, this means about 80 bytes per ply. Then, for example a 15-ply search takes 1200 bytes for move lists and we are out of shared memory if there are more than 40 threads. So I have to think about doing something differently.

What I have been thinking is that I write a kernel that takes a position from the global memory, generates the moves, makes them, and writes the positions back to the global memory. The kernel is launched several times, always taking the previously generated positions as input and producing output that is one ply deeper. The problem with this approach is that there are typically tens of position to be written while there is only one position to read from the memory. The global memory bandwidth on my GPU (NVIDIA GeForce GTX 480: 177 Gb/s) is sufficient for writing three positions per clock cycle. So, given the average of 40 positions per thread, this means about 13 clock cycles per thread, assuming a large number of threads are running and amortized cost calculation is valid. I think this is of the same order of magnitude that it takes to generate a move, so it could be possible to hide the latency quite well.

Eventually, I might end up using the Kogge-Stone-style move generator, because it does not require lookup tables for sliding pieces. For other pieces, the lookup tables take just a few hundred bytes of shared memory. Since the positions can be in registers during the processing, the shared memory consumption is not that important. The move generator does not produce a move list, but makes the moves directly and writes the resulting positions to the global memory. Surprisingly, this might be the fastest way to do it.

I still have to find a way to implement the search, because this architecture is fine for perft(), but not for alpha-beta search as such.


Learning from Experts

Last week, I was listening to two presentations given by two guys working at NVIDIA Research. The presentations where about buidling tree structures on the GPU and about fast ray tracing on the GPU. These provided some insights on GPU programming and also assured me that I am on the right track. On the other hand, the presentation about ray tracing gave me some fresh ideas. Using global memory is not impossible if the memory latency is hidden by a large instruction count and sufficient number of threads. Another idea is to split the program in small subprograms, because smaller kernels take less registers and it is easier to make them branchless. This means that multiple launches of kernels are required. Kernel launches are slow, so the subtasks must be large enough or there must be a large amount of threads to compensate for the overhead. The challenge is to split the chess engine into small kernels.

I found another very useful presentation by Vasily Volkov: Better Performance at Lower Occupancy. The main observation is that memory latency often limits the performance, which is why maximizing multiprocessor occupancy is not the same as maximizing the instruction throughput. Surprisingly, lower occupancy often leads to higher throughput. It is clear that global memory is slow, but it might not be that clear that even the shared memory has noticable latency. It is about six times slower than the registers on the GPU that I have used (NVIDIA GeForce GTX 480). This explains why I got only about one sixth of the instruction throughput in my tests. It is better to use registers whenever possible. On the other hand, the number of register per multiprocessor is limited. The more there are threads on a multiprocessors, the less there are registers per thread. That is why it is sometimes useful to have less threads on a multiprocessor which means more registers for each thread and faster execution, because shared memory is accessed less often. Simple, but not necessarily obvious.

I have come to the conclusion that thread level parellellism is the only way to compete with modern CPUs with multiple cores at higher clock rates than the GPUs. For example the GTX 480 card has 15 multiprocessors, and if each of them runs one warp with 32 threads, this means 480 threads in parallel. This means 960 instructions per clock cycle, if the instructions are multipy-adds or 480 instructions per clock cycle with other simple instructions. With the shader clock rate of 1.4 GHz, this means either 1.345 Tflops (mentioned in the documentations as the peak computing power) or half of it in practice. If we are able to compute only one execution path per warp, then, most of the time at least, we are computing only 15 instructions per cycle. That is only 21 Gflops. Modern CPUs with multiple cores can achieve that with more flexibility. The order of magnitude speed up when using thread level parallellism is the only thing that justifies the use of GPUs.

The global memory speed on GTX 480 is 177 Gb/s. That means roughly 128 bits per clock cycle (@ 1.4 GHz) for all the threads together. So, if I store chess positions in the global memory, I need roughly 8 clock cycles to load it from the memory, assuming the board size of 128 bytes. If several threads work on the same position, this is a tolerable delay. For example, in the move generator I might run 256 threads in parallel to generate the moves in a position. The problem is writing the results back to the global memory. This can take hunders of clock cycles, depending on the number of moves. Assuming that the move generation itself takes tens of instructions per move, this might be a major factor limiting the performance. Actually, it could be possible to estimate the upper bound by assuming that writing the results dominates all the other execution costs. Above, we calculated that the writes take 8 clock cycles per position. Dividing the clock rate 1.4 GHz by 8 gives us 175 million positions per second. The order of magnitude is what we hoped for, if the goal is to produce 100 million positions per second. Not bad at all. Now I only have to implement it.


Recently, I have been writing a CPU chess engine. There are several reasons why I am doing this instead of the GPU chess engine. First, I need a reference against which to compare the GPU engine. Second, I can more easily test things with the CPU engine, because it is easier to debug the code and, e.g. set up test matches to tune the evaluation function. Third, I might end up with quite a strong CPU engine, which as such is an achievement.

The CPU chess engine has now fully functional move generator, which produces correct results in all the test cases I could find for perft(). The speed is modest 8-12 million nodes per second (no bulk counting; every leaf node is visited) on single core 3.2 GHz CPU. I might try to optimize this, but it is not the first thing on my priority list.

I also implemented a simple evaluation function. It is based on material, mobility, king safety, and pawn structure. I have not tried to tune the evaluation, because I will replace the evaluation function with a more principled approach. I have a general idea how the evaluation function should work, but I have not had time to work out the details yet. This is the most enjoyable part of the whole project, so I am in no rush.

Then, I have the alpha-beta search (with quienscence search) implemented with the iterative deepening at the root level. The principal variation is stored and it is searched first. The full principal variation search is not yet implemented, because I want to be sure I get the transposition table working properly first. I have tried the null-move and razoring at one ply from the leaf nodes, but currently they are not included.

The transposition tables are implemented by using the Zobrist keys for hashing. They are first calculated for the initial positions and updated in make() and unmake()-functions. This does not add much to the instruction count, so it is fast. I have tested the transposition tables with perft()-function and I get the correct results, but much faster. I use two transposition tables, one with the always-replace-strategy and another with the depth-preferred-strategy. Tests with perft()-function show about 15 % speed up compared to either of the strategies alone. I have not yet combined the hash tables with the alpha-beta search, because I want to be absolutely sure I do it properly. One must be careful with the type of the entries and the mate scores.

Although the development is still far from completed, I already have a chess engine that is able to play decent chess. I have played against it a few times now, and have not been able to beat it yet. It is strange that the engine seems to understand some positional concepts that I have not programmed into the evaluation function yet. I find two possible explanations for this: 1) the observed positional understanding follows from tactical considerations, and/or 2) some commonly used evaluation terms are overlapping, so the missing terms are implicitly included in the existing terms. This is interesting and raises a question: how to implement an evaluation function with (almost) orthogonal terms?


Without exceptions...

...the world would be perfect. Unfortunately, it is not. Specifically, the rules of chess could be elegant, but there are some special moves that make things annoyingly difficult. Those are the double pawn move, promotion, en passant, and castling. In addition, making sure the side to move is not in check after the move can cause headaches. I think my move generator is only half as fast as it could be without the exceptions.

I am trying to find ways to handle the special moves like normal moves. This is not easy. Some moves require extra work. In castling, two pieces move. In promotions, the piece type changes during the move. In en passant, the captured piece is in a different square than the destination square of the piece. I do not want to make make() and unmake()-functions much more complicated. In a typical chess engine, there are ifs for testing for the special cases. I want to avoid ifs even if the thread divergence is not an issue.

I tried to fit the move data in 32 bits. This means all the data required for restoring the position after undoing the move. I could do it easily without the special moves. But with them, I think that I need more. A general move might look like this:

captured pc. 2piece 2to sq. 2from sq. 2captured pc. 1piece 1to sq. 1from sq. 1
4 bits4 bits6 bits6 bits4 bits4 bits6 bits6 bits

That is 40 bits. So is it enough? Let's see. We have to- and from-squares, plus moving piece and captured piece. Two times. Obviously, regular moves (captures or not) require only one set of the fields. The other set can then be set to point to a dummy square and piece. Then, special moves require more fields. Castling consists of one king move and one rook move without captured pieces. En passant consists of a pawn move which captures the en passant square and of another move where an empty square captures the opponents pawn (sounds funny, but it works). Promotion consists of a pawn move (can be a capture or not) and of another move where the promoted piece captures the pawn on the promotion square (again, sounds odd but it works).

Then, some extra fields are required for restoring the position. These include castling rights, en passant square, and move counter for the 50-moves rule. Finally, in my implementation, I need to store the piece list indices for captured pieces, which is extra 4 bits per captured piece. So the eventual move structure could be:

move counterEP sq.Castling rightscaptured pc. 2piece 2to sq. 2from sq. 2captured pc. 1piece 1to sq. 1from sq. 1
8 bits4 bits4 bits4+4 bits4 bits6 bits6 bits4+4 bits4 bits6 bits6 bits

This is not the thightest packing possible, but it takes 64 bits, which is convinient for alignment reasons. For the GPU, it is better to pack things like this, although a few extra instructions are required to unpack them, because memory reads take time, and going over the 1 kB per warp limit reduces the multiprocessor occupancy rate.


Back to the drawing board

In a previous post I told about the Kogge-Stone-style move generator I had implemented. Unfortunately, I forgot that I must stuff the generated moves somewhere in the shared memory. If a position takes 64 bytes, I have only 960 bytes left below the 1 kB limit. If a move takes 4 bytes, than that means that I can have at most 240 moves in the memory. Obviously, generating all the moves at once is not an option if I want to do more than 1 ply searches. I might have to use the one-move-at-a-time move generation routine that I mentioned earlier. I got it working on the CPU at least.

The main idea is to have an index or a pointer to where we left in during the move generation. We generate a move, make the move, call recursive search, unmake the move, and then generate the next move based on the pointer. The challenge is to construct the pointer so that it can be incremented to get the next move. There can be 16 pieces per side, so 4 bits determine the piece. Then there can be at most eight directions in which to move the piece (knights, queens, and kings). That takes 3 bits. Finally, the sliding pieces can go at most seven steps in one direction. That takes 3 more bits. So in total the pointer has 10 bits. With 10 bits one can represent 1024 numbers. That is a lot more than there are moves in a position. So skipping over the invalid pointer values is important for efficiency.

In addition, I have turned my attention to the 0x88-board representation. It is memory friendly, because it does not require large look-up tables. Basically, what is required, is a list of directions for each piece type. I am looking for an elegant solution to combine the 0x88-board representation with the pointer-based move generator. There are lots of interesting problems to solve.


More low level stuff

After playing with the multiprocessor occupancy calculator provided with the CUDA-SDK, I have concluded that a warp must not take more than 1 kb of shared memory. Then, other things being optimal, 100 % occupancy can be achieved with at least six warps running on one multiprocessor. This is challenging, because a position takes several 8-byte bitboards, and moves take 4 bytes each. Move list are required for both storing the current search path and for move generation.

I do not want to use large look-up tables for the move generation, because the memory space is restricted. It seems that magic bitboards and even rotated bitboards require too much memory. I will try Kogge-Stone move generator. It is probably the fastest way to generate moves without look-up tables. Currently, I have most of the move generator written in CUDA-C and it compiles without errors. I have not had time to write the perft()-function yet, because I am currently quite busy at work. This GPU chess project is going forward, but it takes time.

I found a way to parallelize the capture generation part of the Kogge-Stone generator. This will speed up the quiescence search. I think that it is possible to use the same kind of optimization with parts of the evaluation function. Then, the computing power is best utilized where it matters the most: near the leaf nodes.

I try to write the performance test function this week. I will write more about that when I have results.


Low level programming

The source code of a chess engine often has plenty of bit twiddling in it. Especially, engines using bitboards require lots of bit operations like exclusive ors, ands, complements, and shifts to left or right. These operations can be implemented by using a high level programming language; most often C/C++. Modern C-compilers are quite smart and do all kinds of optimizations, especially if the compiler is asked to optimize the code. Therefore it is not strictly necessary to use assembler code in the engine. In addition, the assembler code can be quite difficult to debug.

Unfortunately, the nvcc-compiler used for compiling CUDA-C does not seem to be that smart. It uses lots of registers, does stupid type conversions, and uses more instructions than necessary. The register usage is perhaps the most severe problem, because the occupancy rate of the multiprocessors depends on the number of registers per thread. I have tried a simple test program written in CUDA-C and examined the PTX-assembler code produced by the compiler. Then I have written the same program directly in PTX-assembler, and got the same result with half the register count and half the instruction count. This means a speed up of factor two or more. I will certainly write the core of the engine in PTX-assembler, because wasting computing power for nothing feels bad.

So what to write in assembler? I think, the functions that are called most often. These include the make() and unmake() -functions, move generation, search, and evaluation functions. Initially, I will write a perft()-function to test the speed and validity of the move generation. This has to be faster than the CPU move generator or otherwise I have to re-design the whole thing. I let you know when I have something working.


Wild ideas

Sometimes small steps and patience is all you need for success. But not always. Man did not go to the moon by only taking small steps. At first, someone had to have the wild idea that it is possible to go to the moon. Then, someone had to invent the rocket engine. Only after the ideas and the technology were there could the basic engineering work begin. Similarly, I do not believe that a GPU can play good chess by just applying the known algorithms and making small improvements. Either the GPUs must change or a new algorithm has to invented. While waiting for the former I will try to do the latter.

One idea I have been playing with is using only zero window (or null window) search on the GPU. A zero window search gives a boolean result. The position is either better or worse than the given bound. It could be possible to launch hundreds of zero window queries from different positions with different bounds. The positions and bounds would be such that the expected utility of the gained information is maximized. The partial game tree would be stored and updated by the CPU.

I think it makes sense to use both the CPU and the GPU, because both of them have their own strengths. The CPU is more flexible and it is better to use it for control tasks. The GPU, on the other hand, is a work horse that can be given the mundane number crunching tasks. A simple idea would be to search the principal variation (PV) nodes with the CPU and let the GPU search the other nodes with the zero window. However, that is not very efficient use of computing power. The subtrees given to the GPU must be large enough to compensate for the extra time required for GPU kernel launches. The sizes of the subtrees must also be similar, because the slowest thread determines the performance of the whole search. In addition, the subtrees must have bounds from the PV nodes before they are run.

There are sophisticated algorithms for splitting the search treee dynamically (see e.g. The DTS high performance parallel tree search algorithm), but they usually require more communication between the processors than possible on the GPU. If there is no communication during the search, we must somehow estimate the subtree sizes before the search in order to balance the workload. It might be possible to do this by utilizing information from shallower searches. Shallow searches or iterative deepening are often useful for move ordering purposes. We could use the same data for load balancing as well.

Hopefully I can find time for coding in the near future. Then I can tell more.


Transposition table and hashing issues

The transposition table is an essential part of an effifient chess engine. It stores results of subtrees that have been searched into a hash table. Then, if the search encounters the root of the subtree by a different move order, it can use the stored results directly instead of performing the search. It is easy, however, to introduce nasty bugs with its implementation. There are two kinds of errors that arise with the transposition table code: (1) hash collisions, and (2) invalid use of hashed search results. In addition, the transposition table must, most probably, reside in the global memory. That is not only slow, but allows a third class of errors, i.e. corruption of the hash entries by concurrent actions by multiple threads. Let us examine the possible errors one by one and then sketch a plan for the implemetation.

Hash collisions occur when to positions are either given the same hash key or mapped into the same position in the hash table. The latter case is much more common that the former one if the hash keys are long enough. They are also easier to avoid. Simply storing the hash keys in the transposition table and comparing them is enough the avoid these errors. The case where the hash keys are the same, is more difficult. It can be avoided to some extent by producing the hash keys properly, e.g. by Zobrist hashing (see http://chessprogramming.wikispaces.com/Zobrist+Hashing). However, even with good hashing, the probability of two positions given the same key is non-zero. With 32-bit keys, there are only 4 billion possible keys. Then, if we have, 4 million positions in the transposition table, the probability of a position given the same key as already given to antoher position is 1:1000. This might sound like a small number, but if you repeat this a million time, you get a thousand errors or so. That is enough to completely ruin the search. So an easy solution is to use longer keys. With 64-bit keys these kind of errors occur much more seldom, but they still do occur, given a large enough transposition table and a search that is efficient enough to visit billions of nodes during the game. So one must be careful to design the transposition table so that the probability of these errors is minimized.

The hashed results can be used for wrong purposes. It is, of course, necessary to check that the stored entry is still valid and deep enough. Then, it is also necessary to check the search bounds. Especially, zero window searches do not give accurate results, but only fail high or low. Then it is necessary to compare the search window used in the transposition entry to the current one. Sometimes, it is possible to use the transposition table entry for move ordering although the stored score cannot be used in the current search.

So where in the search tree to use the transposition table? Its purpose is to save computation time. Two factors affect the potential saving: the number of nodes under the subtree corresponding to the hash entry, and the probability of the search probing the entry. The number of nodes under the subtree is the greatest near the root. However, the probability of hitting the same node is zero for less than 3 plies. After that the probability increases gradually as long as the transposition table is large enough. Imagine a infinitely large transposition table with perfect hashing. Then the probability of finding a position approaches 100 % when the depth of the search increases. Why? Eventually the search will visit all the positions that are required to prove the absolute result, i.e. win for either side or draw. This is only a theoretical consideration, because in practice, the transposition tables are of finite size and the hashing is less than perfect. The probability of probing a node decreases with a small transposition table when the number of stored nodes becomes proportional to the size of the transposition table. This happens because we run out of places in the hash table and there are collisions which lead to discarding data (either the just searched position or the transposition table entry result). This is another reason why saving leaf nodes in the transposition table does not sound good (the other reason being that the subtree is only one node plus possible quiescence search nodes).

I have made simple tests with a transposition table on the CPU. I have written a version of perft ()-function (see previous posts for explanation) that uses a transposition table. The transposition table entries consist of four unsigned ints which take 16 bytes of memory in total. Two of the entries are for the hash keys. The first one is the primary key which is used for hashing and the second one is for checking the validity of the hash entry in the case of a collision. I use the Zobrist hashing scheme. The other two values are the depth and the score. These are included just for testing because perft does not return a score. In addition, the 128-bit hash entry size corresponds to the memory alignment of the global memory. I run the tests with 256 Mb and 1 Gb transposition tables with 16 M and 64 M entries, respectively. These correspond to possible global memory sizes on modern GPUs. I think that we can use as much memory as possible because the global memory access is slow regardless of the table size. The newest GPUs have caches for the global memory, but due to the random access pattern of the hash probes it is of little help. However, the results look like this (initial position, 1 Gb table):

perft (7) = 3195075629 at 39.93 Mnps
depth 1 hash table: 42.477849 % hits, 18.217728 % collisions
depth 2 hash table: 58.034033 % hits, 4.928729 % collisions
depth 3 hash table: 34.097889 % hits, 1.510128 % collisions
depth 4 hash table: 33.115211 % hits, 0.386183 % collisions
depth 5 hash table: 0.000000 % hits, 0.237530 % collisions
depth 6 hash table: 0.000000 % hits, 0.000000 % collisions
depth 7 hash table: 0.000000 % hits, 0.000000 % collisions

The depth refers to the remaining depth in the tree traversial. Leaf nodes were not stored in the transposition table. The node count is wrong (should be 3,195,901,860) because of hash collisions that went undetected. It is clear that 32-bit hash keys are not sufficient for avoiding errors. Another thing that can be seen is that the utility of the transposition table is at best when the number of nodes is still much less than the table size. There are about 2 million nodes at depth 2. For the 64 million entry transposition table this results in the best hit rate. At depth 1, there is already about 24 million nodes, which results in lots of collions reducing the hit rate.

It should be noted that the hit rates are different for different search algorithms. The numbers above are not representative of typical transpositions table hit rates with alpha-beta search algorithms, because perft ()-function visits all the nodes in the search tree. Hit rates are smaller when the visited nodes are selected based on the alpha and beta bounds.


Parallel evaluation

The limited shared memory and thread divergence issues make it difficult the achieve efficient parallellism at the thread level. Therefore, a viable alternative is to let all the threads in a warp work on one position. Then there is more memory per position and the divergence is not an issue, because all the threads in a warp follow the same execution path. The challenge is to find enough to do for all the threads. So, how to take care of this? In the following I present my initial thoughts.

Most of the nodes are leaf nodes. Imagine a search tree where the branching factor is constant 2. Then it is easy to see that the number of nodes per level follows the series: 1, 2, 4, 8, 16, 32, 64, 128, .... The sum of all the previous numbers is the last number minus one. It follows from this observation that most of the nodes in this tree are leaf nodes. With a larger branching factor the ratio of leaf nodes to other nodes is even larger. Although there are cases where the local branching factor is one, it is a safe assumption that a typical chess search tree consists mostly of leaf nodes. We can utilize this piece of information by doing the evaluation in parallel. Evaluation function is mostly called at leaf nodes, so doing it faster helps the overall performance.

If we have an array of pieces stored, we could just let each thread handle one element in the array, i.e. one piece. In the end game, most of the threads do nothing then. That does not sound too good. In addition, the evaluation methods for different pieces differ from each other. Using threads for the elementary bit operations on the bitboards is not wise either, because the GPU has native lsb and popcount-instructions, for finding the least significant bit and calculating the number of bits, respectively. I am not even sure if I will eventually use bitboards. It is probably better if the threads work together on the different features of the position that are somewhat larger.

I have come up with an evaluation function that, in its simplicity, could look something like:

int Board::evaluate ()
    unsigned int score = 0;
    evaluate_targets ();
    for (int i = 0; i < number_of_pieces; i++)
       int j = pieces[i].getPieceType ();
       int k = pieces[i].getPosition ();
       score += material_score[j];
       score += placement_score[j][k];
    return score;

This seems pretty simple. Every piece contributes to the total score its material value and its placement score. The trick is that the placement score table is dynamic. It depends mostly on the pawn structure and the position of the king, and it is calculated in evaluate_targets()-function. Its values could be stored in a hash table, but more probably it is cheaper to calculate them in parallel, because the hash table will not fit into shared memory.

I have had no time to think this more thoroughly yet, but I will do that in the near future.


Move generation code

I noticed that move lists take too much space. Assuming a move takes 4 bytes, then a move list can take hundreds of bytes per thread. In a typical middlegame position, there can be 40 moves. This is already 160 bytes. In the worst case, there can be over 200 moves. That is 800 bytes. Multiply this by the number of threads, e.g. 64, and you have a big number. In the example 64 x 800 bytes = 51200 bytes or 50 kB. That is not very nice. The only practical option, that I can see, is to generate one move at a time. Then we have to keep track where we left last time we generated a move.

A naive move generator could look roughly like this:

int Board::generateList (Move *moveList)
    int index = 0;
    for (int i = 0; i < number_of_pieces; i++)
        Piece p = pieces[i];
        int fromSquare = p.getSquare ();
        for (int j = 0; j < p.getMaxNoDirections (); j++)
            int direction = p.getDirection (j);
            for (int k = 0; k < p.getMaxNoSteps (direction); k++)
                int toSquare = fromSquare + direction * k;
                Move move (p, fromSquare, toSquare);
                if (validMove (move))
                    moveList[index] = move;
    return index;

The code should be quite easy to read. It loops though all the pieces, then all directions for the piece, and finally all the steps in that direction. Finally the validity of the move is tested before adding it to the list of moves. What bitboards do is speed up the two inner loops and the validity test by only generating legal (or pseudo-legal) moves.

Unfortunately, we do not have enough memory for the entire move list. The move generation could be easily split so that it generates moves for only one piece in one call. We might still need as many as 27 moves in the list (queen in the middle of an almost empty board). This requires 6912 bytes or 6.75 kB for 64 threads. That could be quite ok, but there is still a divergence problem, if the threads have different numbers of moves generated. The threads then spend different times generating the moves, and the slowest thread determines the pace of the others.

It is possible to arrange the move generation in another way:

int Board::generateOne (int ptr, Move &move)
    while (ptr < max_ptr)
        int i = (ptr >> 6);
        int j = (ptr >> 3) & 7;
        int k = ptr & 7;
        Piece p = pieces[i];
        int direction = p.getDirection (j);
        int fromSquare = p.getSquare ();
        int toSquare = fromSquare + direction * k;
        move = Move (p, fromSquare, toSquare);
        if  (validMove (move))
            return ptr+1;
    return 0;

I have omitted some details here, so that it is easier to get the main idea here. Basically, I have collapsed the for loops into one loop and a pointer that keeps track of the progress. The function returns only one move and a pointer where to continue the move generation. The highest bits in the pointer are the piece index, next 3 bits are the direction index, and the lowest 3 bits are the steps. Quite many of the combinations of the indices are invalid and in a practical implementation it is important to skip over them to avoid spending too many iterations in the while-loop. My current move generator uses this approach to generate one move at a time, but it has more elaborate routines for skipping over invalid pointer values.

I have also been thinking about the option that I use each multiprocessor to calculate one branch at a time so that the search path is the same for every thread in a warp. Then, there is no divergence problem, and even branching can be allowed without too much penalty. However, the massive parallellism becomes only ordinary parallellism at two levels in this approach. Threads inside a warp are used for speeding up the operations like move generation and evaluation. On the higher level, warps are used like parallel processors when searching the game tree. I think the main problem is then if there is enough to do for the 32 threads inside a warp. Well, there is at most 32 pieces on the board, so each thread could evaluate one piece, but is it wise to the threads like this?


Memory vs. speed

The trade-off between memory and speed is quite common in many computer algorithms. It is often possible to use more memory to gain speed. On the other hand, sometimes the memory space is limited and one must be content with the memory that is available. On the GPU, the situation is complex, because there are different types of memory. There is plenty of global memory, but it is slow. And there is very little on-chip memory, called shared memory, that is fast. Then there are other memory types that might be cached. But the shared memory is the most preffered choise for the move generator at least.

There is either 16 kB or 48 kB of shared memory. On some GPUs, it can be selected, whether there is 16 or 48 kB of shared memory, with more or less registers per thread, respectively. To play it safe, we try to fit the core of the algorithms in 16 kB. Because every thread must have the position they are working on in the shared memory, it makes sense to squeeze the position in as small number of bits as possible without degrading the performance of the algorithms too much. In the end, it will be a compromise between memory and speed, and testing is required to find the best solution.

So how many bits are required for unambiguously describing a chess position? There are 32 pieces at most, and they can be in 64 (= 6th power of 2) different squares. There are exceptions like pawns that can only be on 48 different squares, but we do not go in to such details here. In total, the piece information takes 32 x 6 bits = 192 bits or 24 bytes. This is not enough, however, because we cannot tell the piece types. Initially it is possible to fix the pieces to certain indices in the piece array, e.g. white queen is always at index 4. But the piece types change due to promotions and there might be captures that remove the pieces off the board. So we must keep track of the piece types. For this purpose 3 more bits are required per piece. Notice that it is not required to store the piece color, because we can set the array so that first 16 indices in the array are for white pieces and the next 16 indices for black pieces. Eventually, we have 32 x 9 bit = 288 bits in total. In addition, we need 4 bits to store the castling rights, 1 bit to store whose turn it is, and 4 bits for the en passant (EP) square (4 bits is enough, because we know the rank, only the file must be determined with 3 bits; and one extra bit is required to tell if the EP was valid). This means the total number of bits is 297 which is 37 bytes and a bit. In a practical implementation, we might end up using 16-bits per piece, because 9 bits do not align well with the byte boundaries. This leads to 32 x 16 bits = 512 bits or 64 bytes. The additional flags can be hidden in the empty high bits.

How does this compare with the bitboards? I take chess engine Crafty as an example. It's source code is publicly available. It uses 7 bitboards per side for the board, plus some extra information that we can ignore in this context. These 14 bitboards take, of course, 64 bits each. In total this is 14 x 64 bits = 896 bits which is 112 bytes. It is possible to use less bitboards if we are willing to abandon rotated bitboards. The absolute minimum number, I think, is 5 bitboards (white, pawns, knights, bishop-like pieces, rook-like pieces) plus two variables to tell where the kings are (I would not waste bitboards for that). That takes 5 x 64 bits + 16 bits = 336 bits or 42 bytes. This sounds good to me. Although it is more than 37 bytes, it is certainly less than 64 bytes. The downside is that we do not have rotated bitboards.

Now, let me confirm that we can get all pieces (except kings) out of the five bitboards I mentioned:

wP = w & P;
wN = w & N;
wQ = w & B & R;
wB = w & B & ~wQ;
wR = w & R & ~wQ;
bP = ~w & P;
bN = ~w & N;
bQ = ~w & B & R;
bB = ~w & B & ~bQ;
bR = ~w & R & ~bQ;

where w = white, b = black, P = pawn, N = knight, B = bishop (edit: includes queens), R = rook (edit: includes queens), and Q = queen.

So how much space does this take for several threads? Let us assume 64 threads. Then 64 x 42 bytes = 2688 bytes or about 2.6 kB. That is very nice. It leaves plenty of space for the move stacks. The move data structure must contain all the data required to undo the move. That includes 6 bits for source and destination squares, 3 bits for the captures piece, 2 bits for castling rights, 4 bits for EP square, and 2 bits for promotion piece. This is 23 bits. So strictly only 3 bytes are required. Because of the data alignment, it might be better to use 4 bytes per move which lets us position the flags more conveniently. So for 64 threads, the stack takes 64 x 4 bytes = 256 bytes per ply. In theory, the quiescence search can be as long as 30 ply. Adding this to the end of a regular search of 20 ply means a 50 ply long sequence. This takes 50 x 256 bytes = 12800 bytes or 12.5 kB. We are still under 16 kB (2.6 kB + 12.5 kB = 15.1 kB).

I will now seriously consider re-writing the move generator for the bitboards. The only questions is whether I can cope without the rotated bitboards or not. And thanks for Error323 for your comments about using bitboards. Those made the re-think my decisions about the data structures. This is one of the reasons I write the blog. I get other opinions that can improve the outcome of this project.


Thoughts about selective search

Because the branching factor affects the search so much, it makes sense to spend some time examining the different strategies used for pruning off some moves. Terminating the search before the search horizon might be risky, but on the other hand, some lines must be examined beyond the horizon in the quiescence search. The search depth will then be position-dependent in any case.

Null-move pruning is one technique often used in chess engines. It is based on the assumption that if a position is failing high after one passes a move and lets the opponent move twice, then the position is likely to fail high also if the move were actually made and a complete search performed. This is usually a safe assumption. An exception are the so-called zugzwang positions, where the obligation to move is actually a disadvantage. Zugzwangs occur most likely in the endgame, so the null-move should not be used there, unless one is extremely careful. The null-move search can be done with a shallower depth, so it will save a lot of time when the remaining depth is still large. One should not allow consecutive null-moves, because that would reduce the search to nothing.

Another widely used technique is pruning the branches near the horizon. If the position is quiet and the position would fail high with a clear margin if terminated at the current depth, the search can be terminated. This is usually safe only one or two plies from the horizon. Otherwise there are too many opportunities to get the score below beta. Also, the margin must be set high enough and the position must be quiet (not in the middle of a capturing sequence).

Quiescence search is launched at the horizon depth. If the position is quiet it will call the evaluation function directly. Otherwise, it will search through capturing sequences and probably some checks to see if the score would be different from the positional evaluation. One must be careful to choose only relevant moves to the search or otherwise the search tree can explode.

These ideas could be brought one step further by adding some expert knowledge into the move selection. The choice of moves at each depth would depend on the potential of those moves. A full search would be done only for the first few plies. The rest of the search would be like quiescence search. Then there would be two kinds of evaluation functions: one for positional evaluation and another for move selection. The quality of the evaluation functions is then critical to the quality of the search.

But these ideas are still just ideas. I will experiment with selective search strategies after an implementation of the alpha-beta search is runnning on the GPU.


Testing and setting performance goals

Recently, I have been writing a move generator for the CPU. I write it so that with minor changes, it can be run on the GPU. This CPU move generator has two purposes: (1) testing that the move generator produces the correct moves, and (2) setting a goal, in terms of nodes per second, that the GPU move generator must beat.

It is easy to write bugs in the move generator code. Debugging the CPU code is easier than the GPU code. In the GPU code one must implement the parallel execution properly, in addition to the move generator itself. So, the first step is to get the move generator work properly on the CPU, and then port the code to the GPU.

I have written move generators before, so why not use one of them? None of them were written with the parallel execution in mind. They take too much memory to fit into the shared memory. That is why I have to completely re-designed the move generator. I try to minimize the memory consumption so that 32 or 64 threads can be run in parallel on one streaming multiprocessor, and no global memory accesses are required, except for reading the initial position.

Currently, most of the code has been written, but there are bugs. I use a perft() -function to test the move generator. It generates all the moves up to a specified depth and counts the nodes. The node counts can be compared to publicly available correct numbers. If the numbers match, it is very likely that the move generator is working properly. My perft() -function looks like this:

unsigned int Board::perft (unsigned int depth)
 if (depth == 0) return 1;

 Move m;
 unsigned short ptr = generate (0, m);
 unsigned int nodes = 0;
 while (ptr)
  make (m);
  nodes += perft (depth - 1);
  unmake (m);
  ptr = generate (ptr, m);

 return nodes;

There is one peculiar feature in the move generator routine. It generates only one move per function call and returns a pointer that can be given to the generator to calculate the next move. If it returns zero, there are no legal moves left. The performance test includes the make() and unmake() -functions for making a move and undoing it.

So how efficient is it now? I have tested the current buggy version of the generator by calling perft(6) and taking time. The result is about 10.2 million nodes per second on one core (Intel(R) Core(TM) i7 CPU 930 @ 2.8 GHz). My computer has eight cores in total, so it should achieve about 80 Mnps if all cores were used. So, this is the minimum speed the GPU must beat to be faster than the CPUs. I hope I can make it 100 Mnps. That reminds me of the first version of IBM's chess computer Deep Blue back in 1996. It evaluated 100 M positions per second (Deep Blue in Wikipedia). To be accurate, positions evaluated is different from moves generated, but the magic number is the same. That's the goal anyway.


Existing parallel search algorithms

There are several parallel search algorithms which could be utilized in a chess engine. So, to avoid inventing the wheel again, it is necessary to study the previous work on the topic. It is often assumed that the parallel processors are general purpose processors that can communicate with each other. Although the newest GPUs are more flexible than before, we cannot assume that the existing algorithms work on them as such. Let us examine some of them and see if they would work on the GPU.

One of the simplest ideas is to share the hash table or transposition table as it is called in the context of a chess engine. Then all processors can start at the root. When one thread finishes the search of a position, it stores the result in the transposition table. Other threads can then read the result from the table instead of searching the subtree themselves. If the processors finish at slightly different times, their search paths begin to diverge as they read the transposition table or search the positions themselves. This is a non-deterministic way of sharing the workload between the processors. The memory accesses must be atomic or the critical sections must be protected when writing the results to the transposition table. CUDA supports atomic operations which could be used in the implementation of the transposition table. However, this approach scales quite badly for a large number of processors, so it is not useful for the GPU as a means of parallellization. Still, a transposition table will probably be implemented, and if it resides in the global memory, it will act as a shared transposition table.

A large part of the parallel chess algorithms is based on tree splitting. The alpha-beta search tree is split between the processors. The simplest idea is to give the children of the root node to different processors and search them with a full window. However, if the moves are ordered from the best to the worst, which is common if the iterative deepening strategy is applied, the first move is going to produce a good alpha bound and almost all of the remaining searches would fail low if they were searched serially. Deeper in the tree, if a move ordering scheme is used, the first move is either producing a good alpha bound or failing high. That is why it is not beneficial to let the search proceed with the other moves before the search on the first one is completed. Most often, the other moves are searched with the zero window, because they are expected to fail low. There are lots of algorithms that utilize the ideas mentioned above: mandatory work first, principal variation splitting, enhanced principal variation splitting, young brothers wait, jamboree, dynamic tree search etc. The differences between these algorithms are mainly in how they share the work in the parts of the search tree that are not in the "leftmost branches". These algorithms require some way to communicate which subtrees are available for the processors to search. On the GPU, this could be a memory location that contains a stack of search tasks. However, some kind of semaphores are needed to prevent the corruption of the stack by concurrent accesses.

Instead of splitting the search tree it is also possible to partition the search window. Then each processor does a kind of an aspiration search with its assigned window partition. Most of these searches fail either high or low, and eventually we know in which window partition the correct evaluation lies. If the windows are small enough, this piece of information is sufficient for choosing the best move. In particular, if the searches are done with a null window, then the results should be accurate. This, due to its simplicity, sounds like a good initial plan for the parallellization of the search on the GPU. Let us say we have evaluation granularity on 1/100th of a pawn. Then assume that we have 64 threads on 15 multiprocessors. The total is then 960 threads. This allows the search windows to vary from -4.80 to +4.79 with 0.01 increases. This is usually enough to find the correct evaluation. But it is probably better to use the search power in a more efficient way.

If we use the iterative deepening strategy, we already have a pretty good guess about the evaluation of the best move under the root from the previous depth. Then, it might be enough to use a narrow window, e.g. [score-0.32...score+0.31], and split the search below the root between the multiprocessors by using this window for each of them. All searches are null-window searches anyway, so a better alpha bound from the first move cannot speed up the rest of the search. So in total there would be 64 x N threads, where N is the number of legal moves in the root position. We might have to restart the search if one of the searches fall out of the window. But then we can probably assign more than one multiprocessor for the wider window search, because most of the moves have probably failed low already.

These are just my initial thoughts and they might change while I am reading more about the parallel search algorithms.


About evaluation

The evaluation function is very important, because in most cases, the search must be terminated before the game ends in a checkmate or draw. One way to think about the evaluation function is that it indicates the probability that one side is winning. Assuming perfect play, there is no chance in chess as such. If we had enough computing power, it would be possible to say with absolute certainty for each position if it is a draw or win for white or win for black. But we do not have that power, so we have to resort to the heuristic approach.

Most of the commercial chess engines use pawns as the basic unit of evaluation. Everything else being equal, but white has one pawn more, yields the score +1.00. Similarly, if black is up a pawn, the score is -1.00. For a grandmaster, one pawn advantage is usually just enough to win the game. And if the score equals zero, the position is a dead draw. On the other hand, checkmates result in evaluation at infinity. In practice, chess engines use finite, but such a big number for checkmates that it cannot be confused with other evaluations.

The evaluation function is where most of the expert knowledge in a chess engine is located. Evaluating chess positions is difficult even for experienced human players. It is not possible to invent straightforward rules that apply to every position. Usually what separates a good player from a really strong player is the specific knowledge of different types of positions and which plans work in them. It is possible to program some of this information in a chess engine, but human beings and computers think in a different way, making this a difficult task.

One theory of evaluation is based on three factors: material, space, and time. The player with more material is usually more likely to win. The players with more space is also more likely to win, because he has more options to manouver his pieces. And third, the player who is ahead of the other one in development or has the initiative, is more likely to win. In addition, there are factors like pawn structure, piece mobility, and king safety. These are often explained through the three factor model (material, space, time), but I think the model is not that well-suited for computer chess as it is for human players.

I suggest a different model for the evaluation in a chess engine. It is also based on three concepts, but these are different from the previously presented. They are: force, path, and targets. Force is approximately equal to material. It refers to the pieces and their latent values. Path means that the pieces have routes on the board. It is close to the concept of space, but it is more specific to the piece type. What is a path to a knight is not usually a path to a bishop. And finally, targets refer to weak or otherwise important points in the opponent's position. These are related to the path and force so that the force strikes the targets via the path. If one of these factors is missing, then the strike cannot succeed.

First, the force or material, is evaluated by using a simple piece value table:

Bishop pair0.50

This is a refined version of the traditional 1-3-3-5-9-scoring. It includes a bonus for the side with the bishop pair, because it is widely recognized at grandmaster level that a pair of bishops is slightly better than a bishop and a knight or two knights.

Paths are calculated backwards from the targets. Paths may mean open or semi-open files for rooks or diagonals for bishop. For knights it may mean outposts squares. For passed pawns it may mean free path to the promotion square.

As already mentioned, a target is a weak or important point in the opponent's position. Weak points include e.g. isolated and backward pawns. Important points include the central squares, bases of pawn chains, and squares near the king. These lists are not exclusive, but may be extended as new factors affecting the position become obvious.

This is an initial attempt to evaluate the position. A more sophisticated function can be implemented when the chess engine is otherwise working fine.


Moving pieces

An efficient move generator is an essential part of a strong chess engine. The rules of chess determine the movement of different pieces. Most pieces move quite nicely, but there are expections which make the implementation of a move generator more difficult than it would be without them. The purpose of this post is to examine how to pieces move and how to implement a generic move generator.

But first, it is appropriate to mention here that it is not always necessary to generate all moves for all positions during the search. Many chess engines do this, while others generate the moves in stages like: checks, captures, other moves. If the search fails high after a move, it is not necessary to go through all the rest of the moves and the effort used when generating them is wasted. On the other hand, generating one move at a time will probably be difficult and add extra instructions. However, in zero window search, it can be so that only one move is searched at every other ply. It would be possible to use separate move generation routines for every other ply then. But for simplicity, we try to construct a generic move generator.

It would be nice to have a move generator without branching. This is difficult, but I think it can be done. If the branching cannot be avoided, the threads inside a warp will have to execute all the branches in many cases. Some of the benefits of parallellism are then lost. So we try to replace ifs with table lookups. That is why we need to see if the piece movement can be generalized. The following table lists some properties of the pieces:

PieceSlidingDirectionsMax. movesSpecialities
PawnNo34Capture and non-capture different, promotion, double move, en passant
KingNo88Castling, checks

So basically, pawns are the nastiest pieces and kings are also bad. Majority of the problems with the kings can be avoided by loosening the requirements. Checks can be handled separately and not as a part of move generation.

Generic move rules could consist of the piece to move, change in the board index (= direction), maximum steps it can "slide" (this equals one for non-sliding pieces), a condition for the move to be valid, and an extra action to take in addition to moving the piece. This leads to a table like:

PieceDirectionMax. stepsConditionAction
Pawn-81Destination square empty
Pawn-161Only 2nd row pawn, destination square and middle square emptySet EP square
Pawn-71Destination square occupied by enemy piece or EP square
Pawn-91Destination square occupied by enemy piece or EP square
Pawn-817th row pawn, destination square emptyChange to knight
Pawn-817th row pawn, destination square emptyChange to bishop
Pawn-817th row pawn, destination square emptyChange to rook
Pawn-817th row pawn, destination square emptyChange to queen
Pawn-717th row pawn, destination square occupied by enemy pieceChange to knight
Pawn-717th row pawn, destination square occupied by enemy pieceChange to bishop
Pawn-717th row pawn, destination square occupied by enemy pieceChange to rook
Pawn-717th row pawn, destination square occupied by enemy pieceChange to queen
Pawn-917th row pawn, destination square occupied by enemy pieceChange to knight
Pawn-917th row pawn, destination square occupied by enemy pieceChange to bishop
Pawn-917th row pawn, destination square occupied by enemy pieceChange to rook
Pawn-917th row pawn, destination square occupied by enemy pieceChange to queen
Rook-87Clear a castling flag
Rook-17Clear a castling flag
Rook17Clear a castling flag
Rook87Clear a castling flag
King-91Clear castling flags
King-81Clear castling flags
King-71Clear castling flags
King-11Clear castling flags
King11Clear castling flags
King71Clear castling flags
King81Clear castling flags
King91Clear castling flags
King-21Long castling allowedClear castling flags, move rook
King21Short castling allowedClear castling flags, move rook

The moves could be generated by look ups to such a table. In addition, it is required that the destination square is not off the board. Implementing the conditions is the hard part. Usually they can be formulated as tests if certain bits are set. The actions in the table can be seen as removing one piece and adding another piece.

To avoid ifs in the control flow, a move can be made so that if the condition is false, the move has no effect on any variable. Otherwise the move is made and the appropriate variables are updated. When returning from the search, the move is undone, and the execution continues from the next move at the previous level. The return pointer is then actually an index to the move generation table.

All this might sound confusing, but to avoid branching, I think this is necessary.


Initial plan for shared memory usage

Global memory accesses are slow. Up to two orders of magnitude speed up can be achived by using shared memory. However, the shared memory is limited. These facts lead us to the design principle: use shared memory instead of global memory even at the expense of extra operations. For example, it does not matter if there are three times as many instructions if we can avoid memory operations that are a hundred times slower. So let us see how this can be done.

For each thread, we would like to fit in it the stack of search function parameters and the current position. The function parameters include the remaining depth, alpha value, the move information, and a pointer for returning to the previous depth. These parameters are implementation specific, of course. We try to save space, so we do not have the positions in stack, but only the moves. In addition, the beta value is not necessary in the zero window search, because it is always alpha + 1. Even the alpha need not be in the stack, because it does not change during the search. If it would change that were a fail high.

Different representations of the position take different amounts of memory. Let us compare three of them (without additional flags etc.): simple 8x8 board, 0x88 board, and bitboards. In its simplest form the 8x8 board takes 64 times 4 bits (enough for pieces) which is 256 bits or 32 bytes. It might be necessary to complement this with a piece list for efficiency reasons. Then, 0x88 boards take double the space compared to the simple 8x8 board or four times that if we reserve one byte per square. Finally, bitboards take 64 bits each, and we need 6-12 of them (white, pawns, knights, bishops, rooks, kings, and possibly black, queens, plus 4 rotated bitboards). That is 384-768 bits or 48-96 bytes. Surprisingly, the simplest board representation seems the most attractive one, because it is the smallest one.

Moves can be packed in a few bytes. Source and destination squares take 6 bytes each. Then, for undoing moves, it is necessary to store the captured piece, the en passant square, castling rights, and the promoted piece. These take correspondingly 4, 6, 2, and 4 bits. In total this is 28 bits. So, in practice, a move takes 32 bits or 4 bytes.

Fitting the stack in the shared memory is going to limit the depth of the stack or the number of threads. The number of threads should not be too small, because that would negatively affect the performance. An initial guess is 64 threads per multiprocessor. This can be changed when the implementation is tested. The depth is limited to 32 ply. In practice, this should be enough, although the quiescence search could theoretically go deeper.

The follow table summarizes the total memory consumtion so far:

VariableSizeNumberTotal Memory Consumption
Current position32 bytes642 kB
Moves4 bytes64 x 328 kB
Pointer2 bytes64 x 324 kB

For each multiprocessor, we probably need a set of lookup tables for the move generator. Because the other stuff already takes about 14 kB, and on many devices the upper limit for the shared memory is 16 kB, we have only about 2 kB space available. That is not enough for any 64 to 64 tables. Instead, it is possible to use only arrays of offsets for pieces moving to different directions.

I will refine the plan when I go into the details of move generation.


Examining the search tree

Almost all chess engines go through a search tree, where the nodes are positions and edges are chess moves between the positions. The children of a node are all the positions that can be reached from that position with one legal chess move. Sometimes the legality requirement is loosened to allow pseudolegal moves for efficiency reasons. For example instead of checking for checkmate, it is often sufficient to assign the king a very high value and allow it to be "captured". However, the basic structure of the search tree is the same.

In typical middlegame positions there are around 40 legal moves. The number of legal moves varies between zero (stalemates and checkmates) and 218 (some artificially constructed positions) depending on the position. This is the branching factor of the tree. Let us call it B. Then, the number of nodes at depth D can be estimated with BD. It is obvious that this number becomes very large quickly when D increases. The following table illustrates this:

DepthNode countTime at 10M nodes / s
1400.000004 s
21 6000.00016 s
364 0000.0064 s
42 560 0000.256 s
5102 400 00010.24 s
64 096 000 0006 min 49,6 s
7163 840 000 0004 h 33 min 4 s
86 553 600 000 0007 d 14 h 2 min 40 s

Consider the time take by the full traversal of the search tree as listed in the table. The 10 million nodes per second speed is realistic for modern chess engines. How is it then possible that they routinely reach depths like 15 or 20 ply at tournament play? They only have a few minutes per move, so they should be able to go only 5-6 ply deep. Half of the answer is that this represents only the worst case scenario with bad move ordering. The backbone of most of the chess engines is some version of the alpha-beta search. If the move order is optimal, it is necessary to search only one child at every other level of the tree. This means that the node count becomes BD/2 or in other words: the search can be twice as deep. But multiplying the depths in the table with two gives 10-12 ply. That is still less than what is expected, even if the move order were optimal which it is not in practice.

Where does the extra depth come from? There are many factors that contribute to that. One factor is the transposition table. About 15-20 % of the positions encountered during the search happen to be among those already searched. If the results for these positions can be fetched directly from the hash table, it can reduce the number of positions to examine by a factor that for 5-6 plies is around 3. This might mean an extra ply for the search. In the endgame, transposition tables are even more beneficial, and can yield tremendous speed ups.

Another factor is selectivity. While pruning off some moves before the maximum depth is reached is risky and might not produce the correct results, most good chess engines do it. The benefit of getting deeper is greater than occasional miscalculations. If the pruning strategy is well designed, such mistakes can almost completely be avoided. Human intuition says that most moves in most positions are just dumb. However, it is far from easy to implement this intuition in a chess engine.