Still debugging

I have been quite busy during the week. The little time I had I spent debugging the code. It is surprisingly easy to introduce bugs in the move generator code if you make small changes to it. I had to implement code for printing out the positions. That helped a lot and found out that I had copied wrong masks for preventing the wrap-around in the Kogge-Stone algorithm (A- and H-files swapped).

Now that I have the sequential code working, I must still figure out how to arrange the parallel execution in an optimal way. I might just launch a constant number of threads and let them handle a bunch of positions. The basic idea is to have a kind of queue of positions for processing. The threads take the positions in the beginning of the queue and generate their successors and put them to the end of the queue. As long as there are positions with less than the maximum depth, the threads continue to process them. The obvious problem is that this is a kind of breadth-first-search approach, so the size of memory becomes the limiting factor, and perft(6) is already too much.

So, instead of a queue, I could use stack. Then, the threads take the positions on top of the stack and put the new stuff on top of the stack. Care must be taken to handle the new and old positions properly. The size of the memory area required is then dependent on the number of threads. This is more managable and probably the threads are less divergent as they handle similar types of positions.


Eliminating lookup tables

Sometimes it is useful to look things from a different perspective. This applies to chess programming as well as other things. When programming for the CPU, you usually try to minimize the instruction count in order to optimize the performance. In GPU programming, it is often better to minimize the memory usage, because memories and caches are small, but on the other hand, basic arithmetic instructions are relatively cheap. The bottleneck is the memory speed, not the execution of the instructions. As a side product, I have learned new ways to do things. These might be useful in the future, even when optimizing a CPU chess engine.

I have implemented a prototype for move generation. It currently runs on the CPU, but it should be straightforward to port it to the GPU. The main reason for this is that it does not use any lookup tables, so it does not need shared memory and I do not have to struggle with the memory limit. The side product is that the CPU version is very cache friendly. Of course, the instruction count is larger than with the lookup table approaches, but that does not matter. The generator is bitboard-based and it uses the Kogge-Stone algorithm for sliding piece move generation. So, basically, there are lots of shift- and xor-operations and stuff like you usually see in bitboard engines.

I decided to pack the position in 48 bytes or six 64-bit variables. This is a reasonable compromise between size and instructions required for unpacking the information. It is important to keep the size small, because the positions will be written in the global memory on the GPU. Now the six variables take three 128-bit cache lines or three memory operations per position. Packing the positions tighter would not help unless I can fit everything under 256 bits resulting in only two memory operations. But that I cannot do. Well, actually, the smallest number of bytes for a position (including side to move, en passant square, castling rights, and counter for the 50-move rule) that I have found is 32 bytes or 256 bits, but it is not very practical. So I am happy with 48 bytes. The six bitboards are white pieces, pawns, knights, bishops+queens, rooks+queens, and kings. The extra information required is packed in the lowest and highest eight bits of the pawn-bitboard, because they are never used for pawns in legal positions.

The idea of writing the positions into the global memory lead to another idea. I can interleave the global memory writes with the move generation code. I do not generate move lists at all. I make the move at once when generating it and store the resulting position. Again, no shared memory is required. A position is so small that, ideally, it can fit in registers. A nice side product is that I do not need a data structure for a move, and I do not have to care about packing and unpacking information in it. This actually saves instructions, although that was not the goal here. It also saves us a few ifs, because I do not have to have branches for the special moves both the move generator and in the make() and unmake()-functions. There are no make() and unmake()-functions, so there cannot be ifs in them.

Yet another data structure, that is missing from my implementation, is the board with 64 squares telling which piece is on each of them. This is useful when determining if a piece is captured and which piece it is. But I do not need that, because my move generator will produce the captures so that it always knows which piece is captured. The capture move generator is designed so that it produces the captures in the commonly used Most Valuable Victim - Least Valuable Aggressor (MVV-LVA) order. This can be done by using the bitboards only.

I also get rid of the bit scan operations, because I do not need the square numbers to index the lookup tables. That is simply an unnecessary intermediate step, because the resulting positions consist of bitboards, and not square indices. There are relatively efficient ways to extract the square number of the least significant bit, like the de Bruijn-multiplication which requires a 64-element lookup table, but if I can live without it, even better.

I have to say I am very satisfied with the software architecture right now. Let's see if that changes when I have a chance to test the move generator performance.


Bits and pieces

I have been thinking about the GPU implementation again. The biggest issue is the memory consumption. I have previously shown that a position can be packed in about 40 bytes by using bitboards, so that everything can be unpacked with a few bit operations. Then the positions can reside in the shared memory or even in the registers. But move lists are the problem. To be exact, there are two kinds of move lists, those generated by the move generation routine for each position and those that are put in the stack when recursively calling the search. I call the former move lists and latter ones move stacks.

Moves in the move stack require more information, because they are used for undoing moves and that involves restoring the position as it was before the move, including castling rights, en passant squares, and half-move counters for the 50-move rule. But for move lists, less than that is sufficient, because the original position is known. The least amount of information that is required for unambiguously defining a move in a known position is: to- and from-squares plus possible promotion. This can be fitted in two bytes. This is enough for move lists produced during the move generation. With two bytes per move and average of 40 moves per position, this means about 80 bytes per ply. Then, for example a 15-ply search takes 1200 bytes for move lists and we are out of shared memory if there are more than 40 threads. So I have to think about doing something differently.

What I have been thinking is that I write a kernel that takes a position from the global memory, generates the moves, makes them, and writes the positions back to the global memory. The kernel is launched several times, always taking the previously generated positions as input and producing output that is one ply deeper. The problem with this approach is that there are typically tens of position to be written while there is only one position to read from the memory. The global memory bandwidth on my GPU (NVIDIA GeForce GTX 480: 177 Gb/s) is sufficient for writing three positions per clock cycle. So, given the average of 40 positions per thread, this means about 13 clock cycles per thread, assuming a large number of threads are running and amortized cost calculation is valid. I think this is of the same order of magnitude that it takes to generate a move, so it could be possible to hide the latency quite well.

Eventually, I might end up using the Kogge-Stone-style move generator, because it does not require lookup tables for sliding pieces. For other pieces, the lookup tables take just a few hundred bytes of shared memory. Since the positions can be in registers during the processing, the shared memory consumption is not that important. The move generator does not produce a move list, but makes the moves directly and writes the resulting positions to the global memory. Surprisingly, this might be the fastest way to do it.

I still have to find a way to implement the search, because this architecture is fine for perft(), but not for alpha-beta search as such.


Learning from Experts

Last week, I was listening to two presentations given by two guys working at NVIDIA Research. The presentations where about buidling tree structures on the GPU and about fast ray tracing on the GPU. These provided some insights on GPU programming and also assured me that I am on the right track. On the other hand, the presentation about ray tracing gave me some fresh ideas. Using global memory is not impossible if the memory latency is hidden by a large instruction count and sufficient number of threads. Another idea is to split the program in small subprograms, because smaller kernels take less registers and it is easier to make them branchless. This means that multiple launches of kernels are required. Kernel launches are slow, so the subtasks must be large enough or there must be a large amount of threads to compensate for the overhead. The challenge is to split the chess engine into small kernels.

I found another very useful presentation by Vasily Volkov: Better Performance at Lower Occupancy. The main observation is that memory latency often limits the performance, which is why maximizing multiprocessor occupancy is not the same as maximizing the instruction throughput. Surprisingly, lower occupancy often leads to higher throughput. It is clear that global memory is slow, but it might not be that clear that even the shared memory has noticable latency. It is about six times slower than the registers on the GPU that I have used (NVIDIA GeForce GTX 480). This explains why I got only about one sixth of the instruction throughput in my tests. It is better to use registers whenever possible. On the other hand, the number of register per multiprocessor is limited. The more there are threads on a multiprocessors, the less there are registers per thread. That is why it is sometimes useful to have less threads on a multiprocessor which means more registers for each thread and faster execution, because shared memory is accessed less often. Simple, but not necessarily obvious.

I have come to the conclusion that thread level parellellism is the only way to compete with modern CPUs with multiple cores at higher clock rates than the GPUs. For example the GTX 480 card has 15 multiprocessors, and if each of them runs one warp with 32 threads, this means 480 threads in parallel. This means 960 instructions per clock cycle, if the instructions are multipy-adds or 480 instructions per clock cycle with other simple instructions. With the shader clock rate of 1.4 GHz, this means either 1.345 Tflops (mentioned in the documentations as the peak computing power) or half of it in practice. If we are able to compute only one execution path per warp, then, most of the time at least, we are computing only 15 instructions per cycle. That is only 21 Gflops. Modern CPUs with multiple cores can achieve that with more flexibility. The order of magnitude speed up when using thread level parallellism is the only thing that justifies the use of GPUs.

The global memory speed on GTX 480 is 177 Gb/s. That means roughly 128 bits per clock cycle (@ 1.4 GHz) for all the threads together. So, if I store chess positions in the global memory, I need roughly 8 clock cycles to load it from the memory, assuming the board size of 128 bytes. If several threads work on the same position, this is a tolerable delay. For example, in the move generator I might run 256 threads in parallel to generate the moves in a position. The problem is writing the results back to the global memory. This can take hunders of clock cycles, depending on the number of moves. Assuming that the move generation itself takes tens of instructions per move, this might be a major factor limiting the performance. Actually, it could be possible to estimate the upper bound by assuming that writing the results dominates all the other execution costs. Above, we calculated that the writes take 8 clock cycles per position. Dividing the clock rate 1.4 GHz by 8 gives us 175 million positions per second. The order of magnitude is what we hoped for, if the goal is to produce 100 million positions per second. Not bad at all. Now I only have to implement it.


Recently, I have been writing a CPU chess engine. There are several reasons why I am doing this instead of the GPU chess engine. First, I need a reference against which to compare the GPU engine. Second, I can more easily test things with the CPU engine, because it is easier to debug the code and, e.g. set up test matches to tune the evaluation function. Third, I might end up with quite a strong CPU engine, which as such is an achievement.

The CPU chess engine has now fully functional move generator, which produces correct results in all the test cases I could find for perft(). The speed is modest 8-12 million nodes per second (no bulk counting; every leaf node is visited) on single core 3.2 GHz CPU. I might try to optimize this, but it is not the first thing on my priority list.

I also implemented a simple evaluation function. It is based on material, mobility, king safety, and pawn structure. I have not tried to tune the evaluation, because I will replace the evaluation function with a more principled approach. I have a general idea how the evaluation function should work, but I have not had time to work out the details yet. This is the most enjoyable part of the whole project, so I am in no rush.

Then, I have the alpha-beta search (with quienscence search) implemented with the iterative deepening at the root level. The principal variation is stored and it is searched first. The full principal variation search is not yet implemented, because I want to be sure I get the transposition table working properly first. I have tried the null-move and razoring at one ply from the leaf nodes, but currently they are not included.

The transposition tables are implemented by using the Zobrist keys for hashing. They are first calculated for the initial positions and updated in make() and unmake()-functions. This does not add much to the instruction count, so it is fast. I have tested the transposition tables with perft()-function and I get the correct results, but much faster. I use two transposition tables, one with the always-replace-strategy and another with the depth-preferred-strategy. Tests with perft()-function show about 15 % speed up compared to either of the strategies alone. I have not yet combined the hash tables with the alpha-beta search, because I want to be absolutely sure I do it properly. One must be careful with the type of the entries and the mate scores.

Although the development is still far from completed, I already have a chess engine that is able to play decent chess. I have played against it a few times now, and have not been able to beat it yet. It is strange that the engine seems to understand some positional concepts that I have not programmed into the evaluation function yet. I find two possible explanations for this: 1) the observed positional understanding follows from tactical considerations, and/or 2) some commonly used evaluation terms are overlapping, so the missing terms are implicitly included in the existing terms. This is interesting and raises a question: how to implement an evaluation function with (almost) orthogonal terms?