Wild ideas

Sometimes small steps and patience is all you need for success. But not always. Man did not go to the moon by only taking small steps. At first, someone had to have the wild idea that it is possible to go to the moon. Then, someone had to invent the rocket engine. Only after the ideas and the technology were there could the basic engineering work begin. Similarly, I do not believe that a GPU can play good chess by just applying the known algorithms and making small improvements. Either the GPUs must change or a new algorithm has to invented. While waiting for the former I will try to do the latter.

One idea I have been playing with is using only zero window (or null window) search on the GPU. A zero window search gives a boolean result. The position is either better or worse than the given bound. It could be possible to launch hundreds of zero window queries from different positions with different bounds. The positions and bounds would be such that the expected utility of the gained information is maximized. The partial game tree would be stored and updated by the CPU.

I think it makes sense to use both the CPU and the GPU, because both of them have their own strengths. The CPU is more flexible and it is better to use it for control tasks. The GPU, on the other hand, is a work horse that can be given the mundane number crunching tasks. A simple idea would be to search the principal variation (PV) nodes with the CPU and let the GPU search the other nodes with the zero window. However, that is not very efficient use of computing power. The subtrees given to the GPU must be large enough to compensate for the extra time required for GPU kernel launches. The sizes of the subtrees must also be similar, because the slowest thread determines the performance of the whole search. In addition, the subtrees must have bounds from the PV nodes before they are run.

There are sophisticated algorithms for splitting the search treee dynamically (see e.g. The DTS high performance parallel tree search algorithm), but they usually require more communication between the processors than possible on the GPU. If there is no communication during the search, we must somehow estimate the subtree sizes before the search in order to balance the workload. It might be possible to do this by utilizing information from shallower searches. Shallow searches or iterative deepening are often useful for move ordering purposes. We could use the same data for load balancing as well.

Hopefully I can find time for coding in the near future. Then I can tell more.


Transposition table and hashing issues

The transposition table is an essential part of an effifient chess engine. It stores results of subtrees that have been searched into a hash table. Then, if the search encounters the root of the subtree by a different move order, it can use the stored results directly instead of performing the search. It is easy, however, to introduce nasty bugs with its implementation. There are two kinds of errors that arise with the transposition table code: (1) hash collisions, and (2) invalid use of hashed search results. In addition, the transposition table must, most probably, reside in the global memory. That is not only slow, but allows a third class of errors, i.e. corruption of the hash entries by concurrent actions by multiple threads. Let us examine the possible errors one by one and then sketch a plan for the implemetation.

Hash collisions occur when to positions are either given the same hash key or mapped into the same position in the hash table. The latter case is much more common that the former one if the hash keys are long enough. They are also easier to avoid. Simply storing the hash keys in the transposition table and comparing them is enough the avoid these errors. The case where the hash keys are the same, is more difficult. It can be avoided to some extent by producing the hash keys properly, e.g. by Zobrist hashing (see http://chessprogramming.wikispaces.com/Zobrist+Hashing). However, even with good hashing, the probability of two positions given the same key is non-zero. With 32-bit keys, there are only 4 billion possible keys. Then, if we have, 4 million positions in the transposition table, the probability of a position given the same key as already given to antoher position is 1:1000. This might sound like a small number, but if you repeat this a million time, you get a thousand errors or so. That is enough to completely ruin the search. So an easy solution is to use longer keys. With 64-bit keys these kind of errors occur much more seldom, but they still do occur, given a large enough transposition table and a search that is efficient enough to visit billions of nodes during the game. So one must be careful to design the transposition table so that the probability of these errors is minimized.

The hashed results can be used for wrong purposes. It is, of course, necessary to check that the stored entry is still valid and deep enough. Then, it is also necessary to check the search bounds. Especially, zero window searches do not give accurate results, but only fail high or low. Then it is necessary to compare the search window used in the transposition entry to the current one. Sometimes, it is possible to use the transposition table entry for move ordering although the stored score cannot be used in the current search.

So where in the search tree to use the transposition table? Its purpose is to save computation time. Two factors affect the potential saving: the number of nodes under the subtree corresponding to the hash entry, and the probability of the search probing the entry. The number of nodes under the subtree is the greatest near the root. However, the probability of hitting the same node is zero for less than 3 plies. After that the probability increases gradually as long as the transposition table is large enough. Imagine a infinitely large transposition table with perfect hashing. Then the probability of finding a position approaches 100 % when the depth of the search increases. Why? Eventually the search will visit all the positions that are required to prove the absolute result, i.e. win for either side or draw. This is only a theoretical consideration, because in practice, the transposition tables are of finite size and the hashing is less than perfect. The probability of probing a node decreases with a small transposition table when the number of stored nodes becomes proportional to the size of the transposition table. This happens because we run out of places in the hash table and there are collisions which lead to discarding data (either the just searched position or the transposition table entry result). This is another reason why saving leaf nodes in the transposition table does not sound good (the other reason being that the subtree is only one node plus possible quiescence search nodes).

I have made simple tests with a transposition table on the CPU. I have written a version of perft ()-function (see previous posts for explanation) that uses a transposition table. The transposition table entries consist of four unsigned ints which take 16 bytes of memory in total. Two of the entries are for the hash keys. The first one is the primary key which is used for hashing and the second one is for checking the validity of the hash entry in the case of a collision. I use the Zobrist hashing scheme. The other two values are the depth and the score. These are included just for testing because perft does not return a score. In addition, the 128-bit hash entry size corresponds to the memory alignment of the global memory. I run the tests with 256 Mb and 1 Gb transposition tables with 16 M and 64 M entries, respectively. These correspond to possible global memory sizes on modern GPUs. I think that we can use as much memory as possible because the global memory access is slow regardless of the table size. The newest GPUs have caches for the global memory, but due to the random access pattern of the hash probes it is of little help. However, the results look like this (initial position, 1 Gb table):

perft (7) = 3195075629 at 39.93 Mnps
depth 1 hash table: 42.477849 % hits, 18.217728 % collisions
depth 2 hash table: 58.034033 % hits, 4.928729 % collisions
depth 3 hash table: 34.097889 % hits, 1.510128 % collisions
depth 4 hash table: 33.115211 % hits, 0.386183 % collisions
depth 5 hash table: 0.000000 % hits, 0.237530 % collisions
depth 6 hash table: 0.000000 % hits, 0.000000 % collisions
depth 7 hash table: 0.000000 % hits, 0.000000 % collisions

The depth refers to the remaining depth in the tree traversial. Leaf nodes were not stored in the transposition table. The node count is wrong (should be 3,195,901,860) because of hash collisions that went undetected. It is clear that 32-bit hash keys are not sufficient for avoiding errors. Another thing that can be seen is that the utility of the transposition table is at best when the number of nodes is still much less than the table size. There are about 2 million nodes at depth 2. For the 64 million entry transposition table this results in the best hit rate. At depth 1, there is already about 24 million nodes, which results in lots of collions reducing the hit rate.

It should be noted that the hit rates are different for different search algorithms. The numbers above are not representative of typical transpositions table hit rates with alpha-beta search algorithms, because perft ()-function visits all the nodes in the search tree. Hit rates are smaller when the visited nodes are selected based on the alpha and beta bounds.


Parallel evaluation

The limited shared memory and thread divergence issues make it difficult the achieve efficient parallellism at the thread level. Therefore, a viable alternative is to let all the threads in a warp work on one position. Then there is more memory per position and the divergence is not an issue, because all the threads in a warp follow the same execution path. The challenge is to find enough to do for all the threads. So, how to take care of this? In the following I present my initial thoughts.

Most of the nodes are leaf nodes. Imagine a search tree where the branching factor is constant 2. Then it is easy to see that the number of nodes per level follows the series: 1, 2, 4, 8, 16, 32, 64, 128, .... The sum of all the previous numbers is the last number minus one. It follows from this observation that most of the nodes in this tree are leaf nodes. With a larger branching factor the ratio of leaf nodes to other nodes is even larger. Although there are cases where the local branching factor is one, it is a safe assumption that a typical chess search tree consists mostly of leaf nodes. We can utilize this piece of information by doing the evaluation in parallel. Evaluation function is mostly called at leaf nodes, so doing it faster helps the overall performance.

If we have an array of pieces stored, we could just let each thread handle one element in the array, i.e. one piece. In the end game, most of the threads do nothing then. That does not sound too good. In addition, the evaluation methods for different pieces differ from each other. Using threads for the elementary bit operations on the bitboards is not wise either, because the GPU has native lsb and popcount-instructions, for finding the least significant bit and calculating the number of bits, respectively. I am not even sure if I will eventually use bitboards. It is probably better if the threads work together on the different features of the position that are somewhat larger.

I have come up with an evaluation function that, in its simplicity, could look something like:

int Board::evaluate ()
    unsigned int score = 0;
    evaluate_targets ();
    for (int i = 0; i < number_of_pieces; i++)
       int j = pieces[i].getPieceType ();
       int k = pieces[i].getPosition ();
       score += material_score[j];
       score += placement_score[j][k];
    return score;

This seems pretty simple. Every piece contributes to the total score its material value and its placement score. The trick is that the placement score table is dynamic. It depends mostly on the pawn structure and the position of the king, and it is calculated in evaluate_targets()-function. Its values could be stored in a hash table, but more probably it is cheaper to calculate them in parallel, because the hash table will not fit into shared memory.

I have had no time to think this more thoroughly yet, but I will do that in the near future.


Move generation code

I noticed that move lists take too much space. Assuming a move takes 4 bytes, then a move list can take hundreds of bytes per thread. In a typical middlegame position, there can be 40 moves. This is already 160 bytes. In the worst case, there can be over 200 moves. That is 800 bytes. Multiply this by the number of threads, e.g. 64, and you have a big number. In the example 64 x 800 bytes = 51200 bytes or 50 kB. That is not very nice. The only practical option, that I can see, is to generate one move at a time. Then we have to keep track where we left last time we generated a move.

A naive move generator could look roughly like this:

int Board::generateList (Move *moveList)
    int index = 0;
    for (int i = 0; i < number_of_pieces; i++)
        Piece p = pieces[i];
        int fromSquare = p.getSquare ();
        for (int j = 0; j < p.getMaxNoDirections (); j++)
            int direction = p.getDirection (j);
            for (int k = 0; k < p.getMaxNoSteps (direction); k++)
                int toSquare = fromSquare + direction * k;
                Move move (p, fromSquare, toSquare);
                if (validMove (move))
                    moveList[index] = move;
    return index;

The code should be quite easy to read. It loops though all the pieces, then all directions for the piece, and finally all the steps in that direction. Finally the validity of the move is tested before adding it to the list of moves. What bitboards do is speed up the two inner loops and the validity test by only generating legal (or pseudo-legal) moves.

Unfortunately, we do not have enough memory for the entire move list. The move generation could be easily split so that it generates moves for only one piece in one call. We might still need as many as 27 moves in the list (queen in the middle of an almost empty board). This requires 6912 bytes or 6.75 kB for 64 threads. That could be quite ok, but there is still a divergence problem, if the threads have different numbers of moves generated. The threads then spend different times generating the moves, and the slowest thread determines the pace of the others.

It is possible to arrange the move generation in another way:

int Board::generateOne (int ptr, Move &move)
    while (ptr < max_ptr)
        int i = (ptr >> 6);
        int j = (ptr >> 3) & 7;
        int k = ptr & 7;
        Piece p = pieces[i];
        int direction = p.getDirection (j);
        int fromSquare = p.getSquare ();
        int toSquare = fromSquare + direction * k;
        move = Move (p, fromSquare, toSquare);
        if  (validMove (move))
            return ptr+1;
    return 0;

I have omitted some details here, so that it is easier to get the main idea here. Basically, I have collapsed the for loops into one loop and a pointer that keeps track of the progress. The function returns only one move and a pointer where to continue the move generation. The highest bits in the pointer are the piece index, next 3 bits are the direction index, and the lowest 3 bits are the steps. Quite many of the combinations of the indices are invalid and in a practical implementation it is important to skip over them to avoid spending too many iterations in the while-loop. My current move generator uses this approach to generate one move at a time, but it has more elaborate routines for skipping over invalid pointer values.

I have also been thinking about the option that I use each multiprocessor to calculate one branch at a time so that the search path is the same for every thread in a warp. Then, there is no divergence problem, and even branching can be allowed without too much penalty. However, the massive parallellism becomes only ordinary parallellism at two levels in this approach. Threads inside a warp are used for speeding up the operations like move generation and evaluation. On the higher level, warps are used like parallel processors when searching the game tree. I think the main problem is then if there is enough to do for the 32 threads inside a warp. Well, there is at most 32 pieces on the board, so each thread could evaluate one piece, but is it wise to the threads like this?


Memory vs. speed

The trade-off between memory and speed is quite common in many computer algorithms. It is often possible to use more memory to gain speed. On the other hand, sometimes the memory space is limited and one must be content with the memory that is available. On the GPU, the situation is complex, because there are different types of memory. There is plenty of global memory, but it is slow. And there is very little on-chip memory, called shared memory, that is fast. Then there are other memory types that might be cached. But the shared memory is the most preffered choise for the move generator at least.

There is either 16 kB or 48 kB of shared memory. On some GPUs, it can be selected, whether there is 16 or 48 kB of shared memory, with more or less registers per thread, respectively. To play it safe, we try to fit the core of the algorithms in 16 kB. Because every thread must have the position they are working on in the shared memory, it makes sense to squeeze the position in as small number of bits as possible without degrading the performance of the algorithms too much. In the end, it will be a compromise between memory and speed, and testing is required to find the best solution.

So how many bits are required for unambiguously describing a chess position? There are 32 pieces at most, and they can be in 64 (= 6th power of 2) different squares. There are exceptions like pawns that can only be on 48 different squares, but we do not go in to such details here. In total, the piece information takes 32 x 6 bits = 192 bits or 24 bytes. This is not enough, however, because we cannot tell the piece types. Initially it is possible to fix the pieces to certain indices in the piece array, e.g. white queen is always at index 4. But the piece types change due to promotions and there might be captures that remove the pieces off the board. So we must keep track of the piece types. For this purpose 3 more bits are required per piece. Notice that it is not required to store the piece color, because we can set the array so that first 16 indices in the array are for white pieces and the next 16 indices for black pieces. Eventually, we have 32 x 9 bit = 288 bits in total. In addition, we need 4 bits to store the castling rights, 1 bit to store whose turn it is, and 4 bits for the en passant (EP) square (4 bits is enough, because we know the rank, only the file must be determined with 3 bits; and one extra bit is required to tell if the EP was valid). This means the total number of bits is 297 which is 37 bytes and a bit. In a practical implementation, we might end up using 16-bits per piece, because 9 bits do not align well with the byte boundaries. This leads to 32 x 16 bits = 512 bits or 64 bytes. The additional flags can be hidden in the empty high bits.

How does this compare with the bitboards? I take chess engine Crafty as an example. It's source code is publicly available. It uses 7 bitboards per side for the board, plus some extra information that we can ignore in this context. These 14 bitboards take, of course, 64 bits each. In total this is 14 x 64 bits = 896 bits which is 112 bytes. It is possible to use less bitboards if we are willing to abandon rotated bitboards. The absolute minimum number, I think, is 5 bitboards (white, pawns, knights, bishop-like pieces, rook-like pieces) plus two variables to tell where the kings are (I would not waste bitboards for that). That takes 5 x 64 bits + 16 bits = 336 bits or 42 bytes. This sounds good to me. Although it is more than 37 bytes, it is certainly less than 64 bytes. The downside is that we do not have rotated bitboards.

Now, let me confirm that we can get all pieces (except kings) out of the five bitboards I mentioned:

wP = w & P;
wN = w & N;
wQ = w & B & R;
wB = w & B & ~wQ;
wR = w & R & ~wQ;
bP = ~w & P;
bN = ~w & N;
bQ = ~w & B & R;
bB = ~w & B & ~bQ;
bR = ~w & R & ~bQ;

where w = white, b = black, P = pawn, N = knight, B = bishop (edit: includes queens), R = rook (edit: includes queens), and Q = queen.

So how much space does this take for several threads? Let us assume 64 threads. Then 64 x 42 bytes = 2688 bytes or about 2.6 kB. That is very nice. It leaves plenty of space for the move stacks. The move data structure must contain all the data required to undo the move. That includes 6 bits for source and destination squares, 3 bits for the captures piece, 2 bits for castling rights, 4 bits for EP square, and 2 bits for promotion piece. This is 23 bits. So strictly only 3 bytes are required. Because of the data alignment, it might be better to use 4 bytes per move which lets us position the flags more conveniently. So for 64 threads, the stack takes 64 x 4 bytes = 256 bytes per ply. In theory, the quiescence search can be as long as 30 ply. Adding this to the end of a regular search of 20 ply means a 50 ply long sequence. This takes 50 x 256 bytes = 12800 bytes or 12.5 kB. We are still under 16 kB (2.6 kB + 12.5 kB = 15.1 kB).

I will now seriously consider re-writing the move generator for the bitboards. The only questions is whether I can cope without the rotated bitboards or not. And thanks for Error323 for your comments about using bitboards. Those made the re-think my decisions about the data structures. This is one of the reasons I write the blog. I get other opinions that can improve the outcome of this project.


Thoughts about selective search

Because the branching factor affects the search so much, it makes sense to spend some time examining the different strategies used for pruning off some moves. Terminating the search before the search horizon might be risky, but on the other hand, some lines must be examined beyond the horizon in the quiescence search. The search depth will then be position-dependent in any case.

Null-move pruning is one technique often used in chess engines. It is based on the assumption that if a position is failing high after one passes a move and lets the opponent move twice, then the position is likely to fail high also if the move were actually made and a complete search performed. This is usually a safe assumption. An exception are the so-called zugzwang positions, where the obligation to move is actually a disadvantage. Zugzwangs occur most likely in the endgame, so the null-move should not be used there, unless one is extremely careful. The null-move search can be done with a shallower depth, so it will save a lot of time when the remaining depth is still large. One should not allow consecutive null-moves, because that would reduce the search to nothing.

Another widely used technique is pruning the branches near the horizon. If the position is quiet and the position would fail high with a clear margin if terminated at the current depth, the search can be terminated. This is usually safe only one or two plies from the horizon. Otherwise there are too many opportunities to get the score below beta. Also, the margin must be set high enough and the position must be quiet (not in the middle of a capturing sequence).

Quiescence search is launched at the horizon depth. If the position is quiet it will call the evaluation function directly. Otherwise, it will search through capturing sequences and probably some checks to see if the score would be different from the positional evaluation. One must be careful to choose only relevant moves to the search or otherwise the search tree can explode.

These ideas could be brought one step further by adding some expert knowledge into the move selection. The choice of moves at each depth would depend on the potential of those moves. A full search would be done only for the first few plies. The rest of the search would be like quiescence search. Then there would be two kinds of evaluation functions: one for positional evaluation and another for move selection. The quality of the evaluation functions is then critical to the quality of the search.

But these ideas are still just ideas. I will experiment with selective search strategies after an implementation of the alpha-beta search is runnning on the GPU.


Testing and setting performance goals

Recently, I have been writing a move generator for the CPU. I write it so that with minor changes, it can be run on the GPU. This CPU move generator has two purposes: (1) testing that the move generator produces the correct moves, and (2) setting a goal, in terms of nodes per second, that the GPU move generator must beat.

It is easy to write bugs in the move generator code. Debugging the CPU code is easier than the GPU code. In the GPU code one must implement the parallel execution properly, in addition to the move generator itself. So, the first step is to get the move generator work properly on the CPU, and then port the code to the GPU.

I have written move generators before, so why not use one of them? None of them were written with the parallel execution in mind. They take too much memory to fit into the shared memory. That is why I have to completely re-designed the move generator. I try to minimize the memory consumption so that 32 or 64 threads can be run in parallel on one streaming multiprocessor, and no global memory accesses are required, except for reading the initial position.

Currently, most of the code has been written, but there are bugs. I use a perft() -function to test the move generator. It generates all the moves up to a specified depth and counts the nodes. The node counts can be compared to publicly available correct numbers. If the numbers match, it is very likely that the move generator is working properly. My perft() -function looks like this:

unsigned int Board::perft (unsigned int depth)
 if (depth == 0) return 1;

 Move m;
 unsigned short ptr = generate (0, m);
 unsigned int nodes = 0;
 while (ptr)
  make (m);
  nodes += perft (depth - 1);
  unmake (m);
  ptr = generate (ptr, m);

 return nodes;

There is one peculiar feature in the move generator routine. It generates only one move per function call and returns a pointer that can be given to the generator to calculate the next move. If it returns zero, there are no legal moves left. The performance test includes the make() and unmake() -functions for making a move and undoing it.

So how efficient is it now? I have tested the current buggy version of the generator by calling perft(6) and taking time. The result is about 10.2 million nodes per second on one core (Intel(R) Core(TM) i7 CPU 930 @ 2.8 GHz). My computer has eight cores in total, so it should achieve about 80 Mnps if all cores were used. So, this is the minimum speed the GPU must beat to be faster than the CPUs. I hope I can make it 100 Mnps. That reminds me of the first version of IBM's chess computer Deep Blue back in 1996. It evaluated 100 M positions per second (Deep Blue in Wikipedia). To be accurate, positions evaluated is different from moves generated, but the magic number is the same. That's the goal anyway.