Chess on a GPGPU: Move generation code

I noticed that move lists take too much space. Assuming a move takes 4 bytes, then a move list can take hundreds of bytes per thread. In a typical middlegame position, there can be 40 moves. This is already 160 bytes. In the worst case, there can be over 200 moves. That is 800 bytes. Multiply this by the number of threads, e.g. 64, and you have a big number. In the example 64 x 800 bytes = 51200 bytes or 50 kB. That is not very nice. The only practical option, that I can see, is to generate one move at a time. Then we have to keep track where we left last time we generated a move.

A naive move generator could look roughly like this:

int Board::generateList (Move *moveList)
{
    int index = 0;
    for (int i = 0; i < number_of_pieces; i++)
    {
        Piece p = pieces[i];
        int fromSquare = p.getSquare ();
        for (int j = 0; j < p.getMaxNoDirections (); j++)
        {
            int direction = p.getDirection (j);
            for (int k = 0; k < p.getMaxNoSteps (direction); k++)
            {                
                int toSquare = fromSquare + direction * k;
                Move move (p, fromSquare, toSquare);
                if (validMove (move))
                {
                    moveList[index] = move;
                    index++;
                }
            }
        }
    }
    return index;
}

The code should be quite easy to read. It loops though all the pieces, then all directions for the piece, and finally all the steps in that direction. Finally the validity of the move is tested before adding it to the list of moves. What bitboards do is speed up the two inner loops and the validity test by only generating legal (or pseudo-legal) moves.

Unfortunately, we do not have enough memory for the entire move list. The move generation could be easily split so that it generates moves for only one piece in one call. We might still need as many as 27 moves in the list (queen in the middle of an almost empty board). This requires 6912 bytes or 6.75 kB for 64 threads. That could be quite ok, but there is still a divergence problem, if the threads have different numbers of moves generated. The threads then spend different times generating the moves, and the slowest thread determines the pace of the others.

It is possible to arrange the move generation in another way:

int Board::generateOne (int ptr, Move &move)
{
    while (ptr < max_ptr)
    {
        int i = (ptr >> 6);
        int j = (ptr >> 3) & 7;
        int k = ptr & 7;
        Piece p = pieces[i];
        int direction = p.getDirection (j);
        int fromSquare = p.getSquare ();
        int toSquare = fromSquare + direction * k;
        move = Move (p, fromSquare, toSquare);
        if  (validMove (move))
        {
            return ptr+1;
        }
        ptr++;
    }
    return 0;
}

I have omitted some details here, so that it is easier to get the main idea here. Basically, I have collapsed the for loops into one loop and a pointer that keeps track of the progress. The function returns only one move and a pointer where to continue the move generation. The highest bits in the pointer are the piece index, next 3 bits are the direction index, and the lowest 3 bits are the steps. Quite many of the combinations of the indices are invalid and in a practical implementation it is important to skip over them to avoid spending too many iterations in the while-loop. My current move generator uses this approach to generate one move at a time, but it has more elaborate routines for skipping over invalid pointer values.

I have also been thinking about the option that I use each multiprocessor to calculate one branch at a time so that the search path is the same for every thread in a warp. Then, there is no divergence problem, and even branching can be allowed without too much penalty. However, the massive parallellism becomes only ordinary parallellism at two levels in this approach. Threads inside a warp are used for speeding up the operations like move generation and evaluation. On the higher level, warps are used like parallel processors when searching the game tree. I think the main problem is then if there is enough to do for the 32 threads inside a warp. Well, there is at most 32 pieces on the board, so each thread could evaluate one piece, but is it wise to the threads like this?

2 comments:

Anonymous2/09/2013 11:35 AM
maybe a table driven move generator could avoid some of the checks for illegal pointers...

http://chessprogramming.wikispaces.com/Table-driven+Move+Generation#Conditional%20Linked%20List

--
Srdja
Samuel2/10/2013 5:22 PM
Yes, it could. Actually, I am using my own version of the table driven move generator. I cannot use large arrays, but I am using tables as much as I can.

2/07/2013

Move generation code

2 comments: