Testing and setting performance goals

Recently, I have been writing a move generator for the CPU. I write it so that with minor changes, it can be run on the GPU. This CPU move generator has two purposes: (1) testing that the move generator produces the correct moves, and (2) setting a goal, in terms of nodes per second, that the GPU move generator must beat.

It is easy to write bugs in the move generator code. Debugging the CPU code is easier than the GPU code. In the GPU code one must implement the parallel execution properly, in addition to the move generator itself. So, the first step is to get the move generator work properly on the CPU, and then port the code to the GPU.

I have written move generators before, so why not use one of them? None of them were written with the parallel execution in mind. They take too much memory to fit into the shared memory. That is why I have to completely re-designed the move generator. I try to minimize the memory consumption so that 32 or 64 threads can be run in parallel on one streaming multiprocessor, and no global memory accesses are required, except for reading the initial position.

Currently, most of the code has been written, but there are bugs. I use a perft() -function to test the move generator. It generates all the moves up to a specified depth and counts the nodes. The node counts can be compared to publicly available correct numbers. If the numbers match, it is very likely that the move generator is working properly. My perft() -function looks like this:

unsigned int Board::perft (unsigned int depth)
 if (depth == 0) return 1;

 Move m;
 unsigned short ptr = generate (0, m);
 unsigned int nodes = 0;
 while (ptr)
  make (m);
  nodes += perft (depth - 1);
  unmake (m);
  ptr = generate (ptr, m);

 return nodes;

There is one peculiar feature in the move generator routine. It generates only one move per function call and returns a pointer that can be given to the generator to calculate the next move. If it returns zero, there are no legal moves left. The performance test includes the make() and unmake() -functions for making a move and undoing it.

So how efficient is it now? I have tested the current buggy version of the generator by calling perft(6) and taking time. The result is about 10.2 million nodes per second on one core (Intel(R) Core(TM) i7 CPU 930 @ 2.8 GHz). My computer has eight cores in total, so it should achieve about 80 Mnps if all cores were used. So, this is the minimum speed the GPU must beat to be faster than the CPUs. I hope I can make it 100 Mnps. That reminds me of the first version of IBM's chess computer Deep Blue back in 1996. It evaluated 100 M positions per second (Deep Blue in Wikipedia). To be accurate, positions evaluated is different from moves generated, but the magic number is the same. That's the goal anyway.


  1. In movegeneration bitboards and magics really show their power as far as I know (also in evaluation). Might it be worth considering a bitboard using 2 32bit integers per board?

    Oh and a quick fixme: Deep Blue was actually able of evaluating 200M positions per second.

  2. The main problem with bitboards is the memory consumption. On the GPU it is crucial to be able to fit everything in the shared memory. The benefits of bitboards are lost if you have to use the global memory. It's like 100 time slower that way.

    And from Wikipedia: "It was capable of evaluating 200 million positions per second, twice as fast as the 1996 version." So the 1996 version evaluated 100M positions as I said. You are perhaps referring to the 1997 version.