Some parallel computer chess algorithms require that the threads can communicate. One processor might act as a master and the others as slaves. Or there might be a pool of jobs from which each thread fetches a task when it becomes idle. However, on the GPU the communication is limited between the threads.
In CUDA-C, it is possible to use the __synchthreads() command to synchronize the execution of threads that belong to the same block. It blocks the execution of the threads until all the threads have reached that point and the possible memory accesses are completed.
In addition, there are several atomic instructions for global and shared memory accesses. These operations can be used for doing an operation and updating the memory without others threads interfering with the process. This kind of instructions are useful in parallel computing, because they allow implementing all kinds of advanced synchronization structures like semaphores, mutexes, or monitors.
If the threads must communicate something to the other threads, then the only practical option is to write the data to a memory location that can be read by the other threads, and the other threads check if there is anything new in that memory location. In a chess engine, the transposition table is a natural way to mediate information between threads running a search algorithm.
However, at some point during the search, the workload must be split between the threads. After that, there will be very little communication between them. Because there is a large number of threads that can be run in parallel, it is a challenge to find tasks that take approximately an equal amount of time to execute. Probably each thread is running the same search algorithm, but with different parameters. If the algorithm is alpha-beta search, then the parameters are:
- depth, and
- search window.
The workload depends on all of these, but the depth has probably the most significant effect. That is why it does not sound good to assign the threads with searches of different depths. On the other hand, there should be hundreds of positions to be searched with the same depth. It is not enough to take one position, and let the threads to search the positions following the different moves in that position. The move count per position is typically only a few dozens. In addition, the improving bounds cannot be utilized during the seach, which makes this approach less beneficial.
One idea is to split the search window between the threads. If the granularity of the evaluation function is selected appropriately, it might be possible to make only null-window searches. The purpose of each search is to prove that the search from the position does not produce the score that is the null-window alpha bound for that thread. Then one of these null-window searches returns the alpha while the others fail, and we have found the correct score. The null-window searches are much faster than full searches and they can be run in parallel without communication. This seems to me like an attractive choice for the GPU chess algorithm.