Branching on the GPU

Branching intructions are an essential ingredient in every computer program that does something really useful. In C-language if-statements as well as for- and while-loops have branching instructions under the hood. The control flow may change depending on some conditions. At the processor level that unpredictability in the control flow requires special handling.

Modern CPUs use pipelining which means that instructions which take several clock cycles to complete are started before the previous ones are done. That way the CPU can execute an instruction at every clock cycle or so. Braching instructions are a problem, because it is not possible to know in advance which instruction follows them. The program may need to execute the part after the if-statement or it may need to skip it. For the pipelining this means problems. The following instructions cannot be processed before the branching instruction is completed. This could mean a delay in the execution, unless there were a clever trick. The processor "predicts" which branch will be executed and puts the instructions in the pipeline. If the prediction was correct, the pipelining continues as if there were no branching. If it fails, then the incomplete instructions in the pipeline are just discarded, but we lose nothing compared to the case where we waited for the branching instruction to be completed.

Since branching causes problems even in the case of one processor, it is expected that it causes even more problems in parallel computing. And indeed, all the threads in a warp must execute the same instructions. If there is a if-statement, it is possible that the control flow diverges. If the condition inside the if-statement is evaluated as true in all threads or if it is evaluated as false in all threads, everything is fine. But the tricky case is when some threads require that the instructions inside the if-clause are executed while other threads require that they are not. What happens in this case is that all threads execute all instructions, but those that should not have executed them, discard the results. This is waste of clock cycles, but the program runs correctly. A similar thing happens when the number of threads is not divisible to 32 and a partial warp is run. The full warp is executed, but the results of some threads are discarded.

How bad is this branching problem? Basically, it prevents using the traditional alpha-beta search with recursive calls, because the execution paths of the threads quickly diverge. That means that the execution becomes sequential as all threads must wait for the current one to complete before starting their execution. I have two ideas how to get around this problem:

  • Re-write the recursion as an iterative algorithm with a stack.
  • Let all the threads in a warp to follow the same path in the search tree and utilize the parallellism only at the leaf nodes for evaluation.

So how to ulitize these ideas?

First, it is always possible to re-write recursive algorithms as iterative ones. This is easy to see if you consider how recursive function calls are actually implemented in the operating systems. There is a stack into which things are pushed when the function call is started and from which things are popped, when the function call finishes. On the other hand, the instructions are executed iteratively on one processor, no matter what is the structure of the function calls in the high level code. All this can be done explicitly by the programmer in the kernel. If every step in the search tree traversial is made identical, it does not matter if threads in the same warps follow different paths. The implementation is not easy, and the code will be messy, but the main thing is that it works.

Utilizing the warp level parallellism only at the leaf level is the second best option. Then the system is not massively parellel anymore, because there is usually only a handful of streaming multiprocessors on the GPU. Traditional parallel algorithms may then work best, although the limited communication capabilities between the processors can pose severe restrictions to this. The leaf node evaluation, if the parallellism can be fully utilized, will be about 30 times faster than with a single processor. That is a lot. If the evaluation is the bottleneck in the processing, then this might be a good idea. But if the slowdown is caused by some part of the search itself, things will not be that easy, although, e.g. I can imagine threads doing move generation in parallel.

Next I will consider the limited memory access on the GPU

No comments:

Post a Comment