4/08/2013

Learning from Experts

Last week, I was listening to two presentations given by two guys working at NVIDIA Research. The presentations where about buidling tree structures on the GPU and about fast ray tracing on the GPU. These provided some insights on GPU programming and also assured me that I am on the right track. On the other hand, the presentation about ray tracing gave me some fresh ideas. Using global memory is not impossible if the memory latency is hidden by a large instruction count and sufficient number of threads. Another idea is to split the program in small subprograms, because smaller kernels take less registers and it is easier to make them branchless. This means that multiple launches of kernels are required. Kernel launches are slow, so the subtasks must be large enough or there must be a large amount of threads to compensate for the overhead. The challenge is to split the chess engine into small kernels.

I found another very useful presentation by Vasily Volkov: Better Performance at Lower Occupancy. The main observation is that memory latency often limits the performance, which is why maximizing multiprocessor occupancy is not the same as maximizing the instruction throughput. Surprisingly, lower occupancy often leads to higher throughput. It is clear that global memory is slow, but it might not be that clear that even the shared memory has noticable latency. It is about six times slower than the registers on the GPU that I have used (NVIDIA GeForce GTX 480). This explains why I got only about one sixth of the instruction throughput in my tests. It is better to use registers whenever possible. On the other hand, the number of register per multiprocessor is limited. The more there are threads on a multiprocessors, the less there are registers per thread. That is why it is sometimes useful to have less threads on a multiprocessor which means more registers for each thread and faster execution, because shared memory is accessed less often. Simple, but not necessarily obvious.

I have come to the conclusion that thread level parellellism is the only way to compete with modern CPUs with multiple cores at higher clock rates than the GPUs. For example the GTX 480 card has 15 multiprocessors, and if each of them runs one warp with 32 threads, this means 480 threads in parallel. This means 960 instructions per clock cycle, if the instructions are multipy-adds or 480 instructions per clock cycle with other simple instructions. With the shader clock rate of 1.4 GHz, this means either 1.345 Tflops (mentioned in the documentations as the peak computing power) or half of it in practice. If we are able to compute only one execution path per warp, then, most of the time at least, we are computing only 15 instructions per cycle. That is only 21 Gflops. Modern CPUs with multiple cores can achieve that with more flexibility. The order of magnitude speed up when using thread level parallellism is the only thing that justifies the use of GPUs.

The global memory speed on GTX 480 is 177 Gb/s. That means roughly 128 bits per clock cycle (@ 1.4 GHz) for all the threads together. So, if I store chess positions in the global memory, I need roughly 8 clock cycles to load it from the memory, assuming the board size of 128 bytes. If several threads work on the same position, this is a tolerable delay. For example, in the move generator I might run 256 threads in parallel to generate the moves in a position. The problem is writing the results back to the global memory. This can take hunders of clock cycles, depending on the number of moves. Assuming that the move generation itself takes tens of instructions per move, this might be a major factor limiting the performance. Actually, it could be possible to estimate the upper bound by assuming that writing the results dominates all the other execution costs. Above, we calculated that the writes take 8 clock cycles per position. Dividing the clock rate 1.4 GHz by 8 gives us 175 million positions per second. The order of magnitude is what we hoped for, if the goal is to produce 100 million positions per second. Not bad at all. Now I only have to implement it.

No comments:

Post a Comment