GPU Memories

In CUDA architecture, there are several different types of memory. To determine which memory type to use, one must answer the following questions:

  • How much memory do I need?
  • How fast memory access is required?
  • Do I access the memory randomly or in order?
  • Do I want to share the content of the memory with multiple threads?

If we always got what we wanted, the answers would be: "as much as possible", "as fast as possible", "randomly", "yes". But the reality is what we have to cope with. Big memories cannot be fast. Random accesses cost. Sharing contents of the memory with other threads requires extra effort. So, in order to use them efficiently, it is important to know what are the properties of the different memory types.

Global memory can be used for large arrays of data. In addition, it can be accessed by all the threads. However, it is painfully slow. A read or write can take hundreds of clock cycles. By utilizing the memory alignment, it is possible that up to 16 threads that belong to the same half-warp issue only one instruction to access memory inside one alignment unit. Still, it is better to avoid global memory accesses whenever possible, and use faster memories in the computation. A good strategy is to load the data from the global memory to the other memories, compute, and write the data back to the global memory in the end.

CUDA also has something called "local memory" which is not that local at all. It is used for arrays and other stuff in the kernels that does not fit into registers. However, it is as slow as the global memory, so it should be avoided at all costs. If you program with CUDA-C, you do not have full control over what the nvcc-compiler puts into the local memory. By using the PTX-assembler to write the kernels it is possible to avoid having anything in the local memory, if you use the registers sparingly.

There are also memories that are read-only for the kernels. Constant memory can be used for some values that are assigned before the kernel is run. It is also located in the device memory like the global and local memories, but it is cached, so accesses are faster. Another cached memory type is the texture memory, which is cached based on the 2-D spatial locality. It can be a good option if the memory accesses by the kernel can utilize that kind of locality. It can be expected that texture memory accesses are optimized on the graphics processing unit for the obvious reasons, so this type of memory is superior to global or constant memories if no writes are required.

The fastest memory is located on the chips. This is called shared memory. It is very fast, but there are limitations. First, the size is very small, usually only tens of kilobytes per multiprocessor. Second, the memory cannot be shared between the multiprocessors (I wonder why it is called "shared memory" then? Perhaps for the same reason, the "local memory" is called local). Third, there are memory banks that cannot be accessed by the threads at the same time, so the access patterns must be carefully planned.

What does this all mean from the chess engine's point of view? Obviously, we want to use the registers and shared memory for the search algorithm. A few dozens of kilobytes of memory shared between 32 threads per multiprocessor leaves about a kilobyte of memory space per thread. This is not much, if we have to fit the stack of positions and other variables in it. It is better to pack the information tight, because a few extra operations to unpack them will be much less than issuing global memory accesses.

There is not an obvious way to implement a fast transposition table. The shared memory size is too small to hold the table in it. Most probably, the only way is to use the global memory. This is still reasonable at higher levels in the search tree, where the search takes at least thousands of clock cycles. Then probing the transposition table can still take hundreds of clock cycles and with a reasonble hit rate, we can save time. Experimentation is required here. However, we cannot just naively count the number of positions searched and try to minimize that, because at lower levels of the tree, a full search might be faster than probing the transposition table.

Here are some thoughts about memories on the GPU. I will probably return to this subject, when I start experimenting with the actual implementaion of the search algorithm.

No comments:

Post a Comment