In his paper about Dynamic Tree Splitting (DTS), Robert Hyatt likened the processors working on a branch of the search tree to ants eating crumbs at a picnic. They eat one crumb together and it gets smaller and smaller until it totally disappears. Then the ants turn their attention to another crumb and start eating. This continues while there are crumbs. Making the processors work together like this requires careful design. Specifically, the processors are all equal; there are no master nor slaves, but only peers. One procesor may start working on a branch, but another one may complete it. The DTS paper explains how this is done.
While reading the DTS paper again, I noticed that there are many similarities between the supercomputers in the 80's and the modern GPUs. The shared memory architecture for example makes the DTS algorithm act similarly on both of them. In contrast, on clusters, one must take into account the communication costs. But on GPUs the communication happens through the memory, so it can be as fast as the memory accesses are. On the other hand, there are differences. GPUs have typically more complicated memory hierarchies than the ancient Cray supercomputers (I am not expert on this, but it appears to me that the memory hierarchy was quite flat).
The memory hierarchy causes some problems on the GPU. Ideally, each processor, or in the case of a GPU a thread, should have a copy of the tree state. This is required because the processors must be equal. Anyone of them must be able to complete the search and return the value to the root of the tree. This takes some memory. Unfortunately, the memory consumption is greater than with my current approach were I have one search stack per block of threads. I need one stack per thread. The stacks will be smaller, but not that much smaller that it would compensate for the increased number of stacks. And I will have to use global memory. A typical stack has hundreds of positions, each of them taking 48 bytes. This could mean tens of kilobytes per stack. While the normal stack operations yield the same number of writes to global memory, splitting the tree causes extra operations as the stacks must be copied. I have not tested this yet, so I cannot say if this overhead is bad or not. Anyway, the finer granularity of the search will guarantee better occupancy of the multiprocessors, not only theoretically, but also in practice, meaning that the processors do less futile operations and spend most of their time searching nodes. Now, for example, when a processor has processed its stack, it simply waits for the others to complete theirs. This is clearly inefficient and the DTS-like approach can definitely alleviate the problem.
It is difficult to say how successful my implementation will be in realizing the potential of the DTS algorithm. It will most probably take me a month or so to get this thing working at some level. I am travelling for the next two weeks, so I do not have access to a CUDA-capable computer. But I still have time to develop the ideas further.