Sometimes small steps and patience is all you need for success. But not always. Man did not go to the moon by only taking small steps. At first, someone had to have the wild idea that it is possible to go to the moon. Then, someone had to invent the rocket engine. Only after the ideas and the technology were there could the basic engineering work begin. Similarly, I do not believe that a GPU can play good chess by just applying the known algorithms and making small improvements. Either the GPUs must change or a new algorithm has to invented. While waiting for the former I will try to do the latter.
One idea I have been playing with is using only zero window (or null window) search on the GPU. A zero window search gives a boolean result. The position is either better or worse than the given bound. It could be possible to launch hundreds of zero window queries from different positions with different bounds. The positions and bounds would be such that the expected utility of the gained information is maximized. The partial game tree would be stored and updated by the CPU.
I think it makes sense to use both the CPU and the GPU, because both of them have their own strengths. The CPU is more flexible and it is better to use it for control tasks. The GPU, on the other hand, is a work horse that can be given the mundane number crunching tasks. A simple idea would be to search the principal variation (PV) nodes with the CPU and let the GPU search the other nodes with the zero window. However, that is not very efficient use of computing power. The subtrees given to the GPU must be large enough to compensate for the extra time required for GPU kernel launches. The sizes of the subtrees must also be similar, because the slowest thread determines the performance of the whole search. In addition, the subtrees must have bounds from the PV nodes before they are run.
There are sophisticated algorithms for splitting the search treee dynamically (see e.g. The DTS high performance parallel tree search algorithm), but they usually require more communication between the processors than possible on the GPU. If there is no communication during the search, we must somehow estimate the subtree sizes before the search in order to balance the workload. It might be possible to do this by utilizing information from shallower searches. Shallow searches or iterative deepening are often useful for move ordering purposes. We could use the same data for load balancing as well.
Hopefully I can find time for coding in the near future. Then I can tell more.