There are several parallel search algorithms which could be utilized in a chess engine. So, to avoid inventing the wheel again, it is necessary to study the previous work on the topic. It is often assumed that the parallel processors are general purpose processors that can communicate with each other. Although the newest GPUs are more flexible than before, we cannot assume that the existing algorithms work on them as such. Let us examine some of them and see if they would work on the GPU.
One of the simplest ideas is to share the hash table or transposition table as it is called in the context of a chess engine. Then all processors can start at the root. When one thread finishes the search of a position, it stores the result in the transposition table. Other threads can then read the result from the table instead of searching the subtree themselves. If the processors finish at slightly different times, their search paths begin to diverge as they read the transposition table or search the positions themselves. This is a non-deterministic way of sharing the workload between the processors. The memory accesses must be atomic or the critical sections must be protected when writing the results to the transposition table. CUDA supports atomic operations which could be used in the implementation of the transposition table. However, this approach scales quite badly for a large number of processors, so it is not useful for the GPU as a means of parallellization. Still, a transposition table will probably be implemented, and if it resides in the global memory, it will act as a shared transposition table.
A large part of the parallel chess algorithms is based on tree splitting. The alpha-beta search tree is split between the processors. The simplest idea is to give the children of the root node to different processors and search them with a full window. However, if the moves are ordered from the best to the worst, which is common if the iterative deepening strategy is applied, the first move is going to produce a good alpha bound and almost all of the remaining searches would fail low if they were searched serially. Deeper in the tree, if a move ordering scheme is used, the first move is either producing a good alpha bound or failing high. That is why it is not beneficial to let the search proceed with the other moves before the search on the first one is completed. Most often, the other moves are searched with the zero window, because they are expected to fail low. There are lots of algorithms that utilize the ideas mentioned above: mandatory work first, principal variation splitting, enhanced principal variation splitting, young brothers wait, jamboree, dynamic tree search etc. The differences between these algorithms are mainly in how they share the work in the parts of the search tree that are not in the "leftmost branches". These algorithms require some way to communicate which subtrees are available for the processors to search. On the GPU, this could be a memory location that contains a stack of search tasks. However, some kind of semaphores are needed to prevent the corruption of the stack by concurrent accesses.
Instead of splitting the search tree it is also possible to partition the search window. Then each processor does a kind of an aspiration search with its assigned window partition. Most of these searches fail either high or low, and eventually we know in which window partition the correct evaluation lies. If the windows are small enough, this piece of information is sufficient for choosing the best move. In particular, if the searches are done with a null window, then the results should be accurate. This, due to its simplicity, sounds like a good initial plan for the parallellization of the search on the GPU. Let us say we have evaluation granularity on 1/100th of a pawn. Then assume that we have 64 threads on 15 multiprocessors. The total is then 960 threads. This allows the search windows to vary from -4.80 to +4.79 with 0.01 increases. This is usually enough to find the correct evaluation. But it is probably better to use the search power in a more efficient way.
If we use the iterative deepening strategy, we already have a pretty good guess about the evaluation of the best move under the root from the previous depth. Then, it might be enough to use a narrow window, e.g. [score-0.32...score+0.31], and split the search below the root between the multiprocessors by using this window for each of them. All searches are null-window searches anyway, so a better alpha bound from the first move cannot speed up the rest of the search. So in total there would be 64 x N threads, where N is the number of legal moves in the root position. We might have to restart the search if one of the searches fall out of the window. But then we can probably assign more than one multiprocessor for the wider window search, because most of the moves have probably failed low already.
These are just my initial thoughts and they might change while I am reading more about the parallel search algorithms.