Chess is primarily a tactical game. Exact calculations of move series are important. That is most likely the reason why application of probabilistic methods to chess engines has failed so far. I do not try to change this, because I believe that there is no replacement for accurate analysis. Most probably, I will use some version of the alpha-beta pruning in my chess engine. The basic algorithm is serial so one must carefully design how to best utilize parallellism.
Simply splitting the search so that different moves are given for different threads is inefficient for two reasons. First, the search of a move may improve the alpha bound. Then, if one has already started searching the other moves with the original bound, some effort is wasted. Second, the search of a move may cause a beta cut-off. Then, searching all the other moves is totally wasted effort. Typically, parallel algorithms try to minimize the futile work.
Because the move ordering is usually good in modern chess engines due to the iterative deepening, transposition tables, and various other techniques, the beta cut-offs happen with the first move in about 90 % of the cases. That is why one should avoid splitting the work at Cut nodes (nodes whose children may cause beta cutoff). Instead, the splitting work at All nodes (nodes whose all children must be searched) is preffered. However, it cannot be known for sure which nodes are of which type. With a perfect move ordering, it is possible, but we cannot achieve that. And if we did achieve it, we would not need to search at all.
It could be possible to implement one of the existing parallel search algorithms. However, they do not scale that well for a large number of processors. Probably the best know algorithm is the Dynamic Tree Splitting (DTS) by Robert Hyatt and the Cray Blitz team. Many programmers consider it difficult to implement. One cannot use the typical recursive structure of function calls in the search with it. For me that is not a problem, because the structure of my program is not recursive. That offers great flexibility, because any node can be searched by any thread. So I am thinking about implementing a DTS-like algorithm on the GPU.