I have been quite busy during the week. The little time I had I spent debugging the code. It is surprisingly easy to introduce bugs in the move generator code if you make small changes to it. I had to implement code for printing out the positions. That helped a lot and found out that I had copied wrong masks for preventing the wrap-around in the Kogge-Stone algorithm (A- and H-files swapped).
Now that I have the sequential code working, I must still figure out how to arrange the parallel execution in an optimal way. I might just launch a constant number of threads and let them handle a bunch of positions. The basic idea is to have a kind of queue of positions for processing. The threads take the positions in the beginning of the queue and generate their successors and put them to the end of the queue. As long as there are positions with less than the maximum depth, the threads continue to process them. The obvious problem is that this is a kind of breadth-first-search approach, so the size of memory becomes the limiting factor, and perft(6) is already too much.
So, instead of a queue, I could use stack. Then, the threads take the positions on top of the stack and put the new stuff on top of the stack. Care must be taken to handle the new and old positions properly. The size of the memory area required is then dependent on the number of threads. This is more managable and probably the threads are less divergent as they handle similar types of positions.