First, I must clarify something. The application programming interface (API) that I will be using is NVIDIA's CUDA (see http://www.nvidia.com/object/cuda_home_new.html). It is easy to learn and use, and provides a relatively effient way of utilizing GPUs. The programming language is basically C/C++ with some extensions. However, mindless coding will not lead to applications that utilize GPU optimally. It is even possible for an application to run slower on a GPU than on the CPU. One must be careful and design the program so that it can benefit from the massively parallel computer on the GPU.
One aspect of programming for parallel machines is the instruction architecture. On the GPUs, the most common architecture is SIMD or single instruction multiple data. That means that each parallel processor gets the same instruction, but processes different data elements. This is nice when one needs to decide the color of a block of pixels on the screen, because all of them are processed basically the same way. Only the screen coordinates change between the pixels. However, this is not so convenient when coding a chess program. While the static evaluation of the positions might be assigned to the parallel processors in a similar way the determination of colors is done for the pixel on the screen, the construction and traversial of the dynamic search tree is not that straightforward.
In CUDA the parallel processors are called streaming multiprocessors. Each of them can run a block of threads simultaneously. The size of the block of threads is 32 and it is called a warp. All threads in a warp execute the same instructions at the same time. To be exact, in CUDA the architecture is called SIMT or single instruction multiple threads. The programmer writes the code that a single thread executes or the so called kernel. Then the kernels are launched for a larger block of threads at once. There can, and usually should, be much more than 32 threads. The threads are grouped into warps and they can be given to different multiprocessors. When a multiprocessor is assigned more than one warp CUDA takes care of the scheduling of the warps. From the programmers point of view all the threads are run in parallel.
So what about branching? If-statements are something every programmer needs. I will write about that soon.