Chess on a GPGPU: January 2013

1/31/2013

Existing parallel search algorithms

There are several parallel search algorithms which could be utilized in a chess engine. So, to avoid inventing the wheel again, it is necessary to study the previous work on the topic. It is often assumed that the parallel processors are general purpose processors that can communicate with each other. Although the newest GPUs are more flexible than before, we cannot assume that the existing algorithms work on them as such. Let us examine some of them and see if they would work on the GPU.

One of the simplest ideas is to share the hash table or transposition table as it is called in the context of a chess engine. Then all processors can start at the root. When one thread finishes the search of a position, it stores the result in the transposition table. Other threads can then read the result from the table instead of searching the subtree themselves. If the processors finish at slightly different times, their search paths begin to diverge as they read the transposition table or search the positions themselves. This is a non-deterministic way of sharing the workload between the processors. The memory accesses must be atomic or the critical sections must be protected when writing the results to the transposition table. CUDA supports atomic operations which could be used in the implementation of the transposition table. However, this approach scales quite badly for a large number of processors, so it is not useful for the GPU as a means of parallellization. Still, a transposition table will probably be implemented, and if it resides in the global memory, it will act as a shared transposition table.

A large part of the parallel chess algorithms is based on tree splitting. The alpha-beta search tree is split between the processors. The simplest idea is to give the children of the root node to different processors and search them with a full window. However, if the moves are ordered from the best to the worst, which is common if the iterative deepening strategy is applied, the first move is going to produce a good alpha bound and almost all of the remaining searches would fail low if they were searched serially. Deeper in the tree, if a move ordering scheme is used, the first move is either producing a good alpha bound or failing high. That is why it is not beneficial to let the search proceed with the other moves before the search on the first one is completed. Most often, the other moves are searched with the zero window, because they are expected to fail low. There are lots of algorithms that utilize the ideas mentioned above: mandatory work first, principal variation splitting, enhanced principal variation splitting, young brothers wait, jamboree, dynamic tree search etc. The differences between these algorithms are mainly in how they share the work in the parts of the search tree that are not in the "leftmost branches". These algorithms require some way to communicate which subtrees are available for the processors to search. On the GPU, this could be a memory location that contains a stack of search tasks. However, some kind of semaphores are needed to prevent the corruption of the stack by concurrent accesses.

Instead of splitting the search tree it is also possible to partition the search window. Then each processor does a kind of an aspiration search with its assigned window partition. Most of these searches fail either high or low, and eventually we know in which window partition the correct evaluation lies. If the windows are small enough, this piece of information is sufficient for choosing the best move. In particular, if the searches are done with a null window, then the results should be accurate. This, due to its simplicity, sounds like a good initial plan for the parallellization of the search on the GPU. Let us say we have evaluation granularity on 1/100th of a pawn. Then assume that we have 64 threads on 15 multiprocessors. The total is then 960 threads. This allows the search windows to vary from -4.80 to +4.79 with 0.01 increases. This is usually enough to find the correct evaluation. But it is probably better to use the search power in a more efficient way.

If we use the iterative deepening strategy, we already have a pretty good guess about the evaluation of the best move under the root from the previous depth. Then, it might be enough to use a narrow window, e.g. [score-0.32...score+0.31], and split the search below the root between the multiprocessors by using this window for each of them. All searches are null-window searches anyway, so a better alpha bound from the first move cannot speed up the rest of the search. So in total there would be 64 x N threads, where N is the number of legal moves in the root position. We might have to restart the search if one of the searches fall out of the window. But then we can probably assign more than one multiprocessor for the wider window search, because most of the moves have probably failed low already.

These are just my initial thoughts and they might change while I am reading more about the parallel search algorithms.

1/30/2013

About evaluation

The evaluation function is very important, because in most cases, the search must be terminated before the game ends in a checkmate or draw. One way to think about the evaluation function is that it indicates the probability that one side is winning. Assuming perfect play, there is no chance in chess as such. If we had enough computing power, it would be possible to say with absolute certainty for each position if it is a draw or win for white or win for black. But we do not have that power, so we have to resort to the heuristic approach.

Most of the commercial chess engines use pawns as the basic unit of evaluation. Everything else being equal, but white has one pawn more, yields the score +1.00. Similarly, if black is up a pawn, the score is -1.00. For a grandmaster, one pawn advantage is usually just enough to win the game. And if the score equals zero, the position is a dead draw. On the other hand, checkmates result in evaluation at infinity. In practice, chess engines use finite, but such a big number for checkmates that it cannot be confused with other evaluations.

The evaluation function is where most of the expert knowledge in a chess engine is located. Evaluating chess positions is difficult even for experienced human players. It is not possible to invent straightforward rules that apply to every position. Usually what separates a good player from a really strong player is the specific knowledge of different types of positions and which plans work in them. It is possible to program some of this information in a chess engine, but human beings and computers think in a different way, making this a difficult task.

One theory of evaluation is based on three factors: material, space, and time. The player with more material is usually more likely to win. The players with more space is also more likely to win, because he has more options to manouver his pieces. And third, the player who is ahead of the other one in development or has the initiative, is more likely to win. In addition, there are factors like pawn structure, piece mobility, and king safety. These are often explained through the three factor model (material, space, time), but I think the model is not that well-suited for computer chess as it is for human players.

I suggest a different model for the evaluation in a chess engine. It is also based on three concepts, but these are different from the previously presented. They are: force, path, and targets. Force is approximately equal to material. It refers to the pieces and their latent values. Path means that the pieces have routes on the board. It is close to the concept of space, but it is more specific to the piece type. What is a path to a knight is not usually a path to a bishop. And finally, targets refer to weak or otherwise important points in the opponent's position. These are related to the path and force so that the force strikes the targets via the path. If one of these factors is missing, then the strike cannot succeed.

First, the force or material, is evaluated by using a simple piece value table:

Piece	Value
Pawn	1.00
Knight	3.25
Bishop	3.25
Rook	5.00
Queen	9.75
King	Infinity
Bishop pair	0.50

This is a refined version of the traditional 1-3-3-5-9-scoring. It includes a bonus for the side with the bishop pair, because it is widely recognized at grandmaster level that a pair of bishops is slightly better than a bishop and a knight or two knights.

Paths are calculated backwards from the targets. Paths may mean open or semi-open files for rooks or diagonals for bishop. For knights it may mean outposts squares. For passed pawns it may mean free path to the promotion square.

As already mentioned, a target is a weak or important point in the opponent's position. Weak points include e.g. isolated and backward pawns. Important points include the central squares, bases of pawn chains, and squares near the king. These lists are not exclusive, but may be extended as new factors affecting the position become obvious.

This is an initial attempt to evaluate the position. A more sophisticated function can be implemented when the chess engine is otherwise working fine.

1/29/2013

Moving pieces

An efficient move generator is an essential part of a strong chess engine. The rules of chess determine the movement of different pieces. Most pieces move quite nicely, but there are expections which make the implementation of a move generator more difficult than it would be without them. The purpose of this post is to examine how to pieces move and how to implement a generic move generator.

But first, it is appropriate to mention here that it is not always necessary to generate all moves for all positions during the search. Many chess engines do this, while others generate the moves in stages like: checks, captures, other moves. If the search fails high after a move, it is not necessary to go through all the rest of the moves and the effort used when generating them is wasted. On the other hand, generating one move at a time will probably be difficult and add extra instructions. However, in zero window search, it can be so that only one move is searched at every other ply. It would be possible to use separate move generation routines for every other ply then. But for simplicity, we try to construct a generic move generator.

It would be nice to have a move generator without branching. This is difficult, but I think it can be done. If the branching cannot be avoided, the threads inside a warp will have to execute all the branches in many cases. Some of the benefits of parallellism are then lost. So we try to replace ifs with table lookups. That is why we need to see if the piece movement can be generalized. The following table lists some properties of the pieces:

Piece	Sliding	Directions	Max. moves	Specialities
Pawn	No	3	4	Capture and non-capture different, promotion, double move, en passant
Knight	No	8	8
Bishop	Yes	4	13
Rook	Yes	4	14	Castling
Queen	Yes	8	27
King	No	8	8	Castling, checks

So basically, pawns are the nastiest pieces and kings are also bad. Majority of the problems with the kings can be avoided by loosening the requirements. Checks can be handled separately and not as a part of move generation.

Generic move rules could consist of the piece to move, change in the board index (= direction), maximum steps it can "slide" (this equals one for non-sliding pieces), a condition for the move to be valid, and an extra action to take in addition to moving the piece. This leads to a table like:

Piece	Direction	Max. steps	Condition	Action
Pawn	-8	1	Destination square empty
Pawn	-16	1	Only 2nd row pawn, destination square and middle square empty	Set EP square
Pawn	-7	1	Destination square occupied by enemy piece or EP square
Pawn	-9	1	Destination square occupied by enemy piece or EP square
Pawn	-8	1	7th row pawn, destination square empty	Change to knight
Pawn	-8	1	7th row pawn, destination square empty	Change to bishop
Pawn	-8	1	7th row pawn, destination square empty	Change to rook
Pawn	-8	1	7th row pawn, destination square empty	Change to queen
Pawn	-7	1	7th row pawn, destination square occupied by enemy piece	Change to knight
Pawn	-7	1	7th row pawn, destination square occupied by enemy piece	Change to bishop
Pawn	-7	1	7th row pawn, destination square occupied by enemy piece	Change to rook
Pawn	-7	1	7th row pawn, destination square occupied by enemy piece	Change to queen
Pawn	-9	1	7th row pawn, destination square occupied by enemy piece	Change to knight
Pawn	-9	1	7th row pawn, destination square occupied by enemy piece	Change to bishop
Pawn	-9	1	7th row pawn, destination square occupied by enemy piece	Change to rook
Pawn	-9	1	7th row pawn, destination square occupied by enemy piece	Change to queen
Knight	-17	1
Knight	-15	1
Knight	-10	1
Knight	-6	1
Knight	6	1
Knight	10	1
Knight	15	1
Knight	17	1
Bishop	-9	7
Bishop	-7	7
Bishop	7	7
Bishop	9	7
Rook	-8	7		Clear a castling flag
Rook	-1	7		Clear a castling flag
Rook	1	7		Clear a castling flag
Rook	8	7		Clear a castling flag
Queen	-9	7
Queen	-8	7
Queen	-7	7
Queen	-1	7
Queen	1	7
Queen	7	7
Queen	8	7
Queen	9	7
King	-9	1		Clear castling flags
King	-8	1		Clear castling flags
King	-7	1		Clear castling flags
King	-1	1		Clear castling flags
King	1	1		Clear castling flags
King	7	1		Clear castling flags
King	8	1		Clear castling flags
King	9	1		Clear castling flags
King	-2	1	Long castling allowed	Clear castling flags, move rook
King	2	1	Short castling allowed	Clear castling flags, move rook

The moves could be generated by look ups to such a table. In addition, it is required that the destination square is not off the board. Implementing the conditions is the hard part. Usually they can be formulated as tests if certain bits are set. The actions in the table can be seen as removing one piece and adding another piece.

To avoid ifs in the control flow, a move can be made so that if the condition is false, the move has no effect on any variable. Otherwise the move is made and the appropriate variables are updated. When returning from the search, the move is undone, and the execution continues from the next move at the previous level. The return pointer is then actually an index to the move generation table.

All this might sound confusing, but to avoid branching, I think this is necessary.

1/28/2013

Initial plan for shared memory usage

Global memory accesses are slow. Up to two orders of magnitude speed up can be achived by using shared memory. However, the shared memory is limited. These facts lead us to the design principle: use shared memory instead of global memory even at the expense of extra operations. For example, it does not matter if there are three times as many instructions if we can avoid memory operations that are a hundred times slower. So let us see how this can be done.

For each thread, we would like to fit in it the stack of search function parameters and the current position. The function parameters include the remaining depth, alpha value, the move information, and a pointer for returning to the previous depth. These parameters are implementation specific, of course. We try to save space, so we do not have the positions in stack, but only the moves. In addition, the beta value is not necessary in the zero window search, because it is always alpha + 1. Even the alpha need not be in the stack, because it does not change during the search. If it would change that were a fail high.

Different representations of the position take different amounts of memory. Let us compare three of them (without additional flags etc.): simple 8x8 board, 0x88 board, and bitboards. In its simplest form the 8x8 board takes 64 times 4 bits (enough for pieces) which is 256 bits or 32 bytes. It might be necessary to complement this with a piece list for efficiency reasons. Then, 0x88 boards take double the space compared to the simple 8x8 board or four times that if we reserve one byte per square. Finally, bitboards take 64 bits each, and we need 6-12 of them (white, pawns, knights, bishops, rooks, kings, and possibly black, queens, plus 4 rotated bitboards). That is 384-768 bits or 48-96 bytes. Surprisingly, the simplest board representation seems the most attractive one, because it is the smallest one.

Moves can be packed in a few bytes. Source and destination squares take 6 bytes each. Then, for undoing moves, it is necessary to store the captured piece, the en passant square, castling rights, and the promoted piece. These take correspondingly 4, 6, 2, and 4 bits. In total this is 28 bits. So, in practice, a move takes 32 bits or 4 bytes.

Fitting the stack in the shared memory is going to limit the depth of the stack or the number of threads. The number of threads should not be too small, because that would negatively affect the performance. An initial guess is 64 threads per multiprocessor. This can be changed when the implementation is tested. The depth is limited to 32 ply. In practice, this should be enough, although the quiescence search could theoretically go deeper.

The follow table summarizes the total memory consumtion so far:

Variable	Size	Number	Total Memory Consumption
Current position	32 bytes	64	2 kB
Moves	4 bytes	64 x 32	8 kB
Pointer	2 bytes	64 x 32	4 kB

For each multiprocessor, we probably need a set of lookup tables for the move generator. Because the other stuff already takes about 14 kB, and on many devices the upper limit for the shared memory is 16 kB, we have only about 2 kB space available. That is not enough for any 64 to 64 tables. Instead, it is possible to use only arrays of offsets for pieces moving to different directions.

I will refine the plan when I go into the details of move generation.

1/24/2013

Examining the search tree

Almost all chess engines go through a search tree, where the nodes are positions and edges are chess moves between the positions. The children of a node are all the positions that can be reached from that position with one legal chess move. Sometimes the legality requirement is loosened to allow pseudolegal moves for efficiency reasons. For example instead of checking for checkmate, it is often sufficient to assign the king a very high value and allow it to be "captured". However, the basic structure of the search tree is the same.

In typical middlegame positions there are around 40 legal moves. The number of legal moves varies between zero (stalemates and checkmates) and 218 (some artificially constructed positions) depending on the position. This is the branching factor of the tree. Let us call it B. Then, the number of nodes at depth D can be estimated with B^D. It is obvious that this number becomes very large quickly when D increases. The following table illustrates this:

Depth	Node count	Time at 10M nodes / s
1	40	0.000004 s
2	1 600	0.00016 s
3	64 000	0.0064 s
4	2 560 000	0.256 s
5	102 400 000	10.24 s
6	4 096 000 000	6 min 49,6 s
7	163 840 000 000	4 h 33 min 4 s
8	6 553 600 000 000	7 d 14 h 2 min 40 s

Consider the time take by the full traversal of the search tree as listed in the table. The 10 million nodes per second speed is realistic for modern chess engines. How is it then possible that they routinely reach depths like 15 or 20 ply at tournament play? They only have a few minutes per move, so they should be able to go only 5-6 ply deep. Half of the answer is that this represents only the worst case scenario with bad move ordering. The backbone of most of the chess engines is some version of the alpha-beta search. If the move order is optimal, it is necessary to search only one child at every other level of the tree. This means that the node count becomes B^D/2 or in other words: the search can be twice as deep. But multiplying the depths in the table with two gives 10-12 ply. That is still less than what is expected, even if the move order were optimal which it is not in practice.

Where does the extra depth come from? There are many factors that contribute to that. One factor is the transposition table. About 15-20 % of the positions encountered during the search happen to be among those already searched. If the results for these positions can be fetched directly from the hash table, it can reduce the number of positions to examine by a factor that for 5-6 plies is around 3. This might mean an extra ply for the search. In the endgame, transposition tables are even more beneficial, and can yield tremendous speed ups.

Another factor is selectivity. While pruning off some moves before the maximum depth is reached is risky and might not produce the correct results, most good chess engines do it. The benefit of getting deeper is greater than occasional miscalculations. If the pruning strategy is well designed, such mistakes can almost completely be avoided. Human intuition says that most moves in most positions are just dumb. However, it is far from easy to implement this intuition in a chess engine.

1/23/2013

Floating point vs. integer arithmetics

Eventually, the chess engine will be run on the GPU. The CUDA-C code (assuming that is the language of choice) has been compiled into machine instructions. Some instructions will take longer time than others. While full utilization of parallel threads and planing memory accesses carefully are most important things, the instructions used also count when optimizing the performance. One important thing to notice about the GPU is that the same operations with different data types take different amount of time.

GPUs have been designed for fast floating point arithmetics. Unfortunately, chess engines do integer arithmetics and bit twiddling. It should be clear that the nominal top speeds (in floating-point operations per second or FLOPS) given for the GPUs in the technical specifications cannot be achieved with integer operations. Even all the floating point operations will not do that. There are some special instructions like multiply-add that perform two operations at once to produce the supercomputer-like teraflops figures in the data sheets.

So let us compare integer operations with 32-bit and 64-bit floating point operations (assume CUDA compute capability 2.0). The table shows clock cycles per thread taken by the instruction

Operation	32-bit float	64-bit float	32-bit integer
Add	1	2	1
Multiply	1	2	2
Multiply-add	1	2	2
Logical op.	N/A	N/A	1
Shift	N/A	N/A	2
Type conv.	1	1	1

The situation is not as bad as it could be. Additions and logical operations are fast. Multiplications and shifts are only twice as slow. Multiply-add is useful in the address calculations and it is executed with the same cost as a simple multiply.

The situation is not optimal for the bitboard representation often used by chess engines. The all-important shift operations cost twice the other operations, and there is no native 64-bit integer. Other board representations may be more efficient, but it is hard to say without testing it out.

One thing that should be avoided is unnecessary casting between data types. Although a cast operation takes only one clock cycle per thread, these additional instructions pile up and slow down the execution. The compiler might be stupid in the sense that it sometimes adds cast instructions when they are not really necessary. This can be seen by examining the PTX-assembler code produced by the compiler. It might be required that the inner loops of the search algorithm are written directly in assembler to avoid this waste of instructions.

As a conclusion, it can be said that the speed of integer arithmetics is not going to be an issue.

1/22/2013

Communication between threads

Some parallel computer chess algorithms require that the threads can communicate. One processor might act as a master and the others as slaves. Or there might be a pool of jobs from which each thread fetches a task when it becomes idle. However, on the GPU the communication is limited between the threads.

In CUDA-C, it is possible to use the __synchthreads() command to synchronize the execution of threads that belong to the same block. It blocks the execution of the threads until all the threads have reached that point and the possible memory accesses are completed.

In addition, there are several atomic instructions for global and shared memory accesses. These operations can be used for doing an operation and updating the memory without others threads interfering with the process. This kind of instructions are useful in parallel computing, because they allow implementing all kinds of advanced synchronization structures like semaphores, mutexes, or monitors.

If the threads must communicate something to the other threads, then the only practical option is to write the data to a memory location that can be read by the other threads, and the other threads check if there is anything new in that memory location. In a chess engine, the transposition table is a natural way to mediate information between threads running a search algorithm.

However, at some point during the search, the workload must be split between the threads. After that, there will be very little communication between them. Because there is a large number of threads that can be run in parallel, it is a challenge to find tasks that take approximately an equal amount of time to execute. Probably each thread is running the same search algorithm, but with different parameters. If the algorithm is alpha-beta search, then the parameters are:

position,
depth, and
search window.

The workload depends on all of these, but the depth has probably the most significant effect. That is why it does not sound good to assign the threads with searches of different depths. On the other hand, there should be hundreds of positions to be searched with the same depth. It is not enough to take one position, and let the threads to search the positions following the different moves in that position. The move count per position is typically only a few dozens. In addition, the improving bounds cannot be utilized during the seach, which makes this approach less beneficial.

One idea is to split the search window between the threads. If the granularity of the evaluation function is selected appropriately, it might be possible to make only null-window searches. The purpose of each search is to prove that the search from the position does not produce the score that is the null-window alpha bound for that thread. Then one of these null-window searches returns the alpha while the others fail, and we have found the correct score. The null-window searches are much faster than full searches and they can be run in parallel without communication. This seems to me like an attractive choice for the GPU chess algorithm.

1/21/2013

GPU Memories

In CUDA architecture, there are several different types of memory. To determine which memory type to use, one must answer the following questions:

How much memory do I need?
How fast memory access is required?
Do I access the memory randomly or in order?
Do I want to share the content of the memory with multiple threads?

If we always got what we wanted, the answers would be: "as much as possible", "as fast as possible", "randomly", "yes". But the reality is what we have to cope with. Big memories cannot be fast. Random accesses cost. Sharing contents of the memory with other threads requires extra effort. So, in order to use them efficiently, it is important to know what are the properties of the different memory types.

Global memory can be used for large arrays of data. In addition, it can be accessed by all the threads. However, it is painfully slow. A read or write can take hundreds of clock cycles. By utilizing the memory alignment, it is possible that up to 16 threads that belong to the same half-warp issue only one instruction to access memory inside one alignment unit. Still, it is better to avoid global memory accesses whenever possible, and use faster memories in the computation. A good strategy is to load the data from the global memory to the other memories, compute, and write the data back to the global memory in the end.

CUDA also has something called "local memory" which is not that local at all. It is used for arrays and other stuff in the kernels that does not fit into registers. However, it is as slow as the global memory, so it should be avoided at all costs. If you program with CUDA-C, you do not have full control over what the nvcc-compiler puts into the local memory. By using the PTX-assembler to write the kernels it is possible to avoid having anything in the local memory, if you use the registers sparingly.

There are also memories that are read-only for the kernels. Constant memory can be used for some values that are assigned before the kernel is run. It is also located in the device memory like the global and local memories, but it is cached, so accesses are faster. Another cached memory type is the texture memory, which is cached based on the 2-D spatial locality. It can be a good option if the memory accesses by the kernel can utilize that kind of locality. It can be expected that texture memory accesses are optimized on the graphics processing unit for the obvious reasons, so this type of memory is superior to global or constant memories if no writes are required.

The fastest memory is located on the chips. This is called shared memory. It is very fast, but there are limitations. First, the size is very small, usually only tens of kilobytes per multiprocessor. Second, the memory cannot be shared between the multiprocessors (I wonder why it is called "shared memory" then? Perhaps for the same reason, the "local memory" is called local). Third, there are memory banks that cannot be accessed by the threads at the same time, so the access patterns must be carefully planned.

What does this all mean from the chess engine's point of view? Obviously, we want to use the registers and shared memory for the search algorithm. A few dozens of kilobytes of memory shared between 32 threads per multiprocessor leaves about a kilobyte of memory space per thread. This is not much, if we have to fit the stack of positions and other variables in it. It is better to pack the information tight, because a few extra operations to unpack them will be much less than issuing global memory accesses.

There is not an obvious way to implement a fast transposition table. The shared memory size is too small to hold the table in it. Most probably, the only way is to use the global memory. This is still reasonable at higher levels in the search tree, where the search takes at least thousands of clock cycles. Then probing the transposition table can still take hundreds of clock cycles and with a reasonble hit rate, we can save time. Experimentation is required here. However, we cannot just naively count the number of positions searched and try to minimize that, because at lower levels of the tree, a full search might be faster than probing the transposition table.

Here are some thoughts about memories on the GPU. I will probably return to this subject, when I start experimenting with the actual implementaion of the search algorithm.

1/14/2013

Branching on the GPU

Branching intructions are an essential ingredient in every computer program that does something really useful. In C-language if-statements as well as for- and while-loops have branching instructions under the hood. The control flow may change depending on some conditions. At the processor level that unpredictability in the control flow requires special handling.

Modern CPUs use pipelining which means that instructions which take several clock cycles to complete are started before the previous ones are done. That way the CPU can execute an instruction at every clock cycle or so. Braching instructions are a problem, because it is not possible to know in advance which instruction follows them. The program may need to execute the part after the if-statement or it may need to skip it. For the pipelining this means problems. The following instructions cannot be processed before the branching instruction is completed. This could mean a delay in the execution, unless there were a clever trick. The processor "predicts" which branch will be executed and puts the instructions in the pipeline. If the prediction was correct, the pipelining continues as if there were no branching. If it fails, then the incomplete instructions in the pipeline are just discarded, but we lose nothing compared to the case where we waited for the branching instruction to be completed.

Since branching causes problems even in the case of one processor, it is expected that it causes even more problems in parallel computing. And indeed, all the threads in a warp must execute the same instructions. If there is a if-statement, it is possible that the control flow diverges. If the condition inside the if-statement is evaluated as true in all threads or if it is evaluated as false in all threads, everything is fine. But the tricky case is when some threads require that the instructions inside the if-clause are executed while other threads require that they are not. What happens in this case is that all threads execute all instructions, but those that should not have executed them, discard the results. This is waste of clock cycles, but the program runs correctly. A similar thing happens when the number of threads is not divisible to 32 and a partial warp is run. The full warp is executed, but the results of some threads are discarded.

How bad is this branching problem? Basically, it prevents using the traditional alpha-beta search with recursive calls, because the execution paths of the threads quickly diverge. That means that the execution becomes sequential as all threads must wait for the current one to complete before starting their execution. I have two ideas how to get around this problem:

Re-write the recursion as an iterative algorithm with a stack.
Let all the threads in a warp to follow the same path in the search tree and utilize the parallellism only at the leaf nodes for evaluation.

So how to ulitize these ideas?

First, it is always possible to re-write recursive algorithms as iterative ones. This is easy to see if you consider how recursive function calls are actually implemented in the operating systems. There is a stack into which things are pushed when the function call is started and from which things are popped, when the function call finishes. On the other hand, the instructions are executed iteratively on one processor, no matter what is the structure of the function calls in the high level code. All this can be done explicitly by the programmer in the kernel. If every step in the search tree traversial is made identical, it does not matter if threads in the same warps follow different paths. The implementation is not easy, and the code will be messy, but the main thing is that it works.

Utilizing the warp level parallellism only at the leaf level is the second best option. Then the system is not massively parellel anymore, because there is usually only a handful of streaming multiprocessors on the GPU. Traditional parallel algorithms may then work best, although the limited communication capabilities between the processors can pose severe restrictions to this. The leaf node evaluation, if the parallellism can be fully utilized, will be about 30 times faster than with a single processor. That is a lot. If the evaluation is the bottleneck in the processing, then this might be a good idea. But if the slowdown is caused by some part of the search itself, things will not be that easy, although, e.g. I can imagine threads doing move generation in parallel.

Next I will consider the limited memory access on the GPU

1/11/2013

About SIMD architecture

First, I must clarify something. The application programming interface (API) that I will be using is NVIDIA's CUDA (see http://www.nvidia.com/object/cuda_home_new.html). It is easy to learn and use, and provides a relatively effient way of utilizing GPUs. The programming language is basically C/C++ with some extensions. However, mindless coding will not lead to applications that utilize GPU optimally. It is even possible for an application to run slower on a GPU than on the CPU. One must be careful and design the program so that it can benefit from the massively parallel computer on the GPU.

One aspect of programming for parallel machines is the instruction architecture. On the GPUs, the most common architecture is SIMD or single instruction multiple data. That means that each parallel processor gets the same instruction, but processes different data elements. This is nice when one needs to decide the color of a block of pixels on the screen, because all of them are processed basically the same way. Only the screen coordinates change between the pixels. However, this is not so convenient when coding a chess program. While the static evaluation of the positions might be assigned to the parallel processors in a similar way the determination of colors is done for the pixel on the screen, the construction and traversial of the dynamic search tree is not that straightforward.

In CUDA the parallel processors are called streaming multiprocessors. Each of them can run a block of threads simultaneously. The size of the block of threads is 32 and it is called a warp. All threads in a warp execute the same instructions at the same time. To be exact, in CUDA the architecture is called SIMT or single instruction multiple threads. The programmer writes the code that a single thread executes or the so called kernel. Then the kernels are launched for a larger block of threads at once. There can, and usually should, be much more than 32 threads. The threads are grouped into warps and they can be given to different multiprocessors. When a multiprocessor is assigned more than one warp CUDA takes care of the scheduling of the warps. From the programmers point of view all the threads are run in parallel.

So what about branching? If-statements are something every programmer needs. I will write about that soon.

1/10/2013

What is the goal?

I decided to start writing a blog about a topic that I am interested in. That is designing and implementing a chess engine on a general purpose graphics processing unit (GPGPU). Modern GPUs have a lot of computing power, even more than the central processing units. On the other hand, the playing strenght of a chess engine depends on the available computing resources, in addition to its search algorithm and evaluation function. That is why the idea of combining the power of modern GPGPUs with an efficient chess program comes readily into my mind.

So why aren't there already chess engines utilizing GPUs? The answer is simple. While the graphics processing units have taken many steps towards general purpose computation, they are still far from common CPUs. They are less flexible and require special algorithms to fully utilize the massively parallel computing power they possess.

There are several limitations to GPUs. To name a few, here is a list:

Single instruction multiple data (SIMD) architecture
Limited memory access
Limited communication between processing units
Inefficient branching
Bias towards floating point computation

In addition, the most popular chess algorithms are inherently sequential, and simple parallellization attempts fail. There are some chess algorithms that can utilize more than one processor, but almost none of them are designed for massively parallel computing.

In short, the goal of this blog is to find the most efficient way of implementing a chess engine on a GPGPU.