It has been quiet here for a while. During the last few months, I have regained some interest in computer chess. If I ever want to succeed in writing a GPU chess program, it must be faster than a CPU reference implementation. So I wrote new CPU-side move generation code from scratch. The purpose was to make it as fast as possible. That would set the bar. I wrote the test in the form of a "perft" program that calculates the number of positions at a given depth from a given position. I also compared it to H. G. Muller's qperft (https://home.hccnet.nl/h.g.muller/dwnldpage.html).
I used the traditional bitboard presentations for the positions, where each piece type has a 64-bit field representing squares occupied by that piece type. In addition, I needed to store a bitboard for color, and some extra information, such as whose move it is, castling rights, and en passant square. This all fits nicely is 64 bytes (8 bitboards), which is a common cache line size. I could pack them even tighter, but then would need extra instructions for reading and writing them. In addition, I later added a 64-bit hash key there too.
I have implemented a few alternative sliding piece attack algorithms in the past. I considered "magic bitboards" (https://www.chessprogramming.org/Magic_Bitboards), rotated bitboards (https://www.chessprogramming.org/Rotated_Bitboards), and the ray attacks approach (https://www.chessprogramming.org/Rotated_Bitboards). I ended up using the last one. Magic bitboard need quite large lookup tables, which pollute the cache quite easily (won't fit in L1, but need L2). Rotated bitboards, on the other hand, require extra instructions to update the directional bitboards and also some extra space too. The classical ray attacks approach has smaller lookup tables (fit in L1 cache) and doesn't need extra bitboards per position. I may do some measurements later. It may also depend on the other code, which one is the fastest (e.g. hash tables may pollute the cache anyway).
Then there is the handling of the king, which affects the performance. The lazy and easy approach would be to generate moves, so that the king is like any other piece, and only when a move is executed that captures a king, the position was marked invalid. This requires generating moves one ply deeper than necessary, if king captures were not generated in the first place. Because the branching factor of chess positions can be quite high, this would yield an order of magnitude slower code. Even if only captures to the king's square were tested, this would make it impossible to use bulk counting in leaf nodes. The bulk counting means that at the last ply, the move generation would only need the count the moves, but not execute them. These were the reasons, why I ended up implementing the much more complex move generator, that doesn't produce king captures.
To avoid generating king captures, the pinned pieces must be recognized and taken into account in move generation. Also it is necessary to detect checks and respond to them appropriately. That's why, before generating any moves, there is a preparation step that finds if there are pieces that give check (possible are 0, 1, 2 such pieces), and also find pins in each four directions (vertical, horizontal, diagonal, and anti-diagonal). These are stored in bitboards that are easy to use in move generation. Additionally, the area protected by any of opponent's piece is calculated. Then in the move generation, the pinned bitboards are used for restricting the directions in which moves are allowed for pinned pieces (pinned sliding pieces may still move in the direction of the pin). In the case of a check, there is a special check evasion move generation. Three types of moves are allowed: king moves to squares not belonging to the protection area, capturing the checking piece, and moving a piece in between the checker and the king. The latter two are only possible, when there is only one checker. The last one is valid only for sliding piece checks. There are a few caveats in this approach. E.g. the king should be removed from the board when calculating the protection area. Otherwise the squares x-ray attacked by the checker through the king would still be valid squares to move to. Also, there is a special case with en passant if the king and a rook or a queen are on the same row, but on opposite sides, of the pawns involved in en passant. The possible pin is not recognized by normal pin code, because there are two pawns between the king and the attacker. En passant should be illegal in this case, because the move removes both pawns simultaneously, leaving the king in check.
After navigating through the jungle of special cases, I finally managed to implement a move generator that is accurate and fast. I tested the move counts in several known "perft positions". After fxing a handful of bugs, I get the correct results in all tested positions. My initial implementation was single-threaded. Depending on the positions, it produced 200-300 million positions per second on my computer with Intel i7 9700K 8-core processor (on a single core at this point). Then I decided to make it multi-threaded. I implemented a version of the perft function that uses work-stealing for distributing the work on all cores. I was quite happy to get a fairly good scaling on the eight cores, the result being almost 2.1 billion positions per second in a middle game position.
Yet another speed up came, when I implemented a hash table. This proved to be a bit more challenging. First, I needed a hash key for each position. I needed a good hash, because the implementation relied on the 64-bit hash keys being different for different positions. Because of the birthday problem (https://en.wikipedia.org/wiki/Birthday_problem), the probability that two hash keys collide is greater than what the intuition would say, I needed to calculate the likelyhood of hash key collisions. Fortunately, the hash table size would limit the severity of the problem, because not all generated positions would fit in there. I calculated that even in the bad cases, the probability would in the order of one to ten million. That's enough for me. I ended up using the Zobrist keys that are commonly used in the chess engines (https://en.wikipedia.org/wiki/Zobrist_hashing). I update it when executing moves, so it adds very little extra work.
Any reasonable-sized hash table would most probably produce a cache miss for each probed position. That's why I thought it would be wise to take most out of each cache line read. The hash table entries were 16 bytes each, containing the hash key and the payload data (64-bit position counter). The lowest bits of the hash key were used for determining the index of the table, in which the hash table entry was placed. It was obvious that with a move generator producing over 2 billion positions per second, the hash table would be full in a blink of an eye. So I needed a replacement strategy for colliding entries. In this case, it was easy to use the position count covered by a hash entry as a criterion. The table would keep the entry, which had higher position count. Still for each read 16-byte entry, there would be 3 other entries (48 bytes of 64 bytes cache line) that would be read to cache for nothing. So I decided to make the hash table kind of 4-way associative. If there was a write collision, we would also check the other 3 slots in the same cache line. If any of those were empty or had worse position count, we would replace it. And when reading, we would also read those adjacent slots in the hash table for almost no extra cost (one cache line read anyway). Comparing the hash keys would reveal which of the positions was the correct one, if any. This approach improved the hit rate, when the table was filling up.
In my initial implementation, the hash table polluted the caches and trashed the performance as I was afraid of. Not only hash table accesses were memory bound, but also the normal move generation, because the lookup tables were kicked out of cache. Then I made a few adjustments based on profiling. First, I removed the hash reads and writes from the two lowest levels of the search tree. This reduced the memory traffic so that only hash table accesses waited for memory, but not the normal move generation. Another adjustment was not using pointer in hash table accesses. In initially, I had returned a pointer to the hash table entry, when probing for a read. That lead to indirection that made the program actually read the memory location twice: once when finding the correct entry, and once when using the data in it. This was revealed by looking at the compiled assembler code. I changed this so that the hash entry was returned by value, which apparently went nicely in one 8-byte register, because hash-part of the 16-byte entry was not used. This gave the performance boost I had hoped for from the hash table. An open question was still, if I should do a shared hash table for all the workers or a private one for each. After testing this, it was obvious that a shared hash table would beat individual ones any time, even with the additional locks to protect the table from concurrent access to same entries. I was even prepared to do a fancy system with bitfields accessed with atomics to protect entries with finer granularity, but in the end using just one mutex for the whole table was sufficient. The situation may be different if there were more than eight workers, but now locks took so little time that it was difficult to find in the measurements.
In the end, I got nice results, especially in deeper searches. As an example in one position, the multi-threaded, hash table-backed perft counted in 10 ply some 14.6 trillion positions in just 81 seconds. This means over 180 billion positions per second. However, it should be noted that many of these come from the hash table. In these deeper searches, hash table hit rate is over 90 %. Still, here is the standard to beat with GPU.