Nothing is more annoying than debugging a program that you think should work properly. After a few hours of writing functions to print out bitboards and differences between bitboards, I finally managed to get the correct results for perft(). At first I got capturing promotions generated twice: first as promotions and then as captures. After I fixed that, I still had a bug with castling rights. I got too many castles. Because I did not have the make() and unmake() functions, I needed to make sure I clear the appropriate castling rights whenever a king or a rook moves (both captures and non-captures) or a rook gets captured (including capturing promotions). I had to go through every possible case and that is about 900 lines of code. Why that many lines? Let me explain.
A simple move generator can be much less than 900 lines of code. But, my move generator has making of the move and MVV-LVA move ordering embedded in it. In addition, I added a strict legality check in the move generator. The purpose is to have enough instructions between global memory writes. Because the memory speed is the limiting factor, it makes sense to do something useful while waiting for the write operations. It is important to notice that the global memory writes do not block the execution of other instructions that do not depend on the written data.
Now I am planning to test the code with different numbers of threads, beginning with a single thread and slowly increasing the number of the threads. If I record the nodes per second for the different numbers of threads, I can plot a graph that shows how the algorithm scales. I hope I can do this during this week.