The source code of a chess engine often has plenty of bit twiddling in it. Especially, engines using bitboards require lots of bit operations like exclusive ors, ands, complements, and shifts to left or right. These operations can be implemented by using a high level programming language; most often C/C++. Modern C-compilers are quite smart and do all kinds of optimizations, especially if the compiler is asked to optimize the code. Therefore it is not strictly necessary to use assembler code in the engine. In addition, the assembler code can be quite difficult to debug.
Unfortunately, the nvcc-compiler used for compiling CUDA-C does not seem to be that smart. It uses lots of registers, does stupid type conversions, and uses more instructions than necessary. The register usage is perhaps the most severe problem, because the occupancy rate of the multiprocessors depends on the number of registers per thread. I have tried a simple test program written in CUDA-C and examined the PTX-assembler code produced by the compiler. Then I have written the same program directly in PTX-assembler, and got the same result with half the register count and half the instruction count. This means a speed up of factor two or more. I will certainly write the core of the engine in PTX-assembler, because wasting computing power for nothing feels bad.
So what to write in assembler? I think, the functions that are called most often. These include the make() and unmake() -functions, move generation, search, and evaluation functions. Initially, I will write a perft()-function to test the speed and validity of the move generation. This has to be faster than the CPU move generator or otherwise I have to re-design the whole thing. I let you know when I have something working.