And I usually prototype the C in Python first :D
The most common reason is using architecture-dependent instructions that the compiler doesn't generate well, or doesn't generate at all. Examples are SIMD (auto-vectorization is nice, but far from perfect) and DSPs that have specific multiple-and-accumulate instructions or flags that change the behavior of the accumulate register.
In a project I'm currently working on, inlining was still inferior to fully native ASM. LLVM generated unnecessary stack loading in the prologue, and the completely unused memory access had something like a 4% speed penalty.
On a related note, I have a dream one day to discover a real example where undefined behavior can be used constructively as a license for the compiler to optimize: the following post alludes to this idea but assembly dumps at the bottom show that the compiler is not taking advantage of the information encoded into the undefined behavior:
More seriously, an annotation language for expressing properties that are supposed to be true at various points of the program can be useful to transmit information from the programmer to the compiler and enabling optimizations that would otherwise require difficult full-program analysis. And these annotations can be used to analyze the program too!
Though, it does seem to always store the old stack pointer in r7, even though it doesn't restore from r7, and even though my inline assembly block specifies r7 on the clobber list. That might be a bug, but it's a single 'add', so who cares.
So writing in C is fine for smaller projects, but those are easy enough in assembly anyway, especially on the old architecture designed for human-produced asm.
Back when I started programming (1986), Basic, C, Pascal and Forth were seen as prototyping languages, before coding the real applications in Z80, 6502 Assembly.
You can tell where the devs with Ph.Ds in compiler design are putting all of their effort :)
I think that's also affecting where the optimization effort goes to a great extent - it's more likely to be invested on the type of code people are more likely to use in critical inner loops to be run in places that might saturate large numbers of cores...