It usually is inline assembly, but you write it in C first and copy the disassembly as a starting point. We do this in the embedded world a lot because we're constantly switching architectures, and nobody memorizes the instruction set of every architecture :)
The most common reason is using architecture-dependent instructions that the compiler doesn't generate well, or doesn't generate at all. Examples are SIMD (auto-vectorization is nice, but far from perfect) and DSPs that have specific multiple-and-accumulate instructions or flags that change the behavior of the accumulate register.
In a project I'm currently working on, inlining was still inferior to fully native ASM. LLVM generated unnecessary stack loading in the prologue, and the completely unused memory access had something like a 4% speed penalty.
A lot of missed optimization opportunities come from the impracticability of communicating to the compiler information known to the developer. Basic examples include “the two lvalues here do not alias” and “the int variable x here is always positive, x/8 can be compiled as straightforward right shift instead of a more complex sequence of instructions”. There are various source-level workarounds including using restrict when applicable, copying lvalues to local variables whose address is never taken to make the lack of aliasing clearer, and casting to unsigned int before dividing. In the worst cases, you have to make the program less readable in order to improve the generated code, with no guarantees that the change makes the generated code better for all compilers (some compilers might still miss the optimization opportunity and generate additional instructions for the copies/casts).
On a related note, I have a dream one day to discover a real example where undefined behavior can be used constructively as a license for the compiler to optimize: the following post alludes to this idea but assembly dumps at the bottom show that the compiler is not taking advantage of the information encoded into the undefined behavior:
More seriously, an annotation language for expressing properties that are supposed to be true at various points of the program can be useful to transmit information from the programmer to the compiler and enabling optimizations that would otherwise require difficult full-program analysis. And these annotations can be used to analyze the program too!
Yes, although I make fun of LLVM for the non-optimal code, I really doubt anyone would consider them 'bugs' -- and I did not. They aren't bugs, they are just optimizations that we haven't found a good way to automatically identify yet.
Though, it does seem to always store the old stack pointer in r7, even though it doesn't restore from r7, and even though my inline assembly block specifies r7 on the clobber list. That might be a bug, but it's a single 'add', so who cares.
When you're writing on an architecture with only poor C compilers, it makes a lot of sense to prototype in C and hand compile it. For example, as a hobby, I write on a platform where program memory is limited to 16k and the only available C compiler is sdcc, which outputs relatively large binaries.
So writing in C is fine for smaller projects, but those are easy enough in assembly anyway, especially on the old architecture designed for human-produced asm.