I'm betting on the vast majority of memset() uses being to zero memory, so it might've not been tested much with other fill patterns.
Compared to MSVC, I've found GCC's inline-assembler feature a huge pain to use correctly and the few times it was needed, I decided to use a separate file with pure Asm instead (also with Intel syntax) and link that in. Supposedly it's more powerful, but having to essentially understand part of the register allocation process just to use it is a big turn-off. In contrast, MSVC "just works" --- the syntax is far more pleasant and it seems to know automatically what gets modified by the code you added.
I think basic operations like these, bulk fill and copy, are best performed be specific instructions that the hardware can do in the best way for the implementation. The nios2 memset() is over two dozen instructions, most of which are just shuffling bits around. Intel did this right with REP MOVSB/STOSB.
As for the rules for VC++, they are conservative - see https://msdn.microsoft.com/en-us/library/k1a8ss06.aspx. In general, you're right that it's much nicer to use than the shit gcc foists on you, but these conservative rules are symptomatic of its reluctance to integrate C and asm very well.
That's one problem gcc doesn't have - my complaint isn't that you can, just that it makes you do it the hard way. But that's totally unnecessary... a few years ago I used CodeWarrior for PowerPC. It was very pleasant, and did a good job of making it easy to have C and asm coexist. You allocated registers by declaring variables as `register', and the compiler would figure out how to put them somewhere sensible on entry to each asm block, and spill the ones that were modified, if that even proved necessary. You could leave one asm block having just modified a register variable, assert that it had done what you expected, then start another one, along the lines of:
register int fred=0;
asm addi fred,1
asm addi fred,1
The inline Assembly syntax of PC compilers always felt natural and easy to remember.
Every time I look at gcc inline Assembly, I get this feeling it is impossible to get right without having the manual page always open.
Fortunately there are still some people doing development on UNIX and sharing their software tools.
As a general rule, to which the usual caveats for general rules apply, when it comes to gcc-style inline asm, just say no. Demand more from your tools.
That said, for longer pieces of code, I definitely avoid inline-asm and just use a separate assembly file. There's no debating it's a cleaner solution when you have more then one or a few lines of assembly. And really, besides those obscure CPU instructions, your written assembly is usually not much different, then the stuff generated from some C code. It's worth using C wherever it's viable.
> Stores in %4 the initial pattern shifted left by 8 bytes.
Very interesting write up
uint32 fill16 = (fill << 8) | fill;
uint32 fill32 = (fill16 << 16) | fill16;
My own experience with writing memset / memcpy equivalents 10 years ago on x86 was that gcc did a better job of it. Don’t try and outsmart the compiler for this kind of common case unless you know the compiler generates inefficient code - compiler writers have generally got there before you.
Using the intrinsic provided significant performance improvements, and we got still more when the rest of the inner loop was rewritten purely in assembly.