
How we found that the Linux nios2 memset() implementation had a bug - ingve
http://free-electrons.com/blog/how-we-found-that-the-linux-nios2-memset-implementation-had-a-bug/
======
userbinator
_We were quite surprised to find a bug in some common code for the NIOS II
architecture: we were assuming it would have already been tested on enough
platforms and with enough compilers /situations to not have such issues._

I'm betting on the vast majority of memset() uses being to zero memory, so it
might've not been tested much with other fill patterns.

Compared to MSVC, I've found GCC's inline-assembler feature a huge pain to use
correctly and the few times it was needed, I decided to use a separate file
with pure Asm instead (also with Intel syntax) and link that in. Supposedly
it's more powerful, but having to essentially understand part of the register
allocation process just to use it is a big turn-off. In contrast, MSVC "just
works" \--- the syntax is far more pleasant and it seems to know automatically
what gets modified by the code you added.

I think basic operations like these, bulk fill and copy, are best performed be
specific instructions that the hardware can do in the best way for the
implementation. The nios2 memset() is over two dozen instructions, most of
which are just shuffling bits around. Intel did this right with REP
MOVSB/STOSB.

~~~
to3m
There was exactly this bug in Android for a long time - memset always cleared
the memory to zero. I can't find a good link to this, but try maybe
[https://review.source.android.com/#patch,sidebyside,14699,1,...](https://review.source.android.com/#patch,sidebyside,14699,1,libc/memset.c),
not that Firefox will let me connect to check.

As for the rules for VC++, they are conservative - see
[https://msdn.microsoft.com/en-
us/library/k1a8ss06.aspx](https://msdn.microsoft.com/en-
us/library/k1a8ss06.aspx). In general, you're right that it's much nicer to
use than the shit gcc foists on you, but these conservative rules are
symptomatic of its reluctance to integrate C and asm very well.

That's one problem gcc doesn't have - my complaint isn't that you can, just
that it makes you do it the hard way. But that's totally unnecessary... a few
years ago I used CodeWarrior for PowerPC. It was very pleasant, and did a good
job of making it easy to have C and asm coexist. You allocated registers by
declaring variables as `register', and the compiler would figure out how to
put them somewhere sensible on entry to each asm block, and spill the ones
that were modified, if that even proved necessary. You could leave one asm
block having just modified a register variable, assert that it had done what
you expected, then start another one, along the lines of:

    
    
        register int fred=0;
        asm addi fred,1
        assert(fred==1);
        asm addi fred,1
    

I wish VC++ would do something like that; you can sort-of do it, but it's
impossible to request that a given value goes somewhere in particular
(register, frame relative, etc.) and this can stymie you sometimes. But they
decided to remove inline assembly language altogether for x64, so that's one
way of solving the problem. Still nicer than what gcc does.

~~~
pjmlp
I don't have any need to type inline Assembly in the type of work I do, but
used to back in the MS-DOS days.

The inline Assembly syntax of PC compilers always felt natural and easy to
remember.

Every time I look at gcc inline Assembly, I get this feeling it is impossible
to get right without having the manual page always open.

------
to3m
Here's another horrid gcc inline asm bug:
[http://robertoconcerto.blogspot.co.uk/2013/03/my-hardest-
bug...](http://robertoconcerto.blogspot.co.uk/2013/03/my-hardest-bug.html)

As a general rule, to which the usual caveats for general rules apply, when it
comes to gcc-style inline asm, just say no. Demand more from your tools.

~~~
DSMan195276
Personally I don't find inline-asm with gcc to be all that bad. It is
definitely a very nice way of accessing other-wise inaccessible CPU
instructions - and those tend to be generally trivial to implement (IE. Single
asm instruction, along with some input and output operands). I admit that on
the rare occasion that I do use it, I almost always go reread the
documentation. But still, the compiled end result is tons better then having
to call a completely separate function in your code every time you just want
to use one single CPU instruction.

That said, for longer pieces of code, I definitely avoid inline-asm and just
use a separate assembly file. There's no debating it's a cleaner solution when
you have more then one or a few lines of assembly. And really, besides those
obscure CPU instructions, your written assembly is usually not much different,
then the stuff generated from some C code. It's worth using C wherever it's
viable.

------
atomic_lock

      >  Stores in %4 the initial pattern shifted left by 8 bytes.

Shouldn't that be 8 _bits_?

Very interesting write up

------
asveikau
It's been my experience that it's easy to write GCC inline asm that works the
first few times you use it, but subtly breaks depending on what the optimizer
does with the code around it. Those input/output/clobber arguments can be
tricky to get right and wrongness can be hard to spot.

~~~
pja
Yeah, and as one of the comments to that blog post says, for bit twiddling
operations like this, modern compilers are good enough to generate the optimal
assembler the vast majority of the time: If you write the equivalent C code

    
    
        uint32 fill16 = (fill << 8) | fill;
        uint32 fill32 = (fill16 << 16) | fill16;
    

(copied from the comment, so hopefully correct!) you get the benefit of a) not
having to fiddle with abstruse register allocation semantics but b) you’re
letting gcc use its register allocator to set things up optimally based on the
surrounding optimised code.

My own experience with writing memset / memcpy equivalents 10 years ago on x86
was that gcc did a better job of it. Don’t try and outsmart the compiler for
this kind of common case unless you know the compiler generates inefficient
code - compiler writers have generally got there before you.

~~~
slededit
No compiler (MSVC, gcc, icc) outputs the "bts" instruction for operations
like: bitfield[val / 32] = val % 32; which could implicitly perform the
modulo.

Using the intrinsic provided significant performance improvements, and we got
still more when the rest of the inner loop was rewritten purely in assembly.

------
Drdrdrq
Isn't this exactly the kind of function that you would want thoroughly covered
with unit tests?

~~~
atomic_lock
Altera's software quality isn't exactly shining. Lots of unresolved bugs,
badly documented compiler macros, ...

