
x86 Assembly Primer for C Programmers - frozeneskimo
http://speakerdeck.com/u/vsergeev/p/x86-assembly-primer-for-c-programmers
======
_delirium
This is just about the opening example, so not really the main point of the
slides, but: Is it actually still faster on modern machines to use this repnz
version? One analysis a few years ago found that the naive C implementation,
when compiled with gcc optimizations, was actually faster than that inline-asm
implementation; the inline-asm implementation has fewer instructions, but
doesn't execute faster: <http://canonical.org/~kragen/strlen-utf8.html>

~~~
frozeneskimo
This is a great question. You can also find some of glibc's even more
optimized versions of strlen() here:

[http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386...](http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386/i486/strlen.S)

[http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386...](http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386/i586/strlen.S)

It's counter-intuitive that so many instructions can be dedicated to something
so seemingly simple like strlen(), but it does highlight the complexity of
modern processors. On that note, I do not know and am reluctant to comment on
which implementation of strlen() is fastest, as it seems very difficult to
decouple the non-determinism of caches, instruction scheduling, pipelines,
etc. by only looking at the source, especially when other code is also
involved. But there probably is a benchmark that could give some useful
results -- to justify the code above too -- like in the link you posted.

Someone with a better understanding of instruction scheduling, pipelines, and
other optimizations can probably give a better answer to this than me.

~vsergeev/frozeneskimo

~~~
tcas
Wow, those examples took me a while to go through and understand; it's amazing
the level you can optimize a piece of code to hardware. I'm curious to
benchmark the implementations and see the performance difference.

Maybe you or someone else knows the answer to this though: it seems like they
are processing 4 bytes of the string at a time. If they read over (i.e. the
NULL byte is in byte position 1 2 or 3), isn't that technically undefined
behavior? They are only reading in the memory, but I feel like valgrind or
another tool would spit out an error if that happened. It's aligned, so it
won't trigger a page fault, but it seems like an unsafe optimization.

~~~
frozeneskimo
Yeah, I see your point. Like you said, page alignment and size being a
multiple of 4 won't cause a page fault, so it's technically "ok". I can only
assume that at this level the corner is safely cut for the sake of
performance.

Another more trivial example of something like this is in the repnz-based
strlen() (slides 7-8), where %ecx is loaded with 0xffff ffff, which
technically limits the routine to scan strings up to 4 gigabytes in length.
It's a valid assumption that the string is under 4GB (especially on a strictly
32-bit system), but the point is that it's a semantically different routine
than the C based one.

~vsergeev/frozeneskimo

~~~
VMG
(signatures are frowned upon around here)

~~~
frozeneskimo
(I see, my bad. Habit of mine. Thanks)

------
ben0x539
Would be nice if it was 64bit. Seems a bit late.

~~~
rmcclellan
64 bit x86 is very similar to 32 bit. The differences are covered in on slides
191-193 in this deck.

The biggest difference for me is the difference in the calling convention. In
32 bit, all arguments generally are placed on the stack for "standard" calls.
In 64 bit, different OSes have different conventions:

[http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_...](http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions)
(note that OS X and linux use the "System V" calling conventions)

~~~
xpaulbettsx
I was about to call BS on that last sentence but it looks like I was the one
full of it. I thought that the amd64 calling convention was standardized,
bummer.

On Windows, amd64 is much better than x86 in this regard because of the bevy
of x86 calling conventions that are still around. Only _one_ amd64 calling
convention.

~~~
brigade
To be fair, the System V document is AMD's convention and it was Microsoft
that decided to design an incompatible (and worse) ABI.

~~~
xpaulbettsx
Sucks. I wonder if AMD wrote that later, after Microsoft had made up their own
and were dependent on it. Dave Cutler was involved in amd64 _really_ early in
the process of Clawhammer (mainly because he hates Intel with a passion!)

------
burstlag
The slides look like they have good information, but I'd really love to see
that speech that (I suppose) went with it.

------
tene
All I get is the speakerdeck main page, with "You are not authorized to access
this."

Does anyone have a working link to the content?

~~~
frozeneskimo
Sorry, it's back up. Apparently updating the PDF broke speakerdeck,
permanently marking the presentation as "unpublished", even though it is
public. I had to delete and re-upload. Probably shouldn't have updated the PDF
in the first place, though.

Content is available here as well: <https://github.com/vsergeev/apfcp>

------
iab
I wish I could upvote this indefinitely. What a great resource, thanks for
your efforts!

------
Duckpaddle2
Great resource, thanks for it!

