
Computers are fast - bdcravens
http://jvns.ca/blog/2014/05/12/computers-are-fast/
======
personalcompute
I particularly enjoyed the writing style in this article, largely because of
the extent that the author provided unverified and loose figures in the
article - cputime distributions etc. My experience is usually people are
extremely hesitant to publish any uninformed, fast, and incomplete conclusions
despite them being, in my opinion, still extremely valuable. It may not be
perfectly correct, but that small conclusion is still often much better than
the practically non existent data on the situation I start off with and allows
me to additionally read the article far faster than slowing down to make these
minor fuzzy conclusions myself. There is this misconception that when writing
you can do two things - you can tell a fact or you can say a false statement.
In reality it is a gray gradient space, and when the reader starts off knowing
nothing, that gray is many times superior. Anyways, awesome job, I really want
to see more of this writing style in publications like personal blogs.

[In case it isn't clear, I'm referring to statements like "So I _think_ that
means that it spends 32% of its time accessing RAM, and the other 68% of its
time doing calculations.", and "So we’ve learned that cache misses can make
your code 40 times slower." (comment made in the context of a single non-
comprehensive datapoint)]

~~~
kijin
Agreed. At first, the writing style (too! many! exclamation! points! Are you
Unidan or what?) made me think that the article wasn't serious. The fact that
I immediately thought, "you're just going to be benchmarking your disk!"
didn't help, either. But the author did take the disk performance into
account, and the rest of the article was surprisingly interesting as well.
It's a quick and dirty exposition for a quick and dirty suite of benchmarks.

~~~
StavrosK
I don't know, I love her enthusiasm.

~~~
mrec
Same here. I thought it helped to dispel the conception that only Very Serious
People should be interested in this stuff.

------
exDM69
I posted the following as a comment to the blog, I'll duplicate it here in
case someone wants to discuss:

This program is so easy on the CPU that it should be entirely limited by
memory bandwidth and the CPU should be pretty much idle. The theoretical upper
limit ("speed of light") should be around 50 gigabytes per second for modern
CPU and memory.

In order to get closer to the SOL figure, try adding hints for prefetching the
data closer to the CPU. Use mmap and give the operating system hints to load
the data from disk to memory using madvise and/or posix_fadvise. This should
probably be done once per a big chunk (several megabytes) because the system
calls are so expensive.

Then try to make sure that the data is as close to the CPU as possible,
preferably in the first level of the cache hierarchy. This is done with
prefetching instructions (the "streaming" part of SSE that everyone always
forgets). For GCC/Clang, you could use __builtin_prefetch. This should be done
for several cache lines ahead because the time to actually process the data
should be next to nothing compared to fetching stuff from the caches.

Because this is limited on memory bandwidth, it should be possible to do some
more computation for the same price. So while you're at it, you can compute
the sum, the product, a CRC sum, a hash value (perhaps with several hash
functions) at the same cost (if you count only time and exclude the power
consumption of the CPU).

~~~
sanxiyn
Yup, __builtin_prefetch certainly, if one wants to get it faster. (Compilers
can't do this themselves, because they don't know how much they should
prefetch ahead, and if you get that wrong it's worse than no prefetching.)

I am less sure about madvise. It seems to me default heuristics should work
fine for this case, and as you said, system calls are expensive.

~~~
exDM69
> I am less sure about madvise. It seems to me default heuristics should work
> fine for this case, and as you said, system calls are expensive.

System calls are expensive but so are page faults or having to access the
disk. If you can avoid page faults by using madvise to prefetch from disk to
memory, it should be worth it. In particular, the first run with cold caches
should be faster.

However, the operating system may be smart and realize that we're doing a
sequential access and may speculatively read ahead and madvise calls would be
time wasted.

The same happens with CPU caches too, the CPU internal prefetcher is pretty
good in recognizing a sequential access and grabbing the next cache line in
advance. A few naively placed __builtin_prefetches doesn't seem to help here
(I just tried this out).

Prefetching hints work a lot better in non-sequential access patterns (linked
lists, etc).

------
krick
Pretty naïve, I'm surprised to see it here. Not that this is pointless study,
but it's pretty easy to guess up these numbers if you know about how long it
takes to use a register, L1, L2, RAM, hard drive (and you should). And exactly
how long it would take is task-specific question, because depends more on what
optimization techniques can be used for the task and what cannot, so unless
you are interested specifically in summation mod 256, this information isn't
much of use, as "processing" is much broader than "adding moulo 256".

But it's nice that somewhere somebody else understood, that _computers are
fast_. Seriously, no irony here. Because it's about time for people to
realize, what disastrous world modern computing is. I mean, your home PC
processes gigabytes of data in the matter of seconds, amount of computations
(relative to its cost) it is capable of would drive some scientist 60 years
ago crazy and it gets _wasted_. It's year 2014 and you _have to_ wait for your
computer. It's so much faster than you, but you are waiting for it! What an
irony! You don't even want to add up gigabyte of numbers, you want to close a
tab in your browser or whatever, and there are quite a few processes running
in the background that actually have to be running right now to do something
useful, unfortunately OS doesn't know about that. Unneeded data cached in the
RAM and you wait while OS fetches memory page from HDD. But, well, after 20
layers of abstraction it's pretty hard to do only useful computations, so you
make your user wait to finish some computationally simple stuff.

About every time I write code I feel guilty.

~~~
userbinator
I agree completely. Hardware has gotten _orders of magnitude_ faster, yet the
experience, as a user, hasn't changed nearly that much. You'll probably find
this comparison interesting:
[http://hallicino.hubpages.com/hub/_86_Mac_Plus_Vs_07_AMD_Dua...](http://hallicino.hubpages.com/hub/_86_Mac_Plus_Vs_07_AMD_DualCore_You_Wont_Believe_Who_Wins)

I think the saying that programmer time is more expensive than machine time
is, like the quote about premature optimisation, responsible for promoting an
inherently wasteful culture in programming when people are taught to take them
at face value; looking at it another way, power isn't free, and when machine
time translates to _user_ time, then the situation definitely changes. I
always keep in mind that users are using the software I create to _do work_ ,
and their time is just as valuable if not more so than my own, especially if
there are more of them. To me, it's a question of balancing the tradeoffs ---
it's not worth spending a day to optimise software that will save a few
minutes across all its users over its lifetime, but it is very well worth it
to spend a week to optimise something so that it may save a year or more of
its users' time. Whenever a programmer complains about certain tools (e.g.
compiler, IDE, etc.) being slow, I keep that in mind to mention the next time
he/she flippantly dismisses a time-saving optimisation using the "my time is
more expensive" excuse. Programmers are users too. :-)

~~~
couchand
Found this gem in the linked article: _Windows Vista demands enough real
estate on your hard drive that you could easily fit 30,000 full-length novels
into it._

A nitpick: a day of work, say six hours, is only 3600 minutes. If you can
optimize some task to save a few minutes per user, it only takes a few
thousand users to be worth it.

However, I think the worst optimizations to leave on the table are in an area
many developer-types won't think about: usability. We can save much more of a
user's time by creating simple, intuitive interfaces than we can checking
cache misses (in most modern development).

~~~
claudius
> We can save much more of a user's time by creating simple, intuitive
> interfaces than we can checking cache misses (in most modern development).

And at least for regularly-used software, you can save even more user time by
not optimising for the first five minutes of usage but for the long-term
usage. Keyboard shortcuts are a good example for this: They are mostly not
simple and certainly not intuitive per se (why does Alt+Tab switch windows in
particular?), yet once you know them, they do save a lot of time.

------
userbinator
_I wrote a new version of bytesum_mmap.c [...]and it took about 20 seconds. So
we’ve learned that cache misses can make your code 40 times slower_

What's being benchmarked here is not (the CPU's) cache misses, but a lot of
other things, including the kernel's filesystem cache code, the page fault
handler, and the prefetcher (both software and hardware). The prefetcher is
what's making this so much faster than it would otherwise be if each one of
those accesses were full cache misses. If only cache misses were _only_ 40
times slower, performance profiles would be very different than they are
today!

Here are some interesting numbers on cache latencies in (not so) recent Intel
CPUs:

[https://software.intel.com/en-
us/forums/topic/287236](https://software.intel.com/en-us/forums/topic/287236)

 _I’m also kind of amazed by how fast C is._

For me, one of the points that this article seems to imply is that modern
hardware can be extremely fast, but in our efforts to save "programmer time",
we've sacrificed an order of magnitude or more of that.

~~~
skrebbel
> _For me, one of the points that this article seems to imply is that modern
> hardware can be extremely fast, but in our efforts to save "programmer
> time", we've sacrificed an order of magnitude or more of that._

Sounds like a good deal!

~~~
zxcdw
To who and in what situations? To businesses(hardware and software) sure, they
get products out faster and are motivated to push new stuff out. To end users,
not so much, because they wait more and their batteries drain faster. They
also pay more for hardware to have a "good experience".

~~~
goldfeld
They also have an order of magnitude more software to choose from, which tends
to drive cost of software down through competition (though I'll grant it that
most free models we're stuck with today suck, and I wish paying at least $10
for good software was still a common thing.) More importantly, a lower barrier
of entry to developing allows a much longer tail of niche applications to be
produced, and perhaps even more important, experts and hobbyists from other
domains can go and create software using all the knowledge from their "main"
field.

------
dbaupp
Interesting investigation!

I had an experiment with getting the Rust compiler to vectorise things itself,
and it seems LLVM does a pretty good job automatically, e.g. on my computer
(x86-64), running `rustc -O bytesum.rs` optimises the core of the addition:

    
    
      fn inner(x: &[u8]) -> u8 {
          let mut s = 0;
          for b in x.iter() {
              s += *b;
          }
          s
      }
    

to

    
    
      .LBB0_6:
      	movdqa	%xmm1, %xmm2
      	movdqa	%xmm0, %xmm3
      	movdqu	-16(%rsi), %xmm0
      	movdqu	(%rsi), %xmm1
      	paddb	%xmm3, %xmm0
      	paddb	%xmm2, %xmm1
      	addq	$32, %rsi
      	addq	$-32, %rdi
      	jne	.LBB0_6
    

I can convince clang to automatically vectorize the inner loop in [1] to
equivalent code (by passing -O3), but I can't seem to get GCC to do anything
but a byte-by-byte tranversal.

[1]:
[https://github.com/jvns/howcomputer/blob/master/bytesum.c](https://github.com/jvns/howcomputer/blob/master/bytesum.c)

~~~
pbsd
GCC vectorizes the sum fine if the input is _unsigned bytes_ , cf [1]. My
guess is some odd interaction between integer promotions and the backend.

[1] [http://goo.gl/B7KX1V](http://goo.gl/B7KX1V)

~~~
jxf
This link seems to be breaking the HN comments for this page. :)

~~~
dbaupp
It's a shortened link (because the true link is so long, as you discovered),
maybe you have a browser extension that expands them?

~~~
jxf
Ah, you're right, I do (Tactical URL Expander for Chrome)! That'd sure explain
it.

------
cessor
I like the "free" style of the article. Here is another conclusion: In my
professional life I have heard many, many excuses in the name of performance.
"We don't need the third normal form, after all, normalized databases are less
performant, because of the joins". Optimizing for performance should not mean
to make it just as fast as it could possibly run, but to make it just fast
enough.

Julia's article shows a good example for this. Of course, the goal appears to
generate a feeling of what tends to make a program fast and slow and get a
feeling for how slow it will be or how fast it can get; yet I'd like to point
out that this...

[https://github.com/jvns/howcomputer/blob/master/bytesum_intr...](https://github.com/jvns/howcomputer/blob/master/bytesum_intrinsics.c)

... might be 0.1 Seconds faster than the original code when started as
"already loaded into ram" which she claims runs at 0.6 seconds. Yet this last
piece of code is way more complicated and hard to read. Code like this

Line 11: __m128i vk0 = _mm_set1_epi8(0);

might be idiomatic, fast and give you a great sense of mastery, but you can't
even pronounce it and it it's purpose does not become clear in any way.

Writing the code this way may make it faster, but that makes it 1000x harder
to maintain. I'd rather sacrifice 0.1 seconds running time and improve the
development time by 3 days instead.

~~~
bithush
Depends if that 0.1 seconds is going to add up to a very significant amount.
Obviously these are just tests but in a production system running 24/7 that
0.1 seconds per run is going to add up to a lot. The code might be ugly but
that is when comments are most important.

~~~
pjc50
Or it might add up to next to nothing, if a full run is eight hours. Or it
might finish early and then be blocked waiting for some other part of the
system.

The moral of "computers are fast" is that guessing about bottlenecks and
twiddling with the lowest level of code is unlikely to help; you need to start
at the top with a profiler, and start asking the questions "do we need to
compute this at all?", "can we make it O(n log n) or better?", and "can we
partition this to scale horizontally?"

~~~
jononor
Not only at the top, but with profiling a task that is relevant for the user.
It is trivial to find some code that could be "speed up", but if it does not
bring any value to the user, what is the point?

------
chroma
For an in-depth presentation on how we got to this point (cache misses
dominating performance), there's an informative and interesting talk by Cliff
Click called _A Crash Course in Modern Hardware_ :
[http://www.infoq.com/presentations/click-crash-course-
modern...](http://www.infoq.com/presentations/click-crash-course-modern-
hardware)

The talk starts just after 4 minutes in.

------
chpatrick
It's 1.08s on my computer for one line of Python, which is respectable:

    
    
      python2 -m timeit -v -n 1 -s "import numpy" "numpy.memmap('1_gb_file', mode='r').sum()"                                                                                    
      raw times: 1.08 1.09 1.08

~~~
hesselink
Isn't a lot of numpy implemented in C?

~~~
chpatrick
Sure, but the point is that you can get some pretty nice speeds without
writing any C yourself.

~~~
sampo
But only for operations that someone else has written the C code for you. If
you come up with some new algorithm that cannot be reduced to operations
already implemented in C, then you would need to write the fast C
implementation yourself.

------
nkurz
1/4 second to plow through 1 GB of memory is certainly fast compared to some
things (like a human reader), but it seems oddly slow relative to what a
modern computer should be capable off. Sure, it's a lot faster than a human,
but that's only 4 GB/s! A number of comments here have mentioned adding some
prefetch statements, but for linear access like this that's usually not going
to help much. The real issue (if I may be so bold) is all the TLB misses.
Let's measure.

Here's the starting point on my test system, an Intel Sandy Bridge E5-1620
with 1600 MHz quad-channel RAM:

    
    
      $ perf stat bytesum 1gb_file
      Size: 1073741824
      The answer is: 4
      Performance counter stats for 'bytesum 1gb_file':
    
      262,315 page-faults         #    1.127 M/sec
      835,999,671 cycles          #    3.593 GHz
      475,721,488 stalled-cycles-frontend   #   56.90% frontend cycles idle
      328,373,783 stalled-cycles-backend    #   39.28% backend  cycles idle
      1,035,850,414 instructions            #    1.24  insns per cycle
      0.232998484 seconds time elapsed
    

Hmm, those 260,000 page-faults don't look good. And we've got 40% idle cycles
on the backend. Let's try switching to 1 GB hugepages to see how much of a
difference it makes:

    
    
      $ perf stat hugepage 1gb_file
      Size: 1073741824
      The answer is: 4
      Performance counter stats for 'hugepage 1gb_file':
    
      132 page-faults               #    0.001 M/sec
      387,061,957 cycles                    #    3.593 GHz
      185,238,423 stalled-cycles-frontend   #   47.86% frontend cycles idle
      87,548,536 stalled-cycles-backend     #   22.62% backend  cycles idle
      805,869,978 instructions              #    2.08  insns per cycle
      0.108025218 seconds time elapsed
    

It's entirely possible that I've done something stupid, but the checksum comes
out right, but the 10 GB/s read speed is getting closer to what I'd expect for
this machine. Using these 1 GB pages for the contents of a file is a bit
tricky, since they need to be allocated off the hugetlbfs filesystem that does
not allow writes and requires that the pages be allocated at boot time. My
solution was a run one program that creates a shared map, copy the file in,
pause that program, and then have the bytesum program read the copy that uses
the 1 GB pages.

Now that we've got the page faults out of the way, the prefetch suggestion
becomes more useful:

    
    
      $ perf stat hugepage_prefetch 1gb_file
      Size: 1073741824
      The answer is: 4
    
      Performance counter stats for 'hugepage_prefetch 1gb_file':
     132 page-faults            #    0.002 M/sec
     265,037,039 cycles         #    3.592 GHz
     116,666,382 stalled-cycles-frontend   #   44.02% frontend cycles idle
     34,206,914 stalled-cycles-backend     #   12.91% backend  cycles idle
     579,326,557 instructions              #    2.19  insns per cycle
     0.074032221 seconds time elapsed
    

That gets us up to 14.5 GB/s, which is more reasonable for a a single stream
read on a single core. Based on prior knowledge of this machine, I'm issuing
one prefetch 512B ahead per 128B double-cacheline. Why one per 128B? Because
the hardware "buddy prefetcher" is grabbing two lines at a time. Why do
prefetches help? Because the hardware "stream prefetcher" doesn't know that
it's dealing with 1 GB pages, and otherwise won't prefetch across 4K
boundaries.

What would it take to speed it up further? I'm not sure. Suggestions (and
independent confirmations or refutations) welcome. The most I've been able to
reach in other circumstances is about 18 GB/s by doing multiple streams with
interleaved reads, which allows the processor to take better advantage of open
RAM banks. The next limiting factor (I think) is the number of line fill
buffers (10 per core) combined with the cache latency in accordance with
Little's Law.

~~~
exDM69
Good stuff! The low numbers (4 GB/s, one tenth the available memory bandwidth)
that we're seeing sounds like a lot of time is wasted in page faults (and
indeed, perf stat confirms it). However, the solution you propose sounds
difficult, you need a special file system and need to know what kind of page
sizes the CPU supports.

Can you think of ways to get reduce the number of page faults from inside the
application itself? Or methods that would be portable to architectures with
different page sizes?

I tried a simple call to madvise and posix_fadvise to inform the operating
system ahead of time that I am going to need the memory but that did not have
any effect on the number of page faults.

Any other tips for squeezing some more perf out? Did you happen to do any
cache miss stats your benchmarks?

~~~
pbsd
You get a similar effect to huge pages by passing MAP_POPULATE to mmap, which
pre-faults all pages itself before returning.

~~~
nkurz
What do you think MAP_POPULATE is actually doing here? Unless it's changing
the page size, I don't see how it would be significantly reducing the number
of TLB misses. Is it perhaps doing the preloading in a separate thread on a
different core? And the timing happens to work out that so that the L3 cache
is getting filled at the same rate it's being drained?

~~~
exDM69
> What do you think MAP_POPULATE is actually doing here? Unless it's changing
> the page size, I don't see how it would be significantly reducing the number
> of TLB misses.

I think that MAP_POPULATE here will fill the page table with entries rather
than leaving the page table empty and letting the CPU interrupt at (almost)
every time a new page is accessed. That would be about 200k less interrupts
for a 1G file.

MAP_POPULATE will probably also do the whole disk read in one go rather than
in a lazy+speculative manner.

Page size is probably not affected and neither is number of TLB misses. I in
my testing that the size of the file (and the mapping) will affect the page
size, a 4G had significantly less page fault interrupts than a 500MB file.

And obviously, MAP_POPULATE is bad if physical memory is getting exhausted.

~~~
nkurz
I came across this link, which helped me understand the process a bit better:
[http://kolbusa.livejournal.com/108622.html](http://kolbusa.livejournal.com/108622.html).
So yes, the main savings seems to be that the page table is created in a tight
loop rather than ad hoc. Given the number of pages in the scane, it's still
going to be a TLB miss for each page, but it will be just a lookup (no context
switch).

 _in my testing that the size of the file (and the mapping) will affect the
page size_

I'm doubtful of this, although it might depend on how you have "transparent
huge pages" configured. But even then, I don't think Linux currently supports
huge pages for file backed memory. I think something else might be happening
that causes the difference you see. Maybe just the fact that the active TLB
can no longer fit in L1?

 _And obviously, MAP_POPULATE is bad if physical memory is getting exhausted._

I'm confused by this, but this does appear to be the case. It seems strange to
me that the MAP_POPULATE|MAP_NONBLOCK is no longer possible. I was slow to
realize this may be closely related to Linus's recent post:
[https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6](https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6)

------
mrb
The author's SSE code is a _terribly_ overcomplicated way of summing up every
byte. The code is using PMADDW (a multiply and add?!), and is strangely trying
to interleave hardcoded 0s and 1s into registers with PUNPCKHBW/PUNPCKLBW,
huh?

All the author needs is PADDB (add packed bytes).

~~~
mrb
Here is how it should be done with PADDB:
[http://pastebin.com/MY9tENpW](http://pastebin.com/MY9tENpW) This is 20%
faster than the author's version on my computer: 0.210 sec vs. 0.260 sec to
process 1GiB. The tight loop is simple:

    
    
      400710:	66 0f fc 04 07       	paddb  (%rdi,%rax,1),%xmm0
      400715:	48 83 c0 10          	add    $0x10,%rax
      400719:	48 39 c6             	cmp    %rax,%rsi
      40071c:	77 f2                	ja     400710 <sum_array+0x10>
    

Compare this to the author's complex version:

    
    
      400720:	66 0f 6f 14 07       	movdqa (%rdi,%rax,1),%xmm2
      400725:	48 83 c0 10          	add    $0x10,%rax
      400729:	48 39 c6             	cmp    %rax,%rsi
      40072c:	66 0f 6f c2          	movdqa %xmm2,%xmm0
      400730:	66 0f 68 d4          	punpckhbw %xmm4,%xmm2
      400734:	66 0f 60 c4          	punpcklbw %xmm4,%xmm0
      400738:	66 0f f5 d1          	pmaddwd %xmm1,%xmm2
      40073c:	66 0f f5 c1          	pmaddwd %xmm1,%xmm0
      400740:	66 0f fe c3          	paddd  %xmm3,%xmm0
      400744:	66 0f fe c2          	paddd  %xmm2,%xmm0
      400748:	66 0f 6f d8          	movdqa %xmm0,%xmm3
      40074c:	77 d2                	ja     400720 <sum_array+0x20>

------
ChuckMcM
Nice. I remember the first time I really internalized how fast computers were,
even when people claimed they were slow. At the time I had a "slow" 133Mhz
machine but we kept finding things it was doing that it didn't need too, and
by the time we had worked through that there it was idling a lot while doing
our task.

The interesting observation is that computers got so fast so quickly, that
software is wasteful and inefficient. Why optimize when you can just throw CPU
cycles or memory at the problem? What made that observation interesting for me
was that it suggested the next 'era' of computers after Moore's law stopped
was going to be about who could erase that sort of inefficiency the fastest.

I expect there won't be as much time in the second phase, and at the end
you'll have approached some sort of limit of compute efficiency.

And hats off for perf, that is a really cool tool.

------
bane
It's pretty clear that we're wasting unbelievably huge amounts of computing
power with the huge stacks of abstraction we're towering on.

So let's make this interesting, assuming a ground up rewrite of an entire
highly optimized web application stack - from the metal on up, how many normal
boxes full of server hardware could really just be handled by one? 2? a dozen?

I'd be willing to bet that a modern machine with well written, on the metal
software could outperform a regular rack full of the same machines running all
the nonsense we run on today.

Magnified over the entire industry, how much power and space are being wasted?
What's the dollar amount on that?

What's the developer difference to accomplish this? 30% time?

What costs more? All the costs of potentially millions of wasted machines,
power and cooling or millions of man hours writing better code?

~~~
chadgeidel
I'll take that bet.

We gain a lot with "the huge stack of abstraction". The OS gives us a ton of
"goodies" for free (memory abstraction, process/thread safety, networking,
etc) and the languages/libraries give us more goodies (the ability to focus on
higher-level tasks rather than "bit twiddling"). One could also point to the
fact that your team is taking advantage of hundreds (thousands?) of domain
experts in every aspect of the stack to get the "best" solution.

I would argue that it's not "30% time". It's the accumulated time of each
level of abstraction you are using combined. It is very likely the case that
one team couldn't rewrite the web application stack "from the metal up" and
offer significant improvements.

~~~
bane
Still it's pretty shocking that billions of people might have to wait even a
noticeable unit of time that can often be measured in seconds for a web page
to go from request to final render. We're doing something wrong.

Server and desktop computing power is impressively more powerful today than
when this [1] was first created, which loads almost as fast as I can take my
finger off of the mouse button that clicked the link to take me there

1 -
[http://info.cern.ch/hypertext/WWW/TheProject.html](http://info.cern.ch/hypertext/WWW/TheProject.html)

yet I was able to actually count "one onethousand two onethousand three
onethousand" before www.youtube.com finished rendering. I don't know, is 2.xx
seconds times a billion people more efficient than a few hundred developers
spending 2x or 3x longer to write efficient code?

~~~
chadgeidel
I would say that it's a combination of both. I personally don't automatically
assume the delay when clicking a link is due to the number of abstractions
between my mouse and the server-side hardware that a website is running on.

In the YouTube example, you have very real speed-of-light constraints. I fired
up the Chrome debugger and loaded the main site. I had over 100 requests in
the first couple of seconds. Even with a very low-latency connection, assuming
that your browser can "batch" the requests together, and instantaneous server
response there still is an overhead of request/response of at least tens of
milliseconds for each request (or group of requests).

To reduce that time, it requires reducing the number of requests, assuming
that javascript/images are already loaded, parallelizing or delaying the
loading of "ancillary data" etc. All of which have nothing to do with the
speed of the server or the client.

------
infogulch
Nice writeup! I like how even simplistic approaches to performance can easily
show clear differences! However! I noticed you use many (many!) exclamation
points! It gave me the impression that you used one too many caffeine patches!
[1]

[1]:
[https://www.youtube.com/watch?v=UR4DzHo5hz8](https://www.youtube.com/watch?v=UR4DzHo5hz8)

~~~
bithush
Shows how excited they were. I always like to see blog posts where the poster
is excited about the post!

------
zokier
> So I think that means that it spends 32% of its time accessing RAM, and the
> other 68% of its time doing calculations

Not sure if you can do such conclusion actually, because of pipelining etc.
I'd assume that the CPU is doing memory transfers simultaneously while doing
the calculations.

I also think that only the first movdqa instruction is accessing RAM, the
others are shuffling data from one register to another inside the CPU. I'd
venture a _guess_ that the last movdqa is shown taking so much time because of
a pipeline stall. That would probably be the first place I'd look for further
optimization.

On the other hand, I don't have a clue about assembly programming or low-level
optimization, so take my comments with a chunk of salt.

~~~
userbinator
It spends the other 68% of its time doing calculations _and waiting for the
results of the memory accesses_.

I personally don't like using per-instruction timings like the one presented
in the article to measure performance; on a pipelined, superscalar/out-of-
order CPU, the fact that one instruction takes a long time to execute doesn't
matter so much since others can execute "around it" if they don't depend on
it.

On the other hand, macrobenchmarks like the timing of the execution of a whole
program, _are_ very useful.

 _I timed it, and it took 0.5 seconds!!! [...] So our program now runs twice
as fast_

0.5s down from 0.6s is not "twice as fast", it is a 16.67% improvement. I'm
not sure where the 0.25s was pulled from.

~~~
zokier
> It spends the other 68% of its time doing calculations and waiting for the
> results of the memory accesses.

It is bit odd though, the code is processing data at about 4GB/s, and modern
system should have lot more RAM bandwidth (eg about 10GB/s for DDR3-1333). It
feels like there should be still significant room for optimization.

> 0.5s down from 0.6s is not "twice as fast", it is a 16.67% improvement. I'm
> not sure where the 0.25s was pulled from.

I think the 0.5 was typo (missing '2') and should be instead 0.25

~~~
sanxiyn
Yes, there is a room for optimization. It seems to me bytesum_intrinsics.c
doesn't saturate RAM bandwidth because loops are not unrolled. You should
unroll the loop in addition to SIMD so that SIMD loads complete while SIMD
adds are done, hiding memory latency. Otherwise you wait on loads.

~~~
userbinator
_so that SIMD loads complete while SIMD adds are done_

That's almost certainly already being done by the hardware. Loop unrolling has
been counterproductive ever since ~Sandy Bridge or so, and probably hasn't
been that great of an idea since the post-P4 days:

[http://www.agner.org/optimize/blog/read.php?i=142#142](http://www.agner.org/optimize/blog/read.php?i=142#142)

 _It is so important to economize the use of the micro-op cache that I would
give the advice never to unroll loops._

~~~
sanxiyn
This is a great advice, but I think it doesn't apply here. If code normally
fits in uop cache but unrolled code doesn't, you are punished. But here both
normal code and unrolled code fit uop cache.

Loop unrolling may be counterproductive, but loop unrolling for vectorization
almost certainly isn't, even in the current CPUs. It really can't be done by
the hardware, and all compilers unroll loops for vectorization, even if you
disable loop unrolling in general.

~~~
nkurz
You are correct that both fit in uop cache, but I think 'userbinator' is
correct here that loop unrolling is of very limited benefit here (and almost
everywhere else). Because of speculative execution, processors are basically
blind to loops. They will keep executing ahead along the most likely path even
if this means they are simultaneously executing different iterations of a
loop. The loads will pre-execute just fine across the loop boundary. While
there are cases where reducing the amount of loop logic will speed things up,
it really can be done in hardware!

------
sanxiyn
I wonder why GCC does not autovectorize the loop in bytesum.c even with
-Ofast. With autovectorizer, GCC should make the plain loop as fast as SIMD
intrinsics. Autovectorizer can't handle complex cases, but this is as simple
as it can get.

Anyone has ideas?

~~~
jokoon
I don't think compilers are able to know when to use SIMD

Isn't vectorization different than SIMD ?

~~~
dbaupp
No, vectorization is converting scalar operations (i.e. one piece of data at a
time, e.g. summing one byte at a time) to vector ones (operating on multiple
pieces of data at once, e.g. summing 16 bytes in parallel), and these
operations are exactly what SIMD is.

------
enjoy-your-stay
The first time I realised how fast computers _could_ be was when I first
booted up BeOS on my old AMD single core, probably less than 1Ghz machine.

The thing booted in less than 10 seconds and performed everything so quickly
and smoothly - compiling code, loading files, playing media and browsing the
web (dial up modem then).

It performed so unbelievably well compared to Windows and even Linux of the
day that it made me wonder what the other OSes were doing differently.

Now my 4 core SSD MacBook pro has the same feeling of raw performance, but it
took a lot of hardware to get there.

------
userbinator
One of the things I've always wanted is autovectorisation by the CPU - imagine
if there was a REP ADDSB/W/D/Q instruction (and naturally, repeated variants
of the other ALU operations.) It could make use of the full memory bandwidth
of any processor by reading and summing entire cache lines the fastest way the
current microarchitecture can, and it'd also be future-proof in that future
models may make this faster if they e.g. introduce a wider memory bus. Before
the various versions of SSE there was MMX, and now AVX, so the fastest way to
do something like sum bytes in memory changes with each processor model; but
with autovectorisation in hardware, programs wouldn't need to be recompiled to
take advantage of things like wider buses.

Of course, the reason why "string ALU instructions" haven't been present may
just be because most programs wouldn't need them and only some would receive a
huge performance boost, but then again, the same could be said for the AES
extensions and various other special-purpose instructions like CRC32...

~~~
TheLoneWolfling
This is why I'm in favor of client-side compilation.

Devs compile to an intermediate language or bytecode of some sort, which then
gets compiled on installation / first use / update of the client runtime
client-side. Kind of like Java, but instead of being JITted (which causes
inconsistent performance) it's compiled and cached.

That way things can be optimized for your specific architecture / computer.

Of course, you can actually do this at the kernel level.

------
thegeomaster
Anyone notice how the author is all excited? Got me in a good mood, reading
this.

------
tejbirwason
Great post. If you want to dig in even deeper you can learn certain nuances of
underlying assembly language like loop unrolling, reducing the number of
memory accesses, number of branch instructions per iteration of any loops by
rewriting the loop, rearranging instructions or register usage to reduce the
dependencies between instructions.

I took a CPSC course last year and for one of the labs we improved the
performance of fread and fwrite C library calls by playing with the underlying
assembly. We maintained a leader board with the fastest times achieved and it
was a lot of fun to gain insight into the low level mechanics of system calls.

I digged up the link to the lab description -
[http://www.ugrad.cs.ubc.ca/~cs261/2013w2/labs/lab4.html](http://www.ugrad.cs.ubc.ca/~cs261/2013w2/labs/lab4.html)

------
cgag
The rest of her blog is great as well, I really like her stuff about os-dev
with rust.

------
okso
Naïve Python3 is not as fast as Numpy, but pretty elegant:

    
    
      def main(filename):
          d = open(filename, 'rb').read()
          result = sum(d) % 256
          print("The answer is: ", result)

~~~
jdiez17
Note that sum(d) will generate a huge number, possibly using lots of memory
and processing power. A better option would be:

    
    
        def main(filename):
          d = open(filename, 'rb').read()
          result = reduce(lambda i, j: (i + j) % 256, d)
          print("The answer is: ", result)
    

Note how this is similar to the squaring algorithm used in cryptography:
[http://en.wikipedia.org/wiki/Exponentiation_by_squaring](http://en.wikipedia.org/wiki/Exponentiation_by_squaring)

~~~
okso
That approach would be very well adapted to a low-level language such as C
with the risk of overflow.

Python handles long integers transparently, and the overhead of managing the
lambda function and additional variables in Python is probably much more time
consuming than handling a long integer.

------
hyp0

      I timed it, and it took 0.5 seconds!!!
      So our program now runs twice as fast,
    

minor typo above: time is later stated as 0.25. super neat!

~~~
Udo
I wrote a comment pointing that out on the author's blog, but it appears to
have since been deleted :/

------
sjtrny
But not fast enough

