

How I Got a 2x Speedup With One Line of Code - naftaliharris
http://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/

======
chrismorgan
I'm sad to see that the one line of code, with its magic number and
comparatively inscrutible contents, is entirely undocumented. Please, if doing
things like that: a good commit message is necessary but not sufficient. (Yes,
`git blame` exists, but you're not so likely to use it as a comment in the
code.) A blog post is handy but not sufficient. A good inline comment is
invaluable when returning to the code later if the situation changes, as it
typically will in time for such precise, empirically derived performance
tweaks.

    
    
       // This single line boosts performance by a factor of around 2 on GCC.
       // The optimal lookahead distance i+3 was chosen by experimentation.
       // See http://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/

~~~
thejosh
Referencing a URL, especially a blog, that could die anytime in documentation
isn't good.

~~~
chrismorgan
I agree with your general sentiment and was initially inclined to make the
comment more verbose and not include the URL; I would certainly not be content
with _just_ a URL. Still, in this case I would consider it a fair compromise,
primarily because the inclusion of the URL is non-essential but value-adding.
That is, it explains more, but the two important points—the purpose of the
code and the reason for the number 3—are captured by the lines above, but the
blog post adds more information which could potentially be of use later.
Furthermore, the author of the code is the author of the blog post, which
increases the probability of its survival.

~~~
chris_wot
I'd it the detailed explanation into the commit message. Or you could add it
to README or make a file NOTES and explain it there.

------
susi22
You can get another 2x speedup if you choose the pivot better. Right now it
seems you're doing median of 3.

When you're doing Order Statistics (ie Selection/Quickselect) with 1M elements
you should be very very close to the theoretical optimum (1.5N) for
complexity. See my comment here:

[https://news.ycombinator.com/item?id=6629117](https://news.ycombinator.com/item?id=6629117)

If you're interested in choosing the best strategy.

HTH

~~~
naftaliharris
Neat! A quick skim of the referenced article in that comment
([http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6193457](http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6193457))
shows that it solves a problem I'd wondered about for a while now, about how
to optimally choose the pivots when computing order stats. (Yeah, I'm just
doing median of 3 right now). If you want to sort the whole list, then clearly
you want the pivots to be as near to the median as possible. But if you just
want an order stat, then that's not necessarily true; if you wanted, say, the
10th percentile, my gut feeling would be that you should aim for your first
pivot to be roughly the 20th percentile or so. I'll read that article more
closely and see if I can improve even more.

Thanks, that's really helpful!

------
akkartik
Great article. But it seems somehow unfair to characterize this as a 'one-line
change'. He had to run various performance experiments that will be invisible
to someone else reading the code, and the optimal prefetch distance might
silently shift over time, like say if each object grows larger. It's a little
like the Carmack magic number in that respect:
[https://en.wikipedia.org/wiki/Fast_inverse_square_root](https://en.wikipedia.org/wiki/Fast_inverse_square_root)

I've had this idea for some time that our _representation_ of programs is
fundamentally broken, because it excludes a detailed characterization of the
_input_ space. I took a first stab at a solution with
[http://akkartik.name/post/tracing-tests](http://akkartik.name/post/tracing-
tests), which lets me write all my unit tests at the top level, showing how
each piece of functionality augments the space of inputs my program can
address. See, for example,
[https://github.com/akkartik/wart/blob/3039fe6d03/literate/02...](https://github.com/akkartik/wart/blob/3039fe6d03/literate/023tokenize#L83).
I suspect reading other people's 'code' could be an order of magnitude easier
and more convenient if 'code' included the right information. I'm still
working to flesh this idea out.

~~~
kabouseng
>>It's a little like the Carmack magic number in that respect:
[https://en.wikipedia.org/wiki/Fast_inverse_square_root](https://en.wikipedia.org/wiki/Fast_inverse_square_root)

Thanks that was a great read :D

------
liyanchang
> No way, substantial speedups can really only come from algorithm changes.

I've often found the reverse to be true; that implementation details matter
most.

The best fixes tend to be altering my code so that the compiler/cpu can be
smarter. In this case, the author hinted the cache. Other times, I've found
that looping in a different order helps my cache hit rate. Other times,
properly naming variables and not reusing the same name allows the compiler to
do smarter things.

Other times, when I've tried a new algorithm, it's slower because of bad
constants or I've accidentally messed up the implementation to negate all
benefits.

As a complete side note, you need to be careful when using "older" algorithms,
ie those published pre-2000. I once implemented, based on a Hoare paper, a
doubly linked list for spare matrices, and it was significantly slower then
the naive method, simply because CPU cache lines have gotten really good while
being able to predict linked list memory locations had not.

~~~
pkolaczk
This is all true, but I never encountered cache optimizations like that to
cause more than a 2x difference on average and 5x in very rare extreme cases.
This is very far from what you can get from algorithmic Big-O changes, where
it is easy to get 1000x speedups.

~~~
liyanchang
That's true. Having been briefly in academia, I only evaluated algorithms
based on Big-O and just a bit of practical experience taught me that there
tends to be a sweet spot, where the o(n^2.7) is both much faster then your
naive implementation, but simpler to code, better constants, and better cache
lines then the o(n^2.7003).

------
nly
I don't think this gain came from pre-fetching anything. The compiler may
simply have decided to treat the entire function, or even the entire file,
differently because of that one line.

I compiled (x86, GCC 4.8) lazysort.c both with and without the prefetch
intrinsic, and there were substantial differences in code surrounding calls in
to Python, but no change at all to the guts of the partition or sorting
routines.

Regardless of what is responsible, it's not good to leave something like this
in your code without really understanding what it did. There may have
something else that occurred incidentally from this change, that could have
lead you to a better solution.

~~~
deweerdt
> I don't think this gain came from pre-fetching anything.

It may, and its possible that it doesn't show up on some cpus. see this thread
for an example of how difficult it is to find a good generic prefetching size:
[https://lkml.org/lkml/2013/10/11/534](https://lkml.org/lkml/2013/10/11/534)

------
sp332
There are only two really difficult problems in computer science: naming
things, cache invalidation, and off-by-one errors.

~~~
sillysaurus2
[http://i.imgur.com/LfqyX3p.gif](http://i.imgur.com/LfqyX3p.gif)

~~~
ramchip
I don't get it... that's a cellular automaton joke?

~~~
adamnemecek
Getting the joke requires understanding the game of life and this meme

[http://knowyourmeme.com/memes/boardroom-
suggestion](http://knowyourmeme.com/memes/boardroom-suggestion)

------
wffurr
>> The moral of the story, at least for me, is that you can occasionally get
substantial improvements by understanding what is actually happening under the
hood of your application, rather than fundamentally changing your application
itself.

Also known as the rule of leaky abstractions, as in all abstractions are
leaky.

~~~
Peaker
Is parametric polymorphism leaky?

~~~
wffurr
Method resolution order.

~~~
Peaker
I said _parametric_ polymorphism. That doesn't involve any method resolution
order.

~~~
wffurr
OK - different usage of "abstraction".

If you are talking about abstracting, i.e. taking away, similar bits of code
into one place in programming, whether it's via polymorphism or refactoring
functions, then no, that is not a "leaky abstraction".

I am talking about abstraction as in creating a simplified / unified interface
over some complex thing. Abstracting the idea of something from its concrete
realizations into its own concept. OSI networking layers, drivers for
hardware, modeling real world systems, etc.

If you want to be pedantic instead of snappy, you can change my statement to
"all abstraction layers are leaky".

------
Scaevolus
Note that `ob_item[i+3]` could read past the end of the array.

~~~
jzwinck
This jumped out at me too. GCC's documentation says:

Data prefetch does not generate faults if addr is invalid, but the address
expression itself must be valid. For example, a prefetch of p->next does not
fault if p->next is not a valid address, but evaluation faults if p is not a
valid address.

This seems to make the code as presented incorrect (i.e. it results in
undefined behavior).

------
AznHisoka
I got a 1000 X speedup by removing a sleep(100000) line once. Not as
impressive though.

~~~
wiredfool
I got a 2gb reduction in memory usage and orders of magnitude speedup by
adding an _ in one line once.

~~~
Dylan16807
Care to explain that? I don't get it.

~~~
wiredfool
It was a long involved bug chase, but some django middleware had an optional
cache of some s3 information. The actual variable was self._something, with an
@property on self.something so that the getter would backfill the metadata if
it didn't exist.

In the code, it was checking if self.something existed, and then doing
something else if it didn't. This triggered the lazy getter, and had the
effect of pulling in the metadata for a 300k item s3 bucket.

Checking for self._something correctly returned None, it then checked s3 for
the item, and uploaded it directly.

------
Someone
I would expect that that magic number 3 is related to the size of a cache line
relative to the size of an array element and the time spent in each loop. You
will want to prefetch the next cache line so that it is just available by the
time you want to start processing it (and, ideally, not a moment earlier
because that would push data from the cache that still could be used before
the requested data is needed)

A robust implementation should figure out the size of a cache line and the
load delay, and derive the magic offset from that.

------
temujin
Er, are you aware of the C++ standard library's std::nth_element() function?
(Even my "pure C" programs tend to be compiled with g++ to gain access to that
and std::sort().)

~~~
nly
Yes, this is exactly what the author is trying to port to Python.

Mandatory reference link:
[http://en.cppreference.com/w/cpp/algorithm/nth_element](http://en.cppreference.com/w/cpp/algorithm/nth_element)
. Note linear complexity.

There's also std::partial_sort:

[http://en.cppreference.com/w/cpp/algorithm/partial_sort](http://en.cppreference.com/w/cpp/algorithm/partial_sort)

------
Amadou
_Now, I didn 't just "know" to prefetch the PyObject three indices forward in
the array. I arrived at this value through experimentation._

There is a good chance that 3 is specific to the hardware you tested on.
Different systems will have different memory latencies (and NUMA systems can
make it even more complicated). It isn't likely that 3 will turn into a
pathological case on any other hardware, but it may well end up being less
optimal than say 2 or 4.

------
m_mueller
This sort of thing is why porting naive x86 code to the GPU programming model
(scalar programs aka 'kernels' streamed through on the device using a
configuration) often results in amazing 15-20x speedup over 6 core, while the
expected speedup is only 5-7x. x86 compilers often don't get the pre-fetching
right in loops, only using their heuristics, so you have to talk to them. Then
you you begin to doubt the compiler more and more, carving out a few
percentage points here and there by adding vector intrinsics, loop unrolling
and so on. And before you notice, your codebase is now bigger, more complex
and more error prone than if you had ported it to GPU from the beginning.

Please note - of course this insight doesn't apply in your case since your
algorithm is meant for general purpose usage, not HPC programming meant for
systems where you have the hardware under your control. I'm simply urging
people where the latter applies to think about whether their algorithms might
be suitable for GPU instead - doing lots of prefetch instructions and loop
unrolls is a sign of that.

------
gridspy
Nice article.

One thing that wasn't clear to me in your explanation is that you're sorting a
list of pointers to objects. To sort that list you're comparing two of the
objects themselves.

    
    
        // follows pointer to access *ob_item[i]
        IF_LESS_THAN(ob_item[i], pivot) 
    

So the point of __builtin_prefetch is to deference this pointer in advance and
so avoid the latency of the 12 million reads scattered all over memory. Nice.

Another useful thing to do here is to see if you can find a "streaming
operator" to dismiss *ob_item[i] and ob_item[i] from cache after the
comparison. They won't be needed again for a while.

Another good article on this optimisation (mentioned by OP)

[http://scripts.mit.edu/~birge/blog/accelerating-code-
using-g...](http://scripts.mit.edu/~birge/blog/accelerating-code-using-gccs-
prefetch-extension/)

------
minimax
This is a great post. I guess the idea here is that you can start bringing
future objects to be compared up through the cache hierarchy while you are
doing the comparison on the current object. If that's the case, then I think
the speedup from prefetching will depend on the speed of the comparison which
in turn will depend on the type of objects in the collection.

In your post it says you have a collection of 10MM elements but you didn't say
what they were. Are they just ints or something else?

~~~
naftaliharris
Thanks!

Oh, thanks for pointing out the oversight! They were floats generated with
random.random() and then shuffled with random.shuffle(.) Shuffling them
removes any locality that might have arisen when the floats were being
generated, even though the distribution remains uniform on (0, 1). (And
indeed, empirically the code runs faster if you don't shuffle them).

Yeah, I'm not sure how well this would work if you were sorting different
objects. I would guess that it depends on two things--how large the objects
are in memory, and how long it takes to compare them.

~~~
bmm6o
I was thinking about that when I read the article too. How you've obviously
done a lot of testing, but that the run-time's allocator algorithm is going to
have an effect on this, and how allocating N floats in a loop will have
different performance than if the floats were allocated over time (with other
allocations mixed in) or if the objects were larger. Especially if the
comparison function requires accessing object members from other memory
locations... The magic value 3 might have a lot to do with the number of
objects that fit in the cache or a page.

Anyway, it's an interesting article and an interesting library.

~~~
osmala
The prefetch will always take same amount of data no matter what the size of
python object in question is. That leads two results. 1) You can miss if your
cacheline size is less than python object size. 2) You miss if your python
object is not alligned at the cacheline and comparison is with the part that
is in the next cacheline side.

If these two condition apply then there is limit of 2 on speed up and
increasing python object size reduces the effect of prefetch which still
helps.

Another about prefetch distance I'd pick 5 based on his testing data. 1) It
has no downside on current tests. 2) Due to prefetch constant size any
prefetch that hurts because you pick distance of 5 instead of 3 is unlikely to
have any meaningfull effect at any prefetch distance. 3) It allows some
improvement in comparison speed. Like if there where JIT on python side to
specialize the test. 4) It allows a nice factor of improvement on CORE
execution speed relative to memory latency. 5) It allows you to handle
situations of page misses better.

~~~
bmm6o
The prefetch may take the same amount of time, but the number of objects that
it pulls into cache depends on their size and allocation pattern. If they are
large or were allocated over time (= from different memory blocks) fewer will
be pulled in. It seems to me that this would make prefetch less effective.
That is, if the thing you are trying to prefetch was pulled in because it was
close enough to the last one, prefetch is a no-op.

------
Nimi
Funny, I just experimented with __builtin_prefetch this week, and got no
speedup. Does anyone know whether the kernel list.h always triggers hardware
prefetching when going over a list?

BTW, my case wasn't the problem the kernel maintainers encountered with small
lists, detailed here:

[http://lwn.net/Articles/444336/](http://lwn.net/Articles/444336/)

------
alextingle
I'm pretty sure you could have saved yourself a lot of work by just using
profile guided optimisation (PGO). It's mature & ready to go in GCC at least.

(Integrating it into you build process is a little bit more challenging, I
admit. I've set it up in a large dev environment used by multiple large
projects, and it was well, well worth the effort.)

------
jheriko
its one of those things... micro optimisation becomes the big one once you do
enough algorithm optimisation. even the humble memory copy becomes faster from
prefetching and clever use of registers... and actually making the algorithm
theoretically worse. :P

also, note that it is possible to remove the magic from your magic number with
some thought about latency, the size and alignment of your data etc. fetching
a cache line takes a fairly predictable amount of cycles.

opposite to the prefetch is the non temporal read/writes where you don't
pollute cache to prevent big one off copies from damaging the performance of
other code...

------
joosters
Interesting post! I wonder if the author experimented with different gcc
command-line optimisation options as well? gcc _might_ be able to insert some
prefetches by itself with some settings?

~~~
naftaliharris
Thanks! I tried -fprefetch-loop-arrays before figuring out that it only
prefetches the array memory, (ie, memory with addresses like &ob_item[i]),
rather than the memory referenced by the array memory, (ie, memory with
addresses like ob_item[i]). And empirically adding -fprefetch-loop-arrays got
me no speedups.

I didn't try any other GCC command line arguments because I honestly didn't
know what other relevant options there were.

------
mattholtom
Came expecting [http://thedailywtf.com/Articles/The-Speedup-
Loop.aspx](http://thedailywtf.com/Articles/The-Speedup-Loop.aspx), was
disappointed...

------
jheriko
also, read this:
[http://www.gamedev.net/page/resources/_/technical/graphics-p...](http://www.gamedev.net/page/resources/_/technical/graphics-
programming-and-theory/graphics-programming-black-book-r1698)

------
clintonc
I get at least that good with

    
    
        DEFINT A-Z

