// This single line boosts performance by a factor of around 2 on GCC.
// The optimal lookahead distance i+3 was chosen by experimentation.
// See http://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/
Not like this.
Yes, the +3 deserves an explanation (but something better than "because of experimentation" - my guess is because it's big enough to get the next cache line and small enough to not get something that won't be needed)
And I find it funny when people on HN are shibeing __builtin_prefetch like "wow, I just heard about caches and prefetching data how amazing"
When you're doing Order Statistics (ie Selection/Quickselect) with 1M elements you should be very very close to the theoretical optimum (1.5N) for complexity. See my comment here:
If you're interested in choosing the best strategy.
Thanks, that's really helpful!
I've had this idea for some time that our representation of programs is fundamentally broken, because it excludes a detailed characterization of the input space. I took a first stab at a solution with http://akkartik.name/post/tracing-tests, which lets me write all my unit tests at the top level, showing how each piece of functionality augments the space of inputs my program can address. See, for example, https://github.com/akkartik/wart/blob/3039fe6d03/literate/02.... I suspect reading other people's 'code' could be an order of magnitude easier and more convenient if 'code' included the right information. I'm still working to flesh this idea out.
Thanks that was a great read :D
I've often found the reverse to be true; that implementation details matter most.
The best fixes tend to be altering my code so that the compiler/cpu can be smarter. In this case, the author hinted the cache. Other times, I've found that looping in a different order helps my cache hit rate. Other times, properly naming variables and not reusing the same name allows the compiler to do smarter things.
Other times, when I've tried a new algorithm, it's slower because of bad constants or I've accidentally messed up the implementation to negate all benefits.
As a complete side note, you need to be careful when using "older" algorithms, ie those published pre-2000. I once implemented, based on a Hoare paper, a doubly linked list for spare matrices, and it was significantly slower then the naive method, simply because CPU cache lines have gotten really good while being able to predict linked list memory locations had not.
I compiled (x86, GCC 4.8) lazysort.c both with and without the prefetch intrinsic, and there were substantial differences in code surrounding calls in to Python, but no change at all to the guts of the partition or sorting routines.
Regardless of what is responsible, it's not good to leave something like this in your code without really understanding what it did. There may have something else that occurred incidentally from this change, that could have lead you to a better solution.
It may, and its possible that it doesn't show up on some cpus. see this thread for an example of how difficult it is to find a good generic prefetching size:
Also known as the rule of leaky abstractions, as in all abstractions are leaky.
If you are talking about abstracting, i.e. taking away, similar bits of code into one place in programming, whether it's via polymorphism or refactoring functions, then no, that is not a "leaky abstraction".
I am talking about abstraction as in creating a simplified / unified interface over some complex thing. Abstracting the idea of something from its concrete realizations into its own concept. OSI networking layers, drivers for hardware, modeling real world systems, etc.
If you want to be pedantic instead of snappy, you can change my statement to "all abstraction layers are leaky".
Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid. For example, a prefetch of p->next does not fault if p->next is not a valid address, but evaluation faults if p is not a valid address.
This seems to make the code as presented incorrect (i.e. it results in undefined behavior).
In the code, it was checking if self.something existed, and then doing something else if it didn't. This triggered the lazy getter, and had the effect of pulling in the metadata for a 300k item s3 bucket.
Checking for self._something correctly returned None, it then checked s3 for the item, and uploaded it directly.
Title: I saved the world!!!#$^%!1111
Content: I was about to step on an ant today and narrowly missed it, saving the universe and all of existence for that ant.
A robust implementation should figure out the size of a cache line and the load delay, and derive the magic offset from that.
Mandatory reference link: http://en.cppreference.com/w/cpp/algorithm/nth_element . Note linear complexity.
There's also std::partial_sort:
There is a good chance that 3 is specific to the hardware you tested on. Different systems will have different memory latencies (and NUMA systems can make it even more complicated). It isn't likely that 3 will turn into a pathological case on any other hardware, but it may well end up being less optimal than say 2 or 4.
Please note - of course this insight doesn't apply in your case since your algorithm is meant for general purpose usage, not HPC programming meant for systems where you have the hardware under your control. I'm simply urging people where the latter applies to think about whether their algorithms might be suitable for GPU instead - doing lots of prefetch instructions and loop unrolls is a sign of that.
One thing that wasn't clear to me in your explanation is that you're sorting a list of pointers to objects. To sort that list you're comparing two of the objects themselves.
// follows pointer to access *ob_item[i]
Another useful thing to do here is to see if you can find a "streaming operator" to dismiss *ob_item[i] and ob_item[i] from cache after the comparison. They won't be needed again for a while.
Another good article on this optimisation (mentioned by OP)
In your post it says you have a collection of 10MM elements but you didn't say what they were. Are they just ints or something else?
Oh, thanks for pointing out the oversight! They were floats generated with random.random() and then shuffled with random.shuffle(.) Shuffling them removes any locality that might have arisen when the floats were being generated, even though the distribution remains uniform on (0, 1). (And indeed, empirically the code runs faster if you don't shuffle them).
Yeah, I'm not sure how well this would work if you were sorting different objects. I would guess that it depends on two things--how large the objects are in memory, and how long it takes to compare them.
Anyway, it's an interesting article and an interesting library.
If these two condition apply then there is limit of 2 on speed up and increasing python object size reduces the effect of prefetch which still helps.
Another about prefetch distance I'd pick 5 based on his testing data.
1) It has no downside on current tests.
2) Due to prefetch constant size any prefetch that hurts because you pick distance of 5 instead of 3 is unlikely to have any meaningfull effect at any prefetch distance.
3) It allows some improvement in comparison speed. Like if there where JIT on python side to specialize the test.
4) It allows a nice factor of improvement on CORE execution speed relative to memory latency.
5) It allows you to handle situations of page misses better.
BTW, my case wasn't the problem the kernel maintainers encountered with small lists, detailed here:
(Integrating it into you build process is a little bit more challenging, I admit. I've set it up in a large dev environment used by multiple large projects, and it was well, well worth the effort.)
also, note that it is possible to remove the magic from your magic number with some thought about latency, the size and alignment of your data etc. fetching a cache line takes a fairly predictable amount of cycles.
opposite to the prefetch is the non temporal read/writes where you don't pollute cache to prevent big one off copies from damaging the performance of other code...
I didn't try any other GCC command line arguments because I honestly didn't know what other relevant options there were.