Edit: Here's the full comments: https://queue.acm.org/fullcomments.cfm?id=1814327
Yeah, an incidentally-optimal-with-a-certain-architecture algorithm is great as long as you can count on the architecture remaining constant.
"Cache oblivious" are one seemingly better approach since they're optimal to some degree on a wide range of cache structures - but no theoretically as fast purely in-memory structures, if they are somehow in a single memory type.
"Architecture aware" trees are yet another scheme, if the program can discover the architecture.
Of course, a lot of algorithms are going to GPUs and how these structures relate to GPU processing is another thing I'd be curious about.
That's a theoretical problem that might happen sometime in the future.
Meanwhile the author's approach brings a 10x speedup in real-world deployments?
Why the snark? Theoretical work is vital to understand some problems but practical applications is where the real problem expresses itself, and where real-world constraints manifest hemselves.
Demonstrating that in the real world a specific data structure performs far better than the theoretically optimal data structure is not. That's novel, unless you can find any prior work that presents the exact same findings.
That's one of the main problems with ivory tower types: claiming that a novel poem is nothing new because all the words in it were already known and even published in a few dictionaries.
Because author didn't discover anything stunningly novel, not even slightly, but bragged rather big.
Read the comments, the full list, to see many others saying the same thing. His 'discovery' isn't.
I honestly don't understand your complaint. Did the author discovered a novel data structure? No. But did the author discovered that a data structure performs far better in the real world than theoretically optimal data structures that fail to take into account real-world constraints? Why, yes.
> Read the comments, the full list, to see many others saying the same thing. His 'discovery' isn't.
Sounds like the nay-sayers are entirely oblivious to real-world problems. That's one of the main complaints regarding the ivory tower mentality and how some very clever people fail to fulfill their potential by focusing on theoretical frameworks that have no relation with reality. Some algorithms may present the lowest theoretical time complexity within fictitious theoretical frameworks, where all memory regions are accessible in constant time and where all ops take the same time to perform, but that is far from what we have in reality. In reality cache misses penalize performance and cache locality is fundamental to reduce any performance penalty. It means nothing if a theoretical optimal algorithm that is only optimal under unrealistic assumptions is proven to be optimal in an unrealistic scenario if another algorithm is found to perform far better on the real-world.
Which has been known about since forever. It's not a new discovery. (Edit: by 'since forever' that's the 70's and quite possibly earlier).
It's not even remotely surprising because anyone who know anything about memory architectures will appreciate it.
Please read the full comments in the original ACM article. Or at least a sample.
Also read https://www.goodreads.com/book/show/12485556-what-every-prog... (which can also be found online) which explains very clearly why this is predictable.
I've read the comments and I'm a bit surprised you suggest it because a) there are far more supportive comments and praises than criticism, b) the complains in general are entirely oblivious to the content and are more outraged by the perceived tone than the actual finding regarding real-world performance.
Take for instance the absurd and entirely oblivious jab asking how the proposed algorithm performs on a Turing machine (?!) to, from there, criticising the author for pointing a major flaw in CS education.
As the author stated in the comments section:
> The response to my article broadly falls in two camps, which can be paraphrased as: Physisists arguing that bridges and buildings is simply not their job vs. people who try to build bridges and buildings, asking "whos job is it then ?"
>The obvious answer would be "The Software Engineers".
>Maybe we ought to educate some of those...
Here's an extract from the comments which actually says it well:
The claim "an algorithm that has been on the books as "optimal" for 46 years, which has been analyzed in excruciating detail by geniuses like Knuth and taught in all computer science courses in the world, can be optimized to run 10 times faster" is half wrong and half obvious.
[which is what people here too have been saying]
Is half wrong because you cannot talk about optimizing an algorithm in terms of "time", you can optimize its asymptotic computational complexity number of steps, which you did not. Is half obvious because the fact that heaps do not match well with virtual memory, with caches and with multiple execution pipelines in modern CPUs is something known since decades.
[repeating what people have said here too, that once you have caches, memory accesses become very inconsistent. The time difference between a L1 cache hit and a main mem access can be 50 times, or even more. So much more than a single order of magnitude speed difference. If it hits disk via VM, OMG IT HURTS IT HURTS... That's why an 'inferior' algorithm that uses memory in a nice way can beat a technically superior algo that doesn't]
Who decided to use a pure heap in that code was not a very good programmer, you proved to be a better one [I agree here], you did not change a little bit in the history of algorithms, a good CS class may help you about the fact that you think and claim you did, there is a long way before being able to even name Knuth.
Analysing algorithms can't AFAIK take into account much about caches. It's just too difficult. So analysis is valuable but limited (though with very important real-world value), and you need also to account for the fact that cache/VM effects are huge.
The author didn't discover anything we - or you - aren't aware of, and he presented it badly.
L2 cache was available in consumer hardware as far back as the '486 in the early '90s; CS students were being taught about the effects of having a memory hierarchy as far back as then. PHK wasn't pointing out a flaw in CS education, he was merely demonstrating how out of touch he was.
This wasn't a new realization, even in 2010.
The neat thing is that Bentley's string could probably be reversed in cache today.
Yeah, those big Zen 2 Epyc CPUs have 256 Megabytes of L3 cache. Now. That's more than like, either of my first two computers had in RAM.
Edit. 8 byte registers * 168 total regs = 1344 bytes, so a) you were and b) guessing it was a VIC20?
Choosing an algorithm based on the architecture it will run and the data sets it will operate on on is the right thing to do, period. That's engineering.
If you have a different architecture, you choose a different algorithm. You don't go looking for some mutant yeti super-algorithm that somehow runs well on all of them. This is just a symptom of computer science's obsession with generalization, which is rather counter-productive.
> Of course, a lot of algorithms are going to GPUs and how these structures relate to GPU processing is another thing I'd be curious about.
GPUs are a good example for how you need to craft an algorithm to fit the architecture, not the other way around. The way you can program GPUs today is different from five and ten years ago.
The full quote is:
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
People forget that last bit.
This article seems to pop up about every other year, and pretty much garner the same general reactions every time.
A lot of CS-majors immediately go in to defensive mode, starting to spew O(blablabla) formulas while thumbing their text-books.
That actually just underscores the main point I tried to communicate in that article: CS education simplified the computing platform used for algorithmic analysis so much that it has (almost) lost relevance.
The big-O has little to do with real-world performance any more, because of the layers of caching and mapping going on. In practice O(n²) can easily be faster than O(n) for all relevant n, entirely depending on ordering of access in the implementation.
Also, CS-education tends to gloss over important footnotes. Take quicksorts worst-case performance, which even wikipedia describes as "rare". It is not. Sorted data are very, very common.
Originally researchers were more careful about these issues, and they operated with several distinct big-O's. Re-read TAoCP about sorting on tape-drives for a classic example.
Some pull out "Cache-Oblivious Algorithms", usually having never actually tried to implement, benchmark or use any of them.
They are, as a rule, horribly complex, and seldom faster in practice, because they hide a very large performance constant in their complexity.
More often than not, there are no available reference implementation, so people cannot easily try them out, and none of them are fast enough to compensate of the embuggerance of the software patents which cover them.
There are usually one or two who claim that "Virtual Memory" is no longer a relevant concern and that one should just put "enough RAM" in the machine.
Throwing hardware at the problem is a nice job if you can get it, and a nice hobby if you can afford it.
However, even if the money is there, what about the energy consumption ?
Our climate is rapidly deteriorating because of greenhouse gasses from our energy-production, and more and more algorithms run on battery-power.
If I were writing a CS-thesis or CS-textbook today, it would be all about big-E() notation, and how it only weakly correlates with big-O() notation.
The one after that is that all performance work decreases code comprehension. There are plenty of techniques that do both, and these, as I discovered, fly below the radar of the anti-perf crowd.
Best algorithm performance article I've read on HN for a long time!
I'm not asking this to be snarky, again I'm a dropout and don't have a ton of insight into this sort of thing - do most CS programs not cover this in at least some detail at some point?
Didn't even try to go down the path of "we're focusing on big-O, so we're only ignoring it for this particular analysis", he just pretended the term didn't exist.
So, what did you do then? From a theoretical/CS point-of-view, you're just optimizing for less VM paging. There is nothing wrong about it, quite the opposite: You're demonstrating knowledge of your machines architecture, and use it to keep memory latency low - which is both a good thing. But this isn't new, or unheard of. I recently looked something up w.r.t. tree data structures, and the 1997 8th edition of Cormen, Leiserson and Rivest even mentions how bad it is to access data from a spinning disk, and how that should be reduced (I didn't read that section in detail, because that was not what I was interested in).
However, maybe sometimes people forget that, or the application of their knowledge is bad (e.g. optimize for "number of compares" when memory latency is dominant). With regards to that, I really liked the article: I do a lot of runtime analysis for assembler code across various platforms, and as such I am painfully aware how complex even decades old pipelines are - it's a pain that developers tend to forget about this.
On the one hand I find (parts of) the linked article funny, on the other hand it's painful that you assume all other people to be ignorant, when this is not the case.
Regarding big-E: That's also not new, but probably not as popular as it should/could be. I recall a friend mentioning that part of her PhD research was energy complexity of sorting algorithms. That was ~2008.
IMO for a multitude of reasons this has hidden the cost of these deoptimizations due to the virtualized abstraction.
I believe with public cloud dropping a comprehensive bill on the doorstep of these enterprises, compute will become a utility that will be optimized once again with virtualization at the heart of it all.
> CS education simplified the computing platform used for algorithmic analysis so much that it has (almost) lost relevance.
Do you have any constructive advice? How would you relax the model assumptions?
What do you use to estimate algorithm speed?
Maybe have them download and sort a 32GB data set on a 64GB SD card on a RPi3 board ?
The smart ones will realize that downloading can be their first pass over the data, and all of them will be befuddled by the interesting performance anomalies when photo-grade SD-cards go on sabaticals, if you annoy their simplistic flash-adaptation layers.
If they characterize "the trouble", they may find that obsolete sequential algorithms for tapes run faster than random access algorithms in that environment.
...which is precisely the kind of thing companies expect a CS-graduate to be good at.
Randomly selecting the pivot is an additional step easily skipped when first introducing the algorithm.
And the computation is much less trivial.
Still, an interesting and important observation - page sizes change every decade or two, and almost exclusively upwards, so this is worth pursuing if you need to squeeze more speed from your heap.
Everyone needs to "squeeze more speed from their heap". CPU-bound problems are quite rare in modern software - despite the legions of software engineers who will stare at a CPU profile that says 100% utilisation, and declare "it's CPU bound, nothing to be done".
(this is a pet peeve of mine, in case you couldn't tell)
I think the point of the article is that
> memory latency
Is actually a hierarchy of its own, and non-linear depending on access patterns.
Instruction fetch latency, branch mis-predict latency, cache hit/miss latency (L1 through L3), cache pressure, register pressure, TLB hit/miss, VM pressure, etc. all become significant at scale.
Just knowing big-O/big-Ω characteristics aren't enough.
I’m interested to know what you think that CPU is fully engaged doing if it’s not making progress toward a solution?
The last time I had to optimize a b-tree based k/v store getting this right literally could boost performance in test cases by 40%.
Ok, question from someone clueless without recent low-level programming experience - is this in fact a normal situation on modern systems?
Because I had the general impression that exactly because the gulf between storage and memory speeds has widened so much, that virtual memory is more of a historical appendix than anything.
My laptop has 8 GB of memory, which I think is fairly small these days, so how likely is it that I would be using approaching but less than 8MB more - between 100% and 100.01% of my memory?
...I went and looked for the date of this article, and it's almost 10 years old, so perhaps things have changed...
These are typical performance levels of a TLB:
size: 12 bits – 4,096 entries
hit time: 0.5 – 1 clock cycle
miss penalty: 10 – 100 clock cycles
miss rate: 0.01 – 1% (20–40% for sparse/graph applications)
If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of 1 × 0.99 + (1 + 30) × 0.01 = 1.30 (1.30 clock cycles per memory access).
So as you can imagine, total address space reserved on a machine can easily go into terabyte range. But that doesn't matter because every process get it's own lookup table.
So suddenly we are at 90% memory and everything is slow as molasses.
Keep your options open by keeping your code portable.
The pages are also zeroed on-demand when committed, which ensures thread creation doesn't first require 8mb of pages to be zeroed (There are workarounds for that as well).
Great question. Where I work, it's common to have swap disabled (or only use zram-backed swap, so pages may be stored compressed in RAM but never actually get sent to SSD much less spinning disk) and mlock the executables into RAM. So in my experience it's uncommon for memory accesses to wait for secondary storage (because we've decided to avoid it).
On the other hand, I think the author's definition is poor:
> "VM pressure, measured in the amount of address space not resident in primary memory"
By this definition, the systems I just described have lots of VM pressure. Thread stacks contribute: there might be 64 kiB on up to 4 MiB of address space reserved for a thread's stack (plus a 4 KiB guard page) but none of that is actually backed by real physical RAM until it's first accessed. So often the thread stacks will only occupy 4 KiB of actual RAM (one page).
I'd say however a more useful definition of memory pressure is based on time spent waiting for something to be paged in. See also the recent Linux "pressure stall information" work. By this definition, those non-backed-pages don't contribute to memory pressure much. (Minor page faults, which find a page of physical RAM to match the virtual page and write zeroes to it, are typically fast unless physical RAM is actually hard to find or (at least on Linux) there's heavy contention on the process's mmap_sem.)
> Because I had the general impression that ... that virtual memory is more of a historical appendix than anything.
Virtual memory is still useful for lots of things besides using SSD or disk to extend RAM: allocating address space before backing RAM (as I just described for thread stacks; also iirc ASAN/TSAN take advantage of this), zram-backed swap, and memory-mapped file I/O come to mind.
I think what the author says here is not terribly common:
> A 300GB backing store, memory mapped on a machine with 16GB or less RAM is quite typical. The user paid for 64bit of address-space, and I am not afraid to use it.
It's common for the author's application because he's designed it that way. He doesn't supply any data to make us think it's common in other applications so his advice may not be generally applicable. And he's also not using what I'd consider to be a correct definition of memory pressure. So plenty of reasons to suspect true memory pressure is an "extreme corner case" for most people (in applications other than Varnish), despite his saying "this is pretty much the real-world behavior seen."
I'm unsure from a quick skim if the B-heap he's describing is large and resides in the memory-mapped on-disk pages or if it's small and resides in anonymous pages. If the latter, it may be better to simply mlock() it in RAM so it you can guarantee there are no page faults while it's accessed.
I'd hate to have this discussion with the author because he sounds insufferably arrogant. Another commenter pointed out that this "paper" (I'd say rant) didn't go over well with the ACM crowd, and no wonder why...
I think what's even more interesting are cases where algorithms with worse asymptotic complexity perform better than algorithms with a better asymptotic complexity. You see this all the time in the database world, since you are often better off optimizing for disk operations than for pure asymptotic complexity.
The main example that comes to mind is that if you are performing a GROUP BY on a dataset that won't fit in memory, it's usually faster to calculate the GROUP BY with mergesort than it is with a hash-table. Hash tables are notoriously bad for disk performance. Every lookup or insert becomes one disk read or write. This is in contrast to mergesort where you can take two sorted files, and in one pass merge the two files together. This leads to a much smaller total number of disk reads and writes.
Two other cases like this that come to mind are Fibonacci heaps vs binary heaps and matrix multiplication algorithms.
What's not true in the real world is that the actual value of the constant can be ignored. Memory or disk access on the one hand, and simple arithmetic operations on the other, are both O(1) but the difference is significant.
You probably think this never happened but it used to always happen. The Linux scheduler used to traverse a linked list of pages (because Linus didn't know any better). Once the number of processes being scheduled exceeded something like 64, it slowed to a crawl consuming most of the quanta just rescheduling.
They are all the same.
Good link to a similar discussion: An Introduction to Cache-Oblivious Data Structures - https://news.ycombinator.com/item?id=16317613
Both indexes will show a query plan that’s index only and will hit the same number of records. Locality of memory in that index (since Y had repeats) was hugely important.
I also can't help but appreciate the annecdote of abstraction hiding exactly what needs to be shown for performance computing considerations.
Databases have used B-tree and variants for at least 50 years, not binary trees, so obviously we Know this.
Much to do about absolutely nothing.
Haven't heard the term and googling around, I am certain the author isn't talking about what I am finding at the top.
It took netflix until 2015 to beat out porn & piracy as total bandwidth consumer on the public internet.
1 - https://learnbonds.com/news/netflix-inc-nflx-now-uses-more-b...
Learned something new.
Modern languages like Scala's Vector type is an rrb-tree under the hood: https://docs.scala-lang.org/overviews/collections/performanc...
For more fun: https://youtu.be/sPhpelUfu8Q
Moreover if you want random access just use Arrays (especially now that 2.13 has immutable ones). The only use case for Vector is when you don’t know the collection size up front and need random access. Though even then it honestly might just be a better idea to build up a list and then convert it to an array.
The question is: In the context of a block-oriented store, like a VM system, when is b-heap better or worse than b-tree and variants?
Note, I said “and variants” deliberately because of course plain b-tree is pretty simple. Just comparing O between b-heap and b-tree isn’t enough because B-tree variants can have the same O characteristics, I think.
It isn't 'most jews', for that conclusion one needs to look at the programs of losing parties and for polls of the Jewish diaspora in other countries about their support for Israel's policy.
If you don't like Israeli's government policies (and I don't either!), then yes, you/we disagree politically with some population that voted in that government. (Nevermind that governments never act as perfect proxies for their electoral base, so this itself is an oversimplification.) That is wholly different from detesting people.
If you can't understand that distinction, we can stop here.
My point is that the government policy reflects the democratic choice of the people. Saying that one cannot detest the people for their collective choice needs at the very least explaining.