Hacker News new | past | comments | ask | show | jobs | submit login
You're doing it wrong: B-heap 10x faster than binary heap (2010) (freebsd.dk)
229 points by eyegor 31 days ago | hide | past | web | favorite | 95 comments

This article appeared in the ACM a while ago[0] and it didn't go down entirely well (some of the snarky comments I remember seem to be missing there, but some remain [see Edit below]). He just laid out the data to match the memory architecture and... that's it, really. Important but hardly a new field of computer science.

[0] https://queue.acm.org/detail.cfm?id=1814327Data

Edit: Here's the full comments: https://queue.acm.org/fullcomments.cfm?id=1814327

From comment in link: "So you reinvented the T-tree, plus some Straw Man graphs and some comments on how you're smarter than the entire CS field. Congrats?"

Yeah, an incidentally-optimal-with-a-certain-architecture algorithm is great as long as you can count on the architecture remaining constant.

"Cache oblivious" are one seemingly better approach since they're optimal to some degree on a wide range of cache structures - but no theoretically as fast purely in-memory structures, if they are somehow in a single memory type.

"Architecture aware" trees are yet another scheme, if the program can discover the architecture.

Of course, a lot of algorithms are going to GPUs and how these structures relate to GPU processing is another thing I'd be curious about.

> Yeah, an incidentally-optimal-with-a-certain-architecture algorithm is great as long as you can count on the architecture remaining constant.

That's a theoretical problem that might happen sometime in the future.

Meanwhile the author's approach brings a 10x speedup in real-world deployments?

Why the snark? Theoretical work is vital to understand some problems but practical applications is where the real problem expresses itself, and where real-world constraints manifest hemselves.

Its not a bad idea. I did it myself in 98, down to making data fit in the L3 cache. Memory optimisation isn’t really new.

The concept of cache misses and the importance of cache locality is nothing new. That's HPC 101.

Demonstrating that in the real world a specific data structure performs far better than the theoretically optimal data structure is not. That's novel, unless you can find any prior work that presents the exact same findings.

That's one of the main problems with ivory tower types: claiming that a novel poem is nothing new because all the words in it were already known and even published in a few dictionaries.

I'm calling BS because there was no L3 cache in 1998, Pentium pro only had l2

All the world was not x86. The DEC Alpha definitely had 3-level cache systems by the mid 90s, and I would not be surprised if other RISC systems did as well.

> Why the snark?

Because author didn't discover anything stunningly novel, not even slightly, but bragged rather big.

Read the comments, the full list, to see many others saying the same thing. His 'discovery' isn't.

> Because author didn't discover anything stunningly novel, not even slightly, but bragged rather big.

I honestly don't understand your complaint. Did the author discovered a novel data structure? No. But did the author discovered that a data structure performs far better in the real world than theoretically optimal data structures that fail to take into account real-world constraints? Why, yes.

> Read the comments, the full list, to see many others saying the same thing. His 'discovery' isn't.

Sounds like the nay-sayers are entirely oblivious to real-world problems. That's one of the main complaints regarding the ivory tower mentality and how some very clever people fail to fulfill their potential by focusing on theoretical frameworks that have no relation with reality. Some algorithms may present the lowest theoretical time complexity within fictitious theoretical frameworks, where all memory regions are accessible in constant time and where all ops take the same time to perform, but that is far from what we have in reality. In reality cache misses penalize performance and cache locality is fundamental to reduce any performance penalty. It means nothing if a theoretical optimal algorithm that is only optimal under unrealistic assumptions is proven to be optimal in an unrealistic scenario if another algorithm is found to perform far better on the real-world.

> But did the author discovered that a data structure performs far better in the real world than theoretically optimal data structures that fail to take into account real-world constraints? Why, yes.

Which has been known about since forever. It's not a new discovery. (Edit: by 'since forever' that's the 70's and quite possibly earlier).

It's not even remotely surprising because anyone who know anything about memory architectures will appreciate it.

Please read the full comments in the original ACM article. Or at least a sample.

Also read https://www.goodreads.com/book/show/12485556-what-every-prog... (which can also be found online) which explains very clearly why this is predictable.

> Please read the full comments in the original ACM article. Or at least a sample.

I've read the comments and I'm a bit surprised you suggest it because a) there are far more supportive comments and praises than criticism, b) the complains in general are entirely oblivious to the content and are more outraged by the perceived tone than the actual finding regarding real-world performance.

Take for instance the absurd and entirely oblivious jab asking how the proposed algorithm performs on a Turing machine (?!) to, from there, criticising the author for pointing a major flaw in CS education.

As the author stated in the comments section:

> The response to my article broadly falls in two camps, which can be paraphrased as: Physisists arguing that bridges and buildings is simply not their job vs. people who try to build bridges and buildings, asking "whos job is it then ?"

>The obvious answer would be "The Software Engineers".

>Maybe we ought to educate some of those...

I'm going to reply because you read the comments as I requested and I appreciate that. I don't have much time so this will be brief.

Here's an extract from the comments which actually says it well:


The claim "an algorithm that has been on the books as "optimal" for 46 years, which has been analyzed in excruciating detail by geniuses like Knuth and taught in all computer science courses in the world, can be optimized to run 10 times faster" is half wrong and half obvious.

[which is what people here too have been saying]

Is half wrong because you cannot talk about optimizing an algorithm in terms of "time", you can optimize its asymptotic computational complexity number of steps, which you did not. Is half obvious because the fact that heaps do not match well with virtual memory, with caches and with multiple execution pipelines in modern CPUs is something known since decades.

[repeating what people have said here too, that once you have caches, memory accesses become very inconsistent. The time difference between a L1 cache hit and a main mem access can be 50 times, or even more. So much more than a single order of magnitude speed difference. If it hits disk via VM, OMG IT HURTS IT HURTS... That's why an 'inferior' algorithm that uses memory in a nice way can beat a technically superior algo that doesn't]

Who decided to use a pure heap in that code was not a very good programmer, you proved to be a better one [I agree here], you did not change a little bit in the history of algorithms, a good CS class may help you about the fact that you think and claim you did, there is a long way before being able to even name Knuth.


Analysing algorithms can't AFAIK take into account much about caches. It's just too difficult. So analysis is valuable but limited (though with very important real-world value), and you need also to account for the fact that cache/VM effects are huge.

The author didn't discover anything we - or you - aren't aware of, and he presented it badly.


Yeah these people's point about the "reception" of this article would be better if it wasn't clear that most of the critical comments lack the technical knowledge to actually critique his argument.

> ...for pointing a major flaw in CS education.

L2 cache was available in consumer hardware as far back as the '486 in the early '90s; CS students were being taught about the effects of having a memory hierarchy as far back as then. PHK wasn't pointing out a flaw in CS education, he was merely demonstrating how out of touch he was.

Once upon a time, Jon Bentley wrote one of his Programming Pearls (IIRC) about two string reversal algorithms; both were Big-O identical, but one finished in seconds on a sufficiently large string and he killed the other after several minutes. The punchline was that the second was exactly pessimal with regard to paged VM.

This wasn't a new realization, even in 2010.

The neat thing is that Bentley's string could probably be reversed in cache today.

> The neat thing is that Bentley's string could probably be reversed in cache today.

Yeah, those big Zen 2 Epyc CPUs have 256 Megabytes of L3 cache. Now. That's more than like, either of my first two computers had in RAM.

Zen 2 EPYC CPUs have more than a third as much storage in their register file as my first computer had RAM.

Are you including the shadow registers? I suspect EPYCs actually have quite a bit more.

Edit. 8 byte registers * 168 total regs = 1344 bytes, so a) you were and b) guessing it was a VIC20?

I was working off a figure of 180 in the Zen 2 core for 1440. It was the TRS-80 Model III with 4K, though I did acquire a VIC-20 a little later too.

> Yeah, an incidentally-optimal-with-a-certain-architecture algorithm is great as long as you can count on the architecture remaining constant.

Choosing an algorithm based on the architecture it will run and the data sets it will operate on on is the right thing to do, period. That's engineering.

If you have a different architecture, you choose a different algorithm. You don't go looking for some mutant yeti super-algorithm that somehow runs well on all of them. This is just a symptom of computer science's obsession with generalization, which is rather counter-productive.

> Of course, a lot of algorithms are going to GPUs and how these structures relate to GPU processing is another thing I'd be curious about.

GPUs are a good example for how you need to craft an algorithm to fit the architecture, not the other way around. The way you can program GPUs today is different from five and ten years ago.

I suspect that the paper would have gone down much better if it wasn't phrased in such a snarky way. If he had just presented his results as b-heap is faster under VM pressure it would have just been accepted.

Well, his irritating writing style aside, he’s pinpointing one instance of a much larger problem in software development: the pervasive belief that “just enough” understanding is the right amount of understanding. I can’t count how many times I’ve been shot down in design meetings when I even bring up efficiency with “optimization is the root of all evil”. In most organizations, nothing matters except time to market (including user experience, efficiency and even accuracy) because “how long did it take to program” is the easiest thing for lazy managers to measure. Until organizations stop hiring people with only a vague notion of what they’re doing to deliver software, software will continue to get slower, buggier, and more unpredictable.

> "premature optimization is the root of all evil"

The full quote is:

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

People forget that last bit.

But I do like the optimization of the quote. From now on I'll quote it as "We should forget about small efficiencies 97% of the time."

Oh, I know... but don't try telling them that.

I guess it is that season again ? :-)

This article seems to pop up about every other year, and pretty much garner the same general reactions every time.

A lot of CS-majors immediately go in to defensive mode, starting to spew O(blablabla) formulas while thumbing their text-books.

That actually just underscores the main point I tried to communicate in that article: CS education simplified the computing platform used for algorithmic analysis so much that it has (almost) lost relevance.

The big-O has little to do with real-world performance any more, because of the layers of caching and mapping going on. In practice O(n²) can easily be faster than O(n) for all relevant n, entirely depending on ordering of access in the implementation.

Also, CS-education tends to gloss over important footnotes. Take quicksorts worst-case performance, which even wikipedia describes as "rare". It is not. Sorted data are very, very common.

Originally researchers were more careful about these issues, and they operated with several distinct big-O's. Re-read TAoCP about sorting on tape-drives for a classic example.

Some pull out "Cache-Oblivious Algorithms", usually having never actually tried to implement, benchmark or use any of them.

They are, as a rule, horribly complex, and seldom faster in practice, because they hide a very large performance constant in their complexity.

More often than not, there are no available reference implementation, so people cannot easily try them out, and none of them are fast enough to compensate of the embuggerance of the software patents which cover them.

There are usually one or two who claim that "Virtual Memory" is no longer a relevant concern and that one should just put "enough RAM" in the machine.

Throwing hardware at the problem is a nice job if you can get it, and a nice hobby if you can afford it.

However, even if the money is there, what about the energy consumption ?

Our climate is rapidly deteriorating because of greenhouse gasses from our energy-production, and more and more algorithms run on battery-power.

If I were writing a CS-thesis or CS-textbook today, it would be all about big-E() notation, and how it only weakly correlates with big-O() notation.

The next biggest lie that everyone believes is that everything is premature optimization. For a while I saw this even when our customers were unhappy with the speed of the app. Probably the first bad case of cognitive dissonance I encountered in the field.

The one after that is that all performance work decreases code comprehension. There are plenty of techniques that do both, and these, as I discovered, fly below the radar of the anti-perf crowd.

This is applicable to any system that does paging and stores and retrieves data from those pages, whether that system be a Database or an OS...

Best algorithm performance article I've read on HN for a long time!

I dropped out of a not very good program, but even in my time there we did learn that O(N^2) can (and almost certainly will) be faster than O(N) on most architectures due to data locality and large constants hidden by Big O notation.

I'm not asking this to be snarky, again I'm a dropout and don't have a ton of insight into this sort of thing - do most CS programs not cover this in at least some detail at some point?

Mine not only did not (2006-2010), when the professor was demonstrating something he dropped a constant I knew from prior experience would overpower the big-O in question, then danced around it when I tried to ask about it.

Didn't even try to go down the path of "we're focusing on big-O, so we're only ignoring it for this particular analysis", he just pretended the term didn't exist.

Interesting, thanks. I suppose it isn't necessarily part of a core curriculum (though I did not make it through the curriculum, which is why I was curious).

Hm, maybe it's been too long since I had some AlgoDat lectures, but I recall we learned that memory hierarchy was rather important, and that having data in cache can make a huge difference. Of course for the undergrad course, learning proper analysis was more important, since the whole thing about big-O is how often $STUFF happens relative to the input size (or at least that was my big take away). That $STUFF is swappable, and in the lecture we usually used "number of compares". The idea was that a student could later analyze for other things, e.g. for loads, cache-misses, page-faults, cycles or a combination thereof. You're saying yourself: "they operated with several distinct big-O's.".

So, what did you do then? From a theoretical/CS point-of-view, you're just optimizing for less VM paging. There is nothing wrong about it, quite the opposite: You're demonstrating knowledge of your machines architecture, and use it to keep memory latency low - which is both a good thing. But this isn't new, or unheard of. I recently looked something up w.r.t. tree data structures, and the 1997 8th edition of Cormen, Leiserson and Rivest even mentions how bad it is to access data from a spinning disk, and how that should be reduced (I didn't read that section in detail, because that was not what I was interested in).

However, maybe sometimes people forget that, or the application of their knowledge is bad (e.g. optimize for "number of compares" when memory latency is dominant). With regards to that, I really liked the article: I do a lot of runtime analysis for assembler code across various platforms, and as such I am painfully aware how complex even decades old pipelines are - it's a pain that developers tend to forget about this.

On the one hand I find (parts of) the linked article funny, on the other hand it's painful that you assume all other people to be ignorant, when this is not the case.

Regarding big-E: That's also not new, but probably not as popular as it should/could be. I recall a friend mentioning that part of her PhD research was energy complexity of sorting algorithms. That was ~2008.

That’s the difference between a data structures and algorithms class versus a systems class: in the former the constant in front of the Big O doesn’t matter while in the matter it does. Any good CS curriculum would have both courses.

I think the problem was most companies ran internal data centers and were unflinching to the cost of internal bloat.

IMO for a multitude of reasons this has hidden the cost of these deoptimizations due to the virtualized abstraction.

I believe with public cloud dropping a comprehensive bill on the doorstep of these enterprises, compute will become a utility that will be optimized once again with virtualization at the heart of it all.

It seems like the majority of your comment focuses on your distaste on Big-O complexity, a tool used for algorithm analysis.

> CS education simplified the computing platform used for algorithmic analysis so much that it has (almost) lost relevance.

Do you have any constructive advice? How would you relax the model assumptions?

What do you use to estimate algorithm speed?

My most constructive advice is that CS education should "go concrete" often enough to de-mythologize the predictive powers of big-O notation, and have the course contain at least one exercise which shows how misleading it can be in practice.

Maybe have them download and sort a 32GB data set on a 64GB SD card on a RPi3 board ?

The smart ones will realize that downloading can be their first pass over the data, and all of them will be befuddled by the interesting performance anomalies when photo-grade SD-cards go on sabaticals, if you annoy their simplistic flash-adaptation layers.

If they characterize "the trouble", they may find that obsolete sequential algorithms for tapes run faster than random access algorithms in that environment.

...which is precisely the kind of thing companies expect a CS-graduate to be good at.

Why do you think quicksort's worst-case performance occurs when the data is already sorted?

Quicksort's worst-case occurs when all elements always land on one side of the pivot. If the pivot is naively chosen as the first or last element in the range, this will be the case for already-sorted data.

Randomly selecting the pivot is an additional step easily skipped when first introducing the algorithm.

I don't think this really justifies GP's gripes about Wikipedia or sorted data occurring a lot in the real world. To get the first-element pivot selection quicksort to regularly interact with the sorted data, we would need some standard library to actually ship a quicksort that always picks the first element as the pivot. I don't think any of these exist.

absl::btree_map and absl::btree_set are drop-in replacements for std::map and std::set that use b-trees under the covers. Worth noting that this simulation doesn't factor in pressure on the CPU cache, which means that even if you aren't under VM pressure, b-trees can still offer a speedup. They are also gentler to the allocator.


A lot of people seem to be missing that Varnish is a two-billion-dollar company, Fastly, largely but far from entirely as a result of PHK doing a good job of optimizing it. This is not a question of freshman homework quibbling.

Constant speedup, but you need to optimize for page size; if you optimized for the wrong page size, you may get a lower constant, perhaps even < 1.

And the computation is much less trivial.

Still, an interesting and important observation - page sizes change every decade or two, and almost exclusively upwards, so this is worth pursuing if you need to squeeze more speed from your heap.

> if you need to squeeze more speed from your heap

Everyone needs to "squeeze more speed from their heap". CPU-bound problems are quite rare in modern software - despite the legions of software engineers who will stare at a CPU profile that says 100% utilisation, and declare "it's CPU bound, nothing to be done".

(this is a pet peeve of mine, in case you couldn't tell)

Latency kills: memory latency, network latency, and storage latency. In that order.

> Latency kills: memory latency, network latency, and storage latency.

I think the point of the article is that

> memory latency

Is actually a hierarchy of its own, and non-linear depending on access patterns.

Instruction fetch latency, branch mis-predict latency, cache hit/miss latency (L1 through L3), cache pressure, register pressure, TLB hit/miss, VM pressure, etc. all become significant at scale.

Just knowing big-O/big-Ω characteristics aren't enough.

Why would someone see 100% CPU and say there’s nothing to be done? CPU is the part we have the most control over. In my experience it’s the opposite where you can’t get people to engage you or their own brains. We aren’t CPU bound, nothing to do!

I’m interested to know what you think that CPU is fully engaged doing if it’s not making progress toward a solution?

In addition to optimizing for page size, you want to optimize for cache line alignment to minimize the amount of false sharing that happens in the CPU cache. If you're reading these data structures (safely) from multiple threads, ensuring your blocks are (say) 4080 bytes in a 4096 page can dramatically increase performance because otherwise the address bits used by the cache might line up across multiple blocks.

The last time I had to optimize a b-tree based k/v store getting this right literally could boost performance in test cases by 40%.

Is there any reason to not just produce a larger binary that contains a code path for every possible (well, probable) page size? Only one code path will be taken at runtime, so it's not like it'll harm cache locality.

"VM pressure, measured in the amount of address space not resident in primary memory"

Ok, question from someone clueless without recent low-level programming experience - is this in fact a normal situation on modern systems?

Because I had the general impression that exactly because the gulf between storage and memory speeds has widened so much, that virtual memory is more of a historical appendix than anything.

My laptop has 8 GB of memory, which I think is fairly small these days, so how likely is it that I would be using approaching but less than 8MB more - between 100% and 100.01% of my memory?

...I went and looked for the date of this article, and it's almost 10 years old, so perhaps things have changed...

Everyone here saying paging is uncommon on modern systems with large RAM is correct, but there is another factor. TLB misses are expensive and since they involved MMU misses their effect is pretty analogous to paging. So slowdowns still exist but they're not quite as pronounced since each miss only costs you a few main memory accesses rather than going all the way to secondary storage. They do happen rather more often than paging though due to the small size of the TLB.

From https://en.wikipedia.org/wiki/Translation_lookaside_buffer

These are typical performance levels of a TLB:

size: 12 bits – 4,096 entries

hit time: 0.5 – 1 clock cycle

miss penalty: 10 – 100 clock cycles

miss rate: 0.01 – 1% (20–40% for sparse/graph applications)

If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of 1 × 0.99 + (1 + 30) × 0.01 = 1.30 (1.30 clock cycles per memory access).

Yes, but users rarely see real amount of virtual memory used (reserved, committed, etc). For example Chrome rendering process right now on my machine uses 267 MB of physical memory, 478 MB of committed virtual memory, and 2.1 GB of virtual address space.

So as you can imagine, total address space reserved on a machine can easily go into terabyte range. But that doesn't matter because every process get it's own lookup table.

Poul-Henning Kamp is rather famous for writing code that uses mmap and a straightforward data structure for giant blobs of data and arguing that the kernel memory manager should handle the details. To my knowledge, most people dealing with data larger than memory use more active techniques.

A common failure mode for this is that devs and less sophisticated Ops people see that the machine is only “using” 60% if available memory and they carve out a chunk for another process or a feature. They don’t get that file caching and mmapping are providing them a lot of their performance, and aren’t clearly accounted for by ‘free’.

So suddenly we are at 90% memory and everything is slow as molasses.

Or say things like, "had 90% of their CPU available for twiddling their digital thumbs". CPU usage is irrelevant if all your processes are blocked on i/o.

If portability is a goal, the mmap approach will require a higher investment with more documentation on system tuning to recognize the benefit. But if you're deploying to one architecture with strong support for mmap, go for it.

"One architecture". Like Linux on x86-64? That's the only one that matters for 99% of us.

You have no idea how much money businesses have lost in the long run, because they said that about IBM mainframes.

Keep your options open by keeping your code portable.

This could mean 'address space reserved but not committed'. Committed address space is often much smaller than reserved address space - think of how the stack starts out with very little actual memory in use, and then grows as unused pages are first touched. That mechanism ensures that spinning up a temporary thread with a shallow stack doesn't eat up 8mb of physical memory instantly.

The pages are also zeroed on-demand when committed, which ensures thread creation doesn't first require 8mb of pages to be zeroed (There are workarounds for that as well).

> Ok, question from someone clueless without recent low-level programming experience - is this in fact a normal situation on modern systems?

Great question. Where I work, it's common to have swap disabled (or only use zram-backed swap, so pages may be stored compressed in RAM but never actually get sent to SSD much less spinning disk) and mlock the executables into RAM. So in my experience it's uncommon for memory accesses to wait for secondary storage (because we've decided to avoid it).

On the other hand, I think the author's definition is poor:

> "VM pressure, measured in the amount of address space not resident in primary memory"

By this definition, the systems I just described have lots of VM pressure. Thread stacks contribute: there might be 64 kiB on up to 4 MiB of address space reserved for a thread's stack (plus a 4 KiB guard page) but none of that is actually backed by real physical RAM until it's first accessed. So often the thread stacks will only occupy 4 KiB of actual RAM (one page).

I'd say however a more useful definition of memory pressure is based on time spent waiting for something to be paged in. See also the recent Linux "pressure stall information" work. By this definition, those non-backed-pages don't contribute to memory pressure much. (Minor page faults, which find a page of physical RAM to match the virtual page and write zeroes to it, are typically fast unless physical RAM is actually hard to find or (at least on Linux) there's heavy contention on the process's mmap_sem.)

> Because I had the general impression that ... that virtual memory is more of a historical appendix than anything.

Virtual memory is still useful for lots of things besides using SSD or disk to extend RAM: allocating address space before backing RAM (as I just described for thread stacks; also iirc ASAN/TSAN take advantage of this), zram-backed swap, and memory-mapped file I/O come to mind.

I think what the author says here is not terribly common:

> A 300GB backing store, memory mapped on a machine with 16GB or less RAM is quite typical. The user paid for 64bit of address-space, and I am not afraid to use it.

It's common for the author's application because he's designed it that way. He doesn't supply any data to make us think it's common in other applications so his advice may not be generally applicable. And he's also not using what I'd consider to be a correct definition of memory pressure. So plenty of reasons to suspect true memory pressure is an "extreme corner case" for most people (in applications other than Varnish), despite his saying "this is pretty much the real-world behavior seen."

I'm unsure from a quick skim if the B-heap he's describing is large and resides in the memory-mapped on-disk pages or if it's small and resides in anonymous pages. If the latter, it may be better to simply mlock() it in RAM so it you can guarantee there are no page faults while it's accessed.

I'd hate to have this discussion with the author because he sounds insufferably arrogant. Another commenter pointed out that this "paper" (I'd say rant) didn't go over well with the ACM crowd, and no wonder why...

The tl;dr is that for many use cases, you can get order of magnitude improvements by optimizing constant factors. In their case, they made a disk optimized version of a binary heap by gathering nodes into pages. This is better because it's faster to read one 4kb page from disk, than it is to perform multiple 64b reads from disk.

I think what's even more interesting are cases where algorithms with worse asymptotic complexity perform better than algorithms with a better asymptotic complexity. You see this all the time in the database world, since you are often better off optimizing for disk operations than for pure asymptotic complexity.

The main example that comes to mind is that if you are performing a GROUP BY on a dataset that won't fit in memory, it's usually faster to calculate the GROUP BY with mergesort than it is with a hash-table. Hash tables are notoriously bad for disk performance. Every lookup or insert becomes one disk read or write. This is in contrast to mergesort where you can take two sorted files, and in one pass merge the two files together. This leads to a much smaller total number of disk reads and writes.

Two other cases like this that come to mind are Fibonacci heaps vs binary heaps[0] and matrix multiplication algorithms[1].

[0] https://en.wikipedia.org/wiki/Fibonacci_heap#Practical_consi...

[1] https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...

The log(n) that dominates in search algorithms relates to the height of the tree. The larger block size doesn’t just improve the locality of reference for finding a node, it also substantially reduces the height of the tree.

The problem is that the most used memory model for algorithms performance research is not true in the real world: O(1) memory access. In the real world memory access is not a O(1) operation, it takes a different amount of time to access memory in the different levels of cache.


Memory access time is bounded by a constant time - given by the last level in the hierarchy (which would be the SSD or HD in this case). So, access time is indeed O(1).

What's not true in the real world is that the actual value of the constant can be ignored. Memory or disk access on the one hand, and simple arithmetic operations on the other, are both O(1) but the difference is significant.

No, it isn't bounded by the access time of the last level in the hierarchy. You could miss in the TLB before you even get to the storage hierarchy. Your TLB miss may even require a page table walk with more misses. You could even get rescheduled because your quantum was exceeded.

You probably think this never happened but it used to always happen. The Linux scheduler used to traverse a linked list of pages (because Linus didn't know any better). Once the number of processes being scheduled exceeded something like 64, it slowed to a crawl consuming most of the quanta just rescheduling.

Yeah, there are finer points to consider to get at an actual bound, and there might be some quirks and infelicities on some platforms. But this is not about competing processes on the system. For matters of asymptotic Analysis the access time is still O(1), i.e. bounded by a constant. If it wasn't, what would it be?

O(constant) = O(constant - 1) = O(1)

They are all the same.

Isn't this just a version of a cache oblivious algorithm, using van emde boas ordering to make maximal usage of bytes per pull into the cache? Or is there more to it that I'm not seeing?

Good link to a similar discussion: An Introduction to Cache-Oblivious Data Structures - https://news.ycombinator.com/item?id=16317613

This reminds me of a mongo optimization problem I faced years ago. Given the query {x:A, y:B} where x is a scalar and Y is an array, it was 40x faster to index [x, y] than [y, x].

Both indexes will show a query plan that’s index only and will hit the same number of records. Locality of memory in that index (since Y had repeats) was hugely important.

I have a less negative reaction than most commenters seem to. I am not phased by the snarky know-it-all attitude of the author, and it actually makes his point more concise and clear IMO.

I also can't help but appreciate the annecdote of abstraction hiding exactly what needs to be shown for performance computing considerations.

The problems start in the first sentence. "Optimal" doesn't mean what Poul thinks, so he's attacking a straw-man. The data structure he poo-poos is CS 101 material. Anyone doing serious development knows better and uses structures appropriate for the underlying hardware (which fx. on an FPGA might be vastly different from an embedded process, and typically different from a workstation).

Databases have used B-tree and variants for at least 50 years, not binary trees, so obviously we Know this.

Much to do about absolutely nothing.

EDIT: typos

for a long time i used a variation of this as an interview question for performance analysis positions at google. instead of focusing on I/O costs i was focusing on cache miss costs, but it's pretty much the same observation at a different scale. it's always seemed like an excellent educational example of the difference between theoretical and practical concepts in computer engineering, such as big-O and turing machine vs. practical problem sizes and hierarchical storage/cache capacities/costs.

Can I get some clarity on pink bits?

Haven't heard the term and googling around, I am certain the author isn't talking about what I am finding at the top.

> Today Varnish is used by all sorts, from FaceBook, Wikia and SlashDot to obscure sites you have surely never heard about, many of which serve mostly pink bits, lots of pink bits.

Adult content.

It's hard to overstate the importance of adult content to progress in networking, specifically.

Adult content & companies have always been quick to utilize new technologies. JPEG's, first online payments, massive video services, etc. Its presence is even felt in places you might not expect, like reddit and imgur. I remember reading somewhere at one point 50% of reddit's traffic was adult in nature.

It took netflix until 2015 to beat out porn & piracy as total bandwidth consumer on the public internet[1].

1 - https://learnbonds.com/news/netflix-inc-nflx-now-uses-more-b...

Thats exactly what I was finding...totally surprised.

Learned something new.

... on the adult sites you've found? )

Hidden websites with lots of pink bits is exactly what you think it is. Images are served through CDNs and thus caches like varnish.

Hm... this is interesting, but I’m wondering whether it makes more sense to compare b-heap to other block-oriented data structures? I’m thinking of b-tree and it’s variants.

Not a benchmark but rrb-trees are pretty neat. They give an effective O(1) for all operations.

Modern languages like Scala's Vector type is an rrb-tree under the hood: https://docs.scala-lang.org/overviews/collections/performanc...

For more fun: https://youtu.be/sPhpelUfu8Q

Unfortunately, Scala’s vector type actually sucks big time in regards to performance:


Moreover if you want random access just use Arrays (especially now that 2.13 has immutable ones). The only use case for Vector is when you don’t know the collection size up front and need random access. Though even then it honestly might just be a better idea to build up a list and then convert it to an array.

I think binary heap -> B-heap is analogous to balanced binary tree -> B-tree. It applies a similar idea, but provides O(1) lookup for the minimum element, O(1) average insert, and O(n) search [1], instead of O(log n) min-lookup, O(log n) insert, and O(log n search) [2]. So a B-tree would not perform well in applications where a heap is desirable.

[1] https://en.wikipedia.org/wiki/Binary_heap [2] https://en.wikipedia.org/wiki/B-tree

I mean in the context of block-oriented access, like VM as discussed in the article. That a block-aware data structure performs much better than a block-unaware one isn’t that surprising or interesting, at least to me.

The question is: In the context of a block-oriented store, like a VM system, when is b-heap better or worse than b-tree and variants?

Note, I said “and variants” deliberately because of course plain b-tree is pretty simple. Just comparing O between b-heap and b-tree isn’t enough because B-tree variants can have the same O characteristics, I think.

I mean, isn’t array vs linked list similar? In theory linked list has a lot of good applications, but in real life it doesn’t.

Indeed, this is why for C++ people are generally told to use std::vector (i.e. arrays) when some other data structure may in theory be more appropriate.

Out of pure curiousity: "... That is where it all went pear-shaped: that model is totally bogus today. ..." I agree that most concepts about operating systems are dated but where are the good resources that give an overview without being overwhelming?

Hey, I object to the ZX81 C64 and TRS-80 being called toys!


So, is it then impossible to criticize the collective political decisions of the Jewish State of Israel without be anti-semitic?

No, of course not. Criticism of the state of Israel is entirely separate from "detesting most jews."

More than a third of all Jews live in Israel and about a third voted for the winning party. So if one detests Israel's policy, one detests free democratic choice of a big share of Jewish population.

It isn't 'most jews', for that conclusion one needs to look at the programs of losing parties and for polls of the Jewish diaspora in other countries about their support for Israel's policy.

You're really reaching here.

If you don't like Israeli's government policies (and I don't either!), then yes, you/we disagree politically with some population that voted in that government. (Nevermind that governments never act as perfect proxies for their electoral base, so this itself is an oversimplification.) That is wholly different from detesting people.

If you can't understand that distinction, we can stop here.

Maybe personally I support Israel's government? They do what they have to do for their country to survive. Or not. That doesn't matter.

My point is that the government policy reflects the democratic choice of the people. Saying that one cannot detest the people for their collective choice needs at the very least explaining.

A bit of a specialized use case.

I wonder if he does realize that CS uses this thing called O-notation, it's really useful. So a 10x factor by the very definition of complexity theory is a zero sum game. Nobody cares. You implementation might be faster if it is optimized for a certain architecture, CPU instructions, CPU cache and so forth. It literally doesn't mean anything, it has certainly nothing to do with being more "efficient" or with Knuth not being able to make it more efficient. Optimality is discussed in O-notation, not some contrived laboratory constant factors in front of it.

People writing the AWS checks care a whole awful lot about that insignificant constant.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact