Also, here's Drepper's paper compiled into a single PDF, which I think was edited a little later than the LWN publication and might have had some minor errata fixes: http://www.akkadia.org/drepper/cpumemory.pdf
Every programmer should have a basic understanding of latencies.
This is slightly out of date but still gives an idea of relative latencies:
How many programmers in 2012 are ever even going to need to cache-optimize their code? 25%? 10%? The content is fascinating, but the title is pretty wrong.
Let's say you have a giant in-memory balanced binary tree (for your data store, maybe) with mean depth of ... 27, say. And you're deployed on a machine with 4 DRAM channels populated with (for the sake of argument) 7-7-7-23 DDR3 memory. Give me a ballpark estimate for how many reads per second you can achieve from that tree.
My experience is that virtually no one can answer this question well, and most programmer's intuition is wildly off (they think it's much faster than it is). And even the ones who get it mostly right usually just use "100 cycles" as their assumption of DRAM latency and get tripped up on the subtleties (like the interaction with the number of channels vs. precharge overlap; or the already-active row latency for sequential access which this question doesn't cover).
Reading this paper will teach you how to answer that question from first principles. Surely that's worth a few hours of your time. At the very least it's worth more than arguing on the internet about how you don't need to know it.
(edit: A bonus followup question -- hopefully just as practical -- would be: "So maybe binary trees aren't great. Given a node data size of N, what is the most appropriate fanout to use in a B tree implementation?". This pulls knowledge of caching architectures into the mix.)
That would be my reaction to those interview questions.
As for the follow-up question, my response would be to ask why you are optimizing your B tree for a specific architecture instead of using a cache-oblivious B tree. (Yes, I can think of several reasons why not to. But if the interviewer is asking me BS questions, I want to know whether their knowledge is similarly good.)
Of course in practice I'm generally happy to use BerkeleyDB - the B tree has never been a performance bottleneck for me except when it involved disk latency. And at that point a more fundamental reorganization of algorithms becomes necessary.
Cache oblivious B-trees aren't exactly oblivious to how DRAM behaves, and sometimes their operation actually can be counter to what works best in DRAM (though I agree that generally they tend to do better than data structures that are "oblivious" in a very different sense).
But let's ignore that for a moment. Okay, you aren't using BerkeleyDB, MySQL, or any other of the common B-trees out there, but instead you've got your hands on a cache oblivious b-tree. What is the approximate optimal fanout rate and what is the approximate number of reads per second you can achieve with the 7-7-7-23 DDR3 memory (or memory of your choice that is commonly found in modern systems)?
The question isn't really about knowing the specifics of a particular memory infrastructure. It's about understanding the cost of going to memory, which is increasingly as important as understanding disk latency (you know how "disk is the new tape"? well, RAM is the new "disk").
Interestingly, a lot of databases these days are CPU bound , particularly when working with big data, but even in other cases. They aren't really CPU bound, it just looks like that because of how accounting is done. When you poke under the covers, you see the CPU pipelines are comparatively idle because the CPU's are all tied up accessing memory.
Particularly if you are using AWS or other virtualization services, most web applications these days rely on servicing a majority of their requests entirely from memory. Being able to understand the performance characteristics of memory at least within an order of magnitude becomes increasingly important, because the average rate that you can service from memory becomes a critical metric for the scaling of your services.
It's more than even that though. As concurrency becomes increasingly important, locks and atomics become central to a lot of work. The last few iterations of Moore's law have changed the principles of how locks & CAS impact performance. It used to be that locks were bad, because you often burned up CPU going down to the kernel, but more than that because suddenly you are wasting CPU while you wait for something else to complete. CAS allowed you to largely sidestep all that CPU wasting and you're a happy man. Then CAS got integrated in to locks so that locks only really sucked when you had contention, but now... The real expensive cost of a lock these days is the memory fence put around it. CAS is better because the fence is scoped to only the object being swapped, but even that can be horribly expensive if there is contention on the object. Having a clear understanding of just how big that cost is relative to burning some more CPU, wasting some RAM, or even changing the constraints for your code, can be very important.
Trust me. This stuff really matters for any context where performance matters, and it is only going to get more important in the coming years. Memory has pretty much already trumped CPU in terms of its importance for most applications, and the number of cases for which it is more important than disk is growing at a rapid rate.
However I've spent most of my life working within scripting languages. And the same is probably true of most programmers in the startup world.
If you're able to get away with using a scripting language, then your problems had better not be performance critical, because your swiss army knife of choice really sucks as a saw...
Second, very few startups don't end up making decisions about performance even in the early days. What? You're using a NoSQL store? Is that because you just hate yourself or because you thought it'd be more efficient/scale better than an RDBMS? Why aren't you just storing it all in memory? Wait, you can run out of that? just let it swap then! Actually, why not just store all the data in a flat file? You don't even need to sort it, you can just scan through it whenever you need to, because performance doesn't matter. Why are you using an associative array there? Do you care about having efficient lookups based on a key or something?
You see my point.
I get that some people are playing around in a segment of startup land where programming isn't really that important, but presumably your long term plan isn't for every startup you participate in to never go anywhere. What if you are a success? Do you still want to be on the technical side when you've now got a team of developers and a serious customer base? If not, that's fine and likely a great career plan, but you aren't a programmer in the professional sense (which is what the paper is referring to). If you are planning on staying on that side, either learn this stuff or expect to be cast aside and not grow with the organization, because you aren't ready for the bigger world you'd now be playing in.
To be clear here, I'm not suggesting you'd have to pick up a systems programming language. Heck, when I started my career two decades ago, when CPU's were a lot slower and interpreters weren't as efficient as they are now, there was already this notion that "oh, I'm working in an interpreted language which provides a layer of abstraction so I don't need to know how computers work". I made a good start of my career (and met a lot of people who had been doing so for quite some time) essentially kicking those guys out of their jobs because it turned out doing scripting with an understanding of the computing consequences of your choices is applicable in a very broad set of situations, but scripting without that understanding is barely more useful than knowing how to use Excel.
High level languages actually can kick quite a bit of butt in performance critical situations, because they let you focus on the bigger picture of how you are executing, where the bigger performance wins can be had (I've seen "scripting languages" winning performance bake-offs for exactly this reason), but you have to know what the heck you are doing. People often make the mistake of thinking this won't matter. They tend to have very short careers.
However in general within a scripting language you are not trying to optimize things at the CPU cache level. You're worried about overall efficiency of your algorithms, avoiding disk, identifying performance bottlenecks, etc. But the pervasive use of hashes everywhere and memory hungry data structures makes careful use of CPU caches pretty much a lost cause.
Furthermore the language does so much behind your back that an analysis of performance from first principles is very unlikely to be right. Instead you need to benchmark.
PS: Wait, so your hand rolling your own caching system. That's nice, your fired.
You can't defend your point without understanding the hardware works, you know. Either it's a "terrible data structure" for DRAM stores (and you know why) or "this is all a waste of time" and it's not a terrible data structure by definition. Which is it?
EX: Let's say you want to store 1/2 to 100 million GUID's. They should have a random distribution so you don't have to go crazy just start with a lookup table for the first 16 bits (sized to fit into L2 cache) and a radix tree on the leaf nodes. http://en.wikipedia.org/wiki/Radix_tree or a AVL-tree or whatever you want it's just not that important because these small trees are unlikely be cached. You may want to use a mid range of lockup tables to cut down on DRAM round trips, but again that's the type of micro optimization that starts to get really specific to your usage patterns and the amount of memory on your system etc.
But wait, what if your using time-stamps? The first N digits are going to be the same right? Again, knowing the DRAM timings is pointless verify its in the range you want, mask some middle bits, use that as an index... etc. Or, if it's a pure lookup and it's ok to give up ordering you can flip the bit order to have a fairly random distribution. (Again a micro optimization based on data usage patterns but it may help out.)
And, here is the thing if your storing the data on a HDD the same idea is still useful, just use a larger index in RAM. The important things is relative latency's and storage space not specific timings.
PS: Still, this assumes speed is really important. Generally speaking there are far more important things to worry about like keeping all your data in RAM in the first place, space efficiency can be more important than access patterns. And you might be better creating a batch of query's and ordering them to be more efficient vs changing how your data is stored.
The problem with this question is (and your original example) is it entirely contrived. A better in-memory data structure for what? String indexings? Integers keys? Complex objects? Metric data? Region Data? What is your query load? Range queries? Exact queries? Substring queries? K-edits queries? Nearest Neighbors? etc....
Before worrying about DRAM performance on the particular machine you may be running your code on for a year or two in production you need to answer those questions. The performance of your system will depend far more on your understanding of the work load (read vs write), object type and query types than on the data structures understanding of the DRAM. That is the last optimisation you make, after you have done everything else because it is machine specific.
(Edit: signing off this ridiculous flame war. We're now in full circular logic mode: examples showing the utility of DRAM knowledge are contrived because DRAM knowlege is never useful. Sigh. I'll just point out that, because I know how this stuff works, you are wrong. You can do your jobs without being a good hacker, but you'll never be a great hacker if you don't know how your machinery works.)
Trying to predict performance from first principles is a mug's game in any case; there will be things you didn't expect. It's far better to measure.
I never said that one must be ignorant of factors that can affect performance when measuring. Rather, don't try and predict from first principles without measuring.
Good luck with that interview at Oracle/Google/Facebook/Twitter/Activision/Adobe/Microsoft/AnyoneWhoKnowsAnythingAboutSoftware....
The window within which caching effects occur varies from machine to machine - specifically, certain access patterns fall off a cliff once you go beyond the window - but with appropriate design, you can take reliable advantage of caching behaviour without needing to target a specific cache size.
I think I did - 8 years ago. The problem isn't that I'm lazy, it's that if I have no reason to use something for a long enough period of time, I tend to forget it. Maybe you remember everything you read forever?
There's already a plethora of information that is worth a few hours of my time - the problem is there isn't enough hours in the day. I don't want to spend them on things I'll not use and soon forget.
Really, read it. It's good for you. Like vegetables.
Surely that's worth a few hours of your time. At the very
least it's worth more than arguing on the internet about
how you don't need to know it.
Regarding your interview question, the kind of optimization you're getting into isn't something 99% of 'programmers' will ever touch.
Tell me any of the sites you visit in which this knowledge would be critical to deploying whatever service/product they're offering? I would imagine it would be pretty close to 0.
When you start hitting massive amounts of traffic, these kinds of optimizations can save you a TON of money in equipment and bandwidth.
Not to mention that premature optimization may be bad, but even when you need it, it's better to just write a few algorithms and test them against real data.
For example, if posed with the question presented here, if you realize it will not be as fast as you might hope, maybe you can make the choice to build a data structure with much less depth- maybe you can fit it all in a hash table?- and achieve huge performance wins without once touching assembler. (With a 27-deep tree, you need to make 27 reads from memory to find your data; reducing the depth reduces the required reads)
(I am not a computer scientist, and I am not an expert in algorithms, but hopefully I'm not too far off here.)
The others are correct.
The entire point of coding in a language like PHP is that you don't think about the low level stuff. And even if you did take the time to truly understand the source of PHP and made some fabulous optimization, they could change implementation details in a future release and break what you did.
Bottom line is that your interview question has almost no correlation with whether someone would be a good PHP developer. You might as well ask what the capital of Macedonia is as that is just as good of a screen for potential applicants.
Unfortunately, even if it's paying down technical debt, if you're trying to squeeze out more cycles with your low-level code, a lot of the time it's issues on the algorithm level. When refactoring there, these kinds of optimizations should always be at the back of your head, even if it's just to explain why moving the for loop up 5 lines gave the function a 2% performance boost.
There is most certainly a point where money spent on hardware gives a better return, but all the hardware in the world couldn't net you acceptable runtimes using bubblesort on colossal datasets.
Nobody is denying that this stuff is essential for some very specialized software teams. But the claim that it's something every software engineer needs to know is utterly ridiculous.
Isn't that going to depend on a heck of a lot more than just the RAM latency? Like, for example, how much locality the data has and how big the caches are?
I love this low-level stuff, but I think it's pretty hard to estimate performance of anything from first principles on today's highly complex architectures. There is just too much interplay of confounding factors.
What you're saying is a bit like the but I'll never need math in life argument. You're going to say that math is a lot more fundamental to existence then computer memory is of course, and that's true, but if you're looking at just computers, the way the memory system works is a fundamental part of the whole not entirely unlike math is to understanding nature. If you really want to know what you're dealing with and make sure you're not limiting yourself, you want to be inquisitive and go there.
Basically, I don't really get why the gut reaction to these sort of things (especially if they actually have meat; there's of course a lot of crummy pretenders) is so often to declare them pointless. Yes, you need to filter your information intake, but still, better err on the side of knowing too much.
for(int i = 0; i < 10; i++)
j += i;
k += i;
l += i;
for(int i = 0; i < 10; i++)
j += i;
for(int i = 0; i < 10; i++)
k += i;
for(int i = 0; i < 10; i++)
l += i;
..but you can only know this if you understand what's going on underneath the hood. So if you don't, no need to worry; you won't even know when the moment of your utter irrelevance arrives.
Without imagination, you will not be able to solve certain problems, not know where to even look when some arise, and at worse be totally unaware that better solutions exist.
That said, the OP's context is that of an "interview style" question which raises mental red flags for me. It's a common problem for interviewers to base their assessment of a candidate based on whether or not the interviewee has the same subset of knowledge that they do. In that case, I'd be quick to point out that:
* knowledge != intelligence
* knowledge != capability
* knowledge != skills
* correlation != causation
I've read most of this during my undegrad OS course, and found it absolutely fascinating.
Let's take database queries as an example: Knowing something about how the database is implemented will help you avoid writing queries that are pathologically slow, even if the queries you need to write aren't critical path enough to benefit from going to great length to tune them to the implementation.
Aren't you making the same mistake of the author and confusing "every programmer" with "every low level programmer whose work is highly dependent on speed optimization"? Which, from my understanding are the minority of us. I don't even know how the APIs I use are implemented, let alone know how much "fetching from L2 cache" they're making.
But of course, I know very little about hardware. So I could be very mistaken. I would love to be proven wrong.
Either way, since the disk is so slow, virtual memory and paging are handled by the OS, not the hardware; so there's a lot less that the programmer can specifically control once you get that far down the memory hierarchy, as opposed to (say) avoiding cache misses.
A programmer who is aware of the latency issues around disks can figure out how to move data out of memory into files, and how to change algorithms to arrange for sequential access to data. One who is not aware has no way to figure out how to stop the month-end data processing run taking five days. (That is a real world example.)
edit: and I just realized I'm in the subthread about knowing about latencies being important, not the subthread about knowing the entire contents of the linked paper being important, so yes, I think we're agreeing.
Your claim was that once you're hitting problems with disk latency, there isn't a lot that a programmer can control. My point was that there actually is a ton that the programmer can control at that point.
In fact a programmer working in high level scripting languages benefits more from knowing about disk issues than everything else for the simple reason that they really don't have any control over everything else. Of course you hope that the people implementing the language understand everything else...
If I take data that was in RAM, explicitly move it into a file and then make sure that I access it sequentially, I've modified my memory access patterns in a way that controls page faults. I'm using the same amount of actual space in memory and on disk, I've just labeled some of it "file" rather than "RAM".
In fact if anything I'd argue it the other way. If you're using a high level scripting language, what I just said about page faults still works. However the language gives you so little control over what happens in memory that you can't control the cache misses.
The only other reason you're going to be having page faults is because you're trying to work with more data than can fit in memory. In this case, you will inevitably page fault. The only way to avoid that is to use less memory (or install more memory). So much less that all your data can fit in main memory at a given time.
Let me give a concrete example.
You are page faulting because your data is in a large disk-backed hash. By the nature of a hash, you can access any part of it. Convert that hash to a B-tree, and similar requests are more likely to be close in memory. This can be a big win if there is a non-random access pattern (for instance you are more likely to look up recent records than old ones). Performance may magically get much better. You haven't changed how much memory you need, your data set, etc. But page faults are reduced.
More drastically, rather than going through one data set doing random lookups in another, you can sort both data sets and then process them in parallel. This is a much, much bigger change in the program. However it can result in many orders of magnitude better performance.
I am drawing these from a concrete example. I was looking at a month-end job that did summary end of month statistics for a busy website. Substituting a B-tree for a hash made it go from 5 days to something like 3 days. Not bad for under 5 minutes work. (I just changed the parameters that opened up the database file to use a different format.) Later I took the time to completely rewrite it, and that version took about an hour. Through this process there were no hardware improvements.
So you see, understanding disk performance issues is not useless for programmers. You have options for page faults other than asking for better hardware.
However, provided that your hash is greater than the size of your memory, you are still inevitably going to page fault if you expect to access every element of that hash. In that case, "the only way" to avoid page faults is to either increase the size of your memory or to use less of it.
understanding disk performance issues is not useless for programmers
And I dunno, man, you have big log files. I don't think that processing many-gigabyte log files is what most would call a common case, but I coud be totally wrong here.
The answer is easy: Almost everything in modern computers is about locality spatial and temporal locality. As soon as the complexity of the programs you write is one level above "Hello World" or in Rails-speak a simple controller method. This will become important.
It's easy to translate the concepts of aligning data in DRAM because it's faster to Rails-ish behavior. Assume that you read lot's of data from your database and you process it item by item. Inside your loop you perform another fetch from the database, again and again. If you would try to join the data (speak aligning) than you would need less requests to the database.
Any system architecture that involves handling data will at one point in time come to the situation where the programmer will think about ordering instructions, database queries or even attributes inside a c-struct (see the discussion about why short ruby strings are faster than long...) If you won't keep anything in your mind from this document, but if you remember that sequential access and exploiting locality will increase the performance, then the document made it's point.
Take the hours, read the document, and probably forget most, but take it as an inspiration, rather than an optimization guide.
In Java, I have no idea whether or not the fields in a copybook^H^H^H^H^H^H^H^H bean are going to be anywhere near contiguous, or not. Packing and unpacking stuff in a byte array / string gets tedious.
It's a shame low level integration isn't somehow easier in Java, like it is in C#.
So, Ulrich Drepper thinks everyone should know about the intricacies of DRAM refreshing. Zed Shaw thinks we should all know statistics. Some other schmoe thinks we should know about SEO. Et cetera.
Posts like these should be titled "I worked hard and feel good about myself for knowing X, therefore everyone should know it."
It is blisteringly stupid. Every programmer needs to know WHAT THEY NEED TO KNOW.
Nobody can know everything. Why don't we all just learn what we need, build the things which interest us, and stop telling others what THEY need?
Having said that... interesting stuff. Just needs a title change.
This article, however, is the real deal. Drepper's site (http://www.akkadia.org/drepper/) won't win any design competitions but is a goldmine of systems programming wisdom.
The title is just the author's opinion. Deal with it. And then read the paper anyway, because even if you don't use it directly it will help make you a better programmer.
Understanding the DRAM cycle is critical to performance tuning in modern code, and my experience is that virtually no application developers know how it works.
To explain what I mean, suppose you have a big 2-dimensional array in memory. As it works now, if you read a cell from memory, the area in memory around it is brought into the cache. Because of the way that arrays are represented in memory, what you get in the cache is just cells from the same row (or column). This means that if you need to access neighboring cells above of below the current cell you're not making any use of the cache. The way I see it, it might be useful if the system could somehow instruct the memory to bring a different-shaped block into the cache. Maybe there will be a whole part of memory intended for 2-dimensional arrays, and where blocks of memory will be allocated as squares on the plane (instead of segments on a line). Of course, something like this would require support from the programming language, which would need to mark put 2-dimensional arrays in that special part of the memory (and allocation will be much harder, now that you need to allocate 2-dimensional chunks).
The reason caches come in "lines" that are bigger than the words size of the processor is not an optimization (i.e. they're not deliberately bringing in nearby memory in the hope that it will be useful). The reason is that the logic associated with maintaining the cache (index and tag bits) is expensive, and needs to be repeated for each "thing" you have in the cache. So by making the "things" bigger (e.g. the 512 bit cache line on modern CPUs) you reduce the overhead of the bookkeeping.
I do not think that is true. There is considerable chance that that extra memory will be useful. The simplest example to think of are cache lines that contain program code. The other canonical example is the "for item in array do item = f(x)" loop.
In some sense, large cache lines are just cheap ways to implement speculative reads.
See for example http://dl.acm.org/citation.cfm?id=255272 (pay walled, but the abstract is clear enough):
"A significant observation is that, although increasing line sizes can result in a higher hit ratio, it also considerably increases traffic to main memory, thereby degrading the performance."
There also are CPUs that attempt to detect access patterns without help from the programmer, for example Power: http://chipdesignmag.com/display.php?articleId=3926&issu....
I think Intel CPUs have similar functionality, but these were the ones I found first (using "cache stride hints" as Google query), perhaps because Intel uses different terminology.
I'll just leave this here, PDF version for anyone who doesn't want to mess with the webpage reformatting.
The other day there was some post on front end/client side programming with a crazy amount of things everyone needs to know. Now apparently we also have to know the characteristics of the charging/discharging of capacitors in memory.
All in all, this is WAY too in depth too have the title it has. The only way you have to know all this stuff as a software engineer is if you are working on the lowest level of something like an OS kernel or a video game engine.