Hacker News new | past | comments | ask | show | jobs | submit login
Computers are fast (jvns.ca)
315 points by bdcravens on May 13, 2014 | hide | past | favorite | 154 comments



I particularly enjoyed the writing style in this article, largely because of the extent that the author provided unverified and loose figures in the article - cputime distributions etc. My experience is usually people are extremely hesitant to publish any uninformed, fast, and incomplete conclusions despite them being, in my opinion, still extremely valuable. It may not be perfectly correct, but that small conclusion is still often much better than the practically non existent data on the situation I start off with and allows me to additionally read the article far faster than slowing down to make these minor fuzzy conclusions myself. There is this misconception that when writing you can do two things - you can tell a fact or you can say a false statement. In reality it is a gray gradient space, and when the reader starts off knowing nothing, that gray is many times superior. Anyways, awesome job, I really want to see more of this writing style in publications like personal blogs.

[In case it isn't clear, I'm referring to statements like "So I think that means that it spends 32% of its time accessing RAM, and the other 68% of its time doing calculations.", and "So we’ve learned that cache misses can make your code 40 times slower." (comment made in the context of a single non-comprehensive datapoint)]


The style works because she's going on an adventure of discovery and taking us along for the ride.

Loose guesses that you hope are good enough are more fun than exact calculations. It's fun watching Indiana Jones guessing how much sand he needs to equal the weight of the statue, and it makes him seem more heroic. In that case he might have been better taking his time to do it properly, though.

Our minds are programmed to learn in a particular way:

- Try stuff

- See what happens

- Try to guess at a rule that would explain why it happened

- Test your guess by trying other stuff

... repeat until you know everything

It's fun sharing that process with another person, and a pretty good way to learn too -- especially where precision is not important. Technical writers should use it more.


It doesn't come across as a benchmark of a particular system or a performance tuning guide. It is a few simple experiments to get ballpark figures for just how fast a modern computer is and to trigger a couple of effects (such as cache misses) that can really slow down the system.

I think the real value is to show how to start having a play with things yourself.


Agreed. At first, the writing style (too! many! exclamation! points! Are you Unidan or what?) made me think that the article wasn't serious. The fact that I immediately thought, "you're just going to be benchmarking your disk!" didn't help, either. But the author did take the disk performance into account, and the rest of the article was surprisingly interesting as well. It's a quick and dirty exposition for a quick and dirty suite of benchmarks.


Her style is definitely different, but after discovering her last year and reading a ton of posts, I find it energizing. I recognize that feeling of excitement she's capturing in her writing style! I should be experiencing it more than I do! She's an inspiring writer. The exact details aren't really the point--they can be gotten easily enough. The process and the discovery are what she's documenting so well.


I don't know, I love her enthusiasm.


Same here. I thought it helped to dispel the conception that only Very Serious People should be interested in this stuff.


In fairness, it only takes one datapoint to learn that cache misses can in fact make your code 40 times slower.


I posted the following as a comment to the blog, I'll duplicate it here in case someone wants to discuss:

This program is so easy on the CPU that it should be entirely limited by memory bandwidth and the CPU should be pretty much idle. The theoretical upper limit ("speed of light") should be around 50 gigabytes per second for modern CPU and memory.

In order to get closer to the SOL figure, try adding hints for prefetching the data closer to the CPU. Use mmap and give the operating system hints to load the data from disk to memory using madvise and/or posix_fadvise. This should probably be done once per a big chunk (several megabytes) because the system calls are so expensive.

Then try to make sure that the data is as close to the CPU as possible, preferably in the first level of the cache hierarchy. This is done with prefetching instructions (the "streaming" part of SSE that everyone always forgets). For GCC/Clang, you could use __builtin_prefetch. This should be done for several cache lines ahead because the time to actually process the data should be next to nothing compared to fetching stuff from the caches.

Because this is limited on memory bandwidth, it should be possible to do some more computation for the same price. So while you're at it, you can compute the sum, the product, a CRC sum, a hash value (perhaps with several hash functions) at the same cost (if you count only time and exclude the power consumption of the CPU).


Yup, __builtin_prefetch certainly, if one wants to get it faster. (Compilers can't do this themselves, because they don't know how much they should prefetch ahead, and if you get that wrong it's worse than no prefetching.)

I am less sure about madvise. It seems to me default heuristics should work fine for this case, and as you said, system calls are expensive.


> I am less sure about madvise. It seems to me default heuristics should work fine for this case, and as you said, system calls are expensive.

System calls are expensive but so are page faults or having to access the disk. If you can avoid page faults by using madvise to prefetch from disk to memory, it should be worth it. In particular, the first run with cold caches should be faster.

However, the operating system may be smart and realize that we're doing a sequential access and may speculatively read ahead and madvise calls would be time wasted.

The same happens with CPU caches too, the CPU internal prefetcher is pretty good in recognizing a sequential access and grabbing the next cache line in advance. A few naively placed __builtin_prefetches doesn't seem to help here (I just tried this out).

Prefetching hints work a lot better in non-sequential access patterns (linked lists, etc).


Also, TLB misses. Also the reason why mmap(2) doesn't always beat read(2).


You know you're using a machine properly when you don't just blow the TLB. You blow the TLB for the TLB. I was pretty skeptical when my coworkers insisted my code did this, until I collected some great Intel performance counter data and, indeed, I blew the 2nd level TLB. Good read: http://www.realworldtech.com/haswell-cpu


Pretty naïve, I'm surprised to see it here. Not that this is pointless study, but it's pretty easy to guess up these numbers if you know about how long it takes to use a register, L1, L2, RAM, hard drive (and you should). And exactly how long it would take is task-specific question, because depends more on what optimization techniques can be used for the task and what cannot, so unless you are interested specifically in summation mod 256, this information isn't much of use, as "processing" is much broader than "adding moulo 256".

But it's nice that somewhere somebody else understood, that computers are fast. Seriously, no irony here. Because it's about time for people to realize, what disastrous world modern computing is. I mean, your home PC processes gigabytes of data in the matter of seconds, amount of computations (relative to its cost) it is capable of would drive some scientist 60 years ago crazy and it gets wasted. It's year 2014 and you have to wait for your computer. It's so much faster than you, but you are waiting for it! What an irony! You don't even want to add up gigabyte of numbers, you want to close a tab in your browser or whatever, and there are quite a few processes running in the background that actually have to be running right now to do something useful, unfortunately OS doesn't know about that. Unneeded data cached in the RAM and you wait while OS fetches memory page from HDD. But, well, after 20 layers of abstraction it's pretty hard to do only useful computations, so you make your user wait to finish some computationally simple stuff.

About every time I write code I feel guilty.


I agree completely. Hardware has gotten orders of magnitude faster, yet the experience, as a user, hasn't changed nearly that much. You'll probably find this comparison interesting: http://hallicino.hubpages.com/hub/_86_Mac_Plus_Vs_07_AMD_Dua...

I think the saying that programmer time is more expensive than machine time is, like the quote about premature optimisation, responsible for promoting an inherently wasteful culture in programming when people are taught to take them at face value; looking at it another way, power isn't free, and when machine time translates to user time, then the situation definitely changes. I always keep in mind that users are using the software I create to do work, and their time is just as valuable if not more so than my own, especially if there are more of them. To me, it's a question of balancing the tradeoffs --- it's not worth spending a day to optimise software that will save a few minutes across all its users over its lifetime, but it is very well worth it to spend a week to optimise something so that it may save a year or more of its users' time. Whenever a programmer complains about certain tools (e.g. compiler, IDE, etc.) being slow, I keep that in mind to mention the next time he/she flippantly dismisses a time-saving optimisation using the "my time is more expensive" excuse. Programmers are users too. :-)


"Hardware has gotten orders of magnitude faster, yet the experience, as a user, hasn't changed nearly that much."

I see what you mean, but I feel that this misses out an important fact - a lot of those speed increases are only available if you modify your code to use an optimized path. Memory I/O is a classic example - it has sped up a lot but only if you make sure that you aren't generating cache misses on every memory access. Branch-prediction-friendly code is another good example that can make a huge difference to the performance of a system.

That means that a lot of the performance gains that we have achieved (not all, certainly, but a substantial chunk) are not necessarily made available to naively-written code. It is not terribly surprising then that we sometimes struggle to achieve the speed-ups the hardware has made available to us.

This problem is only compounded by the fact that performance is one of the leakiest things going when we are talking about abstraction layers. It makes things extremely difficult to think about because you need to know intimate detail of your entire software stack, which these days is huge, if you wish to effectively optimize your code.


Found this gem in the linked article: Windows Vista demands enough real estate on your hard drive that you could easily fit 30,000 full-length novels into it.

A nitpick: a day of work, say six hours, is only 3600 minutes. If you can optimize some task to save a few minutes per user, it only takes a few thousand users to be worth it.

However, I think the worst optimizations to leave on the table are in an area many developer-types won't think about: usability. We can save much more of a user's time by creating simple, intuitive interfaces than we can checking cache misses (in most modern development).


> We can save much more of a user's time by creating simple, intuitive interfaces than we can checking cache misses (in most modern development).

And at least for regularly-used software, you can save even more user time by not optimising for the first five minutes of usage but for the long-term usage. Keyboard shortcuts are a good example for this: They are mostly not simple and certainly not intuitive per se (why does Alt+Tab switch windows in particular?), yet once you know them, they do save a lot of time.


six hours, is only 3600 minutes

You mean 360.


> Windows Vista demands enough real estate on your hard drive that you could easily fit 30,000 full-length novels into it.

That's an apples to oranges comparison if I ever saw one. It says nothing about whether Vista is bloated for the features it provides, by itself.


In terms of balancing tradeoffs, I continue to find this cheat-sheet useful whenever the topic comes up.

http://xkcd.com/1205/


Exactly How I feel. We are adding up abstraction over abstraction, backwards compatibility, hardware compatibility, software compatibility, unrelated ui/ux features, animations..

I believe modern computing has 80/20 rule in a bad way. We use 80 percent of the computing power to do unrelated 20 percent of the job.

we need to re-find Unix philosophy for modern day computing. Software should do a single task but do it well.

Also have a look at Zawinski's Law. http://en.wikipedia.org/wiki/Jamie_Zawinski#Zawinski.27s_law...


I wonder how much of the Unix philosophy is accidental success. 'Do one thing well' is a (relatively) easy way to make something that actually works correctly, so I guess I'm wondering if doing one thing is actually important, or if actually doing something in a consistent, reliable and predictable way is the more important aspect.

(Small programs that enable composition are great, but so is, for example, a program that can uncompress any format in place with only an option or two)


The Unix Philosophy is an attitude not a philosophy. Unix doesn’t have a philosophy: it has an attitude. An attitude that says a simple, half-done job is more virtuous than a complex,well-executed one. An attitude that asserts the programmer’s time is more important than the user’s time, even if there are thousands of users for every programmer. It’s an attitude that praises the lowest common denominator

• Small is beautiful.

• 10 percent of the work solves 90 percent of the problems.

• When faced with a choice, do whatever is simpler.

Which sounds good until realize that these comments mean the exact same thing.

• A small program is more desirable than a program that is functional or correct.

• A shoddy job is perfectly acceptable.

• When faced with a choice, cop out.


And when you need to glue lots of things together that do just a single task, you end up spending much more (development and run) time at the joints, as in, constantly converting and parsing data again.


Unix pipes are very effective I believe.


Serializing and deserializing heavily structured data in order to fit over a unix pipe takes a lot of effort - I'm often doing things where the reading and writing data takes up more code and processing time than the actual processing, which is simple once the data is in a sane data structure, not in a stream.

If your data structure is "a line of text, repeated x times" or "a blob representing an image", then unix pipes are very effective. For most everything else, they're not.


right, because you don't have to parse data coming over a pipe. :rolleyes:


> It's year 2014 and you have to wait for your computer. It's so much faster than you, but you are waiting for it!

The same is also evident on the web. In so many cases you need to wait for content, maybe couple of KB of text, to load even with our tens of Mb/s at last mile and tens of Gb/s at datacenter.

http://en.wikipedia.org/wiki/Wirth%27s_law


This is just due to the ridiculous way the development ecosystem for the web has grown. Nobody seems to care about fast languages anymore.

Check out this website. http://forum.dlang.org/

Notice how amazingly fast it is. It's written in D.

Oh, but I forgot, the bottleneck is in the database so feel free to waste millions of cycles every request chasing pointers around and around the vast quantities of memory that your dynamic language consumes.

People don't even know what speed is anymore.


Just because it is written in D doesn't automatically make it fast. If you were to run that program on an underpowered host with a database system that has to constantly swap, with not enough CPU time available it would be slow too.

And that is the bigger problem, people are running their websites on underpowered hardware in the first place. The limit should be the network connection, not the page rendering time... but unfortunately that isn't always the case.


Perhaps its because some people are more productive with dynamic languages and they value productivity over speed.


That's exactly it (and it's a message we've been hearing over and over for the past decade at least -- "worry about being productive, if it's slow, Moore's law will fix it for you in a few years").

The thing is, in some cases, all the productivity ends up costing a lot more. For small shops it will not be evident, but for bigger ones trying to scale, it'll be a major area of improvement (to use less cycles).

And that's OK... comes down to using the right tool for the right job, matching the skills of your developers with the challenge, how fast you want things delivered, etc, etc, etc.

I think the major issue people see is that when all of that slow programs are looked from a 10,000 feet view as a single thing, it looks like a damn waste of resources. Thus comments like "computing (as a whole) in the last X years has been terrible", etc. We can delve into programmer culture, lack of skills, people not caring about wasting resources, etc, etc, etc.

I think the solution is better tooling. Let the programmer be productive, deliver things fast, etc, but build something more intelligent that will strip it down to only the necessary bits and deploy.


I find this hard to believe. Any time people write systems in a dynamic language, they inevitably run into issues as they scale, ending up wasting money (computers, electricity) and a lot of user time. This usually results in rewriting things over and over.

Performance is always important. If you can solve a problem in 10% more time using a performant language, then it makes sense to do so, and I've seen few problems that I can express in Python or Ruby (painfully slow) that I cannot express in C++ (highly performant) or Scala (moderately performant) in roughly that amount of extra time.


Time is not perfectly fungible. The only way to save time early on (prototyping and delivering products early on) may result in spending a lot of time later on.


Do you have any actual data to back your wild assertions? Entire domains of computer science are dedicated to productivity languages and many are widely used with great success and little or no overhead relative to C. (edit: for example scipy)


I was stating an opinion based on my experience. I don't deny that tools like SciPy are very useful for experimental programming. However SciPy and Matlab are exceptions, where the end-user of the program is also the programmer, so the developer experience IS the user experience.

In most cases, when you are a professional developer, you are not the end-user of your product. In these cases, deliberately choosing a less performant language for the sake of developer productivity can mean you end up penalizing the end user.


Good point, and I think that that's why languages like Java and C# are very popular these days - they hit a good balance between performance and developer productivity.


The danger with the "speed doesn't matter" attitude is that you end up with a 10x slowdown due to language choice, and then another 10x slowdown due to inefficient implementation, and you end up with a 100x slowdown, which might actually be significant even in situations where speed doesn't matter that match.


The sentiment you just wrote is one of the chief inspirations for the TechEmpower Framework Benchmarks project that I'm part of. Today, we routinely see huge amounts of computational horsepower evaporated in the name of developer efficiency. Yet the modern web development ecosystem is so plentiful and diverse that there are myriad high-productivity platforms and frameworks that exploit computational horsepower rather than squander it.

My opinion is that investing the modest effort to select a high-performance platform and framework gives you the freedom to write inefficient application code first and to optimize later. It is the process that allows you to avoid the most expensive kind of premature optimization I've seen by postponing optimization until well after an application's concept has been vetted by its first several thousand users.


To me, one of the really mind boogling things in terms of waste is how we today basically create piles of complete systems just to run a single application. The system used to run a single application is filled with other applications, libraries and functions that are never used.

And the application being run is probably using shared lib and other complex mechanisms designed for shared environments when no sharing will in fact happen.

Yes, these servers are virtual and just part of a file. But the overhead in space and energy is still mind boggling.


Do you know what would make people instantly realise how fast computers are?

Reaction: a program that reads a camera and reacts to some user action instantly moving a robot arm. Add to that an ultra-slow-motion camera so we can see what happened afterwards.


There's a passage in Stephen Levy's book "Hackers" where he talks about a robot built by some students at MIT that could catch a ping pong ball thrown to it by a human. I don't recall what year this project was done, but I do recall what computer they used: a PDP-6.

So a computer built 50 years ago is already fast enough to pass your instant-reaction test. (Well, admittedly, maybe it didn't respond instantly, but if it could catch a thrown ping pong ball, it was probably pretty quick).

This comment isn't meant to scoff at your suggestion -- far from it -- I think an instant-reaction-bot will still impress most people (and rightfully so). I just wonder if we, as programmers and scientists and engineers, haven't been taking as much advantage of Moore' law as we ought to.


This demo about sketchpad is pretty impressive, too. http://www.youtube.com/watch?v=USyoT_Ha_bA You look at what they were able to accomplish, and it just seems like we lost a lot of steam somewhere.


yayitswei posted a link to a robot that produces the effect I meant. A modern version of the catcher that could outperform a human would be nice too. Or the knife trick by the robot in Aliens.

Also I saw some news about a spanish uni developing one robot arm to catch space debris.

The thing is we still don't realize how fast they really are. We say all the time that computers are dumb. The perception is somewhat distorted for most people.

Moore's law has been almost completely eaten by the screen growth. That's the easiest least imaginative way.


Reminds me of the robot that wins rock-paper-scissors every time: http://www.theguardian.com/technology/video/2013/nov/06/japa...


The computers are fast point really strikes a cord with me. Because so much of what programmers do seems to be about using abstractions and infrastructure to achieve things that are too slow when written naively in actual code. Just try plotting some 3D data using python. Should I use matplotlib, or a game engine, or maybe OpenGL? Each option could be too slow, even though the machine can easily process the raw data almost instantly.


I wrote a new version of bytesum_mmap.c [...]and it took about 20 seconds. So we’ve learned that cache misses can make your code 40 times slower

What's being benchmarked here is not (the CPU's) cache misses, but a lot of other things, including the kernel's filesystem cache code, the page fault handler, and the prefetcher (both software and hardware). The prefetcher is what's making this so much faster than it would otherwise be if each one of those accesses were full cache misses. If only cache misses were only 40 times slower, performance profiles would be very different than they are today!

Here are some interesting numbers on cache latencies in (not so) recent Intel CPUs:

https://software.intel.com/en-us/forums/topic/287236

I’m also kind of amazed by how fast C is.

For me, one of the points that this article seems to imply is that modern hardware can be extremely fast, but in our efforts to save "programmer time", we've sacrificed an order of magnitude or more of that.


> For me, one of the points that this article seems to imply is that modern hardware can be extremely fast, but in our efforts to save "programmer time", we've sacrificed an order of magnitude or more of that.

Sounds like a good deal!


To who and in what situations? To businesses(hardware and software) sure, they get products out faster and are motivated to push new stuff out. To end users, not so much, because they wait more and their batteries drain faster. They also pay more for hardware to have a "good experience".


They also have an order of magnitude more software to choose from, which tends to drive cost of software down through competition (though I'll grant it that most free models we're stuck with today suck, and I wish paying at least $10 for good software was still a common thing.) More importantly, a lower barrier of entry to developing allows a much longer tail of niche applications to be produced, and perhaps even more important, experts and hobbyists from other domains can go and create software using all the knowledge from their "main" field.


Also, whose time is more valuable? The developer team's, or the combination of all their users?


> For me, one of the points that this article seems to imply is that modern hardware can be extremely fast, but in our efforts to save "programmer time", we've sacrificed an order of magnitude or more of that.

There's still plenty of reasons to write fast code. If the code ends up being run repeatedly and is known to be the bottleneck, optimizing a piece of code may save on hardware costs. And power and cooling costs.

Or if we're talking about a mobile device, doing something quickly and then powering off the CPU will save a lot of those precious milliwatts. (I do this in my day job)

In this case, there was a 10x improvement in the time it takes, allowing the CPU to be powered off for 2 seconds. That translates to longer battery life and happy customers.

While you're right in that CPU time is usually cheaper than programmer time, there's still a time and a place for writing fast code. Computers may be "fast enough" but the hardware is still expensive and the power it consumes is more important than ever.


Re: last paragraph, I agree; however, it's a cost / benefit thing. I work as a consultant, and while I want to make the most optimal software ever on the one hand, my customer also wants features. The more you optimize, the harder it gets and the more maintenance you'll need to do. It all depends on how much time (and money) you want to invest in optimization vs features.


> Re: last paragraph, I agree; however, it's a cost / benefit thing.

Performance optimization is most definitely a cost/benefit trade off thing.

In your business as a consultant, it will almost never make sense to optimize for performance.

On the other hand, in my business (sw engineering for the semiconductor industry), performance and power consumption are some of the primary criteria our customers (consumer product manufacturers) make the choice between our product and our competitor's product. In our case, performance is a feature (if not the feature).

Most software engineers probably fall in to the former category and do not have to care about performance of the software they write. But it is dangerous to think that it would always be the case, and especially harmful to think that educational articles like the OP are wasted programmer time. Especially given that OP seems to be focused in low level operating system development, and who wouldn't want their operating systems, browsers, etc faster.


Interesting investigation!

I had an experiment with getting the Rust compiler to vectorise things itself, and it seems LLVM does a pretty good job automatically, e.g. on my computer (x86-64), running `rustc -O bytesum.rs` optimises the core of the addition:

  fn inner(x: &[u8]) -> u8 {
      let mut s = 0;
      for b in x.iter() {
          s += *b;
      }
      s
  }
to

  .LBB0_6:
  	movdqa	%xmm1, %xmm2
  	movdqa	%xmm0, %xmm3
  	movdqu	-16(%rsi), %xmm0
  	movdqu	(%rsi), %xmm1
  	paddb	%xmm3, %xmm0
  	paddb	%xmm2, %xmm1
  	addq	$32, %rsi
  	addq	$-32, %rdi
  	jne	.LBB0_6
I can convince clang to automatically vectorize the inner loop in [1] to equivalent code (by passing -O3), but I can't seem to get GCC to do anything but a byte-by-byte tranversal.

[1]: https://github.com/jvns/howcomputer/blob/master/bytesum.c


GCC vectorizes the sum fine if the input is unsigned bytes, cf [1]. My guess is some odd interaction between integer promotions and the backend.

[1] http://goo.gl/B7KX1V


I think it's just the type of the accumulator[1] that influences this, GCC seems to vectorize it fine for any type other than signed char/signed short.

[1] s in your code, result in the author's

Edit: The reason for the failure appears to be this:

    test.c:7:2: note: reduction: not commutative/associative: s_11 = (int8_t) _10;
Edit: GCC vectorizes this fine when compiling with -fwrapv, which gives you the semantics the author probably expected.


This link seems to be breaking the HN comments for this page. :)


It's a shortened link (because the true link is so long, as you discovered), maybe you have a browser extension that expands them?


Ah, you're right, I do (Tactical URL Expander for Chrome)! That'd sure explain it.


I like the "free" style of the article. Here is another conclusion: In my professional life I have heard many, many excuses in the name of performance. "We don't need the third normal form, after all, normalized databases are less performant, because of the joins". Optimizing for performance should not mean to make it just as fast as it could possibly run, but to make it just fast enough.

Julia's article shows a good example for this. Of course, the goal appears to generate a feeling of what tends to make a program fast and slow and get a feeling for how slow it will be or how fast it can get; yet I'd like to point out that this...

https://github.com/jvns/howcomputer/blob/master/bytesum_intr...

... might be 0.1 Seconds faster than the original code when started as "already loaded into ram" which she claims runs at 0.6 seconds. Yet this last piece of code is way more complicated and hard to read. Code like this

Line 11: __m128i vk0 = _mm_set1_epi8(0);

might be idiomatic, fast and give you a great sense of mastery, but you can't even pronounce it and it it's purpose does not become clear in any way.

Writing the code this way may make it faster, but that makes it 1000x harder to maintain. I'd rather sacrifice 0.1 seconds running time and improve the development time by 3 days instead.


I believe that the author meant to write 0.25 seconds, instead of 0.5 seconds. She says that the one step halves the running time, and later on she is investigating how the 0.25 seconds are split up.

A >50% speedup is not to be sniffed at! I completely agree with your point though, unless it is mission critical that this code runs as fast as possible, then you are better off keeping it simple!


Depends if that 0.1 seconds is going to add up to a very significant amount. Obviously these are just tests but in a production system running 24/7 that 0.1 seconds per run is going to add up to a lot. The code might be ugly but that is when comments are most important.


Or it might add up to next to nothing, if a full run is eight hours. Or it might finish early and then be blocked waiting for some other part of the system.

The moral of "computers are fast" is that guessing about bottlenecks and twiddling with the lowest level of code is unlikely to help; you need to start at the top with a profiler, and start asking the questions "do we need to compute this at all?", "can we make it O(n log n) or better?", and "can we partition this to scale horizontally?"


Not only at the top, but with profiling a task that is relevant for the user. It is trivial to find some code that could be "speed up", but if it does not bring any value to the user, what is the point?


That's why you write a C++ class wrapping the SSE intrinsics :)


For an in-depth presentation on how we got to this point (cache misses dominating performance), there's an informative and interesting talk by Cliff Click called A Crash Course in Modern Hardware: http://www.infoq.com/presentations/click-crash-course-modern...

The talk starts just after 4 minutes in.


It's 1.08s on my computer for one line of Python, which is respectable:

  python2 -m timeit -v -n 1 -s "import numpy" "numpy.memmap('1_gb_file', mode='r').sum()"                                                                                    
  raw times: 1.08 1.09 1.08


Isn't a lot of numpy implemented in C?


Sure, but the point is that you can get some pretty nice speeds without writing any C yourself.


But only for operations that someone else has written the C code for you. If you come up with some new algorithm that cannot be reduced to operations already implemented in C, then you would need to write the fast C implementation yourself.


Python is implemented in C as well and it's still a dynamic language, which was the point here, i guess.


There is a large difference between "implemented in C" (presumably talking about the runtime being implemented in C) and "Python that calls C code". If the implementation of Python is implemented in C as an interpreter#, then the interpreter is almost sure to be slower than whatever language/technology the interpreter was implemented with/on top of. This is pretty orthogonal to "Python that is implemented in <whatever>, and is doing most of the calculations via calling a C library". If you have a Python program that computes over megabytes of data, but it is calling a C library to do the heavy lifting, then the performance of Python itself is probably going to be negligible if 99.9% of the time is spent on the computation in the C code.

# Though could be a VM or something.


So what? Do you live in an alternative universe where you cannot use libraries implemented in C? Then you also cannot use, for example, Node.js (based on libuv).


Where is the sum?


1/4 second to plow through 1 GB of memory is certainly fast compared to some things (like a human reader), but it seems oddly slow relative to what a modern computer should be capable off. Sure, it's a lot faster than a human, but that's only 4 GB/s! A number of comments here have mentioned adding some prefetch statements, but for linear access like this that's usually not going to help much. The real issue (if I may be so bold) is all the TLB misses. Let's measure.

Here's the starting point on my test system, an Intel Sandy Bridge E5-1620 with 1600 MHz quad-channel RAM:

  $ perf stat bytesum 1gb_file
  Size: 1073741824
  The answer is: 4
  Performance counter stats for 'bytesum 1gb_file':

  262,315 page-faults         #    1.127 M/sec
  835,999,671 cycles          #    3.593 GHz
  475,721,488 stalled-cycles-frontend   #   56.90% frontend cycles idle
  328,373,783 stalled-cycles-backend    #   39.28% backend  cycles idle
  1,035,850,414 instructions            #    1.24  insns per cycle
  0.232998484 seconds time elapsed
Hmm, those 260,000 page-faults don't look good. And we've got 40% idle cycles on the backend. Let's try switching to 1 GB hugepages to see how much of a difference it makes:

  $ perf stat hugepage 1gb_file
  Size: 1073741824
  The answer is: 4
  Performance counter stats for 'hugepage 1gb_file':

  132 page-faults               #    0.001 M/sec
  387,061,957 cycles                    #    3.593 GHz
  185,238,423 stalled-cycles-frontend   #   47.86% frontend cycles idle
  87,548,536 stalled-cycles-backend     #   22.62% backend  cycles idle
  805,869,978 instructions              #    2.08  insns per cycle
  0.108025218 seconds time elapsed
It's entirely possible that I've done something stupid, but the checksum comes out right, but the 10 GB/s read speed is getting closer to what I'd expect for this machine. Using these 1 GB pages for the contents of a file is a bit tricky, since they need to be allocated off the hugetlbfs filesystem that does not allow writes and requires that the pages be allocated at boot time. My solution was a run one program that creates a shared map, copy the file in, pause that program, and then have the bytesum program read the copy that uses the 1 GB pages.

Now that we've got the page faults out of the way, the prefetch suggestion becomes more useful:

  $ perf stat hugepage_prefetch 1gb_file
  Size: 1073741824
  The answer is: 4

  Performance counter stats for 'hugepage_prefetch 1gb_file':
 132 page-faults            #    0.002 M/sec
 265,037,039 cycles         #    3.592 GHz
 116,666,382 stalled-cycles-frontend   #   44.02% frontend cycles idle
 34,206,914 stalled-cycles-backend     #   12.91% backend  cycles idle
 579,326,557 instructions              #    2.19  insns per cycle
 0.074032221 seconds time elapsed
That gets us up to 14.5 GB/s, which is more reasonable for a a single stream read on a single core. Based on prior knowledge of this machine, I'm issuing one prefetch 512B ahead per 128B double-cacheline. Why one per 128B? Because the hardware "buddy prefetcher" is grabbing two lines at a time. Why do prefetches help? Because the hardware "stream prefetcher" doesn't know that it's dealing with 1 GB pages, and otherwise won't prefetch across 4K boundaries.

What would it take to speed it up further? I'm not sure. Suggestions (and independent confirmations or refutations) welcome. The most I've been able to reach in other circumstances is about 18 GB/s by doing multiple streams with interleaved reads, which allows the processor to take better advantage of open RAM banks. The next limiting factor (I think) is the number of line fill buffers (10 per core) combined with the cache latency in accordance with Little's Law.


Good stuff! The low numbers (4 GB/s, one tenth the available memory bandwidth) that we're seeing sounds like a lot of time is wasted in page faults (and indeed, perf stat confirms it). However, the solution you propose sounds difficult, you need a special file system and need to know what kind of page sizes the CPU supports.

Can you think of ways to get reduce the number of page faults from inside the application itself? Or methods that would be portable to architectures with different page sizes?

I tried a simple call to madvise and posix_fadvise to inform the operating system ahead of time that I am going to need the memory but that did not have any effect on the number of page faults.

Any other tips for squeezing some more perf out? Did you happen to do any cache miss stats your benchmarks?


You get a similar effect to huge pages by passing MAP_POPULATE to mmap, which pre-faults all pages itself before returning.


Ah, thanks!

Using MAP_POPULATE, I'm seeing this go from 104,276 page faults down to 130 page faults. However, I'm only seeing a modest increase in overall performance, goes from 160 ms to 130 ms for my ~500 MB test file. MAP_POPULATE + MAP_NONBLOCK is as bad as not using MAP_POPULATE (same number of page faults).

Could the number of TLB misses from using huge pages be a major factor too? Page faults alone do not explain the huge perf difference. (edit: perf stat tells me I'm only missing 0.19% of TLB lookups and only 3% dcache misses)

I also tried with a 4 GB file and I see two interesting effects. First of all, using MAP_POPULATE is slower, which I presume is because physical memory is exhausted (6 G memory total). Secondly, I see less than 100k page faults which suggests that bigger pages (2M or 1G?) are being used (10x bigger file, a little less page faults).


How about the following?

    HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so


This is well worth trying, but not directly applicable to reading from a file. This would help if you were allocating a buffer with malloc() and you wanted it to be backed by hugepages. But if you have control of the source, you are probably better using MAP_HUGETLB with madvise() or mmap(), or using the get_huge_pages() directly instead of using the LD_PRELOAD.


What do you think MAP_POPULATE is actually doing here? Unless it's changing the page size, I don't see how it would be significantly reducing the number of TLB misses. Is it perhaps doing the preloading in a separate thread on a different core? And the timing happens to work out that so that the L3 cache is getting filled at the same rate it's being drained?


> What do you think MAP_POPULATE is actually doing here? Unless it's changing the page size, I don't see how it would be significantly reducing the number of TLB misses.

I think that MAP_POPULATE here will fill the page table with entries rather than leaving the page table empty and letting the CPU interrupt at (almost) every time a new page is accessed. That would be about 200k less interrupts for a 1G file.

MAP_POPULATE will probably also do the whole disk read in one go rather than in a lazy+speculative manner.

Page size is probably not affected and neither is number of TLB misses. I in my testing that the size of the file (and the mapping) will affect the page size, a 4G had significantly less page fault interrupts than a 500MB file.

And obviously, MAP_POPULATE is bad if physical memory is getting exhausted.


I came across this link, which helped me understand the process a bit better: http://kolbusa.livejournal.com/108622.html. So yes, the main savings seems to be that the page table is created in a tight loop rather than ad hoc. Given the number of pages in the scane, it's still going to be a TLB miss for each page, but it will be just a lookup (no context switch).

in my testing that the size of the file (and the mapping) will affect the page size

I'm doubtful of this, although it might depend on how you have "transparent huge pages" configured. But even then, I don't think Linux currently supports huge pages for file backed memory. I think something else might be happening that causes the difference you see. Maybe just the fact that the active TLB can no longer fit in L1?

And obviously, MAP_POPULATE is bad if physical memory is getting exhausted.

I'm confused by this, but this does appear to be the case. It seems strange to me that the MAP_POPULATE|MAP_NONBLOCK is no longer possible. I was slow to realize this may be closely related to Linus's recent post: https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6


It's moving page faults away from the main addition loop, which is what I'm interested in measuring anyway. It also reads the whole file in one go, instead of page by page with the default lazy approach.

The best wall times (that is, with OS time included) I get are obtained by reading L1-sized chunks into a small buffer instead of using mmap. YMMV.


Those are sort of fuzzy concepts for me. At the level of the processor, what does "in one go" really mean? And what does it mean to read it if it's already in memory? Since there are only 512 standard TLB entries, there's no way that all of them can be 'hot' at a time with 4K pages.

For a 1 GB file, I get wall times of:

  Original:     .22 sec
  MAP_POPULATE: .17 sec
  Hugepages:    .11 sec
  Hugepages with prefetch: .07 sec
While I generally agree with the idea that mmap() is no faster than read()/fread(), I'm dubious that one could achieve equally good performance without using huge pages. What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does. I've confirmed that it is not changing the number of TLB page walks. It stays at the expected ~250,000 per GB whether it's used or not.


> And what does it mean to read it if it's already in memory?

It means walking whatever structures the OS uses to keep things in cache. We generally don't know what they are, nor control them.

> What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does.

MAP_POPULATE minimizes the number of page faults during the main loop, which are more expensive (and require a context switch) than TLB misses. Plus, TLB misses can be avoided in our loop, especially with such a friendly linear sweep.

The main problem here, in my view: trying to coax the OS into using memory the way we want. Huge pages surely help in that regard, but they help the most in code that we do not control. The sum itself over 1 GB of memory would be roughly the same speed, regardless of page size.

To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.


Thanks for walking me through. I've usually dealt the with hugepages as buffers (as you mention in the last test) and haven't thought much previously about how they work as shared memory.

To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.

Yes, I'm pretty sure this is the case, and had in fact been assuming that it is the major effect. It's a little trickier to measure than the case from the file, since you don't want to include the random number generation as part of the measurement. This essentially excludes the use of 'perf', but luckily 'likwid' works great for this.

I'll try to post some numbers here in the next hour or so. What performance counters are you interested in?


Cycle counts and DTLB_LOAD_* events should be enough here. Note that the random number generation also doubles as the "populate" flag in this case, since malloc returns uninitialized pages. I fully expect huge pages to be faster, all other things being equal, but I wonder by how much.


OK, I've tested, and I'm excited to report that you about 98% correct that TLB misses are _not_ the main issue! For a prewarmed buffer, using 1GB hugepages instead of standard 4KB standard pages is only about a 2% difference.

Here's the standard case:

  sudo likwid -C 1 -g 
    INSTR_RETIRED_ANY:FIXC0,
    CPU_CLK_UNHALTED_CORE:FIXC1,
    DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0,
    DTLB_LOAD_MISSES_WALK_DURATION:PMC1  -m ./anonpage

  |RDTSC Runtime [s]                | 0.0732768   |
  |INSTR_RETIRED_ANY                | 5.78816e+08 |
  |CPU_CLK_UNHALTED_CORE            | 2.62541e+08 |
  |DTLB_LOAD_MISSES_WALK_COMPLETED  |   262211    |
  |DTLB_LOAD_MISSES_WALK_DURATION   | 5.52261e+06 |
And here is the version using 1GB hugepages:

  sudo likwid -C 1 -g 
    INSTR_RETIRED_ANY:FIXC0,
    CPU_CLK_UNHALTED_CORE:FIXC1,
    DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0,
    DTLB_LOAD_MISSES_WALK_DURATION:PMC1  -m ./hugepage

   | RDTSC Runtime [s]               | 0.0716703   |
   | INSTR_RETIRED_ANY               | 5.78816e+08 |
   | CPU_CLK_UNHALTED_CORE           | 2.56794e+08 |
   | DTLB_LOAD_MISSES_WALK_COMPLETED |     63      |
   | DTLB_LOAD_MISSES_WALK_DURATION  |    4891     |
The hugepages are indeed faster by amount the difference reported in as DTLB_LOAD_MISSES_WALK_DURATION. This means that as you surmised, the majority of the savings is not due to the avoidance of TLB misses per se. I need to think about this more.


Interesting. I wonder if the slowdown of small pages is a result of excessive pointer chasing on the kernel side, which intuitively does a better job of trashing the TLB than sequential accesses would.


> Any other tips for squeezing some more perf out?

Ditch the OS and go into huge unreal mode ala LoseThos. Disable interrupts in the loop. No MMU means no page faults, no TLB, no interrupts, you can't possibly get any faster than that! :-)

More seriously, I think 14.5GB/s is getting close to the limits of memory bandwidth on a single core already. Multicore CPU's memory controllers are optimised for multiple cores accessing memory, so a single one trying to saturate the bus might not perform so well.


Seriously, someone should take TempleOS (if it's open source) or else make something similar and market it in some form of really high-speed server, I'm sure there'd be demand. Of course, that's assuming it would be possible to achieve decent security on the software level to ensure safe operation without memory protection.


I haven't explored it, but you might check out http://www.returninfinity.com/baremetalnode.html

You can also get pretty far on modern Linux by reserving cores for computation with NOHZ_FULL: http://www.breakage.org/2013/11/nohz_fullgodmode/


Something Snabb Switch-like, then, where they disable the Linux task scheduler on 6 out of 8 cores, and do all the scheduling manually: https://news.ycombinator.com/item?id=7250505


Places where people care about this (like HPC) already have customized kernels that run with large pages by default and don't suffer from the poor quality of TempleOS.

See https://software.sandia.gov/trac/kitten for an example.


What exactly do you mean by poor quality? Have you ever run the OS? Do you have a reference to some results that reveal some aspect of it that is poor?


Hah, that's how Novell Netware got such high performance in file serving back in the day (no protected memory).

Funny how things come full circle sometimes...


However, the solution you propose sounds difficult, you need a special file system and need to know what kind of page sizes the CPU supports.

Yes, I've definitely offered a "proof of concept" rather than a solution. But it has gotten simpler with recent Linux kernel versions: http://lwn.net/Articles/533499/.

One difficulty in emulating Julia's particular benchmark is (at least on Linux) transparent hugepages can't be used for file-backed memory. It's easier if you are just allocating an anonymous buffer, and even easier if you don't need to share that buffer between unrelated processes.

Can you think of ways to get reduce the number of page faults from inside the application itself?

"Page faults" is an overloaded term (used for semi-related OS and hardware level events), so probably best to avoid it (even though I was using it, and even though 'perf' does). I was actually using 'likwid' to do most of the measuring, and switched to 'perf' for the post since it's more generally available. Our goal here is to avoid TLB misses, or the consequent 'page walks'.

With that in mind, the subgoal becomes allocating memory that uses hugepages. If you are OK with the default 2MB pages, this is reasonably straightforward. 'libhugetlbfs' (a library distributed independently of the 'hugetlbfs' distributed with the kernel) may make this easier.

Or methods that would be portable to architectures with different page sizes?

The number of page sizes supported is not a big issue. There are only a few sizes out there. In practice, all 64-bit systems will support the 2MB hugepages, and this is almost always the default hugepage size. In this particular case with a linear scan of memory, there is almost no difference in performance between these and the harder to specify 1GB pages.

I tried a simple call to madvise and posix_fadvise to inform the operating system ahead of time that I am going to need the memory but that did not have any effect on the number of page faults.

This is the overloading of the terms getting you. Presuming a system with ample free memory, after the first run you won't have any OS level page faults. But anytime you try to have a working set of more than 512 4K regions in RAM, you will have TLB misses aka "page exceptions".

The madvise() option that should help here is MAP_HUGETLB. You should also be able to use this in mmap(), but not (to my knowledge) if you are doing a file backed allocation.

Any other tips for squeezing some more perf out?

Not really, learning these is my goal as well. For me, it's probably going to involve getting more familiar with the "uncore" performance monitors. Sandy Bridge and post, these are accessed via PCI rather than MSR's, and are difficult to use from 'perf'. 'likwid' supports them better, and is definitely worth exploring.

The other place to get further gains is by trying to understand the actual memory access patterns, and trying to get them to read full DRAM pages each time a bank is opened. John McCalpin (author of the Stream benchmark) goes into it here: http://blogs.utexas.edu/jdm4372/2010/11/09/optimizing-amd-op...

Did you happen to do any cache miss stats your benchmarks?

This is directly related to the performance improvement, but I didn't look at the numbers separately. Glancing now, it looks like practically everything is hitting L3, but a lot of things are missing L1. This is probably worth exploring further. I think one issue is that the hardware prefetcher deposits things in L2 (~12 cycles), but not L1 (~6 cycles).


Naively, it sounds about right just on dimensional grounds. CPU clock: ~ 2GHz, so that means

    (2 GHz) / (4 GB / s) = 0.5 instructions per byte
For a 64 bit processor, you might imagine the best you can do is, say, 2 instructions per 8 bytes (load 64 bits, add+store accumulator).

But of course in practice, the CPU is occupied by other things (i.e. operating system), so I think it's still pretty amazing to get this close to the "theoretical" maximum efficiency.


I'm not sure if this is the right way to look at the problem. With the right numbers, it does provide a theoretical limit, but the problem is that there are other lower I/O limits that get in the way first.

As a stab at the right numbers, on Sandy Bridge you can issue two 16B loads per cycle. Since modern processors are superscalar, in that same cycle you can issue the two vector adds. So your actual CPU limit is more like 32B per cycle (.03 cycles per byte).


Very interesting; of course I expect there are many refinements to be made. As a physicist my first reaction is always to mash numbers together based on dimension. If I can get within one or two orders of magnitude in a problem I know nothing about, I'm pretty happy ;)


I agree completely with the thought process, and but I think there might be better 'dimensions' to be used. The one that I'd recommend here is based on Little's Law, which gives a limit on throughput based on the number and latency of outstanding requests: Concurrency = Latency * Bandwidth.

It turns out that each core has can only have 10 requests for memory outstanding (line fill buffers). Since in the end these requests are coming from RAM, each request has a latency of about 100 cycles. Since each request is for 64B, this gives us a maximum throughput of about about:

  Bandwidth = 10 * 64B in flight / 100 cycles
  Bandwidth = about 6B per cycle
At a 3.5 GHz clock frequency, this would suggest that we have a hard cap at about 3.5 billion cycles/s * 6 bytes/cycle = 21 GB/s, which is mighty close to the actual limit! The numbers are fudged a little bit because it's not actually a flat 100 cycle latency to access RAM, but I think this limit is more relevant and indicative here than the instruction count.


Can you share the prefetching code you wrote? I tried to use __builtin_prefetch, but couldn't figure out how to make it faster.


My approach was somewhat ugly. I went this this:

  for (int i = 0; i < n; i += 8*16) {
      __builtin_prefetch(&a[i + 512]);
      for (int j = 0; j < 8; j++) {
        __m128i v = _mm_load_si128((__m128i *)&a[i + j*16]);
        __m128i vl = _mm_unpacklo_epi8(v, vk0); 
        __m128i vh = _mm_unpackhi_epi8(v, vk0);
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));
    }
The goal is to issue one prefetch for each 128B block of data you read. There are probably better ways to do this than what I did. I'm hoping the compiler did something reasonable, and haven't really looked at the generated assembly.

Also, if it indeed is that case that TLB misses are the major factor (and I think it is), I don't think you will have much success with by adding prefetch alone. Trying right now, I get a slight slowdown with just the prefetch. It may only in combination with hugepages that you get a positive effect.


I profiled my program looking for TLB misses, and got 0 dTLB misses (https://gist.github.com/jvns/a42ff6f48c659cfc4600).


I think 'perf' is probably lying to you. Although maybe it's not a lie, as your Gist does contain that line '<not supported> dTLB-misses'. Perf tries very hard to be non-CPU specific, and thus doesn't do a great job of handling the CPU specific stuff.

What processor are you running this on? If Intel, you might have luck with some of the more Intel specific wrappers here: https://github.com/andikleen/pmu-tools

You also might have better luck with 'likwid': https://code.google.com/p/likwid/

Here's the arguments I was giving it to check:

  sudo likwid -C 1 -g              \
       INSTR_RETIRED_ANY:FIXC0,    \
       CPU_CLK_UNHALTED_CORE:FIXC1,\      
       CPU_CLK_UNHALTED_REF:FIXC2, \     
       DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0 \
       ./bytesum 1gb_file

  |        INSTR_RETIRED_ANY        | 7.38826e+08 |
  |      CPU_CLK_UNHALTED_CORE      | 5.42765e+08 |
  |      CPU_CLK_UNHALTED_REF       | 5.42753e+08 |
  | DTLB_LOAD_MISSES_WALK_COMPLETED | 1.04509e+06 |

  sudo likwid -C 1 -g              \
       INSTR_RETIRED_ANY:FIXC0,    \
       CPU_CLK_UNHALTED_CORE:FIXC1,\      
       CPU_CLK_UNHALTED_REF:FIXC2, \     
       DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0 \
       ./hugepage_prefetch 1gb_file

  |        INSTR_RETIRED_ANY        | 5.79098e+08 |
  |      CPU_CLK_UNHALTED_CORE      | 2.63809e+08 |
  |      CPU_CLK_UNHALTED_REF       | 2.63809e+08 |
  | DTLB_LOAD_MISSES_WALK_COMPLETED |    11970    |
The other main advantage of 'likwid' is that it allows you to profile just a section of the code, rather than the program as a whole. For odd political reasons, 'perf' doesn't make this possible.

ps. I think your 'argc' check is off by one. Since the name of the program is in argv[0], and argc is the length of argv, you want to check 'argc != 2' to confirm that a filename has been given.


yeah, I tried prefetching like this: https://github.com/jvns/howcomputer/blob/master/bytesum_pref... (using MAP_POPULATE instead of hugepages) and got a slight slowdown


> That gets us up to 14.5 GB/s, which is more reasonable for a a single stream read on a single core.

If you want to see roughly the limit of memory speed on your machine, try booting into memtest86+; in addition to testing memory, it benchmarks memory.


The author's SSE code is a terribly overcomplicated way of summing up every byte. The code is using PMADDW (a multiply and add?!), and is strangely trying to interleave hardcoded 0s and 1s into registers with PUNPCKHBW/PUNPCKLBW, huh?

All the author needs is PADDB (add packed bytes).


Here is how it should be done with PADDB: http://pastebin.com/MY9tENpW This is 20% faster than the author's version on my computer: 0.210 sec vs. 0.260 sec to process 1GiB. The tight loop is simple:

  400710:	66 0f fc 04 07       	paddb  (%rdi,%rax,1),%xmm0
  400715:	48 83 c0 10          	add    $0x10,%rax
  400719:	48 39 c6             	cmp    %rax,%rsi
  40071c:	77 f2                	ja     400710 <sum_array+0x10>
Compare this to the author's complex version:

  400720:	66 0f 6f 14 07       	movdqa (%rdi,%rax,1),%xmm2
  400725:	48 83 c0 10          	add    $0x10,%rax
  400729:	48 39 c6             	cmp    %rax,%rsi
  40072c:	66 0f 6f c2          	movdqa %xmm2,%xmm0
  400730:	66 0f 68 d4          	punpckhbw %xmm4,%xmm2
  400734:	66 0f 60 c4          	punpcklbw %xmm4,%xmm0
  400738:	66 0f f5 d1          	pmaddwd %xmm1,%xmm2
  40073c:	66 0f f5 c1          	pmaddwd %xmm1,%xmm0
  400740:	66 0f fe c3          	paddd  %xmm3,%xmm0
  400744:	66 0f fe c2          	paddd  %xmm2,%xmm0
  400748:	66 0f 6f d8          	movdqa %xmm0,%xmm3
  40074c:	77 d2                	ja     400720 <sum_array+0x20>


Do you even lift?


Nice. I remember the first time I really internalized how fast computers were, even when people claimed they were slow. At the time I had a "slow" 133Mhz machine but we kept finding things it was doing that it didn't need too, and by the time we had worked through that there it was idling a lot while doing our task.

The interesting observation is that computers got so fast so quickly, that software is wasteful and inefficient. Why optimize when you can just throw CPU cycles or memory at the problem? What made that observation interesting for me was that it suggested the next 'era' of computers after Moore's law stopped was going to be about who could erase that sort of inefficiency the fastest.

I expect there won't be as much time in the second phase, and at the end you'll have approached some sort of limit of compute efficiency.

And hats off for perf, that is a really cool tool.


It's pretty clear that we're wasting unbelievably huge amounts of computing power with the huge stacks of abstraction we're towering on.

So let's make this interesting, assuming a ground up rewrite of an entire highly optimized web application stack - from the metal on up, how many normal boxes full of server hardware could really just be handled by one? 2? a dozen?

I'd be willing to bet that a modern machine with well written, on the metal software could outperform a regular rack full of the same machines running all the nonsense we run on today.

Magnified over the entire industry, how much power and space are being wasted? What's the dollar amount on that?

What's the developer difference to accomplish this? 30% time?

What costs more? All the costs of potentially millions of wasted machines, power and cooling or millions of man hours writing better code?


I'll take that bet.

We gain a lot with "the huge stack of abstraction". The OS gives us a ton of "goodies" for free (memory abstraction, process/thread safety, networking, etc) and the languages/libraries give us more goodies (the ability to focus on higher-level tasks rather than "bit twiddling"). One could also point to the fact that your team is taking advantage of hundreds (thousands?) of domain experts in every aspect of the stack to get the "best" solution.

I would argue that it's not "30% time". It's the accumulated time of each level of abstraction you are using combined. It is very likely the case that one team couldn't rewrite the web application stack "from the metal up" and offer significant improvements.


Still it's pretty shocking that billions of people might have to wait even a noticeable unit of time that can often be measured in seconds for a web page to go from request to final render. We're doing something wrong.

Server and desktop computing power is impressively more powerful today than when this [1] was first created, which loads almost as fast as I can take my finger off of the mouse button that clicked the link to take me there

1 - http://info.cern.ch/hypertext/WWW/TheProject.html

yet I was able to actually count "one onethousand two onethousand three onethousand" before www.youtube.com finished rendering. I don't know, is 2.xx seconds times a billion people more efficient than a few hundred developers spending 2x or 3x longer to write efficient code?


I would say that it's a combination of both. I personally don't automatically assume the delay when clicking a link is due to the number of abstractions between my mouse and the server-side hardware that a website is running on.

In the YouTube example, you have very real speed-of-light constraints. I fired up the Chrome debugger and loaded the main site. I had over 100 requests in the first couple of seconds. Even with a very low-latency connection, assuming that your browser can "batch" the requests together, and instantaneous server response there still is an overhead of request/response of at least tens of milliseconds for each request (or group of requests).

To reduce that time, it requires reducing the number of requests, assuming that javascript/images are already loaded, parallelizing or delaying the loading of "ancillary data" etc. All of which have nothing to do with the speed of the server or the client.


Could we do this by devoting 30% of industry-wide developer time to creating faster infrastructure? Yes, perhaps we could.

Would it only cost an individual developer 30% of his/her time to write on-the-metal software rather than use all the abstractions? No way. 80%, maybe. Those abstractions save us an amazing amount of work. That's why we use them.


Running Varnish is fun for these reasons. And of course reading Poul-Henning Kamp's clearly-stated opinions. :-)


Nice writeup! I like how even simplistic approaches to performance can easily show clear differences! However! I noticed you use many (many!) exclamation points! It gave me the impression that you used one too many caffeine patches! [1]

[1]: https://www.youtube.com/watch?v=UR4DzHo5hz8


Shows how excited they were. I always like to see blog posts where the poster is excited about the post!


> So I think that means that it spends 32% of its time accessing RAM, and the other 68% of its time doing calculations

Not sure if you can do such conclusion actually, because of pipelining etc. I'd assume that the CPU is doing memory transfers simultaneously while doing the calculations.

I also think that only the first movdqa instruction is accessing RAM, the others are shuffling data from one register to another inside the CPU. I'd venture a guess that the last movdqa is shown taking so much time because of a pipeline stall. That would probably be the first place I'd look for further optimization.

On the other hand, I don't have a clue about assembly programming or low-level optimization, so take my comments with a chunk of salt.


It spends the other 68% of its time doing calculations and waiting for the results of the memory accesses.

I personally don't like using per-instruction timings like the one presented in the article to measure performance; on a pipelined, superscalar/out-of-order CPU, the fact that one instruction takes a long time to execute doesn't matter so much since others can execute "around it" if they don't depend on it.

On the other hand, macrobenchmarks like the timing of the execution of a whole program, are very useful.

I timed it, and it took 0.5 seconds!!! [...] So our program now runs twice as fast

0.5s down from 0.6s is not "twice as fast", it is a 16.67% improvement. I'm not sure where the 0.25s was pulled from.


Unless 0.5 is a typo and it should have actually read 0.25. Then the last line makes more sense too. Except of course 20/0.25 = 80x and not 20x as claimed in step 4...

But never mind, it's a fun post.


> It spends the other 68% of its time doing calculations and waiting for the results of the memory accesses.

It is bit odd though, the code is processing data at about 4GB/s, and modern system should have lot more RAM bandwidth (eg about 10GB/s for DDR3-1333). It feels like there should be still significant room for optimization.

> 0.5s down from 0.6s is not "twice as fast", it is a 16.67% improvement. I'm not sure where the 0.25s was pulled from.

I think the 0.5 was typo (missing '2') and should be instead 0.25


Yes, there is a room for optimization. It seems to me bytesum_intrinsics.c doesn't saturate RAM bandwidth because loops are not unrolled. You should unroll the loop in addition to SIMD so that SIMD loads complete while SIMD adds are done, hiding memory latency. Otherwise you wait on loads.


so that SIMD loads complete while SIMD adds are done

That's almost certainly already being done by the hardware. Loop unrolling has been counterproductive ever since ~Sandy Bridge or so, and probably hasn't been that great of an idea since the post-P4 days:

http://www.agner.org/optimize/blog/read.php?i=142#142

It is so important to economize the use of the micro-op cache that I would give the advice never to unroll loops.


This is a great advice, but I think it doesn't apply here. If code normally fits in uop cache but unrolled code doesn't, you are punished. But here both normal code and unrolled code fit uop cache.

Loop unrolling may be counterproductive, but loop unrolling for vectorization almost certainly isn't, even in the current CPUs. It really can't be done by the hardware, and all compilers unroll loops for vectorization, even if you disable loop unrolling in general.


You are correct that both fit in uop cache, but I think 'userbinator' is correct here that loop unrolling is of very limited benefit here (and almost everywhere else). Because of speculative execution, processors are basically blind to loops. They will keep executing ahead along the most likely path even if this means they are simultaneously executing different iterations of a loop. The loads will pre-execute just fine across the loop boundary. While there are cases where reducing the amount of loop logic will speed things up, it really can be done in hardware!


I tested the program with -funroll-loops (which did unroll the sum_array loop) with no measurable difference in performance. I guess there are no easy shortcuts available anymore, you'd need to delve into the assembly to push for more performance.


I wonder why GCC does not autovectorize the loop in bytesum.c even with -Ofast. With autovectorizer, GCC should make the plain loop as fast as SIMD intrinsics. Autovectorizer can't handle complex cases, but this is as simple as it can get.

Anyone has ideas?


Maybe I'm just oldschool but I found that auto-vectorization hardly ever does what one thinks it should do. I view it as a nice tool in my belt but I don't ever expect it to magically speed things up for me by large factors. If I want real SIMD I get down to intrinsics or assembler myself.


pbsd investigated above[1] and found that the problem was the use of signed chars.

[1]: https://news.ycombinator.com/item?id=7737582


Argh, this is annoying indeed. And yes, compiler is correct not to autovectorize in this case then.


Why?


Different integer wrapping semantics I assume.


I wonder, though, because signed overflow is already undefined in C. The compiler could do anything there and call it correct.


I don't think compilers are able to know when to use SIMD

Isn't vectorization different than SIMD ?


No, vectorization is converting scalar operations (i.e. one piece of data at a time, e.g. summing one byte at a time) to vector ones (operating on multiple pieces of data at once, e.g. summing 16 bytes in parallel), and these operations are exactly what SIMD is.


Compilers certainly can, although they aren't exactly good.


The first time I realised how fast computers could be was when I first booted up BeOS on my old AMD single core, probably less than 1Ghz machine.

The thing booted in less than 10 seconds and performed everything so quickly and smoothly - compiling code, loading files, playing media and browsing the web (dial up modem then).

It performed so unbelievably well compared to Windows and even Linux of the day that it made me wonder what the other OSes were doing differently.

Now my 4 core SSD MacBook pro has the same feeling of raw performance, but it took a lot of hardware to get there.


One of the things I've always wanted is autovectorisation by the CPU - imagine if there was a REP ADDSB/W/D/Q instruction (and naturally, repeated variants of the other ALU operations.) It could make use of the full memory bandwidth of any processor by reading and summing entire cache lines the fastest way the current microarchitecture can, and it'd also be future-proof in that future models may make this faster if they e.g. introduce a wider memory bus. Before the various versions of SSE there was MMX, and now AVX, so the fastest way to do something like sum bytes in memory changes with each processor model; but with autovectorisation in hardware, programs wouldn't need to be recompiled to take advantage of things like wider buses.

Of course, the reason why "string ALU instructions" haven't been present may just be because most programs wouldn't need them and only some would receive a huge performance boost, but then again, the same could be said for the AES extensions and various other special-purpose instructions like CRC32...


This is why I'm in favor of client-side compilation.

Devs compile to an intermediate language or bytecode of some sort, which then gets compiled on installation / first use / update of the client runtime client-side. Kind of like Java, but instead of being JITted (which causes inconsistent performance) it's compiled and cached.

That way things can be optimized for your specific architecture / computer.

Of course, you can actually do this at the kernel level.


Anyone notice how the author is all excited? Got me in a good mood, reading this.


Great post. If you want to dig in even deeper you can learn certain nuances of underlying assembly language like loop unrolling, reducing the number of memory accesses, number of branch instructions per iteration of any loops by rewriting the loop, rearranging instructions or register usage to reduce the dependencies between instructions.

I took a CPSC course last year and for one of the labs we improved the performance of fread and fwrite C library calls by playing with the underlying assembly. We maintained a leader board with the fastest times achieved and it was a lot of fun to gain insight into the low level mechanics of system calls.

I digged up the link to the lab description - http://www.ugrad.cs.ubc.ca/~cs261/2013w2/labs/lab4.html


The rest of her blog is great as well, I really like her stuff about os-dev with rust.


Naïve Python3 is not as fast as Numpy, but pretty elegant:

  def main(filename):
      d = open(filename, 'rb').read()
      result = sum(d) % 256
      print("The answer is: ", result)


Note that sum(d) will generate a huge number, possibly using lots of memory and processing power. A better option would be:

    def main(filename):
      d = open(filename, 'rb').read()
      result = reduce(lambda i, j: (i + j) % 256, d)
      print("The answer is: ", result)
Note how this is similar to the squaring algorithm used in cryptography: http://en.wikipedia.org/wiki/Exponentiation_by_squaring


wouldn't you need at least 2^55 byte (36 petabytes) size file to go over 64bit integer 'sum' variable? I think such limitation is far preferable to an extra modulo operation for each and every byte. Not sure though how python does its integers, so the threshold for using bignum might be lower.


That approach would be very well adapted to a low-level language such as C with the risk of overflow.

Python handles long integers transparently, and the overhead of managing the lambda function and additional variables in Python is probably much more time consuming than handling a long integer.


  I timed it, and it took 0.5 seconds!!!
  So our program now runs twice as fast,
minor typo above: time is later stated as 0.25. super neat!


I wrote a comment pointing that out on the author's blog, but it appears to have since been deleted :/


But not fast enough




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: