Hacker News new | past | comments | ask | show | jobs | submit login
Do you know how much your computer can do in a second? (computers-are-fast.github.io)
557 points by luu on Oct 25, 2015 | hide | past | web | favorite | 174 comments

Alternatively: do you know how much your computer could do in a second, but isn't, because the majority of software is so full of inefficiency?

In my experience, this is something that a lot of developers don't really comprehend. Many of them will have some idea about theoretical time complexity, but then see nothing wrong with what should be a very trivial operation taking several seconds of CPU time on a modern computer. One of the things I like to do is tell them that such a period of time corresponds to several billion instructions, and then ask them to justify what it is about that operation that needs that amount of instructions. Another thing is to show them some demoscene productions.

I got a few of these questions wrong because I don't use Python, but I could probably say with reasonable confidence how fast these operations could be.

Related articles:


http://hallicino.hubpages.com/hub/_86_Mac_Plus_Vs_07_AMD_Dua... (I know title is a BuzzFeed-ism, but this article came from before that era.)

It's because the extra slowness is supposed to be buying us software thats safer and easier to write. The argument has never been about how fast can we be, it's about being fast enough where the cost of talent completely annihilates the cost of compute.

If I were writing some python, which means I've already made that trade off, you wouldn't much like my answer if you started asking me such questions. Which is a shame because I'd probably rather have quite a lot of respect for the same point if we were working lower down the stack.

Slightly related: back in 2003, pg has written an essay about programming languages. A few paragraphs talk about wasting cycles and how it also occurs in other fields than programming: http://www.paulgraham.com/hundred.html

> Alternatively: do you know how much your computer could do in a second, but isn't, because the majority of software is so full of inefficiency?

Amusingly, in the first example, the compiler's static analysis concludes that the entire for loop can be omitted. I have no idea how they came up with that precise number...

Presumably they passed flags to gcc (or whichever compiler they used) to disable these optimizations, otherwise, you're right, they wouldn't make much sense if the code wasn't actually being run as presented.

Interestingly, there's quite a bit that's needed to ensure you bypass some compiler optimizations. Here's an ey-opening (for me) talk[1] on benchmarking C++ and compilers which touched on it.

1: https://www.youtube.com/watch?v=nXaxk27zwlk

I had the same problem: of course an empty loop will run at the maximum because any worthwhile compiler would optimize the loop away.

The real head-scratcher is why the Python version, whose loop is:

  def f(NUMBER):
      for _ in xrange(NUMBER):

runs one tenth as fast! Is it due to the xrange construct? I'm sure that there must be a saner way to make a loop in Python that doesn't bloat to ten times as many cycles as in C. . .

It runs at 1/10 speed because it's interpreted bytecode, and it allocates a new PyInt object for each number in the loop.

You can get it back close to native speed by adding a type annotation and compiling the python file with Cython: http://docs.cython.org/src/userguide/language_basics.html#in...

    def f(NUMBER):
        cdef int i
        for i in range(NUMBER):

Not just close to native speed, this will get you the exact same C code for the inner loop.

I wonder if [x for x in xrange(NUMBER)] would be any faster. Yes it allocates an array but list comprehension is executed differently by python. I recall a talk by some Dropbox guy stating that it was one of their optimization strategies.

Didn't find the original talk but found this: https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loo...

Nitpick: that creates a list, not an array. It could be faster, but it's not the same benchmark. When you get to the point where the speed of loops matters it's better to reach for Cython, Numba, PyPy, etc.

Why wonder when you can test?

    % python -mtimeit '[x for x in xrange(1000)]'
    10000 loops, best of 3: 70.2 usec per loop
    % python -mtimeit 'for x in xrange(1000): pass'
    10000 loops, best of 3: 29 usec per loop

    % python -mtimeit '[x for x in xrange(100000)]'
    100 loops, best of 3: 7.03 msec per loop
    % python -mtimeit 'for x in xrange(100000): pass'
    100 loops, best of 3: 2.5 msec per loop

10x is about what you'd expect for the interpreter overhead in a simple snippet like that.

Unfortunately python doesn't use a tagged pointers optimization even though I believe pointers are always aligned so there are some spare bits that could be used to indicate pointer vs. int.

I think the only reason they haven't is the legacy C interface; it's a shame they didn't do it for Python 3, but I guess it could have delayed conversion of c libraries even more.

Specifically, an interpreter without JIT. I'd reckon that LuaJIT or Javascript could come much closer to C.

For me, running their sample script using the PyPy JIT resulted in a 12x speedup compared to cpython.

Though that still wasn't as fast as gcc on my system (core i5 3.4ghz; using 1e8 rounds):

* cpython 2.7.4 - 1.131s

* pypy 2.6.0 - 0.090s

* gcc w/ "-O2" - 0.050s

Right, those are usually called JIT compilers, not interpreters.

The distinction is not all that useful, and is blurred nowadays. You could say that CPython compiles code to bytecode, but the bytecode is then interpreted; however, none of that matters, because it's doing very little optimization and that means it's slow. Javascript is typically also not compiled ahead-of-time to native code, but it can still be very fast because of JIT; at the same time it's still very much a dynamic language which makes it share features with interpreted languages.

I'd say JIT is neither about compilation nor interpretation, but about optimization.

Agree 100%.

As a hobby I like to hack 6502 assembly language on original hardware, I have Beebs, Ataris, VIC20 and C=64. 2Mhz is the limit. It is amazing to modern eyes how much computation on these systems you can get done once you strip away all the layers of abstraction and indirection modern systems are saddled with.

The situation gets even more extreme when you consider graphics cards, and their ability to do thousands of operations per screen pixel per frame.

I like your Mac Plus article; that reminds us that the user experience has not got necessarily got better with all this processing power.

The situation gets even more extreme when you consider FPGAs. Having done Mandelbrot set as a toy example, it worked faster than a graphics card at producing high-resolution images and zooms.

I am a programmer, I also like to play games, and make them.

I own a ASUS gaming laptop, and I got astonished by how badly newer games run, even games with graphics that are like of 20 years ago (example: I really like business simulation games, some use isometric graphics with some extremely low poly and untextured models for the walls and floors and some objects, like plants and tables) run slow on my machine.

Even some 2D games run badly, Axiom Verge had several slowdowns on my machine, Luftrausers is virtually unplayable (when there is heavy firefighting on the screen, the game starts to lag severely, things start to teleport around the screen), Wasteland 2 runs like crap and bugs out like crazy (transparent walls, grass growing on random places, people disappearing).

Even games from legendary companies in the technical sense are not doing well, for example I tried to run the new Wolfenstein game on my machine, even at the lowest settings possible the thing go at 3 FPS.

Meanwhile I can play emulated Wii games (Xenoblade for example) with 60 FPS, Stalker series (that in my opinion have amazing graphics) I could even place some mods to make it prettier without losing too much FPS, games using Source engine can confortably run with all the bells and whistles, and so on...

But try to play a new game, even a 2D one, say, Pillars of Eternity. The thing manages to lag for no reason.

I noticed one reason for this is the abuse of Garbage Collectors, many of the games I had problems with has Garbage Collectors, and I see the memory swinging all over the place (a particularly bad one: Kerbal Space Program, it loads everything on launch, then unloads things, then load again as needed, when I launch it, it hits 3gb of memory, then as I use it slowly decrease, then it starts increasing again for no reason, I assume due to memory leaks, and crashes), I even started to play games with the windows task manager on the second monitor, with apps ordered by memory use, this way I know when they will run out of memory and crash (thus I can save first).

And then, there are the devs attitude toward optimization, I finished two days ago some ARM NEON code for my iPhone project (I am doing some freelance iOS coding now), I decided to discuss it on IRC, and some devs got pissed off, they tried to convince me that Apple native APIs were better, and that I was doing my design all wrong, and that instead of using a single ASM function that do all the operations I need inside a single loop, I should have used a collection of Apple APIs that would involve 4 different objects layered to the same thing...

When I explained that my original code (that was using OpenCV), was too slow according to the profiler (it was using 60% of the CPU time), they replied that I was "optmizing the wrong place".

Also when I mention I plan to make it work on iPod 5 and iPhone 4, people counter-argue that I should just ignore those and make it run on iPhone 6 only, because that is the proper solution for not having enough CPU (instead of getting the CPU manual and doing decent code).

Righteous rant, and absolutely right: there isn't really any excuse for a 2D game being slow these days. I suspect some of this is due to unpredictability of GPU performance; that's why the nvidia driver ships with 300Mb of compatibility shims.

A lot of modern games run slow because devs are using common 2D/3D engines which focus on cross-compatibility and ease of use over speed.

Testing in a lot of basic 2D engines also focuses on low performance, e.g. for an RPG where you'll have a dozen sprites moving around, instead of say 500 ships/weapon shots in a shmup.

Some of the work of Distributed.net ( http://www.distributed.net/Main_Page ) is wonderful. Does anyone know if this idea could be more than it is currently, where computers (more than ever) are sitting idly and not contributing their cycles in any meaningful way? Even 5 minutes of raw CPU 100% usage per device could do some serious computation. Theoretically superseding modern supercomputers.

Computers more than ever rely on not being at 100% CPU all the time, because the increased power consumption and heat dissipation is a problem. Instead it's all about the "race to idle": do the work and then go to sleep for a few miliseconds to cool down.

Case in point: with my MBP battery, I can get 8 hours of browsing the Web, reading articles, watching a YouTube video or another. But if I spin a parallel build that uses 100% of all cores for about 15 minutes, I eat through half of my battery life.

I wonder what's the sleep-state/consumption curve. Linear or not.

The difference between idle and full power is very large in modern CPUs - a typical laptop one can consume an average of a few watts at idle, but several dozen at full load; furthermore, they can switch between these states thousands of times per second. This is why CPU power circuitry is not easy to design, since it has to keep up with the very fast changes in current as the CPU transitions between power states.

Incidentally, this is also why you may be able to hear audible sounds from a computer when it's idle or running some particular process - the wakeups/sleeps are happening at a frequency in the hearing range, and the components like capacitors and coils can act as tiny speakers.

boinc, folding@home, gimps, etc. have all done this for over a decade.

Thing is that when boinc and co were popular all those cycles were really wasted. Today computers just enter power-saving states when not in use saving electricity and in effect money.

Distributed.net is the only program capable of maxing out all my 8 cores. Not even transcoding video does that. Pretty cool.

This is because Intel is tricking you with the core count. I'm guessing you have an i5/i7/whatever with 8 cores but just two memory channels. Since it only takes two or three cores to saturate those memory channels, you will never be able to max out 8 cores on anything that processes much more than $CACHE_SIZE (~4MB) of data. So you can use 8 cores for stuff like finding primes or bruteforcing RC5 (like distributed.net), but not much else.

Ah! Thanks. What would I go for if I wanted to actually use that much data? POWER8?

Depends on what you want and how much money you want to spend. If your workload is embarrasingly parallel, it could be cheaper to buy several servers with a CPU like the Xeon E5-1630 v3 (Edit: which has just 4 cores and 4 mem channels). Or, as you say, go the POWER8/SPARC route, which also includes the recent option of a single x86 server with up to 8x Xeon E7-8893 v3. For the latter option you may hit NUMA issues as well; depends on your workload.

Your choice boils down to the classic "message passing" vs. "shared memory" (think MPI vs. OpenMP) architectural choice. What the optimal solution is depends on the specific application, as well as how far you want to scale it.

Is it efficient if it takes so long to get to market that the product is irrelevant by the time it gets there?

Is it efficient if the product is buggy and insecure because it's inherently more difficult to program in assembly compared to Python or Java?

Is it efficient if the product is not extensible and inflexible because every engineering decision went towards efficiency, which is inherently opposed to flexibility? (Early binding is inherently more efficient than late binding for all the same reasons it is inherently less flexible.)

Or do you insist on defining "efficiency" solely to mean "machine efficiency" instead of thinking like an engineer and understanding the concept of trade-offs where no decision is inherently worse than any other?

Is it efficient to wait 6 weeks for results rather than half a week because you preferred Python over C++ during the half-day (half-week) respectively of coding?

Is it efficient to use thousands of kWh in electricity rather than just hundreds?

As always, the answer obviously depends on your problem.

What gets up my nose is when people make it seem like a specific engineering decision is the only moral decision, and that making any other set of trade-offs is evidence of some degeneracy, such as laziness or stupidity.

In this context, it's as if they think that, well, "Every cycle's sacred/Every cycle's great/If a cycle's wasted/God gets quite irate" or similar. It's quite annoying to listen to the more extreme end of this spectrum rattle on about how horrible it is that people no longer use self-modifying code with overlapping opcodes and instruction scheduling which optimizes the drum seek time...

Oh, absolutely. But equally absolutely, there are no absolutes. There’s still plenty of areas where fast code is infinitely more valuable than anything else, including developer time. As I said, whether you should optimise for the former or the latter strongly depends on your problem at hand and external circumstances. PhD students are cheap, Xeons aren’t.

These people are rarer and rarer. On HN I always see the top comments excusing poor performance with having supposedly better productivity and security.

It really seems that many devs are allergic to the word "optimisation" and lash out when they see it. No one says you have to write your code in assembly folks, but do run it in a profiles and make sure it's now slow.

I think that attitude is great. It leaves room for the people who think "OK this exists. Now how about making GOOD version of it."

Problems start when everybody too perfectionist or everybody is too quick and dirty.

This attitude makes sense within a robust Free Software environment, but if the inefficient component is tied to a proprietary service using a closed protocol, then the ability of any individual to improve that component is very limited.

I fully agree with you, and will add one thing:

My way leaves room for "OK, this is crap. It isn't worth making a good version of. At least we didn't waste too much time."


It's amazing how somethings are extremely inefficient, and even worse, nobody questions themselves why this is like this

Most of 'this is taking too long' is because of IO, still, sometimes it's done in a very inefficient way as well

>is because of IO

Having moved to very fast SSDs and ramdrives in some cases, it amazes me how many applications can't take advantage of them. It's terrible when you have an app that appears to be IO bound on disk and move it to fast drives only to get a 2x speedup, while even a single cpu core isn't maxed out.

I bought a gaming laptop with a "theoretically" midrange GPU (it is in practice a shitty GPU, nVidia took a true midrange GPU, removed half of the chip, and overclocked it, hoping it will stay with midrange performance, instead the thing just overheats) and a Core i7...

It is a waste of money, I rarely see 100% CPU use, the few times I saw 100% CPU use was demoscene stuff, or some russian compression programs (that can do amazing stuff, like compress 60gb downloads into 20gb, but take 2 hours at 100% CPU to decompress).

It is evne worse that some games that theoretically would need all the CPU (games that has lots of entitites, or are physics or AI intensive), instead just have thousands of cache misses while they loop through their entities in the memory, and the CPU gets used like 20%.

Or they are single-core, even if made recently.

Even with ssds you can have issues like reading one byte at a time, saturating the sata link, etc

I remember fixing a vim plugin that did one system call per character to be received because the IPC returned some number of xml objects, not even wrapped in a top level one, so the plugin didn't want to block by reading past the last > until it processed everything before it.

I fixed that to use non-blocking pipes and it was much better.

C# is a pretty good example of where the tradeoff between speed and safety can be a good choice.

I wrote a non-trivial stock trading app in C#, and for sake of this conversation I added some loop counting. it analyses about 1.5million ticks per-second (against trade rules). This is a single thread running on a 3.2ghz core i5, no I/O.

That's exactly what got me interested in learning x86 assembly language. It's pretty neat to see how fast a generic loop can be when the entire executable is <10 bytes.

The inefficiency of computing platforms is like the amount of congestion. Both expand to the level of being just barely bearable.

If a computer is too slow nobody will use it; if it's only irritatingly slow, some will use it and; if it's slow but still bearable enough, everybody grumbled but still uses it. Development costs and building programs with higher level languages and using big, high-level frameworks saves in development but increases the cost of using the system. The result is that some majority of programs will always be optimized only to the point where they are barely usable.

Similarly, when traffic is always as congested as the people on roads can bear: those will drop out who really hate congestion or prefer to do their trip at another trip but they will be replaced by people who are willing to pay the time tax again. When space is freed on the highway or arterial road, some people will figure out it's worth the trouble and head out.

This means you can't make the user experience faster by getting faster computers and you can't build more roads and lanes to get rid of congestion.

But there's more.

Because a large mass of congested traffic is more difficult to separate and channel into exits congestion on large roads tends to be worse. If you have a village main street with one lane each way on a road, and that road gets traffic congestion that's much less total congestion, area-wise, and it will clear up pretty quickly once the head of traffic is able to exit somewhere. Even so with a grid of 1+1 lane streets, like in a small town. Because the small road gets quickly congested it will only swallow a limited number of cars which will keep a bound on the maximum amount of congested area. Not so with a 10+10 lane highway.

Similarly to computer programs. Slow programs in the early times when everything was simpler could be made faster with sufficient effort. The programs might still struggle in some parts but you could at least invest in the effort to streamline the core operations and make the worst things fast enough. In contrast, consider an enterprise Java program that depends on dozens or hundreds of libraries that amount to a running image of the size of half-a to a whole gigabyte. It takes ten seconds to start and opening each window takes seconds because there's slack everywhere. Everything runs top of something else and even simple operations involve running mundate pieces of code for seconds in wallclock time. Because the computer that's fast enough to run the program in the first place also has the computing power to run a lot of crap in a similar manner. So the amount of crap just increases as the computing power increases, and getting rid of that crap becomes an unsurmountable task. So it becomes close to impossible to make such a big program run fast because you would have to shrink everything between the application and the hardware in half or in one tenth to get some sort of speed. Coincidentally, old computer systems often had faster user interface latency and responded to user inputs in a very timely manner. They could do the cheap things really fast even if the total lack of computer power still made the programs spend seconds on the heavy processing.

There's an element of truth to what you say, but I think software efficiency will increase in importance over time, especially when the limits on Moore's law are too expensive to overcome.

What I'd like to know is how this plays out at the operating system level. We can run cachegrind on a program, but it doesn't take into account the effects of task switching of the OS (as far as I know).

I think it would be lovely in these multicore days if we could dedicate a core to a process and have as close to 0 task switching as possible on that core.

That's close, but I think it is only half the equation. The other half would be granting exclusivity to that process for the core, not just affinity.

Isn't that the isolcpus command it mentions at the bottom?

Btw: this was just a quick google search, I've never done this. But I'm pretty sure I've heard Martin Thompson discuss it.

You are absolutely right. I stopped reading before I got to the bottom. That's really neat!

>Many of them will have some idea about theoretical time complexity, but then see nothing wrong with what should be a very trivial operation taking several seconds of CPU time on a modern computer.

In my experience developers who over-optimize, or who optimize with disgusting hacks, are a vastly bigger problem than developers who don't optimize at all.

There are a vast number of problems where a several second wait for the user is almost irrelevant compared to the damage done to the code base by optimizing those few seconds away.

So your opinion is "If it's slow it's okay, because it's safer than optimizing?".

I'm sorry but I have to disagree completely...

I'm currently rewriting the node-postgres module from scratch and my implementation is between 3 and >1000 times faster than the original (this is not even a joke). And you know why? Because the original actually contains those hacks you're talking about and not the other way around.

And that's where your actual noticable inefficiencies far too often stem from: Inexperienced developers who don't know how to do it the easy way - inefficiencies due to huge workarounds that have hidden costs.

E.g.: Not using builtin methods but writing buggy workarounds. Or even very basic stuff like: Using expensive functions within loops instead of hoisting those outside of it.

I wouldn't even call those "optimizations".

>So your opinion is "If it's slow it's okay, because it's safer than optimizing?".

No. My opinion is that if it's slow you need to check to see if it's actually a problem for the users first before optimizing. Unless...

>And that's where your actual noticable inefficiencies far too often stem from: Inexperienced developers who don't know how to do it the easy way

I already said this below:

>Cleaning up dirty code that incidentally causes performance improvements is fine

> No. My opinion is that if it's slow you need to check to see if it's actually a problem for the users first before optimizing.

Not only. Think of all these seconds of computing time, translated into watts, translated into coal or whatever pollution from electricity production, multiplied by the number of users your program has.

I think programmers should also consider the environment, even if it'd have more impact if everyone started by just not owning a car.

Unless we're talking about optimizing stuff that actually takes 100% of the CPU's or graphics card's power, it'd have more impact if just one person started not owning a car. Or even merely used it a bit less.

Appreciate the sentiment but it's not going to achieve anything. "Tut-tut" attitude is a lot less efficient than simply showing people why it's worth optimizing for their own sake. If it's worth optimizing at all, that is - GP's point is that it's not always worth it. Optimization has a cost. Development cost and maintenance cost. What's the point of making a program faster if nobody can ever really work on that piece of the code again?

By the way, the hour the programmer stays in the office overtime to optimize that code, using computers and lights and other machines, probably damages the environment more than your example before...

> By the way, the hour the programmer stays in the office overtime to optimize that code, using computers and lights and other machines, probably damages the environment more than your example before...

What if we're talking about the file-copy API in some widely distributed operating system? That's going to be multiplied by millions and millions of people, and therefore, being an obtuse devil's advocate, i argue that indeed it would have an appreciable impact on the environment.

If, however, you're talking about some relatively infrequently used GUI element in an application which has merely a million users, then okay, indeed, go home and don't waste another hour of lights and computers at your office :).

Nobody (hopefully!) is defending important library optimizations.

The kind of unnecessary optimizations I think about as being undesirable are when you spend a week's worth of engineering effort and go through a potentially risky data migration just in order to improve the performance of a command-line production environment management tool that is run once per few months, that only takes a few minutes to run anyway, and whose data is cached server-side by the application running it anyway.

Actual example from my software engineering career.

But for library code, that potentially thousands or even millions of systems in production worldwide are going to be using? Hell yes. That is worth the time to improve. And kudos to you on improving Node postgres performance. That's awesome.

Not one-tenth of the problem as people who wrap trivial computations in "enterprise" crapware and use 100x the CPU they really need just to relay a few bytes of data from system A to system B. If you have ever used Java and XML I am looking at you.

This has a real cost associated with it, servers, datacentre space, power and cooling, system administrators to look after it all, don't come for free. Not just money, the environment too.

>Not one-tenth of the problem as people who wrap trivial computations in "enterprise" crapware and use 100x the CPU they really need

I still doubt that the CPU/power/cooling cost of having systems infested with hacky enterprise garbage is its biggest drawback.

That crap also causes expensive bugs and huge maintenance costs (e.g. relatively simple programs that require a team of 12 programmers a month to implement the simplest feature).

Yes, but this "enterprise" crapware is the transaction system of your bank, and those "few bytes" are your last paycheck. Have some respect for the software that makes the world around you function.

... Which they did perfectly well in the 60s on a mainframe. Why does it take - conservatively estimating - a million times more CPU cycles now?

They're most likely still running all their transaction processing on a single mainframe (+another one for redundancy) with something like IBM CICS [1, 2]. And they most likely also have a lot more transactions than in the 60s (card payments, internet banking, etc.).

[1] http://www.ibm.com/software/htp/cics/ [2] https://en.wikipedia.org/wiki/CICS - yes, "Initial release: 1968; 47 years ago"

Most banks and payment systems have old, reliable codebases, a lot of them even in COBOL. Banks don't play with those systems often, and they don't use monstrocities like J2EE for them.

It's the other kinds of enterprise systems that are crap and deserve no respect. Intranet apps in big corporations, the BS systems that always mess up your bills in services companies, slow as molasses web portals for big enterprises and organisations etc, ERP and CRM apps...

I've worked on an old COBOL codebase and "reliable" is not a word I associate with it.

It is front page news when it goes wrong, tho', look at NatWest. So objectively speaking, it is pretty reliable.

The grandparent post isn't talking about over-optimizing. You shouldn't have to add disgusting hacks to complete a trivial operation in less than a second.

Yet both the efficiency the great-grandparent is talking about and the efficiency hacks that the grandparent is talking about exist.

This is evidence of how hard it is to do everything right, especially if you have more than one axis along which you are trying to be right.

>The grandparent post isn't talking about over-optimizing.

Nonetheless, warning bells go off inside my head whenever I hear that opinion expressed without the caveat "if it isn't a problem for the user, leave it the fuck alone".

>You shouldn't have to add disgusting hacks to complete a trivial operation in less than a second.

If a trivial operation takes more than a second that's almost always indicative of a bigger underlying problem of crappy code.

Cleaning up dirty code that incidentally causes performance improvements is fine, but changing code specifically to make performance improvements without improving its overall quality leads to a downward spiral of shit.

There was an article on how much can your computer do 400 cycles (a cache miss) recently.

I have noticed that the size of L2 caches hasn't increased anywhere near the same rate as other memory capacity.


LAME. What you want is a javascript heavy web UI driven by nodejs server, all running on same underpowered embedded platform. At least 5 levels of abstraction above decency, because hey, we can iterate quickly and hardware is cheap!!1

... wherein you learn how slow Python is, and learn that the author severely underestimates how fast optimized C can be.

Many of these questions are heavily dependent on the OS you're running and the filesystem used, and of course the heavy emphasis on Python makes it hard to make good guesses if you've never written a significant amount of it. I mean, I have no idea how much attention was paid to the development of Python's JSON parser; it's trivial to write a low-quality parser using regexes for scanning, OTOH it could be a C plugin with a high-quality scanner, and I could reasonably expect 1000x differences in performance.

Interpreted languages tend to have less predictable performance profiles because there can be a large variance in the amount of attention paid to different idioms, and some higher-level constructs can be much more expensive than a simple reading suggests. Higher level languages also usually make elegant but incredibly inefficient implementations much more likely.

Python's JSON parser will obviously create Python objects as its output. There is a limit of how much you can gain with clever C string parsing when you still have to create a PyObject* for every item that you parsed. Because of this, I don't think you can gain 1000x performance with C optimizations unless the parser is really horrible (unlikely, considering the widespread use of JSON).

There are some speed comparisons of Python json parsers here http://stackoverflow.com/questions/706101/python-json-decodi...

Yajl (Yet Another JSON Library) seems to go about 10x faster than the standard library json

Fwiw I know of some benchmarks of Perl deserializers, some written in pure perl, others in C (but still producing Perl data structures), and a speed difference of a faktor 100 is not uncommon.

Well, I'm not a python programmer, and I guessed 14/18 correctly. Some of the questions are hard to know without knowing details of the libraries - JSON, as you point out, for instance (it's also one I got wrong, expecting better throughput). But since each question lists the actual performance numbers after you answer, you can get a feel for how big the overhead is in trivial pure python.

But yeah, some of the questions were poorly chosen. For example, on my machine, using gcc -O2 as the author specifies, the first program always executes instantly, since gcc optmizes that loop to nothing. That leads me to believe he may be running a mac (IIRC those come with absurdly out of date gcc versions as a result of the GPLv3 dispute?) or something else fishy is going on.

One interesting thing to note is that the memory access latency on his machine is almost an order of magnitude faster than you might expect based on "Latency Numbers Every Programmer Should Know" (https://gist.github.com/jboner/2841832) - and that's not a coincidence; those 100ns have always been a bit conservative, and the latency number has slightly improved in the past two decades.

gcc is little more than a symlink to clang for recent versions of Xcode, so something else fishy is going on.

The first example (sum.c) is mistaken: it says the number of iterations per second is 550,000,000, but actually any compiler with -O will remove the loop entirely (since the sum variable is not used for anything), so the execution time does not depend on the number at all. The answer is limited only by the size of the integer, and the program will always take far less than one second.

You're completely right!

Other readers, check it out for yourself with

gcc -g sum.c ; echo 'disas main' | gdb ./a.out

gcc -g -O2 sum.c ; echo 'disas main' | gdb ./a.out

The best way to do this, which I didn't know about until recently, is

    gcc -fverbose-asm -g -Wa,-adhln=sum.lst sum.c

Pro tip:

    gcc -S -o- sum.c

Oh yeah, that assembly already existed in order to create the binary in the first place!

Thanks for the tip.

(The gdb disassembly shows memory offsets, which might be helpful for some purposes.)

If the function returned s (the sum number), would it not optimize it away?

Well so gcc -O2 will just replace the loop with s = NUMBER.

I also confirmed this behavior with disassembly, and now I'm wondering what the basis of this optimization is. Can anyone explain the logic that gcc would have used to do this?

As a human, I can see that the loop has no effects except to increase s by 1 each of NUMBER times, and that the result of doing so would increase s by 1 * NUMBER, and that s started as 0, so s ends up as NUMBER. (Also the loop variable i is incremented NUMBER times so it ends up as NUMBER too, but its value is never referred to again.)

I'm wondering how the optimizer observed and represented these facts, and how much more general its attempts at optimizing this loop were.

Not 100% sure but this precise optimization for LLVM is probably part of ScalarEvolution, which tries to figure out how values in a loop evolve: http://llvm.org/devmtg/2009-10/ScalarEvolutionAndLoopOptimiz...

LLVM also has a pass dedicated to detecting loop idioms: http://llvm.org/docs/doxygen/html/LoopIdiomRecognize_8cpp_so...

(I'm not at all familiar with how GCC is architected, sorry)

Cool, thanks for the references!

Funny but true, many people think computers are "smart". When I am confronted with this statement in the general public I always remind them: "Computers are fast, at well computing, humans not so much. Computers however are stupid, they follow my directions exactly as I give them and will keep doing the same stupid thing until I figure out my mistake."

When we have a computer that can read the original post and give estimates and comment here on HN I will be impressed. Until then it's just a faster z80 to me, amazing, don't get me wrong, the things we can do today with the power at our disposal starts to feel like magic. [1]

All that said it makes me sad when I find code that someone didn't bother to think through or even use the profiling tools available to maximize the amount of resources it's consuming. It's true that "premature optimization is the root of all evil"[2] however at some point it can be worth you time to review your assumptions and crappy code and give it a tune up.[3]

[1] https://en.wikipedia.org/wiki/Clarke%27s_three_laws [2] https://en.wikiquote.org/wiki/Donald_Knuth [3] http://ubiquity.acm.org/article.cfm?id=1513451

"Premature optimization is the root of all evil," _is_ true, the active word here being "premature." Once something is working (and hopefully testable), one shouldn't hesitate to dig in and optimize anything and everything. This is not sinful!

Thank god they're stupid, can you imagine a smart robot that can think a billion times faster than us? It takes us a whole life to generate new knowledge (PhD) but it would take them just seconds. Now imagine all that knowledge accumulated in a couple of days, a week or a month. The last century alone has brought us exponential discoveries with all the technological advancements on our side.

No, we can't even comprehend.

You're describing a system that can quickly solve a large number of problems, and you conclude that this is undesirable somehow?

As a developer, I think we'd all be better off if all software was developed on 5-year-old machines, databases pre-loaded with a million records, and Redis and Memcached swapped out with instances that use disk, not RAM.

Not related to performance, but please let's add "on machines connected with average DSL speeds and sporting medium-resolution screens."

… and phones with a 1G per month data cap, and no connectivity half the day.

1G? That's so generous. Try surviving on 20MB a month.

It's possible, but you really notice whenever things fail to be cached. (I'm looking at you google maps!)

You can save an offline version of a Google Map in the smartphone app: https://support.google.com/gmm/answer/3273567?hl=en

Yes, but it will get deleted if your phone runs out of space (at least, that's what I'm assuming happened to the map, because I did download a local cache before leaving the hotel).

I like the idea, but I feel like as soon as I'm caring about performance and looking at Python code, something has gone terribly wrong.

But the basics are pretty much the same, no matter if it is python or assembly: does this one-liner run entirely or mostly in L1 cache, or does it have to wait for RAM access repeatedly? Does it have to wait for disk or does it have to wait for network? Repeatedly? People who fail at this won't be able to understand the difference between situations where python or not doesn't matter much and those where it does.

Understanding that "computers are fast" (even in python!) is a very important step towards understanding where we make them slow and whether that is because of waste or because the task is naturally expensive.

Based on your skepticism i assume that you just haven't had much exposure to people who are really bad at these thing, despite having all the formal education (and the paycheck to match). "I'm working in ${absurdly high level language}, of course i'm not supposed to care for performance" is what they tell you before venturing off to make a perfectly avoidable performance blunder that would be crippling even in fully vectorized assembly, followed by a few days spent transforming all their code a different, but perfectly equivalent syntactic representation that looks a bit faster.

Good points.

& probably there are more python coders out there that could benefit from developing this kind of thinking than C programmers, so it makes sense from that perspective, too.

(Side note: It's a trickier exercise in python than in C, which is itself a trickier exercise than plain assembly.)

Not really. You can write Python fast and slow, so if you do write Python code, you can perfectly well care about performance. Learning C takes months to years while adapting my Python code takes minutes to days.

> "Learning C takes months to years"

Where did you get this information from? Learning C is easy. K&R is under 300 pages long, you could get through it in a weekend:


After that you could pick up another more modern book, or could just learn as you go.

C would make a great video game, it's easy to learn and difficult to master.

Sure, I'd go along with that.

What I'd also say is that you don't need to be a master of C to write faster code than Python. I understand the appeal of keeping a codebase in a single language, but it's worth remembering you pay a cost for doing so.

You can't get through K&R in a weekend unless you skip the exercises.

The decision on whether to skip the exercises or not should be based on how you learn.

I've found the most effective way for me to learn a new language is to watch or read some introductions to a language (to get a feel for idiomatic usage), then explore further through playing with small projects written in that language. That's what works for me. K&R is sold as a concise introduction to C, so it has value for learners like me even with skipping the exercises. Of course for people who prefer to learn methodically doing the exercises will be more beneficial.

If you have CS background, lots of experience and are willing to manually allocate memory, maybe...

No, you don't even need that. C was either my first or second programming language, learnt it around the time I learnt VB and I didn't feel overwhelmed, was probably around 16 at the time, a time when my main experience with computers was playing computer games. Part of the reason for its popularity is its simplicity.

Memory management isn't as hard to grasp as you might think. It's really just an extra step when it comes to handling variables, if you make plans to handle errors then it is straightforward to manage, especially for small libraries, which is what you're likely to write to link up to Python.

You should really check out how random people code in the real world. C is absolutely not the right language for "Dad" while Python is very easy to start with and do advanced stuff with. I am not talking about "programmers" but "people who manage to automate things on their computer".

I've programmed in Python before, it's a good language, very broad library ecosystem, and yes it's a good first language as well, but if you're already familiar with imperative-style programming from Python you are not going to struggle with C, I am confident you could learn it quickly. If you've put off learning C because it's too 'hardcore' or some nonsense then try it out, you may be surprised to learn you like it.

Author makes a comment that "If we just run /bin/true, we can do 500 of them in a second" -- this is very platform dependent -- i think Linux' process creation is supposed to be 1-2 orders of magnitude faster than Windows, for example (i don't have the exact numbers though).

The implementation of true also makes a difference! Not quite an order of magnitude difference, though (except for the shell builtin).

    method           Hz comment
    empty file      500 an empty file gets passed to /bin/sh
    dynamic libc   1000 "int main { return 0; }" -> gcc
    static libc    1500 the same, but with "gcc -static"
    assembly       2000 see below
    bash builtin 150000 avoids hitting the kernel or filesystem
The empty file is the "traditional" implementation of true on Unix.

The assembly solution was my attempt at doing the littlest amount possible, because libc initialization still takes time:

    .globl _start
    	movl $1, %eax 		# %eax = SYS_exit
    	xorl %ebx, %ebx		# %ebx = 0 (exit status)
    	int $0x80

This depends in part on how big the process that's forking is. http://canonical.org/~kragen/sw/dev3/server.s manages to get quite a bit more than 2000 forks per second out of Linux, which might be in part because it only has two to four virtual memory pages mapped. (see http://canonical.org/~kragen/sw/dev3/httpdito-readme for more details.)

Processes and files on Windows are heavyweight. I recall once speeding up a script by about 100 times by changing something of the form

    a >>$LOG
    b >>$LOG
    c >>$LOG

    } >>$LOG

I got 10 / 18, that's a pass! I learned some of these numbers from doing a lot of stress tests on the game I work on.

I think the really big thing is to actually create some infrastructure around your product to run performance tests whenever you're developing a feature. That's the only way you're ever going to good data.

As an example, the SQL tests will act very differently depending on if the table was in the buffer pool, or it had to be fetched from disk (I wrote my own tool to run tests on MySQL if anyone is interested, https://github.com/arianitu/sql-stress)

14 / 18 and I don't really program python (e.g. have no idea what the bcrypt lib's defaults in python are...) - but performance is something I've always cared about, and most of these are things you might happen to know.

I'm surprised by the poor memory performance in his tests; my machine get's around an order of magnitude better performance in terms of throughput; which leads me to believe he's compiling using a very outdated gcc, and/or has really slow memory (laptops- you never know), and/or (reasonable, since he only mentioned -O2, but depends on the bitness of the compiler) he's compiling in "compatibility with 80386" mode.

I think it's odd that people still haven't quite figured that one out yet. People use "-O2" all over the place, when that's rarely faster than "-O3", and they leave out one of the simplest optimization options the compiler has - "-march=native".

> have no idea what the bcrypt lib's defaults in python are

It defaults to 12, IIRC.

Ya, I'm pretty proficient with MySQL and was surprised to get both of the SQL questions wrong (although close), while getting 13 / 18 overall. For the second SQL test, I expect that the second of the two iterations it managed took significantly less time than the first, since the data should have been cached. I guessed 100 for that one, thinking that it would probably manage more than 10 and 100 would give me more leeway - shouldn't have over-thought it.

Even when you get past all the indirection and interpreted languages people use there is STILL usually 12x - 100x the speed left on the table.

Even in a native program heap allocations can slow something down to 1/7th.

After that memory ordering for cache locality can still gain 10x - 25x speedups.

After that proper SIMD use (if dealing with bulk numeric computations) can buy another 7x (that's the most I've gotten out of AVX and ISPC.

Then proper parallelism and concurrency are still on the table (but you better believe that the concurrency can be very difficult to make scale).

The divide between how fast software can potentially run and how fast most software actually runs is mind blowing.

The grep one is tricky. If no characters match, it's fast.

But if some match, and if it is ignoring case, it's much slower. It's actually faster to read the whole file into memory, lowercase it and check with python for index of match.

Assuming your pattern is static, it shouldn't be much slower. String matching can be done in linear time with some preprocessing: check out Knuth-Morris-Pratt and Boyer-Moore algorithms.

Basically, the idea is that you build a deterministic finite state automaton and try feeding the string through it. Each character would cause exactly one automaton transition. Therefore, you can do the whole thing in O(n) after you pay the cost of preprocessing to build the automaton, with a quite tiny constant for small patterns.

Actually Boyer-Moore (somewhat counter intuitively) is faster for longer search strings than for short ones.

It makes sense if you think about a search string that has the same length as the searched string. If they don't match you can find that out with a single character comparison.

It's much slower because you are checking the first character in all zeros string, but checking multiple characters in the case that the first ones match using Boyer-Moore.

Grep is not that naive, afaik. https://news.ycombinator.com/item?id=2393587

Deceptively simple old tools are often full of gems.

Is this still the case if you set LANG=C on the command, so that you're lowercasing ASCII bytes instead of Unicode?

Edit: There was a bug. No it's actually slower when the locale is set to C.

I have 4434 files with about 100 words of text in them.

here is python func that searches utf8:

    def find_sp(paths):
        for path in paths:
            with open(path,'rb') as f:
                if b"snow" in f.read().lower():
                    yield True
                    yield False

    %timeit [o for o in find_sp(glob.glob('./*'))]
    10 loops, best of 3: 35.2 ms per loop
Here is the bash one:

    $>echo $LANG
    $>time grep -Riq "snow" .

    real	0m0.015s
    user	0m0.005s
     sys	0m0.010s

However, if you have a lot of files and you are doing the searching for them in a not very smart way, it takes a lot longer.

    time for f in ./*txt; do grep -iq "snow" $f;done

    real	0m2.307s
    user	0m0.337s
     sys	0m2.113s

In the python example you're running a function, while in the grep example you're launching a program (grep) from your shell.

To make the comparison more fair you'd store the python program and time starting and running both from a shell.

Interesting enough, but kinda predictable. I ran most of the tests using PyPy 2.7 on OS X for fun. As expected PyPy performed vastly better in almost all tests, as they are all loop heavy, so the JIT can get to work.

As an example, for the first test I got:

pypy test.py 1000000000 1.01s user 0.03s system 99% cpu 1.042 total

python test.py 55000000 1.02s user 0.01s system 99% cpu 1.038 total

So about 18 times faster. On most tests PyPy was 3-10 times faster than cPython. So what does this tell us? Nothing really, the benchmarks are not really indicative of anything you would do with Python. Oh, and PyPy is very fast at some stuff.

I don't think they're benchmarks though. I think the great part about this piece is that it gives people more intuition about computer speeds in specific use cases to identify bottlenecks better. If you have a complex operation like serving a web page, and you measure each part of the process, this page gives you a feel for what the ideal cases of file IO, memory access, computation, serialization and network access are so you can sort of tell what to fix a lot faster. Essentially a broader version of Numbers Every Computer Programmer Should Know.

pypy fixes python's slow class instantiation. The HTTPRequest request parsing is probably much faster on pypy because it creates a new class within the loop.

Algorithms matter. Do you know how Vim inserts text?

It's exponential. It's worse than a shell loop spawning a new echo process every iteration.


one of my fav performance bugs: https://bugzilla.gnome.org/show_bug.cgi?id=172099

Reported: 2005-03-30, unpatched to this day, because parsing opened files on the fly recursively with O(2^n​) complexity is enough.

The first C result is absurd, not sure how the author could have gotten it.

First of all, the code as written will just optimize to nothing, so we need to add an asm("" : "=g" (s) : "0" (s)) in the loop to stop strength reduction and autovectorization, and we need to return the final value to stop dead code elimination.

Once that is done, the result is more than 2 billion iterations per second on a ~3 GHz Intel desktop CPU, while the author gives an absurd value of 500m iterations which could not have been possibly obtained with any recent Intel Xeon/Core i5/i7 CPU.

BTW, the assembly code produced is this:


add $0x1,%edx

add $0x1,%esi

cmp %eax,%edx

jne 1b

Which is unlikely to take more than 1/2 cycles to execute on any reasonable CPU as my test data in fact shows.

Well there's always flags to prevent compiler optimizations, or maybe the example was purposefully presented in readable C, not whatever hack you'd need to do to bypass optimization. Inline assembly isn't exactly C anymore.

But yeah, I was surprised by the number of operations per second too. I was thinking it had to be over a billion.

It's pretty amazing how much computation you can buy for less than a cup of coffee.

For less than 20 cents (in quantity, perhaps) you can buy a chip that out-performs the personal computers available in the early 80s. Of course you have to add peripherals to bring it to true parity, but you can probably have a working board for about five bucks that'll run rings around an Apple II or a vintage PC. The keyboard and monitor are the most expensive components.

Likewise, memory. Recently I was thinking about doing some optimization and reorganization of some data for a hardware management project, when I realized that the data, for the entire life of the project, would fit into the CACHE of the processor it runs on. Projecting out five or six years, it would always fit. I stopped optimizing.

Most of the time, the most valuable resource is the time of the person involved. Shaving milliseconds of response time rarely matters, shaving an hour of dev time does. (There are big exceptions to this when you are resource-constrained, as in video games, or hardware environments that need to use minimal memory or cycles for cost reasons).

Premature optimization still remains a great evil.

> you can probably have a working board for about five bucks that'll run rings around an Apple II or a vintage PC.

Hell, you can do even better than that. Assuming that CHIP (https://www.kickstarter.com/projects/1598272670/chip-the-wor...) delivers on it's Kickstarter, for $9 you get a 1 GHz CPU and 512 MB RAM. That's roughly on par with an average home PC from about 2002-2003.

If you bump your budget up to $40, you get a Raspberry Pi 2 with a quad core 1 GHz chip and 1 GB of RAM. Now we're talking parity with an typical home PCs from 10 years ago, or less.

I was kinda stunned when I found out how much my computer can actually do. I've been playing with Halide[0] and I wrote a simple bilinear demosaic implementation in it and when I started I could process ~80 Megapixels/s.

After optimising the scheduling a bit (which thanks to Halide is only 6 lines of code), I got that up to 640MP/s.

When I scheduled it for my Iris 6100 (integrated) GPU through Metal (replace the 6 lines of CPU schedule with 6 lines of GPU schedule), I got that up to ~800MP/s.

Compare this to naïvely written C and the difference is massive.

I think it's amazing that my laptop can process nearly a gigapixel worth of data in under a second. Meanwhile it takes ~7s to load and render The Verge.

[0]: http://halide-lang.org/

Yes, actually. In one second it can parse the first ~33 million numbers for primes using a Sieve of Eratosthenes. This requires about 115MiB of RAM.

You'll be happy to know that computers can go even faster, and it doesn't need anywhere near 3.5 (= 115e6/33e6) bytes per number: you can use a single bit for each one (3.9 MiB), or only store numbers that aren't obviously composite (e.g. only odd numbers gives half that, and using a 30-wheel gives 1.0 MiB).

In any case, you can do a lot better than merely 33 million: e.g. http://primesieve.org/ uses some seriously optimised code and parallelism to count the primes below some number between 10 billion and 100 billion in a streaming fashion (meaning very small memory use). For non-streaming/caching the results, I'm not sure how primesieve does, but my own primal[0] (which is heavily inspired by primesieve) can find the primes below 5 billion and store everything in memory in 1 second using ~170 MiB of RAM on my laptop (and it doesn't support any parallelism, at the moment), and the primes below 500 million in ~0.75 seconds on a Nexus 5, and ~1 second on a Nexus S (although both devices give very inconsistent timings).

[0]: https://github.com/huonw/primal

It would be helpful to know the default work factor for the bcrypt hash in that Python library, since none was provided. Apparently it's 12: https://pypi.python.org/pypi/bcrypt/2.0.0

I guessed it was one, and answered a couple orders of magnitude off. I'm going to give myself partial credit, since I actually thought about the work factor before answering. But not enough to go read the docs, unlike you!

I run through loops with multiple inner loops playing around with football stats. It will hit 4+ million combinations in under a second - all on an 8 yr old Q6600 processor. Just amazes me every time the power we have available to us.

This is actually a useful site for learning about the costs of code. What would be more useful is if a multi-language version were developed which I imagine could turn into a pretty cool open source project.

This is extremely cool. Someone needs to do this for all the languages that are commonly used. Knowing general speeds of various calls for javascript, ruby, elixir, etc. would be great for web development.

There's a number of benchmarks at [1]. It would be nice if somebody would compile+run them on an AltJS environment and publish the result for different browsers.

[1] http://benchmarksgame.alioth.debian.org/

I do not think that it is that useful. While i have seen people greatly overestimate computing complexity of simple tasks like comparing numbers, i do not believe knowing how long such instructions take gives you much more. Especially interpreted JIT compiled code is not exactly predictable in performance and optimizing single instructions not very useful. However it is a fun game and does indeed give a better rough assessment for performance in various implementations. But if you really like to optimize your code these are just some hints. You will have to run benchmarks for every use case.

A high-level look at the relative speeds of common JS operations and patterns would be very useful. I often wonder if I'm wasting time thinking about how to optimize something when it won't make any practical difference.

Computers are fast? Try ray-tracing, or physics simulations in general :)

Silly mortals and their non-N-body problems!

So if it's all so fast what does atom (latest) do with it all?

The first time I clicked on this link, I thought it was a joke, because the page never loaded. I think there was some network issue at my ISP and things weren't routing properly, but it tried to load for like 40 minutes. When I finally clicked back on the tab and saw the URL, I laughed and closed it. Clicked back again today from Hacker Newsletter and saw it was actually a thing :P

Anyone else been struggling to get their suite (https://github.com/kamalmarhubi/one-second) running without modification?

I've had to modify the python code in a few places ... don't know why it isn't working out of the box - feel like I must be doing something wrong.

With Python this is usually a version mismatch - 2.x and 3.x are subtly incompatible.

Exacerbated by the fact the repo uses `/usr/bin/env python` instead of explicitly python2 or python3 - which means it will use python 2.x on any PEP394-compliant system, and python 3.x on e.g. Arch.

Yes, to some extent, and some of the examples are astounding: E.g., I wrote some simple C code for solving systems of linear equations, and for 20 equations in 20 unknowns I got 10,000 solutions a second on a 1.8 GHz single core processor. Fantastic.

3/18 , most of the time I picked something 10x slower than the lower num. Guess I'm stuck in the past :(

It's fast enough to do something useful before the light from the screen reaches your eye.

Thanks for the post! The part on serialisation blew my socks off - big eye opener

>new laptop with a fast SSD

or is it macbook with the fastest consumer grade ssd on the market (until yesterday I think)? :)




Awesome! I will definitely include this in interview questions, it's a very good way to check how much someone knows about computers.

That's probably a bad idea. This falls into the realms of pointless trivia and needlessly-specific experience in a narrow domain. If you're actually worried about optimization, these aren't the questions you would ask anyway.

It might depend on how wrong the answers are. If you ask, "How many HTTP requests per second can Python's standard library parse on a modern machine?" then answers in the range of 100 to 1 million might be acceptable, but if the answer is "10" or "1" or "1 billion", then you know the person doesn't have much of a clue, about Python in the former case or about computers in the latter case.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact