
Do you know how much your computer can do in a second? - luu
http://computers-are-fast.github.io/
======
userbinator
Alternatively: do you know how much your computer _could_ do in a second, but
isn't, because the majority of software is so full of inefficiency?

In my experience, this is something that a lot of developers don't really
comprehend. Many of them will have some idea about theoretical time
complexity, but then see nothing wrong with what should be a very trivial
operation taking several _seconds_ of CPU time on a modern computer. One of
the things I like to do is tell them that such a period of time corresponds to
several _billion_ instructions, and then ask them to justify what it is about
that operation that needs that amount of instructions. Another thing is to
show them some demoscene productions.

I got a few of these questions wrong because I don't use Python, but I could
probably say with reasonable confidence how fast these operations _could_ be.

Related articles:

[https://en.wikipedia.org/wiki/Wirth%27s_law](https://en.wikipedia.org/wiki/Wirth%27s_law)

[http://hallicino.hubpages.com/hub/_86_Mac_Plus_Vs_07_AMD_Dua...](http://hallicino.hubpages.com/hub/_86_Mac_Plus_Vs_07_AMD_DualCore_You_Wont_Believe_Who_Wins)
(I know title is a BuzzFeed-ism, but this article came from before that era.)

~~~
bikeshack
Some of the work of Distributed.net (
[http://www.distributed.net/Main_Page](http://www.distributed.net/Main_Page) )
is wonderful. Does anyone know if this idea could be more than it is
currently, where computers (more than ever) are sitting idly and not
contributing their cycles in any meaningful way? Even 5 minutes of raw CPU
100% usage per device could do some serious computation. Theoretically
superseding modern supercomputers.

~~~
pjc50
Computers more than ever _rely_ on not being at 100% CPU all the time, because
the increased power consumption and heat dissipation is a problem. Instead
it's all about the "race to idle": do the work and then go to sleep for a few
miliseconds to cool down.

~~~
reubenmorais
Case in point: with my MBP battery, I can get 8 hours of browsing the Web,
reading articles, watching a YouTube video or another. But if I spin a
parallel build that uses 100% of all cores for about 15 minutes, I eat through
half of my battery life.

~~~
agumonkey
I wonder what's the sleep-state/consumption curve. Linear or not.

------
barrkel
... wherein you learn how slow Python is, and learn that the author severely
underestimates how fast optimized C can be.

Many of these questions are heavily dependent on the OS you're running and the
filesystem used, and of course the heavy emphasis on Python makes it hard to
make good guesses if you've never written a significant amount of it. I mean,
I have no idea how much attention was paid to the development of Python's JSON
parser; it's trivial to write a low-quality parser using regexes for scanning,
OTOH it could be a C plugin with a high-quality scanner, and I could
reasonably expect 1000x differences in performance.

Interpreted languages tend to have less predictable performance profiles
because there can be a large variance in the amount of attention paid to
different idioms, and some higher-level constructs can be much more expensive
than a simple reading suggests. Higher level languages also usually make
elegant but incredibly inefficient implementations much more likely.

~~~
Matumio
Python's JSON parser will obviously create Python objects as its output. There
is a limit of how much you can gain with clever C string parsing when you
still have to create a PyObject* for every item that you parsed. Because of
this, I don't think you can gain 1000x performance with C optimizations unless
the parser is really horrible (unlikely, considering the widespread use of
JSON).

~~~
tim333
There are some speed comparisons of Python json parsers here
[http://stackoverflow.com/questions/706101/python-json-
decodi...](http://stackoverflow.com/questions/706101/python-json-decoding-
performance)

Yajl (Yet Another JSON Library) seems to go about 10x faster than the standard
library json

------
comex
The first example (sum.c) is mistaken: it says the number of iterations per
second is 550,000,000, but actually any compiler with -O will remove the loop
entirely (since the sum variable is not used for anything), so the execution
time does not depend on the number at all. The answer is limited only by the
size of the integer, and the program will always take far less than one
second.

~~~
schoen
You're completely right!

Other readers, check it out for yourself with

gcc -g sum.c ; echo 'disas main' | gdb ./a.out

gcc -g -O2 sum.c ; echo 'disas main' | gdb ./a.out

~~~
dima55
Pro tip:

    
    
        gcc -S -o- sum.c

~~~
schoen
Oh yeah, that assembly already existed in order to create the binary in the
first place!

Thanks for the tip.

(The gdb disassembly shows memory offsets, which might be helpful for some
purposes.)

------
kator
Funny but true, many people think computers are "smart". When I am confronted
with this statement in the general public I always remind them: "Computers are
fast, at well computing, humans not so much. Computers however are stupid,
they follow my directions exactly as I give them and will keep doing the same
stupid thing until I figure out my mistake."

When we have a computer that can read the original post and give estimates and
comment here on HN I will be impressed. Until then it's just a faster z80 to
me, amazing, don't get me wrong, the things we can do today with the power at
our disposal starts to feel like magic. [1]

All that said it makes me sad when I find code that someone didn't bother to
think through or even use the profiling tools available to maximize the amount
of resources it's consuming. It's true that "premature optimization is the
root of all evil"[2] however at some point it can be worth you time to review
your assumptions and crappy code and give it a tune up.[3]

[1]
[https://en.wikipedia.org/wiki/Clarke%27s_three_laws](https://en.wikipedia.org/wiki/Clarke%27s_three_laws)
[2]
[https://en.wikiquote.org/wiki/Donald_Knuth](https://en.wikiquote.org/wiki/Donald_Knuth)
[3]
[http://ubiquity.acm.org/article.cfm?id=1513451](http://ubiquity.acm.org/article.cfm?id=1513451)

~~~
Eleutheria
Thank god they're stupid, can you imagine a smart robot that can think a
billion times faster than us? It takes us a whole life to generate new
knowledge (PhD) but it would take them just seconds. Now imagine all that
knowledge accumulated in a couple of days, a week or a month. The last century
alone has brought us exponential discoveries with all the technological
advancements on our side.

No, we can't even comprehend.

~~~
Retra
You're describing a system that can quickly solve a large number of problems,
and you conclude that this is undesirable somehow?

------
blakecaldwell
As a developer, I think we'd all be better off if all software was developed
on 5-year-old machines, databases pre-loaded with a million records, and Redis
and Memcached swapped out with instances that use disk, not RAM.

~~~
candeira
Not related to performance, but please let's add "on machines connected with
average DSL speeds and sporting medium-resolution screens."

~~~
kps
… and phones with a 1G per month data cap, and no connectivity half the day.

~~~
Tyr42
1G? That's so generous. Try surviving on 20MB a month.

It's possible, but you really notice whenever things fail to be cached. (I'm
looking at you google maps!)

~~~
zymhan
You can save an offline version of a Google Map in the smartphone app:
[https://support.google.com/gmm/answer/3273567?hl=en](https://support.google.com/gmm/answer/3273567?hl=en)

~~~
Tyr42
Yes, but it will get deleted if your phone runs out of space (at least, that's
what I'm assuming happened to the map, because I did download a local cache
before leaving the hotel).

------
dilap
I like the idea, but I feel like as soon as I'm caring about performance and
looking at Python code, something has gone terribly wrong.

~~~
usrusr
But the basics are pretty much the same, no matter if it is python or
assembly: does this one-liner run entirely or mostly in L1 cache, or does it
have to wait for RAM access repeatedly? Does it have to wait for disk or does
it have to wait for network? Repeatedly? People who fail at this won't be able
to understand the difference between situations where python or not doesn't
matter much and those where it does.

Understanding that "computers are fast" (even in python!) is a very important
step towards understanding where we make them slow and whether that is because
of waste or because the task is naturally expensive.

Based on your skepticism i assume that you just haven't had much exposure to
people who are really bad at these thing, despite having all the formal
education (and the paycheck to match). "I'm working in ${absurdly high level
language}, of course i'm not supposed to care for performance" is what they
tell you before venturing off to make a perfectly avoidable performance
blunder that would be crippling even in fully vectorized assembly, followed by
a few days spent transforming all their code a different, but perfectly
equivalent syntactic representation that looks a bit faster.

~~~
dilap
Good points.

& probably there are more python coders out there that could benefit from
developing this kind of thinking than C programmers, so it makes sense from
that perspective, too.

(Side note: It's a trickier exercise in python than in C, which is itself a
trickier exercise than plain assembly.)

------
exacube
Author makes a comment that "If we just run /bin/true, we can do 500 of them
in a second" \-- this is very platform dependent -- i think Linux' process
creation is supposed to be 1-2 orders of magnitude faster than Windows, for
example (i don't have the exact numbers though).

~~~
LukeShu
The implementation of true also makes a difference! Not quite an order of
magnitude difference, though (except for the shell builtin).

    
    
        method           Hz comment
        --------------------------------------------------------
        empty file      500 an empty file gets passed to /bin/sh
        dynamic libc   1000 "int main { return 0; }" -> gcc
        static libc    1500 the same, but with "gcc -static"
        assembly       2000 see below
        bash builtin 150000 avoids hitting the kernel or filesystem
    

The empty file is the "traditional" implementation of true on Unix.

The assembly solution was my attempt at doing the littlest amount possible,
because libc initialization still takes time:

    
    
        .globl _start
        _start:
        	movl $1, %eax 		# %eax = SYS_exit
        	xorl %ebx, %ebx		# %ebx = 0 (exit status)
        	int $0x80

~~~
kragen
This depends in part on how big the process that's forking is.
[http://canonical.org/~kragen/sw/dev3/server.s](http://canonical.org/~kragen/sw/dev3/server.s)
manages to get quite a bit more than 2000 forks per second out of Linux, which
might be in part because it only has two to four virtual memory pages mapped.
(see [http://canonical.org/~kragen/sw/dev3/httpdito-
readme](http://canonical.org/~kragen/sw/dev3/httpdito-readme) for more
details.)

------
rcconf
I got 10 / 18, that's a pass! I learned some of these numbers from doing a lot
of stress tests on the game I work on.

I think the really big thing is to actually create some infrastructure around
your product to run performance tests whenever you're developing a feature.
That's the only way you're ever going to good data.

As an example, the SQL tests will act very differently depending on if the
table was in the buffer pool, or it had to be fetched from disk (I wrote my
own tool to run tests on MySQL if anyone is interested,
[https://github.com/arianitu/sql-stress](https://github.com/arianitu/sql-
stress))

~~~
emn13
14 / 18 and I don't really program python (e.g. have no idea what the bcrypt
lib's defaults in python are...) - but performance is something I've always
cared about, and most of these are things you might happen to know.

I'm surprised by the poor memory performance in his tests; my machine get's
around an order of magnitude better performance in terms of throughput; which
leads me to believe he's compiling using a very outdated gcc, and/or has
really slow memory (laptops- you never know), and/or (reasonable, since he
only mentioned -O2, but depends on the bitness of the compiler) he's compiling
in "compatibility with 80386" mode.

I think it's odd that people still haven't quite figured that one out yet.
People use "-O2" all over the place, when that's rarely faster than "-O3", and
they leave out one of the simplest optimization options the compiler has -
"-march=native".

~~~
falcolas
> have no idea what the bcrypt lib's defaults in python are

It defaults to 12, IIRC.

------
lqdc13
The grep one is tricky. If no characters match, it's fast.

But if some match, and if it is ignoring case, it's much slower. It's actually
faster to read the whole file into memory, lowercase it and check with python
for index of match.

~~~
mehrdada
Assuming your pattern is static, it shouldn't be much slower. String matching
can be done in linear time with some preprocessing: check out Knuth-Morris-
Pratt and Boyer-Moore algorithms.

Basically, the idea is that you build a deterministic finite state automaton
and try feeding the string through it. Each character would cause exactly one
automaton transition. Therefore, you can do the whole thing in O(n) after you
pay the cost of preprocessing to build the automaton, with a quite tiny
constant for small patterns.

~~~
weinzierl
Actually Boyer-Moore (somewhat counter intuitively) is faster for longer
search strings than for short ones.

It makes sense if you think about a search string that has the same length as
the searched string. If they don't match you can find that out with a single
character comparison.

------
CyberDildonics
Even when you get past all the indirection and interpreted languages people
use there is STILL usually 12x - 100x the speed left on the table.

Even in a native program heap allocations can slow something down to 1/7th.

After that memory ordering for cache locality can still gain 10x - 25x
speedups.

After that proper SIMD use (if dealing with bulk numeric computations) can buy
another 7x (that's the most I've gotten out of AVX and ISPC.

Then proper parallelism and concurrency are still on the table (but you better
believe that the concurrency can be very difficult to make scale).

The divide between how fast software can potentially run and how fast most
software actually runs is mind blowing.

------
scandinavian
Interesting enough, but kinda predictable. I ran most of the tests using PyPy
2.7 on OS X for fun. As expected PyPy performed vastly better in almost all
tests, as they are all loop heavy, so the JIT can get to work.

As an example, for the first test I got:

pypy test.py 1000000000 1.01s user 0.03s system 99% cpu 1.042 total

python test.py 55000000 1.02s user 0.01s system 99% cpu 1.038 total

So about 18 times faster. On most tests PyPy was 3-10 times faster than
cPython. So what does this tell us? Nothing really, the benchmarks are not
really indicative of anything you would do with Python. Oh, and PyPy is very
fast at some stuff.

~~~
rockmeamedee
I don't think they're benchmarks though. I think the great part about this
piece is that it gives people more intuition about computer speeds in specific
use cases to identify bottlenecks better. If you have a complex operation like
serving a web page, and you measure each part of the process, this page gives
you a feel for what the ideal cases of file IO, memory access, computation,
serialization and network access are so you can sort of tell what to fix a lot
faster. Essentially a broader version of Numbers Every Computer Programmer
Should Know.

------
sdkmvx
Algorithms matter. Do you know how Vim inserts text?

It's exponential. It's worse than a shell loop spawning a new echo process
every iteration.

[http://www.galexander.org/vim_sucks.html](http://www.galexander.org/vim_sucks.html)

~~~
rasz_pl
one of my fav performance bugs:
[https://bugzilla.gnome.org/show_bug.cgi?id=172099](https://bugzilla.gnome.org/show_bug.cgi?id=172099)

Reported: 2005-03-30, unpatched to this day, because parsing opened files on
the fly recursively with O(2^n​) complexity is enough.

------
devit
The first C result is absurd, not sure how the author could have gotten it.

First of all, the code as written will just optimize to nothing, so we need to
add an asm("" : "=g" (s) : "0" (s)) in the loop to stop strength reduction and
autovectorization, and we need to return the final value to stop dead code
elimination.

Once that is done, the result is more than 2 billion iterations per second on
a ~3 GHz Intel desktop CPU, while the author gives an absurd value of 500m
iterations which could not have been possibly obtained with any recent Intel
Xeon/Core i5/i7 CPU.

BTW, the assembly code produced is this:

1:

add $0x1,%edx

add $0x1,%esi

cmp %eax,%edx

jne 1b

Which is unlikely to take more than 1/2 cycles to execute on any reasonable
CPU as my test data in fact shows.

~~~
CydeWeys
Well there's always flags to prevent compiler optimizations, or maybe the
example was purposefully presented in readable C, not whatever hack you'd need
to do to bypass optimization. Inline assembly isn't exactly C anymore.

But yeah, I was surprised by the number of operations per second too. I was
thinking it had to be over a billion.

------
kabdib
It's pretty amazing how much computation you can buy for less than a cup of
coffee.

For less than 20 cents (in quantity, perhaps) you can buy a chip that out-
performs the personal computers available in the early 80s. Of course you have
to add peripherals to bring it to true parity, but you can probably have a
working board for about five bucks that'll run rings around an Apple II or a
vintage PC. The keyboard and monitor are the most expensive components.

Likewise, memory. Recently I was thinking about doing some optimization and
reorganization of some data for a hardware management project, when I realized
that the data, for the entire life of the project, would fit into the CACHE of
the processor it runs on. Projecting out five or six years, it would _always_
fit. I stopped optimizing.

Most of the time, the most valuable resource is the time of the person
involved. Shaving milliseconds of response time rarely matters, shaving an
hour of dev time does. (There are big exceptions to this when you are
resource-constrained, as in video games, or hardware environments that need to
use minimal memory or cycles for cost reasons).

Premature optimization still remains a great evil.

~~~
Merad
> you can probably have a working board for about five bucks that'll run rings
> around an Apple II or a vintage PC.

Hell, you can do even better than that. Assuming that CHIP
([https://www.kickstarter.com/projects/1598272670/chip-the-
wor...](https://www.kickstarter.com/projects/1598272670/chip-the-worlds-
first-9-computer/description)) delivers on it's Kickstarter, for $9 you get a
1 GHz CPU and 512 MB RAM. That's roughly on par with an average home PC from
about 2002-2003.

If you bump your budget up to $40, you get a Raspberry Pi 2 with a quad core 1
GHz chip and 1 GB of RAM. Now we're talking parity with an typical home PCs
from 10 years ago, or less.

------
Veratyr
I was kinda stunned when I found out how much my computer can actually do.
I've been playing with Halide[0] and I wrote a simple bilinear demosaic
implementation in it and when I started I could process ~80 Megapixels/s.

After optimising the scheduling a bit (which thanks to Halide is only 6 lines
of code), I got that up to 640MP/s.

When I scheduled it for my Iris 6100 (integrated) GPU through Metal (replace
the 6 lines of CPU schedule with 6 lines of GPU schedule), I got that up to
~800MP/s.

Compare this to naïvely written C and the difference is massive.

I think it's amazing that my laptop can process nearly a gigapixel worth of
data in under a second. Meanwhile it takes ~7s to load and render The Verge.

[0]: [http://halide-lang.org/](http://halide-lang.org/)

------
suprjami
Yes, actually. In one second it can parse the first ~33 million numbers for
primes using a Sieve of Eratosthenes. This requires about 115MiB of RAM.

~~~
dbaupp
You'll be happy to know that computers can go even faster, and it doesn't need
anywhere near 3.5 (= 115e6/33e6) bytes per number: you can use a single bit
for each one (3.9 MiB), or only store numbers that aren't obviously composite
(e.g. only odd numbers gives half that, and using a 30-wheel gives 1.0 MiB).

In any case, you can do a _lot_ better than merely 33 million: e.g.
[http://primesieve.org/](http://primesieve.org/) uses some seriously optimised
code and parallelism to count the primes below some number between 10 billion
and 100 billion in a streaming fashion (meaning very small memory use). For
non-streaming/caching the results, I'm not sure how primesieve does, but my
own primal[0] (which is heavily inspired by primesieve) can find the primes
below 5 billion and store everything in memory in 1 second using ~170 MiB of
RAM on my laptop (and it doesn't support any parallelism, at the moment), and
the primes below 500 million in ~0.75 seconds on a Nexus 5, and ~1 second on a
Nexus S (although both devices give very inconsistent timings).

[0]: [https://github.com/huonw/primal](https://github.com/huonw/primal)

------
ademarre
It would be helpful to know the default work factor for the bcrypt hash in
that Python library, since none was provided. Apparently it's 12:
[https://pypi.python.org/pypi/bcrypt/2.0.0](https://pypi.python.org/pypi/bcrypt/2.0.0)

~~~
vessenes
I guessed it was one, and answered a couple orders of magnitude off. I'm going
to give myself partial credit, since I actually thought about the work factor
before answering. But not enough to go read the docs, unlike you!

------
quaffapint
I run through loops with multiple inner loops playing around with football
stats. It will hit 4+ million combinations in under a second - all on an 8 yr
old Q6600 processor. Just amazes me every time the power we have available to
us.

------
Uptrenda
This is actually a useful site for learning about the costs of code. What
would be more useful is if a multi-language version were developed which I
imagine could turn into a pretty cool open source project.

------
tchow
This is extremely cool. Someone needs to do this for all the languages that
are commonly used. Knowing general speeds of various calls for javascript,
ruby, elixir, etc. would be great for web development.

~~~
amelius
There's a number of benchmarks at [1]. It would be nice if somebody would
compile+run them on an AltJS environment and publish the result for different
browsers.

[1]
[http://benchmarksgame.alioth.debian.org/](http://benchmarksgame.alioth.debian.org/)

------
amelius
Computers are fast? Try ray-tracing, or physics simulations in general :)

~~~
FLUX-YOU
Silly mortals and their non-N-body problems!

------
eklavya
So if it's all so fast what does atom (latest) do with it all?

------
cweagans
The first time I clicked on this link, I thought it was a joke, because the
page never loaded. I think there was some network issue at my ISP and things
weren't routing properly, but it tried to load for like 40 minutes. When I
finally clicked back on the tab and saw the URL, I laughed and closed it.
Clicked back again today from Hacker Newsletter and saw it was actually a
thing :P

------
kristopolous
Anyone else been struggling to get their suite
([https://github.com/kamalmarhubi/one-
second](https://github.com/kamalmarhubi/one-second)) running without
modification?

I've had to modify the python code in a few places ... don't know why it isn't
working out of the box - feel like I must be doing something wrong.

~~~
thedufer
With Python this is usually a version mismatch - 2.x and 3.x are subtly
incompatible.

~~~
mappu
Exacerbated by the fact the repo uses `/usr/bin/env python` instead of
explicitly python2 or python3 - which means it will use python 2.x on any
PEP394-compliant system, and python 3.x on e.g. Arch.

------
graycat
Yes, to some extent, and some of the examples are astounding: E.g., I wrote
some simple C code for solving systems of linear equations, and for 20
equations in 20 unknowns I got 10,000 solutions a second on a 1.8 GHz single
core processor. Fantastic.

------
skimpycompiler
3/18 , most of the time I picked something 10x slower than the lower num.
Guess I'm stuck in the past :(

------
ctdonath
It's fast enough to do something useful before the light from the screen
reaches your eye.

------
em0ney
Thanks for the post! The part on serialisation blew my socks off - big eye
opener

------
rasz_pl
>new laptop with a fast SSD

or is it macbook with the fastest consumer grade ssd on the market (until
yesterday I think)? :)

------
introvertmac
nice

------
rkwasny
Awesome! I will definitely include this in interview questions, it's a very
good way to check how much someone knows about computers.

~~~
forgottenpass
That's probably a bad idea. This falls into the realms of pointless trivia and
needlessly-specific experience in a narrow domain. If you're actually worried
about optimization, these aren't the questions you would ask anyway.

~~~
kragen
It might depend on how wrong the answers are. If you ask, "How many HTTP
requests per second can Python's standard library parse on a modern machine?"
then answers in the range of 100 to 1 million might be acceptable, but if the
answer is "10" or "1" or "1 billion", then you know the person doesn't have
much of a clue, about Python in the former case or about computers in the
latter case.

