
How much your computer can do in a second (2015) - MindGods
http://computers-are-fast.github.io
======
markstos
Computers are also slow and sometimes getting slower, because real world
applications are built with increasing layers of abstractions. while the
hardware in the bottom layer is getting faster, the top layer is sometimes
slower because there are so many layers and so much complexity in the stack.

Take this review of terminal emulator benchmarks:
[https://lwn.net/Articles/751763/](https://lwn.net/Articles/751763/)

Uxterm, released in 1994 is the clear winner.

A "modern" terminal like Gnome Terminal has more than 10x the latency.

Booted an Electron app lately?

~~~
PragmaticPulp
A few extra seconds of app startup time or an extra frame of latency in your
terminal is a small price to pay for the absolutely massive difference in
functionality of newer apps.

It’s fun to see how fast apps were when everything was extremely simple, but
you can’t dismiss the fact that it was orders of magnitude less capable than
modern software.

The bottom line is that modern apps are fast _enough_. I’ve never once thought
that my terminal latency is too slow or that a few extra seconds of app
startup time made a any impact on my day whatsoever.

The difference between the fastest and slowest terminals on that list is less
than 30ms, or 3/100ths of a second. Literally less than a blink of an eye.
It’s a fun study and enjoyed the article, but the reality is that it just
doesn’t matter at all.

Anyone suggesting that computers felt faster a decade ago is misremembering
the past. Try using a 2008-era computer with a mechanical HDD and you’ll
quickly realize that apps did not launch faster back in the day.

~~~
antepodius
Modern programs are 100 times slower, but they definitely aren't 100 times
more functional. And speed is a part of functionality and UX! I don't want to
have to use a high-powered workstation or a gaming laptop to get reasonable
responsiveness in my text editor. Many Javascript-powered websites with
'fancy' canvas features are janky and slow as shit on hardware that's only 10
years old, where as websites that use HTML are still snappy and fast.

That 'nothing at all' matters when our current culture involves building
layers and layers and mountains of stacks on top of software. That 30 ms
compounds, because everyone working on each layer thinks 'hey, this is a bit
slow, but it doesn't matter because it's only milliseconds.' Given this, it's
obvious that modern software will be as slow and inefficient as it possibly
can be before normal people notice and complain (usually only because they've
used other equivalent software that's smoother and faster- it's easy to get
used to jank, like low FPS in a video game.)

I think that people undervalue the cost of all these layers of abstraction.

~~~
PragmaticPulp
> Modern programs are 100 times slower, but they definitely aren't 100 times
> more functional.

My IntelliJ IDE is easily 100 times more functional than early text editors.

As for slower: Any relative comparisons are still missing the point. It’s not
a relative question. It’s an absolute question: Is this fast enough?

Let’s be honest, none of us are sitting around waiting for our complex IDEs to
render characters on the screen. No one really cares if it’s 1ms or 30ms or
even 200ms.

Likewise, I don’t care if it takes 100ms or 10s to open the IDE because I’m
not quitting it and re-launching it all day. I launch it, leave it open, and
that’s that.

People like to glorify the good old days when apps were supposedly faster to
launch, but they forget the convenience of simply leaving apps open with our
oodles of RAM and letting our computers rely on suspend/wake. The longest
delays in my workflow are getting my laptop out of the bag and typing in the
password, or maybe downloading things from the internet.

Terminal latency or app launch time just don’t even register on the list of
delays during my day.

~~~
justin66
> Let’s be honest, none of us are sitting around waiting for our complex IDEs
> to render characters on the screen. No one really cares if it’s 1ms or 30ms
> or even 200ms.

Waiting longer for a character to render to the screen than it takes to send
an IP packet to a different continent and receive a reply may make you happy
for some reason, but UI research indicates most people will react negatively
to this.

It's worse than that. With something like Visual Studio, for example, we are
_sitting around waiting_ for local variables to update in the debugger's watch
window, if we're experienced enough to know that we need to do that. This kind
of thing is a genuine user interface bug: if a person doesn't realize what a
slow, awful piece of software they're dealing with, they'll _step step step_
past a variable update in the debugger, never see it happen on their screen
because that only works when you pause and give the watch window time to catch
up, and assume it's their program that has the problem.

People can get acclimated to dealing with very slow software, but they
shouldn't have to when all the hardware performance to make it better already
exists.

~~~
aardvark291
> No one really cares if it’s 1ms or 30ms or even 200ms.

You would definitely care about 200ms latency.

~~~
PragmaticPulp
Have you looked at actual end-to-end latency numbers for common operations?
I’m not talking about theoretical transfer times between buffers or carefully
structured synthetic benchmarks.

Using fast-twitch PC games as the gold standard, most people are looking at
60-80ms of latency on a local PC. The online game streaming services hover
around 150ms, which is noticeable but still entirely usable. (Source:
[https://www.pcgamer.com/heres-how-stadias-input-lag-
compares...](https://www.pcgamer.com/heres-how-stadias-input-lag-compares-to-
native-pc-gaming/) )

That’s why I say that 1ms vs 30ms of terminal latency is a non-issue. When I’m
typing, I’m not on a tight feedback loop with each character. I know what I’m
typing, so I’m not waiting for specific letters to appear on the order of a
blink of an eye.

Most of us are using 60Hz monitors (17ms per frame) with terminals that have
20-30ms of lag, with OSes that introduce slightly more delay, and so on. Then
we SSH into remote servers with 50ms, or 100ms, or 200ms of latency or more.
The total delay between hitting a key and seeing the letter on screen could
easily be 200ms on the regular for an SSH session, yet our typing isn’t
falling apart.

~~~
justin66
That PC Gamer article is not an endorsement of the idea you're pushing that
all this latency doesn't matter. "Singleplayer games are mostly fine to play
through the cloud, but any cloud gaming platform is going to be a no-sell for
people who only play multiplayer games, even with a good connection."

I don't want to give that magazine article too much credence, but when it
comes to user experience, we as an industry ought to try for more than "mostly
fine" or "isn't falling apart."

------
closeparen
This is so important. I see so many people farting around in the problems of
highly scalable distributed cloud systems with thousands of nodes, not
realizing that _single-digit QPS per node is insanity, why don 't you look
there first?_ Computers are fast.

~~~
colecut
I think they do this because their priority is decentralization and not speed

~~~
yachtman
No they do this because horizontal scalability is more general. Once you cross
the threshold of what your meganode can handle you have to rewrite your code
from the bottom up

------
ghj
The empty loop in python was surprisingly slow (68,000,000 iterations per
second).

What is it actually doing here? A 3ghz cpu has 3 billion cycles per second. So
it's spending an average of 44 cycles to increment an integer and compare???

(also fun fact, python integers are 28 bytes, but it still doesn't really
explain the slowness: `import sys; sys.getsizeof(123456)`)

~~~
teraflop
I just tested this on a Linux box I happened to have running, using the "perf"
tool. By my measurements, each iteration takes about 104 instructions, of
which 23 are conditional branches, and completes in about 31 cycles.

(That's after subtracting about 30 million cycles of startup overhead. Tested
with Python 2.7.9 on an Intel i3-4160 processor.)

Remember, Python is a bytecode-interpreted language. Each iteration of the
loop involves multiple bytecode operations:

    
    
          2           0 SETUP_LOOP              20 (to 23)
                      3 LOAD_GLOBAL              0 (xrange)
                      6 LOAD_FAST                0 (NUMBER)
                      9 CALL_FUNCTION            1
                     12 GET_ITER
                >>   13 FOR_ITER                 6 (to 22)
                     16 STORE_FAST               1 (_)
        
          3          19 JUMP_ABSOLUTE           13
                >>   22 POP_BLOCK
                >>   23 LOAD_CONST               0 (None)
                     26 RETURN_VALUE
    

Executing each of those instructions means fetching it, jumping to the
implementation of the appropriate opcode, and then updating the VM's state --
or, in the case of FOR_ITER, calling the native C function that advances the
iterator.

Frankly, it's impressive that it's as fast as it is.

~~~
f1refly
> tested with python 2.7.9

Oof.

[https://www.python.org/doc/sunset-
python-2/](https://www.python.org/doc/sunset-python-2/)

------
alister
> _How many times can we download google.com in a second?_ _Exact answer: 4_

I was a little suprised by this one. Sure, network connections are are going
to be _much_ slower than local operations, but where's the time going? Is it
mostly the network latency? Google has a lean webpage, so I assume it has
nothing to do with google specifically.

~~~
Junk_Collector
If your average connection has 100 ms (50 ms would be extremely good but
doable these days) of round trip latency to Google.com, then you have a hard
upper limit of 10 loads per second just to connect to the system. Then you
have to move the data, store it in memory, parse it, and run it.

~~~
ummonk
I'd be shocked if my latency to Google were even as high as 50 ms. Though
downloading http will take 3 round trips for http and four round trips for
https which will add up if the connection isn't kept alive after each
download.

Edit: just tested and got these numbers for myself:

7-13ms to ping google.com

50-80ms to curl [http://google.com](http://google.com)

90-130ms to curl [https://google.com](https://google.com)

140-160ms to curl [http://google.com](http://google.com)
[http://google.com](http://google.com) [http://google.com](http://google.com)
[http://google.com](http://google.com) [http://google.com](http://google.com)

180-220ms to curl [https://google.com](https://google.com)
[https://google.com](https://google.com)
[https://google.com](https://google.com)
[https://google.com](https://google.com)
[https://google.com](https://google.com)

So should be able to download something like 20-30 copies of google with keep-
alive, and ~10 without.

~~~
Junk_Collector
You're absolutely right and speeds to Google have gotten much faster over
time. For what it's worth, I just ran a quick ping test and got 26ms average
latency to google.com so I should have checked before I posted. Thanks for
keeping me in check.

Still this is dependent on where you are, your ISP, other general factors, and
it's still within an order of magnitude which was the point of the "quiz"

------
bigdict
Great idea, but this is complicated by having to model the Python interpreter.

~~~
Asraelite
Agreed, it would be better to stick to a low level language for this. It's
made even worse by the fact that some operations will be internally delegated
to C code while others are pure Python.

------
markstos
In the real world, when I type close to 100 wpm on keyhero.com, the
highlighting of the current word can't even keep up with a human typist.

~~~
jrockway
I don't have this problem. I did some profiling and it's not doing much of
anything expensive. Maybe there's some ad that I blocked or something that
degrades performance for you.

~~~
monkpit
Or the fact that you’re not using the same computer.

------
dang
If curious see also

2017
[https://news.ycombinator.com/item?id=13960183](https://news.ycombinator.com/item?id=13960183)

2015
[https://news.ycombinator.com/item?id=10445927](https://news.ycombinator.com/item?id=10445927)

~~~
saagarjha
I'm not sure if you've decided to switch up the usual "past discussion" with
"if curious see also", but FWIW the latter suggests to me "here is more
material that is tangentially relevant that you might find interesting" while
the former is very direct in what it's claiming to link to.

~~~
dang
I'm just looking for concise wording that no one will misunderstand as somehow
critical of the repost.

The intention is simply link to interesting things that readers might be
curious to look at. If the repost were bad we'd mark it [dupe] instead.

~~~
switch007
“Past discussions“ seems absolutely fine (and much more precise).

Maybe others need to consider if they’re overreacting to two very simple
words.

------
lulzx
It's a critical problem, there are three key areas to prioritize to continue
to deliver computing speed-ups:

\- better software \- new algorithms \- more streamlined hardware

The performance benefits from miniaturization have been so great that, for
decades, programmers have been able to prioritize making the writing of code
easier rather than making the code itself run faster.

The inefficiency that this tendency introduces has been acceptable, because
faster computer chips have always been able to pick up the slack.

Now, If we want to harness the full potential of these technologies, we must
change our approach to computing.

The researchers recommend techniques like parallelizing code. The multicore
technology has enabled complex tasks to be completed thousands of times faster
and in a much more energy-efficient way.

we will have to deliver performance the hard way.

For algorithms, the team suggests a three-pronged approach that includes
exploring new problem areas, addressing concerns about how algorithms scale,
and tailoring them to better take advantage of modern hardware.

..many others will need to take these issues seriously if they want to stay
competitive.

Performance growth will require new tools, programming languages and hardware
to facilitate more and better performance engineering

computer scientists being better educated about how we can make software,
algorithms and hardware work together, instead of putting them in different
silos.

~~~
entha_saava
Am I the only one who dislikes all this parallelism hype? For data analysis
and batch jobs on large amounts of data it would be great. Other consumer
oriented applications should run on single core.

It hurts when people dismiss otherwise perfectly fine languages like OCaml
saying "No multicore" as if majority of tasks need some kind of parallelism..
JS and Python don't do multicore well either.

------
smallpipe
I don't think the explanation for fill_array.c and fill_array_out_of_order.c
is correct. Unless you're running on a massive server, you're not getting
anywhere near 300MB of cache.

Modern CPUs have optimizations that bypass L1 and L2 cache allocation for a
continuous burst of writes without reads, so the result here is main memory
write speed, not cache allocation.

~~~
bananaface
Both examples read from the array before writing, no? So they have to read on
each iteration.

I don't have a super solid grasp on caching but it seems like his method of
out-of-order referencing will still be hitting a valid L1 cache most of the
time, so this understates the problem. Am I misunderstanding?

~~~
rjtobin
The cartoon picture is that the first example will read everything into cache
once, whereas the second example will read everything into cache twice.

Cache lines are typically 64-bytes, so to write a single character to main
memory the following things happen (again, a cartoon picture): First read the
64-bytes area that contains the byte of interest so that it is owned by my
cache (this is called a RFO, "read-for-ownership"). Second, update the byte of
interest. Thirdly (at some point) write the cache-line back to main memory.

In the sequential case, we just read one 64-byte cache line at a time, update
those 64 chars, then write the cache line back to main memory.

In the second example, we first update all the even-indexed characters, which
still forces us to read in every cache line. Then we loop around and do the
odd-indexed characters, at which point we have to read the cache lines all
over again (assuming the array is big enough that the whole thing can't fit in
cache at once).

~~~
bananaface
Am I misreading the second example's algorithm? Isn't it allocating like this:

1, 2, 4 3, 8 7 6 5, 16 15 14 13 12 11 10 9, ...

And so on?

\---

Also this part:

>Cache lines are typically 64-bytes

Right, but I thought when you access an index it caches quite a lot more than
64 bytes from the index. Doesn't it throw a larger chunk of the array onto
multiple lines? If that's the case then the first example is making very
efficient use of the cache. If the modern CPUs are smart enough to cache
backwards and I understand the second example, isn't the second too?

~~~
bananaface
Ok turns out I was way off, it's actually completely broken. Just ran the
code, printed j at each index.

>>> main(20)

2 4 8 16 12 4 8 16 12 4 8 16 12 4 8 16 12 4 8 16

It's not even hitting odd indexes. Over half the array will be garbage at the
end. I guess that would count as out-of-order though.

~~~
rjtobin
Yep, I misread it also: I saw j = 2 * i (which would do evens and then odds
when NUMBER is odd, or evens then evens again if NUMBER is even).

For what it really is - powers of 2 mod NUMBER - when NUMBER is large most
reads should be out of the cache. So the first example has to read from main
memory only every 64th index, and the second example has to read from main
memory on almost every read. I think this agrees with what you are saying.
This also explains why it is ~5x slower, which seemed too large from my
previous understanding.

~~~
bananaface
>So the first example has to read from main memory only every 64th index

Wouldn't it be 8 per cache line (an int is 64 bits, each cache line is 64
bytes)? I'm also assuming it caches a larger chunk of the array across
multiple lines. Is that not how it works?

But I think there's a more fundamental issue here, which is that the amount
measured, 68 million bytes in a second, is what - 60Mb? Did he just reduce the
array size until it completed in a second? Because a very significant chunk of
that is going to fit in L3 cache (on an i7 it's 8Mb), so even if you had a
good random access algorithm, it would understate the problem because the data
is still contiguous.

Which seems kinda dumb to me, since the real-word problem you're likely to run
into is when your data is stored non-contiguously because it's scattered
across multiple different structs/objects, making it impossible to utilise the
cache to a significant degree at all. In that (very common under OO or
interpreted languages) situation I'd expect a _way_ more dramatic slowdown.

------
osrec
Very interesting! I think every dev should have a sense of how long loops like
these take, if only to have a good starting point for where to optimise code
(only if needed - * _insert quote on premature optimisation_ *).

------
bserge
Anyone interested in how much your brain can do in a second? I'm making a
mindmap/chart for that in my quest to better understand natural intelligence
and how it could be replicated as AI.

~~~
EricBurnett
The brain is always a fun comparison. It's this crazy super-scalar
architecture that can do thousands of trillions of primitive operations per
second (petaflops), yet with pretty terrible latencies: propagation through
synapses on the order of single digit milliseconds, individual neurons only
fire order of 1 time per second, input processing latencies in the hundreds of
milliseconds, and primitive "logical" algorithms like _" how many numbers can
you count in a second?"_ netting out to _" less than 10"._

So we end up in this state where what the brain does well computers are still
(comparatively) terrible at, and what computers do well the brain is
(comparatively) terrible at. We're slowly bridging that divide, with e.g. TPUs
focussing on larger volumes of low precision operations happening in parallel,
but we've got a long way to go yet.

------
bane
Something to think about, back in the 80s, when computers were millions of
times slower than today, they were still considered _so_ fast that it was
worth sacrificing most of the performance by letting people work in an
interpreted language BASIC -- and people were _still_ able to be productive
and do real work with them.

~~~
userbinator
Millions, no. Thousands, yes.

Keep in mind that those 80s interpreted languages still have less overhead
than the HLLs of today.

~~~
bane
No, a typical desktop computer today is millions of times faster than a
typical 1980s 8-bit computer.

A Commodore 64 ran a 6510 at ~1Mhz (depending on NTSC or PAL). It was a single
core CPU with no pipelining or superscaler features and took multiple cycles
to complete a single operation putting it's performance in the hundreds of
thousands of operations (of any kind) per second.

The Playstation 4, a consumer-grade entertainment device provides around
8TFLOPs of performance.

Just a CPU, like an AMD 3990x is rated at 2.3 _million_ MIPS while the 6502 at
1Mhz is rated at less than .430 MIPS.

------
adverbly
Grep bytes surprised me. Why is it the exact same as write to memory? Is
reading from memory much faster than writing or something? Even then I'm not
sure how it wasn't CPU bound to compare a 4 char string 2 billion times...

~~~
mehrdada
Grepping "blah" over a sequence of zeros is definitely going to be faster than
regular grepping.

~~~
adverbly
Ohhh it's a sequence of zeroes.. oops I missed that part. Makes more sense
now.

------
EdTsft
On mobile Firefox (at least for me) nothing appears below the intro message.
The last line I see is "Made for you by ...".

------
herdrick
How is the 'write a byte to disk' loop faster than the empty loop? Maybe
xrange() is dominating in the empty loop example.

~~~
krick
It's number of bytes, not loop iterations.

~~~
herdrick
Thanks; I should have done more than glance at it.

------
Sebb767
Great idea! However, I found the hash part a bit confusing; why not use hashes
per seconds or bytes per second for both of them?

------
webdva
How fast can a single board computer such as the Raspberry Pi compute the
solution to a linear programming problem (i.e., a mathematical optimization
problem of the form Ax < b) that has, literally, trillions of constraints? Or
can it even be done at all what with the large pool of constraints to navigate
through? And how is such a prospect and execution compared to the computations
done in the middle of the twentieth century which used weaker computers?

------
jancsika
For the final one, I'd like to see a comparison showing how many elements can
be set for a linked list in one second.

------
carlsborg
> 342,000,000

> bytes written in one second

High end NVMe ssd can probably do 3x to 10x better than ~300MB/second
sequential write.

~~~
slau
Indeed. I have 2TB of storage that is only 5 times slower than my 32GB of main
memory (4GB/s vs 20GB/s).

------
coronadisaster
"A newer computer won't make your code run 1000x faster :) "

unless it is quantum?

------
martincmartin

      for (s = i = 0; i < NUMBER; ++i) {
          s += 1;
      }
    

I'm surprised gcc can't figure out that this is NUMBER*(NUMBER-1) and
eliminate the loop entirely.

~~~
zozbot234
It probably can, at high-enough optimization level.

~~~
martincmartin
They used -O2, which produces the straight forward loop. [0]

But with -O3, for NUMBER > 51, is uses some other odd loop. [1]

Seems to be O(n) no matter what though.

[0]: [https://godbolt.org/z/czcaYf](https://godbolt.org/z/czcaYf) [1]:
[https://godbolt.org/z/sP194Y](https://godbolt.org/z/sP194Y)

~~~
saagarjha
Clang knows what’s up:
[https://godbolt.org/z/7zKffz](https://godbolt.org/z/7zKffz). GCC is adding
the numbers four at a time by stuffing 0, 1, 2, 3 in an XMM register, taking a
packed addition within the register, and then doing a pairwise packed addition
of the register with 4, 4, 4, 4 to get the next four numbers.

------
NoahTheDuke
This is a lot of fun! Why the use of Python 2?

~~~
sillysaurusx
I’m still salty that Python 3 fragmented the community so much. Both sides of
the debate are equally shrill.

Hopefully they won’t merge that latest pattern matching proposal. Talk about
impossible to backport code...

~~~
saagarjha
At this point, I think the Python 2 side is much less shrill simply because
they've started dying out.

------
asifgdinio
>If we just run /bin/true, we can do 500 of them in a second, so it looks like
running any program has about 1ms of overhead.

This is off by at least three orders of magnitude, at least on my machine.

    
    
        $ time for i in $(seq 1000000); do true; done
    
        real 0m1.049s
        user 0m1.037s
        sys  0m0.019s
    

This article would be better if it were about computers in general, not about
Python. I specifically avoid using Python for anything serious because I find
its performance impossible to reason about. Then again, perhaps it's a good
thing to have examples that require thinking rather than just reciting the
machine's specs like most of these lists I've seen.

~~~
mrob
Your "true" is probably a shell builtin. Try it with /bin/true instead.

