
Computers are fast (2014) - Tomte
http://jvns.ca/blog/2014/05/12/computers-are-fast/
======
seanp2k2
This is cool, but yet I’m continually unimpressed with how slow the actual
experience of using a computer is. Some of the simple text-based UIs from the
80s respond in <50ms, but it takes 10+ seconds from opening the Facebook app
on iOS 11 on a 6S+ before I can start typing into the post text area. We can
load a gigabyte in 2 seconds, but how long does it take to complete an actual
task? Basically, I think that we’re not solving the actual experience problems
with computers or in mobile. UI lag has gotten noticeably worse in the past
few years. Android was terrible at this when it first came out, but iOS 11 did
_something_ that causes the Home button to now skip a few beats before
responding. Excel was reasonably fast, but then everyone moved to GSheets, and
it’s back to being as slow as an original Pentium running Win95 doing the same
calculations (I’d love to actually benchmark this — “normal” $600-ish PCs from
every 5 years going back as far as I could running some spreadsheet
responsiveness tests).

Computers are fast, but software is slow. Software gets slower faster than
computers get faster. “What Andy giveth Bill taketh away” is still as true as
ever. I’d love to see the industry move to focusing on responsiveness and e.g.
timing on productivity tasks for power users (automate the tests) instead of
thinner bezels and flatter UIs.

~~~
Cacti
UI is incredibly slow just about everywhere. Electron is a good example.

The lack of people developing desktop UIs using native toolkits (or libraries
with native wrappers) is disturbingly low. It seems everyone is using these
cross-platform behemoths that are just slow as hell.

~~~
DigitalJack
that's because UIs are about the most uninteresting, unfun thing to program.
If you have to do it, you sure don't want to do it for more than one platform.

~~~
martin_ky
I beg to differ. I find UI programming (and by extension any kind of graphics
programming) interesting and rewarding. I can see the results of my work
immediately. The visual confirmation after each code-compile-debug cycle is
satisfactory. I believe many UI programmers share this view.

Personal taste aside, UIs tend to be slow for different reasons:

I think the major reason is the mindset of "oh, it's just UI", thinking that
UI is not really 'that' important [part of a bigger system] or that anyone can
do it. Not enough attention and expertise then goes to UI.

On the other hand, if UI gets enough attention, the effort tends to go to
design mostly. You end up with a design department producing beautiful artwork
and one poor overworked programmer putting it together. Management tend to
overlook the fact, that UI is not just Photoshop or Aftereffects work, but
someone needs to actually write the code that uses those pretty graphics
assets. This programming part is often misjudged as a trivial step.

Don't even get me started on motion design. This whole discipline can be
summed up as "how can we use up more CPU/GPU cycles and make things less
responsive".

Then the market became saturated with UI frameworks built on top of web
browsers which by itself is a thick, slow and bloated layer - a far cry from
native UI performance. Unfortunately this is becoming the norm due to obvious
commercial advantages - it's cross platform and it's easier to hire JS UI
programmers. I mean good luck finding a programmer with experience in several
native UI kits (say Cocoa, MFC and Android) at once. Even finding someone with
adequate experience in one of them is hard enough nowadays.

Doing UI properly is expensive.

------
kevindqc
>The movdqa instructions have to do with accessing memory, and it spends 32%
of its time on those instructions So I think that means that it spends 32% of
its time accessing RAM, and the other 68% of its time doing calculations.

I think that's wrong? The first movdqa (15%) moves from RAM to a register, but
the second movdqa (17%) moves from one register to another?

17% seems a lot for something that simply moves from a register to another
register! It's even slower than the first movdqa, moving from RAM to register,
but registers are supposed to be something like 2 orders of magnitude faster
to access than RAM (don't remember exactly).

Maybe it's because there are data dependencies? The 3 instructions before the
2nd movdqa use the register xmm0, so the movdqa has to wait for these to
finish before it can execute (aka bubble or pipeline stall)

[https://en.wikipedia.org/wiki/Bubble_(computing)](https://en.wikipedia.org/wiki/Bubble_\(computing\))

~~~
em3rgent0rdr
Also I'm not sure it makes sense to say the program spends X% time with memory
and (100-X)% time with computation. Really a CPU will overlap memory access
with compute.

------
CodesInChaos
Processing 1GB in 250ms seems rather slow for such a simple problem. That's
about 0.5-1 CPU cycles per byte. I would have expected well written SSE code
to be several times faster than that.

Something like accumulating to three separate SSE registers per iteration and
then combining them outside the loop.

~~~
holycrapwhodat
Maybe if you read the entire file into RAM first and then timed _just_ the
processing of the bytes.

But I'd wager it's probably limited by the SSD speed and filesystem caching
code at this point.

(Would be easy enough for an assembly guru to prove either of us right or
wrong!)

~~~
function_seven
The 250ms figure does not include loading the file into RAM. It's just memory
access and number crunching.

------
harry8
Julia Evans is inspiring. I really, really like her humility and tenacity. The
write up of her displaying these characteristics is also just superbly
executed.

Attitudes are contagious and I badly want to catch hers!

------
stabbles
Wouldnt you get the SIMD stuff for free when compiling with `-O3` and
`-march=native`?

My guess would be that SIMD wouldn't improve this piece of code, because all
data is touched only once and the bottleneck is memory.

~~~
Sharlin
Autovectorization is still largely an unsolved problem. In some cases, yes,
but you shouldn't count on that.

~~~
semi-extrinsic
SIMD is like people boarding an airplane. If you've gotten stuff lined up
correctly, it's blazing fast, but usually you have a random mix that gives
lots of idling.

------
wmu
I don't understand one things: the original scalar version sums bytes, while
the SIMD version sums 32-bit values and returns the sum mod 256. The SIMD
version might be much simpler if just byte-wide operations were used. Did I
miss something important?

~~~
booblik
You didn’t miss anything. This code is total crap. It should use paddb,
instead of unpacking into dwords. Also the link is from 2014, so AVX2 could do
that twice as fast. Bottom line: code could be at least 8 times faster. Kids,
don’t learn from this.

------
konceptz
"I’m used to writing in dynamic programming languages, which definitely do not
process 1GB files in 0.25 seconds. Fun!"

1GB in .25 seconds = 4GB in 1 second. Kind of a simplistic takeaway
considering the article's title.

------
tyingq
They try to make it faster, but I would guess the bottleneck in the first try
is the one-byte fread().

Reading into a reasonable buffer size would likely speed things up.

------
hateful
Fun!

~~~
stephc_int13
Prefetching at least one cache line in advance should speed up the whole
process.

Also, multithreading.

~~~
harry8
I doubt that. After the second sequential cache miss the hardware will be
prefetching for you.

Worth trying like all these things, but I'm not convinced you'd see a
different, go ahead and prove me wrong!

------
coolspot
TLDR: author wrote a program in C that sums all bytes mod 256 of 1GB file into
one byte.

It runs 2.5 sec first time when it reads file from SSD and just 0.6 sec second
time when contents of the file is already in the OS disk cache.

~~~
Sharlin
Eh, that's just the first step. There's also mmapping, SIMD vectorization,
cache locality, and using the perf tool.

~~~
tyingq
Mmap seems like a silly hammer when each byte is read just once. Just read
into a reasonable buffer.

~~~
Sharlin
Yeah, but it's clear she's just documenting her learning experience.

