
What's new in CPUs since the 80s and how does it affect programmers? - AndrewDucker
http://danluu.com/new-cpu-features/
======
fiatmoney
It's strange the extent to which programming interview questions reflect an
80s view of the cost of operations, particularly the overabundance of linked
list and binary tree questions. Cache misses ain't free and memory scans are
relatively cheap after you do the initial lookup.

~~~
trsohmers
Yep! The big thing people don't realize that it is memory operations that are
the biggest cost in terms of both time and energy.

While it takes ~100 picojoules to do a double precision floating point
operation on a Ivy Bridge Intel processor, it takes 4200 picojoules to move
the 64 bits from DRAM to your registers. Most people assume that the huge
power usage is because you need to move data from off the chip, but the
reality (and surprising fact to most people) is that over 60% (~2500
picojoules) of the energy usage of moving the data is consumed by the on chip
cache hierarchy. That doesn't mean the SRAM caches themselves, but all the
additional logic that makes it hardware managed (TLBs, etc) that give you
functionality like virtual memory translations, cache coherency, etc.

Getting rid of all of that cruft that has been added since the 80s to make
programmers lives easier would actually reduce power consumption and latency
significantly... My startup is working on that problem by removing all of that
additional logic from the hardware and instead having it managed at compile
time. The best thing though would be having programmers really think about
locality when writing their programs though.

~~~
CyberDildonics
I hope you realize that managing memory latency has been the most fundamental
motivation throughout Intel for 25 years. It isn't a problem that is a matter
of a revelation about the problem itself.

I speed up programs all the time by reorganizing how memory is layed out and
accessed. Many time by factors of 12x or more.

I would think that making SIMD, parallelism, and multiple simple loops instead
of one bigger loop much easier to program around would be much more realistic.
Something like a fusion of ISPC, Rust, C++11, and Julia.

~~~
trsohmers
First off, I was referring primarily to memory latency between L1 cache, which
has improved over the past 2.5 decades only through the combination of Moore's
law getting the wires shorter (which is going to end soon, at least for
silicon) and increasing clockspeed (which really ended with the breakdown of
Dennard scaling a little over 10 years ago). Intel's L1 cache latency has not
improved in almost 10 years, with it still at 4 cycle latency (at best). The
improvement has only been that there is more data you can access at L1, but
the time to data hitting your registers has not improved at all.

Our scratchpad (the analogous term for software managed memory, in comparison
to a traditional hardware managed L1/L2/L3 cache system) for instance has
single cycle latency along with zero bus turnaround. Along with our ability to
guarantee memory latencies between any locations in memory, our whole goal is
to try to never have a wasted cycle.

------
ansible
What interested me in the design of the Mill CPU is how it throws out the
usual design of machine language.

I'm not talking about assembly language, and I know the difference, BTW.

In the name of software compatibility, we're still trying to program CPUs
using machine language that wouldn't be so strange to a programmer from the
1980's. Sure, there's more registers, and some fun new stuff, but it isn't all
that different.

Except that in the 1980's, the CPU actually implemented those instructions.
These days, it is all a lie, especially with regards to things like register
sets and aliasing. Yes, of course, logically, what the programmer wanted to
happen does, but today even programming at assembly level, you are far, far
removed from what the CPU is actually doing.

Edit: Here's the website:
[http://millcomputing.com/docs/](http://millcomputing.com/docs/)

~~~
gizmo686
It is worth noting that many of the Mill features are also a "lie" (for
example, the belt is still just registers and register renaming). Its just
that the Mill uses lies designed for modern processor technology.

~~~
w0000t
False. Mill does not do register renaming. ( Or use "lies". I failed to
comprehend you second sentence. )

Here is a talk from one of the designers explaining how it is done:
[https://www.youtube.com/watch?v=QGw-
cy0ylCc&feature=youtu.be...](https://www.youtube.com/watch?v=QGw-
cy0ylCc&feature=youtu.be&t=24m38s)

~~~
stdbrouw
"Lie" in this context means something like "abstraction or metaphor that
doesn't present itself as such."

------
mgrennan
This is a wonderful post that no-one will care about. This may be the only
post.

Today, programmers are more interested in the rate they can turn out "Just
Works" code. These kinds of details are fare fare to down in the weeds for a
continuous development artists.

~~~
detaro
> _This is a wonderful post that no-one will care about. This may be the only
> post._

I think you are underestimating the crowd here. Last time it was posted it got
quite a few responses:
[https://news.ycombinator.com/item?id=8873250](https://news.ycombinator.com/item?id=8873250)
(already a while back, but might be interesting for reference/to bring topics
up again)

~~~
lfowles
There also seems to be a constant barrage of assembly/CPU posts (I counted 4
posts including this one earlier on the front page) , I'm not sure where GP
gets the idea that no one here cares.

------
annnnd
> A few years back, I used a Pentium 4 system...

I hate it when blog posts don't include the date. Judging by the linked
question this blog post must be at most a few months old, but there was
nothing on the page that would tell me that. One of the most important
questions is whether the information in the article is still applicable... In
this case it is, but it would be nice if readers knew it. Not to mention 10
years from now when somebody stumbles across this writeup. </rant>

EDIT: Nice article though. :)

~~~
stefantalpalaru
It's from January 11 2015, according to this page:
[http://danluu.com/blog/archives/](http://danluu.com/blog/archives/)

------
userbinator
"However, loads can be reordered with earlier stores. For example, if you
write

    
    
        mov 1, [%esp]
        mov [%ebx], %eax
    

it can be executed as if you wrote

    
    
        mov [%ebx], %eax
        mov 1, [%esp]"
    

The confusing mix of Intel and GAS/AT&T syntax aside, this is not possible
since it would give different results when ebx == esp.

~~~
caf
It's not a static decision though - the memory accesses can still be reordered
when %ebx != %esp, though of course this only ends up visible where there are
multiple CPUs involved.

For example, consider the case above and assume that the initial conditions
are:

    
    
      (%esp) == 0
      (%ebx) == 0
    

Now imagine we have a second CPU executing simultaneously, with the same %ebx
and %esp as the first CPU, but executing this:

    
    
      mov $1, (%ebx)
      mov (%esp), %eax
    

Now if there was no reordering, either one or both CPUs must end with %eax ==
1. However, the hoisting of loads before earlier stores means that you can
actually end up with both CPUs have %eax == 0 after this executes.

~~~
userbinator
You're correct about it being possible for other CPUs to see "crossed"
loads/stores, but _within one CPU /stream of instructions the programmer-
visible ordering is absolutely preserved_, because if it wasn't, a lot of
existing software would break. In your example, if ebx == esp, and both CPUs
executed those two instructions, then they must both see eax == 1. I think you
had this scenario in mind instead (where A and B are _different_ memory
locations):

CPU 1:

    
    
      mov $1, (A)
      mov (B), %eax
    

CPU 2:

    
    
      mov $1, (B)
      mov (A), %eax
    

Where eax == 0 on both CPUs is definitely possible.

~~~
caf
We are of course in violent agreement.

The point to note is that the decision on whether or not the reordering can
occur, based on whether or not A and B are the same or not, is made
dynamically at the point of execution.

------
nickpsecurity
It's nice but let's be clear on the best feature: application-accelerators. I
brought up the Cavium Octeon 3 in a discussion on game systems:

[http://www.cavium.com/OCTEON-III_CN7XXX.html](http://www.cavium.com/OCTEON-
III_CN7XXX.html)

Intel, IBM, mainframes, and embedded SOC's are all taking the same approach to
a degree of combining 1-N general-purpose cores with dedicated hardware for
performance-critical stuff or just stuff that shouldn't add overhead. The
Octeon line is an extreme example with them adding accelerators till they hit
around 500. Most modern variant being the "semi-custom" business of Intel and
AMD that is making more of it happen for those with the money.

This is peripheral to an improvement in computers known as network on a chip.
This plus extra layers of functionality in silicon lets the companies easily
do stuff like that. The next step is incorporating FPGA logic in the
processors. We already see it in embedded scene. Just wait till Intel uses
Altera technology in Xeons. SGI's Altix machines with FPGA's using NUMA were
already quite powerful. Imagine the same benefit of no, remote-memory access
for the FPGA logic working side-by-side with CPU software. Will be badass.

------
Aoyagi
This is a question from someone rather ignorant, so please don't hit me: Why
didn't the Bulldozer affect the programmers? It (or Piledriver) seems to be
doing quite well in applications with good threading.

~~~
Symmetry
In general all x86 chips are carefully designed to fulfill the abstraction
that is the x86 ISA so to the programmer it shouldn't matter whether their
code runs on an Atom or a Bulldozer or an i7.

------
Tobu
I don't think "good at executing bad code" applies for embedded CPUs. Of
course, good code is still more the province of the compiler (with PGO for
example).

------
hackuser
Who is Dan Luu? Does he have expertise in this area? Is he a leading expert?
(No offense to Dan if he is reading this; I just don't know.)

~~~
ajdlinux
You can start by looking at
[http://danluu.com/about/](http://danluu.com/about/). Assuming his CV is
accurate he presumably has a reasonable amount of knowledge in the area.

------
bitwize
It's a great article but this drove me nuts:

DON'T USE FUCKING AT&T ASSEMBLY SYNTAX.

Literally everyone uses Intel syntax, except in those situations where they
are forced to use AT&T syntax (inline assembly in C on Unix, somehow your box
doesn't have NASM). Using AT&T syntax for examples just confuses people. Write
assembler the right way. Destination, source. Come on.

~~~
chrisseaton
The default output format of HotSpot is AT&T. The default output format of GCC
is AT&T. The default output format of Clang is AT&T. Tools like the universal
compiler output viewer [https://gcc.godbolt.org](https://gcc.godbolt.org) use
AT&T by default.

Intel isn't the universally accepted format you think it is. I'm a
professional VM researcher and I use AT&T more often than Intel. In fact I
most often see Intel when reading Intel documentation.

You're shouting about nothing more than empirical than tabs vs spaces, and
even then I think your side is actually in a minority.

