
In mobile, 62.7% of computation energy cost is spent on data movement [pdf] - bshanks
https://safari.ethz.ch/architecture/fall2019/lib/exe/fetch.php?media=onur-comparch-fall2019-lecture6b-in-memory-computation-i-afterlecture.pdf
======
bshanks
One way in which the brain differs from conventional computers is that neurons
appear to serve as both CPU and memory storage. Rather than having a few fast
CPUs and a large bank of dedicated memory with a few busses (leading to "the
von Neumann bottleneck" of data transfer between memory and CPU), the brain
has many slow neurons each of which has many connections with other neurons.

So, in-memory computing may be more brain-like.

The brain appears to be very good at massive concurrency with low energy
consumption, at the expense of slow serial computation and a high error rate.

~~~
mikorym
From a mathematical view on complexity, I don't see exactly the difference
between memory and computation. I wonder what would be suitable definitions
for the two concepts.

~~~
shele
You can make sense of the differences abstractly considering the band, the
state register and the instruction table of a Turing machine.

~~~
jacquesm
A Turing machine is just one of many ways in which you could make a computing
device. It is one of the easiest to mechanize and program for, which is why we
use them but there is absolutely no reason to believe it is the only way of
doing things. GP was getting at a more basic form of computation than the
specific one performed by Turing machines. From a mathematical perspective the
tape, the registers and instruction tables are all implementation details.

~~~
shele
> absolutely no reason to believe it is the only way of doing things

Well, I did not suggests that.

> From a mathematical perspective the tape, the registers and instruction
> tables are all implementation details.

No, they are used to define concepts such as space complexity for Turing
machines and there is something to learn from Turing machines, that's why
people are studying them.

~~~
pdpi
You can think of the Church Turing thesis as providing a formalism for how the
tape, registers and instruction tables are all implementation details.

------
mojuba
Some big part of the data movement can be eliminated by introducing persistent
RAM, i.e. the memristor that never happened unfortunately. When the missing
component is finally invented I think we'll have an opportinity to revisit
some of the core OS concepts, such as a file. Just imagine for a second you
have a persistent random-access memory device that doesn't require you to
serialize/deserialize things, or move and resolve the binary modules. Your
binary module/app will be executed right where it's stored, provided that it
was resolved and prepared when it was first copied to your file system.
Similarly, a JSON file will be stored as a memory structure rather than
serialized JSON (converted to text as necessary, e.g. when viewed in a text
editor) Etc. etc.

There are some really interesting changes that can potentially happen when the
memristor type of memory becomes available, possibly with its own problems
too, but with the huge benefit of moving less data.

~~~
wongarsu
> Similarly, a JSON file will be stored as a memory structure rather than
> serialized JSON

You can do this in C++ right now: mmap a file to a memory region, and just
create structs in that region.

This has two problems:

\- normally you want to save at controlled points in time, otherwise you have
to worry about recovering from states where some function updated part of the
data and then crashed (phone ran out of power etc)

\- just writing a bunch of internal data structures into a file used to be
moderately popular and has great performance, but it's a major headache when
you ever update them. You end up implementing a versioning scheme and
importers for migrating old files to your new application version. At that
point a file format that is designed for data exchange is less headache.

In general mmap already offers you a way to treat your disk like memory with
decent performance (thanks to caching), and the number of good use cases
turned out to be somewhat limited. I doubt just making that faster with new
technology will change much.

~~~
jacquesm
There is a huge difference between what the GP proposes and your
interpretation: the GP talks about a structure that is not just a memory
backed copy of the data in the file (which _still_ requires an in memory
structure pointing to the bits and pieces to make sense of it, or a painful
parse step for every access and an impossibility to write back to it).

So the whole structure in its native form rather than the contents of the json
file, for instance a graph would reside in memory and could be operated on
directly.

Hence the 'serialized' in the part that you quoted. Once you serialize it the
whole thing becomes hamburger and needs to be parsed again before you can
operate on it.

~~~
nostrademons
wongarsu is talking about actually mmaping the C structs in memory out to
disk. There's no serialization involved: the format on disk is exactly the
same as the bytes in RAM, and you rely on the OS to page them in on demand
(notably, parts of the file that are never touched by the CPU are never paged
in). You get around the invalidation of pointers by never using them: all data
is stored as flattened arrays, and if you need to reference another object,
you store an array index instead of a pointer.

This is a more common technique than most people suspect. It's taught in most
operating system courses [1]. It's the basis for how SSTables (the primary
read-only file format at Google, and the basis for BigTable/LevelDB) work, as
well as for indexing shards. It was how the original version of MS Word's .doc
files worked, and was also why it was so difficult to write a .doc file parser
until Microsoft switched to a versioned serialized file format sometime in the
90s. I think it's how Postgres pages work (the DB allocates a disk page at a
time and then overlays a C structure on top of it to structure the bytes), but
I'm not familiar enough with that codebase to know for sure. It's how zero-
copy serialization formats like Cap'n Proto & FlatBuffers work, except they've
been specifically engineered to handle the backwards-compatibility aspects
transparently.

It has all the problems that wongarsu mentions, but also huge advantages in
speed and simplicity: you basically let the compiler and the OS do all the
work and frequently don't need to touch disk blocks at all.

[1] [https://www-users.cs.umn.edu/~kauffman/4061/lab06.html](https://www-
users.cs.umn.edu/~kauffman/4061/lab06.html)

~~~
jacquesm
I think we already covered that:

[https://news.ycombinator.com/item?id=21647777](https://news.ycombinator.com/item?id=21647777)

------
pjc50
Including a copy of Maslow's hierarchy of needs and some green field vs
factory slides seems to be .. overselling it a bit?

Fortunately it pulls back towards a "what are the easy wins" approach: in-
memory initialising (rather than stream zeroes over the bus, ask the memory to
zero itself) and in-memory copy.

The three real problems are:

\- this is only useful if the whole stack "knows" about it, or can be
transparently optimised to use it. Otherwise the person running a Javascript
for loop to set values to zero will ruin it.

\- this relies on the data being in DRAM-row-size chunks, and moving it around
within the same physical chip. That may also require re-architecting to make
that common enough to be useful. Locality issues are the downside to
distributed systems; the whole CPU architecture is oriented towards continuing
to pretend that everything happens in a defined order in a single place.

\- possible security issues (rowhammer?): may seem far off, but if you don't
think about it upfront it gets very expensive later.

(Worth recapping: the silicon is fundamentally differently-processed for DRAM,
such that implementing complex logic is slow and expensive.)

------
kstenerud
It's not just a problem with data movement inside the machine. Data movement
between processes and machines is also horribly wasteful.

Text encoding/decoding wastes massive amounts of energy, which is why I've
been developing a format over the past 2 years with the benefits of binary
(smaller, more efficient) as well as text (human readability/editability):
[https://github.com/kstenerud/concise-
encoding](https://github.com/kstenerud/concise-encoding)

~~~
satanspastaroll
Waste can be very subjective.

Exchanging text makes the format easier to debug and accessible to humans,
which can be a purpose in itself.

~~~
9c8675a8
It should only be easy to debug and accessible to humans when needed IMO. I
really think it should be possible to load a "production message", log event
or whatever into a debug parser which only then spits out human readable
output.

~~~
satanspastaroll
Even switching between modes may introduce new kinds of undefined behavior,
which may not be visible on the text-only side

~~~
kstenerud
Every extra step in the process adds another potential failure point. But
we're fast approaching a data and energy crunch that will push the industry
towards binary formats once more. This is my attempt to keep that shift sane,
and avoid the mess of the 80s and 90s.

The implementations are almost done now, and my first tool will be a command
line utility that reads one format and spits out the other, so that you can
take a binary dump from your production system using tcpdump or wireshark or
whatever, and then convert it to a human readable format to see what's going
on. I'll probably even put in a hex-reader so that you can log the raw message
and then read it back:

    
    
        2019-10-02:15:00:32: Received message [01 76 85 6e 75 6b 65 73 88 6c 61 75 6e 64 68 65 64 79]
    
        $ ceconv --hex 01 76 85 6e 75 6b 65 73 88 6c 61 75 6e 64 68 65 64 79
        v1
        {
            nukes = launched
        }

~~~
satanspastaroll
I doubt there is any significant crunch coming from just encoding/decoding the
stream. Most of the time comes still from waiting for network resources, and
evaling megabytes of add-in JS. Now the same problem applies to JS, and the
counter argument for openness and usability are still the same. Compressed
transmission and binary parallel transmission in HTTP/2 are also helping with
the comms size.

The project still seems cool, I'll have to have a deeper look into it soon

------
m0zg
The entirety of acceleration hardware industry is based around this fact.
Basically every single accelerator tries to move more memory towards
significantly simpler (and wider) compute. At the extreme of this is compute-
on-DRAM, not a new idea certainly, but one that is yet to materialize.
Systolic architectures are also very efficient . GPUs are far less so, their
main advantage is relatively good tooling and extensive programmability, not
energy efficiency, per se. And CPUs utterly and completely suck at high
throughput workloads, power efficiency wise. They do often have enough compute
to do e.g. lightweight deep learning though.

Years ago it cost 8pJ/mm to move a byte on-chip. It's probably closer to
5pJ/mm now, but ALUs consume a fraction of this energy to do something with
that byte. And once you leave the chip and hit the memory bus, things get
_really_ slow and expensive.

------
aaxa
I think this headline is very misleading. This is about computational energy
cost and does not represent what the headline implies (that it's 62.7% of
total energy consumption)

~~~
brohee
Yeah on mobile screen and radio dominates the compute part by quite a bit. Not
that it's not still worth optimizing.

~~~
lonelappde
Screen and radio are both data movement tools, so I don't understand the OP's
classification process.

~~~
andai
The next step after in-memory computing: in-screen computing!

~~~
rubinelli
Maybe not computing, but cathode ray tubes were used as RAM in the forties:
[https://www.radiomuseum.org/forum/williams_kilburn_williams_...](https://www.radiomuseum.org/forum/williams_kilburn_williams_kilburn_ram.html)

------
jedberg
The number one cost of any distributed system is data movement.

Mobile phones are essentially leaf nodes in a huge distributed system, where
they interact with "downstream dependencies", ie. the servers that provide all
the functionality of their apps.

So it makes sense that moving data would use the most power.

------
aberforth123
A small part of the solution: Let's block ads! Proof:
[https://webtest.app/?url=https://www.wowhead.com](https://webtest.app/?url=https://www.wowhead.com)

~~~
Enginerrrd
Wow, this website is cool and makes for a fun and very interesting (and
telling) comparison between old reddit and the redesign:

[https://webtest.app/?url=https://www.reddit.com](https://webtest.app/?url=https://www.reddit.com)

vs.

[https://webtest.app/?url=https://old.reddit.com](https://webtest.app/?url=https://old.reddit.com)

An order of magnitude increase in page size and energy consumed, and it takes
5 times as long to load, with 5 times the ad requests.

~~~
MisterTea
Given the current environmental state, there should be a big push to shame
these websites for wasting energy. Perhaps if this happened on a larger scale
we'd see better designs produced that don't rely on highly wasteful and
inefficient virtual machines that are mistook for web browsers.

~~~
aberforth123
Good idea. Where do we start this shaming? I don't have the resources.

------
localhost
Cerebras's "wafer scale engine" [1] takes some of these ideas and applies them
narrowly to deep learning training. 400,000 cores, 18GB of on-chip memory, 9.6
petabytes of memory bandwidth in a 1.2 trillion transistor package in a
gigantic 46,225mm^2 package.

[1] [https://www.cerebras.net/cerebras-wafer-scale-engine-why-
we-...](https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-
chips-for-deep-learning/)

------
stokedmartin
"We observe that data movement between the main memory and conventional
computation units is a major contributor to the total system energy
consumption in consumer devices. On average, data movement accounts for 62.7%
of the total energy consumed by Google consumer workloads." [0]

[0] [https://people.inf.ethz.ch/omutlu/pub/Google-consumer-
worklo...](https://people.inf.ethz.ch/omutlu/pub/Google-consumer-workloads-
data-movement-and-PIM_asplos18.pdf)

------
Veedrac
A major improvement to the energy cost of data movement will come from closer
memories, like HBM and on-die stacked DRAM.

~~~
pdimitar
What do you mean by "closer memories"?

~~~
Veedrac
Putting memory closer to the CPU, through dense and short-distance
interconnects, allowing for much more efficient communication. Eg.
[https://www.youtube.com/watch?v=-besHp8HLxo](https://www.youtube.com/watch?v=-besHp8HLxo).

------
alecco
Slide 3. Paper [https://people.inf.ethz.ch/omutlu/pub/Google-consumer-
worklo...](https://people.inf.ethz.ch/omutlu/pub/Google-consumer-workloads-
data-movement-and-PIM_asplos18.pdf)

Still, this doesn't sound right. Screen should be dominant.

~~~
willvarfar
I don't think they are talking about the phone's total energy; the charts in
that excellent paper you linked to have costs for CPU, L1, L2, DRAM etc but no
screen, GPU, radio etc.

A long time ago I was a technical product manager for, among other things,
this stuff. At the time, the screen completely dominated and the GPU was a
distant second, unless you were playing a 3D game, in which case it used
almost as much as the backlight. CPU and radio were a long way behind them.

My intuition is that this hasn't changed, as I can still get dramatically
better battery life on my new iPhone in iBooks by playing with the brightness
slider.

~~~
lonelappde
Per Carroll's results in An Analysis of Power Consumption in a Smartphone: The
backlight is either the lowest power local component (10%, 40mW) or highest
(50%, 400mW) depending on how bright it is turned up.

The GSM radio is the highest power component (about 400mw), when it is working
continuously (phone call).

Related: a big contribution of smartphone OS optimizations and stuff like wifi
6 is about being smarter about turning off the radio for micro time slices
when not needed.

Laptops are similar, but it depends a lot on whether you are using a 15W CPU
Ultrabook with no GPU, or 65W mobile workstation CPU plus a GPU.

------
colechristensen
With all of that ad overhead it's reached a point where the actual costs
incurred by the user rival the ad revenue gained by the publisher.

------
Gravityloss
Most of the data is useless anyway. Bloat factor over mostly text content and
an adequate quality picture is huge, probably 10x.

------
euroPoor
ethz going _strong_ these days on HN

------
extropy
TLDR: let's put compute inside the memory. I'm very leaning towards calling
this bullshit.

Closer to memory compute is what we have been doing for last 10 years, and
that's why we have ever increasing hierarchical caches.

Yes the ALUs are a tiny part of the power budget because moving / syncing data
is the hard problem. If you want massive parralelism, use a GPU.

The in memory clone and zero could make some sense, but generally you want to
start writing to that memory soon after that, meaning you still need to pull
it into your cache and the benefit is negated.

~~~
imtringued
PIM is already a reality it's just a matter of time until it sees broad
adoption. PIM does not suffer from the crippling limitations of GPUs which
only perform well with arithmetic bound problems, cannot run conventional non
SIMD code efficiently and need a relatively large batch size.

For instance try deserializing a million JSON strings with a GPU. The end
result is a graph like memory structure which GPUs usually struggle with. GPUs
cannot take advantage of sharing the instruction stream here because it will
hit branches for almost every single character and diverge quickly. And
finally if you only want parse one JSON object then the GPU is worthless.

A PIM based solution would not struggle with heterogeneous non-batched
workloads with arbitrary layout at all but still offer the same performance
advantages that GPUs enjoy compared to CPUs and reduce energy usage at the
same time.

