
Upmem Puts CPUs Inside Memory to Allow Applications to Run 20 Times Faster - rbanffy
https://www.hpcwire.com/off-the-wire/upmem-puts-cpus-inside-memory-to-allow-applications-to-run-20-times-faster/
======
bem94
This is really interesting, but note the power increase seen inside the DRAM
modules.

Anandtech have an excellent analysis:
[https://www.anandtech.com/show/14750/hot-
chips-31-analysis-i...](https://www.anandtech.com/show/14750/hot-
chips-31-analysis-inmemory-processing-by-upmem)

> The operation itself [inside the dram module] actually requires 2x the power
> (20 pJ compared to 10 pJ [inside the main CPU]), but the overall gain in
> power efficiency is 170 pJ vs 3010 pJ, or just under 20x

> One thing that this slide states that might be confusing is the server power
> consumption – the regular server is listed as only 300W, but the PIM
> solution is up to 700W. This is because the power-per-DRAM module would
> increase under UPMEM’s solution.

I'm assuming that those numbers represent "peak" power, and that when idle the
compute parts of the DRAM can be power/clock gated. The implication being that
if you are doing lots of in-memory analytics, you'll get a power saving, but
if you switch to "something else" and don't get good utilisation of the in-
memory compute, you'll probably ruin your power/energy efficiency.

I guess that means the future improvements will involve bringing the power
consumption of the modified DRAM modules back in line with their "normal"
cousins.

~~~
sp332
Watts are Joules / second. If you're doing operations that take 1/20 as much
power each but you're doing them 20x faster, it takes the same amount of power
in Watts. For a better comparison you'd need to measure Watts * seconds for a
given task.

~~~
skavi
*1/20 as much energy

~~~
jacksnipe
Joules measure energy, Watts measure power

~~~
sp332
skavi is right, in context I should have said that each operation takes 1/20
as much energy. The amount of power depends on how often the operations are
done.

------
imtringued
Venrays did this in 2012 with their TOMI Borealis PIM but it didn't catch on.

[https://hothardware.com/news/cpu-startup-combines-
cpudramand...](https://hothardware.com/news/cpu-startup-combines-cpudramand-a-
whole-bunch-of-crazy)

~~~
wmeddy
I'm glad to see that you mentioned this. I had the pleasure of working with
Russell Fish on a different project, and I immediately thought of TOMI when
looking at the parent article.

------
cmrdporcupine
Seems to me that a full CPU is overkill for this; a hybrid approach where you
just have some arithmetic on vectors, with no program counter or branching
really, could be effective without adding a bunch of complexity.

Also seems like concurrency and data consistency issues could arise pretty
drastically when each memory unit is potentially making changes independent of
the processor...

And what about CPU cache coherency?

~~~
flaviu1
It uses a custom CPU and ISA that's specialized for the application:
[https://www.anandtech.com/show/14750/hot-
chips-31-analysis-i...](https://www.anandtech.com/show/14750/hot-
chips-31-analysis-inmemory-processing-by-upmem/2)

> Internally the DPU uses an optimized 32-bit ISA with triadic instructions,
> with non-destructive operand compute. As mentioned, the optimized ISA
> contains a range of typical instructions that can easily be farmed out to
> in-memory compute, such as SHIFT+ADD/SHIFT+SUB, basic logic (NAND, NOR, ORN,
> ANDN, NXOR), shift and rotate instructions

Re. CPU cache coherency, they have a software library that automagically hides
that:
[https://images.anandtech.com/doci/14750/HC31.UPMEM.FabriceDe...](https://images.anandtech.com/doci/14750/HC31.UPMEM.FabriceDevaux.v2_1-page-018.jpg)

Not exactly sure how it works, I'm not super familiar with the internals of
CPUs.

------
jacknews
The concept of "smart memory" is good (though not new), but I think the devil
could be in the process-technology details; DRAM and logic are not that
compatible.

But let's hope they've figured it out.

~~~
Symmetry
You do have eDRAM where big blocks of DRAM are put on board a CPU as a last
level cache, like in the POWER7. It does special processes features.

------
mpweiher
First incarnations of this I heard of were SLAM, Scan Line Access Memory [1]

"A SLAM consists of a conventional dense semiconductor dynamic memory
augmented with highly parallel, but simple, on-chip processors designed
specifically for fast computer graphics rasterization. "

and PixelPlanes[2].

[1]
[https://dl.acm.org/citation.cfm?id=13358](https://dl.acm.org/citation.cfm?id=13358)

[2]
[https://pdfs.semanticscholar.org/0c1b/8ffdf7074a73883139da41...](https://pdfs.semanticscholar.org/0c1b/8ffdf7074a73883139da41882d1e05653123.pdf)

------
tempsolution
This sounds sketchy. Either the rest of the industry is incredibly incompetent
or they are deliberately omitting a huge drawback of this solution. And I
don't see lack of expandability as a drawback, just buy a bit more RAM to
start with.

Then there are NVIDIA GPUs and Intel Phi (i.e. Knightslanding)... So they are
trying to sell us that neither AMD, NVIDIA or Intel have though of this
"ingenious" decision to increase the efficiency obtained by co-locating CPU
and memory? The fact that they deliberately avoid addressing this elephant in
the room makes me super skeptical.

~~~
sp332
It requires significant changes to the compiler, so this RAM would not make
any difference to existing binaries. This is a really big change in the way
computers are structured, so it's no wonder it hasn't taken off before. And
this isn't the first attempt; look up computational RAM, IRAM, RADram, smart
DRAM, processor-in-memory.

~~~
bigtrakzapzap
No pun intended, I remember IRAM. Then there was Tilera and Tabula
(MPPA/MIMD).
[https://en.wikipedia.org/wiki/Massively_parallel_processor_a...](https://en.wikipedia.org/wiki/Massively_parallel_processor_array).

------
m0zg
Micron has been pursuing something similar for a while. The "CPUs" that you
can put on the same die, however, are very limited in what they can do, owing
to the specific lithography used on memory chips, and the general lack of die
space. They also don't get uniform access to a huge memory range that you'd
expect from a "real" CPU, and require you to partition work to fit within the
constraints of the memory access pattern they can, in fact, support. The
instruction set is very limited, floating point can only be emulated (i.e.
slow AF, not that you actually need it for neural networks most of the time).
The upside is the unlimited memory bandwidth and very low pJ/byte, with a few
catches.

Don't know if this is similar, but if it is, it's going to be a hard sell,
especially in the era when 90% of programmers can't even understand what I
wrote above.

------
petra
More details: [https://www.anandtech.com/show/14750/hot-
chips-31-analysis-i...](https://www.anandtech.com/show/14750/hot-
chips-31-analysis-inmemory-processing-by-upmem)

------
teilo
1980s: RAM in the CPU to speed up the CPU. 2020s: CPU in the RAM to speed up
the RAM.

------
nickpsecurity
Venray tried this:

[https://www.wired.com/2012/01/russell-fish-
venray/](https://www.wired.com/2012/01/russell-fish-venray/)

They didn't get huge that I can tell. Might be a tough market.

~~~
petra
DRAM is a commodity business. It works in huge batches.

Convincing DRAM manufacturers to change things for something at that small
scale is a huge challenge.

I don't think venray managed to do that.

But OPMEM probably did.

~~~
rasz
Micron dabbled in this. 2003 project Yukon (shutdown in 2004) and again 10
years later with 2013 “The Automata Processor” (erased from their website).

~~~
jplayer01
Interesting. The last I can find on the Automata Processor is from 2014, when
it was supposedly being delivered to developers within weeks. And then
nothing. I wonder what happened to it. Doesn't seem like it ever turned into a
shipping product.

~~~
JackFaker
I only heard of the university of Virginia getting some to work with.
[https://engineering.virginia.edu/center-automata-
processing-...](https://engineering.virginia.edu/center-automata-processing-
cap)

~~~
jplayer01
Ah, cool. Seems like the latest paper involving the AP is from 2016.

~~~
rbanffy
[https://www.cs.virginia.edu/~skadron/Papers/wang_APoverview_...](https://www.cs.virginia.edu/~skadron/Papers/wang_APoverview_CODES16.pdf)
has more information

~~~
jplayer01
> This work was supported in part by the NSF (CCF-0954024, CCF-1116289,
> CDI-1124931, EF-1124931); Air Force (FA8750- 15-2-0075); Virginia
> Commonwealth Fellowship; Jefferson Scholars Foundation; the Virginia CIT
> CRCF program under grant no. MF14S-021-IT; by C-FAR, one of the six SRC
> STARnet Centers, sponsored by MARCO and DARPA; a grant from Micron
> Technology.

I guess the other comment wasn't too far off.

~~~
rbanffy
Not at all. The paper is still interesting though.

------
Diederich
This somewhat reminds me of the Blitter chip, which first came to my attention
in the Amiga 1000:

[https://en.wikipedia.org/wiki/Blitter](https://en.wikipedia.org/wiki/Blitter)

"A blitter is a circuit, sometimes as a coprocessor or a logic block on a
microprocessor, dedicated to the rapid movement and modification of data
within a computer's memory."

~~~
tempguy9999
IIRC the blitter was just a normal coprocessor. I'm pretty certain it wasn't
embedded in any way in the memory.

~~~
tejtm
As I recall there was "chip" ram and "fast" ram. So maybe not embedded for all
definitions but certainly processing chips with privileged access to a subset
of memory

------
crb002
Here is a link to the assembler instruction set for the PIMs.
[https://sdk.upmem.com/2019.2.0/200_AsmSyntax.html#assembly-i...](https://sdk.upmem.com/2019.2.0/200_AsmSyntax.html#assembly-
instructions)

------
crb002
You have to obey speed of light latency, and that means putting compute in RAM
right next to the data. Key is to have a reusable fabric so all the TSMC
customers can pop their own chips in RAM, especially ASICS and micro-FPGAs.

------
Zenst
Interesting that they went with a 32bit ISA, given the growth and demand for
ML/AI, which seems to lean upon lower precision (8bit), then I wonder if they
could of gone for that instead of something more a jack of all trades design.

Which makes me wonder, when will somebody do an ISA dedicated for ML/AI and
will we ever see old 8bit CPU's reborn as on memory CPU's.

But we have seen this approach for processing upon memory before and whilst
that has not tractioned into a product, this might. Though with Chip design
going towards chiplets, ram can and may well become another chiplet in the
evolution of that design process.

~~~
radarsat1
In what domains do people do 8 bit machine learning? Most example code I've
seen e.g. for TensorFlow uses float32 matrices.

~~~
jdietrich
It's quite common to train at float32 but quantize down to int8 for inference.
The loss of accuracy is negligible, but you make huge performance gains.

[https://medium.com/apache-mxnet/model-quantization-for-
produ...](https://medium.com/apache-mxnet/model-quantization-for-production-
level-neural-network-inference-f54462ebba05)

------
oysterfish
How can i get my hands on this?

~~~
seren
There are some information on the SDK here :

[https://www.upmem.com/developer/](https://www.upmem.com/developer/) or more
directly [https://sdk.upmem.com/2019.2.0/](https://sdk.upmem.com/2019.2.0/)

It seems you have to compile an executable for the in memory processor, and
they have some sort of daemon/infrastructure to communicate from the main cpu
to the PIM.

------
basicplus2
So what are the security implications?

One would really need to trust the manufacturer.. but how?

~~~
rbanffy
It can actually be useful for security - you can tell the DPUs to erase memory
the moment the OS frees it, reducing the risk of data being read or decrypted.
That's probably a very trivial use of the tech, but it's a low hanging fruit.

~~~
cestith
I suppose one could put more of an OS's work with memory into this level, too.
Zero it when it's freed, but also keep the free block list updated at this
level. Maybe use these processors for aggregating available contiguous blocks
when necessary, too. Maybe use these processors to update the virtual to
physical mappings and keep free space optimized. Then, when the application
actually needs to do something close to RAM it could be given access.

------
manav
I’m not much of a hardware person but I was wondering why not do the opposite?
Pair a massive amount of SRAM/DRAM with a normal cpu as l2+ cache. Or is that
what they are essentially doing?

~~~
vvanders
SRAM and DRAM are two totally different things.

An L2 cache is just a huge chunk of SRAM, they take more die space and use a
lot more power at the benefit of running with lower latency.

------
xiphias2
Where can I buy put options to this company?

20x speedup compared to a CPU looks worse than just using a stock GPU, and we
haven't talked about the AI inference chips that are optimized for NN
inferencing.

~~~
petra
GPU's have limited amount of memory.

~~~
proverbialbunny
But they can prefetch data. This makes writing code for them a bit more
complex, but outside of that the ram limitations are mostly moot.

This fails when data needs to be read and written randomly. However, even if
the entire dataset was on vram random reads and writes would still be slow,
because of the way a stream processor works. All reads and writes need to be
laid out in memory in an order for a gpu to do it's best work.

------
rbanffy
I wonder what having these in-memory processing units can do with
cryptocurrencies. Serial processing performance is rather low, but there are
thousands of DPUs in a memory DIMM.

------
cellular
This is a great idea! I'm surprised we haven't thought of this before. Add a
simple ALU in there with some request to perform some NN calculations like a
co processor!

~~~
sitkack
The concept is quite old.

[https://en.wikipedia.org/wiki/Berkeley_IRAM_project](https://en.wikipedia.org/wiki/Berkeley_IRAM_project)

[http://iram.cs.berkeley.edu/](http://iram.cs.berkeley.edu/)

------
fnord77
Sorry - I don't know much about chip design: Will this increase the heat
produced by DIMMs and will they require active cooling?

Does increased heat decrease the life of RAM?

------
abstract7
How does shared memory between processes on different chips work with
something like UPMEM?

------
karmakaze
Seems like MemPU would be a more fitting name/ordering of letters.

------
Pywarrior
computronium

