
ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs (2019) [pdf] - ingve
https://parallel.princeton.edu/papers/micro19-gao.pdf
======
userbinator
From the paper:

 _Our novel operations function at the circuit level by forcing the commodity
DRAM chip to open multiple rows simultaneously by activating them in rapid
succession._

That reminds me of this:

[https://www.linusakesson.net/scene/safevsp/](https://www.linusakesson.net/scene/safevsp/)

It's worth reading the long explanation there, and the HN discussion at
[https://news.ycombinator.com/item?id=11845770](https://news.ycombinator.com/item?id=11845770)
, but the critical part is this:

 _In short, one memory cell gets refreshed with the bit value of a different
memory cell._

What I find more amazing is that this effect has probably been noticed for as
long as DRAM existed and treated as a bug (like above), and it took several
decades for someone to think "that could be _useful_!"

~~~
ChainOfFools
I seem to recall an unusual proof of work scheme (cuckoo iirc) which depended
on some similar exploit of off-label DRAM timing behavior, but just as a means
of assuring a standard measuring unit rather than to perform compute ops.

~~~
tromp
As the author of the Cuckoo Cycle PoW
[https://github.com/tromp/cuckoo](https://github.com/tromp/cuckoo) I don't
recall any such dependence. I don't see how the operations described in the
paper would benefit a Cuckoo Cycle solver, which in essense looks like
[https://github.com/tromp/cuckoo/blob/master/doc/simplesolve](https://github.com/tromp/cuckoo/blob/master/doc/simplesolve)

------
noipv4
Micron had implemented an Automata processor on DRAM.

[https://insidebigdata.com/2019/07/26/machine-learning-
with-m...](https://insidebigdata.com/2019/07/26/machine-learning-with-microns-
automata-processor/)

[https://semiwiki.com/forum/index.php?threads/survey-paper-
on...](https://semiwiki.com/forum/index.php?threads/survey-paper-on-microns-
automata-processor.11424/)

~~~
glangdale
This wasn't really on DRAM in the same way as the paper is, interesting though
that project was. Sadly, it's more or less defunct. The academic research at
UVa has largely come to an end and Micron spun out the project to a startup
that seems to be moribund
([http://www.naturalsemi.com/](http://www.naturalsemi.com/)).

I tracked these guys because we were doing automata in s/w (the
[https://github.com/intel/hyperscan](https://github.com/intel/hyperscan)
project at Intel). There were some aspects of their methodology I didn't like
(notably, they tended to run benchmarks that spewed matches, slowing Hyperscan
down, while turning off matches on the h/w card!) but I found them v.
interesting and they had some really thought-provoking stuff.

------
pulse7
In-Memory compute would be useful for garbage collection. Compare also with
this article: [https://people.eecs.berkeley.edu/~krste/papers/maas-
isca18-h...](https://people.eecs.berkeley.edu/~krste/papers/maas-
isca18-hwgc.pdf) (note: one of the article authors is Krste Asanovic, who is
on the RISC-V chairman of the board).

------
dang
A single but excellent comment from last year:
[https://news.ycombinator.com/item?id=21992296](https://news.ycombinator.com/item?id=21992296)

~~~
Someone
That comment says

 _”The Elxsi 6400, built in the early 1980s could do logical operations to
memory using the off the shelf drams of the day. Designer was Harold (Mac)
McFarland, of PDP-11 fame.”_

I googled for more info, but didn’t find info about that. There’s a Wikipedia
page
([https://en.wikipedia.org/wiki/Elxsi](https://en.wikipedia.org/wiki/Elxsi))
that confirms the McFarland part, and I found the system architecture
([https://amaus.net/static/S100/elxsi/systems/Elxsi%20System%2...](https://amaus.net/static/S100/elxsi/systems/Elxsi%20System%206400%20architecture.pdf)),
which is interesting (instructions for multi-precision _ascii_ addition and
subtraction, for example, I would guess for supporting COBOL), but nothing
about those DRAM tricks. Does anybody have more info?

------
etaioinshrdlu
It seems to me like the biggest limitation is that while multiple rows can be
operated on at once, there is apparently no operation that can mix the values
of different columns. So bits at index 0 within a number can only be dependent
on other bits at index 0.

Anyone see anything that contradicts this?

------
DesiLurker
what I'd like is a programmable vector math unit with adjustable vector length
so I can do massive compute operations without having to pay memory access
latency cost. for instance, it would be awesome if CPU can just instruct to do
alpha blend 2 frames command to the memory and avoid the whole fetch decode
execute loop altogether.

On top of this it would be really awesome if this attached 'ram compute block'
was re-programmable like fpgas or a simpler higher level construct. then we
could essentially do all the 'parallel & dumb' computations in ram itself so
CPU would mostly be for control and branch code.

~~~
jiggawatts
With 5nm coming soon, I think this is inevitable. It's now possible to put
127M transistors in a single square millimetre, but despite billions of
transistors in a CPU, total performance is not hundreds of thousands of times
greater than primitive CPUs from decades ago.

It would cost next to nothing in terms of chip area to have a simple
"controller" CPU in each DRAM chip that can issue vector instructions without
needing to have its hand held.

The problem is the instruction set: where do you stop? Do you just have simple
logic operations, or a full set of floating point operations? Do you have flow
control? Stacks? Etc...

Code would either have to be written in the "full featured" language as well
as a cut-down language similar to CUDA, _or_ the embedded little chips would
have to be full CPUs.

The other issue is cache coherence. It's possible to have designs where either
the little DRAM CPUs participate or they don't. Both have big advantages and
big disadvantages, and neither is easy.

I suspect that this is going to start turning up in GPUs before CPUs, or
possibly for HPC applications before general purpose computers.

~~~
nine_k
Since designing and perfecting new, highly parallel programming methods is
hard, I can imagine stretching current approaches.

Spread ALUs and simple control units across RAM cells, there's little needed
because the RAM is its own registers. Some distant big control unit will send
out instructions to the processing units, and orchestrate I/O. A bit like GPU
but with a different set of constraints. Likely it could be made compatible
with OpenCL or CUDA.

------
etaioinshrdlu
Interesting that they get around lack of NOT operations (and thus lack of
universal logic capability) by storing each value twice: once normally, and
once bitwise-negated.

~~~
gowld
Not so much "getting around the lack of NOT" as "implementing NOT".

~~~
DonHopkins
I'm confused: Are you using the Bill & Ted Sarcastic Postfix NOT or not?

------
peter_d_sherman
Do these operations make the memory Turing-complete?

I don't have the expertise to know...

If they don't -- then what would be needed to be done to the memory controller
(in this case a custom FPGA memory controller) to make it Turing-complete?

Also... even if it isn't Turing-complete, then probably whatever functionality
is missing could be implemented by the FPGA -- although at the probable cost
of a memory round trip for those instructions, right?

In other words, you could probably use this for mixed-mode, hybrid, Turing-
completeness via additional FPGA instructions, even if the operations on RAM
aren't Turing-complete in and of themselves -- or am I missing something?

All of this sounds very promising!

Great concept, great paper -- hope you get well funded for your next round of
research!

~~~
jecel
It has the functionality of the first Connection Machine, except with no
network to exchange values between columns (as another comment pointed out).
You can do 64K logic operations, each on one bit, at a time. You still need an
external controller (which can be in the FPGA, as you suggested) to fetch
instructions, do loops and things like that.

The CM-1 could do conditional execution: you tell the 64k processors to add
but only those with a give flag set to 1 will actually do it while the others
will execute a NOP. It is possible to simulate this on this design but it
would be rather awkward, just like their 1 bit addition is awkward compared to
the one clock equivalent in the CM-1.

------
ColanR
So is there any example code? I would love to experiment with this on an extra
computer I have lying around.

~~~
magicalhippo
You need to modify the memory controller so I'm guessing a FPGA is required,
though maybe a microcontroller can be used as well. Also note that they only
got the AND/OR operations to work on a couple of the eight DRAM modules tested
so chances are you have to be lucky or source those parts as well.

