
Making the Case for Feature-Rich Memory Systems (2016) [pdf] - ingve
https://www.cs.utah.edu/~rajeev/pubs/issc16.pdf
======
wpietri
Do folks have practical examples that would make development of alternative
computational models worth it?

The GPU is the only recent example where I've seen people really willing to
rethink how they work to get (very large) performance gains. But I'm not
seeing that here. For the examples I can think of, it seems like it would be
easier to install more servers with modest amounts of RAM, rather than having
a smaller number of servers with lots more RAM and in-memory processing.

Of course, maybe I'm just not thinking of the right examples.

~~~
deepnotderp
> Do folks have practical examples that would make development of alternative
> computational models worth it?

This is the question to ask people when they say "the von Neumann architecture
is a bottleneck".

That being said, I think a strong contender is the dataflow computing model.
The advent of the GPU already heavily punishes control flow, and workloads are
being increasingly constrained, hence the possibility of a real world success
for dataflow machines.

~~~
adrianratnapala
Ok, I have seen dataflow as a model for writing programs. But such languages
still have conditional constructs -- and I always assumed that once it gets
down to the metal the cost of a conditional is going to be much the same.

------
deepnotderp
I find it annoying that most of these jump straight from standard DRAM with a
couple of tweaks straight to deep in-memory processing with memristors.

There are a lot of options for in-between options as well. For example,
mallacc by Harvard modified to reside inside the memory itself could be very
useful:
[http://www.eecs.harvard.edu/~skanev/papers/asplos17mallacc.p...](http://www.eecs.harvard.edu/~skanev/papers/asplos17mallacc.pdf)

------
dom0
"The march toward specialized systems" is an interesting slogan (if you will),
since that's _exactly_ where we came from; we (as in: the industry) made a
huge point out of doing as much as possible using general-purpose components
(remember GPGPU?) and more or less open standards. It is of course obvious
that the general purpose approach has inherent inefficiencies, but we gladly
paid the price. I see some similarities here with the recent rise in
popularity of lower level programming languages (better C++, Rust, even Go),
after the move to VM-based highest-level languages (Python, Ruby, JavaScript,
countless others, perhaps even Java).

We see the gains in productivity and the reduction in cost – at least
according to some measures –, but it has inherent inefficiencies.
Inefficiencies like simple applications using far more CPU and memory than
they're due to.

This perhaps makes people think again what would be possible if all the
abstraction (analogous to _general-purposeness_ of hardware) were wiped away
and what has been possible in the past using far fewer resources.

~~~
pjmlp
The trend of finally adopting AOT on OpenJDK and .NET (NGEN is just for faster
startup) kind of prove the point they should have offered AOT since v1.0
instead of leaving it to third parties.

Another example is the ongoing efforts to improve their support for value
types.

A 20 years delay to catch up with what Common Lisp, Eiffel, Modula-3 and
Oberon variants already offered in those days.

~~~
peoplewindow
Well AOT even in OpenJDK is not necessarily about resource savings, just
startup time. Native code is much larger than the equivalent bytecode, so
compiling lots of cold code that's rarely used AOT to native can make things
more bloated and slower, rather than tighter and faster.

I suspect we'll see the same thing for value types: it's not going to be quite
the easy win it seems. Even in C++, it's easy to create accidental footprint
explosions with templates and lose performance to excessive copying with value
types. And C++ allows mutable values, which are out of fashion now so Java
won't allow them ...

I was curious if the article would mention memory chips that knew how to do
bulk memmoves by themselves. Does anyone know if memory subsystems can already
do that? If you issue a memmove() for, say, 3kb of memory, does the CPU still
have to read it all into the cache and then immediately write it out again, or
is there some way the CPU can signal to the DRAM chips that they should do the
copy themselves? Fast copying of memory would be useful for GCs.

~~~
boznz
30 years ago we had blitter chips dedicated to moving memory around without
using cpu, pretty sure most memory controllers would have something similar
these days

~~~
dom0
memcpy is done by the CPU core, at least for x86. The IMCs don't process data.

The CPU core has more bandwidth than the IMC anyway, so there would be no
speed-up from adding this complexity to the IMC (it would not only need to
perform the operation, but it would also need a way to maintain cache
coherence and communicate with the issuing CPU, none of which is a problem if
you just do it in the core). It might not even save power.

------
teddyh
Forward Error Correction in hardware would, to my mind, be a nice and
comparatively easy step up from simple ECC.

~~~
dbcurtis
What are you proposing? ECC (classic SECDED for instance) is a form of FEC.
What is it you want?

~~~
teddyh
I must admit to not being an expert in these matters, and I therefore defer to
this comment by JoshTriplett, where he talks about ECC as being inferior to
FEC:

[https://news.ycombinator.com/item?id=11604918](https://news.ycombinator.com/item?id=11604918)

~~~
dbcurtis
OK, so I read that, too. I'm still fuzzy on what is being asked for.
Extrapolating based on my 20 years of experience as a CPU logic designer, and
graduate work in error correcting codes, I'm guessing that what the author
seems to want is to have uncorrectable memory read errors passed to user space
as an exception that the application can handle as it sees fit. (I wouldn't
call that FEC, but, /shrug ...)

The OS gets the exception. There are always (usually priviledged) instructions
to read and set the memory check bits. It would be darn hard to write memory
diagnostics without them, or sometimes even to boot a machine that powers up
with random bits in the memory.

If I understand what the author wants, (and I have my doubts that the author
understands what they want), then they are asking for an OS feature, not a CPU
feature, and simply want the exception to bubble up to user space.

------
sitkack
I can't believe it makes no mention of IRam [0] or Computational Ram [1]

[0] [http://iram.cs.berkeley.edu/](http://iram.cs.berkeley.edu/)

[1]
[https://en.wikipedia.org/wiki/Computational_RAM](https://en.wikipedia.org/wiki/Computational_RAM)

