
Darpa Funds Development of New Type of Processor - farseer
http://www.eetimes.com/document.asp?doc_id=1331871&
======
dreamcompiler
Graph processing machines are not new. The signal-to-noise ratio of this
article is so low that I can't tell how this architecture differs from e.g.
the Cray Eldorado. Or the Connection Machine for that matter.

~~~
dmix
There is a real use-case here that has become so widespread and common in the
intelligence and law enforcement community that they saw utility in having a
processor optimized for graph analysis. They likely see some uses for it
outside of that market as well which is why both Intel and Qualcomm are
interested in working on it. Even if it is a variation on 'Threadstorm' it's
still optimized for this particular type of data processing.

I read the whole article and I don't see anywhere in this claiming it is a new
idea. But certainly there is no other processor on the market like this one.
So it fits the category of "new type of processor", even if it's for a
dedicated use-case - just like those ML optimized processors.

There's a big difference between knowing it is theoretically possible and the
value you get from a real world implementation with real users. That sounds
pretty news worthy to me.

------
rayiner
What does the memory interface look like for this? None of the papers I can
find indicate how you can get so much parallelism out of the CPU <-> memory
interface.

An interesting approach to non-Von Neumann computing is to put ALUs in memory,
to take advantage of the fact that DRAMs have far more internal bandwidth than
what is exposed in traditional systems:
[http://researcher.ibm.com/researcher/files/us-
leejinho/tvlsi...](http://researcher.ibm.com/researcher/files/us-
leejinho/tvlsi_authorcp_20170329.pdf).

~~~
kevinnk
Manufacturing logic on the same wafer as DRAM is difficult since the processes
are so different. On the other hand, manufacturing on separate dies and
connecting with an interposer or TSVs gets you tremendous bandwidth and
relatively low latency; this is how some newer generation graphics memory is
implemented (see
[https://en.m.wikipedia.org/wiki/High_Bandwidth_Memory](https://en.m.wikipedia.org/wiki/High_Bandwidth_Memory)).

~~~
white-flame
CPU logic on DRAM process is apparently a solved problem:
[http://venraytechnology.com/Implementations.htm](http://venraytechnology.com/Implementations.htm)

~~~
kevinnk
You can make logic on a DRAM process, but either your logic will be slow or
you'll have to increase the cost of your cost/power consumption of your DRAM
cells. With separate chips connected with TSVs you can have your cake and eat
it too (each process can be specialized for it's components) plus you get the
added benefit of increased yields (chips are smaller, no special processing
steps).

~~~
white-flame
That's kind of the entire point of Venray's claims: The CPUs are both
plentiful and fast, and with far lower power consumption than computationally
equivalent strength discrete CPU + RAM combos. They apparently only add a
couple percent to the die size of the DRAMs.

However, their business model presumes "Wait until a DRAM manufacturer buys
us", which IMO is why nothing's moved forward. DRAM manufacture is low-margin
and not really the place to look for this kind of risky introduction to the
market. I'd love to see this form of parallelism, and their take on breaking
the memory bandwidth wall; it meshes great with the types of problems I work
on.

~~~
kevinnk
The point of my post was there's not much upside that integrated DRAM/logic
has that TSVs don't, but plenty of downsides. Regardless of Venray's claims,
there's a reason modern high performance parallel architectures go with TSVs
and interposers (Knights landing, new GPUs, some deep learning platforms,
etc...) instead of logic in DRAM.

~~~
white-flame
Part of that is simply silicon design inertia, though.

Do you see interposer style designs as linking up terabytes of DRAM? (at least
in the near future) All the chips you're talking about are pretty major dies,
not really suitable for having many stacks of them in conventionally tightly
spaced DIMM arrays to reach such RAM sizes.

Of course, 3d chip advances might throw all current assumptions out the window
and change the layout of everything.

~~~
kevinnk
> Do you see interposer style designs as linking up terabytes of DRAM?

I don't think we're going to see a terabyte of dram on an interposer for a
while (4GB is about the max you can get commercially right now). I'm not sure
what you're trying to get at though; even with logic in DRAM you have to go
off chip to get to terabyte levels, so I don't see the advantage.

> All the chips you're talking about are pretty major dies, not really
> suitable for having many stacks of them in conventionally tightly spaced
> DIMM arrays to reach such RAM sizes.

The stacking happens in package (<1mm thick). Your DIMM array is going to have
to be pretty damn tight for that to matter.

> Of course, 3d chip advances might throw all current assumptions out the
> window and change the layout of everything.

TSVs are 3D (or "2.5" depending on the configuration). You should have thrown
out the assumptions back in 2014.

~~~
white-flame
> _I 'm not sure what you're trying to get at though; even with logic in DRAM
> you have to go off chip to get to terabyte levels, so I don't see the
> advantage._

Many-core processors with low-latency wide-bus on-chip random access speeds
need to scale horizontally as well. Focusing on large chips means you're not
going to have very many on a single motherboard, where
QPI/HyperTransport/memory-bus style communication can achieve higher and more
user-transparent shared memory access performance, compared to offboard
communication networking.

The "stacks" I was talking about are just the rows of DIMM slots stacked
together in tight proximity, compared to the number of CPUs/GPUs/etc per unit
area on a multi-socket motherboard to achieve the same memory footprint.
(easily apples and oranges in the current incarnations, admittedly, but
focusing on end-user expandability and configuration options)

In my opinion, this type of on-chip fast-RAM model in larger memory systems
would best take advantage of splitting up processing to where the memory is,
as opposed to a fatter node model, especially when it comes to physical size
and inter-chip communication of many chips.

However, if we soon have many-core chips with 32 parallel memory buses leading
to in-package 256GB DRAM silicon, it does become more moot.

Yes, I know that 3d silicon stacking, HBM, etc exist now. While they've had
some good speed & power advantages, they remain very limited in terms of
memory footprint. And of course, the memory size is fixed per such a chip, and
there doesn't seem to be a path for many-chip expansion solutions for anything
but the top-end enterprise market. I think the Venray model has a simplicity
and expandability that keeps the most advantageous tradeoffs.

~~~
kevinnk
Okay, I think we're talking past each other at this point so let me be as
clear as possible: the original comparison was between TSVs and logic in DRAM.
Both of these are a way to get DRAM on chip and as physically close to the
core logic as possible. Logic in memory is on die, while TSVs are on package;
neither can be "extended" by an end-user without connecting off chip. Neither
changes the physical package size very much (TSVs are not intrinsically bigger
than logic in DRAM). Both have nothing to do with _off chip_ connections; as
soon as you start talking about things happening off package they behave
identically (grids of processor/DRAM combos can be done with either in exactly
the same way). Any chip you like can have TSVs (many core, single core, big,
small, whatever); there's​ no architecture that logic in DRAM can have that
TSVs can't. Both can be used to "split up processing to where the memory is".
Neither has to be a "fat node".

So with that out of the way, what exactly is the advantage of logic in memory?
Because so far nothing you have described is actually an intrinsic advantage.

~~~
white-flame
Right, it becomes less about individual chips' TSVs vs logic in DRAM, and more
about the scalability of the architecture. In the marketplace right now, the
trend for these TSV/interposer/multi-die sorts of devices is in "fat node"
designs, instead of more on-board distributed designs.

Logic on DRAM should be simpler & cheaper, which would in the long tail lend
itself to more horizontal scaling (and horizontal scaling is currently
required to get large memory footprints economically). More elaborate &
expensive designs would end up more in fat node designs. There's really no
technical difference when looking at many-chip architectures as the chip
package is a black box at that level, but it's more an economic one.

------
empath75
Is it just me or is the illustration of 'graphs' of totally the wrong kind of
graphs? I'm sure they don't mean bar charts.

~~~
angstrom
Yeah, they missed including a pie chart and a gantt chart for good measure.

~~~
mring33621
Most of the industry has moved on to Deep Pie Charts.

~~~
cat199
FYI, they're called 'chicago charts' for those in the know...

------
phonon
How is this different than FORTH processors, like

[http://www.greenarraychips.com/](http://www.greenarraychips.com/)

[http://www.forth.org/cores.html](http://www.forth.org/cores.html)

[http://www.ultratechnology.com/chips.htm](http://www.ultratechnology.com/chips.htm)

[https://www.complang.tuwien.ac.at/anton/euroforth2007/papers...](https://www.complang.tuwien.ac.at/anton/euroforth2007/papers/guzeman.pdf)

[http://wiki.c2.com/?ChuckMoore](http://wiki.c2.com/?ChuckMoore)

etc.

~~~
dnautics
Those do stack based computation?

~~~
kruhft
They they are parallel chips with many cores (128 IIRC) with small amounts of
'stack' RAM running Forth directly. Chuck designs them 'by hand' using Color
Forth, a transistor simulation model and ... Forth words.

Similar to the machine found in TIS-100:

[http://www.zachtronics.com/tis-100/](http://www.zachtronics.com/tis-100/)

~~~
dnautics
exactly, the chips in the OP are not stack-based, are they?

~~~
kruhft
Probably.

------
szczepano
More about this processor here
[http://www.darpa.mil/attachments/BAA-16-52_HIVE_FAQ_20160831...](http://www.darpa.mil/attachments/BAA-16-52_HIVE_FAQ_20160831_posted5.pdf)

------
youdontknowtho
The way that they describe sparse graph processing in memory sounds like the
kind of pointer chasing that makes run-time object-oriented programming memory
access patterns slow.

I wonder if that was a translation to PR artifact, or if there might be
something there to accelerate some of the Java or .Net memory access patterns
that we all use.

~~~
cvoss
There is a very direct sense in which pointer chasing _is_ the fundamental
operation of sparse graph processing.

On top of what access patterns the developers tend to use, there's always the
JVM garbage collector (the bane of efficiency) which runs a basic graph
algorithm over the entire program's network of pointers. Although, I suspect
in many applications the graph in question is small (by comparison to big-
data-scale graphs) and throwing heavy machinery like this at it would be
overkill.

Then again, maybe I'm not dreaming big enough, and this kind of processor will
make the need for cache line locality optimizations, careful instruction
scheduling around memory I/O, and half-second freezes for GC a thing of the
past?

------
Frenchgeek
"Darpa Funds Development of New Type of Processor: Worlds First Non-Von-
Neumann "

[https://en.wikipedia.org/wiki/Harvard_architecture](https://en.wikipedia.org/wiki/Harvard_architecture)
?

~~~
antoinealb
Was my thought as well "What? AVR are harvard architecture and I can buy them
for 0.1$". But after looking, apparently it is more linked to new parallelism
paradigms:

> "This non-von-Neumann approach allows one big map that can be accessed by
> many processors at the same time, each using its own local scratch-pad
> memory while simultaneously performing scatter-and-gather operations across
> global memory."

~~~
convolvatron
processors with cheap and limited local memory and costly and scalable
external memory are also not new....and that doesn't really make them non-von-
neuman except that it doesn't make sense to execute out of global memory

------
Cyph0n
I'm happy to see that a group from Georgia Tech is working on this. Actually,
my group just sent in a proposal for the DARPA SSITH[1] program. The high-
level goal of SSITH is to design low-level (firmware or hardware) protection
techniques that can guard against common software vulnerabilities that lead to
hardware exploitation.

[1]: [http://www.darpa.mil/news-events/2017-04-10](http://www.darpa.mil/news-
events/2017-04-10)

------
atonse
My guess at what's driving this need, is for intelligence agencies to make
sense of and traverse the large and complicated graphs that are used to map
real world relationships.

As they collect more data related to this, they'll need better ways to
traverse these graphs.

------
cardiffspaceman
I can't tell if the processing in TFA is the kind of graph processing that a
machine like TIGRE or SKIM were proposed to do in the '80's. Or perhaps the
graph nodes are less specialized than these machines?

------
Symmetry
Sounds a lot like the Cell in concept.

[https://en.wikipedia.org/wiki/Cell_(microprocessor)](https://en.wikipedia.org/wiki/Cell_\(microprocessor\))

~~~
vonmoltke
The SPUs on the Cell were closer in concept to DSPs than they are to what the
article describes. The Cell chip itself was essentially a general-purpose CPU
joined in silicon to several DSPs.

The SPUs are still designed for sequential processing of memory, just smaller,
discrete blocks. The whole chip is orchestrated by a standard von Neumann
processor anyway, so that acts as a bottleneck to keeping the SPUs busy.

------
wyldfire
What's the "community detection" benchmark referenced in the article?

------
AriaMinaei
From a layman: Would this necessitate a different programming paradigm?

~~~
grondilu
Yes, it's mentioned on page 2 of the article (you may have missed it was two-
paged)

~~~
AriaMinaei
Thanks! I did miss that page. But even though they mention "... calls for the
development of software tools to help programming the new architecture...",
I'm still unsure as to how programming for such a processor might look like.

~~~
benibela
Perhaps it will make trees faster than hashmaps? Linked lists better than
arrays?

------
nnfy
The civilian application proposed in the article, mapping the many to many
relationships between amazon purchasers and items purchased, is unsettling.

~~~
jacquesm
That doesn't need anything special in terms of hardware so it's a bad example,
current hardware is well capable of making those connections.

------
bantunes
Can't wait for this publicly funded research to make some private corporation
billions!

~~~
jacquesm
You mean like the internet? Or self driving cars?

~~~
JackFr
The original ARPANET was a proof of concept of a network which could withstand
the failure multiple simultaneous nodes, that is, survive a nuclear strike.
Whether knowing you have a robust command and control network makes you more
or less likely to use your own nuclear weapons in a first strike is arguable.
But it did guarantee a second strike would be possible (which if you believe
MAD, made the world safer.) And robust second-strike capability, in the end,
is about killing people.

------
andreasgonewild
Worlds first, my ass. Journalism is dead; has been for a long, long time;
we're living in the end-times of the walking dead.

