
What Happens to OO When Processors Are Free? (2015) - mpweiher
http://blog.metaobject.com/2015/08/what-happens-to-oo-when-processors-are.html
======
Animats
The real question is "what do we do with a lot of CPUs without shared memory?"
Such hardware has been built many times - Thinking Machines, Ncube, the PS2's
Cell - and has not been too useful for general purpose computing.

One of the simplest cases for broad parallelism is search where most of the
searches fail. The required intercommunication bandwidth is low - you start up
all the machines and wait for somebody to report success. Bitcoin miners and
crypto key searchers are extreme examples. Problems which can be hammered into
that form parallelize well.

Some problems that distribute spatially parallelize well on such hardware.
Weather prediction, wind tunnel simulation, nuclear explosion simulation -
hydrodynamics problems like that can be handled in sections, with
communications only for the edge elements. That's what supercomputers are used
for.

Neural nets and deep learning, though - finally, there's a broad use case for
such architectures. Google is using lots of GPU-type boards that way now. This
is a key concept. Massive parallelism on a single problem is really hard to
apply to the things people used to do on computers. Instead, it makes possible
new things to do.

An important question is how much memory each CPU should have. Thinking
Machines, NCube, and the Cell (256K each) clearly had too little. You need
enough memory per CPU that the intercommunication doesn't dominate the
problem.

The other big problem is how to organize software for this. We have two main
models of parallel software - shared memory, and clusters on a network. Those
are the two extremes for parallelism. Systems where the CPUs are more tightly
coupled than in an cluster but less tightly coupled than in a shared memory
multiprocessor fall outside common programming paradigms. This is a very hard
problem, with no really good general-purpose solutions.

~~~
SwellJoe
Everything at web scale is embarrassingly parallel, at least on some fronts.
There's still often the bottleneck of maintaining state (in a database,
probably), but a heavily loaded server will have hundreds or thousands of
processes or threads or some kind of async (probably epoll or something
reasonably efficient on modern systems) to serve many clients. Which is why
Erlang (and Elixir) has some very enthusiastic proponents.

Serving the web is, I think, a very big part of the "general computing" space,
today.

~~~
dagss
Serving the web seems to split between a) IO-bound problems and b) databases.
It is certainly not embarassingly parallel in the "would benefit from GPUs"
sense.

Those thousands of threads just sit there waiting for the database to respond
a lot of the time. The database is busy trying to keep things synchronized
across nodes. It is just that the hard stuff is pushed down to the database
layer.

~~~
SwellJoe
Both can be true. Database development, both the way they're built and the way
people use them, has been focused on scaling across machines and storage for
most of the history of the web; it's possible it'll never quite catch up with
CPU, but it will continue to move that way. I don't know what the next
evolution of databases looks like (I'm definitely not a database expert), but
if Google and facebook and Amazon and Microsoft and many others need it to
scale beyond what is currently possible to keep up with CPU parallelization,
they'll figure out how to make it scale.

~~~
k_lander
aren't databases in turn being held back by the hardware limitations like IO
bandwidth?

------
mmstick
Nothing happens to object-oriented programming because it is entirely
unrelated to multicore processors, except for the fact that object-oriented
programming makes for terribly inefficient data structures that cannot handle
threading very well.

If you've written software before, you would know that most of your codepaths
require prior data to be processed, and thus having multiple cores won't make
software any faster than before. Additionally, there is a cost to starting
threads, so the benefit of making your software multi-threaded has to outweigh
the cost of launching a thread, sending data to that thread, and getting
results from that thread.

Pooling and channels are often used to mitigate the initial cost of creating
threads, but it still remains that it takes time to send data to a thread and
receive data from that thread, and there will always be that one main thread
that needs to handle the majority of the logic and keep results in sync.

Basically, it would be easier to see powerful cores sitting adjacent to
miniature parallel cores that are able to respond quickly to requests.

~~~
jinjin2
> object-oriented programming makes for terribly inefficient data structures
> that cannot handle threading very well.

The best approach I have seen to address this is the idea of transactional
object as it is done in Realm (yes, the mobile database).

The idea that objects can freely be accessed on any thread without locks, with
all changes protected by transactions, is so freeing, and the internal
representation that Realm uses for the objects is apparently far more
efficient than how they would natively be represented.

I really wonder why we are not seeing more of these kinds of transactional
data structures. Seems like a natural next step for object oriented
programming now that everything is turning multi-core.

~~~
louthy
You don't need transactions for this, just use immutable data structures and
you get _transactional like_ behaviour for free.

~~~
taneq
It's not free. The cost is reallocation every time something changes, which in
many cases is _all the time_. Not all the world's a web page.

~~~
louthy
> Not all the world's a web page

Did you really need that patronising statement?

My point wasn't about memory usage, it was about development cost. The
transactional behaviour emerges by default from use of immutable structures.

Any transactional system will have memory performance characteristics that are
very similar to using immutable structures, if not worse, so your point was,
well, pointless.

------
sklivvz1971
This article relies on the assumption that OO has been designed to work with
single core or "few" core processor.

It's a false assumption. OO has been designed to model the world in the hope
that this would make expressing business logic easier.

It's not hard to imagine _what would happen_ with these many cores, because we
already have specialized systems with many cores: GPUs and CUDA.

And as every single time people talk about parallelism as a holy grail of
performance, the truth is obvious to whomever actually uses this hardware: not
many problems are highly parallelizable and parallelism is always harder than
single threaded code -- it's harder to _think about_ , and this has nothing to
do with the programming paradigm.

~~~
mikekchar
It think you are incorrect in your last statement. All functors are inherently
parallelisable. This means that any map operation is parallelisable. Also as
long as the operations are associative (which is a surprisingly large number
of cases), your reduce operations will also be parallelisable. That means that
any computation that can be represented by a map/reduce is usually fully
parallelisable and always paralellisable in the map operation. The only
operations that _aren 't_ paralellisable are those that depend on the previous
values (and even then you can make judicious use of partial function
application to cut down on the dependencies). Basically this is where you have
effects where order matters and recursive calls.

I think the main reason we have difficulty imagining massively parallel
computing systems is that we hold onto mutable state and imperative syntax.
This pretty much forces us into a situation where order is significant. It
seems to me that abandoning that idea is the key to freeing ourselves up,
programming-wise. Personally, I don't find pure fp, or immutable OO that much
harder than imperative. Different, for sure, but both have advantages and
disadvantages.

~~~
dagss
No. As someone who has spent five years programming computing clusters, I have
seen many problems that are fundamentally hard to parallelize. If it falls
within mapreduce you are basically "done" no matter what language you use (no
need to argue fp vs imperative) but very many interesting problems to solve
doesn't.

I am a big believer in FP too, but for other reasons.

~~~
mikekchar
I can't edit my previous statement, but after spending more than a couple
minutes thinking about it, I can see I was wrong. It's quite convenient to
have downvotes to clue you in when you get things wrong, even if it is
embarrassing ;-)

------
cjensen
Having a dedicated processor for each object is a terrible idea in terms of
energy.

Consider current high-core count Intel processors on Windows. Windows shuts
down entire cores to save energy. Windows only powers enough cores to cover
your current workload plus some margin. I assume other OSs work the same.

If each object has its own processor, then either (1) you must keep the core
powered for the lifetime of the object or (2) the core must be energized each
time the object receives a message, and re-energizing is not free.

I guess if the core could auto-detect its load (as current Intel processors
do) that would help. But I think dynamically shuttling threads between cores
and shutting down as much as possible (which is exactly what Windows does) is
likely to be much more efficient.

So go ahead and have a thread-per-object for this hypothetical architecture,
but let the OS decide when and where to assign cores.

~~~
BuuQu9hu
You don't want this at the object level anyway. You want to have objects
clumped into _vats_ , which are related object graphs that are isolated from
other vats but freely intermingle otherwise. Then, messages between vats
invoke _turns_ of delivery inside vats independent of other vats, and we can
schedule one vat per processor.

------
chris_st
Slightly related is the video game "TIS-100"[1] which has you programming
(incredibly) constrained processors that are arranged in a grid, with
communication between them. The programming puzzles are not very difficult (so
far! I'm about halfway through) and pretty entertaining to puzzle out. There's
a "meta game" of comments by "your uncle" who was trying to get it working
again.

I think it's closer to the GreenArrays, Inc chips, really.

[1] [http://www.zachtronics.com/tis-100/](http://www.zachtronics.com/tis-100/)

~~~
abrookewood
It's a lot harder when you realise that everyone is trying to optimise for
Cycle count, rather than the number of Instructions/Nodes.

------
readams
This sounds like a good idea until you realize that your computations will be
dominated by synchronization and/or communication overhead.

~~~
elihu
Setting aside for a moment the wisdom of using thousands of cores to enable
object oriented programming (which I'm dubious of), it seems to me that the
way modern processors handle communication and synchronization is mostly a
historical accident, and we could probably do a lot better than just letting
processes share memory willy-nilly and having the processor cache sort it out.
Like maybe we could give each core a set of registers dedicated explicitly for
communicating with its neighbors, bypassing the memory hierarchy entirely for
small messages.

~~~
gpderetta
That didn't work very well for Cell.

------
neolefty
What if we separate this into two questions:

(1) What if you have a fabric of "free" processors, and the real cost is
energy?

(2) How would <foo> programming paradigm work on this fabric?

------
notamy
Minor nitpick: (2015)

That aside, I wonder how hard proper software development would be for such a
machine. It seems like multi-threaded software is something that we struggle
to get right in our current low-core-count systems, so...

~~~
ianhowson
It's hard.

The author has reinvented the architecture used by Cell (PS3) and Intel IXP.
Both of these architectures are dead specifically because they are too damn
hard to program for compared with a multicore ARM/x86 chip.

GPUs would be the most successful modern implementation of this idea. There
are opportunities with FPGAs, but GPU silicon is so far ahead (and still
advancing fast) that you're usually better off designing for GPUs.

You could also consider Cavium parts (16-64 way ARM chips) which ship today in
high-end network hardware.

The common lessons across all of these are that:

* Memory is slow compared with computation

* Put caches everywhere and have the machine decide what to cache

* Synchronisation is hard and puts tremendous load on your scarce memory resources

* It's much easier to do the same job on different data a million times than to do a million different jobs on the same data. In other words, high throughput is easier to achieve than low latency.

~~~
matt4077
I'm not sure if those architectures are comparable to the one discussed in the
article, except that both are highly parallel. GPUs and Cell are, as you
mention, data-parallel.

The article talks about a much more "anarchistic" parallelism where thousands
of different (in code and data) objects are each doing their thing, sending
messages to each other when necessary. I guess that Erlang/Elixir's threads
are closest currently, as mentioned in the article.

~~~
ianhowson
Cell's SPUs and IXP's microengines aren't data-parallel any more than a
regular CPU. They're minimal CPUs with local RAM and fast connectivity between
each other (usually a FIFO queue).

Every single one of the CPUs was independent and happy to run as many branches
and/or memory accesses you want without significant performance penalty,
unlike modern GPUs.

So yeah, you could put different objects on different CPUs if you want. Except
that that's not where the bottleneck in either energy or computation is.
Remember that that local RAM needs to be powered if it's to retain state
(ignoring FeRAM), so CPUs are no longer free; you have to commit objects back
to main DRAM before switching off the CPU. And so you've just reinvented
caching and might as well just run on a fast few-core CPU anyway.

------
dorianm
Discussion at the time it was posted:
[https://news.ycombinator.com/item?id=10121974](https://news.ycombinator.com/item?id=10121974)
(linked by the author in 2015).

------
nowarninglabel
I guess to take this a bit further, the thought that came to mind for me is if
we start processing every user on their own vm with own core and memory. Is
that something anyone is doing today with web sites? It'd seemingly be
horribly inefficient, but it's an interesting thought experiment.

~~~
abrookewood
It's not that inefficient if you are not working with real 'VMs', but with
containers (or something similar). The unikernel projects mentioned by the
peer below (Ling, MirageOS) can spin up a container per request, respond and
then tear it down in milliseconds. They are pretty interesting from a security
perspective, especially when you couple them with a read-only image - I
imagine it would be pretty hard to attack something that only persists for the
time it takes to handle a single request. EDIT: Peer is below.

~~~
icebraining
_I imagine it would be pretty hard to attack something that only persists for
the time it takes to handle a single request._

I don't see why. If the request is the attack (as it usually is), then it'll
persist for just long enough to accomplish it. What kind of attacks do you see
it avoiding?

~~~
mtrpcic
I think the big benefit is that it avoids attacks that infect the server,
because in this case, the server is "destroyed" when the request finishes. So
a request that would maliciously upload "hackertools.php" would be useless,
because the host (read: container) that the file is uploaded to is not the web
server, but the container.

~~~
abrookewood
Yes, this is what I meant. It doesn't make the server any less vulnerable to
an individual attack, but it makes it very hard to escalate it. Though there
was a really interesting video about a security guy breaking out of Lambda
recently and uncovering a persistent file system somewhere - will try to find
it. Edit: Found it:
[https://media.ccc.de/v/33c3-7865-gone_in_60_milliseconds](https://media.ccc.de/v/33c3-7865-gone_in_60_milliseconds)

------
dnautics
Communications overhead will not scale and your parallelism will be amdahl's
law-bound. You're better off being gustafson's law-bound, and to use those
processors for running more copies of the single-threaded program with
different values.

------
gricardo99
Related, there have in fact been fabricated arrays of processors, at least in
Academia, for some time:
[http://vcl.ece.ucdavis.edu/pubs/2016.06.vlsi.symp.kiloCore/2...](http://vcl.ece.ucdavis.edu/pubs/2016.06.vlsi.symp.kiloCore/2016.vlsi.symp.kiloCore.pdf)

------
stcredzero
It's not the cost of processing that holds us back so much as it's the cost of
coordinating. Doing things in parallel and concurrently are where things get
tricky. So an opportunity to trade off software complexity for transistors is
interesting, of course!

------
crb002
Qt added signals/slots to C++.

FPGAs will get us most of the way there. We will want to express programs as
concrete circuits and message passing. Something like [http://www.clash-
lang.org](http://www.clash-lang.org) . Stream level operations are preferred
over "Object" messages.

There is a tradeoff between large bandwidth large area circuits, and smaller
circuits with less bandwidth. Inherently serial stuff goes on small circuits.

------
digi_owl
The whole thing brings to mind the hermetic expression, "as above, so below".

It feels like every layers of computation logic ends up mimicking both higher
and lower layers.

------
markhahn
confuses throughput with latency. has nothing to do with OO.

