
Occupy the cloud: distributed computing for the 99% - indogooner
https://blog.acolyer.org/2017/10/30/occupy-the-cloud-distributed-computing-for-the-99/
======
cmeiklejohn
This work seems to be in the same spirit as the work on function passing,
previously presented at Onward! 2016 (and, also presented at Strange Loop
2014.)

As the work presents it, the function passing model is a generalization of the
Spark/MapReduce model: data is stationary and immutable, stateless closures
are passed around the network providing type-safe, higher-order programming
with deferred evaluation. The contributions of the paper provide include an
implementation of the model in Scala, and re-implementations of systems like
Spark to demonstrate the applicability and generality of the model itself.

Function Passing: A Model for Typed, Distributed Functional Programming [1]
[https://2016.onward-
conference.org/event/onward-2016-papers-...](https://2016.onward-
conference.org/event/onward-2016-papers-function-passing-a-model-for-typed-
distributed-functional-programming)

"Function Passing Style: Typed, Distributed Functional Programming" by Heather
Miller [2]
[https://www.youtube.com/watch?v=coX9RKH4rOs](https://www.youtube.com/watch?v=coX9RKH4rOs)

~~~
cat199
Other than typing, the general concept is not novel and predates spark/hadoop
by 15-16 years or so (1995):

"We describe a distributed implementation of Scheme that permits efficient
transmission of higher-order objects such as closures and continuations. The
integration of distributed communication facilities within a higher-order
programming language engenders a number of new abstractions and paradigms for
distributed computing. Among these are user-specified load-balancing and
migration policies for threads, incrementally linked distributed computations,
and parameterized client-server applications. To our knowledge, this is the
first distributed dialect of Scheme (or a related language) that addresses
lightweight communication abstractions for higher-order objects."

Higher-order distributed objects Full Text: PDFPDF Get this ArticleGet this
Article Authors: Henry Cejtin NEC Research Institute, Princeton, NJ Suresh
Jagannathan NEC Research Institute, Princeton, NJ Richard Kelsey NEC Research
Institute, Princeton, NJ Published in: · Journal ACM Transactions on
Programming Languages and Systems (TOPLAS) TOPLAS Homepage archive Volume 17
Issue 5, Sept. 1995 Pages 704-739

[https://dl.acm.org/citation.cfm?id=213986](https://dl.acm.org/citation.cfm?id=213986)

~~~
cat199
don't mean to be crufty - the new work looks cool, but also don't want the
previous works to be unknown..

------
jerf
Honest question for discussion: How much of "serverless'" advantage is the
"serverless" aspect, and how much is the advantage of breaking functions down
into stateless functions?

While I'm confident that it's not "zero" in either case, it still sort of
feels to me that people are praising serverless for some things they could get
if they had the discipline to write stateless functions in the first place.
(And maybe keep the debugging functionality and the other infrastructure we've
built up for normal apps.) And as with many things when it comes to "cloud"
tech, it seems to me that people are a bit too bedazzled by the advantages to
come to a clear understanding of the disadvantages.

~~~
tdb7893
I thought the main advantage of "serverless" was that you didn't have to
provision and maintain servers.

~~~
mrep
Also automatic autoscaling and it can be cheaper since you utilize basically
all of your billed Compute instead of overprivisioning server capacity.

------
JDL-Amsterdam
This is something I ran into in academia. I made the following product as an
attempt to make scalable, distributed computing simple:

[https://www.crumpington.com/batcht/](https://www.crumpington.com/batcht/)

------
peterwwillis
These paradigms get thought up by people who don't run big systems, and then
generalized to a much larger degree than they are actually applicable.

Segregating compute from storage is untenable for many kinds of workloads. Not
to mention a single pipe to a single shared storage is a huge SPOF.

The best model for random devs who just want to scale up their jobs is to
model what the dev is using for dev. They usually use a laptop - fast local
disk, plenty of memory, fast CPU with speed bursts, fast interconnects, and no
network latency or interruption. Build your app for that, and then spawn a new
app for each compute node, and scale by nodes, and don't depend on the network
for your app to survive. You may recognize this paradigm: it's what we were
just starting to do 20 years ago. It's very resilient to failure, it isn't
fancy or complicated, and it scales very well.

~~~
stochastician
Original paper author here. I think we are in vehement agreement!

1a. we don't run big systems. Well, I don't, at least -- my coauthors wrote
spark and similar systems, so they're certainly aware of some of the
challenges, but for us the major appeal here is someone else (cloud provider)
is running a tremendous amount of the system!

1b. I think you're totally right that these sorts of systems tend to be
generalized beyond their sweet spot. We have had quite passionate debates
internally as to the amount of complexity we should add to the underlying
system to support richer BSP efficiently. I just want to make sure the
"obvious" stuff works as easily and quickly as possible (that is, basic map)

2\. Segregating compute from storage _is_ untenable for a lot of workloads!
But for a lot of the compute-heavy work we do, it's not. And we were really,
truly shocked at how big the bandwidth to S3 really is -- for our compute
applications, this degree of data disaggregation

3\. "Model what the dev is using for dev" is _exactly_ what we're trying to
accomplish. We even try and replicate the dev environment (thanks to anaconda,
cloudpickle, etc.) as transparently as possible. That's exactly what we're
trying to achieve, at the end of the day -- get back to where condor (another
bird-named job system that we love) had us at 20 years ago.

~~~
peterwwillis
I re-read the article and now it makes more sense. The takeaway seems to be
"serialize your functions in bulk and de-serialize later in bulk": in essence,
treat your compute like your storage. But I don't see how this is distributed
computing for the 99%?

PyWren appears to be what some of the serverless platform providers were
missing, and also repeats a bit of history from the bad old days of parallel
processing. While it totally works for a lot of scientific and a few business
cases, it doesn't work for the general case. It's like saying MPI was the best
solution for the 99%. But I might be overlooking another detail.

The biggest problem with depending on one aspect of a system for performance
is that it becomes the bottleneck. We'll rely on low-latency, high-bandwidth
storage? Now if it becomes low latency or low bandwidth, our app is boned.
We'll rely on loads of cheap compute cycles? If compute nodes start choking,
our app is boned. We'll rely on RAM? Our data becomes bigger, and our app is
boned. We'll just scale horizontally? Our network gets partitioned, and our
app is boned.

It can work fine for systems that don't require performance guarantees. If
that's the 99%, then this seems like a good solution, though it also seems
like you're trading off learning a complex stack for learning HPC development
models.

------
indogooner
Anyone with experience of migrating a large system to "serverless" who can
chime about the benefit/cost? Seems good fit for event handlers or operational
tools. I feel two limitations related to "caching" and "overall cost" should
deter any large scale system from moving.

~~~
imglorp
One obvious thing to keep in mind is the profit model.

AWS at least prices lambda, api gateway, and rds so it's very cheap and
attractive for projects with low resource needs. As soon as it gets popular
though, they own your ass until you migrate off it. We looked at one
application on it and it worked out to billions monthly for high traffic
loads, vs much less if we hosted on their regular boxen ourselves.

"operational tools" might be the right answer here - use it for something that
will have bounded scale and you can predict the expense.

------
ubik_
How about Apache Beam on the Google Cloud Dataflow Runner? Easy to use, no
real configuration needed, easy to switch between stream and batch workloads
and really fast performance.

