
The rise of embarrassingly parallel serverless compute - prostoalex
https://davidwells.io/blog/rise-of-embarrassingly-parallel-serverless-compute
======
Uptrenda
Except that none of these compute resources are free and the cost of cloud-
based super computing is a complete rip-off. It's actually thousands of
dollars cheaper to build your own super computer from used rack mount servers
than it is to use AWS, Azure, or Google cloud for lengthy computations.

As an example: you can buy a Dell PowerEdge M915 from ebay with 128 cores for
~$500 USD and a rack costs around the same. Five of them is 640 cores for a
total cost of just 3k USD. That's 640 cores that you now own forever. Guess
how much it would cost to use that many cores for a month on AWS? Well over
10k... and next month its another 10k... and so is next month...

With this option you only pay for power and still have the resources for all
future experiments. I think the m1000e rack can fit 8 blades in total so you
could upgrade to a max 1024 cores in the future! The downside with this
particular rack is that it's very, very loud. But I've run the numbers here
and it's hard to beat high-core, used rack servers on a $/Ghz basis.

~~~
jayd16
The way I look at this is you still need 3 people minimum on call to have
around the clock ops for this rack. Three IT salaries is more than $10k a
month. Eventually you'll find the balancing point but its higher than you
think.

~~~
afwaller
Yes. This is the reason cloud is popular. You don’t have to pay for computer
janitors.

~~~
aea65
We have a title and don’t carry mops, my dear condescending developer who
ostensibly doesn’t respect the entire ecosystem of people holding the
Internet, the public cloud, and your entire CI/CD pipeline for whatever
bullshit you’re cooking in NPM together. Do you have any idea how many
“janitors,” most basically volunteer and most not employed by Amazon, are
involved in the pedestrian aspect of merely making amazon.com land at the
right place?

Yes, this is a good future: who would ever want to touch the computers they
operate? Better to rent computers and position the core competency of your
business with a third party, right? Because you can shed some of those pesky
janitorial salaries? I’ll be waiting by my phone when you’re on the verge of
bankruptcy once the clouds have their teeth in your books, and suddenly I get
promoted from “janitor” to “computer operator who has my interests in mind,
even though I not-so-subtly malign his existence”.

All you’ve done with this mindset is drive the same janitors to work for the
clouds instead and contributed to the downfall of computing as a discipline
that any player has any semblance of agency within, as the people who have
actually touched datacenter equipment all work for them now or sit around in
horror watching as a generation sympathetic to FLOSS arguments willingly hands
over the reins of _owning a computer_ to massive corporations.

~~~
louwrentius
Reading the comments in this thread already shows how much of 'those'
'software janitors'[1] don't understand what is required (and how little) to
run on metal.

It's not that cloud is not the right answer. But people have started to forget
that running your own metal is still an option. Or with current prices and
performance: even a more viable option as it was in the past, because you can
do so much more with so much less.

[1]: cranking out useless features nobody ask for while looking down on those
(dev)ops people.

~~~
logifail
> software janitors don't understand what is required (and how little) to run
> on metal

Quite. In the middle of lockdown a client needed to spin up some virtual
machine instances to demo a product to a potential client. Previous boss had
been pushing a cloud-only strategy using Azure and was itching to retire all
the physical servers.

Problem: can't spin up any kind of VM due to lack of Azure capacity.

me: "Well, we have that [physical] dev server which we still have, we could
spin up the demo stuff on that..." new boss: "Oh, cool. Great lateral
thinking!" me: <wtf?>

Demo done. Potential client happy. Boss happy.

~~~
louwrentius
Thanks for sharing this example. I got a call years ago from azure that I
could not get the 10 vms extra capacity at that time. Insane.

------
ChuckMcM
Hilariously, in 2008 Google was hugely resistant to this idea. DARPA had just
put out an RFP for a new way of running computationally dependent tasks that
currently ran on super computers on a 'shared nothing' architecture (which is
what Google ran at the time (and I believe they still do). I had done some
research in that space when I was at NetApp looking at decomposing network
attached storage services into a shared nothing architecture so I had some
idea of the kinds of problems that were the "hard bits" in getting it right.

I recall pointing out to Platform's management that if Google could provide an
infrastructure that solved these sorts of problems with massive parallelism
that currently required specialized switching fabrics and massive memory
sharing we would have something very special. But at the time it was a non-
starter, way too much money to be made in search ads to bother with building a
system for something like the 200 customers in the world total.

I didn't care one way or the other if Google did it so after running at the
wall of "under 2s" a couple of times I just said "fine, your loss."

------
fxtentacle
I strongly believe that the author never tried out his examples.

One time, I wanted to process a lot of images stored on Amazon S3. So I used 3
Xeon quad-core nodes from my render farm together with a highly optimized C++
software. We peaked at 800mbit/s downstream before the entire S3 bucket went
offline.

Similarly, when I was doing molecular dynamics simulations, the initial state
was 20 GB large, and so were the results.

The issue with these computing workloads is usually IO, not the raw compute
power. That's why Hadoop, for example, moves the calculations onto the storage
nodes, if possible.

~~~
papaf
_That 's why Hadoop, for example, moves the calculations onto the storage
nodes, if possible_

You make a good point about I/O and I actually wanted to comment something
along the lines of "why not Hadoop?" since the programming model looks very
similar but with less mature tooling.

However, now I think about it, the big win of serverless is that it is not
always on. With Hadoop, you build and administer a cluster which will only be
efficient if you constantly use it. This Serverless setup would suit jobs that
only run occasionally.

~~~
fxtentacle
In my experience, the cloud is so slow and expensive for these tasks that even
if your job only runs once per day, you're better off getting a few affordable
bare metal servers.

Plus, most tasks that only run occasionally tend to be not urgent, so instead
of parallizing to 3000 concurrent executions, like the article suggests, you
could just wait an hour instead.

Serverless is only useful if you have high load spikes that are rare but super
urgent. In my opinion, that combination almost never happens.

~~~
qeternity
This is us exactly. We pay around $2k for one of our analytic clusters.
Hundreds of cores, over 1tb of ram, many tb of nvme. Some days when the data
science team are letting a model train (on a different gpu cluster) or doing
something else, the cluster sits there with a load of zero. But it’s still an
order of magnitude cheaper than anything else we’ve spec’d out.

~~~
namibj
Are you potentially interested in renting out idle capacity for batch jobs? If
so, what kind of interconnect do you have? Feel free to contact me (info in my
profile).

~~~
qeternity
Sadly, given how cheap the infra is, it's not worth it to us to have to share
with someone else. Let's say we could cost share 50% thus saving us
$12k/yr...we would spend a lot more than $12k setting up a system and all the
headache that arise from sharing the infra.

But thanks for the offer! The natural market forces will drive cloud computing
prices down the same way they've driven everything else down. But until then,
roll-your-own can save loads.

~~~
namibj
I figured it might be unreasonable, but thanks for responding.

Yeah, I was particularly curious because I was unable to find better public
offers than AWS (with their homebrew 100Gbit/s MPI that drops infiniband's
hardware-guaranteed-delivery to prevent statically-allocated-buffer issues in
many-node setups, allowing them quite impressive scalability) or Azure (with
their 200Gbit/s Infiniband clusters), at least for occasional batch-jobs.

I wouldn't ask if I could DIY for less than using AWS, but owning ram is
expensive. And for development purposes it would be quite enticing to just co-
locate storage with compute, and rent some space on those NVMe drives for the
hours/days you're running e.g. individual iterations on a large dataset to do
accurate profile-guided optimizations (by hand or by compiler). Iterations
only take a few minutes each, but loading what's essentially a good fraction
(minus scratch space, and some compression is typically possible) of the ram
over network causes setup to take quite a long time (compared to a single
iteration).

------
shoo
there is a subset of embarrassingly parallel problems that are are heavily
data intensive.

E.g. suppose you have 100 TB of data files and you want to run some kind of
keyword search over the data. If the data can be broken into 1000x100GB chunks
then you can do some map-reduce-ish thing where each 100GB chunk is searched
independently, then the search results from each of the 1000 chunks is
aggregated. 1000x speedup! serverless!

however, if you want to execute this across some fleet of rented "serverless"
servers, a key factor that will influence cost and running time is (1) where
is the 100 TB of data right now, (2) how are you going to copy each 100 GB
chunk of the data to each serverless server, (3) how much time and money will
that copy cost.

I.e. in examples like this where the time required to read the data and send
the data over the network is much larger than the time required to compute the
data once the data is in memory, is going to be more efficient to move the
code & the compute to where the data already is rather than moving the data
and the code to some other physical compute device behind a bunch of
abstraction layers and network pipes.

~~~
gopalv
> there is a subset of embarrassingly parallel problems that are are heavily
> data intensive.

There's an even smaller subset which is one-shot data access.

> in examples like this where the time required to read the data and send the
> data over the network is much larger than the time required to compute the
> data once the data is in memory

The annoying thing about lambda and other functional alternatives is that
data-access patterns tend to be repetitive in somewhat predictable ways &
there is no way to take advantage of that fact easily.

However, if you don't have that & say you were reading from S3 for every pass,
then lambda does look attractive because the container lifetime management is
outsourced - but if you do have even temporal stickiness of data, then it
helps to do your own container management & direct queries closer to previous
access, rather than to entirely cold instances.

If there's a thing that hadoop missed out on building into itself, it was a
distributed work queue with functions with slight side-effects (i.e
memoization).

~~~
adev_
> If there's a thing that hadoop missed out on building into itself, it was a
> distributed work queue with functions with slight side-effects (i.e
> memoization).

Is that not named spark ? :)

------
aquamesnil
Stateless serverless platforms like Lambda force a data shipping architecture
which hinders any workflows that require state bigger than a webpage like in
the included tweet or coordination between function. The functions are short-
lived and not reusable, have a limited network bandwidth to S3, and lack P2P
communication which does not fit the efficient distributed programming models
we know of today.

Emerging stateful serverless runtimes[1] have been shown to support even big
data applications whilst keeping the scalability and multi-tenancy benefits of
FaaS. Combined with scalable fast stores[2][3], I believe we have here the
stateful serverless platforms of tomorrow.

[1] [https://github.com/lsds/faasm](https://github.com/lsds/faasm) (can run on
KNative, includes demos) [2] [https://github.com/hydro-
project/anna](https://github.com/hydro-project/anna) (KVS) [3]
[https://github.com/stanford-mast/pocket](https://github.com/stanford-
mast/pocket) (Multi-tiered storage DRAM+Flash)

------
superjan
It reminds me of this quote “a supercompter is a device that turns a compute
bound problem into an IO bound problem” (Ken Batcher).

~~~
bob1029
This is a very apt quote here.

"Serverless" is basically equivalent to a supercomputer in that context, but
then it goes on to exhibit latency characteristics that would be considered a
non-starter for a supercomputer.

Latency is one of the most important aspects of IO and is the ultimate
resource underlying all of this. The lower your latency, the faster you can
get the work done. When you shard your work in a latency domain measured in
milliseconds-to-seconds, you have to operate with far different semantics than
when you are working in a domain where a direct method call can be expected to
return within nanoseconds-to-microseconds. We are talking 6 orders of
magnitude or more difference in latency between local execution and an AWS
Lambda. It is literally more than a million times faster to run a method that
lives in warm L1 than it is to politely ask Amazon's computer to run the same
method over the internet.

This stuff really matters and I feel like no one is paying attention to it
anymore. Your CPU can do an incredible amount of work if you stop abusing it
and treating it like some worthless thing that is incapable of handling any
sizeable work effort. Pay attention to the NUMA model and how cache works.
Even high level languages can leverage these aspects if you focus on them. You
can process tens of millions of client transactions per second on a single x86
thread if you are careful.

Furthermore, the various cloud vendors have done an exceptional job at making
their vanilla compute facility seem like a piece of shit too. These days, a
$200/m EC2 instance feels like a bag of sand compared to a very low-end Ryzen
3300G desktop I recently built for basic lab duty. I'm not quite sure how they
accomplished this, but something about cloud instances has always felt off to
me. I can see how others would develop a perception that simply hosting things
on one big EC2 instance would mean their application runs like shit. I am
unsurprised that everyone is reaching for other options now. On-prem might be
the best option if you have already optimized your stack and are now
struggling with the cloud vendors' various layers of hardware indirection.
Simply going from EC2 to on-prem could buy you an order of magnitude or more
in speedup just by virtue of having current gen bare metal 100% dedicated to
the task at hand. Obviously, this brings with it other operational and capital
costs which must be justified by the business.

~~~
vidarh
Even going to managed servers in a datacentre tends to have the effect of
letting you spec machines far closer to what you need.

I've cut number of servers by large factors on several instances when moving
off cloud to managed servers in a data centre as well because I've been able
to configure the right mix of RAM, NVMe and CPU for a given problem instead of
picking a closest match that often isn't very close.

------
ipnon
It's going to take a new kind of software engineer to build these fully
distributed systems. You can imagine calls for "Senior Serverless Engineers".
Will conventional serverful engineers be left in the dust, or will the
serverless engineers just break away and pioneer the apps on a new scale?

~~~
threeseed
Serverless has been around for many years now.

It doesn't require a new kind of software engineer. It's just another software
architecture to go alongside micro services, containerisation etc.

And it hasn't changed the world because (a) it's the ultimate form of vendor
lock in and (b) it makes even simple apps much more complex to reason about
and manage.

~~~
arcturus17
> much more complex to reason about and manage

I really dislike the local dev experience and deployment for serverless. But
otherwise the model is pretty clear: a file is a function, data goes in, data
comes out, just like any other server function. If one instance is busy, spin
off a new one.

What’s hard to reason about?

~~~
flak48
Yep. Inversion of control has always been the norm with whatever
server/framework you choose.

If you have a /myroute handler defined in express or flask or even form.php or
form.cgi in Apache,you never had to write the code to make user requests
trigger your handler anyway even in the old days. That's the entire point of
using a server instead of listening to a socket yourself

With serverless the same thing still happens with someone else managing the
path from a request to a handler and back.

In fact, if you ever used a cpanel host with PHP like in the good old days
(110mb.com anyone?), you already used 'serverless'. You just uploaded .php
files to a directory and your website just 'magically' worked.

------
rahimnathwani
The scraping example seems poorly chosen. The original blog post describing
that example is no longer online, but archive.org has a copy:
[https://web.archive.org/web/20180822034920/https://blog.sean...](https://web.archive.org/web/20180822034920/https://blog.seanssmith.com/posts/pywren-
web-scraping.html)

If the author just wanted to fetch pages in parallel, they could have done
better than 8 hours even on their own laptop (you can run more than one
chromium process at a time). The real benefit they got from using AWS Lambda
is that the requests weren't throttled or ghosted by Redfin, probably because
the processes were running on enough different machines, with different IP
addresses.

------
alexchamberlain
> One of the challenging parts about this today is that most software is
> designed to run on single machines, and parallelization may be limited to
> the number of machine cores or threads available locally.

Depending how you look at it, I don't think most software is designed to take
advantage of multiple cores, let alone multiple machines.

~~~
efnx
Both points are true, but the author is talking about writing new software
specifically for serverless compute.

------
raverbashing
But how much is serverless efficient?

Has anyone benchmarked the speed of running (let's say, on AWS) 1000x a lambda
function vs. running the same function on a regular AWS instances?

What about all the overhead (for example, k8s overhead, both in CPU and disk,
etc)

I'm afraid it would be very easy to get a repeat of this
[https://adamdrake.com/command-line-tools-can-
be-235x-faster-...](https://adamdrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

------
yangl1996
Such a scheme depends heavily on whether the cloud providers can efficiently
multiplex their bare-metal machines to run these jobs concurrently.
Ultimately, a particular computing job takes a fixed amount of CPU-hours, so
there's definitely no savings in such a scheme in terms of energy consumption
or CPU-hours. At the same time, overhead comes when a job can't be perfectly
parallelized: e.g. the same memory content being replicated across all
executing machines, synchronization, the cost of starting a ton of short-lived
processes, etc. These overhead all add to the CPU-hour and energy consumption.

So, does serverless computing reduce the job completion time? Yes if the job
is somewhat parallelizable. Does it save energy, money, etc.? Definitely no.
The question is whether you want to make the tradeoff here: how much more
energy would you afford to pay for, if you want to reduce the job completion
time by half? It like batch processing vs. realtime operation. The former
provides higher throughput, while the latter gives user a shorter latency.
Having better cloud infrastructure (VM, scheduler, etc.) helps make this
tradeoff more favorable, but the research community have just started looking
at this problem.

------
qeternity
Lambda compute only seems to have been rising lately to those who never used
it before. We’ve been running huge amounts of “serverless” style workloads in
a Celery + RabbitMQ setup. Our workloads are fairly stable for that so
bursting in a public cloud has no real value, but we regularly do batch jobs
that burst capacity. And we spun up more workers as such.

The author seems to think the paradigm is new (it isn’t) and claims that it
hasn’t taken off massively (it has) because he incorrectly points to a number
of workloads that aren’t embarrassingly parallel. On the other hand, in theory
having a common runtime for these operations from a public cloud provider
should enable them to keep their utilization of resources extremely high, such
that it would be cheaper for us to use AWS/GCP/etc instead of rolling our own
on OVH/Hetzner. But if anything, the per compute cost of FaaS is higher than
it is for other compute models, which means the economics really only work for
small workloads where the fixed overhead of EC2 is larger than the variable
overhead of Lambda.

------
tlarkworthy
Don't overcomplicate it. Xargs and curl is often enough to drive big, ad hoc
jobs.

~~~
kristopolous
These things only have a handful of customers in the world.

Datasets that are tens of gigabytes, or maybe 100mil records or so...this
really covers most things.

And for every 1 thing it doesn't, there's 20 more claimed that a single
machine using simple tools could handle just fine.

Being able to detect when things have been processed, have a way to set dirty
flags, prioritize things, have regions of interest, be able to have re-entrant
processing, caching parts of the pipeline and having nuanced rules for
invalidating it, these in my mind are kinda basic things here.

When they aren't done, sure, someone will need giant resources because they're
doing foolish things. But that's literally the only reason. Substituting money
for sense is an old hack.

~~~
necovek
Opening your mirrorless camera SD card with images and the image thumbnailer
takes... Forever?

Doing a facial search on it?

Matching a rhythm picked up by the mic to your local music collection?

Hashing and/or encryption of data.

There's plenty of desktop-like use cases that would benefit from massively
parallel computation, but network (or even IO) bandwidth is currently going to
be the limiting factor.

IOW, we are not there yet!

Currently, we can parallelize tasks which are low on data and high on
computation.

So how can we expand the IO bandwidth for everyone, even desktop or mobile
users?

~~~
tlarkworthy
Why is your image and music collection not in the cloud? Definitely an xargs +
curl to serverless job in a modern setup (if auth was easier)

~~~
necovek
Privacy and control. My email is not in the cloud either.

Still, as you note yourself with "if auth was easier", we'd need custom
applications even for the cloud — it's just that you'd hope they provide
unbounded bandwidth for each user, but I am not even sure that's the case for
the biggest of players (dropbox, google drive...).

------
lachlan-sneff
There's no way to distribute really lightweight thunks of arbitrary code.
Maybe WASM can work here, especially if you shape the standard APIs in the
right way.

You'd also need built-in support in tooling and compilers, where you can
compile specific functions or modules into something that can run separately
without actually doing that manually.

~~~
theamk
This depends on how lightweight are those chunks.

If you goal is <0.1 sec startup -- yeah, then you'll need WASM.

If you are OK with 1-5 second of startup, you have a _ton_ of options. Apache
Spark uses JVM magic to send out the the raw bytecode. You can start up docker
container. If you are willing to rewrite stdio, you can exec machine code
under seccomp/pledge.

There are even full-blown VM solutions -- Amazon Firecracker, which claims
that: "Firecracker initiates user space or application code in as little as
125 ms and supports microVM creation rates of up to 150 microVMs per second
per host."

~~~
threeseed
And for people that don't know. AWS Lambda uses Firecracker under the hood.

------
Lio
This seems to come up a lot in these discussions but Moore’s Law (actually
just an observation) says nothing about single thread performance.

It only tells you that the number of transistors on on a silicon die will
double every 18 months.

If we’re still able to add parallel threads of execution at the same rate then
Moore’s Law still holds.

------
vsskanth
Yeah but how expensive will it be ? Some numbers would be nice for something
like hyper parameter tuning

------
choeger
> Software compilation

Well, no. Software compilation does _not_ work massively parallel. Maybe parts
of the optimization pipeline, but compilation of 1000 unit program (assuming
your language of choice even _has_ separate compilation) does normally require
to put the units into a dependency graph (see OCaml, for instance), or puts
most of the effort into the inherently serial tasks like preprocessing and
linking (C++).

------
cranekam
> Note: If you are comfortable with Kubernetes and scaling out clusters for
> big data jobs & the parallel workloads described below, godspeed!

Probably my pedantic side showing through but I find reading text where
ampersand is used in place of “and” really jarring (same for capitalised
regular nouns). It seems somewhat common now so I guess I’ll have to get used
to it.

~~~
loopback_device
Kubernetes is a proper noun, and thus always to be capitalized in English,
while the "&" is a ligature of the Latin "et", the word for "and". Not sure
why that would be jarring, it's all fine and correct English.

~~~
cranekam
I wasn’t talking about Kubernetes being capitalised. I know that’s a proper
noun and deserves capitalisation. I didn’t intend to confuse on this point.

Although & _means_ “and” they are generally used differently. & is used in
places like company names where it’s part of a noun (e.g. B&Q, Smith &
Wesson). “And” joins parts of a sentence. I find it jarring to use & because
a) it looks like a punctuation mark and I naturally pause when reading, b) I
expect to have read a noun, not a join in a sentence, and it takes some
cognitive effort to re-parse the sentence using & in a way I didn’t first
expect. Reading, especially quickly, relies a lot on expectation and pattern
matching and I find this disrupts it. If you don’t, good for you.

Obviously in informal speech people write whatever they want and it’s true
that language evolves over time. But I’d argue that using & instead of and
isn’t “correct”, at least by current standards — if it was we’d see this used
in newspapers, books, and so on.

------
jonnypotty
We can't even design systems to properly take advantage of multiple cpu cores
yet.

------
dstaley
I'm a huge fan of using Lambda to perform hundreds of thousands of discrete
tasks in a fraction of the time it'd take to perform those same tasks locally.
A while back I used Lambda and SQS to cross check 168,000 addresses with my
ISP's gigabit availability tool.[1] If I recall correctly each check took
about three seconds, but running all 168,000 checks on Lambda only took a
handful of minutes. I believe the scraper was written in Python, so I shudder
to think about how long it would have taken to run on a single machine.

[1] [https://dstaley.com/2017/04/30/gigabit-in-baton-
rouge/](https://dstaley.com/2017/04/30/gigabit-in-baton-rouge/)

~~~
kortilla
> I believe the scraper was written in Python, so I shudder to think about how
> long it would have taken to run on a single machine.

Scraping is an embarrassingly perfect scenario for coroutines. Most
asynchronous frameworks even use scraping as one of the examples.

In short, it would probably be done in 15 minutes, assuming you don’t get
throttled quickly. If the tool wasn’t already async capable, another 15
minutes to wrap some scraping in gevent/eventlet.

~~~
nijave
Even without async it's pretty easy to slap a concurrent.futures ThreadPool on
something normally single threaded and get massive performance gains.

------
mirimir
> Basically grep for the internet.

So is this a workaround for "censorship" by Google etc?

And where would the crawl archives come from?

Also, I wonder how this could be made usable and affordable for random
individuals.

------
mD5pPxMcS6fVWKE
But Azure Functions existed for years, AWS only now got to that?

