
Serverless Map/Reduce - emilong
http://tothestars.io/blog/2016/11/2/serverless-mapreduce
======
ralusek
This is a screenshot of my google search from 2 days ago:

[http://i.imgur.com/BNAcSsn.png](http://i.imgur.com/BNAcSsn.png)

I've been using Lambda quite a bit, I think it's SO amazingly useful. Tasks
that are highly parallelized and CPU intensive can literally be infinitely
scaled out. I find it weird that their poster child use case is still always a
reactive event like watching S3 and formatting images. There are so many use
cases for directly invoking a lambda directly from your code.

Imagine a case where you had to parse a million documents with a relatively
expensive computation, let's say 250MS per document. Maybe you have a solid
machine with a few cores that's running your server, but even then you can't
have the server cpu locked for so long, so naturally you'd need some sort of
worker server set up. With a good machine and multiple cores, maybe you get 8
running at once. With a lambda, you can forego the worker server altogether.
Just invoke a million lambdas directly from your application server,
completely parallelized.

Theoretically, you've taken something that would take 70 hours and had it run
in 250ms without having to set up any additional infrastructure.

~~~
hueving
>Theoretically, you've taken something that would take 70 hours and had it run
in 250ms without having to set up any additional infrastructure.

And you've spent the cost of building out that 8 server infrastructure in one
batch.

~~~
etrain
One of the nice things about lambda is the billing is _super_ granular - you
get billed at 100ms intervals.

Assuming 70 hours at $0.000000834/100ms [1]

The whole job costs $2.10.

[1] -
[https://aws.amazon.com/lambda/pricing/](https://aws.amazon.com/lambda/pricing/)

~~~
nostrademons
By comparison (since I was curious), 70 hours on an m3.medium spot instance
will run around $0.70. On an on-demand, it's about $5.39. EMR will cost you
about $7.00 on top of the EC2 costs.

If you can peg the CPU and don't mind getting interrupted, spot instances are
still a fair bit cheaper. But Lambda looks pretty attractive for any other
use-case where the statelessness of Lambda doesn't bite you.

~~~
etrain
Keep in mind that with EC2 you're billed hourly so the _fastest_ a 70 cpu-hour
job could finish on m3.medium for $0.70 is 1 hour, and that's ignoring setup
time, etc.

Meanwhile, on Lambda, you can actually run 1600 60s jobs (or 27 CPU-hours) in
3 minutes. This is inclusive of setup time, job submission, stragglers, etc.
[1]

Of course, if you've got sustained load, it's cheaper to go with spot
instances, but the "occasionally I need a buttload of compute," model is well-
served by Lambda.

[1] [http://ericjonas.com/pywren.html](http://ericjonas.com/pywren.html)

~~~
IanCal
As a note for people, if your constraints are a bit different then these are
some services to check out:

Joyent Manta: [https://www.joyent.com/manta](https://www.joyent.com/manta)

Hyper: [http://hyper.sh](http://hyper.sh)

Possibly Joyent Triton:
[https://www.joyent.com/triton](https://www.joyent.com/triton)

I personally often want to run a bunch of things for ~1-15 minutes, and have
too much data or setup to fit neatly in a lambda function. However, I don't
need 1000 things running simultaneously, although manta would help still
there.

I'd love to see some more layers over the top of services like this, hopefully
someday getting us back to picloud. I miss that service.

~~~
marktangotango
How secure are docker hosts like hyper.sh? I've always been skeptical, the
multitenant docker security story hasn't been very encouraging, or has that
changed?

~~~
riobard
hyper.sh containers are kernel-isolated like virtual machines.

------
stochastician
Author of the figures used in the blog post here. We wrote
[https://github.com/ericmjonas/pywren](https://github.com/ericmjonas/pywren)
somewhat on a lark, because it seemed to fit well with our research goals and
it's fun to push systems to their limit. I'm now a total serverless convert!
I'd love more collaborators and feedback, the goal is to make these sorts of
computations as easy as possible for python developers, especially on the
scientific computing side of things.

------
danso
OT: I teach computational methods and even as much as I dislike
teaching/conflating it with web dev, I have included "let's build a web app"
because students like building and deploying a thing, and because Heroku has a
free tier.

I've considered the possibility of having students do things on AWS (beyond
web dev), including Lambda, and just expensing the costs. It seems feasible to
quickly set up every student with controlled access via IAM...but is there a
way to set up rate-limiting, ideally through a policy? That is, shut an IAM
down if a student accidentally invokes a million processes? Or, for that
matter, limiting the storage capacity of a S3 bucket?

~~~
autotune
I would just set up a new account for each student, have them use their own
billing info, have them use the free tier, teach them how to set up billing
alerts, and let them go to town. They're going to need to learn to take cost
into account when working at a real job with AWS so this is the best way to
teach them to take accountability.

~~~
PetahNZ
I'm not sure its reasonable to expect your students to have a credit card.

~~~
autotune
Depends on the age group. Anyway most could get a secured credit card if
needed.

~~~
danso
Yeah that's the thing. Most students do have a credit card. But I've had a few
who are very much against it, for financial or privacy reasons. For me, it's
not worth compelling them to change their ways (which I'm highly sympathetic
to) for what could amount to as little as $5 of AWS costs.

------
stcredzero
I wonder if something like AWS Lambda could be applied to multiplayer games?
It seems like game-loop based games would be a good domain for such a
programming model. The entire game could be expressed as a function that turns
tick N into tick N+1. Such a function would be composed of many other
functions, of course. So for example, there would also be a function that took
as an argument the player at time N and gave the player at time N+1.

Such a model would allow infrastructure developers to abstract away most of
the concerns around networking, collisions, security, etc., and let game
developers concentrate their efforts on simply making the game.

I currently have a game server cluster written in Golang, where the locations
are instantiated with an idempotent request operation. It doesn't matter if a
particular location-instance exists at a particular moment. It's sufficient
for the "master control" server to only approximately know the loads of the
different cluster server processes. My experience leads me to believe that
something like AWS Lambda, but optimized for implementing game loops would
work well, so long as game developers could get their heads around pure
functional programming and implement with soft real-time requirements in mind.
(John Carmack already advocates the use of pure functions, and game devs in
general already do the latter.)

[http://www.emergencevector.com](http://www.emergencevector.com)

~~~
z3t4
What is the latency overhead in AWS Lambda ?

~~~
stcredzero
Too large for me to want to make a game in it. But if you made a specialized
version that had an actual game loop underneath it, there's some potential
there.

------
lucd
How does it compare to 3 years old Joyent's Manta ? AFAIK it was especially
designed for this kind of purposes. The processing is made directly on the
servers storing the data..

~~~
jahewson
Manta is pretty similar to Elastic MapReduce, which also runs the computation
on the same node as the data. So it compares pretty much the same as EMR.

------
thinkloop
The article counts characters in documents stored on S3 - which makes sense
since S3 is great for storing documents and can handle unlimited concurrency,
priced per usage.

But what's the solution for structured data? DynamoDB is the obvious main
candidate, but it's billed by hour and high concurrency is very expensive,
requiring complicated temporary increases and decreases of concurrency that
are hard to predict.

Is there a good solution for running massively parallel lamdas on stuctured
data?

~~~
ranman
If you're doing any sort of table scan op then DDB perf/cost will be less than
stellar. If you have an index / range key it works really (like really) well
-- even in massively parallel situations.

If you're dealing with a TON (5+ TB) of data I recommend heading in to RDS,
BQ, or redshift.

~~~
thinkloop
It's less the total size of the data I'm worried about and more the
concurrency. For example, say I had a process that retrieved 1000 tiny records
(using index query) and ran some cpu-intensive calculation on them, and I
wanted to run 1000 of those processes simultaneously to reduce into a final
result. This would require tuning dynamo to thousands of concurrent reads (and
maybe writes, depending on the process), then scaling it back down after the
operation because it is very costly and priced by hour. This makes it
complicated and expensive on dynamo.

It seems the only storage services compatible with variable unlimited bursts
of concurrency are S3 and SimpleDB. S3 comes with many problems for handling
structured data (no update of records only replace, locking, listing items is
slow/costly, etc.). SimpleDB is no longer being iterated, is limited to 10gb
per domain, and looks like it's being slowly phased out.

It seems like massively parallel lambdas depend on few fetches of large blobs
of data - which is basically batch-processing EMR-style, or better suited to
redshift. Not something that opens the door for novel use-cases.

I would have really liked for dynamodb to be more of a service than a vm. I
wish its concurrency was unlimited and you paid for usage rather than time.
Basically DynamoDB with SimpleDB pricing.

~~~
ec109685
Just use RDS and S^3 for the blobs. RDS can do tens of thousands of index
lookups a second.

If you only need one index, then just name your s3 document by the compound
index value and call it a day.

Otherwise, just use RDS for everything.

~~~
thinkloop
From the RDS FAQ:

> In order to maximize your workload’s throughput on Amazon Aurora, we
> recommend building your applications to drive a large number of concurrent
> queries.

Perrrrfect. Thanks!

[https://aws.amazon.com/rds/aurora/faqs/](https://aws.amazon.com/rds/aurora/faqs/)

------
partycoder
I do not agree with term serverless. Amazon Lambda is a service, therefore
there is a server involved.

It's like saying deathless meat, because someone else killed the animal you
are consuming.

~~~
aikah
Fight the good fight, you're not the only one standing up against this
ridiculous moniker. Never back down. Keep calling out the bullshit every time
and everywhere this term is promoted. People ask "what's the big deal?" well
the big deal is ,calling a service "serverless" is both a lie and misleading
for marketing purposes.

~~~
edblarney
It's not a lie, and it's not misleading.

The concept of 'server' loses all meaning in this given architecture.

You create 'Lambdas' \- units of functionality - and they do that they are
supposed to do entirely independent of the underlying architecture.

In fact, using the concept of 'server' in a Lambda situation probably
obfuscates the situation and adds unnecessary complexity.

A 'server' is an implementation detail that concerns only those providing the
container/Lambda services. As long as the implementation lives up to the SLA
(i.e. performance, uptime, security, price, redundancy) that you have agreed
to - then it doesn't matter how it works.

~~~
Can_Not
All of that is running on a server, connecting to a server, etc.. If this is
"serverless", then any random cpanel host has been doing "serverless"
php/mysql for 20 years.

So it is, in fact, both a lie and misleading.

------
plandis
I've always had one big question about Lambda. Is it really worth the cost you
get for the convienience of it?

Is anyone using it in production that can comment?

~~~
edblarney
I think you can do the math yourself - the costs are published.

FYI - we did some experiments and the limiting factor was latency. 250-300ms
on average, you have to go through their API feature as well, and that's part
of the delay. But worse - Lamda's that have not been called for several
minutes (I'm assuming they are not 'hot') often take several seconds, up to 5s
to be called. So it creates a problem for intermittent traffic.

If that kind of latency is acceptable to you, it might work for you so long as
the cost equation is right.

I think some other people had issues with versioning, it's a problem we didn't
go far enough to observe.

~~~
twagner
Quick clarification: No need to go through API Gateway if you don't actually
need the https endpoint - all AWS SDKs can hit Lambda's REST APIs directly,
which also reduces p50 latency.

------
eistrati
Saw the presentation last week at ServerlessConf in London and it really looks
very promising. The cost behind this solution is what will really make me
check this out :)

P.S. Quoting the author: "As you can see for these queries, the reference
implementation performs reasonably well; it's nowhere near Redshift
performance for the same queries, but for the price it really can't be beat
today"

------
dnackoul
Does anyone have experience building mobile backends in Lambda? I was looking
at an API Gateway / Lambda / Amazon RDS stack for building a central data
store and was wondering what people's experience with that setup is?

~~~
cpollard0
Using a framework like server less or chalice makes it incredibly easy for an
MVP

~~~
dnackoul
Thanks for the recommendations. Any experience with how these hold up as they
scale up?

------
c-smile
About the site: quite hard to read - almost white text on white background.

~~~
pjc50
Worse than that, doesn't appear for me at all until I use Readability Mode.

------
boulos
Note: the underlying comparison to other systems is from a 2014 blogpost [1]
which suggest they used the m2.4xlarge series of EC2 VMs (which were Nehalem
class parts from 2010). Nehalem vs Haswell or Broadwell (the likely parts
underlying Lambda) is a pretty big jump.

Disclosure: I work on Google Cloud, but I'm just pointing out a fact ;).

[1]
[https://amplab.cs.berkeley.edu/benchmark/](https://amplab.cs.berkeley.edu/benchmark/)

------
mallya16
Implementation guide for Serverless MapReduce:
[https://aws.amazon.com/blogs/compute/ad-hoc-big-data-
process...](https://aws.amazon.com/blogs/compute/ad-hoc-big-data-processing-
made-simple-with-serverless-mapreduce/)

------
willcodeforfoo
I wonder if Amazon will ever open Lambda up to any Docker image? (I know it's
possible to run binaries, but its a bit of a pain to compile with the Amazon
AMI, etc.) Being able to have a bunch of `docker run` with any image would be
pretty powerful.

~~~
twagner
Yes. First step was [https://aws.amazon.com/blogs/aws/new-amazon-linux-
container-...](https://aws.amazon.com/blogs/aws/new-amazon-linux-container-
image-for-cloud-and-on-premises-workloads/). Layering Lambda's image on top of
that to assist people building and testing is definitely on our roadmap.

~~~
nolite
What about running lambda on custom AMIs ? Is that even remotely feasible one
day ?

------
frenchhacker
I guess the example assumes the data is already somehow in AWS. How is the
total cost affected if I wanted to run this setup on a 10TB dataset?

------
elcct
Is there any AWS Lambda equivalent that could be deployed on bare metal?

~~~
cpollard0
Docker

~~~
eudoxus
That's a horrible and flat out wrong answer. Docker simply provides a
transferable containerized execution environment for processes with a simple
push pull workflow.

Lambda is a fully elastic infrastructure you can build, deploy, and execute
functions.

Simply having docker-engine and docker-cli installed on my machine doesn't
give me anything close to lambda.

------
amelius
If it doesn't run on a server, then what does this plumbing-work run on?
Clickbait name?

~~~
luhn
Oh come on. Is it still not accepted that "serverless" is the colloquial name
for AWS Lambda and comparable services? Stop trying to make "FaaS" happen.
It's not going to happen.

~~~
CaptSpify
Just because you abstract something away doesn't mean it no longer exists. I
don't manually manage cache on my laptop's HD, but that doesn't mean it's
"cacheless".

~~~
luhn
I'm not saying "serverless" is a good term, I'm saying that it's _the_ term.
It's won. You can argue all you want that it's a terribly misleading/incorrect
term, but people aren't going to stop using it. So let's move on.

~~~
kuschku
> It's won.

That’s the linguistic descriptivist position.

We can also require all papers and journals and conferences to use "Function
as a Service" everywhere, and force all professors to teach "Function as a
Service", and require all official publications to use "Function as a
Service", by defining an authoritative dictionary, which gets its authority by
law.

Then wait a few months, and the term "serverless" will be gone.

Some countries handle their entire language that way – and have an official
institution tasked with updating the language every few years, and the updates
become mandatory for business communication, press releases, and schools.

Germany and France are some examples.

IMO, having grown up right after one of the largest such changes in recent
German history, it’s a better system than letting the mob decide how to call
things, or how to write words, because that leads to pure chaos.

~~~
detaro
Having a unified way of how to spell things is quite different then
prescribing what to call specific things. Which luckily outside of very
limited areas (legal terms, protected names and trademarks) doesn't exist in
your example countries either.

~~~
kuschku
Yes, it does.

We’ve renamed Camouflieren to Tarnen, and many other words like that.

There’s hundreds of cases of words being entire replaced.

A list of replacements of the past centuries, for example: Distanz → Abstand,
Liberey → Bücherei, Moment → Augenblick, Passion → Leidenschaft, Projekt →
Entwurf, Addresse → Anschrift, Korrespondenz → Briefwechsel, Komödie →
Lustspiel, Dialekt → Mundart, Orthographie → Rechtschreibung, Journal →
Tagebuch, Autor → Verfasser, Fundament → Grundlage, Antike → Altertum,
Parterre → Erdgeschoss, Universität → Hochschule, Terrorismus →
Schreckensherrschaft, Singular → Einzahl, Plural → Mehrzahl, poste restante →
postlagernd, Coupé → Abteil, Perron → Bahnsteig, Billet → Fahrkarte,
Retourbillet → Rückfahrkarte, download → herunterladen.

And even today there are large companies even funding groups working on
replacing parts of the language, be it to replace foreign words, or to
simplify words: [https://www.rossmann.de/unternehmen/soziale-
verantwortung/so...](https://www.rossmann.de/unternehmen/soziale-
verantwortung/soziale-projekte/verein-deutsche-sprache.html)

Obviously, the far-right takes it to an extreme level, even replacing Internet
with Weltnetz, but even in the left-wing there’s no opposition to replacing
words, or simplifying the language.

And even those opposing those changes (see criticism section here
[https://de.wikipedia.org/wiki/Reform_der_deutschen_Rechtschr...](https://de.wikipedia.org/wiki/Reform_der_deutschen_Rechtschreibung_von_1996)
) don’t oppose these concepts in general, just would rather like to see
different changes.

~~~
detaro
And most of these examples are in the common dictionaries, used in major
newspapers, will be accepted in your high-school German tests (as long as you
use them appropriately and spell them correctly) and are used or at least
easily understood by all native speakers. The others fell out of use over
time.

Recommendations by various authorities on what constitutes "good" use of
language change, and in Germany there might be more reliance on the big
dictionaries (but I honestly can't accurately gauge how this compares to
various parts of society in English-speaking countries), but actual language
use does not care all that much. The reforms and the dictionary are very
relevant for spelling and grammar, but have not much influence on _actual
selection of words_ , even less in specialist subjects.

People try to change language use with all kinds of motivations all the time,
but they can't do much more than suggest, individuals and organizations decide
what they agree to and what they don't. And this exists in other languages
just as much (trying to avoid offensive terms, trying to sound modern, trying
to remove foreign influences).

~~~
kuschku
Half of the words in above list were invented by one single person.

And that person has pushed all of those words into popular use by cooperating
with other authors, writers, newspapers at the time.

So, yes, that stuff is possible.

> but they can't do much more than suggest

Except, the authorative dictionaries have authority in Germany because, per
definition, tests in schools have to be graded based on them, and official
communication of companies has to be written with them.

If a dictionary says "deprecated", these have to switch.

Which, in turn, has a direct effect only a 13 years later (the maximum time it
takes someone to go through school).

~~~
detaro
Change through collaboration and use is exactly what you argued against: a new
term is coined, used first by "experts"/influential people, then goes into
widespread use and is codified in dictionaries. Right now we can watch a group
of people establishing "serverless" as a word for some kind of PaaS in
technical language, as stupid and confusing we might think it is (I personally
hate trend and would prefer the word be used for P2P or client-side
applications, but I think that ship is sailed). Documentation of expert
language will soon pick it up, if it hasn't already.

For purposes other than spelling and some grammar rules, dictionaries are nice
suggestions, but even in school (where the Rechtschreibreform actually has
legal "power", even if it doesn't anywhere else) a word not being in the Duden
didn't mean it didn't exist (and conversely, just because it is in there
doesn't mean it's good to use). Professional usage has its own conventions
(newspaper styleguides, common terms and ways of writing in scientific
disciplines, "PR speak"), even if they make for "worse" language, common usage
varies even more. And nothing actually _enforces_ language in all those areas,
which make up most of our language use. On the contrary, it's used as input
for new iterations of guidelines and dictionaries.

Spelling and grammar has been "designed by committee" and relatively
successfully legally enforced, the words used are not. They are influenced by
motivated groups, but that's part of the linguistic discription model just as
well.

