
Show HN: Corral – A Serverless MapReduce Framework - bcongdon
https://github.com/bcongdon/corral
======
bcongdon
Hi HN, author here. Corral is my attempt at a performant, easy-to-deploy
MapReduce. Unlike traditional frameworks like Hadoop, it uses AWS Lambda for
execution and is “serverless” as a result. It was initially kicked-off by AWS
adding Lambda support for Go, but draws on experience I’ve had using Lambda
tools like Zappa[1] and Serverless[2] in the past.

I think there’s a lot of interesting applications for using function-as-a-
service platforms as executors in data processing frameworks such as this.

If you’re interested more in the development/internals of this project, I
wrote a blog post with more details:
[https://benjamincongdon.me/blog/2018/05/02/Introducing-
Corra...](https://benjamincongdon.me/blog/2018/05/02/Introducing-Corral-A-
Serverless-MapReduce-Framework/)

[1]: [https://github.com/Miserlou/Zappa](https://github.com/Miserlou/Zappa)
[2]: [https://serverless.com/](https://serverless.com/)

~~~
quickben
Hi,

For a small map reduce load, say a terabyte (to replace a single MR node), how
much would you estimate the aws cost would be?

~~~
bcongdon
Pricing depends a lot on how much memory your job requires[1] and how much
processing each record requires -- i.e. the pricing is more sensitive to
usage.

As a very rough estimate, for a light-to-medium load of 1Tb, the cost would
probably be in the ballpark of ~$0.50. AWS's own reference MR framework[2]
(which is mostly a tech demo) quotes prices in a similar order of magnitude.

Corral isn't great for processing-heavy MR jobs, as Lambda pricing rises
quickly if you need a lot of memory or take a lot of time with each record.
But, for small-ish low-overhead jobs, it can pretty easily beat the pricing
and hassle of using something like EMR.

[1]:
[https://aws.amazon.com/lambda/pricing/#Lambda_pricing_detail...](https://aws.amazon.com/lambda/pricing/#Lambda_pricing_details)
[2]: [https://github.com/awslabs/lambda-refarch-
mapreduce/](https://github.com/awslabs/lambda-refarch-mapreduce/)

~~~
shaunray
Hi OP, I am the person that made the most recent changes to the AWS Labs
refarch. I had been working on a golang version and wanted to clean up the
python one. Sunil the original author used the AMPLABS benchmark to calculate
the results table. I was planning on updating it with the 1 and 5 node test.
Would be happy to include Corral as well.

------
cle
Nice!

The use of S3 ListObjects is an immediate deal breaker though, its eventual
consistency can cause silent data corruption. To avoid the List, you'd need to
write a file manifest somewhere that contains a list of all S3 objects. If it
were me, I'd use DynamoDB and append keys to a StringSet on a single item (if
you use S3 for the manifest, it needs to be a single object, which means you
need to aggregate the keys first, which sounds tricky with Lambda). You'll hit
a scaling limit with DDB's item size limit, if you want to avoid that, perhaps
writing an item per mapper with the same hash key and a different range key
might be better, then you'd do a strongly consistent query to reconstruct the
manifest.

------
jedimastert
I'm a complete novice when it comes to distributed computing, so I'm not too
sure I get the idea of a MapReduce framework. I'm gonna try to suss it out
here as I understand it, and if someone could correct me where I'm wrong,
that'd be awesome.

First off, when I see "map" and "reduce" I think of the functional
programming/data processing equivalents of mapping, meaning to apply a
function to every element in a set (like capitalizing strings or dividing
everything by two or something) and reducing, meaning to iterate over a set,
processing it and combining it with some accumulator (like taking a sum).

What a MapReduce framework seems to do is take these two function and run them
in parallel, splitting the data to take advantage of the independent nature of
these two functions. Data can be split however is convenient because the map
function doesn't need to worry about another data than itself, and run in as
many processes you can manage. Any mapped-data can be put into parallel reduce
processes, which can be run in any order because the order of the data
shouldn't matter.

All of that I get (although if I'm wrong that might explain why I'm confused).
I guess my main confusion is why the reduce function doesn't really fit with
the idea that I just put forward. I would think that the reduce function would
need some sort of "accumulator" input, and that you'd only get one thing as an
output, as opposed to more files of data. Perhaps the idea is that the reduce
is actually just any function that can only work on post-mapped functions, or
even the only one that's supposed to change state in some way?

Can anyone shed some light on my confusion? What is the reduce function
actually supposed to do, if not what I just laid out.

~~~
mayank
> I would think that the reduce function would need some sort of "accumulator"
> input, and that you'd only get one thing as an output, as opposed to more
> files of data.

It may be easier to think of the reduce step more like a SQL GROUP BY rather
than a function of a list. The map phase emits a bunch of (key, value) pairs,
and all values with the same key are processed by the same reducer function
(but each key gets a new reducer, modulo implementation details).

So in your paradigm, there are many reduce functions, each starting with a
null accumulated value, resulting in many outputs rather than a single one.

~~~
jedimastert
Interesting. That actually makes a lot of sense. I've done a touch more
reading, and I actually only just now realized that the "word count" example
doesn't count all of the words in a document, but rather all _occurrences of
each word_. That clears a lot of questions up for me.

------
mjn
Very different technical construction, but reminds me a bit of Joyent's Manta,
that also lets you run map-reduce jobs over an object store without explicit
server management:
[https://news.ycombinator.com/item?id=5939340](https://news.ycombinator.com/item?id=5939340)

Source code:
[https://github.com/joyent/manta](https://github.com/joyent/manta)

I believe the cloud version has since been renamed to "Converged Analytics",
so this is probably the same thing:
[https://www.joyent.com/triton/analytics](https://www.joyent.com/triton/analytics)

------
amelius
Why call it "serverless" if I need to provide stuff like:

AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, AWS_TEST_BUCKET

~~~
Willson50
Because you don't need to manage any servers.

------
Annatar
Any time I read "serverless", I get a violent allergic reaction. There is no
such thing, as software requires hardware to run on, and the person or persons
who went "serverless" simply chose to stick their head(s) in the sand and punt
the OS engineering and hardware design and maintenance off to someone else,
hoping that it will just work. But it does not, and eventually there will be
an outage and lost money. One can punt this responsibility off to someone
else, but there will be consequences.

If you want reliable infrastructure, first you must become a master system,
database and network administrator, then you must apprentice with a mentor to
become a system engineer, and finally after several decades of practitioning
as one, you will have enough experience and insight to become a system
architect. There is no way around that, no punting will help.

~~~
scarface74
_Any time I read "serverless", I get a violent allergic reaction. There is no
such thing, as software requires hardware to run on,_

The meanings of words morph over time. When developers mention serverless
everyone knows what it means it that context. Just like when someone says
there is a bug in their code no one thinks that there are roaches running
around in their computer.

 _and the person or persons who went "serverless" simply chose to stick their
head(s) in the sand and punt the OS engineering and hardware design and
maintenance off to someone else,_

When I write a program, I'm not writing assembly language. I'm also "sticking
my head in the sand" about the how assembly works. AWS has a whole team of
people that know how to do that stuff.

 _hoping that it will just work. But it does not, and eventually there will be
an outage and lost money. One can punt this responsibility off to someone
else, but there will be consequences._

AWS is probably more reliable than what you could do on prem or at a colo,

 _If you want reliable infrastructure, first you must become a master system,
database and network administrator, then you must apprentice with a mentor to
become a system engineer, and finally after several decades of practitioning
as one, you will have enough experience and insight to become a system
architect. There is no way around that, no punting will help._

Tell that to Netflix. They host everything on AWS. They purposefully moved
from an on prem architecture to AWS because they realized where their core
competence was.

~~~
Annatar
Netflix has an army of professional system and kernel engineers who optimize
and engineer the OSes they run on; they just use AWS for virtual machine and
datacenter capacity.

~~~
scarface74
They use a lot of AWS services - not just virtual machines

\- First they use Ubuntu Linux

\- SES (Email)

\- ElasticSearch (AWS has their own managed version)

\- SQS (queueing system)

\- S3 (storage)

-auto scaling

Most of the optimization's they do are standard things at scale where they are
tuning by measuring performance.

[https://youtu.be/89fYOo1V2pA](https://youtu.be/89fYOo1V2pA)

and the slide deck from the presentation.

[https://www.slideshare.net/brendangregg/how-netflix-tunes-
ec...](https://www.slideshare.net/brendangregg/how-netflix-tunes-
ec2-instances-for-performance)

But that's even more of a reason to choose AWS, Netflix has open sourced
dozens of tools specifically related to AWS. You get to take advantage of
their tools and knowledge.

~~~
Annatar
So if anything, they are not running “serverless”, far from it: they really
dig their hands into the OS and system engineering. Just recently they did
kernel engineering on FreeBSD to push the networking as close to 100 Gbps per
interface as possible. You picked a bad example to make your point, a point
which I do not, from experience of doing this for several decades, recognize
as valid anyway.

By the by, I’m that guy who runs a private datacenter in the basement, from
designing one’s own rack mountable servers to crimping the network cables and
running the fiber. Infrastructure is not something one should entrust to
others, because these others cannot be trusted, as has been proven by AWS
outages time and time again. And yes, I system engineer my own operating
system as well. Well I did, it’s all been running automatically for years now.

~~~
scarface74
You were making a point about people should run their own data centers.
Netflix is not running their own data centers. Can you run your own data
center in 90 areas around the world? Can you run your own globally distributed
CDN? Your own globally distributed DNS? Are you saying that you never had any
downtime? Can you stand up a replica of your object store, your caching and
your database in both Europe and Asia so it can be closer to your outsourced
developers and foreign customers? Does your data center in your basement allow
me to set up subnets that span regions (not AWS “regions”, geographic areas -
“Availability Zones”) just in case your basement floods?

What if I don't want a replica of data, what if I want to duplicate my entire
infrastructure in Asia so the developers there can have a clone of our
infrastructure - databases, storage, VMs, load balancers, multiple
availability zones, etc. How long would it take you to do that? I could do it
with a JSON script and run a command.

How fast can you provision a half dozen load balancers and five or ten
dedicated computers - not VMs “dedicated hosts”? I can do it by creating a
JSON file and running one command from my terminal.

And why should I trust you to set up a more reliable, redundant, network than
AWS? Again Netflix didn’t trust _themselves_ to create more reliable
infrastructure and decided to trust a competitor to do the “undifferentiated
heavy lifting” and the guy who lead the transition is now a VP at AWS, so I
think he knows something about infrastructure.

~~~
Annatar
_You were making a point about people should run their own data centers._

I'm making the point that one can never write high quality software if one
does not master system administration and then system engineering first. It's
impossible to write high quality software without understanding the substrate
on which software is built. That's my message.

~~~
scarface74
It could easily be said that one cannot write quality software if they don't
understand how compilers work, how to write assembly language, how processors
are fabbed, etc.

How far down the rabbit hole do you want to go?

It's not a matter of "not understanding" how it works, it's a matter of
focusing on your core competency and even if you know _how_ to do something,
it's about where you can add value and where is it best to outsource.

AWS has a dozens of services across dozens of areas across the globe to handle
infrastructure. Why waste time doing the "undifferentiated heavy lifting" that
you can't do as well? No matter how good you think you are, you can't
efficiently setup infrastructure as fast and as reliable as AWS.

How quickly could you setup duplicate data centers on opposite sides of the
continent for disaster recovery and/or to reduce latency? I can do it by
running a CloudFormation template.

~~~
Annatar
_How far down the rabbit hole do you want to go?_

All the way down to the hardware.

