Hacker News new | comments | show | ask | jobs | submit login
Show HN: Corral – A Serverless MapReduce Framework (github.com)
92 points by bcongdon 4 months ago | hide | past | web | favorite | 36 comments



Hi HN, author here. Corral is my attempt at a performant, easy-to-deploy MapReduce. Unlike traditional frameworks like Hadoop, it uses AWS Lambda for execution and is “serverless” as a result. It was initially kicked-off by AWS adding Lambda support for Go, but draws on experience I’ve had using Lambda tools like Zappa[1] and Serverless[2] in the past.

I think there’s a lot of interesting applications for using function-as-a-service platforms as executors in data processing frameworks such as this.

If you’re interested more in the development/internals of this project, I wrote a blog post with more details: https://benjamincongdon.me/blog/2018/05/02/Introducing-Corra...

[1]: https://github.com/Miserlou/Zappa [2]: https://serverless.com/


I'd like to see your readme expanded with some figures:

* Processing speed - that is, how long does it take to do that word count example on a nontrivial dataset? Something that takes hours on a local machine, vs minutes in map/reduce. Comparing local to this to e.g. Hadoop or Google BigQuery or whatever viable alternative there is. * Cost. I think that's probably the biggest factor here. I don't get the impression that Lambda was intended for big data or highly resource / i/o / processing intensive operations, but, I'd love to be proven wrong. * Actually, mostly just cost vs performance.

I mean it's a neat idea but if the serverless benefit is outweighed by difficulty in setting up, cost, performance, etc compared to dedicated big data solutions it's going to stay a proof of concept.


Hi,

For a small map reduce load, say a terabyte (to replace a single MR node), how much would you estimate the aws cost would be?


Pricing depends a lot on how much memory your job requires[1] and how much processing each record requires -- i.e. the pricing is more sensitive to usage.

As a very rough estimate, for a light-to-medium load of 1Tb, the cost would probably be in the ballpark of ~$0.50. AWS's own reference MR framework[2] (which is mostly a tech demo) quotes prices in a similar order of magnitude.

Corral isn't great for processing-heavy MR jobs, as Lambda pricing rises quickly if you need a lot of memory or take a lot of time with each record. But, for small-ish low-overhead jobs, it can pretty easily beat the pricing and hassle of using something like EMR.

[1]: https://aws.amazon.com/lambda/pricing/#Lambda_pricing_detail... [2]: https://github.com/awslabs/lambda-refarch-mapreduce/


Hi OP, I am the person that made the most recent changes to the AWS Labs refarch. I had been working on a golang version and wanted to clean up the python one. Sunil the original author used the AMPLABS benchmark to calculate the results table. I was planning on updating it with the 1 and 5 node test. Would be happy to include Corral as well.


Nice! Always thought this would be cool. Few thought questions though: how do you get things like consistent hashing? Spark for example can shuffle data somewhat efficiently by sending the data to the right node / getting data from the right node by the hash key, right? How in a serverless stateless world you call a specific Serverless function instance? Assuming it can’t be done, arent you losing performance gained by data locality? Eg data has to be saved in a massive and efficient key value store? Isn’t it much slower than spark’s in memory / data locality (bring the compute to the data and not vice versa). Would love to see benchmarks on this. This is the future IMHO... well done.


Nice! I've been thinking about doing this kind of thing for a while. I have long experimented with doing map reduce style work outside of things like Hadoop and gotten much better results due to being able to tune for different things much more quickly.


Hi.

How do you deal with the 5min (IIRC) execution time limit of Lambda ?


Yeah, max execution time (and max memory usage) are the main constraints of using Lambda.

Corral deals with this by splitting input data into small enough chunks that each chunk can be processed within the timeout -- I exposed options for setting the amount of data that each Lambda function has to process. However, if each data item requires more than 5 min of processing, then corral won't work for you.

The "driver" that coordinates the Lambda functions runs locally (not in Lambda), so it doesn't have this constraint.


Nice!

The use of S3 ListObjects is an immediate deal breaker though, its eventual consistency can cause silent data corruption. To avoid the List, you'd need to write a file manifest somewhere that contains a list of all S3 objects. If it were me, I'd use DynamoDB and append keys to a StringSet on a single item (if you use S3 for the manifest, it needs to be a single object, which means you need to aggregate the keys first, which sounds tricky with Lambda). You'll hit a scaling limit with DDB's item size limit, if you want to avoid that, perhaps writing an item per mapper with the same hash key and a different range key might be better, then you'd do a strongly consistent query to reconstruct the manifest.


I'm a complete novice when it comes to distributed computing, so I'm not too sure I get the idea of a MapReduce framework. I'm gonna try to suss it out here as I understand it, and if someone could correct me where I'm wrong, that'd be awesome.

First off, when I see "map" and "reduce" I think of the functional programming/data processing equivalents of mapping, meaning to apply a function to every element in a set (like capitalizing strings or dividing everything by two or something) and reducing, meaning to iterate over a set, processing it and combining it with some accumulator (like taking a sum).

What a MapReduce framework seems to do is take these two function and run them in parallel, splitting the data to take advantage of the independent nature of these two functions. Data can be split however is convenient because the map function doesn't need to worry about another data than itself, and run in as many processes you can manage. Any mapped-data can be put into parallel reduce processes, which can be run in any order because the order of the data shouldn't matter.

All of that I get (although if I'm wrong that might explain why I'm confused). I guess my main confusion is why the reduce function doesn't really fit with the idea that I just put forward. I would think that the reduce function would need some sort of "accumulator" input, and that you'd only get one thing as an output, as opposed to more files of data. Perhaps the idea is that the reduce is actually just any function that can only work on post-mapped functions, or even the only one that's supposed to change state in some way?

Can anyone shed some light on my confusion? What is the reduce function actually supposed to do, if not what I just laid out.


> I would think that the reduce function would need some sort of "accumulator" input, and that you'd only get one thing as an output, as opposed to more files of data.

It may be easier to think of the reduce step more like a SQL GROUP BY rather than a function of a list. The map phase emits a bunch of (key, value) pairs, and all values with the same key are processed by the same reducer function (but each key gets a new reducer, modulo implementation details).

So in your paradigm, there are many reduce functions, each starting with a null accumulated value, resulting in many outputs rather than a single one.


Interesting. That actually makes a lot of sense. I've done a touch more reading, and I actually only just now realized that the "word count" example doesn't count all of the words in a document, but rather all occurrences of each word. That clears a lot of questions up for me.


Very different technical construction, but reminds me a bit of Joyent's Manta, that also lets you run map-reduce jobs over an object store without explicit server management: https://news.ycombinator.com/item?id=5939340

Source code: https://github.com/joyent/manta

I believe the cloud version has since been renamed to "Converged Analytics", so this is probably the same thing: https://www.joyent.com/triton/analytics


Why call it "serverless" if I need to provide stuff like:

AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, AWS_TEST_BUCKET


Because you don't need to manage any servers.


Serverless, not configless.


Serverless doesn't mean there isn't a server—it just means you don't have to manage server instances. You can treat all of AWS as your server. Which means there must be a way to separate your work from everybody else's work who is sharing AWS with you. Thus the configuration.


buzzword compliance


Any time I read "serverless", I get a violent allergic reaction. There is no such thing, as software requires hardware to run on, and the person or persons who went "serverless" simply chose to stick their head(s) in the sand and punt the OS engineering and hardware design and maintenance off to someone else, hoping that it will just work. But it does not, and eventually there will be an outage and lost money. One can punt this responsibility off to someone else, but there will be consequences.

If you want reliable infrastructure, first you must become a master system, database and network administrator, then you must apprentice with a mentor to become a system engineer, and finally after several decades of practitioning as one, you will have enough experience and insight to become a system architect. There is no way around that, no punting will help.


Any time I read "serverless", I get a violent allergic reaction. There is no such thing, as software requires hardware to run on,

The meanings of words morph over time. When developers mention serverless everyone knows what it means it that context. Just like when someone says there is a bug in their code no one thinks that there are roaches running around in their computer.

and the person or persons who went "serverless" simply chose to stick their head(s) in the sand and punt the OS engineering and hardware design and maintenance off to someone else,

When I write a program, I'm not writing assembly language. I'm also "sticking my head in the sand" about the how assembly works. AWS has a whole team of people that know how to do that stuff.

hoping that it will just work. But it does not, and eventually there will be an outage and lost money. One can punt this responsibility off to someone else, but there will be consequences.

AWS is probably more reliable than what you could do on prem or at a colo,

If you want reliable infrastructure, first you must become a master system, database and network administrator, then you must apprentice with a mentor to become a system engineer, and finally after several decades of practitioning as one, you will have enough experience and insight to become a system architect. There is no way around that, no punting will help.

Tell that to Netflix. They host everything on AWS. They purposefully moved from an on prem architecture to AWS because they realized where their core competence was.


Netflix has an army of professional system and kernel engineers who optimize and engineer the OSes they run on; they just use AWS for virtual machine and datacenter capacity.


They use a lot of AWS services - not just virtual machines

- First they use Ubuntu Linux

- SES (Email)

- ElasticSearch (AWS has their own managed version)

- SQS (queueing system)

- S3 (storage)

-auto scaling

Most of the optimization's they do are standard things at scale where they are tuning by measuring performance.

https://youtu.be/89fYOo1V2pA

and the slide deck from the presentation.

https://www.slideshare.net/brendangregg/how-netflix-tunes-ec...

But that's even more of a reason to choose AWS, Netflix has open sourced dozens of tools specifically related to AWS. You get to take advantage of their tools and knowledge.


So if anything, they are not running “serverless”, far from it: they really dig their hands into the OS and system engineering. Just recently they did kernel engineering on FreeBSD to push the networking as close to 100 Gbps per interface as possible. You picked a bad example to make your point, a point which I do not, from experience of doing this for several decades, recognize as valid anyway.

By the by, I’m that guy who runs a private datacenter in the basement, from designing one’s own rack mountable servers to crimping the network cables and running the fiber. Infrastructure is not something one should entrust to others, because these others cannot be trusted, as has been proven by AWS outages time and time again. And yes, I system engineer my own operating system as well. Well I did, it’s all been running automatically for years now.


You were making a point about people should run their own data centers. Netflix is not running their own data centers. Can you run your own data center in 90 areas around the world? Can you run your own globally distributed CDN? Your own globally distributed DNS? Are you saying that you never had any downtime? Can you stand up a replica of your object store, your caching and your database in both Europe and Asia so it can be closer to your outsourced developers and foreign customers? Does your data center in your basement allow me to set up subnets that span regions (not AWS “regions”, geographic areas - “Availability Zones”) just in case your basement floods?

What if I don't want a replica of data, what if I want to duplicate my entire infrastructure in Asia so the developers there can have a clone of our infrastructure - databases, storage, VMs, load balancers, multiple availability zones, etc. How long would it take you to do that? I could do it with a JSON script and run a command.

How fast can you provision a half dozen load balancers and five or ten dedicated computers - not VMs “dedicated hosts”? I can do it by creating a JSON file and running one command from my terminal.

And why should I trust you to set up a more reliable, redundant, network than AWS? Again Netflix didn’t trust themselves to create more reliable infrastructure and decided to trust a competitor to do the “undifferentiated heavy lifting” and the guy who lead the transition is now a VP at AWS, so I think he knows something about infrastructure.

Annatar 4 months ago [flagged]

You were making a point about people should run their own data centers.

I'm making the point that one can never write high quality software if one does not master system administration and then system engineering first. It's impossible to write high quality software without understanding the substrate on which software is built. That's my message.


It could easily be said that one cannot write quality software if they don't understand how compilers work, how to write assembly language, how processors are fabbed, etc.

How far down the rabbit hole do you want to go?

It's not a matter of "not understanding" how it works, it's a matter of focusing on your core competency and even if you know how to do something, it's about where you can add value and where is it best to outsource.

AWS has a dozens of services across dozens of areas across the globe to handle infrastructure. Why waste time doing the "undifferentiated heavy lifting" that you can't do as well? No matter how good you think you are, you can't efficiently setup infrastructure as fast and as reliable as AWS.

How quickly could you setup duplicate data centers on opposite sides of the continent for disaster recovery and/or to reduce latency? I can do it by running a CloudFormation template.


How far down the rabbit hole do you want to go?

All the way down to the hardware.


Can you please not do technical flamewars on HN? The value that should drive discussions here is intellectual curiosity (https://news.ycombinator.com/newsguidelines.html), not nitpicking or irritable fault-finding.


Considering that the topic of "serverless" is itself highly controversial, consider that mentioning it here is akin to starting a flamewar. Consider the other side as well, not just one side.

My intent wasn't and isn't a flamewar. The topic is not one of "intellectual curiosity" either.


The OP certainly clears the bar for being intellectually interesting in HN's sense.

It's true that 'serverless' is a bit of a trigger word in technical discussions and people disagree about what it means, etc., but there are degrees of these things, and you stepped into several further degrees of flamewar. Please don't do that!


The term "serverless" means you don't need to manage the server resources. There is an "infinite" on-demand amount of resources that are available at request and they are available to run any process you want with a single command/hook/etc.

It's a misnomer, but it's no worse than "the cloud" or how "artificial intelligence" has come to mean anything to do with machine learning.


There is no such thing as infinite server resources.


>punt the OS engineering and hardware design and maintenance off to someone else

Isn't hiring other people who know better than you to do this kind of stuff kind of the point? Like, a lot of people's jobs are based on that idea, including almost everyone in the IT industry. I'm confused by your point. It almost looks like sarcasm. Getting some serious "Poe's Law" here.


One cannot design and write high quality software without throughly understanding the substrate underneath and how to engineer for it; from doing this for over 30 years, I’m telling you it’s impossible.


Any time I read "serverless", I get a violent allergic reaction.

You're even, then, because I've got the allergy to this trivial monotonous whinging about the by-now-well-understood meaning of the term "serverless". I'm not the only one.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: