I think there’s a lot of interesting applications for using function-as-a-service platforms as executors in data processing frameworks such as this.
If you’re interested more in the development/internals of this project, I wrote a blog post with more details: https://benjamincongdon.me/blog/2018/05/02/Introducing-Corra...
* Processing speed - that is, how long does it take to do that word count example on a nontrivial dataset? Something that takes hours on a local machine, vs minutes in map/reduce. Comparing local to this to e.g. Hadoop or Google BigQuery or whatever viable alternative there is.
* Cost. I think that's probably the biggest factor here. I don't get the impression that Lambda was intended for big data or highly resource / i/o / processing intensive operations, but, I'd love to be proven wrong.
* Actually, mostly just cost vs performance.
I mean it's a neat idea but if the serverless benefit is outweighed by difficulty in setting up, cost, performance, etc compared to dedicated big data solutions it's going to stay a proof of concept.
For a small map reduce load, say a terabyte (to replace a single MR node), how much would you estimate the aws cost would be?
As a very rough estimate, for a light-to-medium load of 1Tb, the cost would probably be in the ballpark of ~$0.50. AWS's own reference MR framework (which is mostly a tech demo) quotes prices in a similar order of magnitude.
Corral isn't great for processing-heavy MR jobs, as Lambda pricing rises quickly if you need a lot of memory or take a lot of time with each record. But, for small-ish low-overhead jobs, it can pretty easily beat the pricing and hassle of using something like EMR.
How do you deal with the 5min (IIRC) execution time limit of Lambda ?
Corral deals with this by splitting input data into small enough chunks that each chunk can be processed within the timeout -- I exposed options for setting the amount of data that each Lambda function has to process. However, if each data item requires more than 5 min of processing, then corral won't work for you.
The "driver" that coordinates the Lambda functions runs locally (not in Lambda), so it doesn't have this constraint.
The use of S3 ListObjects is an immediate deal breaker though, its eventual consistency can cause silent data corruption. To avoid the List, you'd need to write a file manifest somewhere that contains a list of all S3 objects. If it were me, I'd use DynamoDB and append keys to a StringSet on a single item (if you use S3 for the manifest, it needs to be a single object, which means you need to aggregate the keys first, which sounds tricky with Lambda). You'll hit a scaling limit with DDB's item size limit, if you want to avoid that, perhaps writing an item per mapper with the same hash key and a different range key might be better, then you'd do a strongly consistent query to reconstruct the manifest.
First off, when I see "map" and "reduce" I think of the functional programming/data processing equivalents of mapping, meaning to apply a function to every element in a set (like capitalizing strings or dividing everything by two or something) and reducing, meaning to iterate over a set, processing it and combining it with some accumulator (like taking a sum).
What a MapReduce framework seems to do is take these two function and run them in parallel, splitting the data to take advantage of the independent nature of these two functions. Data can be split however is convenient because the map function doesn't need to worry about another data than itself, and run in as many processes you can manage. Any mapped-data can be put into parallel reduce processes, which can be run in any order because the order of the data shouldn't matter.
All of that I get (although if I'm wrong that might explain why I'm confused). I guess my main confusion is why the reduce function doesn't really fit with the idea that I just put forward. I would think that the reduce function would need some sort of "accumulator" input, and that you'd only get one thing as an output, as opposed to more files of data. Perhaps the idea is that the reduce is actually just any function that can only work on post-mapped functions, or even the only one that's supposed to change state in some way?
Can anyone shed some light on my confusion? What is the reduce function actually supposed to do, if not what I just laid out.
It may be easier to think of the reduce step more like a SQL GROUP BY rather than a function of a list. The map phase emits a bunch of (key, value) pairs, and all values with the same key are processed by the same reducer function (but each key gets a new reducer, modulo implementation details).
So in your paradigm, there are many reduce functions, each starting with a null accumulated value, resulting in many outputs rather than a single one.
Source code: https://github.com/joyent/manta
I believe the cloud version has since been renamed to "Converged Analytics", so this is probably the same thing: https://www.joyent.com/triton/analytics
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, AWS_TEST_BUCKET
If you want reliable infrastructure, first you must become a master system, database and network administrator, then you must apprentice with a mentor to become a system engineer, and finally after several decades of practitioning as one, you will have enough experience and insight to become a system architect. There is no way around that, no punting will help.
The meanings of words morph over time. When developers mention serverless everyone knows what it means it that context. Just like when someone says there is a bug in their code no one thinks that there are roaches running around in their computer.
and the person or persons who went "serverless" simply chose to stick their head(s) in the sand and punt the OS engineering and hardware design and maintenance off to someone else,
When I write a program, I'm not writing assembly language. I'm also "sticking my head in the sand" about the how assembly works. AWS has a whole team of people that know how to do that stuff.
hoping that it will just work. But it does not, and eventually there will be an outage and lost money. One can punt this responsibility off to someone else, but there will be consequences.
AWS is probably more reliable than what you could do on prem or at a colo,
Tell that to Netflix. They host everything on AWS. They purposefully moved from an on prem architecture to AWS because they realized where their core competence was.
- First they use Ubuntu Linux
- SES (Email)
- ElasticSearch (AWS has their own managed version)
- SQS (queueing system)
- S3 (storage)
Most of the optimization's they do are standard things at scale where they are tuning by measuring performance.
and the slide deck from the presentation.
But that's even more of a reason to choose AWS, Netflix has open sourced dozens of tools specifically related to AWS. You get to take advantage of their tools and knowledge.
By the by, I’m that guy who runs a private datacenter in the basement, from designing one’s own rack mountable servers to crimping the network cables and running the fiber. Infrastructure is not something one should entrust to others, because these others cannot be trusted, as has been proven by AWS outages time and time again. And yes, I system engineer my own operating system as well. Well I did, it’s all been running automatically for years now.
What if I don't want a replica of data, what if I want to duplicate my entire infrastructure in Asia so the developers there can have a clone of our infrastructure - databases, storage, VMs, load balancers, multiple availability zones, etc. How long would it take you to do that? I could do it with a JSON script and run a command.
How fast can you provision a half dozen load balancers and five or ten dedicated computers - not VMs “dedicated hosts”? I can do it by creating a JSON file and running one command from my terminal.
And why should I trust you to set up a more reliable, redundant, network than AWS? Again Netflix didn’t trust themselves to create more reliable infrastructure and decided to trust a competitor to do the “undifferentiated heavy lifting” and the guy who lead the transition is now a VP at AWS, so I think he knows something about infrastructure.
I'm making the point that one can never write high quality software if one does not master system administration and then system engineering first. It's impossible to write high quality software without understanding the substrate on which software is built. That's my message.
How far down the rabbit hole do you want to go?
It's not a matter of "not understanding" how it works, it's a matter of focusing on your core competency and even if you know how to do something, it's about where you can add value and where is it best to outsource.
AWS has a dozens of services across dozens of areas across the globe to handle infrastructure. Why waste time doing the "undifferentiated heavy lifting" that you can't do as well? No matter how good you think you are, you can't efficiently setup infrastructure as fast and as reliable as AWS.
How quickly could you setup duplicate data centers on opposite sides of the continent for disaster recovery and/or to reduce latency? I can do it by running a CloudFormation template.
All the way down to the hardware.
My intent wasn't and isn't a flamewar. The topic is not one of "intellectual curiosity" either.
It's true that 'serverless' is a bit of a trigger word in technical discussions and people disagree about what it means, etc., but there are degrees of these things, and you stepped into several further degrees of flamewar. Please don't do that!
It's a misnomer, but it's no worse than "the cloud" or how "artificial intelligence" has come to mean anything to do with machine learning.
Isn't hiring other people who know better than you to do this kind of stuff kind of the point? Like, a lot of people's jobs are based on that idea, including almost everyone in the IT industry. I'm confused by your point. It almost looks like sarcasm. Getting some serious "Poe's Law" here.
You're even, then, because I've got the allergy to this trivial monotonous whinging about the by-now-well-understood meaning of the term "serverless". I'm not the only one.