This, a million times over. Map-reduce is not difficult to implement. Implementing a distributed, petabyte-scale filesystem to hold the data being accessed by thousands of workers is what's difficult.
It's just a shame that Hadoop (and HDFS) are what we're saddled with in the outside world. They're a total disaster area, from configuration difficulty to memory usage to speed to monitoring. But since HDFS is the only commonly used distributed FS, you're pretty much bound to Hadoop (and the rest of the horrible Apache ecosystem).
The world needs a good, stable, well-written distributed filesystem (ideally, one not written in a bloated language designed for remote controls and set-top boxes).
Also a system is a lot more than just the compute framework. It needs to deal with various inputs/outputs, do scheduling, etc.
Between the distributed storage and the distributed processing, I'm not sure it's easy to decide which one could be more difficult, either.
Saying that Hadoop is a disaster, is not far from saying we live in an awful world. Working with them for years doesn't make me the most objective person, however, given the huge adoption, I'd say they may not be that bad. More, using something like Cloudera Manager makes it trivial (which sometimes makes me wonder why the vanilla version hasn't been improved...) (BTW there's QFS and other distributed file systems).
I wonder why is the Apache ecosystem horrible?
Here's my heavily biased subjective opinion on this entirely hypothetical software:
I think we should do one or both of two things:
A) Do it in very clean, fast, simple C. Put an emphasis on speed and simplicity.
B) Do it in very reliable, secure, simple Haskell. Put an emphasis on correctness and simplicity.
With some effort, the C one could be correct and the Haskell one could be fast.
I mention these two languages because they compile to native code and have very good cross-platform support. You won't have any trouble running either of these on embedded devices (which I can't say for Java or Go. Go has some weird compiler bugs on ARM platforms, and the JVM is frequently too memory intensive for embedded). C has an advantage of allowing the absolute minimal implementation, and Haskell has an advantage of allowing a massively concurrent implementation. Yada yada yada
Of course, it could be that the question is completely irrelevant. Just define a spec for a DFS, and then let different implementations pop up in whatever language is best suited to that implementation's specific details.
Why is this important in this use-case? If the DFS is being used for data processing then presumably the nodes are reasonably capable machines.
There may well be a difference use-case for a DFS for embedded and resource-constrained devices. That's not what Google or Hadoop is doing though.
Also, the read/processing characteristics of compute nodes often means the CPU is underutilized while filesystem operations are ongoing.
Spark runs on the JVM, and much, much faster than Hadoop on similar workloads (Yes, I understand it isn't just doing Map/Reduce, but the point is that Java doesn't seem to be a performance limitation in itself).
Don't you mean GROUP BY?
having said that at BT we did have a distributed ISAM database where the core code was F77!
Here is my personal experience with these tools at Google and at Yahoo (lightly edited from an old comment)
I have had the opportunity to try out Google's implementation of mapreduce implemented in C++ way back in time (6 years ago). These would run on fairly impoverished processors, essentially laptop grade of that time. Have done stuff on Yahoo's Hadoop setup as well, these used high end multicore machines provisioned with oodles of RAM. If I were to be generous, Hadoop ran 4 times slower as measured by wall clock times. Not only that, Hadoop required about 4 times more memory for similar sized jobs. So you ended up requiring more RAM, running for longer and potentially burning more electricity. This is by no means a benchmark or anything like that, just an anecdote.
That Hadoop would require much more memory did not surprise me, that was expected. What was really surprising was that it was so much slower.
Four times might not seem like much, but I was being generous to Hadoop. It makes a big difference when you can make multiple run through the data in a single day and make changes to the code/model. Debugging and ironing out issues is a lot more efficient when your iteration loop is shorter.
I think Hadoop (by virtue of its comparative crappiness) gave Google a significant competitive edge over the rest, probably still does.
(Disclaimer: I used to work for Quantcast).
I thought erasure codes were a much better approach.
Does QFS run in addition to other file systems on the storage nodes or does it manage disks directly? You see, I was thinking that maybe ZFS + QFS might be a good idea and would like to know if it's possible. Also, is QFS available for FreeBSD and/or SmartOS storage nodes and clients? How about CoreOS and Debian, are storage nodes and clients available for those?
What is Disco?
Disco is an implementation of mapreduce for distributed computing. Disco supports parallel computations over large data sets, stored on an unreliable cluster of computers, as in the original framework created by Google. This makes it a perfect tool for analyzing and processing large data sets, without having to worry about difficult technicalities related to distribution such as communication protocols, load balancing, locking, job scheduling, and fault tolerance, which are handled by Disco.
Disco can be used for a variety data mining tasks: large-scale analytics, building probabilistic models, and full-text indexing the Web, just to name a few examples.
The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms with very little code.
all open source.
They started with S3 compatible API and are adding multi-data-center replication and NFS support.
It's hard to build a cost-effective, reliable and fast distributed, petabyte scale filesystem that's suitable for a wide range of workloads.
Consider that you need to minimise the amount of copies of data to keep costs reasonable, yet the fewer copies of data, the lower your IO capacity for accessing that data is (since readers/writers will content for the IO capacity of a small number of storage nodes), so you want to maximise the number of copies of data to maximise throughout. Yet the higher number of copies to maintain, the more IO it takes to spread each write out through your storage network. Soon enough you start running into "fun" problems such not being able to naively push writes out to each storage server it is meant to go to for data that needs to be replicated widely, because you'll be bandwidth constrained, but instead needing a fan-out even for simple writes.
You'll also want to minimise operational headaches; a disk going dead or an entire server failing needs to be handled transparently, as every additional disk or server you add increases the odds of a failure per any unit of tim.
(Compare with the naive approach for just a 1PB system: I can "easily" get about 200TB per off-the shelf storage server with hardware RAID. Lets say 150TB usable space; get about 14 of them to let you replicate stuff across two servers, and put GlusterFS on it. It'll work. It'll also be expensive, horribly slow for a number of workloads, and a regular disk replacement nightmare)
The routing table should be small enough to fit in RAM on every machine, and consulted for request. It would be updated when failover occurs. The table would consist of general rules with temporary exceptions for specific partition ranges that are being failed over.
You can store indexes in files, in a similar way. Just avoid joins and make like a graph database: first load documents from the index and then do mapreduce to get the related documents.
But besides that, I can see how maybe multi user concurrent access might necessitate eventual consistency algorithms for each app, but that's it.
So in a way, everything is happening in the storage nodes and they need to be much more than just a filesystem.
The original GFS and MapReduce papers (http://static.googleusercontent.com/media/research.google.co... and http://research.google.com/archive/mapreduce.html) go into detail.
* today's petabyte feels lighter than 10 years ago. It's not only disk space, it's bandwidth: from network to bus to ram to cpu.
* often a matter seems trivial when you have a proof that a given design actually works. But someone had to build it the first time and get it right; understand which of the many were important, invest and risk. You can find spectacular failures even with less unknowns.
* the devil is in the detail.
Oh cool, subjective language argument time!
I don't know what you'd have to be smoking to use Java in a remote control.
See eg: "Defcon 21 - The Secret Life of SIM Cards"
Java is a high-level language designed for heavy usage of dynamic allocation and related features (like GC). Remote controls, being very simple devices that benefit strongly from low power consumption, tend to be designed with low-resource microcontrollers. These devices are not generally capable of managing the entire java feature set. Therefore, to use java on a remote control, you must either use an unnecessarily complex microcontroller, or a crippled subset of Java, in which case you might as well use C or something.
How, exactly, would you port the entire java core language to a platform without dynamic allocation? There exist subsets of java that work without it, but they're essentially not the same language at that point.
As for "how you'd 'port' Java to such an environment", again, look at Javacard.
That's not my logic at all. The C core language doesn't require dynamic allocation. In fact, the core language has no concept of it. All dynamic allocation comes from library functions, not language features. This is why we use C for embedded programming.
>As for "how you'd 'port' Java to such an environment", again, look at Javacard.
"However, many Java language features are not supported by Java Card (in particular types char, double, float and long; the transient qualifier; enums; arrays of more than one dimension; finalization; object cloning; threads). Further, some common features of Java are not provided at runtime by many actual smart cards (in particular type int, which is the default type of a Java expression; and garbage collection of objects)."
Java Card is almost nothing like Java as most of us know it.
So I don't find your argument very compelling. You have to do better than to point at how different an experience it is to code in an environment without object cloning and threats. That argument is almost tautological! You have to show how Javacard Java is fundamentally dissimilar as a language to Java. But almost the entire list of language features you cited here are absent because they don't make sense in the Javacard programming environment, not because they've been replaced with some other alien language concept.
At any rate: you were wrong to begin with when you scoffed at the idea of Java being used for remote controls, given that small consumer electronics were the original problem domain for the language that became Java, and you're wrong today, given that there are relatively popular and very successful small-form-factor embedded environments based on Java.
Again, you're taking what I said and twisting the logic beyond recognition. No one uses Rails to do embedded programming. No one expects normal Rails and embedded Rails to be the same (because there is no embedded Rails).
Embedded C and "desktop" C are more or less exactly the same. They are the same language, in just about every way.
This is absolutely not true with standard Java and embedded Java subsets (like Java Card). There are huge differences in the language itself, like those I mentioned. Half the reason people use Java is the memory management features. Java minus these features is a fundamentally different language. Not to mention the lack of certain fundamental types (No floats, no multi-dimensional arrays, etc.) and other weird quirks of systems like Java Card.
> you were wrong to begin with when you scoffed at the idea of Java being used for remote controls, given that small consumer electronics were the original problem domain for the language that became Java,
Argumentum ad antiquitatem, or maybe argumentum ad auctoritatem (towards Sun). Just because Java was intended to be used for something does not mean it's any good at that thing.
>and you're wrong today, given that there are relatively popular and very successful small-form-factor embedded environments based on Java.
Argumentum ad populum. Just because a lot of people use some subset of Java for embedded programming doesn't mean it's a good idea. Lots of people use PHP too; it's not because it's a good thing to do; it's because PHP programmers are cheap. If I had to hazard a guess, that's the same reason people use Java in embedded environments.
> This morning, at their I/O Conference, Google revealed
> that they’re not using Map-Reduce to process data
> internally at all any more.
MapReduce doesn't work well for low-latency pipelines because it's got a high fixed overhead, but it's still the undisputed king of medium-latency and latency-insensitive workloads.
From today's I/O keynote video https://www.youtube.com/watch?v=wtLJPvx7-ys#t=9454
This is the exact quote:
"... and today even when you use map-reduce, which we invented over a decade ago, it's still cumbersome to write and maintain analytics pipelines, and if you want streaming analytics you are out of luck. And in most systems once you have more than a few petabytes they kind of break down. So we've done analytics at scale for awhile and we've learned a few things. FOR ONE, WE DON'T REALLY USE MAP-REDUCE ANYMORE. It's great for simple jobs but it gets too cumbersome as you build pipelines, and everything is an analytics pipeline."
Of course the word "really" in the middle of the sentence gives semantic wiggle room, but it's still a pretty big statement.
Since I would assume that any non-trivial service that Google provides is in that petabyte neighborhood it explains why he would say that Google isn't using MR anymore.
This guy may work for Google, but he's a clown.
Also, Urs Holzle is not a clown.
Back then 17x 750's would be roughly the same as one the 5k plus clusters that yahoo etal use.
We even sold the system to NZ telecom
(we had oxford street dug up for our 10MBs link)
> MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
> The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
> The name MapReduce originally referred to the proprietary Google technology but has since been genericized.
So it would be quite impossible to have a MapReduce system without distributed computing infrastructure; even if you were doing mapping and reducing, it wouldn't be MapReduce.
In the example GP gave, the data could possibly have been stored in a database queried using segmentation via consistent hashing (a basic way to distribute jobs across a known number of workers).
EDIT @supermatt Ah I see, we differ in the definition then, to me it isnt bigdata/largescale unless it churns through big amounts of stored data. Bitcoin mining is no where in the ball park of this, its an append only log of solutions computed in parallel.
'Big Data' means 'Big Data', not 'Big Storage'. They are completely different things.
You might be into HPC, but that's not what Sanjay and Jeff did. HPC and big data loads are quite different.
Big Data may incorporate data from various 3rd party, remote, local, or even random sources. For example, testing whether URLs in a search engines index are currently available. This may be a map/reduce job, it may utilize a local source of urls, but it will also incorporate a remote check of the url.
As I said a few links up: DFS is not a requirement for map/reduce.
I guess people just fixate on the terms map and reduce when the focus of MapReduce really was....shuffle.
The very start of the paper describes the term and it's methodology (which is what we are discussing), and then goes on to explain googles own implementation using GFS (which you seem to be getting hung up on.)
> Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages.
Inspired doesn't mean equivalent.
> Our use of a functional model with user specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
They are using map and reduce as a tool to get something else.
> The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.
They are very specific about what the contribution is. All work that has claimed to be an implementation of MapReduce has followed their core tenants. Even if MPI has a reduce function, it is not MapReduce because it is based on other techniques.
I'm really tired of people who claim there is nothing new or even significant when there clearly was. Ya, everything is built on something these days, but so what? In the systems community, MapReduce has been a huge advance, and now we are moving on (at least for streaming).
But it was still useful: it was a good computing model for letting as many compute nodes as possible process data.
That might not be what Google was trying to achieve, but it's difficult to argue that it isn't MapReduce.
Our "Mappers" did quite a lot of work compared most modern map functions
We had to build a lot of the functionality that comes out of the box in more modern system like hadoop
Teradata's been doing map/reduce in their proprietary DBC 1012 AMP clusters since the 80's, providing analytical data warehousing for some of the world's largest companies. Walmart used them to globally optimize their inventory.
MPI systems have been supporting distributed map/reduce operations since the early 90's (see MPI_REDUCE).
I could run your workloads in Excel without breaking a sweat. But go on kidding yourself.
The point is that people are starting to use this term to describe something that it's not even technical anymore, let alone describe the actual amount of data: merely using data to drive decision making.
This is not a new thing , yet there is a clear trend that shows how this kind of dependency is shifting from being auxiliary to being generative; some of the reasons are:
1. cheaper computing and storage power
2. increased computing literacy among scientists and not.
3. increased availability of digitalised content in many areas that capture human behaviour.
When there's request, there's opportunity for business. One thing that is new and big about Big Data is the market. It should be called "Big Market (of data)".
It's an overloaded term. IMHO it's counterproductive to let the hype around Big Data as a business term pollute the discussion about what contribution Google and others have made in the field of computer science and data processing.
So what did Google really invent? Obviously the name and concept behind MapReduce wasn't new. Nor the fact that they did start to process large amounts of data.
Size and growth are two key factors here. Although it's possible that the NIH syndrome affected Google, it's possible that existing products just weren't able to solve those two requirements. It's difficult to tell exactly how large given that the Google is not very keen at releasing numbers, although it's possible to find some announcements like  "Google processed about 24 petabytes of data per day in 2009".
20P is 10000 times more that 200 T. Stop to think a moment what does 10000 mean.
It's enough to completely change the problem, almost any problem. A room full of people becomes a metropolis; an US annual low wage salary becomes 100 million dollars, more than the annual spending of Palau . Well, it's silly to make those comparison, but it's hard to think about anything that scaled by 10000 doesn't change profoundly.
Hell, this absurdly long post is well under 10k!
To stay in the realm of computer science, processor performance didn't increase by a factor of 10000 since PDP-11 from 1978 to Xeon from 2005 .
Working at that scale poses unique problems, and that's where real the contributions
to the advancement of the field made by the engineers and the engineering culture at Google are placed. If anything, just knowing it's possible
and having some accounts on what they focused on is inspiring.
This is the Big Data I care about. It's not about fanboyism. It's cool, it's real, it's rare. Arguing who invented the map reduce mechanics is like arguing that hierarchical filesystems where already there hence any progress made in that area by countless engineers is just trivial.
 Historical perspective: James Gleick , http://en.wikipedia.org/wiki/The_Information:_A_History,_a_T...
Hm, as a former Google engineer, that statement (from the journalist) is not accurate. Though the definition of map reduce is malleable, so it's hard to say what was meant in the first place.
Decoupled (compared to Hadoop) systems with distributed data and JIT processing?
To give you an example, data import speed has always been an issue with Hadoop - Facebook are quite proud that they can import 320TB of data into their Hadoop clusters in a day.
AFAIU, hadoop has already moved to support different workloads than MR when it introduced YARN
Giving this some thought, I think I see a parallell here to how most the rest of the nature works; the "components" that make up a creature appears to often be as simple as it can afford to be.
Also, why do you think developing for a map-reduce framework requires fewer engineering resources than developing for, e.g. a dataflow framework?