"and the real contribution from Google in this area was arguably GFS, not Map-Reduce"
This, a million times over. Map-reduce is not difficult to implement. Implementing a distributed, petabyte-scale filesystem to hold the data being accessed by thousands of workers is what's difficult.
It's just a shame that Hadoop (and HDFS) are what we're saddled with in the outside world. They're a total disaster area, from configuration difficulty to memory usage to speed to monitoring. But since HDFS is the only commonly used distributed FS, you're pretty much bound to Hadoop (and the rest of the horrible Apache ecosystem).
The world needs a good, stable, well-written distributed filesystem (ideally, one not written in a bloated language designed for remote controls and set-top boxes).
First, I don't believe that Map-Reduce is not used inside Google anymore.
Second, the map-reduce pattern is actually very useful basic, common computation. It's equivalent to the SQL JOIN and it's not something that you can really do without.
Perhaps the large chunk (batch) approach is not ideal for many of the use-cases Map-Reduce is being tried with (e.g. like "interactive" querying with Pig or Hive). But that doesn't mean it's not useful. If you're optimizing for throughput you'll generally want to read/process in the batches optimized for some underlying sizes (it could be page size, it could be blocks, etc.).
Also a system is a lot more than just the compute framework. It needs to deal with various inputs/outputs, do scheduling, etc.
Between the distributed storage and the distributed processing, I'm not sure it's easy to decide which one could be more difficult, either.
Saying that Hadoop is a disaster, is not far from saying we live in an awful world. Working with them for years doesn't make me the most objective person, however, given the huge adoption, I'd say they may not be that bad. More, using something like Cloudera Manager makes it trivial (which sometimes makes me wonder why the vanilla version hasn't been improved...) (BTW there's QFS and other distributed file systems).
I wonder why is the Apache ecosystem horrible?
I get it that you don't like Java. Fair enough. What would be your language of choice for the next gen, stable distributed file system? Go, Rust, JavaScript?
>What would be your language of choice for the next gen, stable distributed file system?
Here's my heavily biased subjective opinion on this entirely hypothetical software:
I think we should do one or both of two things:
A) Do it in very clean, fast, simple C. Put an emphasis on speed and simplicity.
B) Do it in very reliable, secure, simple Haskell. Put an emphasis on correctness and simplicity.
With some effort, the C one could be correct and the Haskell one could be fast.
I mention these two languages because they compile to native code and have very good cross-platform support. You won't have any trouble running either of these on embedded devices (which I can't say for Java or Go. Go has some weird compiler bugs on ARM platforms, and the JVM is frequently too memory intensive for embedded). C has an advantage of allowing the absolute minimal implementation, and Haskell has an advantage of allowing a massively concurrent implementation. Yada yada yada
Of course, it could be that the question is completely irrelevant. Just define a spec for a DFS, and then let different implementations pop up in whatever language is best suited to that implementation's specific details.
You won't have any trouble running either of these on embedded devices (which I can't say for Java or Go. Go has some weird compiler bugs on ARM platforms, and the JVM is frequently too memory intensive for embedded).
Why is this important in this use-case? If the DFS is being used for data processing then presumably the nodes are reasonably capable machines.
There may well be a difference use-case for a DFS for embedded and resource-constrained devices. That's not what Google or Hadoop is doing though.
The biggest limiting factor even in our relatively low-density populated rack is heat and power. With off the shelf servers and relatively low density, I can trivially exceed the highest power allocations our colo provider will normally allow per rack. The more power you waste on inefficient CPU usage, the less you can devote to putting more drives in.
The OP's claim is that memory is the limiting factor in the case of Java. I don't entirely agree, but even if I did it would almost certainly be a fixed overhead per machine, and unlikely to be a problem on server class machines.
Also, the read/processing characteristics of compute nodes often means the CPU is underutilized while filesystem operations are ongoing.
I will leave with an elliptical meta-comment, for those whose competitive advantage lies in others not getting it right, have little interest in correcting misconceptions. You might have interest in this anecdote https://news.ycombinator.com/item?id=7948170
But how much of that is Java, and how much is Hadoop?
Spark runs on the JVM, and much, much faster than Hadoop on similar workloads (Yes, I understand it isn't just doing Map/Reduce, but the point is that Java doesn't seem to be a performance limitation in itself).
Indeed and as I said it did surprise me that Hadoop was so much slower. But the buck really stops at resources consumed per dollar of usable results produced, and in that Java is going to consume a whole lot more. At large scales, running costs far exceeds development costs. BTW my point was not only about Java but also about your assessment of the hardware.
CPU and memory resources spend on an inefficient filesystem implementation are just wasted resources, not available for your workload. Keep in mind that the inefficiencies are multiplied over all your cluster nodes.
I don't think large scale distributed file systems written in C are hypothetical. I'm pretty sure this is exactly what MapR has done - replace the Java-based HDFS with C, retaining the API. GlusterFS by Red Hat is another DFS.
As someone who is currently implementing a next-gen distributed file system, I can highlight one aspect: you have a lot of concurrency and asynchronous processing. Thus you need at least reference counting.
Can you really do Haskell on embedded? I thought the far far abstraction away from memory as a concern made it pretty much a non-starter for the foreseeable future.
Embedded meaning "ARM running an OS", yes. Embedded meaning "OS-less microcontroller", not so much. You'd have to use an embedded programming DSL for that, which isn't really ARM anymore.
ats will probably be an interesting best-of-both-worlds third option soon, though from what little I've seen of it it is currently harder to write code in than either haskell or c. but once you do put the work in to write your proofs etc. both correctness and speed should fall out naturally.
Heartily agreed ! Or at least Hadoop does it poorly. It is really that bad. I say this with hesitation and a lot of reluctance, its an open source project, I have not contributed anything towards it, so it is really unfair of me to complain.
Here is my personal experience with these tools at Google and at Yahoo (lightly edited from an old comment)
==
I have had the opportunity to try out Google's implementation of mapreduce implemented in C++ way back in time (6 years ago). These would run on fairly impoverished processors, essentially laptop grade of that time. Have done stuff on Yahoo's Hadoop setup as well, these used high end multicore machines provisioned with oodles of RAM. If I were to be generous, Hadoop ran 4 times slower as measured by wall clock times. Not only that, Hadoop required about 4 times more memory for similar sized jobs. So you ended up requiring more RAM, running for longer and potentially burning more electricity. This is by no means a benchmark or anything like that, just an anecdote.
That Hadoop would require much more memory did not surprise me, that was expected. What was really surprising was that it was so much slower.
Four times might not seem like much, but I was being generous to Hadoop. It makes a big difference when you can make multiple run through the data in a single day and make changes to the code/model. Debugging and ironing out issues is a lot more efficient when your iteration loop is shorter.
I think Hadoop (by virtue of its comparative crappiness) gave Google a significant competitive edge over the rest, probably still does.
Let me suggest the Quantcast File System (QFS) [0]. It's much closer to GFS as described in the paper (crucially it uses Reed-Solomon encoding to reduce storage requirements), it's highly tunable to different workloads, and it's written in C++. Quantcast uses it to store petabytes of data and for running map reduce. Unfortunately it hasn't seen much uptake outside of Quantcast, despite being a clear improvement over HDFS.
Reed-Solomon for forward error correction, to provide redundancy? But isn't Reed-Solomon really geared towards single-bit errors, while in the real world our storage tends to fail with multiple missing blocks?
I thought erasure codes were a much better approach.
It's not used for tolerating disk errors (typically in a production context you have RAID for that, and failures tend to be for the entire disk). It's used to reduce storage requirements via striping. See the QFS paper (http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p808-...) for a good description of how this works. The basic idea is that with RS you can get 3x replication by splitting the data into 6 pieces stored on different servers, plus three parity blocks. This requires 1.5x storage rather than 3x while still tolerating the loss of any three machines of the nine.
>[T]ypically in a production context you have RAID for that, and failures tend to be for the entire disk[.]
Does QFS run in addition to other file systems on the storage nodes or does it manage disks directly? You see, I was thinking that maybe ZFS + QFS might be a good idea and would like to know if it's possible. Also, is QFS available for FreeBSD and/or SmartOS storage nodes and clients? How about CoreOS and Debian, are storage nodes and clients available for those?
Disco is an implementation of mapreduce for distributed computing. Disco supports parallel computations over large data sets, stored on an unreliable cluster of computers, as in the original framework created by Google. This makes it a perfect tool for analyzing and processing large data sets, without having to worry about difficult technicalities related to distribution such as communication protocols, load balancing, locking, job scheduling, and fault tolerance, which are handled by Disco.
Disco can be used for a variety data mining tasks: large-scale analytics, building probabilistic models, and full-text indexing the Web, just to name a few examples.
Batteries included
The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms with very little code.
Why s it difficult to build a distributed, petabyte scale filesystem? Isn't the search through indexes easily partitionable horizontally? Is it a problem of eventual consistency? I am not sure what the huge issue is and would like to learn.
It's "trivial" to build a distributed, petabyte scale filesystem.
It's hard to build a cost-effective, reliable and fast distributed, petabyte scale filesystem that's suitable for a wide range of workloads.
Consider that you need to minimise the amount of copies of data to keep costs reasonable, yet the fewer copies of data, the lower your IO capacity for accessing that data is (since readers/writers will content for the IO capacity of a small number of storage nodes), so you want to maximise the number of copies of data to maximise throughout. Yet the higher number of copies to maintain, the more IO it takes to spread each write out through your storage network. Soon enough you start running into "fun" problems such not being able to naively push writes out to each storage server it is meant to go to for data that needs to be replicated widely, because you'll be bandwidth constrained, but instead needing a fan-out even for simple writes.
You'll also want to minimise operational headaches; a disk going dead or an entire server failing needs to be handled transparently, as every additional disk or server you add increases the odds of a failure per any unit of tim.
(Compare with the naive approach for just a 1PB system: I can "easily" get about 200TB per off-the shelf storage server with hardware RAID. Lets say 150TB usable space; get about 14 of them to let you replicate stuff across two servers, and put GlusterFS on it. It'll work. It'll also be expensive, horribly slow for a number of workloads, and a regular disk replacement nightmare)
If you need to minimize the amount of copies then yes, you need to have some "risk management" software to estimate which machines are more reliable, and which files are more important, and then assign those files to enough replicas to be able to statistically guarantee some SLA. Then you need failover where at least one if the replicas is always available.
The routing table should be small enough to fit in RAM on every machine, and consulted for request. It would be updated when failover occurs. The table would consist of general rules with temporary exceptions for specific partition ranges that are being failed over.
You can store indexes in files, in a similar way. Just avoid joins and make like a graph database: first load documents from the index and then do mapreduce to get the related documents.
But besides that, I can see how maybe multi user concurrent access might necessitate eventual consistency algorithms for each app, but that's it.
Well, one key element of the original map-reduce paper is the way the data is spread around. Instead of building a giant NAS with specialized (expensive) systems, and then building a bunch of specialized (expensive) compute systems, and then shipping massive quantities of data around on fast (expensive) network, the map-reduce system is built on a bunch of well balanced systems in terms of CPU/ram vs. disk, and the job is designed in a way that it can be distributed to these systems and data transfer is minimized.
So in a way, everything is happening in the storage nodes and they need to be much more than just a filesystem.
* today's petabyte feels lighter than 10 years ago. It's not only disk space, it's bandwidth: from network to bus to ram to cpu.
* often a matter seems trivial when you have a proof that a given design actually works. But someone had to build it the first time and get it right; understand which of the many were important, invest and risk. You can find spectacular failures even with less unknowns.
Didn't think I needed to! Seemed self-evident enough. How's this:
Java is a high-level language designed for heavy usage of dynamic allocation and related features (like GC). Remote controls, being very simple devices that benefit strongly from low power consumption, tend to be designed with low-resource microcontrollers. These devices are not generally capable of managing the entire java feature set. Therefore, to use java on a remote control, you must either use an unnecessarily complex microcontroller, or a crippled subset of Java, in which case you might as well use C or something.
Java is not "designed for heavy usage of dynamic allocation". You're confusing the language, the runtime, and the reference JVM. The HotSpot-derived reference JVM is certainly not an embedded design, but the reference JVM looks nothing like (for instance) Javacard.
> Java is not "designed for heavy usage of dynamic allocation".
How, exactly, would you port the entire java core language to a platform without dynamic allocation? There exist subsets of java that work without it, but they're essentially not the same language at that point.
That's silly. A huge fraction of all C libraries rely on dynamic allocation, and so following your logic, C isn't an embedded programming language, despite being the lingua franca of embedded programming.
As for "how you'd 'port' Java to such an environment", again, look at Javacard.
>A huge fraction of all C libraries rely on dynamic allocation, and so following your logic, C isn't an embedded programming language
That's not my logic at all. The C core language doesn't require dynamic allocation. In fact, the core language has no concept of it. All dynamic allocation comes from library functions, not language features. This is why we use C for embedded programming.
>As for "how you'd 'port' Java to such an environment", again, look at Javacard.
Again:
"However, many Java language features are not supported by Java Card (in particular types char, double, float and long; the transient qualifier; enums; arrays of more than one dimension; finalization; object cloning; threads). Further, some common features of Java are not provided at runtime by many actual smart cards (in particular type int, which is the default type of a Java expression; and garbage collection of objects)."
Java Card is almost nothing like Java as most of us know it.
Embedded programming is nothing like programming as people on HN know it. The dev who writes a native-code Markdown gem for Rails is going to be surprised at how different the experience of writing a SPI bus driver is.
So I don't find your argument very compelling. You have to do better than to point at how different an experience it is to code in an environment without object cloning and threats. That argument is almost tautological! You have to show how Javacard Java is fundamentally dissimilar as a language to Java. But almost the entire list of language features you cited here are absent because they don't make sense in the Javacard programming environment, not because they've been replaced with some other alien language concept.
At any rate: you were wrong to begin with when you scoffed at the idea of Java being used for remote controls, given that small consumer electronics were the original problem domain for the language that became Java, and you're wrong today, given that there are relatively popular and very successful small-form-factor embedded environments based on Java.
>The dev who writes a native-code Markdown gem for Rails is going to be surprised at how different the experience of writing a SPI bus driver is.
Again, you're taking what I said and twisting the logic beyond recognition. No one uses Rails to do embedded programming. No one expects normal Rails and embedded Rails to be the same (because there is no embedded Rails).
Embedded C and "desktop" C are more or less exactly the same. They are the same language, in just about every way.
This is absolutely not true with standard Java and embedded Java subsets (like Java Card). There are huge differences in the language itself, like those I mentioned. Half the reason people use Java is the memory management features. Java minus these features is a fundamentally different language. Not to mention the lack of certain fundamental types (No floats, no multi-dimensional arrays, etc.) and other weird quirks of systems like Java Card.
> you were wrong to begin with when you scoffed at the idea of Java being used for remote controls, given that small consumer electronics were the original problem domain for the language that became Java,
Argumentum ad antiquitatem, or maybe argumentum ad auctoritatem (towards Sun). Just because Java was intended to be used for something does not mean it's any good at that thing.
>and you're wrong today, given that there are relatively popular and very successful small-form-factor embedded environments based on Java.
Argumentum ad populum. Just because a lot of people use some subset of Java for embedded programming doesn't mean it's a good idea. Lots of people use PHP too; it's not because it's a good thing to do; it's because PHP programmers are cheap. If I had to hazard a guess, that's the same reason people use Java in embedded environments.
> This morning, at their I/O Conference, Google revealed
> that they’re not using Map-Reduce to process data
> internally at all any more.
This is incorrect, so monumentally so that I couldn't continue reading. It's as if the author had opened an article about climate change with "Now that advancing glaciers have rendered Algeria uninhabitable..."
MapReduce doesn't work well for low-latency pipelines because it's got a high fixed overhead, but it's still the undisputed king of medium-latency and latency-insensitive workloads.
"... and today even when you use map-reduce, which we invented over a decade ago, it's still cumbersome to write and maintain analytics pipelines, and if you want streaming analytics you are out of luck. And in most systems once you have more than a few petabytes they kind of break down. So we've done analytics at scale for awhile and we've learned a few things. FOR ONE, WE DON'T REALLY USE MAP-REDUCE ANYMORE. It's great for simple jobs but it gets too cumbersome as you build pipelines, and everything is an analytics pipeline."
emphasis mine
Of course the word "really" in the middle of the sentence gives semantic wiggle room, but it's still a pretty big statement.
But Urs also said, paraphrasing this time, that once you get into petabytes of information everything pretty much becomes streaming analytics.
Since I would assume that any non-trivial service that Google provides is in that petabyte neighborhood it explains why he would say that Google isn't using MR anymore.
How many big data jobs were being processed by MapReduce in the 70s, 80s, early 90s? Ya, that's right: none. Sanjay and Jeff were the first to apply the combination of map-shuffle-and-reduce as we know it today to big data processing.
British Telecom used map reduce in billing systems for the dialcom (telecom gold) platform in the 80's - that was on the largest (non black) prime minicomputer site in the UK.
Back then 17x 750's would be roughly the same as one the 5k plus clusters that yahoo etal use.
we used the normal file system (primes probably descended from ITS) and had a load of JCL written in CPL (prime JCL) language to sync every thing up over our Cambridge ring to two sites.
> MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
...
> The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
...
> The name MapReduce originally referred to the proprietary Google technology but has since been genericized.
So it would be quite impossible to have a MapReduce system without distributed computing infrastructure; even if you were doing mapping and reducing, it wouldn't be MapReduce.
How do you do distributed processing without a distributed filesystem? Do you mean you'd load the filesystem into memory and send it to the "processors"?
The data could be stored on a network device, such as a file server or database, for example. It could indeed be local, but it needn't be distributed.
In the example GP gave, the data could possibly have been stored in a database queried using segmentation via consistent hashing (a basic way to distribute jobs across a known number of workers).
...defeating the entire purpose: of large scale parallelism on commodity machines. OTOH if you have a way of achieving order 500X parallelism with a centralized commodity server or database, I would love to hear.
EDIT @supermatt Ah I see, we differ in the definition then, to me it isnt bigdata/largescale unless it churns through big amounts of stored data. Bitcoin mining is no where in the ball park of this, its an append only log of solutions computed in parallel.
How on earth do you think bitcoin mining pools work (as an extremely trivial example). They coordinate ranges between a number of workers. The stored size of those ranges is miniscule in comparison to the data of the hashes between those ranges calculated on each 'miner'. These 'coordinators' absolutely work as a centralised 'commodity' storage server (or database) resource for 500x+ parallelism.
'Big Data' means 'Big Data', not 'Big Storage'. They are completely different things.
The bitcoin example may be a bit oversimplified, and may indeed lean more towards HPC. The example was intended to illustrate data locality (as per the parent question), not the actual computation.
Big Data may incorporate data from various 3rd party, remote, local, or even random sources. For example, testing whether URLs in a search engines index are currently available. This may be a map/reduce job, it may utilize a local source of urls, but it will also incorporate a remote check of the url.
As I said a few links up: DFS is not a requirement for map/reduce.
All MapReduce frameworks I know about today are built on DFSs. There are definitely plenty of frameworks that support map and reduce that don't (e.g. MPI), but these aren't systems based on what was described in the OSDI 2004 paper where the word MapReduce was introduced.
I guess people just fixate on the terms map and reduce when the focus of MapReduce really was....shuffle.
I think the problem is that we are talking about two different things.
The very start of the paper describes the term and it's methodology (which is what we are discussing), and then goes on to explain googles own implementation using GFS (which you seem to be getting hung up on.)
Keep in mind that this whole thread is about "MapReduce", which Holzle was talking about, not the more generic map and reduce that has been around since the 1800s (and they will continue to mapping and reducing in their new dataflow framework, they just won't be using MapReduce). Now for the paper:
> Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages.
Inspired doesn't mean equivalent.
> Our use of a functional model with user specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
They are using map and reduce as a tool to get something else.
> The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.
They are very specific about what the contribution is. All work that has claimed to be an implementation of MapReduce has followed their core tenants. Even if MPI has a reduce function, it is not MapReduce because it is based on other techniques.
I'm really tired of people who claim there is nothing new or even significant when there clearly was. Ya, everything is built on something these days, but so what? In the systems community, MapReduce has been a huge advance, and now we are moving on (at least for streaming).
I'm still in the camp of there being nothing new here. Now gfs may be a different matter, but that was part of a different paper, and not a requirement of this one. Which is why I have kept stating that a dfs is not a requirement.
If that's what you believe, then you are going to miss out on the last 10 or so years of systems research and improvements. And when Google stops using MapReduce but the new thing still uses map and reduce, you are going to be kind of confused.
I've seen MapReduce done against fairly significant amounts of data stored (10s of TBs per run) on a SAN running over fibre. The compute nodes weren't particularly cheap either - I guess they were commodity machines, but quite a long way from the "cheapest possible" things Google uses.
But it was still useful: it was a good computing model for letting as many compute nodes as possible process data.
That might not be what Google was trying to achieve, but it's difficult to argue that it isn't MapReduce.
Databases we should be so lucky :-) this was old school ISAM files updated with Fortran 77 and 4 different log files all with multiple types of records.
Our "Mappers" did quite a lot of work compared most modern map functions
I don't know about Mr Holzle but you're wrong about map/reduce. I'm aware of two significant counterexamples. I'm sure there are others.
Teradata's been doing map/reduce in their proprietary DBC 1012 AMP clusters since the 80's, providing analytical data warehousing for some of the world's largest companies[1]. Walmart used them to globally optimize their inventory.
MPI systems have been supporting distributed map/reduce operations since the early 90's (see MPI_REDUCE[2]).
I see the crazies are out trying to redefine MapReduce as just being map and reduce and completely missing the point. But whatever, they've probably never seen big data loads and are definitely not involved in the industry.
There's certainly a hype around big data nowadays, often even up to the point of being ridiculous.
The point is that people are starting to use this term to describe something that it's not even technical anymore, let alone describe the actual amount of data: merely using data to drive decision making.
This is not a new thing [0], yet there is a clear trend that shows how this kind of dependency is shifting from being auxiliary to being generative; some of the reasons are:
1. cheaper computing and storage power
2. increased computing literacy among scientists and not.
3. increased availability of digitalised content in many areas that capture human behaviour.
When there's request, there's opportunity for business. One thing that is new and big about Big Data is the market. It should be called "Big Market (of data)".
It's an overloaded term. IMHO it's counterproductive to let the hype around Big Data as a business term pollute the discussion about what contribution Google and others have made in the field of computer science and data processing.
So what did Google really invent? Obviously the name and concept behind MapReduce wasn't new. Nor the fact that they did start to process large amounts of data.
Size and growth are two key factors here. Although it's possible that the NIH syndrome affected Google, it's possible that existing products just weren't able to solve those two requirements. It's difficult to tell exactly how large given that the Google is not very keen at releasing numbers, although it's possible to find some announcements like [1] "Google processed about 24 petabytes of data per day in 2009".
20P is 10000 times more that 200 T. Stop to think a moment what does 10000 mean.
It's enough to completely change the problem, almost any problem. A room full of people becomes a metropolis; an US annual low wage salary becomes 100 million dollars, more than the annual spending of Palau [2]. Well, it's silly to make those comparison, but it's hard to think about anything that scaled by 10000 doesn't change profoundly.
Hell, this absurdly long post is well under 10k!
To stay in the realm of computer science, processor performance didn't increase by a factor of 10000 since PDP-11 from 1978 to Xeon from 2005 [3].
Working at that scale poses unique problems, and that's where real the contributions
to the advancement of the field made by the engineers and the engineering culture at Google are placed. If anything, just knowing it's possible
and having some accounts on what they focused on is inspiring.
This is the Big Data I care about. It's not about fanboyism. It's cool, it's real, it's rare. Arguing who invented the map reduce mechanics is like arguing that hierarchical filesystems where already there hence any progress made in that area by countless engineers is just trivial.
Though it seems to be quoting Urs accurately enough. All I can guess is that he meant MR isn't used for some particular context. MR is most definitely still heavily used inside Google.
Projects like Apache Spark have demonstrated the power of a more complex DAG (Directed Acyclic Graph) approach that allows for more precise control over the data-processing flow, compared with the simpler execution model of M/R. All of the major Hadoop vendors are pivoting, and simultaneously adding support for Spark (which can work with the Hadoop stack, but is not part of it), while also supporting the development of one or more technologies that are trying to retrofit Hadoop into a more powerful model, such as Tez, from Hortonworks.
'The company stopped using the system “years ago.”'
Hm, as a former Google engineer, that statement (from the journalist) is not accurate. Though the definition of map reduce is malleable, so it's hard to say what was meant in the first place.
The offered improvement is, I guess, process Google-sized 'Big Data' without the significant issues of Hadoop style clusters. Beyond that, there's not enough details.
To give you an example, data import speed has always been an issue with Hadoop - Facebook are quite proud that they can import 320TB of data into their Hadoop clusters in a day.
So what's replacing it in the open source world? What are some real world problems Map Reduce is being used for now that some other solution is better suited for?
A lot of problems people are solving with Hadoop and friends are best solved by getting a big box with lots of ram and lots of SSDs. Current generation dual processor socket server boards go up to 1 TB of ram, a few years ago that kind of ram would require several servers; and the Haswell EP Xeons, expected towards the end of the year will support even more ram.
>[P]aradoxically, it’s easier to build robust, fault-tolerant systems from unreliable components[.]
Giving this some thought, I think I see a parallell here to how most the rest of the nature works; the "components" that make up a creature appears to often be as simple as it can afford to be.
So they're not using MapReduce, a particular implementation of the map/reduce concept. But are they still using the concept having just changed their implementation? I've read the article and this is still unclear to me; I think commentators are conflating the two.
Map-reduce is a brute force approach. It's obvious that intelligent, specific data handing strategies are orders of magnitude better for production systems if there's the engineering power to pull it off.
I'd say it teaches the current generation. IME the next generation tend to repeat the mistakes of their predecessors much more easily than our industry should be happy with.
This, a million times over. Map-reduce is not difficult to implement. Implementing a distributed, petabyte-scale filesystem to hold the data being accessed by thousands of workers is what's difficult.
It's just a shame that Hadoop (and HDFS) are what we're saddled with in the outside world. They're a total disaster area, from configuration difficulty to memory usage to speed to monitoring. But since HDFS is the only commonly used distributed FS, you're pretty much bound to Hadoop (and the rest of the horrible Apache ecosystem).
The world needs a good, stable, well-written distributed filesystem (ideally, one not written in a bloated language designed for remote controls and set-top boxes).