*"and the real contribution from Google in this area was arguably GFS, not Map-R...

random3 · on June 26, 2014

First, I don't believe that Map-Reduce is not used inside Google anymore. Second, the map-reduce pattern is actually very useful basic, common computation. It's equivalent to the SQL JOIN and it's not something that you can really do without. Perhaps the large chunk (batch) approach is not ideal for many of the use-cases Map-Reduce is being tried with (e.g. like "interactive" querying with Pig or Hive). But that doesn't mean it's not useful. If you're optimizing for throughput you'll generally want to read/process in the batches optimized for some underlying sizes (it could be page size, it could be blocks, etc.).

Also a system is a lot more than just the compute framework. It needs to deal with various inputs/outputs, do scheduling, etc.

Between the distributed storage and the distributed processing, I'm not sure it's easy to decide which one could be more difficult, either.

Saying that Hadoop is a disaster, is not far from saying we live in an awful world. Working with them for years doesn't make me the most objective person, however, given the huge adoption, I'd say they may not be that bad. More, using something like Cloudera Manager makes it trivial (which sometimes makes me wonder why the vanilla version hasn't been improved...) (BTW there's QFS and other distributed file systems).

I wonder why is the Apache ecosystem horrible?

I get it that you don't like Java. Fair enough. What would be your language of choice for the next gen, stable distributed file system? Go, Rust, JavaScript?

wyager · on June 26, 2014

>What would be your language of choice for the next gen, stable distributed file system?

Here's my heavily biased subjective opinion on this entirely hypothetical software:

I think we should do one or both of two things:

A) Do it in very clean, fast, simple C. Put an emphasis on speed and simplicity.

B) Do it in very reliable, secure, simple Haskell. Put an emphasis on correctness and simplicity.

With some effort, the C one could be correct and the Haskell one could be fast.

I mention these two languages because they compile to native code and have very good cross-platform support. You won't have any trouble running either of these on embedded devices (which I can't say for Java or Go. Go has some weird compiler bugs on ARM platforms, and the JVM is frequently too memory intensive for embedded). C has an advantage of allowing the absolute minimal implementation, and Haskell has an advantage of allowing a massively concurrent implementation. Yada yada yada

Of course, it could be that the question is completely irrelevant. Just define a spec for a DFS, and then let different implementations pop up in whatever language is best suited to that implementation's specific details.

nl · on June 26, 2014

You won't have any trouble running either of these on embedded devices (which I can't say for Java or Go. Go has some weird compiler bugs on ARM platforms, and the JVM is frequently too memory intensive for embedded).

Why is this important in this use-case? If the DFS is being used for data processing then presumably the nodes are reasonably capable machines.

There may well be a difference use-case for a DFS for embedded and resource-constrained devices. That's not what Google or Hadoop is doing though.

vidarh · on June 26, 2014

The biggest limiting factor even in our relatively low-density populated rack is heat and power. With off the shelf servers and relatively low density, I can trivially exceed the highest power allocations our colo provider will normally allow per rack. The more power you waste on inefficient CPU usage, the less you can devote to putting more drives in.

nl · on June 26, 2014

The OP's claim is that memory is the limiting factor in the case of Java. I don't entirely agree, but even if I did it would almost certainly be a fixed overhead per machine, and unlikely to be a problem on server class machines.

Also, the read/processing characteristics of compute nodes often means the CPU is underutilized while filesystem operations are ongoing.

srean · on June 26, 2014

I will leave with an elliptical meta-comment, for those whose competitive advantage lies in others not getting it right, have little interest in correcting misconceptions. You might have interest in this anecdote https://news.ycombinator.com/item?id=7948170

nl · on June 26, 2014

But how much of that is Java, and how much is Hadoop?

Spark runs on the JVM, and much, much faster than Hadoop on similar workloads (Yes, I understand it isn't just doing Map/Reduce, but the point is that Java doesn't seem to be a performance limitation in itself).

srean · on June 26, 2014

Indeed and as I said it did surprise me that Hadoop was so much slower. But the buck really stops at resources consumed per dollar of usable results produced, and in that Java is going to consume a whole lot more. At large scales, running costs far exceeds development costs. BTW my point was not only about Java but also about your assessment of the hardware.

delian66 · on June 26, 2014

CPU and memory resources spend on an inefficient filesystem implementation are just wasted resources, not available for your workload. Keep in mind that the inefficiencies are multiplied over all your cluster nodes.

wyager · on June 26, 2014

There are many use cases for a DFS. Some of those cases involve relatively low-resource nodes.

nl · on June 26, 2014

I agree. But that's a difference use-case to what HDFS is designed for.

liquidcool · on June 26, 2014

I don't think large scale distributed file systems written in C are hypothetical. I'm pretty sure this is exactly what MapR has done - replace the Java-based HDFS with C, retaining the API. GlusterFS by Red Hat is another DFS.

fh973 · on June 26, 2014

As someone who is currently implementing a next-gen distributed file system, I can highlight one aspect: you have a lot of concurrency and asynchronous processing. Thus you need at least reference counting.

dasil003 · on June 26, 2014

Can you really do Haskell on embedded? I thought the far far abstraction away from memory as a concern made it pretty much a non-starter for the foreseeable future.

codygman · on June 26, 2014

According to Atom, you can even do "hard realtime embedded software":

http://hackage.haskell.org/package/atom

wyager · on June 26, 2014

Embedded meaning "ARM running an OS", yes. Embedded meaning "OS-less microcontroller", not so much. You'd have to use an embedded programming DSL for that, which isn't really ARM anymore.

zem · on June 26, 2014

ats will probably be an interesting best-of-both-worlds third option soon, though from what little I've seen of it it is currently harder to write code in than either haskell or c. but once you do put the work in to write your proofs etc. both correctness and speed should fall out naturally.

pedroo · on June 26, 2014

> It's equivalent to the SQL JOIN...

Don't you mean GROUP BY?

e12e · on June 26, 2014

I think parent meant equivalent in the sense of being "[a] very useful basic, common computation. (...) not something that you can really do without."

walshemj · on June 26, 2014

Fortran or PL/1G gets my vote writing Map Reduce in java is painful compared to PL/1G mappers I was writing back in the 80's on PR1ME mini computers.

shiven · on June 26, 2014

Wait, what? A distributed file system written in Fortran? Is it even possible in a language like that?

gtirloni · on June 26, 2014

Possibly yes, since it can interoperate with C. That's not what most people use it for (networking) so you would be swimming against the current.

I don't think it would attract a following among developers anyway... perhaps if you emphasize it compiles to Javascript, who knows ;)

walshemj · on June 26, 2014

no my bad thought that they where asking what would you use instead of java for he mapper/reducer code.

having said that at BT we did have a distributed ISAM database where the core code was F77!

dsymonds · on June 26, 2014

MapReduce is difficult to implement well. It's just ironic that the hard part is the part that's not in the name: the shuffle phase.

srean · on June 26, 2014

Heartily agreed ! Or at least Hadoop does it poorly. It is really that bad. I say this with hesitation and a lot of reluctance, its an open source project, I have not contributed anything towards it, so it is really unfair of me to complain.

Here is my personal experience with these tools at Google and at Yahoo (lightly edited from an old comment)

==

I have had the opportunity to try out Google's implementation of mapreduce implemented in C++ way back in time (6 years ago). These would run on fairly impoverished processors, essentially laptop grade of that time. Have done stuff on Yahoo's Hadoop setup as well, these used high end multicore machines provisioned with oodles of RAM. If I were to be generous, Hadoop ran 4 times slower as measured by wall clock times. Not only that, Hadoop required about 4 times more memory for similar sized jobs. So you ended up requiring more RAM, running for longer and potentially burning more electricity. This is by no means a benchmark or anything like that, just an anecdote.

That Hadoop would require much more memory did not surprise me, that was expected. What was really surprising was that it was so much slower.

Four times might not seem like much, but I was being generous to Hadoop. It makes a big difference when you can make multiple run through the data in a single day and make changes to the code/model. Debugging and ironing out issues is a lot more efficient when your iteration loop is shorter.

I think Hadoop (by virtue of its comparative crappiness) gave Google a significant competitive edge over the rest, probably still does.

timr · on June 26, 2014

A big part of implementing the shuffle correctly at a large scale is having a good distributed filesystem upon which to build.

dsymonds · on June 26, 2014

Sure. I didn't mean to imply a good distributed FS was unnecessary or easy. Just saying that MR isn't easy.

necubi · on June 26, 2014

Let me suggest the Quantcast File System (QFS) [0]. It's much closer to GFS as described in the paper (crucially it uses Reed-Solomon encoding to reduce storage requirements), it's highly tunable to different workloads, and it's written in C++. Quantcast uses it to store petabytes of data and for running map reduce. Unfortunately it hasn't seen much uptake outside of Quantcast, despite being a clear improvement over HDFS.

[0] http://quantcast.github.io/qfs/

(Disclaimer: I used to work for Quantcast).

jwr · on June 26, 2014

Reed-Solomon for forward error correction, to provide redundancy? But isn't Reed-Solomon really geared towards single-bit errors, while in the real world our storage tends to fail with multiple missing blocks?

I thought erasure codes were a much better approach.

necubi · on June 26, 2014

It's not used for tolerating disk errors (typically in a production context you have RAID for that, and failures tend to be for the entire disk). It's used to reduce storage requirements via striping. See the QFS paper (http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p808-...) for a good description of how this works. The basic idea is that with RS you can get 3x replication by splitting the data into 6 pieces stored on different servers, plus three parity blocks. This requires 1.5x storage rather than 3x while still tolerating the loss of any three machines of the nine.

erikano · on June 27, 2014

>[T]ypically in a production context you have RAID for that, and failures tend to be for the entire disk[.]

Does QFS run in addition to other file systems on the storage nodes or does it manage disks directly? You see, I was thinking that maybe ZFS + QFS might be a good idea and would like to know if it's possible. Also, is QFS available for FreeBSD and/or SmartOS storage nodes and clients? How about CoreOS and Debian, are storage nodes and clients available for those?

staunch · on June 26, 2014

I think MogileFS is closest to being what most people need. I'd love to see a Go version with all the lessons learned from that project.

https://code.google.com/p/mogilefs/

qohen · on June 26, 2014

What about Nokia's Disco project, which uses Erlang and Python?

http://discoproject.org/

From http://disco.readthedocs.org/en/develop/intro.html

What is Disco?

Disco is an implementation of mapreduce for distributed computing. Disco supports parallel computations over large data sets, stored on an unreliable cluster of computers, as in the original framework created by Google. This makes it a perfect tool for analyzing and processing large data sets, without having to worry about difficult technicalities related to distribution such as communication protocols, load balancing, locking, job scheduling, and fault tolerance, which are handled by Disco.

Disco can be used for a variety data mining tasks: large-scale analytics, building probabilistic models, and full-text indexing the Web, just to name a few examples.

Batteries included

The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms with very little code.

rdtsc · on June 26, 2014

I am more exited about a newer one --

LeoFS

http://leo-project.net/leofs/

all open source.

They started with S3 compatible API and are adding multi-data-center replication and NFS support.

EGreg · on June 26, 2014

Why s it difficult to build a distributed, petabyte scale filesystem? Isn't the search through indexes easily partitionable horizontally? Is it a problem of eventual consistency? I am not sure what the huge issue is and would like to learn.

vidarh · on June 26, 2014

It's "trivial" to build a distributed, petabyte scale filesystem.

It's hard to build a cost-effective, reliable and fast distributed, petabyte scale filesystem that's suitable for a wide range of workloads.

Consider that you need to minimise the amount of copies of data to keep costs reasonable, yet the fewer copies of data, the lower your IO capacity for accessing that data is (since readers/writers will content for the IO capacity of a small number of storage nodes), so you want to maximise the number of copies of data to maximise throughout. Yet the higher number of copies to maintain, the more IO it takes to spread each write out through your storage network. Soon enough you start running into "fun" problems such not being able to naively push writes out to each storage server it is meant to go to for data that needs to be replicated widely, because you'll be bandwidth constrained, but instead needing a fan-out even for simple writes.

You'll also want to minimise operational headaches; a disk going dead or an entire server failing needs to be handled transparently, as every additional disk or server you add increases the odds of a failure per any unit of tim.

(Compare with the naive approach for just a 1PB system: I can "easily" get about 200TB per off-the shelf storage server with hardware RAID. Lets say 150TB usable space; get about 14 of them to let you replicate stuff across two servers, and put GlusterFS on it. It'll work. It'll also be expensive, horribly slow for a number of workloads, and a regular disk replacement nightmare)

EGreg · on June 27, 2014

If you need to minimize the amount of copies then yes, you need to have some "risk management" software to estimate which machines are more reliable, and which files are more important, and then assign those files to enough replicas to be able to statistically guarantee some SLA. Then you need failover where at least one if the replicas is always available.

The routing table should be small enough to fit in RAM on every machine, and consulted for request. It would be updated when failover occurs. The table would consist of general rules with temporary exceptions for specific partition ranges that are being failed over.

You can store indexes in files, in a similar way. Just avoid joins and make like a graph database: first load documents from the index and then do mapreduce to get the related documents.

But besides that, I can see how maybe multi user concurrent access might necessitate eventual consistency algorithms for each app, but that's it.

gatehouse · on June 26, 2014

Well, one key element of the original map-reduce paper is the way the data is spread around. Instead of building a giant NAS with specialized (expensive) systems, and then building a bunch of specialized (expensive) compute systems, and then shipping massive quantities of data around on fast (expensive) network, the map-reduce system is built on a bunch of well balanced systems in terms of CPU/ram vs. disk, and the job is designed in a way that it can be distributed to these systems and data transfer is minimized.

So in a way, everything is happening in the storage nodes and they need to be much more than just a filesystem.

karamazov · on June 26, 2014

Dealing with frequent machine failure is one major issue, for example.

The original GFS and MapReduce papers (http://static.googleusercontent.com/media/research.google.co... and http://research.google.com/archive/mapreduce.html) go into detail.

ithkuil · on July 1, 2014

I just want to add to the other good replies:

* today's petabyte feels lighter than 10 years ago. It's not only disk space, it's bandwidth: from network to bus to ram to cpu.

* often a matter seems trivial when you have a proof that a given design actually works. But someone had to build it the first time and get it right; understand which of the many were important, invest and risk. You can find spectacular failures even with less unknowns.

* the devil is in the detail.

EdwardDiego · on June 26, 2014

> (ideally, one not written in a bloated language designed for remote controls and set-top boxes).

Oh cool, subjective language argument time!

mukundmr · on June 26, 2014

What about the Cassandra Distributed File System? It probably doesn't scale as much as HDFS though.

wyager · on June 26, 2014

>ideally, one not written in a bloated language designed for remote controls and set-top boxes

I don't know what you'd have to be smoking to use Java in a remote control.

riffraff · on June 26, 2014

java was designed for that

http://en.wikipedia.org/wiki/Oak_(programming_language)

e12e · on June 26, 2014

Indeed. And it runs today (on among other things) sim cards.

See eg: "Defcon 21 - The Secret Life of SIM Cards"

https://www.youtube.com/watch?v=31D94QOo2gY

wyager · on June 26, 2014

I know. My comment still stands.

tptacek · on June 27, 2014

It really doesn't, in that it doesn't actually build an argument of any sort.

wyager · on June 27, 2014

Didn't think I needed to! Seemed self-evident enough. How's this:

Java is a high-level language designed for heavy usage of dynamic allocation and related features (like GC). Remote controls, being very simple devices that benefit strongly from low power consumption, tend to be designed with low-resource microcontrollers. These devices are not generally capable of managing the entire java feature set. Therefore, to use java on a remote control, you must either use an unnecessarily complex microcontroller, or a crippled subset of Java, in which case you might as well use C or something.

tptacek · on June 27, 2014

Java is not "designed for heavy usage of dynamic allocation". You're confusing the language, the runtime, and the reference JVM. The HotSpot-derived reference JVM is certainly not an embedded design, but the reference JVM looks nothing like (for instance) Javacard.

wyager · on June 27, 2014

> Java is not "designed for heavy usage of dynamic allocation".

How, exactly, would you port the entire java core language to a platform without dynamic allocation? There exist subsets of java that work without it, but they're essentially not the same language at that point.

tptacek · on June 27, 2014

That's silly. A huge fraction of all C libraries rely on dynamic allocation, and so following your logic, C isn't an embedded programming language, despite being the lingua franca of embedded programming.

As for "how you'd 'port' Java to such an environment", again, look at Javacard.

wyager · on June 28, 2014

>A huge fraction of all C libraries rely on dynamic allocation, and so following your logic, C isn't an embedded programming language

That's not my logic at all. The C core language doesn't require dynamic allocation. In fact, the core language has no concept of it. All dynamic allocation comes from library functions, not language features. This is why we use C for embedded programming.

>As for "how you'd 'port' Java to such an environment", again, look at Javacard.

Again:

"However, many Java language features are not supported by Java Card (in particular types char, double, float and long; the transient qualifier; enums; arrays of more than one dimension; finalization; object cloning; threads). Further, some common features of Java are not provided at runtime by many actual smart cards (in particular type int, which is the default type of a Java expression; and garbage collection of objects)."

Java Card is almost nothing like Java as most of us know it.

tptacek · on June 28, 2014

Embedded programming is nothing like programming as people on HN know it. The dev who writes a native-code Markdown gem for Rails is going to be surprised at how different the experience of writing a SPI bus driver is.

So I don't find your argument very compelling. You have to do better than to point at how different an experience it is to code in an environment without object cloning and threats. That argument is almost tautological! You have to show how Javacard Java is fundamentally dissimilar as a language to Java. But almost the entire list of language features you cited here are absent because they don't make sense in the Javacard programming environment, not because they've been replaced with some other alien language concept.

At any rate: you were wrong to begin with when you scoffed at the idea of Java being used for remote controls, given that small consumer electronics were the original problem domain for the language that became Java, and you're wrong today, given that there are relatively popular and very successful small-form-factor embedded environments based on Java.

wyager · on June 29, 2014

>The dev who writes a native-code Markdown gem for Rails is going to be surprised at how different the experience of writing a SPI bus driver is.

Again, you're taking what I said and twisting the logic beyond recognition. No one uses Rails to do embedded programming. No one expects normal Rails and embedded Rails to be the same (because there is no embedded Rails).

Embedded C and "desktop" C are more or less exactly the same. They are the same language, in just about every way.

This is absolutely not true with standard Java and embedded Java subsets (like Java Card). There are huge differences in the language itself, like those I mentioned. Half the reason people use Java is the memory management features. Java minus these features is a fundamentally different language. Not to mention the lack of certain fundamental types (No floats, no multi-dimensional arrays, etc.) and other weird quirks of systems like Java Card.

> you were wrong to begin with when you scoffed at the idea of Java being used for remote controls, given that small consumer electronics were the original problem domain for the language that became Java,

Argumentum ad antiquitatem, or maybe argumentum ad auctoritatem (towards Sun). Just because Java was intended to be used for something does not mean it's any good at that thing.

>and you're wrong today, given that there are relatively popular and very successful small-form-factor embedded environments based on Java.

Argumentum ad populum. Just because a lot of people use some subset of Java for embedded programming doesn't mean it's a good idea. Lots of people use PHP too; it's not because it's a good thing to do; it's because PHP programmers are cheap. If I had to hazard a guess, that's the same reason people use Java in embedded environments.