Hacker News new | comments | show | ask | jobs | submit login

> Our feeling isn't that Java is bad it's that it shouldn't be the only option for big data processing. There are a lot of people that want to do this type of computing and a lot of them hate Java.

JVM != Java. It runs a wide variety of languages, and this is not just theory - you can do Spark in Python right now. "Every language I care about" was perhaps overly flippant, but between Groovy, Clojure and Scala there's something to suit most programming tastes.

Meanwhile I hate Linux, and Docker doesn't have a good answer for that; the suggestions I've seen for developing with Docker on other platforms boil down to "run Linux in a VM".

> Could you give an example of what could go wrong here? Docker prevents you from talking to the outside world so you can't call out to an outside service in the middle of a job or anything like that.

I'm more worried about forward compatibility, standardization - and partly attack surface I guess. I can run jars from 15 years ago, on today's JVM - and not because it bundles an old version; I can link today's programs against 15-year-old libraries and they'll just work. If a library I built against has disappeared from the internet, that's fine.

Meanwhile I can't run linux binaries from 15 years ago (I did try with some of the loki games - maybe it's possible but it's decidedly nontrivial). Even if I found the right .sos, if I wanted to use a library with the ABI of ten years ago in my program, I'd have to rebuild the whole world. I guess I'm meant to rebuild my docker image, but that doesn't work if the library source is no longer available, and I'm not at all confident that the tooling does fully reproducible builds to the same extent as e.g. maven (from what I can see plenty of dockerfiles use unpinned versions).

Finally, the spec for the JVM is in one place, and it's published; apache harmony shows that a ground-up reimplementation is possible (that the TCK isn't open is very unfortunate, but the patents will expire sooner or later). I have a reasonable level of confidence that we'll always be able to run the jars we're producing today, without having to emulate x86s or what have you. Linux doesn't have a spec like that (yes the C code is available), and I don't think a ground-up reimplementation would be feasible.

> In Pachyderm job scheduling happens outside the user submitted containers. The job scheduler is what actually spins up the containers and pushes data through them. I'm afraid I don't understand what it would even mean for one of those containers to then do job scheduling itself.

Job scheduling was a poor example; I guess I'm thinking more things like e.g. running your own database on the cluster nodes, instead of talking to one outside the cluster via a proper interface. (Admittedly that's possible on hadoop with derby or h2).

Thinking about it more I guess it's less the service part and more the interface part that I'm worried about. A docker image could implement critical functionality with shell scripts. I bet they will. I've already spent too much of my life trying to get people to move away from shell scripts into something safer and more maintainable.

> between Groovy, Clojure and Scala there's something to suit most programming tastes

> the spec for the JVM is in one place, and it's published

Between Groovy, Clojure and Scala, there's something to suit most tastes on the scale of how much language specification there is. Scala has a published spec (along with many white papers on the R.I. internals), Clojure has its published spec in the comments in the source code, and Groovy actively changes the spec between versions to prevent addons and alternative implementations being built.

Sure, and I'd prefer people used the more specified options. But even if you have a Groovy library that you can't compile under modern Groovy, the bytecode will still run on new JVMs - more than you can say in the Docker case.

PySpark does not run Python code on the JVM. It uses py4j and it is slow as molasses. This has the benefit of supporting native Python libraries like NumPy. If you're not using such libraries you'd be better off using a JVM language, including Jython.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact