Being more explicitly Linux-centric and dropping the JVM paves the way for some major performance optimizations. In practice, virtually everyone uses Linux for these kinds of applications so portability is not a significant concern. Efficiency in the data center is becoming a major concern, both in terms of CapEx and OpEx, and the basic Hadoop stack can be exceedingly suboptimal in this regard. Free software is not cost effective if you require 10-100x the hardware (realistic) of a more optimal implementation.
However, I would use a different architecture generally rather than reimplementing the Hadoop one. The Achille's Heel of Hadoop (and Spark) is its modularity. Throughput for database-like workloads, which matches a lot of modern big data applications, is completely dependent on tightly scheduling execution, network, and storage operations without any hand-offs to other subsystems. If the architecture requires network and storage I/O to be scheduled by other processes, performance falls off a cliff for well-understood reasons (see also: mmap()-ed databases). People know how to design I/O intensive data processing engines, it was just outside the original architectural case for Hadoop. We can do better.
A properly designed server kernel running on cheap, modern hardware and a 10 GbE network can do the following concurrently on the same data model:
- true stream/event processing on the ingest path at wire speed (none of that "fast batch" nonsense)
- drive that ingest stream all the way through indexing and disk storage with the data fully online
- execute hundreds of concurrent ad hoc queries on that live data model
These are not intrinsically separate systems, it is just that we've traditionally designed data platforms to be single purpose. However, it does require a single scheduler across the complete set of operations required in order to deliver that workload.
From a software engineering standpoint it is inconvenient that good database kernels are essentially monolithic and incredibly dense but it is unavoidable if performance matters.
Long term we don't want to mandate either of these technologies. We'd like to offer users a variety of job formats (docker and rocket come to mind more domain specific things like SQL are interesting as well) and a variety of storage options (zfs, overlayfs, aufs and in-memory to name a few). However we had to be pragmatic early on and pick what we thought were the highest leverage implementations so that we could ship something that worked.
We'll certainly be looking in to getting rid of this mandate in the near future, we like giving people a variety of choices.
> Contrast this with Pachyderm MapReduce (PMR) in which a job is an HTTP server inside a Docker container (a microservice). You give Pachyderm a Docker image and it will automatically distribute it throughout the cluster next to your data. Data is POSTed to the container over HTTP and the results are stored back in the file system.
Now I'm not even sure anymore whether they are trying to troll people or if they are that out of touch with reality.
- No focus on trying to take on the real issues, but working around a perceived problem which would have been a non-issue after 5 minutes of research? Check.
- Trying to improve things by looking at bad ideas which weren't even state-of-the-art ten years ago? Check.
- Being completely clueless about the real issues in distributed data processing? Check.
- Whole thing reads like some giant buzzword-heavy bullshit bingo? Check.
- "A git-like distributed file system for a Dockerized world." ... Using Go? ... Please just kill me.
If I ever leave the software industry, it's because of shit like this.