Hacker News new | comments | show | ask | jobs | submit login

The article makes many good points but also misses on a few in my opinion. Hadoop was designed to solve a pretty narrow problem a long time ago; modern use cases tend to be far outside its intended design envelope.

Being more explicitly Linux-centric and dropping the JVM paves the way for some major performance optimizations. In practice, virtually everyone uses Linux for these kinds of applications so portability is not a significant concern. Efficiency in the data center is becoming a major concern, both in terms of CapEx and OpEx, and the basic Hadoop stack can be exceedingly suboptimal in this regard. Free software is not cost effective if you require 10-100x the hardware (realistic) of a more optimal implementation.

However, I would use a different architecture generally rather than reimplementing the Hadoop one. The Achille's Heel of Hadoop (and Spark) is its modularity. Throughput for database-like workloads, which matches a lot of modern big data applications, is completely dependent on tightly scheduling execution, network, and storage operations without any hand-offs to other subsystems. If the architecture requires network and storage I/O to be scheduled by other processes, performance falls off a cliff for well-understood reasons (see also: mmap()-ed databases). People know how to design I/O intensive data processing engines, it was just outside the original architectural case for Hadoop. We can do better.

A properly designed server kernel running on cheap, modern hardware and a 10 GbE network can do the following concurrently on the same data model:

- true stream/event processing on the ingest path at wire speed (none of that "fast batch" nonsense)

- drive that ingest stream all the way through indexing and disk storage with the data fully online

- execute hundreds of concurrent ad hoc queries on that live data model

These are not intrinsically separate systems, it is just that we've traditionally designed data platforms to be single purpose. However, it does require a single scheduler across the complete set of operations required in order to deliver that workload.

From a software engineering standpoint it is inconvenient that good database kernels are essentially monolithic and incredibly dense but it is unavoidable if performance matters.

Hi. Co-founder and lead Pachyderm dev here. This is a really interesting comment and raises some questions that I haven't fully considered. Building the infrastructure in this decoupled way has let us move quickly early on and made it really easy to scale out horizontally. We haven't paid as much attention to what you talk about here. However I think there might be some optimizations we can make within our current design that would help a lot. Probably the biggest source of slowness right now is that we post files in to jobs via HTTP, this is nice because it gives you a simple API for implementing new jobs but it's reaaally slow. A much more performant solution would be to give the data directly to the job as a shared volume (a feature Docker offers) this would lead to very performant i/o because the data could be read directly off disk by the container that's doing the processing.

Andrew knows this stuff better than I do but one way to start thinking about this is to look at what the sources of latency are, where your data gets copied, and what controls allocation of storage (disk/RAM/etc). For example mmaped disk IO cedes control of the lifetime of your data in RAM to the OS's paging code. JSON over HTTP over TCP is going to make a whole pile of copies of your data as the TCP payloads get reassembled, copied into user space, then probably again as that gets deserialized (and oh the allocations in most JSON decoders). As for latency you're going to have some context switches in there which is not helpful. One way you might be able to improve performance is to use shared memory to get data in and out of the workers and process it in-place as much as possible. A big ring buffer and one of the fancy in-place serialization formats (capnproto for example) could actually make it pretty pleasant to write clients in a variety of languages.

Thanks for the tips :D. These all seem like very likely places to look for low hanging fruit. We were actually early employees at RethinkDB so we've been around the block with low level optimizations like this. I'm really looking forward to get Pachyderm in a benchmark environment and tearing it to shreds like we did at Rethink.

So, I'm not so sure that building a product that mandates both btrfs and Docker really counts as "decoupled".

Hi, jd @ Pachyderm here.

Long term we don't want to mandate either of these technologies. We'd like to offer users a variety of job formats (docker and rocket come to mind more domain specific things like SQL are interesting as well) and a variety of storage options (zfs, overlayfs, aufs and in-memory to name a few). However we had to be pragmatic early on and pick what we thought were the highest leverage implementations so that we could ship something that worked.

We'll certainly be looking in to getting rid of this mandate in the near future, we like giving people a variety of choices.

A very performant way would be native RDMA verbs access over an Infiniband fabric, 40g to 56g throughput. At that rate remote storage is faster latency wise than local SATA. Many many HPC shops operate in this fashion and the 100g Ethernet around the corner only solidifies this.

What about other options like HPCC?

I hate Hadoop with a passion, but this reads like written by people even more clueless than Hadoop developers.

> Contrast this with Pachyderm MapReduce (PMR) in which a job is an HTTP server inside a Docker container (a microservice). You give Pachyderm a Docker image and it will automatically distribute it throughout the cluster next to your data. Data is POSTed to the container over HTTP and the results are stored back in the file system.

Now I'm not even sure anymore whether they are trying to troll people or if they are that out of touch with reality.

- No focus on trying to take on the real issues, but working around a perceived problem which would have been a non-issue after 5 minutes of research? Check.

- Trying to improve things by looking at bad ideas which weren't even state-of-the-art ten years ago? Check.

- Being completely clueless about the real issues in distributed data processing? Check.

- Whole thing reads like some giant buzzword-heavy bullshit bingo? Check.

- Some Silicon-Valley-kids-turned-JavaScript-HTML-"programmers"-turned-NodeJS-developers-turned-Golang-experts thinking that they can build a Git-like distributed filesystem from scratch? Check.

- "A git-like distributed file system for a Dockerized world." ... Using Go? ... Please just kill me.

If I ever leave the software industry, it's because of shit like this.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact