Zookeeper isn't Modular? No mention of HTCondor+Dagman? No mention of Spark?
I've written a general DAG processor/workflow engine/metascheduler/whatever you want to call it. It's used by various physics and astronomy experiments. I've interfaced it with the Grid/DIRAC, Amazon, LSF, Torque, PBS, SGE, and Crays. There's nothing in it that precludes Docker jobs from running. I've implemented a mixed-mode (Streaming+batch processing + more DAG) version of it which just uses job barriers to do the setup and ZeroMQ for inter-job communication. I think something like Spark's resilient distributed dataset would be nice here as well.
We don't use hadoop because, as was said, because it's narrow in scope and it's not a good general purpose DAG/Workflow Engine/Pipeline.
I think this is a small improvement, but I don't really see it being much better than hadoop, or HTCondor, or Spark.
HTCondor is pretty amazing. I think somebody should be building a modern HTCondor, not a modern Hadoop.
Our feeling is that there's a big difference between having projects like this in the ecosystem and having a canonical implementation that comes preloaded in the software. "Batteries included but removable" is the best phrasing of this we've found (stolen from Docker). I've seen tons of pipeline management tools in the Hadoop ecosystem (even contributed to some) it's not that they're bad it's just that they're not standardized. This winds up being a huge cost because if 2 companies use different pipeline management it gets a lot harder to reuse code.
I think etcd is a pretty good example of this. Etcd is basically a database (that's the way CoreOS describes it) by database standards it's pretty crappy. But it's simple and it's on every CoreOS installation so you can use it without making your software a ton harder to deploy.
Hope this clears some stuff up!
A bug-free, cross platform, rock-solid HTCondor would be great. Sadly the existing HTCondor is a buggy turd. I've fixed some stuff in it and a lot of the code is eye-roll worthy. I can practically guarantee you hours of wondering why something that by any reasonable expectation should be working isn't working, and not giving any meaningful error message.
But like you said, a really solid HTCondor rewrite with the Stork stuff baked in would pretty much be all anyone should need, I think. It's not like you'd even really need to dramatically re-architect the thing. The problems with HTCondor really are mostly just code quality and the user experience.
How big are the data sets? How does it compare with the MapReduce paradigm?
I see there is this Dagman project which sounds similar to some of the newer Big Data frameworks that use a DAG model. I will make an uneducated guess that it deals with lots of computation on smaller sized data (data that fits on a machine). Maybe you have 1 GB or 10 GB of data that you want to run through many (dozens?) of transformations. So you need many computers to do the computation, but not necessarily to store the data.
I would guess another major difference is that there is no shuffle (distributed sort)? That is really what distinguishes MapReduce from simple parallel batch processing -- i.e. the thing that connects the Map and Reduce steps (and one of the harder parts to engineer).
In MapReduce you often have data much bigger than a single machine, e.g. ~100 TB is common, but you are doing relatively little computation on it. Also MapReduce tolerates many node failures and thus scales well (5K+ machines for a single job), and tries to reign in stragglers to reduce completion time.
I am coming from the MapReduce paradigm... I know people were doing "scientific computing" when I was in college but I never got involved in it :-( I'm curious how the workloads differ.