A bug-free, cross platform, rock-solid HTCondor would be great. Sadly the existing HTCondor is a buggy turd. I've fixed some stuff in it and a lot of the code is eye-roll worthy. I can practically guarantee you hours of wondering why something that by any reasonable expectation should be working isn't working, and not giving any meaningful error message.
But like you said, a really solid HTCondor rewrite with the Stork stuff baked in would pretty much be all anyone should need, I think. It's not like you'd even really need to dramatically re-architect the thing. The problems with HTCondor really are mostly just code quality and the user experience.
How big are the data sets? How does it compare with the MapReduce paradigm?
I see there is this Dagman project which sounds similar to some of the newer Big Data frameworks that use a DAG model. I will make an uneducated guess that it deals with lots of computation on smaller sized data (data that fits on a machine). Maybe you have 1 GB or 10 GB of data that you want to run through many (dozens?) of transformations. So you need many computers to do the computation, but not necessarily to store the data.
I would guess another major difference is that there is no shuffle (distributed sort)? That is really what distinguishes MapReduce from simple parallel batch processing -- i.e. the thing that connects the Map and Reduce steps (and one of the harder parts to engineer).
In MapReduce you often have data much bigger than a single machine, e.g. ~100 TB is common, but you are doing relatively little computation on it. Also MapReduce tolerates many node failures and thus scales well (5K+ machines for a single job), and tries to reign in stragglers to reduce completion time.
I am coming from the MapReduce paradigm... I know people were doing "scientific computing" when I was in college but I never got involved in it :-( I'm curious how the workloads differ.