Hacker News new | comments | ask | show | jobs | submit login

Interesting -- does HTCondor solve the same problem as Hadoop? I know it is for running batch jobs across a cluster.

How big are the data sets? How does it compare with the MapReduce paradigm?

I see there is this Dagman project which sounds similar to some of the newer Big Data frameworks that use a DAG model. I will make an uneducated guess that it deals with lots of computation on smaller sized data (data that fits on a machine). Maybe you have 1 GB or 10 GB of data that you want to run through many (dozens?) of transformations. So you need many computers to do the computation, but not necessarily to store the data.

I would guess another major difference is that there is no shuffle (distributed sort)? That is really what distinguishes MapReduce from simple parallel batch processing -- i.e. the thing that connects the Map and Reduce steps (and one of the harder parts to engineer).

In MapReduce you often have data much bigger than a single machine, e.g. ~100 TB is common, but you are doing relatively little computation on it. Also MapReduce tolerates many node failures and thus scales well (5K+ machines for a single job), and tries to reign in stragglers to reduce completion time.

I am coming from the MapReduce paradigm... I know people were doing "scientific computing" when I was in college but I never got involved in it :-( I'm curious how the workloads differ.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact