

What languages and tools do you use for cluster computing? - etal

I've recently come into possession of an account on a research computing cluster. This is for bioinformatics work at a university. For the past few years I've been interested in parallelism-friendly functional languages like Haskell, but in my actual work I haven't come across anything that C, Python and Bash can't handle easily. So this seems like a great opportunity to try the
weird stuff.<p>Most scientists seem to get by using MPI with C, C++ or Fortran. Naturally I'll learn how to do that, too, but I strongly suspect that this traditional route tends to result in sprawling, complex code that's difficult to debug. There must be a better way. Ocaml and Python have MPI interfaces, too; Scala seems to be designed for distributed computing on existing platforms.<p>And the platform is a limitation, too. The interactive node on this cluster shows a few stable packages to work with: GCC 3.4.6, Java 1.6, Python 2.3 (with a broken numpy), and of course the MPI suite. Nothing exotic; CVS and SVN for version control. I think this means Erlang and Haskell are ruled out, unless they're compiled to C or JVM bytecode first -- any VM must be available on all the nodes if the code's going to run. Right?<p>So, what do <i>you</i> use? Or, what would you use in this situation?
======
jjguy
I used OpenMosix and a modified version of john the ripper to build a
distributed password cracker, back before rainbow tables.

OpenMosix is linux with a modified kernel and some user-space tools; nodes
automatically join the cluster, processes get migrated across the cluster to
equalize load across all the systems.

It's not as elegant as it could be, but it's quick and dirty.

------
noodle
i learned on C with MPI. it really isn't that difficult. the trick is just to
minimize the complexity and operations within the parallelized area.

~~~
etal
Oh. Then, why the fuss about Occam, Fortress, Hadoop, et al? Is there another,
hairier area of HPC these languages and technologies are meant for, where
scientists rarely need to tread? Or are most scientists deliberately avoiding
those areas because of the hairiness?

~~~
bbgm
Depends on what you want to do as well. Hadoop is not the appropriate solution
for doing molecular dynamics simulations. You also have to look at the support
system. Something like LAPACK makes Fortran very attractive for numerical
programming.

What kinds of algorithms are you trying to implement?

~~~
etal
For research, right now I'm looking for better ways to visualize the results
of Chain analysis on a set of sequences. (This probably doesn't actually need
a cluster.) We're looking at evolutionary relationships between protein
families.

I just started a graduate program, so I'll have to do a range of things for
classes -- which is why I have some leeway to pick up a new language and
stagger through a few numerical programs now if it will help me get things
done quickly later, when it matters.

~~~
bbgm
It's probably worth your while to take a loop at Hadoop them. When you can
fundamentally map your problem into many small problems and then reduce the
results (true for most sequence alignment, etc problems), then Hadoop is very
viable. Not exactly common fare these days, but getting more popular and
perhaps increasingly so. Plus you don't really have to learn new languages,
just a distributed computing framework.

