Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Cuneiform – A Functional Workflow Language (cuneiform-lang.org)
39 points by joergen7 on Feb 5, 2016 | hide | past | favorite | 10 comments



Really like this work. Bioinformatics has not entered the Big Data age, but will do eventually - kicking and screaming.

Talk on it here: https://www.youtube.com/watch?v=2uWSbYWyLh8&index=2&list=PL5...


Honest question: What's wrong with how they do it? I'm coming from the "cloud / big data" side of things, I suppose.

Guess: Low utilization? I imagine they the demand to run jobs is very spiky (i.e. right before a paper deadline), but if you manually/statically allocate machines, then machines will sit idle and your utilization will be kinda low.

And perhaps poor reproducibility, although I don't think "big data" style pipelines are all that much better.


The question, why many scientific communities haven't been racing to adopt Big Data technology is a tough one. Whether or not utilization is a major factor I can't really say. For Cuneiform, we have assumed that the major obstacle for using platforms like Hadoop or Spark directly is that these systems require a library to have a Java or Scala interface. If your software is written in, say, Python you need to either

- write a wrapper (an approach taken by Crossbow [1], SeqPig [2], or BioPig [3])

- reimplement (an approach taken by CloudBurst [4] or ADAM [5])

Reimplementing a whole pipeline represents a huge upfront cost. Writing wrappers is more feasible and, accordingly, a huge amount of custom tailored pipelines and ad-hoc worklfow systems have emerged recently. So if it boils down to wrapping libraries and running them on Hadoop, then we need to make this as easy as somehow possible. Cuneiform is a step in this direction.

[1] http://bowtie-bio.sourceforge.net/crossbow/index.shtml [2] http://bioinformatics.oxfordjournals.org/content/30/1/119.lo... [3] http://bioinformatics.oxfordjournals.org/content/early/2013/... [4] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682523/ [5] http://bdgenomics.org/projects/adam/


The Java/JVM bias is a deficiency of open source Big Data, not bioinformatics!!!

FWIW, Google doesn't have a JVM bias in terms of its big data stack, but it still has langauge biases. Bioinformatics is built on Unix tools and Unix composition AFAICT, and that philosophy needs to make its way back to the big data stack IMO.

My question is what's wrong with what they're doing? If it's working, why not keep doing the same thing, rather than trying to move to Hadoop, which sounds like the wrong tool?


> The Java/JVM bias is a deficiency of open source Big Data, not bioinformatics!!!

I agree.

> Bioinformatics is built on Unix tools and Unix composition AFAICT

That is true for most bioinformatics tools with a few important exceptions. CummeRbund, e.g., is a must for RNA-Seq and you need to drive it from R. It's hard to tell whether supporting only command line would actually be enough to conveniently describe workflows. For Cuneiform, we assume that it isn't.

> what's wrong with what they're doing?

It's not that moving to Big Data stacks is the ultimate answer to everything. If your workflow runs fast enough on one machine, probably it's best to change nothing. Nevertheless, we're observing a dramatic drop in sequencing costs right now. And the more this trend continues, the bigger will be the demand for scalable infrastructures in Next Generation Sequencing.


I've read the paper, and I feel like I should be more informed than I am.

I take it this language is designed to allow you to build systems that take very large data sets and transform them in a sort of extremely parallel, map-reduce inspired way using black-box, possibly external services to do so.

What I don't understand, is why this particular language is superior to a general purpose language (say python, for argument's sake) for the task.

Speaking from a level of genuine irritation with my own ignorance here - would someone better informed be prepared to explain?


Automatic parallelization is usually not part of general purpose languages.

Programs written in a general purpose programming language (like Python) are neither automatically parallelizable nor do they run on more than one machine. Of course, most general purpose languages offer facilities for parallelization and/or distribution. Python has the multiprocessing library. In R there is the parallelMap library. But responsibility over how parallelization is used in such a program still resides with the programmer.

This is different in Cuneiform. Everything is automatically parallelized here. Most task applications are relatively large computational pieces of work that run for seconds or minutes (in contrast to, e.g., an addition). In this scenario it makes sense to parallelize wherever possible.

Cuneiform allows you to drive any API.

Another thing that is difficult in most general purpose programming languages is the incorporation of libraries not written in that particular language. E.g., it is non-trivial to use a Python library in Java. Even calling command line applications requires a small wrapper (os.system() in Python). In Cuneiform this is the normal mode of operation. Writing a wrapper around a Bash script is just as hard as defining a function.


> Programs written in a general purpose programming language (like Python) are [not] automatically parallelizable

Of course they are - your super-scalar, out-of-order processor does it for you at the very least.

And at a higher level, the same speculative techniques that can be used to automatically parallelise a C program can also be used for Python, Ruby, etc. It's just that nobody has implemented that yet.


Perhaps I was a bit imprecise by saying that Python programs are not automatically parallelizable.

What I was trying to say was that Python programs are not automatically parallelized by the standard Python interpreter.

Thanks for pointing it out.


This will solve many problems and help make things easier for scientists and programmers. I don't want to say this negatively - just for constructive criticism - that the syntax of Cuneiform is quite ugly to look at. Many of the syntax design choices feel arbitrary such as the \{ and \} encapsulating functions, the need to write deftask to define a function and the really ugly function signature syntax. I mean it is almost as ugly as Perl.

Please consider this issue.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: