
Show HN: Cuneiform – A Functional Workflow Language - joergen7
http://www.cuneiform-lang.org/
======
jamesblonde
Really like this work. Bioinformatics has not entered the Big Data age, but
will do eventually - kicking and screaming.

Talk on it here:
[https://www.youtube.com/watch?v=2uWSbYWyLh8&index=2&list=PL5...](https://www.youtube.com/watch?v=2uWSbYWyLh8&index=2&list=PL5oElY7F-znCU_Ppb7YWJ8jifqbDttxOt)

~~~
chubot
Honest question: What's wrong with how they do it? I'm coming from the "cloud
/ big data" side of things, I suppose.

Guess: Low utilization? I imagine they the demand to run jobs is very spiky
(i.e. right before a paper deadline), but if you manually/statically allocate
machines, then machines will sit idle and your utilization will be kinda low.

And perhaps poor reproducibility, although I don't think "big data" style
pipelines are all that much better.

~~~
joergen7
The question, why many scientific communities haven't been racing to adopt Big
Data technology is a tough one. Whether or not utilization is a major factor I
can't really say. For Cuneiform, we have assumed that the major obstacle for
using platforms like Hadoop or Spark directly is that these systems require a
library to have a Java or Scala interface. If your software is written in,
say, Python you need to either

\- write a wrapper (an approach taken by Crossbow [1], SeqPig [2], or BioPig
[3])

\- reimplement (an approach taken by CloudBurst [4] or ADAM [5])

Reimplementing a whole pipeline represents a huge upfront cost. Writing
wrappers is more feasible and, accordingly, a huge amount of custom tailored
pipelines and ad-hoc worklfow systems have emerged recently. So if it boils
down to wrapping libraries and running them on Hadoop, then we need to make
this as easy as somehow possible. Cuneiform is a step in this direction.

[1] [http://bowtie-bio.sourceforge.net/crossbow/index.shtml](http://bowtie-
bio.sourceforge.net/crossbow/index.shtml) [2]
[http://bioinformatics.oxfordjournals.org/content/30/1/119.lo...](http://bioinformatics.oxfordjournals.org/content/30/1/119.long)
[3]
[http://bioinformatics.oxfordjournals.org/content/early/2013/...](http://bioinformatics.oxfordjournals.org/content/early/2013/09/10/bioinformatics.btt528)
[4]
[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682523/](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682523/)
[5]
[http://bdgenomics.org/projects/adam/](http://bdgenomics.org/projects/adam/)

~~~
chubot
The Java/JVM bias is a deficiency of open source Big Data, not
bioinformatics!!!

FWIW, Google doesn't have a JVM bias in terms of its big data stack, but it
still has langauge biases. Bioinformatics is built on Unix tools and Unix
composition AFAICT, and that philosophy needs to make its way back to the big
data stack IMO.

My question is what's wrong with what they're doing? If it's working, why not
keep doing the same thing, rather than trying to move to Hadoop, which sounds
like the wrong tool?

~~~
joergen7
> The Java/JVM bias is a deficiency of open source Big Data, not
> bioinformatics!!!

I agree.

> Bioinformatics is built on Unix tools and Unix composition AFAICT

That is true for most bioinformatics tools with a few important exceptions.
CummeRbund, e.g., is a must for RNA-Seq and you need to drive it from R. It's
hard to tell whether supporting only command line would actually be enough to
conveniently describe workflows. For Cuneiform, we assume that it isn't.

> what's wrong with what they're doing?

It's not that moving to Big Data stacks is the ultimate answer to everything.
If your workflow runs fast enough on one machine, probably it's best to change
nothing. Nevertheless, we're observing a dramatic drop in sequencing costs
right now. And the more this trend continues, the bigger will be the demand
for scalable infrastructures in Next Generation Sequencing.

------
gearhart
I've read the paper, and I feel like I should be more informed than I am.

I take it this language is designed to allow you to build systems that take
very large data sets and transform them in a sort of extremely parallel, map-
reduce inspired way using black-box, possibly external services to do so.

What I don't understand, is why this particular language is superior to a
general purpose language (say python, for argument's sake) for the task.

Speaking from a level of genuine irritation with my own ignorance here - would
someone better informed be prepared to explain?

~~~
joergen7
Automatic parallelization is usually not part of general purpose languages.

Programs written in a general purpose programming language (like Python) are
neither automatically parallelizable nor do they run on more than one machine.
Of course, most general purpose languages offer facilities for parallelization
and/or distribution. Python has the multiprocessing library. In R there is the
parallelMap library. But responsibility over how parallelization is used in
such a program still resides with the programmer.

This is different in Cuneiform. Everything is automatically parallelized here.
Most task applications are relatively large computational pieces of work that
run for seconds or minutes (in contrast to, e.g., an addition). In this
scenario it makes sense to parallelize wherever possible.

Cuneiform allows you to drive any API.

Another thing that is difficult in most general purpose programming languages
is the incorporation of libraries not written in that particular language.
E.g., it is non-trivial to use a Python library in Java. Even calling command
line applications requires a small wrapper (os.system() in Python). In
Cuneiform this is the normal mode of operation. Writing a wrapper around a
Bash script is just as hard as defining a function.

~~~
chrisseaton
> Programs written in a general purpose programming language (like Python) are
> [not] automatically parallelizable

Of course they are - your super-scalar, out-of-order processor does it for you
at the very least.

And at a higher level, the same speculative techniques that can be used to
automatically parallelise a C program can also be used for Python, Ruby, etc.
It's just that nobody has implemented that yet.

~~~
joergen7
Perhaps I was a bit imprecise by saying that Python programs are not
automatically parallelizable.

What I was trying to say was that Python programs are not automatically
parallelized by the standard Python interpreter.

Thanks for pointing it out.

------
theideasmith
This will solve many problems and help make things easier for scientists and
programmers. I don't want to say this negatively - just for constructive
criticism - that the syntax of Cuneiform is quite ugly to look at. Many of the
syntax design choices feel arbitrary such as the \ _{ and \_ } encapsulating
functions, the need to write deftask to define a function and the really ugly
function signature syntax. I mean it is almost as ugly as Perl.

Please consider this issue.

