
Thrill – Big Data Processing with C++ - brakmic
http://blog.brakmic.com/thrill-big-data-processing-with-c/
======
lorenzhs
The companion paper to Thrill, with more details on its architecture and some
benchmarks and comparisons to Spark and Flink:
[https://arxiv.org/abs/1608.05634](https://arxiv.org/abs/1608.05634)

~~~
75dvtwin
Thx for the link to the paper. It is a useful read in its own right.

It provides an overview of other processing framework, explains why C++ was
chosen, explains various bottlenecks and their affect.

I had worked with KMeans before, and happy to see as part of bench-marking, as
it seems one of the more widely used approaches for unsupervised learning.

In my view, Thrill is similar in composite-ability and integration into
existing code objectives to Python's Dask

------
codepie
There's also Blogel [0] which is a distributed graph processing framework in
C++ and it runs significantly faster than its counterpart in Java, Apache
Giraph [1].

I have started wondering if the big data developers really care about the
speed; the advantages of these Java softwares start to fade out when compared
with their C++ counterparts.

[0] - [http://www.cse.cuhk.edu.hk/blogel/](http://www.cse.cuhk.edu.hk/blogel/)

[1] -
[http://www.cse.cuhk.edu.hk/blogel/papers/blogel.pdf](http://www.cse.cuhk.edu.hk/blogel/papers/blogel.pdf)

~~~
pjmlp
If you just measure milliseconds yes.

If you measure project costs, including the salaries of the developers and
amount of development days, then no.

This is the main reason why there is such a big pressure from trading folks
for Oracle to improve Java regarding value types and FFI to native code.

~~~
lorenzhs
I think with Thrill, there are two different skill levels to be distinguished:

\- Using it to implement things should be fairly easy and doesn't require
advanced knowledge of C++. Basically you have to plug lambdas that do the
processing into the provided operations, similar to Spark, but using C++
_syntax_. It might require some compiler error parsing skills, but altogether
it shouldn't be too different from using Spark with Java/Scala

\- Extending Thrill requires familiarity with modern C++, possibly including
advanced template tricks.

Since there isn't a whole lot of advanced stuff available for Thrill (yet),
that means that currently people with the latter skills would most likely be
required at the moment. But in a world where the same libraries available for
Spark are available for Thrill or a similar C++ framework, that wouldn't be
the case. Note that Thrill is currently quite experimental.

I guess it's a trade-off, but dismissing the potential for 10x runtime gains
"because C++" seems too one-sided. That isn't to say that the C++ frameworks
don't have a long way to go before they can rival Spark etc in ease of use and
tooling, they do! But at least they point out the inefficiencies and potential
for improvement in these existing systems.

------
adrianN
There is also the STXXL [1] for times when your data is big but not "big". It
contains containers and algorithms optimized for external storage.

[http://stxxl.sourceforge.net/](http://stxxl.sourceforge.net/)

~~~
lorenzhs
Thrill and STXXL are both delevoped in the same group at KIT (I work there,
too, but I'm not directly involved). Thrill also reuses some parts of STXXL,
and does so completely transparently to the user - if memory doesn't suffice,
it'll use the disk.

------
pzh
Does anybody know how this is different from Spark? These Distributed
Immutable Arrays sound suspiciously similar to Spark's Resilient Distributed
Datasets. Is it just the choice of C++ as opposed to Scala that would make
this more efficient?

Also, I wonder if and how they implemented the concept of lineage (unless
these DIAs are not really very resilient)... I thought Spark relied on Scala's
delayed evaluation to do that, though I may be mistaken.

~~~
lorenzhs
DIAs are quite heavily inspired by RDDs. A lot of the performance increases
come from the C++ compiler's ability to fuse local operations etc.

Thrill doesn't implement any fault tolerance at the moment, it's closer to
prototype status than production readiness.

~~~
pzh
Do you have any plans regarding how you'd implement resilience and the
equivalent of Spark's concept of 'lineage', where you keep a history of how a
given RDD was computed, and then you can recompute it if it gets lost?

I haven't looked into Spark in depth, but I believe that 'lineage' relies
heavily on Scala's delayed evaluation and the underlying Java RMI facilities.
Doing something similar in C++ may require a lot more effort and a
significantly different set of tradeoffs regarding the performance model.

~~~
lorenzhs
I'm not that directly involved in Thrill, so I can't really speak with
authority. There aren't any _concrete_ plans on fault tolerance but it would
certainly be an interesting topic to work on, partially because the existing
solutions seem quite inefficient.

------
Mikeb85
Very cool. Will have to remember this, maybe write an R package that makes use
of it.

~~~
fnord123
Not using thrill, but there is Rpbd:

[https://rbigdata.github.io/](https://rbigdata.github.io/)

It's basically R on scalapack.

------
tmsldd
The Force is strong with this one

