
Pinterest open-sources Terrapin, a tool for serving data from Hadoop - gexos
http://venturebeat.com/2015/09/14/pinterest-open-sources-terrapin-a-tool-for-serving-data-from-hadoop/
======
varunsharma
This is Varun from Pinterest.

We did look at a few options before building this. ElephantDB seemed a bit
heavy handed, such as having to modify ring configuration every time we
added/removed servers and also, modifying domain spec yaml files for newly
added data sets. It did not allow us to easily change # of shards across
different versions of the data - something that our developers do often to
make their jobs run faster etc. Also, it does not GC out older versions and
since our workflows write new versions every day, this was a problem.

We did look at Cassandra but we also did not want to operate another data
store. However, we definitely wanted to get the data loaded fast i.e. through
simple file copy operations. We found that for this option Cassandra had
similar issues as HBase i.e. having to do major compactions to get rid of
older data versions. Tweaking the number of reduce shards was also harder.

With Terrapin, we essentially tried to build serving system on top of HDFS
given the recent improvements in HDFS performance when there is data locality.
We felt that HDFS was rock solid and the best storage system (in terms of
scalability & ease of operation) for immutable data sets. On top of that, we
built versioning, cheap garbage collection, extensible serving formats etc. as
mentioned in the blog

As for Apache Drill, it is more suited to running analyst queries with
latencies ranging upto seconds or 100s of milliseconds. This is not acceptable
for webscale work loads where the latencies must be < 10ms for lower level
serving systems like terrapin.

~~~
ameyamk
Hi Varun,

Can you also elaborate, how you read HFiles and serve it out from Terrapin
servers? Are you using similar functionality as HBase? (With block cache like
design if yes how do you keep both in sync).

Your blog is missing this interesting detail.

~~~
varunsharma
That is correct, we are using the functionality similar to HBase. We pull in
the HBase BlockCache library with some tweaks to make it work for our
scenario. Note that data is never overwritten and HFiles are immutable. So the
cache automatically, gets evicted/populated as HFiles are opened and closed.
That said, there is a possibility to use more performant formats like rocksdb
etc. in the future (the format is pluggable). Or even still use HFiles and
have them loaded into some kind of specialized in memory data structure etc.

~~~
ameyamk
Cool. So in a way its Immutable read only HBase (with guaranteed data locality
no memstore overhead and compactions overhead). Cool. Nice solution. I wonder
this can be patched back to HBase - as a "read only mode" ?

~~~
varunsharma
Suspect it would be difficult. There are differences like new data is
completely independent of previous data and the source of truth for region
distribution is HDFS for the block locations. There are multiple replicas per
shard while HBase has one region for each shard etc. Across bulk loads, the
number of shards (regions) can be changed for the same fileset or table - not
possible with HBase. The data sharding is not range based like in HBase but
mod based, as output by Hadoop HashPartitioner. There are so many differences
that its hard to accommodate it into the HBase code.

------
optimusclimb
Either more tools like this are going to pop up, or the existing ones will
mature, as more people adopt Lambda style architectures, I'd imagine.

While building one out, we looked at VoldemortDB, SploutSQL, and ElephantDB to
serve bulk data coming out of Hadoop in batches. Voldemort turned out to be
much rougher around the edges than expected, ElephantDB looked very bleeding
edge, and SploutSQL wasn't as general purpose. In the end we turned to
Cassandra and this tool -
[https://github.com/spotify/hdfs2cass](https://github.com/spotify/hdfs2cass).

Good to see Pinterest open sourcing this.

------
kevinbowman
The URL [https://engineering.pinterest.com/blog/open-sourcing-
terrapi...](https://engineering.pinterest.com/blog/open-sourcing-terrapin-
serving-system-batch-generated-data-0) gives more info, which this article
links through to.

------
ameyamk
Bulk Uploads into KV stores are slow - so Terrapin allows KV access over
immutable HDFS files.

Very typical use case for recommendation systems etc. We face similar problems
with latencies on HBase (At Groupon).

So this solution seems interesting. Would be good to have comparison of other
solutions Pinterest tried before building this. eg. loading data into
Cassandra instead of HBase etc.

In nutshell - very specific use case - but the one which comes across very
often

~~~
sjg007
What do you think of Apache Drill onto of say HDFS served files?

------
arthurcolle
I wonder if the authors are Maryland alumni!

