

Open-Sourcing Pinot: Scaling the Wall of Real-Time Analytics - onderkalaci
http://engineering.linkedin.com/pinot/open-sourcing-pinot-scaling-wall-real-time-analytics

======
eis
Would be interesting to see a comparison to Druid:
[http://druid.io/](http://druid.io/)

They both seem to cater the same use case: scalable real-time analytics.

~~~
cheddar
I'm biased in that I had the honor of writing the first line of code of Druid,
but from looking at what I could find on Pinot it looks like architecture-
wise, we share a lot of similarities. At a minimum, the thought behind how to
scale and how to ingest data appears to be the exact same. Druid clusters are
presently ingesting tens of billions of events per day (with a 50b number from
Metamarkets in the venturebeat article) and housing trillions of rows. From a
high level architecture, I don't see a reason Pinot couldn't as well (when
they are the same, it's kinda hard to say that one can't handle what the other
can ;)).

On the query API-side, it appears like Pinot exposes "PQL" which is a SQL-like
query language. In Druid, for better or worse, we use JSON-based HTTP calls.
Once you are using a client for querying in a programmatic manner, the
difference doesn't really matter, but it is true that something like PQL makes
it simpler for humans to write their own queries.

For management of distributed state, it looks like Pinot depends on Apache
Helix, which is an interesting framework. This was the first time I had heard
about it and, I really would've liked to have something like that to build on
when we started Druid. We instead had to implement it all ourselves. Helix
sounds like it provides all the hooks to enable the various levels of
integration that we have needs for in Druid, so I can only imagine that there
is little difference in capability there. My biggest fear is that the majority
of operational issues that we run into with Druid is around zookeeper
integration, losing an entire ZK cluster is painful (and does happen) and from
my read of the Helix docs, that would likely cause a loss of all shards of
data as well (it looks like they are persistent znodes?).

In terms of storage format, Druid enforces bitmap indexing and the Pinot docs
appear to indicate that they have multiple types of indexes available, which
is pretty cool. It looks like Pinot doesn't have a way to plug in user-defined
column formats, though, so it would likely be require a bit of extra work to
extend the system to be able to handle more complex types of aggregators like
sketches. And, really, enabling sketching extensions is one of the most
important features of any "scalable realtime processing" tool, in my opinion.

On the ingestion side, it appears that Pinot reads data from kafka in real-
time and then restating from batch, which is the model we've employed in Druid
for the last 4 years, so we know it works. I'd love to know how Pinot
guarantees high availability and replication of the real-time stream, there
are many ways of doing that (over the years, we've implemented 3 different
ways in Druid and are still not particularly happy with any of them).

Anyway, I still haven't had a chance to play with Pinot, so this could be
completely off-base, but those are my initial reactions.

~~~
jfim
Both architectures are indeed similar. The distinction between PQL and JSON is
mostly a client issue, as you could have a client that converts a hypothetical
Druid query language to JSON.

Pinot does use Helix, as it's been used successfully inside (and outside)
LinkedIn to manage distributed state and coordination with a state model
that's easily understandable.

To recover from losing the entire ZK cluster (which would be quite a bad
operational failure, Kafka and other services depending on ZK would also
break), you'd need to recreate your tables, repush your data from Hadoop and
start consuming from Kafka again. We only use ZK and Helix for coordination
and storing segment-level metadata.

There are some other differences with Druid. For example, when data is pushed
into Druid, it can be persisted in another deep storage system (S3, etc.),
which is something we don't support at this point in time (it wasn't necessary
internally at LinkedIn). We also don't have integration with R, nor
documentation that's as extensive as Druid's.

------
scootrous
Is the advantage of pinot that it is not as strict on data format as Druid
(which requires json/delimited)...?

------
dang
Url changed from [http://venturebeat.com/2015/06/10/linkedin-open-sources-
its-...](http://venturebeat.com/2015/06/10/linkedin-open-sources-its-pinot-
real-time-analytics-software/), which points to this.

