
Creating a full-text search engine in Apache Pinot - caetris1
https://medium.com/apache-pinot-developer-blog/text-analytics-on-apache-pinot-cbf5c45d282c
======
memexy
> Apache Pinot is a real-time distributed OLAP datastore, built to deliver
> scalable real time analytics with low latency.

> In this post, we will discuss newly added support for text indexes in Pinot
> and how they can be used for efficient full-text search queries.

~~~
contravariant
I'm not too familiar with either but isn't that the exact use-case for Apache
Druid as well? Does anyone know how they differ or why there are two efforts
by Apache towards the same goal?

~~~
memexy
Not sure. Apache always has several projects that do the same thing for one
reason or another. Usually they're projects from competing organizations like
Google and Facebook.

------
igrekel
I haven"t checked Pinot in detail yet. I'd be interested to know how it
compares to the likes of ElasticSearch.

~~~
kishoreg
Pinot is built to answer OLAP queries at high throughput while maintaining low
latency. It powers many customer-facing analytics apps such as LinkedIn's who
viewed my profile, Publisher Analytics, etc (50+).

At LinkedIn, it serves 100k+ queries per sec with 10-1000 ms latency while
ingesting millions of events/sec from Kafka.

This is achieved by various indexing techniques - sorted index, bitmap index,
range index, star-tree index, bloom filter, partitioning, etc and a flexible
query execution planner that can dynamically pick the right plan based on the
query and data profile.

[https://www.youtube.com/watch?v=luMLCDANxiU](https://www.youtube.com/watch?v=luMLCDANxiU)
should give you more info on why we built Pinot at LinkedIn.

Disclaimer: pinot committer

~~~
igrekel
Thanks for the link, I'll go through the presentation.

We need to upgrade the way we compute indicators and the backend for our
analytics and I was considering solutions like Druid and ElasticSearch and
Pinot seems like another good option. Getting better latency is really
interesting and I'm curious on how much we need to compromise on space usage
etc.

Another big subject is how it handles time-based data, similar to time series.

