
Gorilla: A Fast, Scalable, In-Memory Time Series Database (2015) [pdf] - kiril-me
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
======
pixelmonkey
I read this paper a few months ago and summarized it for my team. Sharing the
notes here in case it's helpful for others:

\- when storing time series keys, you can save a lot of space by encoding them
as a first timestamp followed by timestamp offsets (deltas)

\- when storing time series values, you can save a lot of space by realizing
sequential data points tend not to be volatile... e.g. a "writes per minute"
series is more likely to be 100, 99, 101, etc. than it is to be 100, 999999,
1, etc. This means you can encode the current value by using XOR tricks with
prior values to save space

\- the authors suggest using in-memory caching for recent data, but doing
eventual persistence to HBase or similar distributed filesystems; thanks to
this, you can get real-time operational metrics that are fast and space
efficient, yet still have a system that scales out horizontally for historical
storage

The Morning Paper also did a good analysis here:
[https://blog.acolyer.org/2016/05/03/gorilla-a-fast-
scalable-...](https://blog.acolyer.org/2016/05/03/gorilla-a-fast-scalable-in-
memory-time-series-database/)

~~~
marknadal
Good tips, on that note you really don't have to use HBase or some always-on
managed filesystem. You can actually use S3! I know this sounds weird as you'd
rack up API costs way faster than storage cost, but we implemented a simple
batching approach for timeseries data that could handle 100M+ messages for
$10/day (over 100GB+ of data, includes all costs, processing, storage,
backup). Did a 2 min screen cast here:
[https://www.youtube.com/watch?v=x_WqBuEA7s8](https://www.youtube.com/watch?v=x_WqBuEA7s8)
. This hopefully can be a lifesaver on costs for some people, and keeps your
system overall simpler.

~~~
StreamBright
This is neat! S3 is always my first choice when it comes to scalable cheap
storage. The price and scalability beats almost everything in this category.
Reliability is unprecedented as well. I am wondering what was you file format
for storing the TS data.

------
gtrubetskoy
Not to bash Gorilla, because it looks like excellent stuff, but I do not
understand the general obsession with specialized TS stores.

If you're operating on FB-scale, then sure that's what you have to do, but in
most cases your database (especially Postgres) is a far superior option.

Time series is "unusual" in the sense that most people don't know first thing
about it, event folks with degrees in math/statistics. I think this is why
there is the prevailing misconception that it requires a specialized database.

Incidently I've just written a blog about storing TS in PostgreSQL:
[https://grisha.org/blog/2016/12/16/storing-time-series-in-
po...](https://grisha.org/blog/2016/12/16/storing-time-series-in-postgresql-
part-ii/)

~~~
bbrazil
> If you're operating on FB-scale, then sure that's what you have to do, but
> in most cases your database (especially Postgres) is a far superior option.

> Time series is "unusual" in the sense that most people don't know first
> thing about it, event folks with degrees in math/statistics. I think this is
> why there is the prevailing misconception that it requires a specialized
> database.

I'm one of the developers of Prometheus, and while I'm going to point
appropriate use cases towards Postgres (and regularly do), even relatively
small monitoring loads require careful handling such that a traditional
database isn't suitable.

An example from previous company is that we had only ~50 machines in one
datacenter and yet were up to 30,000 samples per second. This is not
considered to be a large setup, which would be in the hundreds of thousands of
samples per second.

All the buffering/batching required to make such loads practical requires
special design, as naively making each new data point into a disk write (with
fdatasync) will not work out - even with SSDs.

And that's just writes. You also need to be able to efficiently read and
process back the data efficiently when it is queried.

I'd suggest
[https://www.youtube.com/watch?v=HbnGSNEjhUc](https://www.youtube.com/watch?v=HbnGSNEjhUc)
to give a look into how Prometheus does it (which also covers our use of
Gorilla). There was also a good post on the InfluxDB blog about the evolution
of their design and why the problem is hard that I can't seem find right now.

~~~
gtrubetskoy
I think that this is an argument over apples and oranges. I'm well familiar
with the write problem you're speaking of and even wrote a blog a couple of
years ago detailing influx's storage code.

The "apples" use case is when you want for lack of a better description a
Grafana chart out of it. (E.g. the cluster cpu load, "we're getting X page
views per second", etc).

The "oranges" case is when every data point matters, and I think that this
particular case is not so much about time series, but logging, where a log is
a "time series" technically speaking. Here of course you're talking massive
amounts of data which you want to do your best to optimize and it's a hard
problem. It's not a _time series_ problem, it's a problem of storage. Splunk
(sorry, can't think of a better example ATM) doesn't describe itself as a
"times series database", and yet it's addressing that very problem of
horizontally scalable write-intensive storage. So is hbase, cassandra,
accumulo, bigquery and all their friends.

In my experience the best solution is to use two separate systems one for each
case. Because there is absolutely no value in a "disk used" measurement every
millisecond - aggregated to once a minute (or few) is perfectly fine, and
retaining the original data is not needed and you'd be a fool to build a
Cassndra cluster to handle it. On the other hand if I'm recording equity bids
and asks, well then I better record every point forever.

~~~
bbrazil
That's two separate dimensions.

One is metrics vs. event logging.

The other is consistency vs. availability.

There's many types of logs, and you don't need the consistency that'd be
required for billing-related logs as for debug logs. Similarly you may have
billing-related metrics (e.g. bandwidth usage) where consistency is important.

> It's not a _time series_ problem, it's a problem of storage.

Technically anything with a time dimension is a time series, so Splunk is a
time series database.

> In my experience the best solution is to use two separate systems for each
> case.

I agree completely. There's at least two general problems here that need
different approaches once you get beyond trivial scale.

------
tschellenbach
There still isn't a leading time series database. Think the use case
definitely deserves it though. I've seen so many projects spend months in
building custom data rollup solutions, it's a waste of time. A good time
series database would be pretty amazing.

~~~
bbrazil
There's several, depending on use case. Each make different tradeoffs, so you
have to decide what's important for you.

For example Prometheus (which I work on) is great at reliable monitoring and
powerful processing of metrics at high volumes, but it'd be unwise to use it
for event logging or customer billing.

If you're doing IoT or event logging then InfluxDB might be a good choice for
you, though if you're doing more text-based logging then Elasticsearch is
nearer to what you're looking for.

[https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCE...](https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCEWdPtAoccJ4a-IuZv4fXDHxM/edit#gid=0)
is one comparison of the various open source options.

~~~
lobster_johnson
We're using Elasticsearch for event logging -- where "event" means analytics
event, e.g. a page view -- and it's fantastic. The aggregation support is
superb.

We initially used Influx, but it could not perform well at the time (0.8). Our
events are also heavily label-based. Basically, we do ETL at the time of
write, collecting multiple documents into one mega-event, which is a complex,
nested JSON document. It may have perhaps 150-200 fields. A single event may
be something like "clicked button X". By storing the original document, we can
aggregate based on any field value, including text and scalar fields, without
having to think about a schema or about planning ahead of time what fields
should be indexed or not. ES handles the rest pretty well.

To do the same thing with Influx or Prometheus I suspect we'd have to reverse
this and store the document as the labels, along with a single count (1) as
the "metric". I don't know how well Influx etc. scale with number of unique
label values, though I'd love to find out. The last time I read about this, I
think they recommended not going overboard with them.

What's different with business analytics is that the end product is typically
multidimensional rollup reports over large time windows (number of page views
per customer per web property per month, comparing by 2015 vs 2016, for
example), and it's almost all "group by count", sometimes "count distinct" or
averages. Whereas "rate per second"-type metrics aren't used anywhere in our
app, for example.

~~~
pixelmonkey
I also use Elasticsearch for time series site analytics use cases. I gave a
talk about Elastic{on} about it last year.

Apologies for the email gateway for the video, but you can also see my slides
here: [https://www.elastic.co/elasticon/conf/2016/sf/web-content-
an...](https://www.elastic.co/elasticon/conf/2016/sf/web-content-analytics-at-
scale-with-parse-ly)

We found that as we scaled it up, we couldn't really keep the data in raw
form, so we had to build rollup documents that cover 5-minute and 1-day
buckets. Do you use the same trick, or is the number of pageview events for
you manageable enough that you just keep it all raw?

~~~
lobster_johnson
We haven't reached that stage yet, fortunately; though at some point someone
_will_ want to do a big multi-year aggregation report across all indexes and
still expect it to take not more than a few seconds.

My ideal solution would be one that rotated the dataset into historical
rollups on a daily basis, so that we only stored the raw data for today, and
gradually merged earlier entries at lower granularities. However, I haven't
thought much about how to do that with Elasticsearch. I can see a way of doing
it by embedding the value in the field label, and using the field value as a
count, but Elasticsearch _really_ doesn't like lots of unique fields; you
shouldn't be using more than a few hundred at most in a single installation
(across all indexes).

------
hefeweizen
VLDB '15\. Previously discussed at:
[https://news.ycombinator.com/item?id=10207863](https://news.ycombinator.com/item?id=10207863)

~~~
kiril-me
Yes, here is open source representation of the ideas presented in this paper:
[https://github.com/facebookincubator/beringei](https://github.com/facebookincubator/beringei)

------
manigandham
Submitted this a while back, the Morning Paper did a good write up of BTrDB
which is another very fast TSDB written in Go, storing 2.1 trillion data
points, and supporting 119M queries per second (53M inserts per second) in a
four-node cluster

[https://blog.acolyer.org/2016/05/04/btrdb-optimizing-
storage...](https://blog.acolyer.org/2016/05/04/btrdb-optimizing-storage-
system-design-for-timeseries-processing/)

------
pauldix
This is a great paper. On InfluxDB we used a similar compression technique for
float64s in our new storage engine. A key difference we have is that we
separated timestamps from values so we can use run length encoding for regular
time series.

------
crudbug
Does the architecture allow persistent storage plugins ?

------
DonHopkins
"You will get a better Gorilla effect if you use as big a piece of paper as
possible." -Kunihiko Kasahara, Creative Origami.

