Beringei: A high-performance time series storage engine

crescentfresh · on Feb 4, 2017

Ok so it's an inmemory product, sharded no less.

They speak about compressing the data before "storing" it.

I don't have a lot of experience with inmemory anything, but are we talking about retaining the compressed format in server memory here? Ie, RAM is your datastore.

Then, at some point, to serve requests/queries for the data don't you have to get it "out of" RAM and uncompress it, also an inmemory operation?

Did I get this right?

storyinmemo · on Feb 4, 2017

The paper on the algorithm is here: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Somebody implemented the algorithm in go based on the paper here: https://github.com/dgryski/go-tsz

A short answer that may work for your question: the bits that are set in RAM are xor values relative to previous values. To provide an answer as to what the value is, a series of read|xor operations are performed.

jdonaldson · on Feb 4, 2017

They refer to it as "delta of delta", which implies the encoding handles things like acceleration and momentum. I'm guessing the healthcare data has much more of this sort of behavior than your typical server event time series.

Scaevolus · on Feb 4, 2017

Not quite. Delta of deltas means that periodic timestamps turn into a long run of zeros (since the difference between each point is constant), which is very easy to compress. In the Gorilla paper, 96% of timestamps compress down to zero, which is encoded by a single zero bit.

For series without periodic sampling, delta-of-deltas performs similarly to normal deltas.

rakoo · on Feb 4, 2017

"delta of delta" is only used for timestamps, not for values. In any standard monitoring setup, the acceleration of timestamps will be 0 for a big portion of measurements, and very small for a big portion of the rest, so it makes compression extremely easy and efficient.

throwaway2016a · on Feb 4, 2017

> I'm guessing the healthcare data has much more of this sort of behavior than your typical server event time series.

I thought they were talking about healthcare data too when I first read it but seeing as this is Facebook I realized they are talking about server health not human health.

machineman44 · on Feb 4, 2017

Haha, I thought jdonaldson was making a joke :P

irfansharif · on Feb 4, 2017

general note on when companies open source non-trivial projects but squash all pre-release commits, not being able to dig though implementation history and change logs make it difficult to jump in despite the code being publicly visible.

Context: initial commit with 277,989 additions [0].

[0] - https://github.com/facebookincubator/beringei/commit/17a6c2d...

cocoflunchy · on Feb 4, 2017

What would HN suggest to store about 1GB of data per day, mostly for archiving and offline analysis, with less than 10 columns including timestamp? We're currently writing everything to our postgres DB and flushing the table to S3 every few days but it's killing the app performance under high loads. I'm looking for something that is easy to set up and keep running with low to no maintenance.

pdq · on Feb 4, 2017

CSV

Same way most web servers log traffic.

IanCal · on Feb 4, 2017

Strongly recommend this. Your disks are not going to struggle with 1G/day, it's easy to setup & manage, easy to move data around and easy to process later. Worst case is it's quick to try and see if it suits.

ju-st · on Feb 4, 2017

Considering the use case ("archiving and offline analysis") this is likely the best choice. If you need some buzzwords or want to get fancy, try HDFS

Mahn · on Feb 4, 2017

Or perhaps SQLite, since it should perform similar to writing to a CSV file and you get querying out of the box afterwards.

darksaints · on Feb 4, 2017

SQLite is a major powerhouse for stuff like this, and if your analysis is on a single computer with a single disk, analysis is almost always faster than any sort of highly parallel big data setup. I've loaded, analyzed, and reported on 2TB of data using SQLite in less time than it took for an 8 machine spark cluster took to load the data.

storyinmemo · on Feb 4, 2017

Can you elaborate on what it is that slows down your app performance? Is it bottle-necking on writing to the database? Does the database have read activity? Does the application slow down when there is read activity?

With some basic assumptions on my part including that you can have a delay in writing data to the database (since it's archival and analysis), but you don't want the application to be delayed, putting a fast queue like Kafka or RabbitMQ in place could help. It'll buffer when the database is under load, isolating your application from that.

Offline analysis might also skip the database entirely: columnar data formats (parquet) or log structured merge / sorted string tables (rocksdb) could work very well for archival or offline analysis in MR type environments.

cocoflunchy · on Feb 4, 2017

Yes, we only have one DB for everything and it is bottle-necking on writes. Queuing the writes is probably the easiest solution since we already have RabbitMQ setup... thanks for your answer.

darksaints · on Feb 4, 2017

There are a lot of good suggestions here, but if this is the route you're going down and you are still using the same database, do everything you can to batch your inserts. In your queue consumer, aggregate as much as you can tolerate in memory, and then insert them all and commit the inserts in a single transaction. This eliminates a huge amount of MVCC overhead.

zo1 · on Feb 4, 2017

Don't write it to your main transactional database, then.

Have a separate connection to a DB on another machine that is handling the load. Do that within the transaction/request from the client, or pop it into a message queue.

The queue will "persist" the data, but will also schedule it for when your other box/DB can ingest it. At the end of the day, if you're pushing too-much data, you need to have a mechanism in place to let you scale. To me, a simple solution would be a message queue server with plenty of redundant storage space. From there on, you keep adding more consumers.

GordonS · on Feb 4, 2017

If the message queue becomes a bottleneck, buffering/batching can also help - instead of sending individual messages to the queue, send them in batches (based on the number, time period, or both).

mirekrusin · on Feb 4, 2017

If you don't have strong guarantees on that table (usually with high volume comes less requirements, ie. you can loose few rows here and there and it's not the end of the world) then you may want to try unlogged table (new versions of postgres do support it). Also make sure you're using BRIN index (if any) on that timestamp column.

tyingq · on Feb 4, 2017

I think the answer depends on what tool you're using for offline analysis. Reducing the number of intermediate copies and formats would be desirable, right? Start with that, and then find some asynchronous way (to avoid blocking your app) and dump the data straight to that format.

jeffrand · on Feb 4, 2017

We used InfluxDB along with a few other tools. Check out: https://www.influxdata.com/time-series-platform/

rodionos · on Feb 9, 2017

Try Axibase Time Series Database. We support SQL which should work well for analytics use cases.

SQL docs: https://github.com/axibase/atsd-docs/tree/master/api/sql

Analytics examples: https://github.com/axibase/atsd-use-cases

user5994461 · on Feb 4, 2017

Given the "archive and offline analysis" requirements, which are special characteristics.

The "hacked" way is to go CSV files, store them on s3. It is very limited but it is also very simple. (Not sure what software can run queries on CSV files directly).

The "doing things right" way is to go for AWS RedShift or Google BigQuery. There is a learning curve at the start but it's really REALLY good and it will pay off by many folds later.

patrec · on Feb 4, 2017

sqlite can directly process csv

weego · on Feb 4, 2017

If you want low to no maintenance and maintain some of what you have I'd suggest pushing it into kinesis instead of postgres and then dumping to s3 as you're already doing.

cocoflunchy · on Feb 5, 2017

Kinesis looks pretty good but it's yet another (proprietary) product to integrate and maintain... I'd rather avoid that if I can for now, we're a very small team and the fewer tools we have to use the better.

throwaway2016a · on Feb 4, 2017

Second this. This use case is exactly what Kinesis was built for in the first place.

Additionally, Dynamodb can be decent for this as long as you only need to sort on time.

gaius · on Feb 4, 2017

What would HN suggest to store about 1GB of data per day,

1Gb per day is not a lot - that's 12k/sec which is trivial. Your Postgres must be badly misconfigured!

agacera · on Feb 4, 2017

If you are already invested in aws I suggest you to use the firehose + s3 to store data and use the Athena to query it.

jdwyah · on Feb 4, 2017

Kinesis firehose to S3 and then query with Athena is pretty great. I've been very happy with the combo.

kvz · on Feb 4, 2017

One idea might be to write to S3 directly and use Athena for your offline analysis

cocoflunchy · on Feb 5, 2017

Thanks for Athena, I hadn't heard about it!

robochat · on Feb 4, 2017

You could try influxdb.

vegabook · on Feb 5, 2017

teafiles

ibotty · on Feb 4, 2017

I am very interested in a comparison by prometheus developers. Brian Brazil recently wrote about optimization to prometheus' time series storage.

bluecarbuncle · on Feb 7, 2017

Prometheus' implementation of the same:

* https://prometheus.io/docs/operating/storage/#chunk-encoding

* https://github.com/prometheus/prometheus/blob/127332c56f85b7...

kvz · on Feb 4, 2017

How would this stack up against InfluxDB?

erichocean · on Feb 4, 2017

We're getting good results from http://traildb.io/ and we don't have to grant Facebook a worldwide, royalty-free license to our patent pool in order to use it.

perlgeek · on Feb 4, 2017

https://github.com/facebookincubator/beringei/blob/master/LI... doesn't even mention patents.

https://github.com/facebookincubator/beringei#license says "We also provide an additional patent grant."

None of this sounds to me like you have to grant Facebook any patents to use Beringei.

ivank · on Feb 4, 2017

https://github.com/facebookincubator/beringei/blob/master/PA...

vtuulos · on Feb 4, 2017

Great to hear that!

We would love to hear how you are using TrailDB and what could be improved for your use case. Feel free to open issues in GitHub or drop by at our Gitter channel, https://gitter.im/traildb/traildb

buremba · on Feb 4, 2017

I don't think TrailDB is comparable to Beringei. TrailDB is built for behavioral analytics queries while Beringei is developed for time-series aggregated metrics.

erichocean · on Feb 4, 2017

> time-series aggregated metrics

That's exactly what we're using TrailDB for. Works great.

buremba · on Feb 4, 2017

Does TrailDB pre-aggregate metrics? AFAIK it stores the raw data bucketed with an actor id (usually a server or visitor) and performs compression and some columnar storage optimizations for events of actors.

no0b20 · on Feb 4, 2017

I'm curious about what kind of write speeds you are using traildb for?

erichocean · on Feb 4, 2017

We trace every server request with it, across all stages of the request—approx. 10-12 events/timestamps per request—across multiple processes (think Zipkin). The per-request event streams are independent "trails" per process, and then we merge them together later and compute aggregated metrics using hdr_histogram.

Individual events are typically hundreds to thousands of nanoseconds long, with about 20 nanoseconds of overhead to grab a timestamp using the rdtscp instruction.

Hope that helps!

sanjayts · on Feb 4, 2017

If you don't mind asking: how do you uniquely identify the requests? Is it a composite field made up of `client_ip:timestamp` or a random UUID? What does a typical payload sent across to the TSDB look like? Also I'm assuming the services are written in a language like C or C++?

erichocean · on Feb 5, 2017

• random UUID

• we track function entry/exit/throw, offset (in nanoseconds) from the initial timestamp, and in the case of a throw, we also capture a stack trace (which is not stored in TrailDB)

• our server is written in C++

no0b20 · on Feb 4, 2017

I'm not sure I follow everything what you wrote. I'm interested in how many inserts are you doing per second to the DB. Is it in 1000s/sec or millions/sec.

erichocean · on Feb 4, 2017

Millions, not thousands, of inserts per second. TrailDB is hella fast (that's why we use it).

buremba · on Feb 4, 2017

So you have a batching mechanism for creating TrailDB files, right? Then you're able to query historical metrics by processing these TrailDB files.

erichocean · on Feb 5, 2017

Correct. We mostly use hdr_histogram right now for the post-processing, since stable latency is what we care about.

We also do some hdr_histogram processing for our dashboard in parallel to storing traces in TrailDB, and then we retain the individual traces to do longer term processing or when we're tracking down issues in production.

no0b20 · on Feb 4, 2017

Thanks. I will check it out.