

Catena: A time series storage engine written in Go - misframer
http://misfra.me/state-of-the-state-part-iii

======
misframer
Author here. There's a lot I left out of the post, especially since it's such
a young project, but I'd be happy to answer any questions.

Edit: I'm not sure why I'm being downvoted... I'm not home at the moment, so
I'm trying my best to answer using my phone.

Edit #2: Back home with a full-size QWERTY keyboard :).

~~~
tinco
Do you think it's good enough to replace Graphite's Carbon at this point? Not
as a drop in to graphite, but as a backend to a custom metrics system?

I know your project is young and probably has not seen much battle testing,
but your blog post indicates to me that you've put a lot of thought into it
being robust.

We are using Carbon for our metrics solution at the moment, and I've read its
source and it's not something I'd give a big 'ready for production' stamp even
though I know many shops are using it in production.

Perhaps, if you feel like it and will entertain my cheap grab for information,
could you give super small explanation of the performance differences between
your partition style format and for example Whisper (like RRD, it's what
Carbon uses) and InfluxDB? As far as I understand Whisper is simply a cyclic
buffer of fixed time distance points in a file per series. And InfluxDB is
simply a key value store I think.

Your solution lies somewhere in between those right?

~~~
misframer
InfluxDB uses Bolt in the latest version. Yes, it's a key-value store. See my
other comment below about why I'm not using Bolt.

I'm not too familiar with Whisper, but it since it uses fixed size storage,
it's not for my use case. I have large amounts of time series metrics
(>10,000), and many of them end up being sparse with only a few irregular
points. Partitioning by time instead of by metric means I don't have to deal
with thousands of files, especially those that have some large, fixed size.

I'm very focused on making this "production ready." It's currently used for
another project of mine[0], so I haven't been looking at integrating it into
other systems.

[0]
[http://preetamjinka.github.io/cistern/](http://preetamjinka.github.io/cistern/)

------
carbocation
This is a pleasant, self-honest discussion of purpose-driven software as it is
being implemented. The language is irrelevant; it's fun to read this type of
piece.

~~~
misframer
Thank you! You're right; the language is irrelevant. I wrote this in Go
because the parent project is written in Go. I'm looking forward to writing
more of this blog series.

------
jrv
Hi, Prometheus[0] author here. Thanks for the interesting article!

Since I was curious how this compares to Prometheus's internal storage for
writes, I whipped up some (disclaimer: very naive and ad-hoc!) benchmarks[1]
to get a rough feeling for Catena's performance. I am not achieving a lot of
write performance with it yet, but maybe I'm doing something wrong or using it
inefficiently. Some questions to investigate would be: what's the best number
of rows to batch in one insert, and are timestamps in seconds, milliseconds,
or essentially only user-interpreted (I noticed the partitioning at least
depends heavily on the interval between timestamps)? So far I've just done a
tiny bit of fiddling and results haven't changed dramatically.

The benchmark parameters:

* writing 10000 samples x 10000 metrics (100 million data points)

* initial state: empty storage

* source names: constant "testsource" for all time series

* metric names: "testmetric_<i>" (0 <= i < 10000)

* values: the metric index <i> (constant integer value within each series)

* timestamps: starting at 0 and increasing by 15 seconds for every iteration

* GOMAXPROCS=4 (4-core "Core i5-4690K" machine, 3.5GHz)

* Disk: SSD

* Other machine load: SoundCloud playing music in the background

The benchmark results:

#### Prometheus ####

(GOMAXPROCS=4 go run prometheus_bench.go -num-metrics=10000 -samples-per-
metric=10000)

Time: 1m26s Space: 138MB

#### Catena ####

(GOMAXPROCS=4 go run catena_bench.go -num-metrics=10000 -samples-per-
metric=10000)

Time: 1h25m Space: 190MB

So in this particular benchmark Catena took 60x longer and used 1.4x more
space.

Please don't take this as discouragement or a statement on one being better
than the other. Obviously Catena is very new and also probably optimized for
slightly different use cases. And possibly I'm just doing something wrong
(please tell me!). I also haven't dug into possible performance bottlenecks
yet, but I saw it utilize 100% of all 4 CPU cores the entire time. In any
case, I'd be interested in a set of benchmarks optimized specifically for
Catena's use case.

Unfortunately we also haven't fully documented the internals of Prometheus's
storage yet, but a bit of background information can be found here:
[http://prometheus.io/docs/operating/storage/](http://prometheus.io/docs/operating/storage/)
Maybe that's worth a blog post sometime.

[0] [http://prometheus.io/](http://prometheus.io/)

[1] The code for the benchmarks is here:
[https://gist.github.com/juliusv/ce7c3b5368cd7adf8bc6](https://gist.github.com/juliusv/ce7c3b5368cd7adf8bc6)

~~~
misframer
Thanks for trying it out! I haven't had time to run any benchmarks, so I
really appreciate you taking the time to do this (especially since it took so
long!).

I'm not sure what the best batch size is at the moment. Timestamps are int64s,
and it's up to the user to interpret them as they wish. Partition sizes are in
terms of the number of timestamps. If you had timestamps which correspond to
seconds, and you wanted each partition to be 1 day, you'd choose 86400. This
isn't configurable yet unless you modify the source.

I'm not surprised it's that slow. I'm not a storage engine expert, still a
college student, and this has my first lock-free list implementation :). There
is a lot of silliness going on with inserts, like launching a goroutine for
each metric on inserts[0], using lock-free O(n) lists for sources and
metrics[1] when I could have used a map, there's still a lock left that should
be removed[2], and a bunch of others.

On an unrelated note, I see that someone from Prometheus will be at
Monitorama. I'll be there too, so I'd love to you guys some more.

Thanks again!

[0]
[https://github.com/PreetamJinka/catena/blob/8e068b1b95ce1a10...](https://github.com/PreetamJinka/catena/blob/8e068b1b95ce1a107cfed75e2f9e33693ce4bff5/mem_partition.go#L163-L169)

[1]
[https://github.com/PreetamJinka/catena/blob/8e068b1b95ce1a10...](https://github.com/PreetamJinka/catena/blob/8e068b1b95ce1a107cfed75e2f9e33693ce4bff5/mem_partition.go#L191)

[2]
[https://github.com/PreetamJinka/catena/blob/8e068b1b95ce1a10...](https://github.com/PreetamJinka/catena/blob/8e068b1b95ce1a107cfed75e2f9e33693ce4bff5/mem_partition.go#L273-L274)

~~~
jrv
Ah ok, so some obvious things left to optimize then :) Thanks for the
explanations!

As far as I know, none of the Prometheus core developers will be at Monitorama
this year. But we'll be at SRECon Dublin, DevOpsCon Berlin, GopherCon Denver,
and hopefully also at one (or both) of the VelocityConfs later this year.

Indeed, it would be great to catch up in person if you're at any of these.

~~~
misframer
I see. I should be going to GopherCon too, so I hope to see you there.

------
bascule
"Catena" is also a password hashing function:
[http://eprint.iacr.org/2013/525.pdf](http://eprint.iacr.org/2013/525.pdf)

------
kylered
Good stuff!

