
Storage engine design - chaotic-good
http://akumuli.org/akumuli/2017/04/29/nbplustree/
======
lima
It looks like Yandex recently open-sourced their Graphite engine they built on
top of Clickhouse:

[https://github.com/yandex/graphouse](https://github.com/yandex/graphouse)

Looks really interesting for Graphite-like use cases.

------
eis
I don't think you want to measure your timeseries datastructure against LSM-
trees because the latter is inherently a pretty bad structure to use for
timeseries (which are mostly append-only) as a few projects painfully found
out.

Anyways, I'm interested in timeseries so I read the article and tried to
understand the datastructure but to be honest it opens up more questions than
it answers. I applaud people though who try to describe their datastructures
that are core to the app. Thanks for that.

    
    
      1. What is the exact datastructure of a leaf? You mention a leaf node can hold 1000 datapoints.
      2. Why is using a WAL impossible? That should be possible for any datastructure.
      3. In your example if level 3 is full and everything gets merged into level 4, there are no levels 1-3. How does one query the datastructure? Does it maintain a pointer from the root to level 4?
      4. Related to above: if I want to query all data from last month until now which happens to be the start of level 4, it will first go to root, then to the level 2 Leaf, from there to the level 2 SBlock, from there to the level 3 Leaf, then level 3 SBlock, then level 4 Leaf, then level 4 Sblock, then the first level 4 leaf? That seems a lot of random access. How many iops does a lookup need?
      5. SBlocks need to be kept in memory. If I have ten million timeseries (not uncommon, can be much more), each with 3 levels, then upon startup the app has to load 30M SBlocks into memory?
      6. You say that several trees can be interleaved in a single file, how does that not break the linear IO advantage of columnar layouts?
      7. How do you store information in the "inner nodes" (SBlocks?) to speed up aggregation? Do you store every possible aggregation in there? E.g. sum, avg, min, max, stddev, ...
      8. Storage format of an individual time series is only a part of what a TSBD needs to do, another part is how to find all the timeseries for a particular query. How does that work?
    

And in general I think you can't have all three:

    
    
      A) good write performance
      B) no write amplification
      C) low RAM usage
    

... because you have to either write data out immediately (and get either 2x+
write amplification or lots of random writes/reads) or buffer it in RAM to
batch linear writes.

I think there are some interesting ideas in this structure, it looks to me
more like a Linked List of one level deep B+Trees, not a big overall tree.

~~~
chaotic-good
Akumuli is desinged for an SSD and NVMe drives so I chose to have a lot of
random reads and writes. My laptop's NVMe drive have a random write througphut
around 400MB/s (AFAIR) and my most havy performance test wrote data at rate
about 70MB/s (16M data points per second).

~~~
ChuckMcM
This is perhaps the most interesting aspect of it to me. When we relax the
constraint that 'mass storage access must be a linear and infrequent as
possible' what sort of possibilities does that open up in the design space
that were previously untenable.

Nice work and thank you for sharing it.

~~~
chaotic-good
It's not that easy, actually. The simplest method that can utilize the full
throughput of the drive is to use large writes (1MB or larger). This is the
fastest possible way to write data to the SSD, period. This method also
creates the simplest possible FTL mapping table.

Random reads and writes are significantly slower if you write everything from
one thread. To speed everything up you should write in parallel (for example
using Linux AIO + O_DIRECT, or libuv + O_DIRECT). OS level buffering and many
OS threads will deliver good random write throughput as well.

There are other effects to consider, e.g. read-write interference.

~~~
ChuckMcM
I understand. I would expect that you will get an additional boost if you
target Intel's 'Optane' technology which, by its design, allows for a much
faster channel turnaround and so less interference. And in the fairly recent
past other vendors like Texas Memory systems developed strategies which were
all RAM and a bit of cleverness to snapshot to HD when the power fails. The
point being that with enough money you could brute force the solution, but now
the money required it decreasing and so new strategies are opening up.

~~~
chaotic-good
If I understand this right, with Intel's Optane you will eventually need to
write everything to HDD because data collection happens at steady pace and the
cache size is limited.

~~~
ChuckMcM
Depends on the size of your data set. Intel's plan, according to their web
site, is to replace the SSDs (especially NVME ones) with Optane based solid
state memory. The road map has them shipping exabytes of the stuff eventually.

So as I see it you'd be constrained by 32GB Optane modules today, but they
will eventually (one, maybe 2 years) be 2 TB modules like the Samsung 960 Pro
modules are today. And an M.2 port is really just a PCIe slot so you're
looking at systems with maybe 32 TB of Optane storage on the high end within
the next 5 years.

------
menegattig
Good content for creating time-series database engines that was just posted on
other HN thread:

[https://medium.com/slicingdice-com-blog/want-to-build-a-
new-...](https://medium.com/slicingdice-com-blog/want-to-build-a-new-time-
series-database-start-here-d4c3fd1517b1)

[https://news.ycombinator.com/item?id=14246189](https://news.ycombinator.com/item?id=14246189)

------
evdev
What I've found matters in this area is the mismatch in locality between
elements in read batches and elements in write batches. It'd be nice if the
emerging DBs that deal with these issues put at least a gloss on the
information model and write -> read lifecycle they're targeting.

Otherwise, a lot of these "actually you need X for time series!" are just
talking past each other because "time series" means any number of actual
patterns.

~~~
chaotic-good
What do you mean by gloss?

~~~
evdev
FYI I found this a great article otherwise.

------
tianyicui
It seems the kind of query for N time series at a specific timestamp or across
a small span will be inherently slow because of O(N) block read? Is there some
way to support this kind of query efficiently?

------
stuff4ben
Fascinating stuff! How does this compare to other TSDB's like Influx (which
uses LSM-tree) and Druid?

~~~
chaotic-good
Akumuli is quite different from InfluxDB. It focuses on single node
performance and operational simplicity. Essentially I'm trying to make it a
"fire and forget" kind of app. No idea about Druid.

~~~
ibotty
and prometheus?

~~~
chaotic-good
Prometheus is a monitoring system (pull based), Akumuli is a TSDB (push
based). I believe that one can use Akumuli as a long-term storage for
Prometheus.

~~~
bbrazil
Prometheus is fundamentally a TSDB, see
[https://fabxc.org/blog/2017-04-10-writing-a-
tsdb/](https://fabxc.org/blog/2017-04-10-writing-a-tsdb/) for the design of
the next version.

~~~
chaotic-good
This is a one more example of the design in which one file holds many series
and everything is chunked by time:

\- "there is no longer a single file per series but instead a handful of files
holds chunks for many of them"

\- "We partition our horizontal dimension, i.e. the time space, into non-
overlapping blocks. Each block acts as a fully independent database containing
all time series data for its time window."

I don't believe this will work out well because it will introduce read
amplification during query time (compared to file per series approach that
they're using now). And I'm really curious how they managed to get 20M writes
per second on laptop. The article states that they're using compression
algorithm from Gorilla paper and Gorilla paper authors claims that they
managed to get 1.5M on a single machine.

~~~
devdevdev1
It seems very much like the B+ tree approach is just a mental model put on top
of the exact same idea that is being argued against. The initial list of "bad
things about LSM approaches" has almost exactly the same items on it as the
list of features the B+ approach claims to achieve.

Maybe I'm getting this all wrong, but aren't the leaves also representing
chunked data, which is compressed.

The Prometheus solution also sequentially places compressed chunks for the
same series. The time slicing actually has a lot of benefits and can simply be
seen as the first level of the described B+ tree. An index of chunks for a
series can then be seen as the second level.

The potential read amplification here seems completely equivalent. Just from
my high-level view, all properties of the read and write path seem almost
identical.

~~~
chaotic-good
>> Maybe I'm getting this all wrong, but aren't the leaves also representing
chunked data, which is compressed.

Leaf nodes contain data from one series (this data should be read together)
and SSTable with time-series data contains many series and there is no
guarantee that all these series will be used by the query.

>> The Prometheus solution also sequentially places compressed chunks for the
same series.

I'm not really that familiar with Prometheus internals, especially with
indexing part. As I understand it doesn't align writes so there is a lot of
write amplification on the lower level that translates to cell degradation and
non-optimal performance, but I can be wrong here.

------
batisteo
Yet another project with an Esperanto name

