
Why Not to Build a Time-Series Database (2018) - aberoham
http://davidgildeh.com/2018/11/06/why-not-to-build-a-time-series-database/
======
dang
[https://news.ycombinator.com/item?id=18402890](https://news.ycombinator.com/item?id=18402890)

------
kwillets
Most OLAP has that exponential falloff in data use with time. I once extracted
all the date strings from recent queries on a data warehouse and found the
same distribution.

Zynga once built a kind of time-series database with very similar metric
namespace issues: about 24M metrics/minute reducing to about 1M unique names
with heavy skew. They did almost everything wrong in implementing it; I was
considering blogging about it once but let it go.

It turned out that the basic aggregation (they were in a hierarchy, so they
needed to rollup to each level with counts and uniques) could be done in a few
seconds with a string sort. But nothing could solve the problem of middle
management.

------
rightbyte
I have this great idea for dealing with their scaling problems. Send each
client a binary blob and let the client execute it. The only thing your
service need to do is act as a liscense server.

I call it "edge computing".

~~~
_pmf_
This is my favorite take on this idea:
[https://www.colinsteele.org/post/27929539434/60000-growth-
in...](https://www.colinsteele.org/post/27929539434/60000-growth-in-7-months-
using-clojure-and-aws)

Quoting: "Because the data set is small, we can “bake in” the entire content
database into a version of our software. Yep, you read that right. We build
our software with an embedded instance of Solr and we take the normalized,
cleansed, non-relational database of hotel inventory, and jam that in as well,
when we package up the application for deployment.

Egads, Colin! That’s wrong! Data is data and code is code!

We earn several benefits from this unorthodox choice. First, we eliminate a
significant point of failure - a mismatch between code and data. Any version
of software is absolutely, positively known to work, even fetched off of disk
years later, regardless of what godawful changes have been made to our content
database in the meantime. Deployment and configuration management for
differing environments becomes trivial.

Second, we achieve horizontal shared-nothing scalabilty in our user-facing
layer. That’s kinda huge. Really huge."

------
mluggy
is there a reason why you can't have a deployment/set of pods per client? the
article keeps mentioning every solution failed when the whole dataset hit a
certain limit.

~~~
rightbyte
Obvously you can parallelize this problem perfectly per customer, unless you
are data mining them, which would remove the congestion.

A TSDB is in the end a db with a timestamp in each row and some convience
functions out of the box.

