
It’s About Time for Time Series Databases - okket
https://www.nextplatform.com/2018/01/25/time-time-series-databases/
======
kodablah
Sorry to leave the technical detail part real quick. But is anyone else
concerned about using a DB solely from a company built specifically around
that DB? After Rethink DB (sustainability issue) and Foundation DB (bought and
shuttered/hidden) and Riak (admittedly haven't kept up but I saw [0]), I am
wary of using any DB that is not built by a large community or is not built as
a non-core project from a large tech company. Sorry TimescaleDB, I see you
have raised a decent amount of funding, but I have to choose my DBs w/
trepidation these days.

0 -
[https://www.theregister.co.uk/2017/07/13/will_the_last_perso...](https://www.theregister.co.uk/2017/07/13/will_the_last_person_at_basho_get_the_lights_oh_too_late/)

~~~
lbruder
We use a combination of SQLite and a sharding frontend service. One SQLite
database file per device, one table per sensor, table contents are timestamp
and measured value. As simple as it gets, easy to scale, and damn fast.

But try telling people you're using SQLite to store critical data...

~~~
creeble
Simpler than a log file? What's the significant advantage of SQLite vs just
appending to a file?

~~~
eloff
Transactions? Crash data safety? SQL queries? Portability?

~~~
creeble
I don't see how any of those apply for a two-column 'db' of timestamp and
data.

~~~
rpedela
It is database per device and table per sensor, not just one table. With SQL
you can do joins, aggregations, etc. If you just want to log it, sure use a
log file. As soon as you want to use the data somehow, you need SQL or your
own log file parsing and query code. Personally, I rather just use SQL.

~~~
creeble
If you’re storing from multiple sensors with multiple write processess, you
will have concurrency problems if they’re in the same db with SQLite.

I’m not referring to manipulating or even querying the data after it’s stored.
I just don’t see how _saving_ the original data in SQLite buys you anything,
if it needs to be collected into a db for queries, etc, anyway.

------
lima
ClickHouse (the analytics DMBS by Yandex), while not explicitly designed as
such, is a fantastic time series database.

There's even a special backend, the GraphiteMergeTree, which does staggered
downsampling, something most TSDBs aren't able to.

It's the most promising development in this space I've seen in a long time.

[https://clickhouse.yandex/](https://clickhouse.yandex/)

[https://clickhouse.yandex/docs/en/table_engines/graphitemerg...](https://clickhouse.yandex/docs/en/table_engines/graphitemergetree.html)

~~~
manigandham
I'd also recommend Druid, MemSQL, SnappyData, MapD and other column-oriented
databases. Any of them can partition on a time column with full SQL and
extremely fast aggregations and high compression that come from columnar
storage.

~~~
cevian
Hi, you've posted this notion that partitioning a column store by time would
yield the same result as TimescaleDB a few times, so thought we'd jump in and
clear things up. We fully agree that column stores have their place,
particularly if you have a massive number of metrics, and all you care are
roll-ups on single column axes. There are some major differences between
TimescaleDB and column stores. Namely, TimescaleDB supports a lot of features
that column stores in general do not.

\- Secondary Indexes.

\- Transactional semantics.

\- Can operate on data sets greater than available memory (doesn't have to be
all in-memory unlike memSQL and some others). Time-series data is voluminous.

\- A whole bunch of specialized time-based optimizations that optimize query
plans when working with time-based indexes and data.

\- Constraints - Including foreign keys.

\- Triggers

\- Joins with relational data.

\- Full SQL - allowing you to use complex queries and window functions

\- Compatible with data tools that use SQL - which gets you gets you the
richest ecosystem of tools in the data world

\- The full gamut of Postgres datatypes including JSON/B and GIS location data

\- 20+ years of reliability, tested backups, live streaming replication, etc.

\- Geospatial support through best-in-class PostGIS

And of course, we're only getting started :)

~~~
manigandham
It depends on the queries but columnstores would yield a faster result. We're
not new to this and have used ClickHouse, MemSQL, SQL Server, and Druid
extensively.

Columnstores just store data by column, they do not have any inherent
limitations because of it. They all support SQL and compatible tools (although
Druid is experimental SQL using apache calcite). They all store columnstore
tables on disk (memsql uses rowstores in memory, sql server can optionally run
columnstores in-memory using its hekaton engine, and they all use in-memory
buffers for rapid ingest). They can all do geospatial queries, support JSON
columns and some can handle nested/repeated structures. Indexes are available
but unnecessary when you can prune partitions based on what's contained in
each segment, especially when using a primary sort key (like a timestamp
column in your case). SnappyData has a unique statistical engine to tradeoff
query precision for much faster results (like HLL+ algorithms applied to the
entire dataset). MemSQL will do OLTP access with full transactions across both
rowstore and columnstore data.

Congrats on the VC funding, I'm always happy to see new projects and building
on Postgres does give you a solid base with triggers and foreign keys (which
come with their own scaling issues), and extending time-based functions will
be useful -- however my issue is the marketing spin where you claim to be
better than everything else. Columnstores are very fast, efficient,
performant, and time as a dimension is not a new challenge. That's before
considering the bigquery/snowflake superscale options or specialized databases
like kdb+ which have served the financial industry for decades.

Approaching the field with a single-node automatic partitioning extension (as
of today) for a rowstore RDMS and saying you're better than the rest on
features that they already have just strikes me as insincere. It would be
better to recognize the competition and focus on what you're good at instead.

~~~
mfreed
“my issue is the marketing spin where you claim to be better than everything
else”

I’m sorry that was your impression and it’s certainly not our intent to
mislead, although I’m not really sure why/where you think we claimed this.
Indeed, the quoted article even says that "Timescale is not trying to take on
Kx Systems directly in this core market.", and that such organizations have
different needs for different use cases.

Technology choices are all about trade-offs, and databases are no different.

In fact, our docs have a page describing exactly when _not_ to use TimescaleDB
compared to other options:
[http://docs.timescale.com/v0.8/introduction/timescaledb-
vs-n...](http://docs.timescale.com/v0.8/introduction/timescaledb-vs-nosql)

Cheers :)

~~~
jnordwick
I think what he might be saying is that you clearly are trying to be a direct
competitor those you say you are not. Claiming to not compete against the prop
TSDB offerings just so you can stack the comparison deck in your favor by then
comparing yourself to the less than acceptable FOSS offerings is a little
disingenuous.

This would definitely clear things up in my mind.

Why would I use Timescale over KDB or IQ or Vertica? Is it just a price thing,
you are mostly cheaper (both licensing and finding talent)? If cost was a
minor issue, why chose Timescale? What advantage does it have over those other
TSDBs? That bullet list that has been repeated a couple times seem to not
really be unique to Timescale when compared to the other big columnar
databases.

Maybe it is you have a good story on how you can do scalar operations better
than the others? You you have a particular workload mix you are trying to
target?

Do you plan on doing any of the TPC benchmarks?

------
YCode
> Back in 2014...developers...could use relational databases with SQL
> interfaces, which are easy to use but they don’t scale well.

I'm skeptical.

~~~
g09980
Guessing they meant to write "... for time series aggregation"

------
stdbrouw
> nobody wants to have large grain snapshots of data for any dataset that is
> actually comprised of a continuous stream of data points

Except, of course, for those who realize that the precision of a statistic
only increases at sqrt(n) and that a biased dataset will remain biased
regardless of how much data you have. I'll take a large grain dataset that I
can load on my computer and analyze in five minutes over a finer grained
dataset where I need to set up a cluster before I can even get started. Enough
with the "let's store everything" fetishism already.

(Somewhat tangential to the blog post, I realize.)

~~~
teej
The reason we need to store everything is less about needing perfect accuracy
of measurement (though I think we do want it) and more about the curse of
dimensionality[0]. We want to slice, pivot, and filter datasets more
aggressively than ever before which helps drive aggressive data collection.

[0] -
[https://en.wikipedia.org/wiki/Curse_of_dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)

~~~
ryanworl
You can use sampling to both store every dimension of a data point and to not
store an unwieldy amount of data.

~~~
teej
This simply doesn’t work when you have sparsely populated dimensions and/or
you don’t know what dimensions are important in advance. Both of these are
very common. That’s why you don’t see a higher prevalence of estimated
measurement.

~~~
stdbrouw
> you don’t know what dimensions are important in advance

But again this is a huge red flag. I've seen so many data science projects
that started with "well, let's just get started with collecting everything and
we will figure out what is important later on" and then spent so much time on
infrastructure that no useful insights were ever produced.

~~~
cjalmeida
There's a balance to that. You don't need anything fancier than a Hadoop
cluster to store everything. Nowadays you can get that packaged and working
out of the box from a number of vendors.

Of course get that data out into analytical datasets is a whole different
matter.

------
shangxiao
This is a bit easier to grok: [http://www.timescale.com/how-it-
works](http://www.timescale.com/how-it-works)

(if you're new to this like me)

~~~
bitoneill
[https://en.wikipedia.org/wiki/Time_series_database](https://en.wikipedia.org/wiki/Time_series_database)

------
acomjean
I worked at a place that monitored power usage minute by minute across 1000s
of locations. We just used MySQL with a time column. Maybe I'm not the target
audience but I'm failing to see what this gets me.

The problem is they say the data is imuatable and stored sequentially,
allthough our data was imutable with devices on the net the data comes in
random order when these ineviatably have connection problems.

We always aggregated our data into larger time blocks. Storage was cheap and
doing comparative analysis across locations and time zones was always our pain
point.

I think using a Postgres database is wise.

~~~
kev009
Then effectively your insert rate is between 16 and 60 per second, sure, you
don't really need a sophisticated partitioning or log structured DB. Native
static partitioning would give you a decent speedup without much thought.

It's intro computer science, if you have a tree structure and fill it up, you
spend a lot of time in the corners of theta notation. Timescale uses tightly
integrated partitioning on the time axis to deal with write performance and
aging data out. Other popular TSDBs are plays on log structured merge trees
etc

~~~
postwait
Reducing "Other TSDBs" to log-structured-merge trees is misleading. Any large-
scale TSDB has something sophisticated underneath and LSM is often just one
tiny part of that. I would argue (as most do) that any TSDB "simply used an
LSM" it would be doomed at any scale over time.

~~~
kev009
There was no reduction, it was intended as a pointer to one data structure
some TSDBs are using underneath. I would bold and highlight the "etc" present
there for you if the markup allowed it.

I hope a reader would become interested in what an LSM tree is (and perhaps as
importantly isn't)

------
Radim
Perhaps not as lofty goal as time series, but a good, commercial database for
_vectors_ (aka the building representation block of modern machine learning)
is long overdue too.

A few open source options exist (Spotify's Annoy, Facebook's FAISS, NMSLIB)
but these are rather low-level, more hobby projects than enterprise level
engines (index management, transactions, sharding…).

After building a few document similarity engines for our clients we took up
the gauntlet and created ScaleText,
[https://scaletext.ai](https://scaletext.ai). It's still early days but the
demand for a sane, scalable, cross-vertical and well-supported NLP technology
is encouraging.

------
amyjess
At my employer, we've recently (as of the middle of last year) been making a
considerable effort to use InfluxDB to track our KPIs. It's working out
wonderfully for us, and I'm expecting it'll get used more and more as the year
goes on.

What really floors me about Influx is how _fast_ it is. A query that used to
take hours in Oracle takes seconds in Influx. And the influx query is
readable: rolling data up to various intervals produced nightmare queries in
Oracle but is short and sweet in Influx. Want to take data gathered every 5
minutes and give a daily average? Yeah, good luck with that in Oracle. It's
doable, sure, but I've seen the code. And the speed. Both are ugly.

The only problem I have with InfluxDB is it's still a little immature, and the
tooling isn't where it should be. This isn't entirely Influx's fault; most
third parties aren't aware it exists or don't care. Our reporting team uses
Crystal Reports, which can't talk to InfluxDB. So I end up having to write a
Python script that runs in cron every night to perform InfluxDB queries for
the previous day's data doing all the rolling and average/min/max calculations
and then inserting the results into Oracle, just so our reporting team can get
to the data. For some KPIs I'm working on right now, we decided to not go
through the reporting team, and I'm writing a webapp in Python/Bottle to
display the report, and we're probably going to augment that with graphs from
Grafana.

Grafana is beautiful, by the way.

~~~
alistairbayley
How big is your database, and how long does it take to restart? We have what I
would describe as a fairly small influx database, and it takes a long time to
restart (20-30 mins). And the time to restart seems to be growing linearly
with the db size. Not cool if you do regular server patching with reboots.

We also had a problem where user queries that return too many columns cause
the DB server process to OOM. And then it restarts, so another 20 mins of
downtime. Also not cool.

We liked the tagging and rollup features, and automatic retention management,
but those first 2 problems really turned us off.

~~~
postwait
Try IRONdb... Relatively fast restarts, linear cluster scalability and no
downtime when a single node is malfunctioning.

------
olympus
The article recognizes that several time series databases already exist. They
also say, "we aren't trying to compete against kdb+." They explain how they
can handle time series data better than NoSQL databases that aren't time
series focused.

But what are they doing better than the existing time series databases? Surely
they must have some advantage or they wouldn't have raised 16.1 million
dollars.

~~~
busterarm
Every time I read about some new solution to storing time series data, I
always feel like I must be doing something wrong, but I've run into _zero_
problems yet.

Every time I have to store time series data, I never really need ACID
transactions. I definitely don't ever need to do updates or upserts. It's
always write-once-read-many. ElasticSearch has always been the obvious choice
to me and it has worked extremely well. Information retrieval is incredibly
robust and for times that I'm worried about consistency, I use Postgres's JSON
capabilities and write first to there. You can have your application sanity-
check between the two if you're worried about ElasticSearch not
receiving/losing data.

I find it really hard to beat.

~~~
pfranz
I had set up ElasticSearch for logging events and was thinking about using it
to store metrics as well. It seemed like it ticked all the boxes and would
work pretty well when I was messing around with it.

I ended up leaving that job and set up something specifically for metrics from
scratch. I didn't compare 1 to 1, but this was much faster to query and had
much lower requirements (disk space, memory, etc).

Both use-cases are using it as a time-series database, and there's no reason
ElasticSearch couldn't work for both of those use-cases for many people. When
using two different backends (one for events and another for metrics) I meant
drawing a line when creating/logging data and two ways to query...which sucks
for training. You also would have to maintain/archive two different data
stores.

------
arbesfeld
We use Cassandra extensively at
[https://logrocket.com](https://logrocket.com). How does performance compare
vs Cassandra when you don't need the semantics or transactions of SQL? I'm
surprised the article only provides benchmarks against PostgreSQL.

~~~
RobAtticus
That is currently next in our pipeline for a benchmark blog post. Early
results look good on both the read and write side in terms of raw performance,
with the benefit of more complex queries being more easily expressable.

~~~
akulkarni
To add to this: We've found that the benefit of Cassandra's approach is when
scaling out to 100s of nodes (which TimescaleDB currently does not support).
But we found Cassandra's per node usage not that impressive. We'll have more
benchmarks vs Cassandra soon (@robatticus is literally the person working on
it), but suffice to say that users have already replaced multi-node Cassandra
clusters (e.g., 10 nodes) with a single TimescaleDB instance (optionally
adding read-only replicas for hot standbys).

~~~
aksyn
I recommend you benchmark against ScyllaDB, not Cassandra.

------
tw1010
I have a hard time shaking the feeling that database innovation is being
forced not by necessity, but by a culture of blind relentless change.

~~~
codingdave
OK, but is that bad?

It might be bad for the specific organization pushing their new solutions, but
for the software industry as a whole, it seems like the good side of "throw
spaghetti against the wall and see what sticks". Many ideas will fail, but the
good ones will stick around. So I'd say by all means, let people feel free to
innovate. And those of us who consume the innovations just need to remember
the mantra of "leading edge, not bleeding edge."

~~~
tw1010
I didn't say it was bad (though the tone of my post definitely hinted at it).
I agree with you. Following the trail of necessity can easily lead to locally
optimal points, away from the global maximum.

------
marknadal
We had major success by simply batching incoming writes into ~15 second chunks
and writing that as a file to S3 and an index that tracks how the files are
split / chunked to make read performance decent.

This alone gave us an insanely scalable (load tested against 100GB/day ~100M
records/day) for a grand cheap total cost of $10/day for everything, server,
disk, and S3. [https://youtu.be/x_WqBuEA7s8](https://youtu.be/x_WqBuEA7s8)

Works great for timeseries data, super scalable, no devops or managing
database servers, simple, and works.

I believe the Discord guys also did something similar and had written some
good engineering articles on it, give it a Google as well.

~~~
postwait
Storing and retrieving data has never been all that hard. The challenge is
having user-interactive performance on complex queries against the data.
Comparing and correlating and deriving and integrating and ... (lots of other
analysis). For many "scaled" systems, 100M records/minute isn't uncommon...
and while that's very likely possible with your design the question of
economic feasibility enters. Solving these problems at scale with good
economics is the playground of TSDB vendors today.

------
jnordwick
I think I say this about every time, because I see it as the major flaw in
almost every approach to this problem. I say this from the perspective of
having used a wide range of TS databases, but mostly KDB, Oracle, and prop
solutions with Cassandra and Mongo products thrown in.

There are two big problems with relational SQL databases: the storage and the
query language. The two aren't separate concerns. While projects like this
might fix the storage issue, they will never be simple or fast without also
changing the query language too.

SQL doesn't fit the column/array paradigm well. There are extensions that
attempt to close the gap, but they are slow and complex compared to just
fixing the query language. Operations like windowed queries and aggregates can
be painfully slow because of the mismatch between SQL and the ordered data.

I would absolutely love to work on a product that attempts to fix the query
issue at the same time as working on the storage and interface issues. You
don't need an obscure APL-like language, just something that promotes using
the properties inherent in time series to the front.

Also, it is a bit weird to say you are not trying to compete with the likes of
KDB then go on to say how good you are at its core competencies. As always,
eagerly awaiting the TPC-D benchmarks and comparisons to the heavy hitting
time series dbs out there, not the easily crushed FOSS products (except for
maybe AeroSpike that I wish I had more experience with).

~~~
cobbzilla
PostgreSQL with jsonb columns can be quite powerful, best of both worlds; but
yes the syntax for querying within json is a bit arcane.

------
dustingetz
How about one made by Rich Hickey —
[https://www.datomic.com/](https://www.datomic.com/)

~~~
dragandj
Sadly, Datomic lacks two of (arguably) very important features in this space:

1\. It seems to not be optimized for speed (but it's difficult to say since
the license forbids publishing benchmarks).

2\. It's not open source.

~~~
dustingetz
I empathetically agree with #1 but object to #2. Datomic, on an architectural
basis alone, is going to be strictly faster than say an RDBMS, because it can
parallelize almost all of the workload. Read-side is entirely elastic; index
and storage maintenance does not block the writer, and the single operation
that must be serial is the actual ACID write primitive, the conditional-put to
DynamoDB. In RDBMS, the ACID writer is competing for resources with all these
other operations, because they can't be parallelized without sacrificing
consistency.

~~~
dragandj
We are talking about time-series databases here. I guess they are more
optimized for the specific workload than a generic RDBMS.

------
cbcoutinho
We use a proprietary database system that uses a 'flat-file' format (no idea
what that means) and is primarily time series based due to the fact that we're
logging sensor data. Since it's primarily a backend, you can't access it
outside of their proprietary gui. It's also accessible as a linked server via
SQL Server, but this is slow as hell for non-trivial queries. We use it within
a power plant setting where we heavily prioritize db-writes, which this
software is apparently very good in, and db-reads are less of a focus.

I'm not sure if moving to another db system would be beneficial, but I would
be very grateful if accessiblity could be much less of a hack

~~~
BillinghamJ
We have a similar setup for our telematics data (insurance company). It’s
stored in files which contain DEFLATE’d BSON. Similarly, we have a custom GUI
for viewing it, but do also have some terminal tools which can be used.

The solution to larger processing simply seems to be loading the files and
batching them through whatever processing we want and storing the output in a
proper database - which can then be queried instantly.

For a lot of time series data, this does seem like a pretty decent approach.

------
dx034
What I often miss with these kinds of databases is compression capabilities. I
currently use my own delta-of-delta encoding for time series data (stored in
postgres) as that gave me much better compression than any of the time series
databases I tested.

It's not that storage isn't available, it's easy to store terrabytes of data.
But really fast storage is still expensive and rare. Being able to compress
time series data to 3-5% of the raw value allows keeping most in memory,
speeding up reading writing and analysis.

InfluxDB is quite good at that but unfortunately, they struggle storing many
small datasets.

------
augustl
Isn't it a bit weird to launch a "never delete anything" store today and not
mention anything about European GDPR requirements?

How would you go about deleting a users data upon request?

~~~
manigandham
You delete it. It's still a database. What's being done here is just smart
partition management as an extension to Postgres, similar to CitusDB.

You can also use any column-oriented relational database with a time-based
partition key and do the same thing.

------
chrismccabe
What is the benefits of timescaledb or influxdb? surely the enterprise pricing
is going to be similar to KDB which has been around for years and a very
mature product

------
RegBarclay
How is TimescaleDB different from OSIsoft PI?

~~~
apohn
Not just OSISoft PI, but there are many other historians for time series data.

[https://en.wikipedia.org/wiki/Operational_historian](https://en.wikipedia.org/wiki/Operational_historian)

At least in my experience, historians are rarely recommend for complex or ad-
hoc queries. Typically you just pull the data (by tags) into another
application and do your data processing there. It looks like in timeseriesdb
lets you execute complex queries in the database. Historians typically only
let you fetch data by tag and you need a metadata (e.g. asset management)
framework on top to organize the data (e.g. give me avg temp every 5 minutes
by sensor). It looks like with timeseriesdb you can have strings/text as
fields within the timeseries table, which removes the need (to some degree) to
join the data with a seperate metadata database.

I've also never heard of anybody using these commercial historians for time
series data you'd see from non industrial processes (e.g. stock data, price
tracking, GPS location of people or moving assets, time between clicks on a
website,etc).

All that being said, OSISoft PI and AF have their warts, but OSISoft has been
around for a while and PI has been battle tested in various industries (e.g.
Oil & Gas, Manufacturing). It's closed source and you have to pay for it, so
it's probably not attractive to startups and smaller companies. But it does
come with a support organization if you need it and can pay for it. And IME
data retrieval from PI is extremely performant!

~~~
cevian
Yeah, to underscore some differences more completely, timescale has:

\- Full indexing and secondary index support.

\- Support for transactional semantics

\- Support for living along side relational data - including foreign keys to
the relational data.

\- Full trigger and constraint support

\- Support for all of Postgres's native data types including JSON(B) and GIS
location data.

~~~
amelius
Transactional support? I was under the impression that timeseries data is
considered immutable.

~~~
grzm
Being transactional is orthogonal to being immutable. For example, if you want
to ensure that a given set of writes is applied atomically, you would issue
them in a transaction.

~~~
mfreed
This is especially important if you are also dealing with metadata that you
don't want to always denormalize into your time-series tables. If you don't
insert it transactionally, you can also lose referential integrity between
your time-series and relational (metadata) tables.

------
anonu
I almost thought this was going to be an ad for KX as part of the subject is
their slogan...

Glad to see open source tools in this space gaining traction.

