
TimescaleDB vs. InfluxDB: built differently for time-series data - mfreed
https://blog.timescale.com/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877
======
neslinesli93
Wonderful analysis, I was waiting for something like this to come out!

Recently, I've gone through this very same choice and ended up with vanilla
PostgreSQL (Timescale was not mature enough).

[Shameless self plug] You can read some of the details here:
[https://medium.com/@neslinesli93/how-to-efficiently-store-
an...](https://medium.com/@neslinesli93/how-to-efficiently-store-and-query-
time-series-data-90313ff0ec20)

~~~
stalller
One point of clarification for readers of @neslinesli93's post is that
Timescale does not create "heavy" indexes.

We do create some default indexes that PostgreSQL does not, but these defaults
can be turned off. We also allow you to create indexes after bulk loading
data, if you want to compare apples-to-apples.

But to be clear, the indexes Timescale creates are the same, or can often
times be cheaper, than PostgreSQL (remember, TimescaleDB is implemented as a
PostgreSQL extension). We're always happy to help people work through proper
set up and any implementation details in our Slack community
(slack.timescale.com).

~~~
neslinesli93
Hi, thanks for the tips!

As I mentioned inside the article, I tested last year version of TimescaleDB
(July/August 2017) and that was my experience with it out of the box.

I am really impressed by all the progress you've made, and hopefully I'll
consider TimescaleDB as my first choice on the next iteration of the product
I'm working on.

Now, I'm skimming through the docs[1] and as I understand, create_hypertable
is called before all the data is migrated, thus all TimescaleDB indexes are
already present during the migration. What is the way to create indexes after
data migration?

[1] [https://docs.timescale.com/v0.11/getting-
started/migrating-d...](https://docs.timescale.com/v0.11/getting-
started/migrating-data)

~~~
mfreed
Hi @neslinesli93, it's quite easy:

(1) Call create_hypertable with default indexes off (include an argument of
`create_default_indexes => FALSE`) [1]

(2) Then just use a standard CREATE INDEX command on the hypertable at any
time. B-Tree, hash index, GIN/Gist, single key, composite keys, etc. This DDL
command will propagate to any existing chunks (and create them) and will be
remembered to so any future chunks that are automatically created will also
have these indexes [2]

[1]
[https://docs.timescale.com/latest/api#create_hypertable](https://docs.timescale.com/latest/api#create_hypertable)

[2] [https://docs.timescale.com/latest/using-
timescaledb/schema-m...](https://docs.timescale.com/latest/using-
timescaledb/schema-management#indexing)

------
kureikain
InfluxDB isn't as easy as it's sound to operate.

Anything built on top of Postgres probably has year of knowledge to tune the
db. but not much on InfluxDB. You are on your own.

You cannot even easily upgrade InfluxDB, especially when you want to use some
new feature such as enabling the TSI.

When something is wrong, again, you're on the own.

The question to InfluxDB HA isn't available as well.

Yes, with all of that pain point, I'm still using InfluxDB. I even have to add
in [https://github.com/buger/goreplay](https://github.com/buger/goreplay) to
support replaying traffic to other instance during upgrade.

I have to write a tool to re-read old data and import into new instance
instead of using their own import/restore.

Many gotcha with InfluxDB, hard to explain to dev not using high cardinality
for tag, or not using too many tag. For example, people get used to `tag`
concept of `fluentd` and put stuff like user id, device id into tag...

I want to log slow query time, now, I cannot use whole query as tag because of
very high cardinality.

However, I kept using InfluxDB. I want to support it so we have something
better than Postgres. I'm sick of SQL query(personaly) and I would like for
Flux to be successful.

Similar to MongoDB, it's bad years ago but very good nowadays. I guess same
thing will happen with InfluxDB. And indeed, they do improve over the year.

Similar to how we use Ruby vs C. It's about productivity. And despite of above
pain points, I always find a way to solve them eventually. And the tooling
around InfluxDB is nice, especially Grafana.

------
NickBusey
As someone who is struggling with issues with InfluxDB in production
environments, this just moved my `Investigate replacing Influx with Timescale`
issue higher up my priority list. Many of the problems with InfluxDB pointed
out in the article are indeed real-world pain points for us.

~~~
AngeloR
What sort of issues are you running in to?

~~~
petetnt
After running InfluxDB in production for a bit more than year, here's some
issues from top of my mind we ran have into:

# Common \- All: Various documentation issues

    
    
      - Including versioning in the documents
    

\- Influx/Kapacitor: Cannot join values in Influx, but you can Kapacitor (but
not dynamically)

    
    
      - Available in InfluxDB 1.6. now!
    

# InfluxDB

\- Influx: GroupTag not grouping
[https://github.com/influxdata/kapacitor/pull/1773](https://github.com/influxdata/kapacitor/pull/1773)

\- Influx: last() is really slow
[https://github.com/influxdata/influxdb/issues/8997](https://github.com/influxdata/influxdb/issues/8997)

\- Influx: Cannot update / edit tags
[https://github.com/influxdata/influxdb/issues/3904](https://github.com/influxdata/influxdb/issues/3904)

\- Apparently you can have tags and fields with the same names which bump up
the query times by 1000x times, without ever knowing what's wrong (fixable by
adding ::tag to the value)

\- Cannot incrementally restore incrementally backed up databases (we made a
script to do that) [https://github.com/motleyagency/influxdb-incremental-
restore](https://github.com/motleyagency/influxdb-incremental-restore)

# Kapacitor

\- Kapacitor does not support subqueries

\- Kapacitor does not (properly) support queries from multiple measurements

\- Cannot have field and tags with same name \- Cast syntax doesn't work
either
([https://github.com/influxdata/influxdb/pull/6529](https://github.com/influxdata/influxdb/pull/6529))

# Telegraf

\- Telegraf: Telegraf HTTPJson plugin does not support custom timestamps

# Chronograf

\- The TickScript editor sometimes hangs for good and requires Chronograf
restart

\- Minor: No up-to-date syntax highlighting for TickScript in any common
editors

That said we most likely would have ran into similar issues with out time
series databases and I applaud their effort to keep InfluxDB open source.

------
NewsAware
These kind of articles by one of the compared parties only become interesting
once the other party responds. So waiting for Paul Dix to show up in this
thread as usual.

------
rw
The TimescaleDB benchmark code is a fork of code I wrote, as an independent
consultant, for InfluxData in 2016 and 2017. The purpose of my project was to
rigorously compare InfluxDB and InfluxDB Enterprise to Cassandra,
Elasticsearch, MongoDB, and OpenTSDB. It's called influxdb-comparisons and is
an actively-maintained project on Github at [0]. I am no longer affiliated
with InfluxData, and these are my own opinions.

I designed and built the influxdb-comparisons benchmark suite to be easy to
understand for customers. From a technical perspective, it is simulation-
based, verifiable, fast, fair, and extensible. In particular, I created the
"use-case approach" so that, no matter how technical our benchmark reports
got, customers could say to themselves: "I understand this!". For example, in
the devops use-case, we generate data and queries from a realistic simulation
of telemetry collected from a server fleet. Doing it this way creates
benchmarking stories that appeal to a wide variety of both technical and
nontechnical customers.

This user-first design of a benchmarking suite was a novel innovation, and was
a large factor in the success of the project.

Another aspect of the project is that we tried to do right by the competition.
That means that we spoke with experts (sometimes, the creators of the
databases themselves) on how to best achieve our goals. In particular, I
worked hard to make the Cassandra, Elasticsearch MongoDB, and OpenTSDB
benchmarks show their respective databases in the best light possible.
Concretely, each database was configured in a way that is 1) featureful, like
InfluxDB, 2) fast at writes, 3) fast at reads, and 4) efficient with disk
space.

As an example of my diligence in implementing this benchmark suite for
InfluxData, I included a mechanism by which the benchmark query results can be
verified for correctness across competing databases, to within floating point
tolerances. This is important because, when building adapters for drastically
different databases, it is easy to introduce bugs that could give a false
advantage to one side or the other (e.g. by accidentally throwing data away,
or by executing queries that don't range over the whole dataset).

I don't see that TimescaleDB is using the verification functionality I
created. I encourage TimescaleDB to run query verification, and write up their
benchmarking methods in detail, like I did here: [1].

I think it's great that TimescaleDB is taking these ideas and extending them.
At InfluxData, we made the code open-source so that others could build and
learn from our work. In that tradition, I hope that the ongoing discussion
about how to do excellent benchmarking of time-series databases keeps
evolving.

[0] [https://github.com/influxdata/influxdb-
comparisons](https://github.com/influxdata/influxdb-comparisons) (Note that
others maintain this project now.)

[1] [https://rwinslow.com/rwinslow-benchmark-tech-paper-
influxdb-...](https://rwinslow.com/rwinslow-benchmark-tech-paper-influxdb-vs-
elasticsearch-may-2016.pdf)

~~~
leehampton
Hey rw, one of the core contributors to TSBS here. First of all, thank you for
the work you did on influxdb-comparisons, it gave us a lot to work with and
helped us understand Timescale’s strengths and weaknesses against other
systems early on. We do appreciate the diligence and transparency that went
into the project. We outline some of the reasons for our eventual decision to
fork the project in our recent release post [1]. Most of the reasons boil down
to needing more flexibility in the data models/use cases we benchmark and
needing a more maintainable code design since we’re using this widely for a
lot of internal testing.

Verification of the correctness of the query results is obviously something we
take very seriously, otherwise running these benchmarks would be pretty
pointless. We carefully verified the correctness of all of the query
benchmarks we published. However, it’s a process we haven’t fully automated
yet. From what we can tell, the same is true of influxdb-comparisons — the
validation pretty prints full responses but each database has a significantly
different format, so one needs to manually parse the results or set up a
separate tool to do so. We have our own methods for doing that internally —
once we get the process more standardized and automated we will definitely be
adding it to TSBS. We encourage anyone with ideas around that (or anything
else) to take a look at the open source TSBS code and consider contributing
[2].

[1] [https://blog.timescale.com/time-series-database-
benchmarks-t...](https://blog.timescale.com/time-series-database-benchmarks-
timescaledb-influxdb-cassandra-mongodb-bc702b72927e)

[2] [https://github.com/timescale/tsbs](https://github.com/timescale/tsbs)

------
dominotw
> the focus of time-series databases has been narrowly on metrics and
> monitoring

I am curious if ppl are using TSDB's for business predictions, machine
learning, exploratory visualisation, datascience and AI. I got curious after
seeing a udacity course on tsdb predictions.

[https://www.udacity.com/course/time-series-forecasting--
ud98...](https://www.udacity.com/course/time-series-forecasting--ud980)

~~~
testrun
Quite a few companies do. I was involved in creating a predictive maintenance
application using neural nets and using a TSDB as data source at my previous
employer.

------
evdev
Does anyone have thoughts on why Postgres shouldn't provide:

\- Automatic sharding of tables per-"shard key".

\- Automatic sharding of those keyed shards by the range of some primary
index.

Doesn't this get you 90%+ of the way there? (There's no "adaptive" time
bucketing, I guess.)

For the record, I am a veteran of a naive Postgres time series scheme that was
brought to its knees by seek times.

~~~
akulkarni
I think what you are asking is why something like TimescaleDB has to exist in
the first place; i.e., why doesn't PostgreSQL just naturally do this?

Here's why: There are scenarios with time-series data that rarely occur with
standard/vanilla PostgreSQL OLTP workloads. So PostgreSQL simply isn't
designed to handle these scenarios well on its own.

Having 100s-1,000s of partitions is one such example. We found that insert
rate on standard PostgreSQL drops quickly as the number of partitions
increases, because PostgreSQL decides to hold a lock on every partition on
insert, even though the insert may only touch one partition. [1]

When we asked the core PostgreSQL devs about this, they explained that they
did this because sorting out the appropriate locks was a hard problem, and
that they saw this scenario as so unlikely for OLTP that they instead directed
their resources to other more pressing problems.

But with time-series data this is a very common scenario, so we (TimescaleDB)
had to sort it out ourselves.

And this is just one example. There are quite a few query optimizations that
we also had to develop for working with time-based data more efficiently.

At a high-level, every project has to optimize for something. PostgreSQL
understandably optimizes for OLTP workloads. But the beauty of PostgreSQL is
that it allows extensions to optimize for other workloads, such as time-
series.

[1] [https://blog.timescale.com/time-series-data-
postgresql-10-vs...](https://blog.timescale.com/time-series-data-
postgresql-10-vs-timescaledb-816ee808bac5)

~~~
mslot
> When we asked the core PostgreSQL devs about this, they explained that they
> did this because sorting out the appropriate locks was a hard problem, and
> that they saw this scenario as so unlikely for OLTP that they instead
> directed their resources to other more pressing problems.

The locking on partitioned tables is a little clunky, but the overhead of
these locks is very low. The main performance problem in Postgres 10 was the
partitioning pruning, which used an exhaustive linear search. That has been
fixed in Postgres 11 (due in September) which uses binary search and
introduces various other partitioning improvements [1].

[1]
[https://www.postgresql.org/docs/11/static/release-11.html#id...](https://www.postgresql.org/docs/11/static/release-11.html#id-1.11.6.5.5)

~~~
cevian
I believe what akulkarni meant to talk about is relation accesses (and not
just locks). While the partition pruning certainly improved things, two
sources of inefficiency still remain in PG 11:

1) Fetching statistics for each table during queries (which happens by reading
from the data file off of disk). This happens /before/ pruning, even on PG 11.

2) The overhead of locking each table is still there. Although it's a smaller
issue than (1).

We at TimescaleDB found (1) to be the most significant overhead and in fact we
have significantly improved things there [1].

[1][https://github.com/timescale/timescaledb/commit/b7257fc8f483...](https://github.com/timescale/timescaledb/commit/b7257fc8f483475382019cadcc7a75fae0b72f0a)

~~~
anarazel
You could also just have worked on lowering those overheads in PG, just
saying. It's easy to blame "PG devs", but most of us could get changes quicker
to our company's respective customers by just fixing everything in forks.

~~~
enordstr
Timescaler here. We're not blaming "PG devs". We have great respect for the
PostgreSQL developers and what they are doing; so much, in fact, that we chose
to base our product on PostgreSQL. And, TimescaleDB is not a fork--it is an
extension to PostgreSQL that can be loaded in existing PostgreSQL
installations.

We would be happy to contribute to PostgreSQL, but I think the issue here is
that, as a business that is focusing on a very particular use case, we are not
perfectly aligned with the PostgreSQL roadmap. We want to be able to move
quickly and adapt to customers needs, focusing on the pain points and issues
they have. This simply isn't compatible with the more conservative development
pace that main PostgreSQL understandably has.

From another perspective, I think one strength of PostgreSQL is, in fact, its
support of extensions, enabling innovation alongside main PostgreSQL while the
core developers can focus on a rock solid and extensible foundation. So, from
where I am coming from, this is a feature and not a bug.

------
mintyc
Nice article but is it deliberate that you don't mention getting data into the
database (line protocol or similar) or analysing the results and displaying
them.

Input = use postGres Output = Grafana has a postGres data source which I
assume works and mention in timescales b's issues of a Grafana query helper.

Also lack of analysis, consolidation (continuous queries) and retention
policies.

I am however intrigued as it does seem to hit my sweet spot of 100's of
servers each with 5 to 10 different series of 10 metrics each, every 10 sec.

Database size might be an issue, as would the complexity of deployment (a big
win for Prometheus rather than influxdb though).

Final thoughts Can't help feeling this looks like a few input scripts running
on postGres rather than a system solution for metrics and annotations.

~~~
enordstr
Fellow Timescaler here. Thanks for the feedback. While we do not directly
compare ingestion protocol and specific features, like continuous queries and
retention polices (something I guess we could add), we do compare echosystems
and third-party tools support, including ingestion (e.g., Kafka, Hibernate)
and visualization (e.g., Tableau, Grafana). In fact, the developer behind the
Grafana PostgreSQL data source is also a Timescaler, and an upcoming version
of the data source will have an improved query builder and first-class
TimescaleDB support.

Finally, I can assure you that this is more than a few input scripts. In fact,
the project is thousands of lines of C code (if that matters) that implement
automatic partitioning, query optimizations, and much more. Our code is open
source here:
[https://github.com/timescale/timescaledb](https://github.com/timescale/timescaledb)

We have other blog posts
([http://blog.timescale.com](http://blog.timescale.com)) and system
documentation ([http://docs.timescale.com](http://docs.timescale.com)) that
explain what we're doing if you're interested in learning more about the
details.

------
pauldix
Creator & CTO of InfluxDB here. I won't respond too much in this write-up
specifically but felt I had to respond to the requests for me to comment. For
the benchmarks, I haven't looked at their fork of our original code and we
haven't had engineers attempt to reproduce their results. In truth, we
probably won't put much effort into that. We have every intention of putting
more effort into benchmarking, but I'll talk a bit more about that at the end
of this post.

Much of this comparison is the technology equivalent of an argument through
appeal to authority. The old "nobody ever got fired for buying Big Blue"
argument. It's true that Postgres has been around for much longer than
InfluxDB. Mike is actually overly generous when he pinpoints InfluxDB's first
release in September of 2013. First commit was September 26th, 2013 where I
added the MIT license and the empty README where I refer to the project as
ChronosDB. The first "release" wouldn't follow for another six weeks and I
would hardly qualify that as an official release (0.0.1). If you followed the
commit log, you'd see that InfluxDB is actually younger than that. We rewrote
the entire thing from November of 2014 onward. Ben Johnson gets the award for
InfluxDB committer to have the largest delete set in a single commit when he
ripped everything out when we started the 0.9 release line. The storage engine
didn't even start getting written until December of 2015 (although I wrote the
prototype of the concept over Labor Day Weekend in the beginning of September
2015). So in some sense you could say that we've been at this storage game for
less than three years.

However, I wouldn't discount a technology simply because it's new. We take
data loss very seriously and strive to create a storage engine that is safe
for data. The issues linked to in that post are either closed, apply to a
previous storage engine, or were recovered through tooling (in the case where
a corrupt TSM file was written). Yes, these things take time to get right and
there is always room for improvement. Proper infrastructure and planning
mitigates these risks. For example, in our cloud environment, we take hourly
snapshots of the EBS volumes that store data. We make sure that we're able to
recover from a catastrophic failure, even if it is one that was induced by
some software bug. Although we haven't seen corrupt TSM files or corrupt WAL
files in our cloud environment.

The argument on community size is in a similar vein. Yes, the Postgres
community is larger than InfluxDB's. But InfluxDB has a large, vibrant and
growing community. PHP has a larger community than Go, but I'm not going to
write code in PHP because of that (no offense to the PHP devs). When I was a
Ruby programmer I didn't pick it because of maturity or community size. In
2005 barely anyone even knew about it. I picked Ruby (and Rails) at the time
because of what I could build with them. More importantly, I picked those
tools because of how quickly I could build with them. It also didn't hurt that
I connected with the Ruby community and felt like I had found my tribe. So
it's possible to have a community that you like and connect with regardless of
size.

Ultimately, we've chosen to create from scratch. We've also chosen to create a
new language rather than piggybacking on SQL. We've made these choices because
we want to optimize for this use case and optimize for developer productivity.
It's true that there are benefits to incremental improvement, but there are
also benefits to rethinking the status quo. I've heard many times from our
users that they liked the project because of how easy it was to get started
and have something up and running. We'll continue to optimize for that while
also optimizing performance, stability, the overall ecosystem and our
community. It means that we invest into tools outside the database to make
solving problems with time series data easier. It also means that we've
created a storage engine from scratch and we're creating a new language, Flux.
We've MIT licensed the language and the engine. This is because we're building
Flux to make it work with other databases and systems. Our goal is to build an
all new community and ecosystem around Flux, for programmers that are working
with data (time series or otherwise).

Finally, some thoughts on benchmarks. I hate benchmarks. There are lies, damn
lies, and benchmarks. Particularly in comparisons. You always have to look for
what was in, what was out, and if everything was done to favor one solution
over another. And yes, we're guilty of putting out the original performance
benchmark comparisons. So here's what I want to do for InfluxDB as an ongoing
effort. We should be benchmarking, but doing it with workloads that are as
close to what we see in real production systems as possible. No bulk loading a
bunch of data and then doing a bunch of basic queries while the DB isn't under
any other load. Further, I don't want to do comparisons to other solutions. I
don't want to do another vendor's work for them. I'd rather focus the
benchmarks on continuous improvements against our own builds. Benchmarks are
great when they lead to ongoing product improvement. They're also useful if we
make them public for the community and our customers so they can see over time
how things are shaping up.

We see time series as a massive market with many different offerings, which
often have different philosophies. And much of this is about APIs and
aesthetic, so for many questions, there isn't really a "correct" answer. Our
goal is to focus our product efforts on delivering the best experience for the
community and customers who are working with time series data and building
applications and solutions on top of our platform. At the same time, we want
to contribute as much of our from scratch code back to the open source Go
community so that implementors ahead of us can build on our shoulders.

~~~
steve19
> Our goal is to focus our product efforts on delivering the best experience
> for the community and customers

After makimg clustering closed source after saying it would be part of the
open source version, I think it would be more accurate and far more honest to
just say "our customers" not "community and customers".

~~~
pauldix
The vast majority of InfluxDB users are using open source exclusively. There
are millions of servers all over the world running open source software built
by InfluxData. When I say we're building for our community, I mean exactly
that. We continue to put software into the open source ecosystem because it's
a core value for our company and as developers we like to share what we're
building with the world.

Yes, we build some features (like clustering) exclusively for paying
customers, but that is what subsidizes the open source that we continue to
build and make freely available. Last year I gave a talk and a related blog
post about the dynamics of building a business on open source software:
[https://www.influxdata.com/blog/the-open-source-database-
bus...](https://www.influxdata.com/blog/the-open-source-database-business-
model-is-under-siege/)

------
kev009
At the end of the day Influx is going to be cerchunking along like the square
wheel garbage collection is with occasional wild memory and latency gyrations.
Couple with that a log structured tree for more "fun".

C doesn't guarantee you wont do things like that, but Timescale is built in
such a way as to minimize most of this kind of extreme waste in patrols with
full postgres user experience.

------
hellomichibye
Why not compare it with the market leader kdb+?
[https://kx.com/media/2018/06/KdbTransitive-
Comparisons.pdf](https://kx.com/media/2018/06/KdbTransitive-Comparisons.pdf)

~~~
cevian
From kdb+ license agreement:

"1.4 Kdb+ Software Evaluations. End User shall not distribute or otherwise
make available to any third party any report regarding the performance of the
Kdb+ Software, Kdb+ Software benchmarks or any information from such a report
unless End User receives the express, prior written consent of Kx to
disseminate such report or information."

~~~
thinkx
It is unfortunate that Kdb+ likes to tout benchmarks against Influx etc, but
their license prevents anyone else from doing the same.

------
freediver
Would love a comparison to Market Store
[https://github.com/alpacahq/marketstore](https://github.com/alpacahq/marketstore)

------
mooneater
Given the cambrian explosion in tools, i am always looking for safe ways to
cross a tool off my "must consider" list. This article was very helpful in
that regard!

------
damm
Sure Postgresql can handle doing metrics; been doing it for 2 years without
any timescaledb.

I also use influxdb. Different tools for different purposes

------
jonny_eh
timescale.com isn't https enabled by default. influxdata.com is. Just my first
impression as a 2018 web developer.

------
superdimwit
Submission by "Professor of Computer Science, Princeton. Co-founder and CTO,
Timescale."

~~~
gbrown_
From the article.

    
    
        > Yes, we are the developers of TimescaleDB, so you might quickly disregard our
        > comparison as biased. But if you let the analysis speak for itself, you’ll find
        > that it tries to stay objective — indeed, we report several scenarios below in
        > which InfluxDB outperforms TimescaleDB.

~~~
a012
I've just read and 2/3 of this article shows TS has more advantage than
InfluxDB. I just wonder why TS instead of just use PostgreSQL directly?

~~~
akulkarni
(Timescaler here.) That's a common question, and one we address in this post:
[https://blog.timescale.com/timescaledb-
vs-6a696248104e](https://blog.timescale.com/timescaledb-vs-6a696248104e)

tl;dr TimescaleDB vs. PostgreSQL: 20x higher inserts, 2000x faster deletes,
1.2x-14,000x faster queries, additional functions for time-series analytics
(e.g., first(), last(), time_bucket() [1])

[1]
[http://docs.timescale.com/v0.11/api#analytics](http://docs.timescale.com/v0.11/api#analytics)

