
Why DNSFilter replaced InfluxDB with TimescaleDB - LogicX
https://blog.dnsfilter.com/3-billion-time-series-data-points-dnsfilter-replaced-influxdb-with-timescaledb-d9f827702f8b
======
dr_faustus
I find it fascinating how almost every No/NewSQL database in almost every use
case niche (document, time series, key value, etc.) gets blown out of the
water by Postgres either from day one or a couple of Postgres releases later
(sometimes using a plugin). It just goes to show how great their technology
is. Plus, its very mature, well understood, safe, secure and stable. And its
much more likely to still be around (maintained) 10 years from now than the
current NoSQL flavor of the month.

After we had some pretty horrendous experiences with several NoSQL dbs (mostly
MongoDB, CouchDB) on client projects, we strongly urge all clients to just use
Postgres. The only exceptions are redis (for caching, queues, etc.) and
elasticsearch (for ... search), which are just very convenient complements to
Postgres. We have never found a single instance where the postgres performance
was not as good or better than the NoSQL alternative.

~~~
eikenberry
I'm curious what you do for Postgres to get an HA, as the last time I worked
with Postgres it was a nightmare to get even a bare bones HA working. I swore
I'd never touch it again until they had a real HA solution. I ask ever year or
so and keep being told it still doesn't have one.

~~~
daurnimator
Use barman [http://www.pgbarman.org/](http://www.pgbarman.org/)

~~~
keypusher
Barman appears to be a solution for backups, not HA. If your master fails, you
need to restore from the backup, which means downtime.

~~~
daurnimator
Barman can keep a hot-standby for you.

You still need a mechanism to swap over to the standby on failure. However
with libpq from postgres 10 you can provide multiple servers. This isn't true
HA, but for most people it fits the requirements.

------
jimaek
We use influxdb as well at [http://www.dnsperf.com](http://www.dnsperf.com)

No issues so far except of course the initial difficulty correctly setting up
the tables and continuous queries.

The main problem for us is that clustering is available only to the enterprise
version which is way too expensive for a small self funded company like ours.
I wish they offered more options without any support and a big discount.

~~~
pauldix
For smaller users or cases we generally suggest InfluxCloud, our offering in
AWS, which uses the clustering. Did you look into that?

~~~
hemantv
I loved InfluxDB early in 2014 (I used it at Postmates for internal metrics).

It's very bad you guys killed high availability feature but understand the
business requirements.

~~~
tyldum
For us the problem is that we have been talking to influx sales for over 6
months to get a quote for those features, however it is just a never-ending
interview process to determine how much we are willing to pay. We are looking
at other options at this point.

------
odiroot
I evaluated InfluxDB at my previous company. Settled for TimescaleDB in the
end due to the querying power of Postgres.

Influx had some real quirks with nested queries (wrong data being returned).
TimescaleDB is probably a bit slower and less compact but with our data (a few
GBs/day) it wasn't our biggest problem.

------
LogicX
I'm Mike Schroll, CTO of DNSFilter - Happy to answer any questions about our
experiences with InfluxDB or TimescaleDB over the last 2 years.

~~~
camel_gopher
When you talked about 150M queries per day; are those inserts or reads? That's
about 1,700 per second, which to be honest doesn't seem like a lot for time
series metrics ingest. I would expect a single node on most TSDBs to be at
least ~100 times that performance.

Can you talk about the data you were ingesting? Was it numerics, text, or
something else?

(Disclosure, I work for TSDB provider IRONdb
[http://irondb.io](http://irondb.io))

~~~
LogicX
Hi there -- We are currently ingesting 150M/day -- though it's not evenly
distributed -- probably peaks around 3,000qps

Agreed that it's not too much yet. I think the element which kills us is
needing to do rollup tables, and then query against both the raw data and
rollup data for our customer analytics dashboard.

The data is a combination of date/time, strings, and ints. 22 fields.

~~~
ryanworl
I know I mentioned this in another comment, but you really should check out
Clickhouse. They have a table engine for exactly this purpose.

You create a table with the raw logs, then a materialized view (or another
actual table) which declaratively does the rollup for you in real time.

[https://clickhouse.yandex/docs/en/table_engines/aggregatingm...](https://clickhouse.yandex/docs/en/table_engines/aggregatingmergetree/)

A full example: [https://www.altinity.com/blog/2017/7/10/clickhouse-
aggregate...](https://www.altinity.com/blog/2017/7/10/clickhouse-
aggregatefunctions-and-aggregatestate)

------
pauldix
Influx founder and CTO here. I figure I should comment because it seems this
post is a bit dated on information. That being said, I think this is one of
those cases where based on where InfluxDB was at the time, they simply picked
the wrong tool for the job. We were looking at use cases with hundreds of
thousands of series. It just wasn't designed for what they were trying to do
(for whatever version they were running at the time). We have much higher
aspirations now, but I'll get to that in a bit when I talk to each point
raised in the post.

As I mentioned, it would be helpful to know which version of InfluxDB was
tested against, but I'll try to cover each of the points.

1 - Cardinality. High cardinality has long been a known problem for InfluxDB.
We started work on this almost 15 months ago to create a new feature called
TSI or Time Series Index. It was significantly harder for us to develop than
we thought it would be. It can be enabled on version 1.5, which was released
yesterday. We'll have a detailed whitepaper on how it works and what
techniques we used coming soon. It shows significantly lower memory usage for
high cardinality write workloads. The other side of this is query, which I'll
talk about in a bit.

2 - Rollup tables is a known issue for us. Continuous queries were originally
supposed to do this, but the problem is that it falls over at scale. There are
some other weird things about how it works where it won't pick up lagged data
collection. It's a weak spot for us, Prometheus, OpenTSDB and a bunch of other
solutions. From what I can tell, Timescale doesn't offer this feature either.
See
[https://github.com/timescale/timescaledb/issues/350](https://github.com/timescale/timescaledb/issues/350).
It sounds like DNSFilter implemented this themselves using their Kafka ingest
pipeline and computing them and inserting into Timescale tables (which for now
is exactly what we recommend with InfluxDB). If I'm guessing wrong, I'd love
to hear the detail about how this works in your solution.

3 - Ingestion performance vs. Query performance. It sounds like query load
causes ingestion performance to degrade. I really don't know without more
information. In the post Mike said they were working with InfluxDB in 2015 and
2016 and didn't continue upgrading. So they could have been running 0.11.0, a
version that is now over two years old. We've made significant improvements
since then. However, I will say that our current query engine is a known weak
spot. This is why we're investing heavily into IFQL (our new language and
engine). The new language also addresses the InfluxQL doesn't actually operate
like SQL, which is something he mentions at the top of the post. I agree that
can cause frustration, which is why we're designing our new language around
this use case. I think the functional paradigm makes more sense than SQL for
time series. Slides from a recent talk I gave about the engine:

Slides: [https://speakerdeck.com/pauldix/ifql-and-the-future-of-
influ...](https://speakerdeck.com/pauldix/ifql-and-the-future-of-influxdata)
Video:
[https://www.youtube.com/watch?v=QuCIhTL2lQY&list=PLYt2jfZork...](https://www.youtube.com/watch?v=QuCIhTL2lQY&list=PLYt2jfZorkDqmNVloKpJGp-47KVeKcUPb)

4 - Resource utilization. It uses RAM, particularly on high cardinality
workloads. Yes, we know this and have known it for a long time. Running tests
on 1.5 with TSI enabled, the RAM utilization is significantly lower. However,
TSI is only one side of the issue. The other side is querying high cardinality
data. Our current query engine will eat many resources trying to do this. It's
something we're addressing with the IFQL engine. You can actually use the new
engine as a separate process against the InfluxDB 1.5 release. However, it's
still under heavy work and we haven't begun the performance optimizations.

I'm confused how Mike talks about about their query processor and then
immediately dives into Kafka, which is only relevant in the ingestion
pipeline. In fact, our recommended configuration is to route writes through
Kafka before sending to InfluxDB if you're operating at scale. It's how we're
designing 2.0 from day 1 and that work has already started. In an analytics
pipeline, you want to separate your write pipeline from production
query/storage.

1 - Ease of change: schema changes in InfluxDB aren't easy. Postgres supports
alter table commands. How do these work on large tables? Has Timescale (or
Postgres) solved the problem of these kinds of operations taking the DB down
for a while? I think that's one of the strengths of having Kafka in their
ingestion pipeline to protect against it (or other DB issues). We still need
to do work here and I know it's a weak spot and it's honestly just a very hard
problem (for how data is stored on disk for us). In the future, at the very
least we'll enable users to kick a schema change off and have it run in the
background while keeping the production infrastructure up. Depending on the
change it might be a long running background task that requires rewriting many
things.

2 - Performance - Impossible to address without knowing their data, their
queries, and really testing it on a new version. We're optimizing all the
time. I guarantee you that the query performance you see in InfluxDB today
won't be anything close to as good as what it'll be a year from now.

3 - No missing data - I don't know of any issue in the current release (or
many previous releases) of Influx that just drops writes. This might be
something to do with Continuous Queries since it sounded like it was about
rollups. It's a known issue with that feature and it's being completely
reworked in 2.0 to work in a guaranteed way, at scale. It's a long running
problem and one that is hard to get right. Again, it sounds like Mike and team
implemented it themselves through their ingestion pipeline and then attributed
the gain to Timescale. You can do the same thing with InfluxDB and it's our
recommended architecture. We tell customers to do their downsampling

I'm not sure what is going on with deletes. You can do deletes in Influx and
you can delete by a time range, which would give you a specific point. That
being said we've done significant work in the last three releases to improve
how deletes work and particularly how they work for deleting large amounts of
data.

For loads, I'd really like to see a comparison against the 1.5 release with
TSI on compared to Timescale. A release that's 1.5 to 2 years old with
InfluxDB is completely different that what is current, as many people that
have been watching the project over the last 4 years has seen.

Finally, Mike does make a case at the end for Timescale just being SQL and
Postgres so they found it easy to use because it's familiar. For better or
worse, we're making the bet that SQL isn't the last only true API for working
with time series data. Some developers will prefer SQL and some will prefer
other approaches. For InfluxDB, we're making a bet that other approaches will
be better for developer productivity.

~~~
LogicX
Thanks for the thoughtful reply Paul.

I agree that in the end, InfluxDB ended up not being the right tool for the
job. We're using it more-so as a backend analytics system for our customer
dashboard, making it query-heavy.

Sorry, you may have read an earlier version of the post -- I'd since updated
it to specify that version 1.3 is the last version we used, a few months ago.

We currently do rollups in TimescaleDB via a cronjob that runs a stored
procedure. It references a rollup table for the bigid of the last data element
in our raw table. It is then rolled up with an upsert. In this way, we don't
miss any items while rolling up. They're working to make it more 'native' to
the TimescaleDB offering, but it's been working great for us.

We use Kafka only as a queue to send data to multiple timescale instances.
Kafka has allowed for us to shift to a different timescale instance when
making database changes, if an alter table is blocking. Not all table or index
changes are blocking though.

~~~
pauldix
Thanks for the reply. Interesting that you're using crons to query and then
upsert. We're going to be doing the same thing in our 2.0 work. That paired
with our new query engine should make things much better on our end. So it's
one of those things that we know is a weakness and we're working quickly to
try to solve and make it a first class citizen in the DB. For many workloads
we've had people working successfully, but we keep expanding our focus as we
go.

If you tested against 1.3 we might have more work on queries and writes at the
same time. Although it may be that with the 1.5 and TSI enabled, that problem
would largely go away, but not sure. Can you provide more detail about how
many QPS you were pushing?

For the different Timescale instances, are they share nothing and your write
and query pipelines shard based on customer? Didn't think there was a
clustering solution available.

Thanks again for providing more detail, much appreciated.

------
ryanworl
Did you evaluate Clickhouse?

~~~
pauldix
Yeah, I'd think about that for the analytics use case. It's interesting
technology and I always keep an eye out for what Influx can learn from these
other projects. This CloudFlare post is on my readlist to learn more about it:
[https://blog.cloudflare.com/http-analytics-
for-6m-requests-p...](https://blog.cloudflare.com/http-analytics-
for-6m-requests-per-second-using-clickhouse/)

------
jxub
I wonder if they also considered KDB+ for that.

------
marknadal
Dang, 3B records/day is very impressive - and I'm a competing distributed
systems database engineer!

Our tests were doing about 100M+ (100GB+) a day (this was 2 years ago),
although on significantly smaller hardware, but I don't think it would've hit
3B even on the Intel Xeon you guys used. Stats:
[https://www.youtube.com/watch?v=x_WqBuEA7s8&t=1s](https://www.youtube.com/watch?v=x_WqBuEA7s8&t=1s)

Just wanted to say, really impressive work! Respect.

