Hacker News new | past | comments | ask | show | jobs | submit login
It’s About Time for Time Series Databases (nextplatform.com)
356 points by okket on Jan 25, 2018 | hide | past | web | favorite | 148 comments



Sorry to leave the technical detail part real quick. But is anyone else concerned about using a DB solely from a company built specifically around that DB? After Rethink DB (sustainability issue) and Foundation DB (bought and shuttered/hidden) and Riak (admittedly haven't kept up but I saw [0]), I am wary of using any DB that is not built by a large community or is not built as a non-core project from a large tech company. Sorry TimescaleDB, I see you have raised a decent amount of funding, but I have to choose my DBs w/ trepidation these days.

0 - https://www.theregister.co.uk/2017/07/13/will_the_last_perso...


We use a combination of SQLite and a sharding frontend service. One SQLite database file per device, one table per sensor, table contents are timestamp and measured value. As simple as it gets, easy to scale, and damn fast.

But try telling people you're using SQLite to store critical data...


That's a good pattern for straight data retrieval.

Unfortunately if you need to do aggregated queries on all of the SQLLite tables, things may be challenging.

But if you could somehow connect Spark to a folder (on a distributed FS) of these SQLite files...

Edit: Also SQLite has a limitation that only one process can write to it at a given a time. For this particular use case though, it shouldn't be a problem unless you have rewrites coming from various sources (which can happen when correcting data)


I mean, depends on the aggregation you need imo. Shouldn't be too hard (tm) to rig up some distributed query pipeline. (as long as you are ok with coding per query, instead of the convenience of sql)


Wow, that's a pretty neat approach. I have a bunch of questions, if you don't mind..

Do you do filesystem replication or something else for HA? Is this running in one of the cloud cloud IaaS providers or something else?

Why did you go with this approach? Eg: Was it a deliberate decision after evaluating other options, a "this is what we know" type situation, or did it grow organically from something smaller?

How long has it been running? Whats the rough size of data you have (GB/TB, number of servers, number of DB instances)? Any reliability problems?

If you were building this service from scratch, would you do it this way again?


Don't mind at all. ATM we're using a simple VMWare server backing up the virtual HDD every night, with a dirty hack thrown in (no full backups on weekends as that's when the bigger jobs run, so I dump the current day to a CSV on a smaller virtual HDD that's still saved daily). We're running on our own systems as our customers are pretty paranoid about where their data is sent.

The history of this system is pretty organic, lots of experimentation. Started with Postgres, but that was too slow if a select statement returns tens of thousands of rows. MariaDB was way faster but became erratic once we passed about 2 billions of records. Now with SQLite we are at about 180 GB, about three years of data so far, so not really "big data" territory yet.

Constantly looking for better alternatives. InfluxDB looks promising, but maybe we'll switch to a "hot/cold" approach instead, keeping whole "cold" days as JSON objects in JSONB fields in Postgres again while storing "hot" data in SQLite. We'll see...

So far the system has been rock solid. In the last few weeks we went live for real, now storing about 2k records a second. Not a problem so far.


Wait, so to be clear, you moved from PG > Maria > SQLite?


> But try telling people you're using SQLite to store critical data...

SQLite is solid and trusted by many but for some unknown reason some people refuse to see it as a solution for their needs. I have seen many databases that could easily been replaced by SQLite.


I looked into SQLite a while ago, and it wasn't so much its robustness that was the deal-breaker, but more the fact that it couldn't handle writes from parallel processes from multiple users (i.e. concurrency) like a normal SQL database could.

I see SQLite as a file-format with a SQL interface.


> I see SQLite as a file-format with a SQL interface.

I love the description from SQLite's website on the subject:

> SQLite does not compete with client/server databases. SQLite competes with fopen().

https://www.sqlite.org/whentouse.html


Yep. Parallel performance is the biggest issue I've seen with SQLite.

SpiceWorks is the biggest example of a product that doesn't scale because they've elected to use SQLite (and only SQLite). There's no hard limit on the number of devices, but most people say not to exceed about 1,000 devices. We tested it out against 800 devices. We started it on Friday. On Monday, it was still running.

The only other issue I've seen with it is with the data typing. Type inference is much weaker than strict typing, and sometimes it doesn't quite store things the way you'd expect. Granted, the last time I saw problems with that was years ago, but I've seen it.


There are plenty of problems that don't need parallel performance.


And there are plenty that do. What's your point?


Because some won’t consider sqlite even for those that don’t.


There is Berkeley DB/SQL (now sadly owned by Oracle), which is essentially Sqlite with BDB instead it's original B-tree implementation, that supports multiple parallel writers. Also Firebird can be used as in-process embedded database and supports multiple writers (in fact in the server mode it uses same file-locking based synchronisation mechanisms as in embedded mode)

On the other hand it is somewhat rare to see aplication that really needs multiple writers and would not be better served by full fledged database server.


Berkley DB is not as reliable as SQLite. Subversion, before switching to own storage, where known for data corruption problems attributed to Berkley DB.


I remember the old days when people would refer to berkdb as "Berserkley" because it tended to go berserk on your data. It's always had a bad reputation for reliability.


It can handle some degree of concurrency via threading.


probably cos of word "Lite" in their name.


Just tell them SQLite is used by ALL Android / iOS phones (billions of them) to store user's critical data.

I have not heard of any cases of SQLite failure in phone, or anyone needing a DB admin to recover their phone contact data.


> I have not heard of any cases of SQLite failure in phone, or anyone needing a DB admin to recover their phone contact data.

For a long time (until API level 11), Android automatically silently deleted corrupted SQLite databases. https://android.stackexchange.com/questions/12388#33896 https://stackoverflow.com/questions/7764943

I learned the hard way to limit SMS storage and install Jan Berkel's SMS Backup+ https://github.com/jberkel/sms-backup-plus/#readme after losing my first Android SMS database (this may or may not have been SQLite's fault).

https://www.sqlite.org/howtocorrupt.html#_bugs_in_sqlite


Specifically (can't edit above post -- too slow):

https://issuetracker.google.com/issues/36911900#comment457

>the SQLite packaged with Android is being upgraded to version 3.6.22 in Android 2.2. I think that the SQLite database corruption bug linked in comment 451 has been fixed in Android 2.2. So _if_ that's what's causing the text message database corruption and deletion, then this bug will probably be fixed in Android 2.2

http://sqlite.1065341.n5.nabble.com/Android-database-corrupt...

>Statically link your application against SQLite 3.6.23 instead of using the SQLite 3.5.9 that is found on Android. The bug you are hitting was fixed in SQLite 3.6.2

Contrast https://web.archive.org/web/20110219041419/https://www.sqlit... (Feb 2011) vs. https://web.archive.org/web/20110729152833/https://www.sqlit... (July 2011), which could be a glitch in the Internet Archive (a partial copy of the oldest record?) but adds "Though SQLite is resistant to database corruption, it is not immune" and introduced an extensive history of known issues.

Sqlite is amazing and its development history is further testament to its current reliability, especially when used correctly!


SQLite was built as a solution to the question "How do we put a SQL database on a nuclear submarine stealthily moving away from base for months at a time in secret, without putting a DBA in there with it?"

It can go wrong, it does go wrong, but it's a pretty solid solution, especially if you're not needing insane concurrency throughput.

The problem is many people need concurrency. They're running web apps over multiple servers. It's there where SQLite is a problem.


Source for origin? Legit interested.


https://en.wikipedia.org/wiki/SQLite#History

>used aboard guided missile destroyers [...] operated without installing a database management system or requiring a database administrator

PS. Thanks to your question, today I re-discovered https://unqlite.org, which may fulfill a requirement I've been facing for some time. Awesome!


Hey hey, unqlite looks pretty interesting. I'm both glad you were able to rediscover it and glad that I can have fun reading about it!


What's your replication/HA story?


Yes, collecting/aggregating for querying seems like the sticking point.


What does collecting or aggregating have to do with replication/HA?


It was my improvisational approach to pointing in the right direction. https://en.wikipedia.org/wiki/Yes,_and....

I don't really see how replication/HA is an issue in this setup beyond normal file system replication/HA (which seems to be a solved "spend money on NAS" problem in the enterprise world). If your filesystem isn't reliable / delivering what's needed then you're gonna have a bad time no matter the storage approach; I doubt some unusual option is involved though it would be cool to be proven wrong.

The only time the filesystem could even potentially have read/write conflicts (assuming 1-to-1 correlation between devices writing to files) is when querying. I did wind up muddling the terminology in the interest of jumping to the actual problem.


Wouldn't be hard to rsync that data to rotating s3 buckets/nfs shares/ftp drops/whatever


Those are backups/DR. HA is for when a database instance fails so you can keep on trucking as though nothing happened.


Simpler than a log file? What's the significant advantage of SQLite vs just appending to a file?


Transactions? Crash data safety? SQL queries? Portability?


I don't see how any of those apply for a two-column 'db' of timestamp and data.


Except.. compliance staff who want to use some ahem wonderful standards complaint reporting tool.

There's just one solitary mention of Informix in the whole discussion, as I'm typing this. And that's a comment lumping Informix with everything else commercial and expensive.

I thought Informix owned the embedded dB space at a certain rollout complexity. Remember the backups... Informix speaks json too, including at some quite low level primitives. Whether you buy a product like Informix Edge plus tools or not, depends on what is sending the time series data.


It is database per device and table per sensor, not just one table. With SQL you can do joins, aggregations, etc. If you just want to log it, sure use a log file. As soon as you want to use the data somehow, you need SQL or your own log file parsing and query code. Personally, I rather just use SQL.


If you’re storing from multiple sensors with multiple write processess, you will have concurrency problems if they’re in the same db with SQLite.

I’m not referring to manipulating or even querying the data after it’s stored. I just don’t see how saving the original data in SQLite buys you anything, if it needs to be collected into a db for queries, etc, anyway.


SQL query support?


Indexing.


On disk support?


There are tons of justifications you can come up with for not using something you don't want to use. The down side is that you aren't going to be getting the upsides you'd get if you used a new technology, possibly to the extent of being no longer competitive with people who do use that technology.

The important thing is balancing chasing the new hotness vs. always using the tried and true.

Last year I ran into a similar sort of situation at work, the funds to buy Ansible Tower required a battle I wasn't willing to fight. The argument was "We've spent so much money on tools that then get abandoned." We spent nearly a year with me being the only one in the company that could run Ansible playbooks because of this, until I finally found RunDeck. Now, we aren't paying for Rundeck, does that make it more likely to be abandoned? Was the risk in our operations if I suddenly became unavailable worth the saved licensing fees? Well, in this case it worked out ok, but it could have gone very badly.

As far as timeseries goes, there are huge advantages here. We used to use collectd/graphite/carbon, but a year or more ago I switched us over to InfluxDB/telegraf/grafana. It was a little rocky a start, the first version of Influx I used had plenty of problems. At the time Prometheus was probably a better choice, but I couldn't come up with a way to make it's pull-centric design fit into our environment. Then InfluxDB settled down and things have been great since.

We have way better metrics, with a way more usable dashboard, and significantly lower load. Literally, disabling collectd ended the complaints about performance on our dev/stg environments.


> The down side is that you aren't going to be getting the upsides you'd get if you used a new technology

True of course. I've found that at least "premature abstraction" is almost a requirement when using persistence. I'm not talking an ORM or anything, but put a service in front of it if you have multiple things needing it, or put an in-code abstraction if you don't. And make each abstraction's/service's operation very specific to its caller. So a simple "fetchMarketData(timeRange, tickerSymbols) -> StreamOfData" will save you so much in the future.


Riak the database survives Basho the company. All assets (IP) were bought by bet365 (A large Basho customer) and made open source. Development continues.

/former Basho employee.


> made open source

I hope things work out at least as well as they have for RethinkDB's resurrection/transition! The Riak Users mailing list seems like the best place to follow along: http://riak-users.197444.n3.nabble.com

https://github.com/basho/riak_kv/tree/develop-2.2.5 (November 2017)

http://bet365techblog.com/riak-workshop-summary (October 2017)

>the immediate goal of releasing a known good build as soon as possible [...] all code would be released under the Apache 2 license [...] contact the HLL developers to aid in the dependency and GPL removal

https://news.ycombinator.com/item?id=15182566#15183008 (September 2017)

>andrew_deane_: bet365 have signed the agreement to purchase all Basho IP [...] our intention is to open source all code, help rebuild the community and collaboratively take the development of RIAK forward


All IP has been made open by bet365 (including RIAK TS) and development is continuing, both in the community and within the company. bet365 are actively recruiting to support this effort, and are committed to the long term future of RIAK. Latest release is to be cut Jan / Feb with the help of the community.


And RethinkDB has moved under the Linux Foundation.


Is the concern mitigated by this being an open-source Postgres extension? One question would be whether they have attracted external OSS contributors.


Depends on your reliance I suppose. Appears most of TimescaleDB's vendor-specific pieces are in the DDL area, but there are plenty of DML-specific extensions. I'm not saying TimescaleDB might not get to a level of maturity where I'd use it, I personally just can't justify it today. I've found DBs and ops/orchestration engines to be the most painful to move off from if you need to.


They also provide the most bang for the feature-complexity buck. (A DB is something you don't want to roll on your own, and you want to use the purpose-built one for your use case. A columnar store for time series, a scatter-gather SSD + random read optimized engine for hot OLTP, a fancy multi level cache/tier BLOB store for photos/files, CockroachDB for geo consistency, etcd for local consistent config, and so on.)


If Timescale goes bust you still have all your data in Postgres, which isn’t going anywhere. The risk is much lower.


Data portability is certainly an important issue here to consider here given the points you mentioned. When adopting a vendor, the relevant question is "How much work is it to get data in?". And when planning for the contingency of vendor loss, the question is "How much work is it to get the data out?".

The first question is tied closely to the interfaces provided by the database, and those interfaces are usually commensurate with the associated tooling. In the case of time series databases, they include statsd, collectd, and good old JSON. Relational databases utilize SQL as expected. Take the example of migrating from Oracle to Postgres; you shouldn't have to spend an inordinate amount of retooling your applications because the data interfaces are similar.

Data egress is similar; SQL databases generally provide SQL/CSV exports, and most TSDBs (the good ones at least) allow you to get the data out via JSON through a REST interface.

Whether or not the DB you choose is open source, or from a small vendor, you should pick the right tool to solve your problem with data portability in mind. (disclosure, I work for http://irondb.io, and have used Postgres for 18 years)


Basho is in receivership, close to bankruptcy. https://www.theregister.co.uk/2017/07/31/end_of_the_road_for...


For anyone just seeing this, its assets were purchased and are planned to be open-sourced: https://news.ycombinator.com/item?id=15182566


What you say is another way to see the argument about OSS Project vs OSS licensed software which is actually a company and not a project.


Timescale is open source PostgreSQL extension that is much safer vs RethinkDB.


There was a similar sub-discussion on the Datomic Cloud announcement last week: https://news.ycombinator.com/item?id=16168041#16168743

As an open source product it seems like TimescaleDB is less risky than Datomic. (Not that the comparison is even necessary -- I don't think they compete in the same category apart from both being non-relational DBs.)

[edit:link fixed, thanks! (and thanks for taking the time to correct my misunderstanding re:TimescaleDB as a relational DB)]


Hi, the link you posted seems broken. Also just wanted to clarify that TimescaleDB is in fact a relational database, being built as a PostgreSQL extension. Looking at the nature of time-series workloads (namely appending data about recent events) we can architect the database differently for time-series workloads and applications powered by time-series data; scaling insert rates and storage capacity, increasing query performance and simplifying operations, while retaining full-SQL and the reliability of Postgres.


> But is anyone else concerned about using a DB solely from a company built specifically around that DB?

It worked out pretty well for Oracle. (SCNR)


I'd almost rather be locked in to a cheap, may-deprecate DB than a really expensive one. The former at least forces you to move, the latter just costs your business a lot especially when non-doers run the tech show and will sink money over re-impl every time (as opposed to making measured decisions).


FWIW, I found a table with prices for Oracle database licenses once, and it made Microsoft SQL Server look cheap. I am not sure if that means MSSQL is cheap, though. At least I am fairly confident that Microsoft is not going out of business anytime soon.

In my current job, I needed to get intimately familiar with the database our ERP software uses, because my boss needed lots of reports that the ERP software just did not do (unless paid an expensive consultant to customize the software for you). And the idea of having to move all that data to a different database makes me shiver in horror. I can understand why a business would be very reluctant to move their data from one database (both schema and engine) to another.

For business-critical tasks, I prefer a boring, reliable solution. But it is good to know that there are still interesting developments going on in the database area, building tomorrow's boring, reliable solutions.


Honestly I have found it is simply easier to master PostgreSQL and sqlite, and between the two (and with the right know-how) pretty much all use-cases are covered.

Played with a lot of databases, sql or no, and I always end up going back to PG


There's also snowflakedb which was in the news today after getting a pile of funding. Though it seems to be cloud-only, not something you can run yourself.


riak is not dead and still have strong community of developers, users and enthusiasts. bet365 bought all the assets including enterprise parts and made them open source. you can join postriak slack channel and see with your own eyes there is active and vibrant community working about a lot of exciting features. so, no, party is not over yet!


You mean like Oracle? Or Sybase? Or Informix? Or Terradata?

Well, I guess you can stick to DB2.


Having worked at places where they paid tons for options like those, and suffering from the inability to move and the difficulty of finding non-company tooling and community around them, I'd say they have the same problem. Point being: if your ecosystem is a single company, it's at the whims of that company's success and give-a-shitness. Non-DB companies and open source communities tend to gouge less and share more.


ClickHouse (the analytics DMBS by Yandex), while not explicitly designed as such, is a fantastic time series database.

There's even a special backend, the GraphiteMergeTree, which does staggered downsampling, something most TSDBs aren't able to.

It's the most promising development in this space I've seen in a long time.

https://clickhouse.yandex/

https://clickhouse.yandex/docs/en/table_engines/graphitemerg...


I'd also recommend Druid, MemSQL, SnappyData, MapD and other column-oriented databases. Any of them can partition on a time column with full SQL and extremely fast aggregations and high compression that come from columnar storage.


Hi, you've posted this notion that partitioning a column store by time would yield the same result as TimescaleDB a few times, so thought we'd jump in and clear things up. We fully agree that column stores have their place, particularly if you have a massive number of metrics, and all you care are roll-ups on single column axes. There are some major differences between TimescaleDB and column stores. Namely, TimescaleDB supports a lot of features that column stores in general do not.

- Secondary Indexes.

- Transactional semantics.

- Can operate on data sets greater than available memory (doesn't have to be all in-memory unlike memSQL and some others). Time-series data is voluminous.

- A whole bunch of specialized time-based optimizations that optimize query plans when working with time-based indexes and data.

- Constraints - Including foreign keys.

- Triggers

- Joins with relational data.

- Full SQL - allowing you to use complex queries and window functions

- Compatible with data tools that use SQL - which gets you gets you the richest ecosystem of tools in the data world

- The full gamut of Postgres datatypes including JSON/B and GIS location data

- 20+ years of reliability, tested backups, live streaming replication, etc.

- Geospatial support through best-in-class PostGIS

And of course, we're only getting started :)


Point of clarification - MemSQL does not need to be all in-memory, there is also a columnstore that is on-disk and only leverages memory for indexes / column segment information.


It depends on the queries but columnstores would yield a faster result. We're not new to this and have used ClickHouse, MemSQL, SQL Server, and Druid extensively.

Columnstores just store data by column, they do not have any inherent limitations because of it. They all support SQL and compatible tools (although Druid is experimental SQL using apache calcite). They all store columnstore tables on disk (memsql uses rowstores in memory, sql server can optionally run columnstores in-memory using its hekaton engine, and they all use in-memory buffers for rapid ingest). They can all do geospatial queries, support JSON columns and some can handle nested/repeated structures. Indexes are available but unnecessary when you can prune partitions based on what's contained in each segment, especially when using a primary sort key (like a timestamp column in your case). SnappyData has a unique statistical engine to tradeoff query precision for much faster results (like HLL+ algorithms applied to the entire dataset). MemSQL will do OLTP access with full transactions across both rowstore and columnstore data.

Congrats on the VC funding, I'm always happy to see new projects and building on Postgres does give you a solid base with triggers and foreign keys (which come with their own scaling issues), and extending time-based functions will be useful -- however my issue is the marketing spin where you claim to be better than everything else. Columnstores are very fast, efficient, performant, and time as a dimension is not a new challenge. That's before considering the bigquery/snowflake superscale options or specialized databases like kdb+ which have served the financial industry for decades.

Approaching the field with a single-node automatic partitioning extension (as of today) for a rowstore RDMS and saying you're better than the rest on features that they already have just strikes me as insincere. It would be better to recognize the competition and focus on what you're good at instead.


I couldn't agree more with manigandham. Column-store data warehouses have nearly all the features that cevian mentioned, and a column-store with a time column as the partition key will run analytical queries much faster than a row-store, even a row-store like TimescaleDB that's specialized for time-series data.


“my issue is the marketing spin where you claim to be better than everything else”

I’m sorry that was your impression and it’s certainly not our intent to mislead, although I’m not really sure why/where you think we claimed this. Indeed, the quoted article even says that "Timescale is not trying to take on Kx Systems directly in this core market.", and that such organizations have different needs for different use cases.

Technology choices are all about trade-offs, and databases are no different.

In fact, our docs have a page describing exactly when not to use TimescaleDB compared to other options: http://docs.timescale.com/v0.8/introduction/timescaledb-vs-n...

Cheers :)


I think what he might be saying is that you clearly are trying to be a direct competitor those you say you are not. Claiming to not compete against the prop TSDB offerings just so you can stack the comparison deck in your favor by then comparing yourself to the less than acceptable FOSS offerings is a little disingenuous.

This would definitely clear things up in my mind.

Why would I use Timescale over KDB or IQ or Vertica? Is it just a price thing, you are mostly cheaper (both licensing and finding talent)? If cost was a minor issue, why chose Timescale? What advantage does it have over those other TSDBs? That bullet list that has been repeated a couple times seem to not really be unique to Timescale when compared to the other big columnar databases.

Maybe it is you have a good story on how you can do scalar operations better than the others? You you have a particular workload mix you are trying to target?

Do you plan on doing any of the TPC benchmarks?


You keep using that word "column store" but I don't think you know what it means.

Since when did it imply all the limitations you are saying?


Clickhouse is very cool. But note that it does not support transactional and relational semantics and does not have real-time updates or deletes. Thus, its meant for very different applications than TimescaleDB. I would classify Clickhouse more in the data-warehouse space...


Direct link how to use as a graphite whisper replacement https://github.com/yandex/graphouse

Telegram channel https://t.me/clickhouse_en


Completely unrelated question, but how on earth is 'clickhouse.yandex' a valid web address?


The same way https://domains.google is, you pay - you get your own TLD


> Back in 2014...developers...could use relational databases with SQL interfaces, which are easy to use but they don’t scale well.

I'm skeptical.


Guessing they meant to write "... for time series aggregation"


> nobody wants to have large grain snapshots of data for any dataset that is actually comprised of a continuous stream of data points

Except, of course, for those who realize that the precision of a statistic only increases at sqrt(n) and that a biased dataset will remain biased regardless of how much data you have. I'll take a large grain dataset that I can load on my computer and analyze in five minutes over a finer grained dataset where I need to set up a cluster before I can even get started. Enough with the "let's store everything" fetishism already.

(Somewhat tangential to the blog post, I realize.)


The reason we need to store everything is less about needing perfect accuracy of measurement (though I think we do want it) and more about the curse of dimensionality[0]. We want to slice, pivot, and filter datasets more aggressively than ever before which helps drive aggressive data collection.

[0] - https://en.wikipedia.org/wiki/Curse_of_dimensionality


You can use sampling to both store every dimension of a data point and to not store an unwieldy amount of data.


This simply doesn’t work when you have sparsely populated dimensions and/or you don’t know what dimensions are important in advance. Both of these are very common. That’s why you don’t see a higher prevalence of estimated measurement.


> you don’t know what dimensions are important in advance

But again this is a huge red flag. I've seen so many data science projects that started with "well, let's just get started with collecting everything and we will figure out what is important later on" and then spent so much time on infrastructure that no useful insights were ever produced.


There's a balance to that. You don't need anything fancier than a Hadoop cluster to store everything. Nowadays you can get that packaged and working out of the box from a number of vendors.

Of course get that data out into analytical datasets is a whole different matter.


Unfortunately there’s no way to know if useful insights won’t be produced unless you explore the full data set.


You’re definitely not wrong, but if you simply can’t operate a system to store everything for some reason, sampling is a lot better than just doing an aggregation on one dimension and throwing everything else away.

For anyone reading this who is interest in a practical application with sampled operational data, check out Facebook Scuba.

https://research.fb.com/publications/scuba-diving-into-data-...


For sensor data analytics, you are frequently using many orthogonal sensor data sources to measure the same thing, precisely so that you can remove source bias. And most non-trivial sensor analytics are not statistical aggregates but graph reconstructions, the latter being greatly helped by having as much data as you can get your hands on.

The "let's store everything" isn't being done for fun; it is rather expensive. For sophisticated sensor analytics though, it is essentially table stakes. There are data models where it is difficult to get reliable insights with less than a 100 trillion records. (Tangent: you can start to see the limit of 64-bit integers on the far horizon, same way it was with 32-bit integers decades ago.)


> There are data models where it is difficult to get reliable insights with less than a 100 trillion records.

Example(s)?


Some remote sensing data models. Many population behavior data models; you discover that these are mostly garbage if you actually ground truth them unless you have completely unreasonable quantities of data.


Re: setting up a cluster

With TimescaleDB we've focused on single node performance to try and reduce the need for clustering. We've found performance scales very well just by adding more disk space if needed and more cores. So maybe some datasets are not practical for your laptop necessarily but a single instance on Azure/AWS/GCP is workable. No need for a cluster to get started :)

(Read scale out is available today and we are working on write scale out, hopefully later this year)


There are a lot of use cases (for example security) where it is not the case.


This is a bit easier to grok: http://www.timescale.com/how-it-works

(if you're new to this like me)



I worked at a place that monitored power usage minute by minute across 1000s of locations. We just used MySQL with a time column. Maybe I'm not the target audience but I'm failing to see what this gets me.

The problem is they say the data is imuatable and stored sequentially, allthough our data was imutable with devices on the net the data comes in random order when these ineviatably have connection problems.

We always aggregated our data into larger time blocks. Storage was cheap and doing comparative analysis across locations and time zones was always our pain point.

I think using a Postgres database is wise.


Then effectively your insert rate is between 16 and 60 per second, sure, you don't really need a sophisticated partitioning or log structured DB. Native static partitioning would give you a decent speedup without much thought.

It's intro computer science, if you have a tree structure and fill it up, you spend a lot of time in the corners of theta notation. Timescale uses tightly integrated partitioning on the time axis to deal with write performance and aging data out. Other popular TSDBs are plays on log structured merge trees etc


Reducing "Other TSDBs" to log-structured-merge trees is misleading. Any large-scale TSDB has something sophisticated underneath and LSM is often just one tiny part of that. I would argue (as most do) that any TSDB "simply used an LSM" it would be doomed at any scale over time.


There was no reduction, it was intended as a pointer to one data structure some TSDBs are using underneath. I would bold and highlight the "etc" present there for you if the markup allowed it.

I hope a reader would become interested in what an LSM tree is (and perhaps as importantly isn't)


Perhaps not as lofty goal as time series, but a good, commercial database for vectors (aka the building representation block of modern machine learning) is long overdue too.

A few open source options exist (Spotify's Annoy, Facebook's FAISS, NMSLIB) but these are rather low-level, more hobby projects than enterprise level engines (index management, transactions, sharding…).

After building a few document similarity engines for our clients we took up the gauntlet and created ScaleText, https://scaletext.ai. It's still early days but the demand for a sane, scalable, cross-vertical and well-supported NLP technology is encouraging.


At my employer, we've recently (as of the middle of last year) been making a considerable effort to use InfluxDB to track our KPIs. It's working out wonderfully for us, and I'm expecting it'll get used more and more as the year goes on.

What really floors me about Influx is how fast it is. A query that used to take hours in Oracle takes seconds in Influx. And the influx query is readable: rolling data up to various intervals produced nightmare queries in Oracle but is short and sweet in Influx. Want to take data gathered every 5 minutes and give a daily average? Yeah, good luck with that in Oracle. It's doable, sure, but I've seen the code. And the speed. Both are ugly.

The only problem I have with InfluxDB is it's still a little immature, and the tooling isn't where it should be. This isn't entirely Influx's fault; most third parties aren't aware it exists or don't care. Our reporting team uses Crystal Reports, which can't talk to InfluxDB. So I end up having to write a Python script that runs in cron every night to perform InfluxDB queries for the previous day's data doing all the rolling and average/min/max calculations and then inserting the results into Oracle, just so our reporting team can get to the data. For some KPIs I'm working on right now, we decided to not go through the reporting team, and I'm writing a webapp in Python/Bottle to display the report, and we're probably going to augment that with graphs from Grafana.

Grafana is beautiful, by the way.


How big is your database, and how long does it take to restart? We have what I would describe as a fairly small influx database, and it takes a long time to restart (20-30 mins). And the time to restart seems to be growing linearly with the db size. Not cool if you do regular server patching with reboots.

We also had a problem where user queries that return too many columns cause the DB server process to OOM. And then it restarts, so another 20 mins of downtime. Also not cool.

We liked the tagging and rollup features, and automatic retention management, but those first 2 problems really turned us off.


Try IRONdb... Relatively fast restarts, linear cluster scalability and no downtime when a single node is malfunctioning.


I can attest to how great InfluxDB is. Have used it in a few applications that rely on time based data and I am constantly blown away by how fast the query times are.


I think you're saying a lot more about Oracle than about Influx :)

I also am using Influx at my company and the experience is not that great, mostly due to it being immature yes. For example, I currently have runaway disk usage and I have no way to know what table is using it. So I have to choose between losing data or continuously increasing storage.


One bit of immaturity I just remembered: InfluxDB does a terrible job of telling you when you've fat-fingered a query. You can select from a series that doesn't exist or reference a tag that doesn't exist inside a 'where' clause, and InfluxDB will happily return 0 results instead of giving an error.

I've spent some time trying to figure out why I'm not getting data, and it turns out that it was because I typoed the name of the series.

I still like Influx though, warts and all.


Apache Hive is also a great piece of software that I would recommend for these use cases.


The article recognizes that several time series databases already exist. They also say, "we aren't trying to compete against kdb+." They explain how they can handle time series data better than NoSQL databases that aren't time series focused.

But what are they doing better than the existing time series databases? Surely they must have some advantage or they wouldn't have raised 16.1 million dollars.


Every time I read about some new solution to storing time series data, I always feel like I must be doing something wrong, but I've run into _zero_ problems yet.

Every time I have to store time series data, I never really need ACID transactions. I definitely don't ever need to do updates or upserts. It's always write-once-read-many. ElasticSearch has always been the obvious choice to me and it has worked extremely well. Information retrieval is incredibly robust and for times that I'm worried about consistency, I use Postgres's JSON capabilities and write first to there. You can have your application sanity-check between the two if you're worried about ElasticSearch not receiving/losing data.

I find it really hard to beat.


I had set up ElasticSearch for logging events and was thinking about using it to store metrics as well. It seemed like it ticked all the boxes and would work pretty well when I was messing around with it.

I ended up leaving that job and set up something specifically for metrics from scratch. I didn't compare 1 to 1, but this was much faster to query and had much lower requirements (disk space, memory, etc).

Both use-cases are using it as a time-series database, and there's no reason ElasticSearch couldn't work for both of those use-cases for many people. When using two different backends (one for events and another for metrics) I meant drawing a line when creating/logging data and two ways to query...which sucks for training. You also would have to maintain/archive two different data stores.


We've had some clients try this, find consistency issues between elastic nodes or postgres and elastic and start using TimescaleDB as a way to simplify the stack and application. But obviously YMMV and this is highly dependent on your query needs.


A few reasons why you may want to use TimescaleDB vs other time series DBs:

1. For some developers, just having a SQL interface to time-series data (while maintaining insert/query performance at scale) is good enough reason to use TimescaleDB. For example, when trying to express complex queries, or when trying to connect to SQL-based visualization tools (e.g., Tableau), or anything else in the PostgreSQL ecosystem.

2 For others, the difference is being able to fully utilize JOINs, adding context to time-series data via a relational table. Other time-series DBs would require you to denormalize your data, which adds unnecessary data bloat (if your metadata doesn't change too often), and data rigidity (when you do need to change your metadata).

To quote a user in our Slack Support forums[1]: "Retroactively evolving the data model is a huge pain, and tagging “everything” is often not viable. Keeping series metadata in a separate table will allow you to do all kinds of slicing and dicing in your queries while keeping it sane to modify the model of those metadata as you gain more knowledge about the product needs. At least that is something we have been successful with"

3. For others, it's some of the other benefits of a relational database: e.g., arbitrary secondary indexes.

One thing to think about with NoSQL systems (including every other time-series db) is that their indexing is much more limited (eg, often no numerics) or you need to be very careful about what you collect, as costs grow with cross-product of cardinality of your label sets (strings). We have heard from multiple people that they explicitly didn't collect labels they actually wanted when using [another time series db], because it would have blown up memory requirements and basically OOM'd DB. (Timescale doesn't have this issue with index sizes in memory, and you can create or drop indexes at any time.) You often hear about this as the "cardinality problem", which TimescaleDB does not have.

4. There's also the benefit of being able to inherit the 20+ years of work of reliability on Postgres.

There was no database that did all of this when we decided to build Timescale. (If one did exist, we would have used it).

You can read more about that story here: https://blog.timescale.com/when-boring-is-awesome-building-a...

And for more on our technical approach: https://blog.timescale.com/time-series-data-why-and-how-to-u...

[1] http://slack-login.timescale.com/


I hate to be "that one" but:

> There was no database that did all of this when we decided to build Timescale. (If one did exist, we would have used it).

If you removed the based on Postgres part, except for KDB, IQ, Vertica, and a few others. I can definitely see a price argument though (i.e, the same but cheaper) as those all tend to be a expensive.


We use Cassandra extensively at https://logrocket.com. How does performance compare vs Cassandra when you don't need the semantics or transactions of SQL? I'm surprised the article only provides benchmarks against PostgreSQL.


That is currently next in our pipeline for a benchmark blog post. Early results look good on both the read and write side in terms of raw performance, with the benefit of more complex queries being more easily expressable.


To add to this: We've found that the benefit of Cassandra's approach is when scaling out to 100s of nodes (which TimescaleDB currently does not support). But we found Cassandra's per node usage not that impressive. We'll have more benchmarks vs Cassandra soon (@robatticus is literally the person working on it), but suffice to say that users have already replaced multi-node Cassandra clusters (e.g., 10 nodes) with a single TimescaleDB instance (optionally adding read-only replicas for hot standbys).


I recommend you benchmark against ScyllaDB, not Cassandra.


I have a hard time shaking the feeling that database innovation is being forced not by necessity, but by a culture of blind relentless change.


OK, but is that bad?

It might be bad for the specific organization pushing their new solutions, but for the software industry as a whole, it seems like the good side of "throw spaghetti against the wall and see what sticks". Many ideas will fail, but the good ones will stick around. So I'd say by all means, let people feel free to innovate. And those of us who consume the innovations just need to remember the mantra of "leading edge, not bleeding edge."


I didn't say it was bad (though the tone of my post definitely hinted at it). I agree with you. Following the trail of necessity can easily lead to locally optimal points, away from the global maximum.


It (and tech innovation in general) is also driven by people scratching their own itch. That's where we started. And we actually didn't start as a DB company, but as an IoT company storing lots of sensor data. We needed a certain kind of time-series database, tried several options, but nothing worked. So we built our own, and then realized other people needed it as well.

And we picked Postgres as our starting point exactly because it wasn't new and shiny but boring and it worked.

More on this here [1] if you're interested.

[1] https://blog.timescale.com/when-boring-is-awesome-building-a...


I have to say that a sentiment similar to this is why we built Timescale as a Postgres extension and not started a new database from scratch. We didn't want a new shiny thing just for kicks but rather a product focused on solving a particular problem.


Relentless change is most likely the motivator for these innovations. However, I don't think it is needless innovation or innovation for the same of innovation. Personal opinion that there isn't much innovation in what this article covers, but still it is happening elsewhere.

Changes in telemetry production (IoT and others) have fundamentally changed the requirements... millions (and often billions) of data points points per-second are now happening. This is being driven by RSM and IoT. RSM is alluded to in my ACM Queue article: https://queue.acm.org/detail.cfm?id=3178371


Silicon Valley on reinventing the wheel: "Ours is rounder!"


We had major success by simply batching incoming writes into ~15 second chunks and writing that as a file to S3 and an index that tracks how the files are split / chunked to make read performance decent.

This alone gave us an insanely scalable (load tested against 100GB/day ~100M records/day) for a grand cheap total cost of $10/day for everything, server, disk, and S3. https://youtu.be/x_WqBuEA7s8

Works great for timeseries data, super scalable, no devops or managing database servers, simple, and works.

I believe the Discord guys also did something similar and had written some good engineering articles on it, give it a Google as well.


Storing and retrieving data has never been all that hard. The challenge is having user-interactive performance on complex queries against the data. Comparing and correlating and deriving and integrating and ... (lots of other analysis). For many "scaled" systems, 100M records/minute isn't uncommon... and while that's very likely possible with your design the question of economic feasibility enters. Solving these problems at scale with good economics is the playground of TSDB vendors today.


I think I say this about every time, because I see it as the major flaw in almost every approach to this problem. I say this from the perspective of having used a wide range of TS databases, but mostly KDB, Oracle, and prop solutions with Cassandra and Mongo products thrown in.

There are two big problems with relational SQL databases: the storage and the query language. The two aren't separate concerns. While projects like this might fix the storage issue, they will never be simple or fast without also changing the query language too.

SQL doesn't fit the column/array paradigm well. There are extensions that attempt to close the gap, but they are slow and complex compared to just fixing the query language. Operations like windowed queries and aggregates can be painfully slow because of the mismatch between SQL and the ordered data.

I would absolutely love to work on a product that attempts to fix the query issue at the same time as working on the storage and interface issues. You don't need an obscure APL-like language, just something that promotes using the properties inherent in time series to the front.

Also, it is a bit weird to say you are not trying to compete with the likes of KDB then go on to say how good you are at its core competencies. As always, eagerly awaiting the TPC-D benchmarks and comparisons to the heavy hitting time series dbs out there, not the easily crushed FOSS products (except for maybe AeroSpike that I wish I had more experience with).


PostgreSQL with jsonb columns can be quite powerful, best of both worlds; but yes the syntax for querying within json is a bit arcane.


How about one made by Rich Hickey — https://www.datomic.com/


Sadly, Datomic lacks two of (arguably) very important features in this space:

1. It seems to not be optimized for speed (but it's difficult to say since the license forbids publishing benchmarks).

2. It's not open source.


I empathetically agree with #1 but object to #2. Datomic, on an architectural basis alone, is going to be strictly faster than say an RDBMS, because it can parallelize almost all of the workload. Read-side is entirely elastic; index and storage maintenance does not block the writer, and the single operation that must be serial is the actual ACID write primitive, the conditional-put to DynamoDB. In RDBMS, the ACID writer is competing for resources with all these other operations, because they can't be parallelized without sacrificing consistency.


We are talking about time-series databases here. I guess they are more optimized for the specific workload than a generic RDBMS.


is Datomic good for timeseries ? AFAIWA it's good for rare-write often-read applications


We use a proprietary database system that uses a 'flat-file' format (no idea what that means) and is primarily time series based due to the fact that we're logging sensor data. Since it's primarily a backend, you can't access it outside of their proprietary gui. It's also accessible as a linked server via SQL Server, but this is slow as hell for non-trivial queries. We use it within a power plant setting where we heavily prioritize db-writes, which this software is apparently very good in, and db-reads are less of a focus.

I'm not sure if moving to another db system would be beneficial, but I would be very grateful if accessiblity could be much less of a hack


We have a similar setup for our telematics data (insurance company). It’s stored in files which contain DEFLATE’d BSON. Similarly, we have a custom GUI for viewing it, but do also have some terminal tools which can be used.

The solution to larger processing simply seems to be loading the files and batching them through whatever processing we want and storing the output in a proper database - which can then be queried instantly.

For a lot of time series data, this does seem like a pretty decent approach.


What I often miss with these kinds of databases is compression capabilities. I currently use my own delta-of-delta encoding for time series data (stored in postgres) as that gave me much better compression than any of the time series databases I tested.

It's not that storage isn't available, it's easy to store terrabytes of data. But really fast storage is still expensive and rare. Being able to compress time series data to 3-5% of the raw value allows keeping most in memory, speeding up reading writing and analysis.

InfluxDB is quite good at that but unfortunately, they struggle storing many small datasets.


Isn't it a bit weird to launch a "never delete anything" store today and not mention anything about European GDPR requirements?

How would you go about deleting a users data upon request?


You delete it. It's still a database. What's being done here is just smart partition management as an extension to Postgres, similar to CitusDB.

You can also use any column-oriented relational database with a time-based partition key and do the same thing.


What is the benefits of timescaledb or influxdb? surely the enterprise pricing is going to be similar to KDB which has been around for years and a very mature product


How is TimescaleDB different from OSIsoft PI?


Not just OSISoft PI, but there are many other historians for time series data.

https://en.wikipedia.org/wiki/Operational_historian

At least in my experience, historians are rarely recommend for complex or ad-hoc queries. Typically you just pull the data (by tags) into another application and do your data processing there. It looks like in timeseriesdb lets you execute complex queries in the database. Historians typically only let you fetch data by tag and you need a metadata (e.g. asset management) framework on top to organize the data (e.g. give me avg temp every 5 minutes by sensor). It looks like with timeseriesdb you can have strings/text as fields within the timeseries table, which removes the need (to some degree) to join the data with a seperate metadata database.

I've also never heard of anybody using these commercial historians for time series data you'd see from non industrial processes (e.g. stock data, price tracking, GPS location of people or moving assets, time between clicks on a website,etc).

All that being said, OSISoft PI and AF have their warts, but OSISoft has been around for a while and PI has been battle tested in various industries (e.g. Oil & Gas, Manufacturing). It's closed source and you have to pay for it, so it's probably not attractive to startups and smaller companies. But it does come with a support organization if you need it and can pay for it. And IME data retrieval from PI is extremely performant!


Yeah, to underscore some differences more completely, timescale has:

- Full indexing and secondary index support.

- Support for transactional semantics

- Support for living along side relational data - including foreign keys to the relational data.

- Full trigger and constraint support

- Support for all of Postgres's native data types including JSON(B) and GIS location data.


Transactional support? I was under the impression that timeseries data is considered immutable.


Being transactional is orthogonal to being immutable. For example, if you want to ensure that a given set of writes is applied atomically, you would issue them in a transaction.


This is especially important if you are also dealing with metadata that you don't want to always denormalize into your time-series tables. If you don't insert it transactionally, you can also lose referential integrity between your time-series and relational (metadata) tables.


That's not always true. There are tons of cases where time-series data gets corrected later. It is true that time-series is INSERT-mostly but UPDATES do happen.


I almost thought this was going to be an ad for KX as part of the subject is their slogan...

Glad to see open source tools in this space gaining traction.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: