To take a contrary position: whose infrastructure is so large that a specialized SQL and storage engine (like Influx) is necessary? Actually not that many companies...so why does infrastructure always end up with these special snowflake versions of things?
Granted this was an especially bad case because it was the shittiest real time database ever built.
And because Timescale is an extension in Postgres, it can actually sit inside (and coexist) with your standard Postgres database if you want. Or, it can be deployed as a separate instance that can transparently JOIN data against your main DB via a FDW.
My personal preference is to hire experienced teammates who have solved similar problems in the past, because at most levels of scale (other than, say, Google's or Facebook's), few problems are truly novel. So it's reasonably possible to predict where the business will be WRT technological requirements in, say, 1 to 3 years, sometimes 5 if you're just getting started, and plan accordingly.
In the case of this particular problem (time series data), there are some pretty easy and scalable answers out there -- Datadog, SignalFx, Wavefront, Prometheus, etc. to start with. So it's not like you have to waste a bunch of opportunity cost going out and building a solution.
We've seen something of a convergence here with ELK -- people push their syslog into the same system they use to provide general text processing for the application, instead of having a separate/weird storage engine for infrastructure logs.
Also, if you find yourself JOINing timeseries on more attributes than the timeline itself, you should question whether you really have a solid use case for a timeseries DB.
That being said, always good to see competition in the market, especially if it's built on such a rock solid product and community.
I really like that Postgres is "good enough" solution for almost every use case by now besides relational data, be it document storage, full text search, messaging - or time series now. Nothing wrong with having less stack to worry about, especially for prototypes and small scale projects!
FWIW, I think that's currently a good chunk away from being a significant bottleneck in postgres' query processing. But we're chipping away at the bigger ones, so we'll probably get there at some point.
If one goes to the heavy contenders in this space, e.g. Teradata, you may expect:
- DMA for data retrieval
- A suitable and linearly scalable network layer with Remote DMA
- Row, columnar and hybrid storage options
- Utilization of CPU vector options
The analytical database has become a commodity. I really like Postgres, but I would still do a very careful analysis of my business needs if I were to choose a DBMS when there is such a strong range of commercial options available.
Maybe Postgres needs pluggable storage engines .
I read Postgres 10 might have this, but looks like it will miss the deadline.
That said, we may have to get a little creative to support some specific use cases (e.g., bitemporal modeling). Still TBD.
Also, I agree that SQL isn't for everyone (the popularity of PromQL is evidence to that). But a lot of people have been using SQL for a while (personally, since 1999), and there is a rich ecosystem (clients, tools, visualization tools, etc) built around it.
And temporal and bitemporal databases (despite the name) are orthogonal to the aggregation and windowing issues that make time series difficult in SQL or a row-oriented database. The are just a couple of timestamps and where clauses to support point-in-time queries.
Maybe this is why so many time series databases fail. People making them often don't seem to fundamentally understand the issues, Very few, such as Kx and KDB, seem to understand them.
If the data is not small for that specific time period and, than you will hit some IO problems, if you create the appropriate indexes it may be efficient at first but then you may need to switch to columnar storage format if you really need it. If Timescale uses immutable shards which are basically Postgresql tables, you can easily re-write the historical shards periodically using the columnar storage format rather than default row-oriented table format.
That's usually how most of the time-series databases work. The hot data is in row-oriented format and the system uses buffering to write the efficient columnar tables periodically under the hood.
The usual thing that prevents indexes from scaling with large data is that inserts slow down as soon as the index BTrees can't fit into memory. TimescaleDB gets around this problem by auto-sizing of tables: we start new tables as old tables (and their indexes) grow so large that they can no longer fit in memory. This allows us to have tables that are not so big that inserts become slow but big enough that queries don't have to touch many tables (and are thus efficient).
However, as data sizes grow, you may want to convert data to column-stores to save disk-space though, as you allude to. We are looking at how best to do this and the best "archive" format to use. Stay tuned.
Your observations on cstore_fdw are accurate. I'm adding a few points about our learnings from cstore deployments in production.
From a use-case standpoint, we find that most cstore deployments use it for data archival and to a lesser extent data warehousing workloads on Postgres. Some of these deployments are fairly sizable (Fortune 1000) and they primarily benefit from compression that comes with cstore_fdw (3-10x).
Citus + cstore_fdw deployments exist, but they are fewer in number. This isn't because of an architectural issue related to cstore or Postgres FDW/extension APIs. Rather, making cstore work for real-time workloads require notable implementation work. In particular:
(1) Real-time ingest & update / delete capabilities: Columnar stores have a harder time providing these features. The most practical way to achieve this is by partitioning your data on time, using a row store for the most recent data, and then rotating that data into a columnar store.
Users who use cstore for real-time workloads, manually set up table rotation themselves. In fact, this type of integration between Citus (distributed db) and cstore_fdw (columnar storage) has been one of the most requested features: https://github.com/citusdata/citus/issues/183
(2) High-availability & disaster recovery: This requires that cstore integrates with PG's write-ahead logs. Fortunately, Postgres 9.6 makes this possible / pluggable and it's on cstore_fdw's immediate roadmap.
(3) Indexes for quick look-ups: cstore_fdw comes with built-in min/max indexes. This helps but doesn't provide quick look-ups across any field. We thought about integration with PG indexes -- one challenge there involves PG indexes' storage footprint and potential hit on compression ratios.
(4) Query planning and execution tuned for columnar store: If your workload is bottlenecked on CPU (rather than disk), these type of improvements can help a lot. For now, we're happy to see both recent and upcoming analytical query performance improvements that are coming up in Postgres 10 and beyond.
Also, putting cstore_fdw aside, I'm a huge fan of Timescale and the work they are doing! My favorite "feature" is that Timescale is a PostgreSQL extension. :)
This saves on space but also often improves query latency, provided your execution engine supports queries on compressed columns - less data means more of it fits into cache, meaning less cache misses (i.e. expensive disk reads).
edit: So what I mean is compressing the entire files or disk blocks with LZ might not take advantage of the same properties and might not get you the same deal - here often the in-memory data layout is also compressed, and queries execute efficiently on compressed data, versus with a blindly compressed filesystem you'd ostensibly have to always decompress first.
Sadly, PostgreSQL does not support that now, but hopefully it's not that far in the future. It's becoming a fairly regular topic in discussions both with customers and devs on conferences.
We've certainly discussed running on ZFS internally, but haven't evaluated yet. We're a bit uncomfortable dictating file system requirements to users, so looking to ultimately provide many of these things in the database instead.
Would welcome any testing/benchmarks, though! :)
I'm in the process of migrating from influxDB to a combination of Kafka and TimeScaleDB.
Has anyone beaten me to the lunch with testing if it works with [PipelineDB](https://www.pipelinedb.com/)? I can see that being a powerful combination, perhaps also tied to their forthcoming [Stride](https://www.stride.io/) analytics API.
Just so you're aware, upgrading between versions of TimescaleDB at this point is not fully supported. Meaning, we are still breaking some non-trivial things between release and don't have upgrade scripts. I don't want to discourage your usage, but I'm letting you know about the potential pitfalls.
We hope to start supporting smoother upgrade transitions in the near future.
We tried: so many continuous query rollups...
Eats into ability to write data... at only 1000/sec (artificially tripled due to splitting into multiple measurements to get around cardinality problem)
Dropping writes even on 24 core 500gb ram, SSD Dedicated.
Can't scale when also used to query against.
Basically: their claimed performance looks great; but real works doesn't match up.
They've pushed fixing cardinality two quarters.
No reply on my github issue with broken top()
Compression is good... is about all I can say lol
Also, github if you just want to see the code:
I've been working with TS databases for a long time now, and it never fails that every database vendor always has benchmarks showing they are the best (no put that more bluntly, when you come out with your own benchmark suite and you are the fastest/smallest/bestest I won't be surprised or believe you).
I don't expect you to be the fastest when having a row-oriented architecture, and it would be an unfair comparison against the non-free databases, but I would like realistic numbers.
Actually, if you came out 2nd to 3rd against competitors' workloads, I would be far more impresses.
(But really? A row-oriented TS database?)
Definitely agree that sometimes its hard to take numbers from a company at face value. As you said, we don't necessarily expect to show that we are the best at all types of queries (and will include query patterns where we may not be the best), but for those who need more SQL features and richer queries we expect Timescale to perform well. Additionally, working with people's existing stack and database is another key win.
I notice they only compare against free versions. I wonder if they have tried against any of the commercial vendors? The comercial vendors right now tend to lead the FSS pack by a very large gap.
To be clear: our tests are all using the standard PostgreSQL interface, not JSON-over-HTTP. And we don't exclude any SQL parsing time in the DB.
Give dataset, X, how fast does query Y return.
Would also love to learn more about your use case... If you feel inclined to share more, please shoot me an email firstname.lastname@example.org. Thanks again!
Full SQL interface
Scale-out distributed architecture
Auto partitions across space and time
Single-node and cluster optimizations
Complex query predicates
Native JOINs with relational data
Labels and metrics support
Geo-spatial query support
Fine-grained access control
Reliable (active replication, backups)
Automated data retention policies
I work a step upward in the stack (analyzing the sensor data and building models based off of it - usually using SQL) so I'm not an expert on underlying technology.
Over the last 15 years or so my work (industrial plant) has been big on historians. We've moved from in house bespoke stuff from the 90's which as I understood it was basically a custom written ring buffer that held 1-2 days of live data and would get down sampled into oracle tables as 1min/ 6min/hourly averaged data every hour or so.
Prior to this in the 80's we used a database technology called "IMS" but I'm not as familiar with how this system worked.
Nowadays in all our newer systems all I hear from the low level guys is "OPC" (which I gather is a messaging format) and historians. Time series databases sound like they should fit in somewhere but I'm not sure where - given other people use them for logging high frequency sensor data sounds like they could be a solution.
For sensor data collection, it's very hard to beat. Lay something like spark or filodb on top for more flexibility and it's virtually unmatched.
Why didn't you use Scylla, which is 3x to 10x faster than Cassandra?
We ran hundreds of machines (per cluster), and multiple clusters.
We don't use Scylla for dozens of reasons, including (but not limited to):
- The product was written before Scylla existed
- Our benchmarks didn't show Scylla as being 3x faster, let alone 10x faster
- Our machine count wasn't based on Cassandra limits, it was based on the amount of data we were comfortable storing on a given machine
- We don't throw shiny new tech into a critical stack just because someone says it's faster. I'd rather run code that's tested at scale. ONE of my Cassandra clusters there would have been 10x larger than any known Scylla deployment. At the same time, that deployment is at least 2 orders of magnitude smaller than other published Cassandra deployments (meaning largest known Cassandra deployments are 3 orders of magnitude larger than the largest known Scylla deployment)
The one thing I agree with the author: databases aren't where you want to have hipster tech. You want boring things that work. For me, Cassandra is the boring thing that works.
If there's a downside here, it's that this database is still a little too immature to have a great use-case.
But by all means, it's not my money being burned, so I'm not losing sleep over it. I'll just sit here and giggle as people talk about scaling Postgres and in my mind all I hear is single point of failure masters and vacuum to prevent wraparound in high write workflows
Does it waste a certain amount of effort in duplicated work and products? Sure. But we also get new projects that may try new approaches, gain different experiences, and train new developers which all helps the market expand. Ultimately it's better to have yet another version than nothing at all.
Also, regarding vacuuming, TimescaleDB does not suffer from it much under the normal workloads it targets because time-series data is largely immutable and deletions typically happen at the chunk level (i.e., dropping complete tables), avoiding any vacuuming during deletes and making wraparound vacuuming more efficient.
Wraparound vacuuming still sucks for high write workloads. I've been there. I've fought that problem in a high-write-throughput-no-delete-immutable-workload. I've seen it in person. You're still writing a lot of frozen txids to disk. Your slaves are still going to get the WAL command in a single-threaded WAL sender and fall behind in replication as that vacuum runs. You're still going to have pain trying to create an HA setup.
I'd also add the link to the post explaining the principle of how time series is actually stored which makes it capable of receiving 150K+ data points per second on relatively small hardware:
Fundamentally though, what timescale does is different in that they store (as far as I can tell) every incoming data point, which is more of a "big data" problem, whereas Tgres is all about what you can do with Postgres if you make the series regular (i.e. evenly spaced) on the fly and store time series as round-robin archives of fixed resolution and span, so I don't think comparing Tgres and Timescale makes a lot of sense.
The author mentioned write performance, but didn't touch much on read performance. Are there any benchmarks you can share in that direction?
Also the author didn't touch much on the space requirements or the node splitting expectations.
Those two areas would be useful to explain in greater detail.
Space requirements can admittedly be a bit higher than some other time-series databases or column stores. Since disk is cheaper than memory, and with easy ways of doing retention, we think its a worthwhile trade-off for people for now. But compression is something we will look to down the road. For our later benchmarks we will include memory/space comparisions to help people evaluate.
As for node splitting, our clustering solution (not yet released) is being worked on quite a bit as well. When its closer to being ready we'll include our thoughts on how to best manage your partitions.
This also plays nicely with configurable data retention policies if desired, e.g., only store raw data for X days, but more aggregated data for longer periods of time.
For DW needs (including time series), MemSQL is still the best option for distributed SQL unless you absolutely need postgres features. Proprietary w/mysql flavor but integrated columnstore + memory rowstore built for performance and fantastic ops management software makes it a great system. Also has interesting kafka and s3 ingest pipelines.
For an operational db, Citus is a great scale-out option for postgres. You can combine it with cstore to get columnar compression for DW use but at that point it's better to just use a more focused product. There's also CockroachDB which looks good (if it ever gets out of beta). I'd also recommend ScyllaDB + JanusGraph once both mature to get a scalable multi-master graph database if HA/reliability/replication are a big concern.
Timescale DB is interesting and it's always nice to see more options, but right now there are better data warehouse options already that work just fine for time series + SQL. A built-in columnstore would go a long way to making it competitive though.
Additionally, TimescaleDB comes with optimizations on the query side specifically for time-series data.
So you distribute the childtables to several nodes of a server cluster.
Is network latency a problem? I guess one should colocate the servers in one location rather than spread it out?
How good does it work when nodes die?
Do you use query parallelization (available since 9.6 in vanilla) on a single node and across different nodes?
Query parallelization works in single and multi-node cases.
This is also true at write time: You can insert a batch of rows in a single insert, then the query planner will split them into sub-batches, which it then will insert into the appropriate chunk. So, you don't need to "pre-partition" your data before writing to the system (even in the forthcoming distributed version).
Key quote of interest: "135K writes per second per node" this is very good! Quite impressed.
It can use postgresql as an index store.
I was investigating the same topic (PG based timeseries database) for a stock tick data project, would definitely give timescaledb a try.
Since financial data is mentioned in the blog, would be curious on how it performed / scaled in practical.
We're still preparing some benchmark numbers/performance numbers that we will hopefully share in the coming weeks. We do have some write performance numbers in there as you can see. There is also a lot of churn at the moment as we're still in beta and refining some key features, so I don't want to speculate too much on how performance looks until after we get a few more of our query optimizations tuned.
I quite like Postgresql (and deploy it all the time), and I'm no fan of nosql stuff, which just means you don't have to properly analyze your database structure before-hand, but with time-series it's different matter. The data you tend to send to generic time-series databases tends to be very unpredictable. I currently don't care what data is sent to Prometheus or Influx. This includes, but is not limited to ZFS stats of our storage, system load, network traffic, VMWare, nginx/haproxy, application usage and errors, ... I know that when I'll need it, I'll have it available and can try to correlate the data at any point in the future. In TimescaleDB it looks like I would have to pre-create tables fitting all those stats, which would make it an absolute pain in the ass.
As counter-intuitive as it sounds, to do it properly/flexible/usable in a real world, you should impose a fixed/limited database structure. The tgres project mentioned elsewhere in this thread seems a lot more useful for 'generic' cases. Maybe they can be combined in some way.
Also, some other important things in the time-series world are a REST api. Yes you might want advanced SQL to query data, but no you don't want that to insert it. Pretty much any application can do http(s) stuff without any additional libraries (except when you're talking about C/C++ things). Postgres on the other hand? Oh no I need a JDBC driver for this Java app, I need an extra Python library for X, ... Not to mention you suddenly have to manage database pools in the applications and have something that is a potential bottleneck.
Then, no Grafana support is a no-go. It is the de-facto standard visualization tool these days.
What about retention policies, you don't need ms-accurate data-points from 2 years ago, when talking about many data-points, you might want to aggregate this. Though since the data set in TimescaleDB will be limited due to specific data-sets, this could be less of a problem.
Alternatively if you do want pure blog storage, Postgres's JSONB datatype works as well, although with some performance trade-offs. This can work quite well if you have some structured data that lives alongside other unstructured data. 
Building or putting a REST front-end in front of PostgreSQL should not really be an issue (we built one for our hosted version), and there are already a fair number of PostgreSQL clients/libs. That said, for ease of use we are already thinking of adding an HTTP interface.
For Grafana: We were actually working on our own connector to support PostgreSQL, but found out that one is already in the works by the Grafana team (which will work out of the box with Timescale, because each of our nodes look like PostgreSQL to the outside world). This doesn't seem like it'll be an issue for very long.
We've mentioned retention policies elsewhere, but you do bring up a good point. Instead of dropping data, support for aggregating data in older chunks to a more coarse-grain resolution is something we are already looking into.
The main reason time-series databases are 'hot' these days is because ops jumped on them, and I think understanding how metrics are being used there and for what purpose is the key here. My first reaction when something generates 'events' or data - whatever that might be - is simply to push them into a time-series database without thinking about them. Is it data from snmp (a switch, firewall, ...), generic machine stats, database stats, application metrics, ping statistics, time until an ssl certificate expires, ... you name it - I don't care, maybe it'll be useful, maybe not. I don't even dare to estimate the amount of different metrics I'm currently tracking.
Real-world example: did I think I had to know what the sizes of my different caches on my ZFS storage units were? Not at all, but Telegraf pushed them anyway - so whatever. They actually ended up being very useful tracking slow-downs on our fileshares caused by some rogue process scanning all files on them, completely trashing the caches in the process. I had the data right there at my disposal, and I didn't really knew the details of what exactly I was tracking until I had to take a closer look.
These use-cases are something TimescaleDB's approach on it's own would be completely unsuitable for. As you mentioned, JSONB has performance trade-offs and blindly using that for everything sounds like a recipe for disaster. It also only addresses one problem, complexity to use is another, and everyone having to define their own structure and insert queries is yet another: there is no standard. While there are quite a few time-series out-there, most popular tools support the most important-ones, but I don't see how you can add support for TimescaleDB since the data-structure is completely undefined.
That's why I mentioned the tgres (which I had encountered before). Something like that, backed by a Postgres database with TimescaleDB's extension could be very interesting. Doing the 'generic' metrics through a simplified, well defined interface with some fixed table-structures, while still allowing way more powerful 'direct' metrics to address niche needs sounds very interesting to me.
Currently however, I only see TimescaleDB useful within one well-defined application, where it can be very valuable. In the grand scheme of 'ops' things however, it feels too limited and inflexible, and not really like a time-series database.
> For Grafana: We were actually working on our own connector to support PostgreSQL, but found out that one is already in the works by the Grafana team
Ah yeah, I forgot that the Postgres datasource type was added in Grafana, yes that would do.
Other thing is time-series-specific functions, but I'm not really familiar with all stuff Postgres offers, I know it's a lot - but I couldn't find stuff like regression analysis/derivative functions. Or maybe it's my google-foo is failing. Providing metrics-specific functions in the Postgres extension should be possible though.
(We also don't currently support joins, while TimescaleDB's joins sound pretty dope :))
- Both Aerospike and Honeycomb.io don't support full sql queries. Instead supporting their own custom (and more limited query format)
- Aerospike is not optimized for time-based queries and is more like a key-value store. You cannot get the same performance
- Honeycomb is a column store. Ours is built on Postgres and can work with your existing Postgres databases
As for performance, we are still working on gathering numbers that we hope to share soon comparing us against other solutions including Influx and similar.
You could use a timestamp as the key, and put data as json, but now you need to make sure that all data arrives at the same time so it can be inserted once, otherwise you will be doing series of updates.
For querying you are very limited, because you mainly want to query by the main key, which would be the time. You can get only one key at the time, you can't request range of data or data fitting specific criteria (unless you use some extra indices, which comes with own issues). If you have holes between the keys you will need to find some way to make them predictable, because listing all keys is a very expensive operation that only makes sense to use when testing.
Expiration of old keys... generally it is not pretty either, riak has bitcask backend which can expire old data, which could help, although the expiration is more of "time guaranteed that given key(s) is not purged". I'm not sure if DynamoDB has this as well, for example riak's leveldb backend doesn't allow expiration.
There's no dedicated time series database in AWS the closest thing to it would be using a relational one.
 BTW: don't use "TSDB" initials, because that's a name of another time series databases that runs on top of hadoop, it confused me a bit :)
Are you guys familiar with it and I am understanding this properly?
VoltBD’s focus on all in-memory processing and general OLTP workloads (so-called ”NewSQL”) is taking a different point in the design/problem space though.
- Automatic table-sizing. This includes dynamically creating new tables for new data, closing old tables as soon as they get too big, etc. (more challenging than meets the eye because of deadlocking issues, etc.).
- Related to above, optimized inserts to correct chunks. In Postgres 9.6, this is slow using rules/triggers. In upcoming Postgres 10 the situation is supposed to be better (waiting for finalization of 10 to test). But in Postgres 10, auto-closing and table creation are still not supported, and implementing them would probably require the slow path. In contrast, Our inserts are very fast.
- Hash-partitioning query pruning according to partition key
- (Time based pruning supported also, but that's standard Postgres)
- Time-based query optimizations. For example, we enable the planner to group by time-intervals using indexes on time-columns. Plain postgres can't support that because it does not know that truncating/rounding time is order-preserving. Other optimizations in active development.
- Our upcoming clustered version will include distributed query optimizations. In addition our scale-out design takes advantage of many workload features unique to time-series (see white-paper for more details)
I've actually used inheritance far more than partitioning, but I know of others that would love to see inheritance die.
Can you speak as to the stability of your system for production usage?
Right now we're looking for feedback as people use it on their own machines or experiments. Don't want to mislead people into using this for something critical and having it not work.
In the next few months, I'm going to be examining the feasibility of Postgres as a timeseries store, with an eye towards deployment onto RDS for production.
Might have some feedback then.
Best regards- this project looks like The Right Way Forward.
So, as we've mentioned in some other replies, this means that our partitions/chunks are typically based on two-dimensions (time and a primary partition key), chunks across nodes are automatically created and closed (based both on time intervals and sizes), and there are a bunch of time-based distributed query optimizations and time-series-specific functions that we'll provide out of the box.
More information in whitepaper: http://www.timescale.com/papers/timescaledb.pdf
Adding clustering on sounds like the hardest part.
Does Timescale have any baked in support bitemporal constructs?
Are temporal foreign keys supported?
We do not currently support bitemporal constructs.
What's an example of a temporal foreign key query that you need?
Basically, does the database have an opinion on time oriented data consistency?
We are not really built for that use case. Instead of focusing on fundamentally relational data such as people or jobs (which you are free to store alongside our stuff, using plain postgres functionality), we store event or metrics data. This data can be easily joined against the relational data (e.g., using a person_id stored in both event and relational tables).
That makes sense, thanks for answering ;)
One thing we'll note is that currently JOINs between two time-series tables (what we call 'hypertables') are not optimized, but we're working on it! :)
At a high-level, Citus is solving a very different problem: general analytics and transactions at scale, ideally for B2B applications. On the other hand, Timescale is laser focused on time-series workloads.
1. Time-series data (data collected over time, e.g., DevOps data, financial data, sensor data) is a growing (and lately, quite popular) type of "big data" thanks to a number of trends: more sources of data, cheaper storage, rise of a monitoring culture, IoT, etc.
2. One of the characteristics of time-series data is that it grows very, very quickly (e.g., when collecting data every few seconds), making it hard to use traditional relational databases for storage. In response, people have developed specialized time-series databases, which so far achieve scalability by sacrificing query power.
3. We just released a new time-series database that takes a different approach, achieving both scalability and query power. One of the benefits of our approach is that one can use normal SQL to run queries, something that was missing in prior time-series databases. Another benefit is that we are tightly integrated with PostgreSQL, one of the most popular, reliable, stalwarts of the database world. A lot of people find this approach useful.
If you collect any type of time-series data, then I welcome you to take a closer look. Happy to answer any other questions.
We expose a more traditional postgres schema, which is a different way of thinking about the data model if you come from a monitoring world.
So you can shove data into a postgres JSONB or array type if you want, but you might more commonly have all your data items stored as individual columns. And different/many metrics can be stored in a single row, so (1) you aren't paying for the label cost with each metric, (2) our query planner makes it efficient to support complex predicates on these metrics. Like:
SELECT date_trunc('minute', time) AS minute, COUNT(device_id)
WHERE cpu_level > 0.9 AND free_mem < 1024
AND time > now() - interval '24 hours'
GROUP BY minute
ORDER BY COUNT(device_id) DESC LIMIT 25;
Datadog, SignalFx, Prometheus, and Wavefront all have this ability. It's now standard functionality for any reasonably advanced time series monitoring implementation.
The fact that your implementation happens to be backed by Postgres is interesting, but it's merely an implementation detail as far as power users are concerned.
This kind of problem is really an OLAP cube problem, yet you're throwing an OLTP solution at it. I've seen it tried before at several sites and it's been abandoned each time for having crippling performance and management behaviors. What are you doing differently that will make this attempt successful?
We allow up to two dimensions of partitioning: we always partition by time, and we allow you to specify another key on which to partition. Allowing multiple keys to partition by is probably far down the road.
As for the number of partitions on that non-time key: I don't think we have a limit (or its something like 32,000).
Not sure if that answered your question.
How does this compare to Citus?
But at a high-level Citus is solving a very different problem (general analytics and transactions at scale, ideally for B2B applications).
On the other hand, Timescale is super focused on time-series workloads.
- I'm not your padawan
- Postgresql is cool, why do you say it is boring?!
- yes awesome some PhD People are vouching with there titles to a peace of software i might wanna use in a production environment? Nope. Tx for the efford and i will watch it but no.
I do like that you do that work and tx for this but why not wrting a post which goes into details instead of this hipster pseudo tl;dr text?