0 - https://www.theregister.co.uk/2017/07/13/will_the_last_perso...
But try telling people you're using SQLite to store critical data...
Unfortunately if you need to do aggregated queries on all of the SQLLite tables, things may be challenging.
But if you could somehow connect Spark to a folder (on a distributed FS) of these SQLite files...
Edit: Also SQLite has a limitation that only one process can write to it at a given a time. For this particular use case though, it shouldn't be a problem unless you have rewrites coming from various sources (which can happen when correcting data)
Do you do filesystem replication or something else for HA? Is this running in one of the cloud cloud IaaS providers or something else?
Why did you go with this approach? Eg: Was it a deliberate decision after evaluating other options, a "this is what we know" type situation, or did it grow organically from something smaller?
How long has it been running? Whats the rough size of data you have (GB/TB, number of servers, number of DB instances)? Any reliability problems?
If you were building this service from scratch, would you do it this way again?
The history of this system is pretty organic, lots of experimentation. Started with Postgres, but that was too slow if a select statement returns tens of thousands of rows. MariaDB was way faster but became erratic once we passed about 2 billions of records. Now with SQLite we are at about 180 GB, about three years of data so far, so not really "big data" territory yet.
Constantly looking for better alternatives. InfluxDB looks promising, but maybe we'll switch to a "hot/cold" approach instead, keeping whole "cold" days as JSON objects in JSONB fields in Postgres again while storing "hot" data in SQLite. We'll see...
So far the system has been rock solid. In the last few weeks we went live for real, now storing about 2k records a second. Not a problem so far.
SQLite is solid and trusted by many but for some unknown reason some people refuse to see it as a solution for their needs. I have seen many databases that could easily been replaced by SQLite.
I see SQLite as a file-format with a SQL interface.
I love the description from SQLite's website on the subject:
> SQLite does not compete with client/server databases. SQLite competes with fopen().
SpiceWorks is the biggest example of a product that doesn't scale because they've elected to use SQLite (and only SQLite). There's no hard limit on the number of devices, but most people say not to exceed about 1,000 devices. We tested it out against 800 devices. We started it on Friday. On Monday, it was still running.
The only other issue I've seen with it is with the data typing. Type inference is much weaker than strict typing, and sometimes it doesn't quite store things the way you'd expect. Granted, the last time I saw problems with that was years ago, but I've seen it.
On the other hand it is somewhat rare to see aplication that really needs multiple writers and would not be better served by full fledged database server.
I have not heard of any cases of SQLite failure in phone, or anyone needing a DB admin to recover their phone contact data.
For a long time (until API level 11), Android automatically silently deleted corrupted SQLite databases. https://android.stackexchange.com/questions/12388#33896 https://stackoverflow.com/questions/7764943
I learned the hard way to limit SMS storage and install Jan Berkel's SMS Backup+ https://github.com/jberkel/sms-backup-plus/#readme after losing my first Android SMS database (this may or may not have been SQLite's fault).
>the SQLite packaged with Android is being upgraded to version
3.6.22 in Android 2.2. I think that the SQLite database corruption bug linked in
comment 451 has been fixed in Android 2.2. So _if_ that's what's causing the text
message database corruption and deletion, then this bug will probably be fixed in
>Statically link your application against SQLite 3.6.23 instead of
using the SQLite 3.5.9 that is found on Android. The bug you are
hitting was fixed in SQLite 3.6.2
Contrast https://web.archive.org/web/20110219041419/https://www.sqlit... (Feb 2011) vs. https://web.archive.org/web/20110729152833/https://www.sqlit... (July 2011), which could be a glitch in the Internet Archive (a partial copy of the oldest record?) but adds "Though SQLite is resistant to database corruption, it is not immune" and introduced an extensive history of known issues.
Sqlite is amazing and its development history is further testament to its current reliability, especially when used correctly!
It can go wrong, it does go wrong, but it's a pretty solid solution, especially if you're not needing insane concurrency throughput.
The problem is many people need concurrency. They're running web apps over multiple servers. It's there where SQLite is a problem.
>used aboard guided missile destroyers [...] operated without installing a database management system or requiring a database administrator
PS. Thanks to your question, today I re-discovered https://unqlite.org, which may fulfill a requirement I've been facing for some time. Awesome!
I don't really see how replication/HA is an issue in this setup beyond normal file system replication/HA (which seems to be a solved "spend money on NAS" problem in the enterprise world). If your filesystem isn't reliable / delivering what's needed then you're gonna have a bad time no matter the storage approach; I doubt some unusual option is involved though it would be cool to be proven wrong.
The only time the filesystem could even potentially have read/write conflicts (assuming 1-to-1 correlation between devices writing to files) is when querying. I did wind up muddling the terminology in the interest of jumping to the actual problem.
There's just one solitary mention of Informix in the whole discussion, as I'm typing this. And that's a comment lumping Informix with everything else commercial and expensive.
I thought Informix owned the embedded dB space at a certain rollout complexity. Remember the backups... Informix speaks json too, including at some quite low level primitives. Whether you buy a product like Informix Edge plus tools or not, depends on what is sending the time series data.
I’m not referring to manipulating or even querying the data after it’s stored. I just don’t see how saving the original data in SQLite buys you anything, if it needs to be collected into a db for queries, etc, anyway.
The important thing is balancing chasing the new hotness vs. always using the tried and true.
Last year I ran into a similar sort of situation at work, the funds to buy Ansible Tower required a battle I wasn't willing to fight. The argument was "We've spent so much money on tools that then get abandoned." We spent nearly a year with me being the only one in the company that could run Ansible playbooks because of this, until I finally found RunDeck. Now, we aren't paying for Rundeck, does that make it more likely to be abandoned? Was the risk in our operations if I suddenly became unavailable worth the saved licensing fees? Well, in this case it worked out ok, but it could have gone very badly.
As far as timeseries goes, there are huge advantages here. We used to use collectd/graphite/carbon, but a year or more ago I switched us over to InfluxDB/telegraf/grafana. It was a little rocky a start, the first version of Influx I used had plenty of problems. At the time Prometheus was probably a better choice, but I couldn't come up with a way to make it's pull-centric design fit into our environment. Then InfluxDB settled down and things have been great since.
We have way better metrics, with a way more usable dashboard, and significantly lower load. Literally, disabling collectd ended the complaints about performance on our dev/stg environments.
True of course. I've found that at least "premature abstraction" is almost a requirement when using persistence. I'm not talking an ORM or anything, but put a service in front of it if you have multiple things needing it, or put an in-code abstraction if you don't. And make each abstraction's/service's operation very specific to its caller. So a simple "fetchMarketData(timeRange, tickerSymbols) -> StreamOfData" will save you so much in the future.
/former Basho employee.
I hope things work out at least as well as they have for RethinkDB's resurrection/transition! The Riak Users mailing list seems like the best place to follow along: http://riak-users.197444.n3.nabble.com
https://github.com/basho/riak_kv/tree/develop-2.2.5 (November 2017)
http://bet365techblog.com/riak-workshop-summary (October 2017)
>the immediate goal of releasing a known good build as soon as possible [...] all code would be released under the Apache 2 license [...] contact the HLL developers to aid in the dependency and GPL removal
https://news.ycombinator.com/item?id=15182566#15183008 (September 2017)
>andrew_deane_: bet365 have signed the agreement to purchase all Basho IP [...] our intention is to open source all code, help rebuild the community and collaboratively take the development of RIAK forward
The first question is tied closely to the interfaces provided by the database, and those interfaces are usually commensurate with the associated tooling. In the case of time series databases, they include statsd, collectd, and good old JSON. Relational databases utilize SQL as expected. Take the example of migrating from Oracle to Postgres; you shouldn't have to spend an inordinate amount of retooling your applications because the data interfaces are similar.
Data egress is similar; SQL databases generally provide SQL/CSV exports, and most TSDBs (the good ones at least) allow you to get the data out via JSON through a REST interface.
Whether or not the DB you choose is open source, or from a small vendor, you should pick the right tool to solve your problem with data portability in mind. (disclosure, I work for http://irondb.io, and have used Postgres for 18 years)
As an open source product it seems like TimescaleDB is less risky than Datomic. (Not that the comparison is even necessary -- I don't think they compete in the same category apart from both being non-relational DBs.)
[edit:link fixed, thanks! (and thanks for taking the time to correct my misunderstanding re:TimescaleDB as a relational DB)]
It worked out pretty well for Oracle. (SCNR)
In my current job, I needed to get intimately familiar with the database our ERP software uses, because my boss needed lots of reports that the ERP software just did not do (unless paid an expensive consultant to customize the software for you). And the idea of having to move all that data to a different database makes me shiver in horror. I can understand why a business would be very reluctant to move their data from one database (both schema and engine) to another.
For business-critical tasks, I prefer a boring, reliable solution. But it is good to know that there are still interesting developments going on in the database area, building tomorrow's boring, reliable solutions.
Played with a lot of databases, sql or no, and I always end up going back to PG
Well, I guess you can stick to DB2.
There's even a special backend, the GraphiteMergeTree, which does staggered downsampling, something most TSDBs aren't able to.
It's the most promising development in this space I've seen in a long time.
- Secondary Indexes.
- Transactional semantics.
- Can operate on data sets greater than available memory (doesn't have to be all in-memory unlike memSQL and some others). Time-series data is voluminous.
- A whole bunch of specialized time-based optimizations that optimize query plans when working with time-based indexes and data.
- Constraints - Including foreign keys.
- Joins with relational data.
- Full SQL - allowing you to use complex queries and window functions
- Compatible with data tools that use SQL - which gets you gets you the richest ecosystem of tools in the data world
- The full gamut of Postgres datatypes including JSON/B and GIS location data
- 20+ years of reliability, tested backups, live streaming replication, etc.
- Geospatial support through best-in-class PostGIS
And of course, we're only getting started :)
Columnstores just store data by column, they do not have any inherent limitations because of it. They all support SQL and compatible tools (although Druid is experimental SQL using apache calcite). They all store columnstore tables on disk (memsql uses rowstores in memory, sql server can optionally run columnstores in-memory using its hekaton engine, and they all use in-memory buffers for rapid ingest). They can all do geospatial queries, support JSON columns and some can handle nested/repeated structures. Indexes are available but unnecessary when you can prune partitions based on what's contained in each segment, especially when using a primary sort key (like a timestamp column in your case). SnappyData has a unique statistical engine to tradeoff query precision for much faster results (like HLL+ algorithms applied to the entire dataset). MemSQL will do OLTP access with full transactions across both rowstore and columnstore data.
Congrats on the VC funding, I'm always happy to see new projects and building on Postgres does give you a solid base with triggers and foreign keys (which come with their own scaling issues), and extending time-based functions will be useful -- however my issue is the marketing spin where you claim to be better than everything else. Columnstores are very fast, efficient, performant, and time as a dimension is not a new challenge. That's before considering the bigquery/snowflake superscale options or specialized databases like kdb+ which have served the financial industry for decades.
Approaching the field with a single-node automatic partitioning extension (as of today) for a rowstore RDMS and saying you're better than the rest on features that they already have just strikes me as insincere. It would be better to recognize the competition and focus on what you're good at instead.
I’m sorry that was your impression and it’s certainly not our intent to mislead, although I’m not really sure why/where you think we claimed this. Indeed, the quoted article even says that "Timescale is not trying to take on Kx Systems directly in this core market.", and that such organizations have different needs for different use cases.
Technology choices are all about trade-offs, and databases are no different.
In fact, our docs have a page describing exactly when not to use TimescaleDB compared to other options: http://docs.timescale.com/v0.8/introduction/timescaledb-vs-n...
This would definitely clear things up in my mind.
Why would I use Timescale over KDB or IQ or Vertica? Is it just a price thing, you are mostly cheaper (both licensing and finding talent)? If cost was a minor issue, why chose Timescale? What advantage does it have over those other TSDBs? That bullet list that has been repeated a couple times seem to not really be unique to Timescale when compared to the other big columnar databases.
Maybe it is you have a good story on how you can do scalar operations better than the others? You you have a particular workload mix you are trying to target?
Do you plan on doing any of the TPC benchmarks?
Since when did it imply all the limitations you are saying?
Telegram channel https://t.me/clickhouse_en
Except, of course, for those who realize that the precision of a statistic only increases at sqrt(n) and that a biased dataset will remain biased regardless of how much data you have. I'll take a large grain dataset that I can load on my computer and analyze in five minutes over a finer grained dataset where I need to set up a cluster before I can even get started. Enough with the "let's store everything" fetishism already.
(Somewhat tangential to the blog post, I realize.)
 - https://en.wikipedia.org/wiki/Curse_of_dimensionality
But again this is a huge red flag. I've seen so many data science projects that started with "well, let's just get started with collecting everything and we will figure out what is important later on" and then spent so much time on infrastructure that no useful insights were ever produced.
Of course get that data out into analytical datasets is a whole different matter.
For anyone reading this who is interest in a practical application with sampled operational data, check out Facebook Scuba.
The "let's store everything" isn't being done for fun; it is rather expensive. For sophisticated sensor analytics though, it is essentially table stakes. There are data models where it is difficult to get reliable insights with less than a 100 trillion records. (Tangent: you can start to see the limit of 64-bit integers on the far horizon, same way it was with 32-bit integers decades ago.)
With TimescaleDB we've focused on single node performance to try and reduce the need for clustering. We've found performance scales very well just by adding more disk space if needed and more cores. So maybe some datasets are not practical for your laptop necessarily but a single instance on Azure/AWS/GCP is workable. No need for a cluster to get started :)
(Read scale out is available today and we are working on write scale out, hopefully later this year)
(if you're new to this like me)
The problem is they say the data is imuatable and stored sequentially, allthough our data was imutable with devices on the net the data comes in random order when these ineviatably have connection problems.
We always aggregated our data into larger time blocks. Storage was cheap and doing comparative analysis across locations and time zones was always our pain point.
I think using a Postgres database is wise.
It's intro computer science, if you have a tree structure and fill it up, you spend a lot of time in the corners of theta notation. Timescale uses tightly integrated partitioning on the time axis to deal with write performance and aging data out. Other popular TSDBs are plays on log structured merge trees etc
I hope a reader would become interested in what an LSM tree is (and perhaps as importantly isn't)
A few open source options exist (Spotify's Annoy, Facebook's FAISS, NMSLIB) but these are rather low-level, more hobby projects than enterprise level engines (index management, transactions, sharding…).
After building a few document similarity engines for our clients we took up the gauntlet and created ScaleText, https://scaletext.ai. It's still early days but the demand for a sane, scalable, cross-vertical and well-supported NLP technology is encouraging.
What really floors me about Influx is how fast it is. A query that used to take hours in Oracle takes seconds in Influx. And the influx query is readable: rolling data up to various intervals produced nightmare queries in Oracle but is short and sweet in Influx. Want to take data gathered every 5 minutes and give a daily average? Yeah, good luck with that in Oracle. It's doable, sure, but I've seen the code. And the speed. Both are ugly.
The only problem I have with InfluxDB is it's still a little immature, and the tooling isn't where it should be. This isn't entirely Influx's fault; most third parties aren't aware it exists or don't care. Our reporting team uses Crystal Reports, which can't talk to InfluxDB. So I end up having to write a Python script that runs in cron every night to perform InfluxDB queries for the previous day's data doing all the rolling and average/min/max calculations and then inserting the results into Oracle, just so our reporting team can get to the data. For some KPIs I'm working on right now, we decided to not go through the reporting team, and I'm writing a webapp in Python/Bottle to display the report, and we're probably going to augment that with graphs from Grafana.
Grafana is beautiful, by the way.
We also had a problem where user queries that return too many columns cause the DB server process to OOM. And then it restarts, so another 20 mins of downtime. Also not cool.
We liked the tagging and rollup features, and automatic retention management, but those first 2 problems really turned us off.
I also am using Influx at my company and the experience is not that great, mostly due to it being immature yes. For example, I currently have runaway disk usage and I have no way to know what table is using it. So I have to choose between losing data or continuously increasing storage.
I've spent some time trying to figure out why I'm not getting data, and it turns out that it was because I typoed the name of the series.
I still like Influx though, warts and all.
But what are they doing better than the existing time series databases? Surely they must have some advantage or they wouldn't have raised 16.1 million dollars.
Every time I have to store time series data, I never really need ACID transactions. I definitely don't ever need to do updates or upserts. It's always write-once-read-many. ElasticSearch has always been the obvious choice to me and it has worked extremely well. Information retrieval is incredibly robust and for times that I'm worried about consistency, I use Postgres's JSON capabilities and write first to there. You can have your application sanity-check between the two if you're worried about ElasticSearch not receiving/losing data.
I find it really hard to beat.
I ended up leaving that job and set up something specifically for metrics from scratch. I didn't compare 1 to 1, but this was much faster to query and had much lower requirements (disk space, memory, etc).
Both use-cases are using it as a time-series database, and there's no reason ElasticSearch couldn't work for both of those use-cases for many people. When using two different backends (one for events and another for metrics) I meant drawing a line when creating/logging data and two ways to query...which sucks for training. You also would have to maintain/archive two different data stores.
1. For some developers, just having a SQL interface to time-series data (while maintaining insert/query performance at scale) is good enough reason to use TimescaleDB. For example, when trying to express complex queries, or when trying to connect to SQL-based visualization tools (e.g., Tableau), or anything else in the PostgreSQL ecosystem.
2 For others, the difference is being able to fully utilize JOINs, adding context to time-series data via a relational table. Other time-series DBs would require you to denormalize your data, which adds unnecessary data bloat (if your metadata doesn't change too often), and data rigidity (when you do need to change your metadata).
To quote a user in our Slack Support forums:
"Retroactively evolving the data model is a huge pain, and tagging “everything” is often not viable. Keeping series metadata in a separate table will allow you to do all kinds of slicing and dicing in your queries while keeping it sane to modify the model of those metadata as you gain more knowledge about the product needs. At least that is something we have been successful with"
3. For others, it's some of the other benefits of a relational database: e.g., arbitrary secondary indexes.
One thing to think about with NoSQL systems (including every other time-series db) is that their indexing is much more limited (eg, often no numerics) or you need to be very careful about what you collect, as costs grow with cross-product of cardinality of your label sets (strings). We have heard from multiple people that they explicitly didn't collect labels they actually wanted when using [another time series db], because it would have blown up memory requirements and basically OOM'd DB. (Timescale doesn't have this issue with index sizes in memory, and you can create or drop indexes at any time.) You often hear about this as the "cardinality problem", which TimescaleDB does not have.
4. There's also the benefit of being able to inherit the 20+ years of work of reliability on Postgres.
There was no database that did all of this when we decided to build Timescale. (If one did exist, we would have used it).
You can read more about that story here:
And for more on our technical approach:
> There was no database that did all of this when we decided to build Timescale. (If one did exist, we would have used it).
If you removed the based on Postgres part, except for KDB, IQ, Vertica, and a few others. I can definitely see a price argument though (i.e, the same but cheaper) as those all tend to be a expensive.
It might be bad for the specific organization pushing their new solutions, but for the software industry as a whole, it seems like the good side of "throw spaghetti against the wall and see what sticks". Many ideas will fail, but the good ones will stick around. So I'd say by all means, let people feel free to innovate. And those of us who consume the innovations just need to remember the mantra of "leading edge, not bleeding edge."
And we picked Postgres as our starting point exactly because it wasn't new and shiny but boring and it worked.
More on this here  if you're interested.
Changes in telemetry production (IoT and others) have fundamentally changed the requirements... millions (and often billions) of data points points per-second are now happening. This is being driven by RSM and IoT. RSM is alluded to in my ACM Queue article: https://queue.acm.org/detail.cfm?id=3178371
This alone gave us an insanely scalable (load tested against 100GB/day ~100M records/day) for a grand cheap total cost of $10/day for everything, server, disk, and S3. https://youtu.be/x_WqBuEA7s8
Works great for timeseries data, super scalable, no devops or managing database servers, simple, and works.
I believe the Discord guys also did something similar and had written some good engineering articles on it, give it a Google as well.
There are two big problems with relational SQL databases: the storage and the query language. The two aren't separate concerns. While projects like this might fix the storage issue, they will never be simple or fast without also changing the query language too.
SQL doesn't fit the column/array paradigm well. There are extensions that attempt to close the gap, but they are slow and complex compared to just fixing the query language. Operations like windowed queries and aggregates can be painfully slow because of the mismatch between SQL and the ordered data.
I would absolutely love to work on a product that attempts to fix the query issue at the same time as working on the storage and interface issues. You don't need an obscure APL-like language, just something that promotes using the properties inherent in time series to the front.
Also, it is a bit weird to say you are not trying to compete with the likes of KDB then go on to say how good you are at its core competencies. As always, eagerly awaiting the TPC-D benchmarks and comparisons to the heavy hitting time series dbs out there, not the easily crushed FOSS products (except for maybe AeroSpike that I wish I had more experience with).
1. It seems to not be optimized for speed (but it's difficult to say since the license forbids publishing benchmarks).
2. It's not open source.
I'm not sure if moving to another db system would be beneficial, but I would be very grateful if accessiblity could be much less of a hack
The solution to larger processing simply seems to be loading the files and batching them through whatever processing we want and storing the output in a proper database - which can then be queried instantly.
For a lot of time series data, this does seem like a pretty decent approach.
It's not that storage isn't available, it's easy to store terrabytes of data. But really fast storage is still expensive and rare. Being able to compress time series data to 3-5% of the raw value allows keeping most in memory, speeding up reading writing and analysis.
InfluxDB is quite good at that but unfortunately, they struggle storing many small datasets.
How would you go about deleting a users data upon request?
You can also use any column-oriented relational database with a time-based partition key and do the same thing.
At least in my experience, historians are rarely recommend for complex or ad-hoc queries. Typically you just pull the data (by tags) into another application and do your data processing there. It looks like in timeseriesdb lets you execute complex queries in the database. Historians typically only let you fetch data by tag and you need a metadata (e.g. asset management) framework on top to organize the data (e.g. give me avg temp every 5 minutes by sensor). It looks like with timeseriesdb you can have strings/text as fields within the timeseries table, which removes the need (to some degree) to join the data with a seperate metadata database.
I've also never heard of anybody using these commercial historians for time series data you'd see from non industrial processes (e.g. stock data, price tracking, GPS location of people or moving assets, time between clicks on a website,etc).
All that being said, OSISoft PI and AF have their warts, but OSISoft has been around for a while and PI has been battle tested in various industries (e.g. Oil & Gas, Manufacturing). It's closed source and you have to pay for it, so it's probably not attractive to startups and smaller companies. But it does come with a support organization if you need it and can pay for it. And IME data retrieval from PI is extremely performant!
- Full indexing and secondary index support.
- Support for transactional semantics
- Support for living along side relational data - including foreign keys to the relational data.
- Full trigger and constraint support
- Support for all of Postgres's native data types including JSON(B) and GIS location data.
Glad to see open source tools in this space gaining traction.