1. QuestDB – https://questdb.io/ – is a performance-focused, open-source time-series database that uses SQL. It makes heavy use of SIMD and vectorization for the performance end of things.
2. GridDB - https://griddb.net/en/ - is an in-memory NoSQL time-series database (there's a theme lately with these!) out of Toshiba that was boasting doing 5m writes per second and 60m reads per second on a 20 node cluster recently.
3. MeiliSearch - https://github.com/meilisearch/MeiliSearch – not exactly a database but basically an Elastic-esque search server written in Rust. Seems to have really taken off.
4. Dolt – https://github.com/liquidata-inc/dolt – bills itself as a 'Git for data'. It's relational, speaks SQL, but has version control on everything.
TerminusDB, KVRocks, and ImmuDB also get honorable mentions.
InfoWorld also had an article recently about 9 'offbeat' databases to check out if you want to go even further: https://www.infoworld.com/article/3533410/9-offbeat-database...
Exciting times in the database space!
We are focused on the revision control aspects of Terminus - trying to make the lives of data intensive (ML etc.) teams a little easier. We use a delta encoding approach to updates like source control systems such as git and provide the whole suite of revision control features: branch, merge, squash, rollback, blame, and time-travel. Idea is to provide continuous integration for the data layer.
Basically squash all the tools and processes mentioned here into a versioned graph: https://martinfowler.com/articles/cd4ml.html
I think this idea is really valuable, but I usually see it implemented as a time-series extension on top of Postgres or MySQL, like SQLAlchemy-Continuum or TimescaleDB. i.e. you caan get most of the useful git-like time-travel semantics (modulo schema migrations) out of timeseries data with a separate transaction history table.
I'm curious what Dolt's performance profile looks like (i.e. how reads and writes scale vs "branch"-count, row-count, and how they handle indices across schema migrations), since the aforementioned solutions on Postgres are building on an already-very-performant core.
edit: TerminusDB also looks very cool, although it's approaching this problem from a graph-database angle rather than a relational one. Their architecture (prolog server on top of a rust persistence core) also seems super fascinating, I'd love to read more on how they chose it.
You are indeed correct, we are quite young. We are also focused at this stage on the data distribution use case (ie. use Dolt instead of putting a CSV or JSON file on GitHub). So, we haven't spent much time on query performance or scale. The current model is that the database and all of its history have to fit on the user's hard drive. The format is easily distributed (ie. put content addressed chunks on different servers, optimize for fewer network hops) but that's not how it works today.
That being said, we think architecturally we can eventually get pretty close to parity with other RDBMS on the read path. We will be slower on the write path given the need to build the Merkle DAG on writes.
Long ways to go though, we only launched open source last August. On a suite of MySQL ~6M correctness benchmarks, we currently execute 3-4X more slowly. These aren't large sets either so we suspect we'll run into some non-linearities in performance. This is just normal SQL. We haven't really tested the limits of how many branches we can handle or how long branch creation or merge takes at scale. Not because we don't want to but because it's not the use case we're focused on.
If you are willing to take the journey with us, we'll work with you to get Dolt where it needs to be (if it can, reads should be fine, writes will be more expensive). But we're just at the beginning so there will be bumps along the road. If that's not for you, we completely understand.
i.e. "Select population.master, population.branch2, population.commit123.diff from state_populations"
1) `SELECT * FROM state_populations AS OF 'master'` to get the repo as it existed at a particular timestamp / ref name
2) `SELECT * FROM dolt_diff_state_populations WHERE ...` to examine to/from column value pairs between any two commits you choose.
There are a bunch of different supported system tables that provide interesting functionality: https://www.dolthub.com/docs/reference/sql/#dolt-system-tabl...
What they all share is a) awful performances, and b) bad design.
Git semantics, ("branche", "merge", "commit") are not well suited for data, because merging dataframes and creating "branches" often leads to misunderstandings and delays. Time travel is very nice to have, but it's often the case where you would like to consume your input datasets at different point in time in the same repository (unless you do one dataset per repository, but then, what's the point ?).
Performances are bad, because all updates needs to go through some kind of coordination mechanism (etcd, zookeeper, or raft directly). In a single instance scenario, you often end-up flooding it or needing additional memory to cope with the load. However, you could deliver high throughput and high availability by using proper sharding and distributing updates to specific masters (like you would do in any actor-based architecture).
As a replacement, we're now using a custom event-sourcing framework on top of AWS S3/Azure blob. It's faster, more reliable, and most importantly, better designed.
> Git semantics, ("branche", "merge", "commit") are not well suited for data, because merging dataframes and creating "branches" often leads to misunderstandings and delays.
Can you give a concrete example of what you mean? I'm wondering if this is a failing of the tool you're using or the model itself?
> Time travel is very nice to have, but it's often the case where you would like to consume your input datasets at different point in time in the same repository
Dolt supports doing exactly this. See my reply to a child comment below.
> Awful performance
It's not obvious to me why this needs to be true in the general case, unless it's caused by b) bad design. Are you mostly talking about write performance, or are reads also a problem?
First, branching and merging. In git, branching allows you to make uncoordinated parallel progress for the price of a reconciliation step. In a datastore, you want the exact opposite: A single, consistent, available source of truth. Having different branches of the same dataset bring more confusion while solving zero problem.
Then, commits. In git, a commit represent a snapshot of the entire state of your repository. This is particularly attractive because it guarantees that your code will build no matter what kind of update will follow (without incidence: editing a readme ; severely destructive: removing an src folder). In a datastore, this is nice but unnecessary. As I mentioned it in this thread, datasets move at different speeds, and attaching an new hash to something that didn't change doesn't add value. However, I have to recognize, I failed to mention earlier that datasets are often unrelated and not relational. This would be to reconsider if it were the case, of course. Most of the time, a dataset is represented as a single dataframe (or a single collection of dataframes).
There some points where git semantics make sense: immutability of commits, linearizability within branches. Both are extremely important if you want to enable reproducibility of your pipeline. These are traits coming from Event Sourcing.
Reproducibility is also claimed by DVC and Pachyderm, but their issue here is more a problem of trying to do too much things at once but not managing to do it right. Running code within Pachyderm pipelines was a recipe for disaster and the first thing we got rid of.
As for performances, the write side is where it matters, because it needs to be coordinated. Reads almost never are an issue with good caching. In any case, it should be robust enough to fill the gap between csv files sent to s3 and a full kafka cluster, eg: not complaining for a few TB. To my knowledge, the only multi-leader datastore suitable for storing sharded datasets as a continuous log is Kafka.
I was reading through these source code branching patterns and you can easily imagine (if design and performance is right) how they would apply to data - but as the author says 'branching is easy, merging is harder.'
Also, for TerminusDB we don't use a multimaster coordination mechanism - we actually use the same sort of git approach.
Back to the time-travel. One of the most evident architecture when dealing with AI/ML/Optimisation is to design your application as a mesh of small, deterministic, steps (or scripts) reading input data and outputting results. As you would expect, output of one step is reusable by another one. Example:
Script A is reading Sales data from source S, Weather data from source W; writing its result to A. Script B is reading data from source A, and Calendar from C; writing its result to B. In this example, we want to be able to do two things: 1) run a previous version of A with S and W from 2 weeks ago and assert the result it produced now is exactly identical to the one it produced at the time 2) run a _newer_ version of A with S and W from 2 weeks ago and compare its result from the one it previously produced. Of course, in the real-world, S, W, C, progress at different speed : new sales could be inserted by the minute, but the weather data would likely change by the day. So, you need a system that would allow you to read S@2fabbg and W@4c4490 while being in the same "repository". That's why git semantics are not a good fit: you need to have only one "branch" to ensure consistency and limit misunderstandings, but you want to "commit" datasets in the same repository at different pace. For that purpose, event sourcing is much better :) (BTW, git at its core, is basically event-sourcing)
Kafka's architecture is actually the best solution.
`SELECT * FROM S AS OF '2fabbg';`
`SELECT * FROM W AS OF '4c4490';`
(Branch names work as well as commit hashes for the above).
You can even do joins between the tables as they existed at those revisions:
`SELECT * FROM S AS OF '2fabbg' JOIN W AS OF '4c4490' ON ...`
As long as your data is actually relational, it's a pretty good fit.
Very interesting - thanks for the additional detail. Will have to think about how we might best represent in Terminus. We did a bunch of work for retailers in exactly the situation you describe.
Can you give me some more detail about what you are doing? We have a a large road ahead, and I am genuinely curious about the path you've taken.
Well, not really. Dolt isn't just time travel. If all you want is time travel (or append-only data), you can do that with Postgres or MySQL pretty easily and get great performance. What Dolt brings to the table is actual Git semantics, a real commit graph you can branch, merge, fork, clone. You can inspect the historical values of each cell and examine the author and commit message for each change. Two people can clone a database, both insert / update / delete rows, and then merge each other's changes. If they touched the same rows, they'll have a merge conflict they have to resolve. If you find somebody's data they've published in the Dolt format has an error, you can fork it, correct the data, and submit them a PR to merge back, just like you do with code on Github. It's a database built for collaboration from the ground up. Instead of authorizing each person you want to make edits specific grants on particular tables / subsets of the data, you just let them clone it, then submit a PR that you can examine and approve. Or if they disagree with your policies, they can fork the data and still merge in updates from your master source as necessary. Git demonstrated what a powerful model the commit graph is for code. Dolt brings it to data, with SQL built in.
To answer your question about indexes across schema migrations, indexes are versioned with the schema of the table. This means they can be efficiently synced during a pull operation, but it means that previous versions of the data don't have indexes on them. We're considering adding the ability to add an index to the data from the beginning of the commit history as a rebase operation, but haven't implemented that yet.
I guess in my imaginary perfect world, you don’t need to use commit, PR, merge semantics to make normal edits. You can use existing bitemporal/transaction history ideas in online updates, appends, and deletes, and then you have a higher-level abstraction for offline(-only?) branching and merging.
I guess what I’m saying is that I don’t totally buy the idea that you need git “all the way down”, especially if it gets in the way of performance. But maybe I’m just used to “the old way”, and I’ll cringe reading this in 20 years. :)
Major kudos for this project, it’s super impressive. I’m excited to see how it grows!
But putting that probably-wrong-because-I'm-a-dilettante wankery to one side: a serious advantage of bitemporalism is that most layfolk understand how clocks and calendars work. This is less true of git commits.
We've been able to train up "regular folk" on bitemporalism at our company. The distinction between valid and transaction history takes a little while, but it sticks. Git has two or three such conceptual sticking points (staging vs committing, branching & merging, remote vs local).
FWIW we are absolutely working on modes of using the product that "just work" like a normal database without any explicit commits / merges. In that case commits would happen automatically (maybe one for each write transaction?) and you'd get all the history "for free." We aren't sure how useful that is, though. Definitely looking for customers interested in this use case to guide us on their needs.
There is much business value to be extracted out of this idea, because most companies (1) do not have the ability to cheaply (i.e. ad-hoc) introspect this way (2) would gain a lot of value from this ability. My hunch is that most companies out there are interested in answering that set of questions (referred to as bitemporal data, in the literature).
This is the space that TimescaleDB seems to be competing in, although we don't use them (we use our own extension of Postgres, that's only a few hundred lines of PL-pgSQL, plus some helper libraries in Go/Python).
In my perspective, I think git-style semantics would be very powerful as a layer above an existing bitemporal system. We've implemented similar systems, where records get a semver-style version in their bitemporal key, so we can see how different producer logic versions minted different records over time. It's be really cool to have branching versions, and to be able to rebase them -- this is something our system doesn't currently support.
Anyways, hope our data point is useful for you all. Happy to share more if you have questions about anything.
This eliminates complex rebasing topologies, and Git not recognizing that old and rebased commits are the same. But I'm not sure if it works well in practice. And it doesn't extend to SQL data sets yet.
I don't know if you're interested in this or not, but I just wanted to mention it.
We like Prolog for Querying, constraint checking and user interaction. Rust is great for low-level data manipulation. Prolog is a superpower - very simple but very powerful, but quite far removed from the hardware and uses abstractions not under our control so not good at nitty-gritty bit manipulation. We like Rust as a low-level memory-safe language (and it has a great community).
Much more background on why we made those choices: https://news.ycombinator.com/item?id=22867767
List here: https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...
Why would Raft be part of a search module when it can be written (and/or used) as a separate library? Curious about the design-choice here.
Release notes: https://github.com/terminusdb/terminusdb-server/blob/master/...
Furthermore it exposes a privacy-as-code framework allowing you to version control and be very granular about how different entities interact with your datasets.
I'm on the core team - if you have any questions I would be more than happy to answer.
Looks awesome, added to my feeds!
I wrote a Python script to delete the stale records based on that time key. We have a separate vacuum service that cleans up the mess continuously.
Is this considered a time series database? Or is there something a dedicated time series database does differently?
In general, a time-series database makes architectural decisions and focuses on capabilities for time-series data that lead to orders of magnitude higher performance and a better developer experience.
For example, TimescaleDB:
- Auto-partitions data into 100s-10,000s of chunks, resulting in faster inserts and queries, yet provides the Hypertable abstraction that allows you to treat the data as if it lives in a single table (eg full SQL, secondary indexes, etc)
- Provides Continuous Aggregates, which automatically calculate the results of a query in the background and materialize the results 
- Supports native compression, using a variety of best-in-class algorithms based on datatype (eg delta-of-delta encoding, Simple-8b RLE, XOR-based compression (aka "Gorilla" compression), dictionary compression 
- Adds other capabilities like interpolation, LOCF, data retention policies, etc, that are necessary for time-series data
- Scales up on a single node, or out across multiple nodes
There's more, but I'll stop because I don't want this to sound like marketing.
You are welcome to implement this yourself (and our docs explain how if you'd like), but most find it easier to use the database we already provide for free. :-)
It sounds like a time-series database does in fact include some elements of ETL at the database layer. I can see how that would be helpful.
We also have a pretty active community on Slack (4,000+ members): https://slack.timescale.com/
An overly simplistic example being an API that captures upvotes. Instead of storing the individual upvotes, you incoming requests in a queue and only write out a count of upvotes over, say 1 minute. That way, if you want to get a count over a larger period of time, you're setting a ceiling on the number of records involved in your operation. If you have 1 minute resolution and you're looking over a year of data, you're adding at most, 6024366 records for each post. This is really handy for analytics data where you don't necessarily care about the individual records.
The project I was specifically referencing was one that captured data streams from sensors like accelerometers and thermometers, which is inherently time-series data, because it's literally just (timestamp, sensor-reading). But to make use of that, you need a library of tools that understand that the underlying data is a time-series to do things like smoothing data, or highlight key events. For example, a torque spike had a "signature" which involved looking at the difference over time periods at a particular resolution. But a temperature spike would look different. Etc.
Do time series databases store the raw underlying data for the aggregates or just enough to calculate the next window?
The "time-series" features come from the focus around time as a primary component of the data (or some kind of increasing numeric value). Features like automatic partitioning and sharding, better compression, aggregating into time buckets, extracting and managing date values, smoothing missing values in queries, etc.
You can do all that yourself on Redshift but they just offer more built-in functionality for it.
You can choose to not do this but then get ready to eat mediocre performance.
You don't need to do that with a relational columnstore like Redshift, BigQuery, MemSQL, Vertica, Kdb+, or others. They're designed for massive aggregations with rows stored in highly compressed column segments that support updates and joins. And they're much faster than any custom compression scheme on top of another slower data store.
There's also the in-between options like OpenTSDB and Pinalytics that use Hbase/Bigtable. That's a sorted key/value datastore but it still applies compression on the row segments so you can leverage that in scans without a custom compression scheme on top.
As usual, it depends on the situation and what's being measured. For example, if you are able to batch a lot, then write performance of a blob will be line rate. If you need an ACK on each event then that's not possible and indeed a column store will be better.
AFAICT, OpenTSDB compacts columns which is similar to what I've described.
This is proprietary so there's no visibility into what they do here.
FWIW, KairosDB is a timeseries db build on Cassandra and uses a row with 1,814,400,000 to side step the data management overhead via bucketing. Not a binary blob but certainly not the straight forward data layout that someone asking about 'why can't I just store data in a normal database' might expect.
OpenTSDB compacts columns inside a column-family. This is because Hbase/cassandra/dynamo key/value stores support arbitrary columns for each row and store a verbose identifier for the column/cell/version matrix. It's a custom data format to save disk space but the underlying compaction still refers to HBase LSM-tree compaction.  The rows are still stored individually and compression is enabled on HDFS.
But yes, it's all highly dependent on the situation and it's the same fundamental mechanics of batching and compressing data. My point is that it's better to just use a database that is built with this architecture and can provide greater performance, usability and full querying flexibility instead of bolting it on top of a less functional data store.
I have never worked with something that bills itself as a "time-series" database. Does that involve pushing elements of the ETL layer down into the database? Or is it just optimizing for access by a time dimension?
Antti from aito.ai here (https://aito.ai)
What do you think about Aito and the predictive databases?
We believe that the predictive databases can be a thing both in the database and the ML space, which is quite exciting.
I also feel that the 'smart databases' like Aito or BayesDB form an interesting database category, which will emerge as an everyday softwarw component once the technology matures. According to our internal benchmarks, it is already mature accuracy & performance wise for many applications.
- Antti & Aito
Yep they sure say that. While opening a tool box with very few tools in it. Sometimes only one.
It would be interesting to compare notes, and see what Materialize does better.
The main differences you should expect to see are generality and performance.
Generality, in that there are fewer limitations on what you can express. Oracle (and most RDBMSs) build their Incremental View Maintenance on their existing execution infrastructure, and are limited by the queries whose update rules they can fit in that infrastructure. We don't have that limitation, and are able to build dataflows that update arbitrary SQL92 at the moment. Outer joins with correlated subqueries in the join constraint; fine.
Performance, in that we have the ability to specialize computation for incremental maintenance in a way that RDBMSs are less well equipped to do. For example, if you want to maintain a MIN or MAX query, it seems Oracle will do this quickly only for insert-only workloads; on retractions it re-evaluates the whole group. Materialize maintains a per-group aggregation tree, the sort of which previously led to a 10,000x throughput increase for TPCH Query 15 . Generally, we'll build and maintain a few more indexes for you (automatically) burning a bit more memory but ensuring low latencies.
As far as I know, Timescale's materialized views are for join-free aggregates. Systems like Druid were join-free and are starting to introduce limited forms. KSQLdb has the same look and feel, but a. is only eventually consistent and b. round-trips everything through Kafka. Again, all AFAIK and could certainly change moment by moment.
Obviously we aren't allowed to benchmark against Oracle, but you can evaluate our stuff and let everyone know. So that's one difference.
I think the Continuous Computation Language (CCL) name captures the essence of these systems: data flows through the computation/query.
These systems have always had promise but none have found anything but niche adoption. The two most popular use cases seem to be ETL-like dataflows and OLAP style Window queries incrementally updated with streaming data (e.g. computations over stock tick data joined with multiple data sources).
If you want to maintain the results of a SQL query with a correlated subquery, StreamSQL in Aurora did not do that (full query decorrelation is relatively recent, afaiu). I have no idea what TIBCO's current implementation does.
If you want to maintain the results of a SQL query containing a WITH RECURSIVE fragment, you can do this in differential dataflow today (and in time, Materialize). I'm pretty sure you have no chance of doing this in StreamSQL or CCL or CQL or BEAM or ...
The important difference is that lots of people do actually want to maintain their SQL queries, and are not satisfied with "SQL inspired" languages that are insert-only (Aurora), or require windows on joins, or only cover the easy cases.
With all due respect, CREATE SINK and CREATE SOURCE are SQL-like. I would argue that the pipeline created from the set of SINKs and SOURCEs is the key concept to grasp for developers new to your platform. The purity of the PostgreSQL CREATE MATERIALIZED VIEW syntax and other PG/SQL constructs seems like a minor selling point, in my (very narrowly informed) opinion. I hope I'm wrong.
Our difference of opinion involves marketing language and perceived differentiators. There are some important use cases for continuous SQL queries over Kafka-like data streams that remain unaddressed (as far as I know). I hope Materialize gains traction where others have failed to do so. If PG/SQL compatibility was the only thing holding back this style of solution then kudos to you and your team for recognizing it. Good luck (honestly).
#1 -- Postgres is built in C, while Timely/Differential (which underpin Materialize) are built in Rust. Materialize could be a Postgres extension at some point, but for now we want to control the stack (to more easily do cross-dataflow optimization, etc)
#2 -- We are absolutely interested in exposing an API to enable new data sources (and data sinks!). It currently takes a bit of time to understand some basic Timely Dataflow concepts, but we intend on documenting this and opening things up. We're also trying to understand user requirements around things like durability guarantees and UPDATE/DELETEs (Feel free to email me or hit us up in our public Slack as well).
If anyone is interested in sending data via HTTP POST, we'd love to hear more: https://github.com/MaterializeInc/materialize/issues/1701.
Much more interesting than our speed, though, in my opinion, is the fact that you can use materialized as the place where you do joins _across_ databases and file formats. It's particularly interesting in a microservices environment, where you may have a postgres db that was set up by one team and a mysql db that was set up by another team and the only thing you care about is doing some join or aggregate across the two: with materialized (and debezium) you can just stream the data into materialized and have continuously up to date views across services. Combine this with a graphql or rest postgres api layer (like hasura) and a large amount of CRUD code -- entire CRUD services that I've worked on in the past -- just disappears.
My understanding is that it means that the database doesn't update the view simply because a datum arrived, but rather only if the datum will change it. That avoids an enormous amount of churn and makes materialized views fantastically cheaper on average.
Edit: The original thread here on HN has lots of detailed answers from the Materialize folks, plus some gushing fandom from that jacques_chester dweeb: https://news.ycombinator.com/item?id=22359769
Oracle's docs have entire features in order to help you diagnose whether your views are eligible for fast or synchronous refresh. In Materialize, everything you can express in SQL is fast refreshed. No caveats.
*Almost, because the surface area is truly astonishing.
There’s also ksqlDB that can be used in a similar way?
The Go library is in beta, working on a server that's wire compatible with Redis.
I think in the future there’s going to be a sine-wave of smart-clients consuming S3 cleverly, and then smartness growing in the S3 interface so constraints and indices and things happen in the storage again, and back and forth...
I like your point about consuming S3 cleverly, it's often difficult to get good out of the box performance from S3 so abstracting that to the degree possible is good for end-users. The cloud vendors though are always one or two steps ahead of companies that build upon their services. AWS Redshift for instance already can pre-index objects stored on S3 to accelerate queries at the storage layer. It's difficult as a third party vendor to compete with that.
A "thick-client" also doesn't perform well unless that client is located on a node in the same region. I think as with everything it works well in some cases and not well in others.
I was referring more to the fact that the cloud vendors can co-design their infrastructure and software to support their database services.
The principles are (from my perspective): How would a database (API) look like when both data changes over time and structural flexibility were first class concepts. The result is almost mindbogglingly simple.
Datalog is already such a powerful language in and of itself and I really wonder why it is still such a niche thing.
Don't get me wrong. SQL is great. It is there for a reason. And it is and should be "the default" for most use-cases. But Clojure Datalog DBs have a fairly clear and large advantage, especially when data is temporal and/or graph-like.
With that said, CouchDB 4.0 (on FDB) is going to be killer. Master-Master replication between clients and server with PouchDB is phenomenal when you remove the complicated eventual consistency on the server side.
And as a plug, I'm building a multi-tenant/multi-application database on top of it.
I always find some issue or caveat or problem and I decide in the end that Postgres gets most of the way there anyway and I return to Postgres.
Whenever I get tempted by a shiny new database I remind myself "don't bet against Postgres".
PG has mostly skipped the whole time series database trend until Timescale showed up. Still waiting waiting for the graph-db and git-db extensions!
I think there’s been so much progress at the lowest level, that new SQL-databases might be kind of foolhardy to not build on top of PG at this point, especially if their main claim to fame is essentially a data type and some fancy indexing.
I hope PG ends up more like Linux, and less like C++ or Java. :)
> making a database look like an rpc api
I'd recommend checking out PostgREST for this (if you're using Postgres). We used this approach in my previous startup quite successfully.
We also have plans at Supabase to make Postgres databases semi-ephemeral. You'll be able to spin them up/down using a restful API. The schemas/types will be accessible via a restful API: https://supabase.github.io/pg-api/. Not quite as simple as SQLite, but a good alternative.
Databases are cool.
This is precisely one of the features that make ClickHouse shine
It's basically a 200 line document database in Python that's backed by SQLite. I need to store a bunch of scraping data from a script but don't want a huge database or the hassle of making SQLite tables.
Goatfish is perfect because it stores free-form objects (JSON, basically), only it lets you index by arbitrary keys in them and get fast queries that way.
It's pretty simple, but a very nice compromise between the simplicity of in-memory dicts and the reliability of SQLite.
Isn't this basically what FoundationDB is?
SQL Server ala 10+ years ago enters the discussion.
Because they what?
Something this focused should have a few applications where bit level auditability matters, eg financial, chain of events, etc. Of course it comes with some tradeoffs vs a relational or kv db.
I wonder if there would be room for a self-hosted clone?
You can even detach and reattach the view from its backing table.
Great resources will be appreciated.
What's your perspective on predictive databases like https://aito.ai?
I'm one of the Aito.ai founders. If you would like to hear more, I'm happy to talk one-to-one.
Also, I'd like to add one database to the list (I work there for 3 weeks now): TriplyDB . It is making linked data easier.
Linked data is useful for when people of different organizations want a shared schema.
In many commercial applications one wouldn't want this, as data is the valuable part of a company. However, scientific communities, certain government agencies and other organizations -- that I don't yet know about -- do want this.
I think the coolest application of linked data is how the bio-informatics/biology community utilizes it [1, 2]. The reason I found out at all is because one person at Triply works to see if a similar thing can be achieved with psychology. It might make conducting meta-studies a bit easier.
I read the HN discussions on linked data and agree with both the nay sayers (it's awkward and too idealistic ) and the yay sayers (it's awesome). The thing is:
1. Linked data open, open as in open source, the URI  is baked into its design.
2. While the 'API'/triple/RDF format can be awkward, anyone can quite easily understand it. The cool thing is: this includes non-programmers.
3. It's geared towards collaboration. In fact, when reading between the lines, I'd argue it's really good for collaboration between a big heterogeneous group of people.
Disclaimer: this is my own opinion, Triply does not know I'm posting this and I don't care ;-) I simply think it's an interesting way of thinking about data.
 A friend of mine once modeled some biochemistry part of C. Elegans from linked data into petrinets: https://www.researchgate.net/publication/263520722_Building_...
 https://www.google.com/search?client=safari&rls=en&q=linked+... -- I quickly vetted this search
 I still don't know the difference between a URI and URL.
 I think back in the day, linked data idealists would say that all data should be linked to interconnect all the knowledge. I'm more pragmatic and simply wonder: in which socio-technological context is linked data simply more useful than other formats? My current very tentative answer is those 3 points.
Edit: Also, to answer your question, the difference between a URL (Uniform Resource Locator) and a URI (Uniform Resource Identifier) is that the URL actually points to something, an object at a particular location, and you can paste it into your web browser to view that something. A URI just uses a URL-like scheme to represent identifiers, such that your domain and directory structure provide a kind of namespace that you control. But as long as it follows the format, your URI can contain literally anything, it doesn't have to be human-readable or resolve to anything in a web browser. It might as well be mycompany_1231241542345.
I'm not sure what's the point you're trying to make, that is exactly what RDF is for! It's not an implementation technology, it's purely an interchange format. You should not be using a fully-general triple store, unless you really have no idea what the RDF you work with is going to look like.
SPARQL is the same deal; it's good for exposing queryable endpoints to the outside world, but if your queries are going to a well-defined relational database with a fixed schema and the like, you should just translate the SPARQL to SQL queries and execute those.
It seems you're proving my point: you didn't need to collaborate outside of your company. So you picked the path of least resistance and that's totally what I would do too.
But what if you work together with a plethora of academic institutions and then decide that you want to keep options open such that other disciplines can connect to you and you can connect to them, automatically?
You could create a consortium and create a shared Postgres schema (among other things), since everyone knows it. Or you could just put your linked data online with a web page, no consortium needed. Anyone who wants to link to you, they can. And if they publish their data, then by no effort of your own, your data is enriched as well.
I view linked data as a DSL of some sorts. DSLs are amazing, except if you try to force fit them into something they shouldn't be force fitted into. You are giving an argument that one should not force fit it within one organization.
And I agree with that since that's not where linked data shines. Just like SQL doesn't shine at making a raytracing engine, but that doesn't prevent anyone  ;-)
That's my current view anyway (again, 3 weeks in, I mostly dealt with NodeJS and ReactJS issues at the moment).
Also, a lot of SPARQL looks like SQL (to my new eyes anyway). Here's a tutorial on it (for a basic feel, watch until episode 5 -- takes about an hour, or watch just the first episode to get a feel): https://www.youtube.com/watch?v=nbUYrs_wWto&list=PLaa8QYrMzX...
Using python as an example language, TileDB-Py offers simple functions, like "from_numpy"  for basic and simple models which we can write directly. For more advanced (and more common) use cases of complex models, its easy to create an array to store the model and all associated metadata. TileDB's array metadata and time travel capabilities allow for you to even see how your model changes over time as you update it.
Disclosure: I'm a member of the TileDB team.
When we change titles, we always try to use representative language from the article. I replaced it with a phrase from the article where it says what it's about in a neutral way. I added a hyphen to "recently minted" because it seems to me that grammar requires it. However, if we worship different grammar gods I am happy to let yours hold sway. The hyphen is now toast.
This is bog-standard HN moderation. I dare say the reason you "really really hate it" is probably because you notice the cases you dislike and weight them much more heavily: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que.... Meanwhile the ones you don't find objectionable pass over relatively unnoticed. That is bog-standard HN perception bias.
If we didn't meticulously try to keep titles linkbait-free and less-misleading (more hyphens, sorry!), HN's front page would be completely different, and since the front page is the most important thing here, the entire forum would soon be completely different. That requires editing titles all the time. petercooper, who posted elsewhere in this thread, has an HN title-change tracker (oops, I hyphened again!) which is quite nifty. It doesn't show all of them, though, and it can't tell the difference between moderator edits and submitter edits.
There is a power shortage in my neighborhood and my laptop battery is about to expire, so if you don't see a link momentarily, or if I don't respond anywhere on the site for a few hours, it's because shelter-in-place (uh-oh!) prevents me from finding somewhere else to plug in.
Edit: here it is: https://news.ycombinator.com/item?id=21617016. Note how we edited out the bloody pig mask.
Battery is at 2%...
If it goes on too long, and someone has a car nearby, you can use that as a power source. (Be sure to keep the engine running though, otherwise you'll end up with two objects that need a power source...)
I mean, I still had my phone. But moderating HN with a phone is like trying to do surgery with only your thumbs, through a two-inch hole. In the dark.
It's usually not correct to add a hyphen between an adjective and an adverb that modifies it. The case of "recently minted database technologies" matches this rule. Since "linkbait-free" and "title-change" are single adjectives made up of several words each, it's correct to use the hyphen.
However, in the case of "less-misleading", the hyphen is generally incorrect as "less" is an adverb. The exception would be if it precedes an uncountable noun, where "less" could conceivably modify the noun. Example: "less misleading critique" would be ambiguous, so using a hyphen would be appropriate.
Thank you for your attention to detail, dang!