Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What could a modern database do that PostgreSQL and MySQL can't
338 points by eatonphil 16 days ago | hide | past | favorite | 321 comments
If there were a general-purpose OLTP SQL database that was developed today, what features or design decisions would it pick that PostgreSQL and MySQL cannot adapt for historic reasons?

Or put another way, what are some cutting edge OLTP database techniques/features/architectures that PostgreSQL and MySQL would have a hard time supporting?

Lots of people are citing cutting-edge bells and whistles for advanced query plans and data types, but after working at $BIGCO and seeing what exists there (and has existed for years), I get frustrated by the lack of ergonomic and operational essentials in the public-cloud options.

1. Horizontal scaling. PG can do sharding or replication, but that's not the same thing. Manual resharding is something that, once you've experienced it, you don't want to do it ever again, especially when things like Spanner/Cockroach exist. Database-level liquid sharding is such a dramatically superior solution that it makes any system that depends on an enumerated set of master instances seem completely obsolete.

2. Strong commit timestamps, again a la Spanner/Cockroach. Globally-ordered commit times and read snapshots aren't something you'll need in every schema, but when you do, they're magical - suddenly, you can use your OLTP system for strong lease assignment (conditioning other transactions on lease validity), you can construct arbitrary application-level transaction semantics (non-transactional read followed by transactional compare-and-swap is pretty powerful in some scenarios), and all sorts of other things.

3. Interface abstraction. Databases that ship language-support drivers should also ship a fake in-memory implementation for testing purposes that supports the same interfaces and options. There's no reason why I should have to start a subprocess and emulate a network connection to tell whether a query functions properly or not, or whether I've set isolation properties correctly for a pipeline.

4. Declarative schemas, and a well-defined update process (that isn't human-written DDL), are essential at any sort of organizational scale. Every time I see an "ALTER TABLE" statement checked into a repository, I tear my hair out.

5. Queue support, where message publishing/consumption is transactional with other read/write operations. You can always build a queue with a time-keyed table, but it's kind of a pain in the ass to get all the edge cases right. It's much better as a first-class concept supporting automatic sharding, retries, batching, and bounded-out-of-order delivery with a global cursor.

> Horizontal scaling

I think the importance of horizontal scaling is overhyped. 99% of PostgreSQL applications are at a size where a single machine can easily handle the workload.

To enable horizontal scaling, you need to make so many tradeoffs that I don't think it's worth it for most applications.

I see people responding to this thread with examples of high-volume workflows that use a small number of machines for the database layer.

I think those examples are missing the point. One of the examples uses a Redis cluster, which is a sign that the DB alone can't support the workflow.

Another example is using a database for certificate issuance. Which I suspect is using one table or multiple that can be shard and it doesn't suffer from lock contention or joints of multiple tables.

I wouldn't be brave enough to count the number of cases where horizontal scaling is needed but I would say it is definitely not zero, especially for read replicas when high availability is needed

I'd be willing to believe 99% of applications, yes - for the same reason that Wordpress can claim to drive some ridiculous percentage of websites - but since the larger use cases tend to come from larger organizations, 99% of developers don't work on systems where one instance suffices. There's a reasonably large number of companies for which that's not true, and there's not very many good "next step" options when you're too big for vertical scaling but too small to fund building your own engine.

where did you get the 99% metric ? Most of the companies even with a single saas product have insane amount of data these days.

Not just application data, There is also a whole lot of analytical data collected at every step of the product usage cycle.

> where did you get the 99% metric ? Most of the companies even with a single saas product have insane amount of data these days.

I will just leave this link here ....


And this one -- StackOverflow, one single live db:


From managing databases over the last two decades (starting with Ingres/mSQL), hardware has growing much faster (RAM/CPU/NVMe) than the data needs. I remember the first time we put 2Tb RAM into a machine to handle most analytics in memory.

From my experience TimescaleDB is fast and takes a lot of data in. If you're multi tenant, it's usually easy to shard.

And we do dataware housing on BigQuery, no need to have your own machine and manage the database.

Of course there are people who need unlimited horizontal scaling.

TimescaleDB also offers a good lot of supplementary functions to the PostgeSQL core product to help with time series data analysis... saves a lot of SQL acrobatics!

Generally you don't have your main application database also serve as your data warehouse for lots of reasons.

Generally you offload that to purpose built systems that favor those aforementioned tradeoffs or onto services that run them for you, i.e BigQuery, Snowflake, etc.

Your main application database is unlikely to need sharding unless you really do have a phenomenal amount of customers or you need regional sharding to meet legal requirements about data sovereignty for example.

If you do sas, you are already the 1 %.

You are right, the metric should probably be 99.999999999999999%. 99% is way to conservative.

Two things stick out here that I don't fully understand.

Declarative schemas, and a well-defined update process (that isn't human-written DDL), are essential at any sort of organizational scale.

Isn't this impossible because some schema changes require data migration? A data migration cannot be declaratively automated as far as I know.

Queue support

Why not use a dedicated and feature rich queue such as Rabbit MQ or, if you want to get really fancy, Kafka?

>Isn't this impossible because some schema changes require data migration?

In the most general case, sure - although there are workarounds for some specific cases (e.g., including previously-known-as names in the declarative schema to allow automatically planning renames). But 99% of the time, you're adding and removing tables and columns in a way that's very well defined. This is one of those areas where the best solution is to legislate away the hard problems - a tool that covers 98% of schema changes automatically (the provably safe ones), and then fails/requires a human to approve the last 2% is still dramatically better than having humans manually write and sequence change plans all the time.

Data migrations will require human effort, but you can sequence the changes and isolate the parts that need different kinds of work. If you're changing a string to an integer, for instance, you can make it clear in your change history that you (1) add the new column, (2) start dual writes, (3) backfill, (4) move readers to the new column, (5) stop the dual write, (6) drop the old column. You can do that with checked-in migration code, but think about what you end up with at the end - with imperative migrations, you have the clutter from all of that history; with declarative schema definitions, you just have the final state, just like how it works for code.

Declarative schemas also usually come with nice ancillary properties - for instance, they can give you automatic rollbacks, because you can pack up the whole schema definition as a static artifact.

>Why not use a dedicated and feature rich queue such as Rabbit MQ or, if you want to get really fancy, Kafka?

Atomicity. It's really, really powerful to be able to write a row and send or receive a message as a single exactly-once transaction. Imagine a queue of events. The writer writes the event data, and sends a message as a notification that it exists; the consumer will keep some sort of aggregate (count, filtered total, etc.). With a separate queueing system, you have to explicitly manage the timing and error cases bridging different atomicity domains, usually reimplementing some sort of idempotency log on your own. If it's all in one place, you just write a transaction that reads the old value, consumes the queue message, and writes a new value.

I'm doing a project perhaps relevant to this. Talk of declarative schemas (or what you imply they offer) is very interesting and I'd like to know more but I can't find anything relevant (just magento and sqlalchemy). Indeed, searching for <<<"Declarative schemas" "sql">>> in google gets this very thread on the first results page.

Any links to clear, actionable and reasonably comprehensive examples of these would be most helpful. Obviously abstract statements of the required semantics are also needed, but I also need to see what actual code would look like.


Magento (the new XML-based version, not the old create/update/revert-script version) gives a lot of these properties. Making it part of the database instead of a third-party tool would be better, though - it lets you cover more features, and with deeper integration you can get transactional updates and history logs.

> Why not use a dedicated and feature rich queue such > as Rabbit MQ or, if you want to get really fancy, Kafka?

If your queue is in the same database as your data, you can do "ack message/mark job as complete" and "commit result of job to database" in the same transaction. That simplifies a lot of the harder parts of things like materialization pipelines or BI systems.

Wanted to ask you more about the experiences you wrote. Have some similar thoughts as your later points. Do you have a contact method like an email address or something?

> .. scaling

Did you try Stolon?. https://github.com/sorintlab/stolon

Maybe I'm reading it wrong, but that looks like yet another multi-master high-availability system, which is also an important property for a system to have but more or less unrelated tow hat I was talking about.

As far as open-source add-ons for true horizontal scaling, I think [Citus](https://github.com/citusdata/citus) is the most well-known and sophisticated, but I don't have enough experience with it in production to have a particularly strong opinion yet. It might work quite well, but I still fundamentally doubt that it will ever be as good as a system that was designed to support horizontal sharding from the ground up.

> 1. Horizontal scaling. PG can do sharding or replication, but that's not the same thing. Manual resharding is something that, once you've experienced it, you don't want to do it ever again, especially when things like Spanner/Cockroach exist. Database-level liquid sharding is such a dramatically superior solution that it makes any system that depends on an enumerated set of master instances seem completely obsolete.

It feels like the industry has largely failed at this in regards to the traditional RDBMSes. Solutions like TiDB attempt to remedy this in a transparent way, but there are simply too many limitations for it to be a feasible alternative for many projects out there: https://docs.pingcap.com/tidb/stable/mysql-compatibility#uns...

I've only ever seen PostgreSQL/MySQL/MariaDB/Oracle be deployed successfully as single instances in prod that are scaled vertically, because attempts to do otherwise either create issues with performance and resiliency, or make the entire system only have eventual consistency, which is as problematic as it is inevitable when it comes to horizontally scaled and distributed systems.

I don't think that there is a good answer to this, honestly. You will get either horizontal scalability OR data consistency - pick one.

> 3. Interface abstraction. Databases that ship language-support drivers should also ship a fake in-memory implementation for testing purposes that supports the same interfaces and options. There's no reason why I should have to start a subprocess and emulate a network connection to tell whether a query functions properly or not, or whether I've set isolation properties correctly for a pipeline.

That's a good idea, but personally i've seen that almost all abstractions are leaky in one way or another. In the day of containers and highly customizable software that can scale from tens or hundreds of megabytes of memory with a fraction of a CPU core to an entire server with a hundred gigabytes of memory and dozens of CPU cores, it feels somewhat unnecessary to risk these abstractions.

Just make sure that your process for initializing the DB schema and seeding it with data for testing is fully automated and reproducible, and then spin up a disposable database container as a part of your CI process. Either that, or just don't test the actual DB implementation, which i've also seen be done to pretty horrendous results.

Edit: Perhaps my past experience makes me have certain biases - i've also seen manually set up DB instances which eventually leads to there being no good way to test N parallel feature branches, each of which needs a different schema, because that also necessitates N DB instances to be present, half or more of which could in theory break. I don't see any good options apart from containers here, especially when you need extensions which these in-memory implementations wouldn't play nicely with.

It depends on what you're testing. Performance properties and version updates, sure, it's hard to find a substitute for containers. But that's not 99% of testing. If the database has a stable interface (for instance, its dialect of SQL+DDL), it can ship a fake implementation of that interface to use in unit tests or toy examples.

> If the database has a stable interface (for instance, its dialect of SQL+DDL), it can ship a fake implementation of that interface to use in unit tests or toy examples.

Sometimes the interface just has too large of a surface area for it to be simulated properly.

For example, suppose that you have an Oracle stored procedure, which, when given about 20 parameters, goes through a decision tree in a PL/SQL block that spans a few thousands of lines and along the way uses both DBMS_OUTPUT.PUT_LINE, as well as DBMS_APPLICATION_INFO.SET_SESSION_LONGOPS, any number of methods from DBMS_PREDICTIVE_ANALYTICS as well as any number of other packages that are included in the full release: https://docs.oracle.com/en/database/oracle/oracle-database/1... which may also include logic that makes network requests, sends e-mails or even triggers other pieces of code to be executed.

All of that would be almost impossible to fake well enough because of the scope of functionality that's used, especially if methods are not only called, but their output is also checked and the execution depends on this. Should systems be simpler and do less? Probably, but that's not the objective reality in many projects out there.

That said, i agree that it's a good thing to have if you treat your DB as a data store and not much more, which in my opinion is a pretty good and stress free way to go about things overall. But somehow that's also like saying "Just use an ORM to have your system be DB agnostic!" which never actually works out in practice because of implementation details that leak through.

1. High performance read/write of Scylla/Cassandra with high availability[1]. It has some limitations for OLTP workloads and require careful planning. Postgres without Citus is not really HA and even then companies would often invent custom sharding/cluster management system.

2. Powerful graph query language like Cypher. It might not perform well in real life, but my personal experience left me amazed[2]. There is a number of issues with SQL that could be addressed[3], but current standard is way too much prevalent.

3. Zero impedance mismatch between database and application representation. In database like Smalltalk GemStone it is really seamless experience for developer to write database code and application code[4]. To some extent, MongoDB success can be attributed to this aspect.

4. Datomic temporal capabilities[5]. It is hard to maintain temporal tables in postgres. There are some use cases where you really want query in point of time. Strictly not an OLTP feature, but I can see this be usefully in many scenarios.

  [1] https://www.youtube.com/watch?v=Fo1dPRqbF-Q
  [2] https://www.youtube.com/watch?v=pMjwgKqMzi8&t=726s
  [3] https://www.edgedb.com/blog/we-can-do-better-than-sql
  [4] https://www.youtube.com/watch?v=EyBkLbNlzbM
  [5] https://www.youtube.com/watch?v=7lm3K8zVOdY
Edited: independence -> impedance

> Zero impedance mismatch between database and application representation

Does this include informacion hiding/encapsulation? (to prevent saved objects' internal representation from being exposed).

Traditional databases don't have an encapsulation mechanism AFAIK, which is one of the reasons for impedance mismatch.

This is important because it is a good practice for client code to make no assumptions about the internal representation, accessing data only via the a public interface.

If it happens to be exposed by the database, the clients can use it in their queries. If the internal representation changed later on, such clients would be broken.

Of course, this can be solved by only allowing data access via, say, well designed restful apis (that don't expose internal details), but this would still provide no guarantees.

How about another reason for impedance mismatch, that of storing objects that belong to a class hierarchy?

> Traditional databases don't have an encapsulation mechanism AFAIK, which is one of the reasons for impedance mismatch.

It actually does, those are views and functions.

The real problem with impedance mismatch is that SQL is declarative (you say what you want and database figures out how to get it) when most programming languages are iterative (you say what should be done).

The issue is that you have two very different languages. For one you have powerful IDE with type checking auto completion and refactoring capabilities, the SQL often is sent as a string and don't have these benefits. The various ORM are attempts to use iterative and object oriented language to access relational objects using a declarative language.

I think JetBrains is addressing the problem the right way. They added Data Grip functionality to their IDEs like PyCharm for example. What it does is that if you connect the IDE to a database and let it download the schema you get the same functionality for the data. Basically it will detect SQL statements in the string and offer the same capability for it as you have with the primary language.

At that point the impedance mismatch no longer feels like a mismatch. You basically have two languages, one to obtain data you need and another to process/present it. You can get database to return exact fields you need for your projects and even the object mapping starts feeling unnecessary.

Why data is stored in a relational way? Because that's most optimal way to store the data and the way it is stored allows multiple applications access the same data differently.

For example with NoSQL you need to know how the data will be used so you correctly plan how it will be stored. If application changes you might need to restructure the entire data.

Ultimately the data is the most important thing businesses and it stays, while applications that use it come and go.

For example with NoSQL you need to know how the data will be used so you correctly plan how it will be stored. If application changes you might need to restructure the entire data.

Honestly, SQL has this problem too, but it presents itself not in the way you store, but in the way you query. There are simple schemas and complex ones, and irrespective of that there are obvious query sets and unplanned ones (i.e. written at runtime as part of the data analysis process). SQL and its autoplanning is required only for complex+unplanned cases, in my opinion. In all other cases I know my data and I’d better walk through the indexes myself rather than writing 4-story queries to satisfy the planner. At the end of the day, nested loops through the indexes is what RDBMS does. There is no declarative magic at the fetch-and-iterate level.

Iow, it would be nice to have “extql” a direct access to indexes and rows, in sort of a way EXPLAIN works, and skip SQL completely.

  function get_items(store_id) {
    for (var item in items.id) {
      var res = item.{name, code}
      var item_id = i.id
      res.price = prices.any({item_id, store_id})?.price
      if (!res.price) continue
      res.props = todict(props.all({item_id}).{name, value})
      yield res // or collect for a bigger picture
This query could be an equivalent of “select from items inner join prices on (store_id, item_id) left join props on (item_id)” but saving space for many props and being much more programmable. Also, it would be nice to have the same engine (sql+extql) at the “client” side, where the inverse problem exists – all your data is nosql, no chance to walk indexes or declare relations.

> Honestly, SQL has this problem too, but it presents itself not in the way you store, but in the way you query. There are simple schemas and complex ones, and irrespective of that there are obvious query sets and unplanned ones (i.e. written at runtime as part of the data analysis process). SQL and its autoplanning is required only for complex+unplanned cases, in my opinion. In all other cases I know my data and I’d better walk through the indexes myself rather than writing 4-story queries to satisfy the planner. At the end of the day, nested loops through the indexes is what RDBMS does. There is no declarative magic at the fetch-and-iterate level.

The thing is that what worked at specific time can change. For example if you have simple join with two tables, let say A and B. You search by column in target A to get value from column in table B. Now if both tables are large then it makes sense to lookup in A by an index, then use foreign key and index to find the row in table B.

Now if A and B have few elements. Even if there is an index on both of them, it actually is faster just to scan one or both tables.

It might be actually more beneficial to ensure that tables are properly analyzed, have right indices and preferences in the query planner are tuned.

If you need to override query planner, you don't have to make sophisticated queries, you can just use this[1] extension. Though if things aren't working right it is either lack of data, mis-configuration or a bug.

[1] http://pghintplan.osdn.jp/pg_hint_plan.html

> For example with NoSQL you need to know how the data will be used so you correctly plan how it will be stored. If application changes you might need to restructure the entire data.

This point is very important, and well explained in Stonebraker's paper "What Goes Around Comes Around". What is most interesting is that he is actually talking about half a century old pre-relational IMS IBM databases, but they had exactly the same issue, hence the paper's title. Codd invented the relational model after watching how developers struggled with the very problem you mentioned.

Stonebraker famously quipped that "NoSQL really stands for not-yet-SQL".

He also addresses the impedance matching issue in the "OO databases" section; there is actually a lot more to it, and he gives it all an insider's historical perspective.

Your view of what impedance mismatch is doesn't sound accurate. It's not about declarative or imperative or syntax or strings, etc...

It's about data modeling, one models data using relations, the other models data in a hierarchical way (using maps, arrays, objects, etc...). They are two different ways to structure your data, hence the impedance mismatch.

Perhaps I was using the wrong word. I was referring that traditionally things fell a bit off when working with SQL. Because often it wasn't a code, just a bit of strings that you were sending.

Because of that, developers started to abstract that with code and objects that were then populated with data.

With IDEs understanding the SQL that's no longer necessary. I can construct a specific SQL to get the exact structure my program needs. Even if it is hierarchical I can use various jsonb aggregation functions. That's a game changer to me.

Not just jetbrains. pgtyped does this for postgres and typescript, and queryfirst for c# against sql server/postgres/mysql.

I did not know that, I first encountered that in PyCharm.

I'm glad there's more.

Edit: actually what you mentioned is slightly different. This is what I'm talking about: https://youtu.be/_FlpiNno088?t=2863

> good practice for client code to make no assumptions about the internal representation

If your internal representation and API start to differ then it adds complexity fast. Its far better to have as close to a 1-1 mapping for your backend and frontend data models as possible.

Postgres comes with the building blocks for both sharding and HA out of the box, and they're extensively discussed in the docs. You don't need proprietary addons other than as pure convenience.

don't underestimate the importance of convenience. I'm convinced one of the reasons MySQL had so much more mindshare than postgres back in the day was that it was far easier to get up and running, even if postgres might have been easier to use once everything was set up correctly.

That's funny, I must have been an outlier then.

I've been using Postgres since 1998, and I tried getting MySQL up first. There was more documentation available for the latter, so it should have been simple. Failed. It just didn't work.

Out of frustration I then tried Postgres, because I just wanted a decent database for my project. It was surprisingly easy, I only had to learn about pg_hba.conf to get to a functional state. Everything else was in place out of the box.

I've been a happy user ever since. MySQL may have had the mindshare (thanks to prevalence of LAMP) but everything outside the magic happy path was confusing and fragile.

No one cares what was in 1998, that's the whole point of this discussion. Postgres devs kept digging their heads into sand for many years, saying that high availability is somehow not the task of the database. In reality only the datastores need to be HA/durable in an otherwise stateless architecture.

I don't even remember choosing MySQL when I started (15 years ago). It was just so dominant, we didn't question it.

Nowadays I would still use it because I assume it is the dumbest database system and that's exactly what I need for my 1-5 user app.

It still has some features over PostgreSQL that pushed me to choose it (actually MariaDB) for a new project about a year ago, namely multi-master replication. Yes, I know, terrible database and horrible feature, but it really helped in that particular domain.

I couldn't find anything decent for postgres, while MariaDB/MySQL have that built-in, with some differences in implementation. Especially for a customer who refuses to pay for his software, because there are some commercial solutions.

Sharding is not the same as natural clustering, because eventually you’ll need to reshard and then you’ll be writing a lot more code.

Postgres has many HA solutions, and it's getting better all the time. Postgres has good performance, benchmarks need to show the various trade-offs systems make so that informed decisions can be made. Feel free to post a link.

The Graph model can be made available through extensions. See AGE: https://age.incubator.apache.org/ They plan to support OpenCypher.

The JSONB type allows for No-SQL like development if that's what you really want.

Just because you can use JSONB, it doesn't mean that it is as easy to work with as it is the case with MongoDB. So if you really need a JSON store, just use Mongo.

> independence mismatch

you probably mean impedance mismatch, right?


> Zero impedance mismatch

I see the problem here being that too many manual optimizations need to be done when implementing a schema.

You start with a logical schema (ERD diagram) and then implement it via a physical schema with denormalizations added for efficiency (usually because the relational model has scaling limits with number of joins, or because db migrations are too difficult, or handling unstructured data without introducing a ton of new tables). The db should do the denormalization automatically, allowing the user to interface with their logical schema directly.

Another reason is we can't use SQL in the browser - we have to go through many caching layers and API layers which complicate things.

+1 to the application/database mismatch. GemStone is really amazing. I never got to do commercial work with it, but damn is it a great environment. I've never seen anything like it.

CockroachDB is getting a lot of interest these days.

It has broad PGSQL language (and also wire I think) compatibility yet has a clustered peer architecture well suited to running in a dynamic environment like cloud or k8s. Nodes can join dynamically and it can survive them leaving dynamically as long as there's a quorum. Data is distributed across the nodes without administrator needing to make any shard rebalance type interventions.

PGSQL is designed for deployment as a single server with replica servers for HA. It's not really designed for horizontal scalability like Cockroach. You can do it - the foreign data wrappers feature and table partitioning can give you poor man's scale out. Or you can use Citus which won itself a FOSS license earlier this year. And there are other Foss and proprietary approaches too.

MySQL is similar - you can do it, like with their recent router feature, but it has been retrofitted, and it's not as fluid as Cockroach. IIRC MySQL router is similar in configuration to Galera - that is, a static config file containing a list of cluster members.

Listen I'm sure that the design approach of Cockroach could be retrofitted to PGSQL and MySQL, but I'm pretty sure that doing a good job of it would be a lot of work.

So in answer to your question, I'm not sure that there's all that much RDBMS can't be made to do. Geospatial, Graph, Timeseries, GPU acceleration. Postgres has it all and often the new stuff comes to Postgres first.

By the way I love MySQL and PostgreSQL, and the amazing extensions for PGSQL make it extra awesome. Many are super mature and make pgsql perfect for many many diverse use cases.

For XXL use cases though, CockroachDB is taking a very interesting new path and I think it's worth watching.

Not a database expert but here is my thoughts after using Cockroach DB for a mid sized: - Single cluster mode is really nice for local deployment - WebUI is a cool idea but I wish it had more info - The biggest problem is Postgres compatibility there are some quirks that was annoying examples being. The default integer types having different sizes, cockroachdb having a really nice if not exists clause for create type but having no way to maintain a Postgres compatible way of doing it (i think this one is on postgres only being able to do it with plsql scripts is cumbersome) - For me the neutral one. IT HAS ZERO DOWNTIME SCHEMA CHANGES. But if you are just coming from Postgres and just using a regular migration tool and transactions and schema changes having serious limitations and could end up your database in inconsistent state is scary. (Docs really document the behavior but still I would expect a runtime warning for it.

CDB was great until we started doing table creation on the minute and hundreds of inserts on those tables. When we tried to drop tables on a schedule, CDB never could catch up and would generally just crash (3 nodes). I really don't like the magic you have to do with CDB for things that your commodity DBs can be expected do.

CockroachDB dev here. We've gotten a good bit better in the last few versions in terms of schema change stability. We're still not very good at large schemas with more than 10s of thousands of tables but we've got projects under way to fix that which I expect will be in the release in the spring of 22.

I'd like to hear more about the magic.

To be fair, while I kind of agree the system should be able to handle it regardless, 10,000s of tables sounds outside the realm of 99.99% of all use cases.

You might be surprised to learn how common the "store the results of the user's query into a temporary table" pattern is.

Temporary tables in cockroach exist, but the implementation was done largely to fulfill compatibility rather than for serious use.

The implementation effectively just creates real tables that get cleaned up; they have all the same durability and distributed state despite not being accessible outside of the current session.

Getting something done here turned out to be a big deal in order to get ORM and driver tests to run, which is extremely high value.

A better implementation would just store the data locally and not involve any of the distributed infrastructure. If we did that, then temp tables wouldn't run into the other schema scalability bottlenecks I'm raising above.

Thanks for all of that information in those 2 posts.

Can you tell me more about the use case where you'd create a new table every minute?

Not OP but for analytics use cases it is common to create new tables for each minute, hour, day or whatever granularity you collect data in. This makes it easier to aggregate later, you don't end up with extremely big tables, you can drop a subset of the data without affecting the performance of the table currently being written to etc..

That sounds like what table row partitioning is for, I thought all the major databases supported that?

I agree, CockroachDB is overall very good!

A few things I'm missing from it though:

- Advisory Locks (only exist as a shim/no-op today)


- CTEs (Common Table Expressions)

- Introspection and management features (for example pg_locks, pg_stat_activity, pg_backend_pid(), pg_terminate_backend())

They are making good progress though, and more recently they've spent effort on making adapters for well, e.g. their ActiveRecord adapter.

I think this is clever, since the deciding factor for many companies will simply be "Can I easily replace this with PostgreSQL in my Rails/Django/etc. project?"

> It has broad PGSQL language compatibility

Depends how you define broad. :)

Many key features are missing, including but not limited to:

    - UDFs and sprocs.
    - Useful datatypes such as TSTZRANGE
    - More limited constraints
I was recently looking at Cockroach as a PGSQL replacement because of its distributed nature. But the equivalence featureset is still lagging badly, unfortunatley.

May I suggest looking at YugabyteDB, which is a distributed SQL database built on actual PostgreSQL 11.2 engine? It's way ahead of CockroachDB in terms of feature support.

YugabyteDB has UDFs, stored procedures, distributed transactions, the range types are working from what I can tell, at least the example from here: https://wiki.postgresql.org/wiki/Extract_days_from_range_typ... works right out of the box, just copy paste. Postgres extensions are of course working. Orafce, postgis (with extra work needed for gin indexes, afair...), custom extensions.

YugabyteBD == Postgres. The query planner, analyzer and executor are all Postgres. Mind you, some features are not readily available because handling them properly in a distributed manner takes effort. Those unsupported features are disabled on the grammar level, before being worked on. But - unsupported features will not corrupt your data. Missing features are enabled very fast.

For example, I have recently contributed foreign data wrapper support: https://github.com/yugabyte/yugabyte-db/pull/9650 (enables postgres_fdw at a usable level) and working on table inheritance now.

Yugabyte is an amazing bit of technology and more people should know about it. By the way - it's Apache 2 with no strings attached.

For clarification: I'm not associated with Yugabyte in any way other than very happy user and contributor. Yugabyte organizes the Distributed SQL Summit later this month: https://distributedsql.org/. Might be a good opportunity to learn more about it. Second clarification: I'll be speaking at the event.

Are you still using postgres? Thinking about a similar transition, would really appreciate any shared learnings :)

> Are you still using postgres?

At the moment yes.

The final decision has not been made yet, but I think the reality is we're getting tired of evaluating all these distributed databases that claim Postgres compatibility only to find its the usual clickbait marketing speak.

So the most likely outcome is we're going to stick with Postgres and give the emerging distributed tech a couple more years to pull their socks up. We're not going to go round changing all our Postgres code and schemas just to shoehorn into the limitations of a random definition of "postgres support".

I don't know if anybody from CockroachDB is reading this but their article[1] on serializable transactions is somewhat questionable, as it compares CockroachDB's Serializable to Postgres's Read Committed, which seems to imply CockroachDB is better. Of course, Postgres has serializable as well. The only novelty in the article is that Cockroach DB runs Serializable by default, so I am not sure what that comparison was doing.

[1] https://www.cockroachlabs.com/docs/stable/demo-serializable....

I don't get that implication from the article at all. What I do get is that the only transaction isolation level available under CockroachDB is serializable. And to show why serializable is valuable, they have to demonstrate using Postgres' default of read committed (because they would be unable to demonstrate this using CockroachDB which doesn't allow any other transaction isolation levels.

Ah, fair reading, thanks. As Postgres supports that level of isolation as well, I was thrown off a bit. but your phrasing makes sense.

I don't think CockroachDB supports anything other then serializable isolation level, so of course they are going to be pushing this type of argument. Its an interesting philosophy really. It would be one thing if the performance penalty of running all transactions at serializable isolation level vs weaker isolation levels was reasonably low, but its not (see [1] - yes I know the entire difference in performance here is not due to CDB using serializable transactions for everything, but I'm guessing a very large part of it is). Its kind of like saying the only cars in the world should be ford pintos because we don't trust folks to safely drive faster cars.

[1] https://www.scylladb.com/2021/01/21/cockroachdb-vs-scylla-be...

> Its kind of like saying the only cars in the world should be ford pintos because we don't trust folks to safely drive faster cars.

Heh... The Pinto, for safety? https://en.wikipedia.org/wiki/Ford_Pinto#Fuel_system_fires,_...

It always rubs me the wrong way - all those paxos/raft approaches (which are great, but...) simply elect the leader to pick writes. In that sense there is no distribution of computation at all. It's still single target that has to cruch through updates. Replication is just for reads. Are we going to have something better anytime soon? Like real distribution, when you add more servers writes distribute as well?

> Like real distribution, when you add more servers writes distribute as well?

There's a really interesting suggestion in The Mythical Man Month. He suggests that instead of hiring more programmers to work in parallel, maybe we should scale teams by keeping one person writing all the code but have a whole team supporting them.

I don't know how well that works for programming, but with databases I think its a great idea. CPUs are obnoxiously fast. Its IO and memory which are slow. I could imagine making a single CPU core's job to be simply ordering all incoming writes relative to each other using strict serialization counters (in registers / L1 cache). If the stream of writes coming in (and going out) happened over DPDK or something, you could probably get performance on the order of 10m-100m counter updates per second.

Then a whole cluster of computers sit around that core. On one side you have computers feeding it with writes. (And handling the retry logic if the write was speculatively misordered.) And on the other side you have computers taking the firehose of writes, doing fan-out (kafka style), updating indexes, saving everything durably to disk, and so on.

If that would work, you would get full serializable database ordering with crazy fast speeds.

You would hit a hard limit based on the speed of the fastest CPU you can buy, but I can't think of much software on the planet which needs to handle writes at a rate faster than 100m per second. And doesn't have some natural sharding keys anyway. Facebook's analytics engine and CERN are the only two which come to mind.

> I could imagine making a single CPU core's job to be simply ordering all incoming writes relative to each other

Calvin is an interesting alternate design that puts "reach global consensus on transaction order" as its first priority, and derives pretty much everything else from that. Don't even need to bottleneck through a single CPU.


Like VoltDB, the one huge trade-off is that there's no `BEGIN ... COMMIT` interaction where the client gets to do arbitrary things in the middle. Which would be fine, if programming business logic in the database was saner than with Postgres.

What you suggest would work if you just want to order writes. However, it won't support an ACID transaction model, which a lot of applications expect, or even simpler models where you can read values before you write.

For instance: consider an application that looks at the current value of a counter and adds 1 if the counter is less than 100. You can execute this on the primary (resource bottleneck because you need to see current state and only the primary has the full picture) or do the operation on a local node and try to submit it to the primary for ordering (coordination required over the network, e.g., to ensure data are current).

There are other approaches but they generally result in the same kind of conflicts and consequent slow-downs. Or they limit the operations you can handle, for example by only allowing conflict-free operations. That's what CRDTs do. [1]

[1] https://hal.inria.fr/hal-00932836/file/CRDTs_SSS-2011.pdf

> However, it won't support an ACID transaction model

I think you could add ACID support, while the process doing ordering still not caring about the data. You do something like this:

- Split the keyset into N buckets. Each bucket has an incrementing version number. (The first change is 1, then 2, then 3, and so on). The whole system has a vector clock with a known size (eg [3, 5, 1, 12, etc] with one version per bucket.)

- Each change specifies a validity vector clock - eg "this operation is valid if bucket 3 has version 100 and bucket 20 has version 330". This "validity clock" is configured on a replica, which is actually looking at the data itself. The change also specifies which buckets are updated if the txn is accepted.

- The primary machine only compares bucket IDs. Its job is just to receive validity vector clocks and make the decision of whether the corresponding write is accepted or rejected. If accepted, the set of "write buckets" have their versions incremented.

- If the change is rejected, the secondary waits to get the more recent bucket changes (something caused it to be rejected) and either retries the txn (if the keys actually didn't conflict) or fails the transaction back to the end user application.

So in your example:

> consider an application that looks at the current value of a counter and adds 1 if the counter is less than 100

So the write says "read key X, write key X". Key X is in bucket 12. The replica looks at its known bucket version and says "RW bucket 12 if bucket 12 has version 100". This change is sent to the primary, which compares bucket 12's version. In our case it rejects the txn because another replica had a concurrent write to another key in bucket 12. The replica receives the rejection message, checks if the concurrent change conflicts (it doesn't), then retries saying "RW bucket 12 if bucket 12 has version 101". This time the primary accepts the change, bumps its local counter and announces the change to all replicas (via fan-out).

The primary is just doing compare-and-set on a small known array of integers which fit in L1 cache, so it would be obscenely fast. The trick would be designing the rest of the system to keep replica retries down. And managing to merge the firehose of changes - but because atomicity is guaranteed you could shard pretty easily. And there's lots of ways to improve that anyway - like coalescing txns together on replicas, and so on.

I'm really sorry didn't see your comment earlier. I like your solution but am a little stuck on the limitations.

1.) It requires applications to specify the entire transaction in advance. ACID transactions allow you to begin a transaction, poke around, change something, and commit. You can derive the transaction by running it on a replica, then submitting to the primary, in which case this becomes an implementation of optimistic locking including transaction retries.

2.) As you pointed out the trick is to keep the array small. However, this only works if you have few conflicts. Jim Grey, Pat Helland, and friends pointed this out in 1996. [1] That in turn seems to imply a very large number of buckets for any non-trivial system, which seems like a contradiction to the conditions for high performance. In the limit you would have an ID for every key in the DBMS.

3.) Finally, what about failures? You'll still need a distributed log and leader election in case your fast machine dies. This implies coordination, once again slowing things down to the speed of establishing consensus on the network to commit log records.

Incidentally Galera uses a similar algorithm to what you propose. [2] There are definitely applications where this works.

[1] https://dsf.berkeley.edu/cs286/papers/dangers-sigmod1996.pdf

[2] https://galeracluster.com/library/documentation/certificatio...

>There's a really interesting suggestion in The Mythical Man Month. He suggests that instead of hiring more programmers to work in parallel, maybe we should scale teams by keeping one person writing all the code but have a whole team supporting them.

Sounds like mob programming!

There are two ways I see databases doing paxos. The basic way, like some databases like Percona, basically is a single raft across the whole database. It can help with high availability, but the database scale is still kind of constrained to the capability of a single writer.

What you really want is databases like CRDB/Yugabyte/TiDB, which are sharding+raft. Tables are sharded into 128MB chunks, and each chunk has their own raft. The database handles transactions, distributed queries, and auto-balancing transparently.

Definitely, YugabyteDB falls into the second category. Tables are split into so called tablets. This splitting can be controlled by choosing the hashing algorithm for the primary key (or transparent row key hash, if no primary key for a table exists). Each tablet is replicated and has its own raft group. Different tables can have different replication factors, the number of tablets to split the table into can be selected and modified.

YugabyteDB recently added support for table spaces on steroids where table spaces are allocated to physical nodes in the cluster. This enables geo-replication features where tables or rows can be placed within selected geo locations.

All the data shifting is done transparently by the database.

Doesn't CockroachDB work this way as well, with each "range" running a separate Raft? That is what I get from https://www.cockroachlabs.com/docs/stable/architecture/overv...

All those noSQL or newSQL have more than one leader per table they have a distinct leader for each partition.

So if you have 6 server and 2 partition you could have 3 servers for partition number #1 and 3 different servers for partition number #2.

If you want extra performance you could make the server simply store key->value mapping using quorum write like Cassandra is doing but to keep data consistency you still have 2 choice

#1 use Optimistic concurrency (an app performing an update will verify if the data has changed since the app last read that data).

#2 using some kind of Lease, elect one machine to be the only one allowed to write to that partition for some time period.

Option #1 do not give faster transaction throughput but could offer lower tail latency.

Option #2 bring you back to square one of having a leader so you better just use (Paxos/Raf)

You are asking for a miracle or just simply - breach of the physics laws. The only way to horizontally scale write speed is to shard your data to be written somehow. You can easily do it but then your reading queries have to go to multiple servers to assemble single result. You can't scale both at the same time. There are some in between solutions like eventual read consistency that are relying on traffic at some point easing enough so that the conductor can actually synchronize servers. But if you have a steady stream of read/write requests the only way you really can scale is to tell amazon to sod off, buy yourself big fat multicore / multi CPU server with nice SSD array (preferably of Optane type) and watch your processing power suddenly shoot through the roof while the cost greatly decreases ;)

> There are some in between solutions like eventual read consistency that are relying on traffic at some point easing enough so that the conductor can actually synchronize servers.

You can use CRDT's to give a formally correct semantics to these "inconsistent" scenarios. And they might well be something that's best explored in a not-purely-relational model, since the way they work is pretty unique and hard to square with ordinary relational db's.

CRDT is just a fancy term for the strategy that still has to eventually merge data. In case of CRDT the data organized / designed in a way that makes it easier. The keyword here is "merging" which by definition kills "infinite" scalability.

You can dance around all you want but you just can't beat laws of nature.

Sure you can always design something that works for your particular case but generic solution is not possible.

Can't you arrange merging into ie. binary tree - in that setup you'd be collapsing merges into single one at the root and cummulative throughput at leaf nodes could be exponentially higher?

data is also sharded in those databases

true, thanks for pointing out, but this is a bit cheating isn't it? ie. atomic transactions cross shards flip over, right? it's basically ergonomic equivalent of just using multiple databases?

Not 100% sure what you mean by flipping over (fail?) but at least in CRDB you can of course do cross shard transactions. They wont be as fast as a transaction that can be completely handled by a single leader but it works fine and is transparent to the client.

You move over into the world of distributed transactions, which can be really expensive.

Thankfully sharding works great for a large number of applications (or in other cases you can accept eventual consistency).

Cross-shard transactions will use some sort of two-phase commit protocol internally. They still work just fine (the extra complexity is transparent to the client), but with somewhat worse performance. Most of the difficulty/expertise involved in using such systems optimally is in designing your schema and access patterns such that you minimize the number of cross-shard transactions.

CAP theorem. Choose two, it's a fundamental limitation of scaling a database.

CAP theorem. You can't have guarantee for having all three all the time. But you can still have nice thing most of the time.

Even better, sometimes you can change which guarantee you need.

We can do better then "pick two"

yes, but no. because in reality what we really want is not true pure AP or CP (or CA, which you can't have anyway)


(CAP is a very important result about a very strong assumption of consistency: linearizable events, but for that you can't lose any messages [if I remember it correctly], otherwise the system will become inconsistent)

Thanks for the video, very interesting.

I've always found CAP theorem to be better described as:

Given the need for Partitions, you must choose between prioritizing Consistency or Availability.

Google Spanner does the three.

> The purist answer is “no” because partitions can happen and in fact have happened at Google, and during some partitions, Spanner chooses C and forfeits A. It is technically a CP system.


Make sure you thoroughly test your multi-region deploys. Last time we tried the system was not stable when a region went down. Also beware of this anti pattern: https://www.cockroachlabs.com/docs/stable/topology-patterns....

Interesting. So you followed the advice of deploying to greater than two regions and your DB didn't survive when you lost a region or you only deployed to two and got bit by the anti-pattern?

Even with >2 when one goes down CDB performed poorly.

Sounds like Mnesia, which has existed for 20+ years.


Mnesia isn't SQL-compatible, can't scale horizontally (each node has a full copy of the DB + writes hit all nodes), needs external intervention to recover from netsplits, and expects a static cluster.

It's pretty much the opposite of what the parent described IMHO. It's meant more like a configuration store than a proper DB. That's how RabbitMQ uses it for instance, it stores metadata in Mnesia and messages in a separate store, but they're working on replacing Mnesia with their own more modern store based on Raft.

> Mnesia isn't SQL-compatible,

I think this is accurate. There are some references to a masters project to do SQL with mnesia, but afaik, it's not supported or used by anyone.

> can't scale horizontally (each node has a full copy of the DB + writes hit all nodes),

This isn't accurate, each table (or fragment, if you use mnesia_frag) can have a different node list; the schema table does need to be on all the nodes, but hopefully doesn't have a lot of updates (if it does need a lot of updates, probably figure something else out)

> needs external intervention to recover from netsplits,

This is true; but splits and joijs are hookable, so you can do whatever you think is right. I think the OTP/Ericsson team uses mnesia in a pretty limited setting of redundant nodes in the same chasis, so netsplit is pretty catastrophic and they don't handle it. Different teams and tables have different needs, but it would probably be nice to have good options to choose from. There's also no built-in concept of logging changes for a split node to replay later, which makes it harder to do the right thing, even if you only did quorum writes.

> and expects a static cluster.

It's certainly easier to use with a static cluster, especially since schema changes can be slow[1], but you can automate schema changes to match your cluster changes. Depending on how dynamic your cluster is, it might be appropriate to have a static config defined based on 'virtual' node names, so TableX is on db101, but db101 is a name that could be applied to different physical nodes depending on whatever (but hopefully only one node at a time or you'll have a lot of confusion).

[1] The schema changes themselves are generally not actually slow, especially if it's just saying which node gets a copy, ignoring the part where the node actually gets the copy which could take a while for large enough tables or slow enough networks; it's that mnesia won't change the schema until all of the nodes are not currently doing a 'table dump' where up to date tables are written to disk so table change logs on disk can be cleared and those dumps can take a while to complete and any pending schema operations need to wait.

That’s a huge oversimplification. Mnesia is distributed but that’s about it. There’s so much more to sql than selecting data. Transactions, user defined functions, stored procedures, custom types…

None of what you detail though is included in what the GP describes as a modern database.

I’m not trying to nitpick but what GP describes aligns to what Mnesia provides.

The GP mentioned CockroachDB though. There’s a lot of implicit functionality behind that name. Even just behind mentioning Postgres.

The distributed part of modern sql is fairly recent.

When building YugabyteDB, we reuse the "upper half" of PostgreSQL just like Amazon Aurora PostgreSQL and hence support most of the functionality in PG (including advanced ones like triggers, stored procedures, full suite of indexes like partial/expression/function indexes, extensions, etc).

We think a lot about this exact question. Here are some of the things YugabyteDB can do as a "modern database" that a PostgreSQL/MySQL cannot (or will struggle to):

* High availability with resilience / zero data loss on failures and upgrades. This is because of the inherent architecture, whereas with traditional leader-follower replication you could lose data and with solutions like Patroni, you can lose availability / optimal utilization of the cluster resources.

* Scaling the database. This includes scaling transactions, connections and data sets *without* complicating the app (like having to read from replicas some times and from the primary other times depending on the query). Scaling connections is also important for lambdas/functions style apps in the cloud, as they could all try to connect to the DB in a short burst.

* Replicating data across regions. Use cases like geo-partitioning, multi-region sync replication to tolerate a region failure without compromising ACID properties. Some folks think this is far fetched - its not. Examples: the recent fire on an OVH datacenter and the Texas snowstorm both caused regional outages.

* Built-in async replication. Typically, async replication of data is "external" to DBs like PG and MySQL. In YugabyteDB, since replication is a first-class feature, it is supported out of the box.

* Follower reads / reads from nearest region with programmatic bounds. So read stale data for a particular query from the local region if the data is no more than x seconds old.

* We recently enhanced the JDBC driver to be cluster aware, eliminating the need to maintain an external load balancer because each node of the cluster is "aware" of the other nodes at all times - including node failures / add / remove / etc.

* Finally, we give users control over how data is distributed across nodes - for example, do you want to preserve ASC/DESC ordering of the PKs or use a HASH based distribution of data.

There are a few others, but this should give an idea.

(Disclosure: I am the cto/co-founder of Yugabyte)

This title kind of implies that PG or MySQL aren't modern or modern enough which i think is is very wrong. Look what they bring in every update. I think they are quite modern!

Not sure if you knew this or not, but Phil Eaton (the OP) writes databases for fun[1]. This post isn't saying Postgres and MySQL aren't modern, but is more trying to get ideas for what to play around with next.

[1] https://notes.eatonphil.com/database-basics.html

Postgres is very modern and MySQL is also... em... also modern in a different sense. ;-)

I understood the question as:

> What can those newer database systems that are not beholden to RDBMS conventions do for us that Postgres and MySQL (and Oracle and SQL Server and DB2) can not?

And the answer is: Not all that much.

The title seems fair to me. Anything that's actually used has certain commitments it made earlier in its lifecycle from which it now can't deviate, even if later developments made the commitments problematic.

>Anything that's actually used has certain commitments it made earlier in its lifecycle from which it now can't deviate

Perhaps I'm misunderstanding your comment, but MySQL has definitely deprecated and removed features over the years. https://dev.mysql.com/doc/refman/8.0/en/mysql-nutshell.html

Sure. There are probably also things they wish they could deprecate, but realistically can't. User expectations they wish they could replace and can't. I think that's the charitable interpretation of things asked for.

I think it's unfair. "Modern" is in no way equivalent to what the thread is really asking, which is what could you do if you built a new database today with all of the knowledge but none of the existing commitments of a large mainstream database.

It’s fair because these databases made trade offs for older ideas presumably for good reasons. The question isn’t what could the have done differently up to now, but rather what new ideas might designers have included if they existed back then.

Modern software development practices and knowledge are orthogonal to feature set / compatibility guarantees. The title asks about the former and the text asks about the latter, and I don't agree it's correct to conflate the two

Although most of the responses have been for ideas that did exist back then, but have significant trade-offs, both then and now. They're mostly just different approaches to db, not something fundamentally "modern".

The question is fair if it applies to database systems that aren't actually used?

Right, just like how any new system starts off. It's asking about how one might greenfield differently with the OLTP knowledge we have today.

Postgres was more modern when it used QUEL. SQL was a big a step backwards technically (although beneficial from a marketing standpoint).

One thing PostgreSQL would likely not be able to adapt to, at least without significant effort, is dropping MVCC in favor of more traditional locking protocols.

While MVCC is fashionable nowadays, and more or less every platform offers it at least as an option, my experience, and also opinions I have heard from people using SQL Server and similar platforms professionally, is that for true OLTP at least, good ol’ locking-based protocols in practice outperform MVCC-based protocols (when transactions are well programmed).

The “inconvenient truth” [0] that maintaining multiple versions of records badly affects performance might in the future make MVCC less appealing. There’s ongoing research, such as [0], to improve things, but it’s not clear to me at this point that MVCC is a winning idea.

[0] https://dl.acm.org/doi/10.1145/3448016.3452783

I sort of want the opposite. Except for extremely high velocity mutable data, why do we ever drop an old version of any record? I want the whole database to look more like git commits - completely immutable, versionable, every change attributable to a specific commit, connection, client, user.

So much complexity and stress and work at the moment comes from the fear of data loss or corruption. Schema updates, migrations, backups, all the distributed computing stuff where every node has to assume every other node could have mutated the data .... And then there are countless applications full of "history" type tables to reinstate audit trails for the mutable data. It's kind of ridiculous when you think about it.

It all made sense when storage was super expensive but these days all the counter measures we have to implement to deal with mutable state are far more expensive than just using more disk space.

If the old versions of records stay where they are, they will start to dominate heap pages and lead to a kind of heap fragmentation. If the records are still indexed, then they will create an enormous index bloat. Both of these will make caches less effective and either require more RAM or IOPS, both of which are scarce in a relational db.

You probably need a drastically different strategy, like moving old records to separate cold storage instead (assuming you might ocassionally want to query it. Otherwise you can just retain your WAL files forever).

Absolutely ... it needs to be designed from the ground up - why I think it fits the question of something a modern database could do that Postgres would struggle with.

> moving old records to separate cold storage

FWIW this is available in ClickHouse (which is an analytics database, though)


You should check out dolt, does exactly what you're describing, and is a drop-in MySQL replacement:


The downside of "good ol' locking" is that you can end up with more fights and possibly deadlocks over who gets access and who has to wait.

With Postgres/Oracle MVCC model, readers don't block writers and writers don't block readers.

It's true that an awareness of the data concurrency model, whatever it is, is essential for developers to be able to write transactions that work as they intended.

It’s not that conflicts magically disappear if you use MVCC. In some cases, PostgreSQL has to rollback transactions whereas a 2PL-based system would schedule the same transactions just fine. Often, those failures are reported ad “serialization errors”, but the practical result is the same as if a deadlock had occurred. And Postgres deadlocks as well.

This is already being developed. zheap [1] is based on the undo/redo log system that databases such as Oracle use, and will be one of the options once Postgres supports pluggable storage backends.

[1] https://www.cybertec-postgresql.com/en/postgresql-zheap-curr...

I don't know. Lock-based concurrency control has problem scaling up concurrent access among readers and writers for read consistency. Of course Oracle with MVCC still beats everybody performance wise.

Does "more traditional locking" mean that the clients request locks? Isn't it already offered by PG with table-level and row-level locks, and also in an extremely flexible way (letting the client developer define the semantics) by pg_advisory_lock?


Having time travel capabilities (eg. cockroachdb) is a really useful side effect of MVCC though. Postgres once had this capability. Need for garbage collection/vacuuming is a downside.

I think it all depends on the pattern of reads, writes, and the types of writes. Mysqls innodb is often faster than postgres but under some usage patterns suffers from significant lock contention. (I have found it gets worse as you add more indexes)

Personally I wish Postgres would add support for optimistic concurrency control (for both row-level and "predicate" locks), which can be a big win for workloads with high throughput and low contention.

> good ol’ locking-based protocols in practice outperform MVCC-based protocols

Doesn't make sense to me. Oracle supports MVCC. SQL Server doesn't scale as well as Oracle.

There is absolutely no reason at all that using a Model View Controller architecture should say anything about your persistence layer. Model View Controller is an abstraction for managing code that generates things that are put on a screen. It says that you should have code that represents underlying data separate from code that represents different ways to show the data to the user and also separate from code that represents different ways for the user to specify how they would like to view or modify the data.

The Model portion of MVC should entirely encapsulate whether you are using a relational database or a fucking abacus to store your state. Obviously serializing and deserializing to an Abacus will negatively impact user experience, but theoretically it may be a more reliable data store than something named after a Czech writer famous for his portrayl of Hofbureaucralypse Now .

What's being discussed is MVCC, or multiversion concurrency control


Materialize (https://materialize.com/) is capable of performing SQL operations over a stream, using incremental calculation based on differential dataflow.

CockroachDB is a distributed SQL database, with strong consistency. I think that Yugabyte is similar.

Support for realtime changes, including queries over those changes, is better in other databases, like RethinkDB, OVSDB or Firebase.

Relational is not always the best way to store your data. Sometimes a graph model is better. Or for full text search something specialized is better. In that sense dgraph or elastic search may be better OLTP databases in some cases.

Columnar storage, like used in Vertical or BigQuery have several advantages in processing and data compression. But implementing it in PG or MySQL I think that would require almost a full rewrite.

I really want to do streaming queries over updates and inserts in a normal SQL database. Imagine writing

    SUBSCRIBE SELECT customers.id, customers.name 
    FROM EVENTS(customer_purchases) AS cpe
    LEFT OUTER JOIN customers ON cpe.row.customer_id = customers.id
    WHERE cpe.row.product_id = '123abc' AND cpe.type IN ('update', 'insert') 
in postgres itself and just getting every incoming purchase for a particular product.

Someone with more in depth knowledge can tell me why this isn't a good idea, but I'm pretty sure you can get something very close to this with graphql subscription query against hasura backed with postgres

https://ksqldb.io/ is great for this, assuming you're already using Kafka.

I wonder if Materalize gets you close?

Materialize excites me, but probably can't be used for the current problems I'm working on, because of the operational overhead we'd be putting on our clients. Streaming SQL queries inside an existing well-understood database would be the killer feature.

1. Automatic backups to an S3 compatible storage, out of the box.

2. Progressive automatic scalability. As load increases or storage runs out, the DB should be able to automatically. NewSQL databases do this already.

3. Tiered storage.

4. Support streaming, stream processing, in memory data structures, etc. I feel like this is one of those weird things but I keep wishing this were possible when I work on side projects or startups. I don't want to have to spin up mysql/postgres, kafka/pulsar, flink/whatever/stream processing, redis, etc separately because that would be prohibitively expense when you're just getting off the ground unless you have VC money. So I find myself wishing I could deploy something that would do all of those things, that could also scale somewhat until the need to break everything into their own infrastructure, and if it was wire compatible with popular projects then that would be perfect. Will it happen? I doubt it, but it would be lovely assuming it worked well.

> 3. Tiered storage.

Do you mean tablespaces? https://www.postgresql.org/docs/10/manage-ag-tablespaces.htm...

They'll allow you to have some parts on database on faster devices for instance.

That does seem neat, but I was actually thinking more like off-loading cold data to S3, or some such. Maybe even having distinctions between (ultra) hot, warm, cold data.

Very cool - I had no idea this existed. I wonder if this could be useful for using ramdisks when working with temp tables.

Very possible.

I learned about the possibility a good decade ago when working on Oracle. For high performance use we used to have two areas:

- one small SSD-backed area where new inserts happened (Server class SSD was new and relatively expensive back then.)

- one (or more) areas for permanent storage, based on ordinary disks with good storage/price

Then we had processes that processed data from the SSD Pool and wrote them to long term storage asynchronously.

Edit: I should say I was very happy to find the same feature on Postgres, Oracle is a real hassle in more than one way.

yes, ramdisks are great. =) I've used that approach to replace redis with a faster version on postgres. =)

> 4. Support streaming

I feel the same. Streaming is so lacking in most dbs today. RethinkDB's `.changes()` was really cool. I wonder if PSQL will eventually do it, or whether a new DB will take over. Everyone went to Mongo then ran back to PSQL, but maybe after lessons-learned there is room for a new db optimized for in-memory usage and with great first-class streaming support.

MongoDB has supported a streaming API since version 3.6


But not relational...

Great list. For #4, you should take a look at Erlang or Elixir - they have "just enough" concurrency with streaming primitives and a functional style that makes it easy to use. They make a lot of "bolt on" stuff you need for a regular startup (Redis, Celery, etc) superfluous.

Could you briefly elaborate on this? Are you suggesting the right structures within Elixir/Erlang are both concurrent and safe enough to negate the need for these things at some level? (And are you referring to things like OTP, or more general than that?)

EDIT: For context, I'm familiar with the concurrency model and with how fairly bulletproof "processes" are in their context, but had never considered putting these to use in lieu of a Redis cache or certain other datastore use-cases. (My brief foray into Elixir was, however, when looking to improve reliability of a high-volume messaging system and various task queues attached to it, so that use-case I am at least aware of)

I wouldn't say Elixir is highly concurrent in the sense that you'll get something like a CRDT out of the box. You'll still have to design all of your distribution logic.

I'm saying that the tools provided in that toolbox - like processes, OTP and ETS - let you build in-memory or disk backed structures that live inside the concern of the main application and are more responsive to your specific needs.

For instance, let's say you need a scoreboard that persists in between page refreshes.

With Node.js, you're left in a "hmm - not sure" space where you need to cobble together all the different pieces. At the end of the day you'll be facing cache consistency issues, problems saving state, abstraction leaks and communication overhead.

In Elixir, it's trivial to use pubsub to collect events, a GenServer to store that state or act as a broker, and ETS to perform simple queries against it. It's like you're still given just a box of tools, but all of the tools work together better for dynamic, live, complex applications.

Your mileage may vary! Let me know if I can help.

I am actually super familiar with Erlang and Elixir and I agree with your assessment, but with the caveat that if you go that route you end up with your own hand-rolled solution that won't be wire/API compatible with the most used OS projects. Or in the case of Erlang/Elixir you'll be running it in process! But it is great for that reason.

i mean it took me just a couple hours to write script to backup postgres db to s3. not a lot of work

And it's work that gets replicated again and again

VoltDB is a good example of rethinking relational databases for for modern technology. Traditional databases assume that everything is stored on disk, with a bit of RAM available for cacheing. VoltDB assumes that everything is stored in RAM first, using checkpointing to disk and replication to get durability.


Have a look at Michael Stonebraker, he's a database genius, and keeps starting "NewSQL" companies which use the relational model, but specialized for different applications, e.g. column stores.


Neat. Storing everything in ram and checkpointing to disk is how a lot of game servers work, too. :)

Volt is indeed very cool. Michael Stonebraker wrote part of the original Ingres code, didn’t he?

I realise I'm straying a bit from core OLTP stuff but also I think removing the historical need for a separate OLAP database is something modern systems should address. Off the top of my head:

1) Incremental materialized view maintenance, à la Materialize (bonus points for supporting even gnarly bits of SQL like window functions).

2) Really ergonomic and scalable pub/sub of some sort, à la RethinkDB.

3) Fine tuned control over query plans if I want it.

4) Probably very deep Apache Arrow integration.

    3) Fine tuned control over query plans if I want it.
I feel like when most people say this, what they really want is a better query planner.

The optimum query plan depends on a lot of dynamically changing factors: system load, free RAM, and of course the data in the tables themselves. Any hints we give the query planner are going to help at certain times and be somewhere between "suboptimial" and "disasterous" at most other times.

It's certainly true that the query planners in major RDBMS could be better or, at least, give insight into why they made the choices they did.

It would be cool if EXPLAIN ANALYZE also perhaps showed other query plans the planner considered but discarded, and why. Imagine if the planner tried multiple plans and adjusted itself accordingly.

For Postgres in particular, I think the user-specified costs in pg.conf like `seq_page_cost` and `random_page_cost` feel like one obvious area for improvement: why am I guessing at these values? Postgres should be determining and adjusting these costs on the fly. But, I could be wrong.

> The optimum query plan depends on a lot of dynamically changing factors

Except … I’m part of a large organization with a very thorough incident post-mortem process and I’ve read a lot of analyses that end up blamed on the dreaded “query plan flip.” This ends up being an unplanned change that has an unexpected perf cost (and no real “rollback”) and causes an outage. The lesson I’ve seen learned over and over is to have very good understanding (and constant re-evaluation) of your hot queries’ perf , and to lock the query plan so it can only change when you intend it to.

Yeah we've had to add some weekly forced statistics recalculation on certain key tables just to prevent the large customers from going down due to indexer suddenly not wanting to use an index.

Hasn't happened often, but the few occasions have of course been at the worst possible time.

> It would be cool if EXPLAIN ANALYZE also perhaps showed other query plans the planner considered but discarded, and why. Imagine if the planner tried multiple plans and adjusted itself accordingly.

The interesting point is in that regard MySQL is really better than PostgreSQL. MySQL can give you more execution plans it evaluated and tell you why it didn't use them (cost value is higher) and even tell you why it didn't use a specific index it. Combined with the enforcing specific indexes you can sometimes trick it into a more efficient query plan even if it believed to be worse.

But to be honest, PostgreSQL query planner is a lot more intelligent and does stupid things very seldom. And the ability to instruct PostgreSQL collecting more statistics on some attributes (https://www.postgresql.org/docs/13/sql-createstatistics.html) is a huge improvement to getting PostgreSQL make more intelligent plans.

What i really miss in PostgreSQL is:

* Seeing other query plans it discarded as you described to get a feeling how to hint PostgreSQL into a direction if the query planner is doing stupid things

* MySQLs query profile showing the execution time of multiple subtasks of the query: https://dev.mysql.com/doc/refman/8.0/en/show-profile.html

* The ability to enforce specific query plans. The PostgreSQL devs stated they are against this but pg_hint_plan is really usefull and i was able to drastically improve some very complex queries.

It's true a better query planner is what I want, i.e. one that always does what I want it to do. That not being the case, I will settle for being in control, footguns and all.

I agree the config is hard to wrangle though, you effectively find yourself doing grid search with a bunch of common workloads, it does feel like something the machine should be doing for me (the most common config variable we end up tweaking is "how much money we give Amazon").

Another interesting approach is HyPer[1]. HyPer uses many new techniques to combine OLTP and OLAP in one database. For example, to achieve good OLAP performance, a columnar storage layout is used, but the columns are chunked for locality to achieve good OLTP performance at the same time. OLTP queries are executed in memory, but cold data that is not used for OLTP is automatically compressed and moved to secondary storage for OLAP.

[1] https://hyper-db.de/

Not much one can do with Hyper as it sold the commercial license to Tableau and I can’t find any mention of an OSS version to play with anywhere on the site.

Subscriptions. Databases like Firebase will automatically push changes to query results down to clients. You can add this to Postgres with tools like Hasura, but it's poll based and not very efficient. It's a super-useful feature for keeping UIs in sync with database state.

Why can't you do that with triggers?

By the way subscriptions seems a good idea in theory, in practice polling is often the best solution. There are not a lot of applications where you need the data that real time (I mean with a latency that is less of a couple of seconds), and for that applications you should build something custom.

Subscriptions to work are based on websockets that have its problems, and also requires a constant connection to the server. The application then needs to handle the incoming data properly, another thing that is not entirely obvious.

Another thing is that most of the times you don't have a 1:1 mapping between the database schema and the API (and if you do, you shouldn't, since a change of the internal database representation will require a change of the API and will break clients). You have in practice an application server in between. Well is that application server that can send subscriptions to clients. And it can do that without subscribing to updates on database tables (assuming that the tables are only changed by the application itself, that seems reasonable). The application when updates the data can publish it to whatever subscription system it wants to notify clients.

> and for that applications you should build something custom.

Given the spirit of the original post, I think avoiding "build something custom" is the point.

It should be a toggle in the database.

> you don't have a 1:1 mapping between the database schema and the API (and if you do, you shouldn't, since a change of the internal database representation will require a change of the API and will break clients)

I'd argue this is a huge problem. When DB and API get out of sync it creates so much complexity, especially when working in a relational model, and things become impossible to change, or make client caching really difficult.

> The application when updates the data can publish it to whatever subscription system it wants to notify clients.

This is so much complexity though, compared with the experience of Firebase. Good for keeping more devs employed though.

> This is so much complexity though, compared with the experience of Firebase. Good for keeping more devs employed though.

The solution of Google is not less complex. Rather the complexity is hidden, that if all goes well is good, when things start to not work as expected it means weeks spent in debugging and trying to work around the issue. And we are not even talking about the possibility that Google changes API or closes down the service entirely, or it makes it more expensive, the so called vendor lock-in.

At the other side a custom solution takes more time to develop initially, but then is entirely under your control, if something doesn't work you know how to fix it because you built it, you are not bounded to a particular platform, and there is not the possibility that Google decides to change the API in an incompatible way and you have to do extra work just to make things work as they did before, you decide when and if to update the software.

We built a real-time server to solve this using Postgres Write Ahead Log: https://github.com/supabase/realtime. We also have some neat stuff coming with integrated Row Level Security.

Disclosure (Supabase cofounder)

Postgres has LISTEN and NOTIFY. People build DIY pub-sub with this.

Sure, but it's a very manual process. I want to be able to write an artbitrary SQL query (or subset of SQL query), and have the subscription aspect "just work".

Materialize.io does something like this, although in a separate system from Postgres, and I'm not sure it's possible to push live queries to a client.

Alas RethinkDB was supposed to do this but the project imploded. It’s been OSSed I think though.

I feel like Rethink would have done a lot better if it were started closer to today. Funnily enough, in 2009 it was originally an SSD-optimized storage engine for MySQL.

The dev experience and API were great. It was the "relational mongodb".

Though no one feels daring enough to part from PSQL as a backend these days. With all the cloud deployment options and optimizations in PSQL, it's hard to justify writing your own db engine, but for a great real-time experience, that's what needs to happen.

Yes, and as a happy early adopter with many projects running off it, I was very happy about this. However the pace of development since then has been absolutely glacial and seemingly without direction. It's truly a shame.

Have you tried Postgres Notify? Its the native event subscription.


Can't you add something like this to MySQL using triggers or some similar system?

Hasura evaluated Postgres' listen/notify feature to power their subscriptions, but chose polling instead:


> Listen/Notify: Requires instrumenting all tables with triggers, events consumed by consumer (the web-server) might be dropped in case of the consumer restarting or a network disruption.

It was substantially non-trivial for them to implement subscriptions that were both robust and efficient using this approach.

Many applications need the extra step on top of listen/notify of relaying subscriptions to an untrusted client (e.g., a browser). I'd like to see more DBs bake that feature in, like RethinkDB did.

The Superbase (https://github.com/supabase/realtime) approach is really interesting. It listens to the logical replication stream. Makes a lot of sense to me. Unfortunately our postgres instances hosted on heroku don't expose this, so I've been unable to try it out.

Hasura discusses why they chose polling here (https://github.com/hasura/graphql-engine/blob/master/archite...).

> WAL: Reliable stream, but LR slots are expensive which makes horizontal scaling hard, and are often not available on managed database vendors. Heavy write loads can pollute the WAL and will need throttling at the application layer.

Would like to hear Supabase's response to this.

Hasura's approach is great and there are just tradeoffs between approaches.

Sure, slots (can be) expensive; GCP doesn't offer slots but all the others do as far as I know (and I see that as a GCP issue, not a Supabase issue); a lot of writes can definitely fill the WAL.

Stepping back, the main comment I can give is that we each approach problems with a slightly different philosophy. Hasura tackles problems using middleware and Supabase tackles problems using the database. Both are fine.

Our approach is important (to us) as we evolve the product. For example, we are adding Row Level Security to our Realtime instance. Because it all sits in the database it's essentially just a Postgres extension, minimizing chatter between a middleware server and the database. Also because it's at the "bottom of the stack", other vendors/integrations can make use of the functionality we provide.

(supabase cofounder)

I'm the defacto maintainer of the Meteor MySQL integration and the Node.js package mysql-live-select.

These implement pub/sub and reactive queries using the MySQL binary log as the event source:



At least speaking for Postgres, it _is_ modern in that it’s very actively developed with all kinds of innovations layered on top of the core system. It can be a timeseries db, a real-time db, horizontally sharded, a self-contained REST/graphql API server, a graph db, an interface to many APIs via foreign data wrappers, and much more. In itself it has many analytics functions, and most cloud OLAP dbs use its syntax over a columnar storage scheme. I’m afraid most would-be database projects would be cloned by a Postgres extension before they could make a dent in its share.

My opinion is Postgres is Jack of many trades, but master of none. I would be more afraid of a DB that tries to be everything. Every feature adds a baggage.

It is a master of plain old rbdms, but able to accomplish more with specific extensions. Sure, it probably makes more sense to use Elastic than Postgres for full-text search, but there is a solid extension for it. In a lot of operational contexts it’s easier to host another Postgres instance with a specialized extension than to try deploy an entirely new DB.

I agree with you, when you say its a master of RDBMS. But each extension brings its own baggage. Specialised databases are written for a reason. For majority of the people the extensions may fit . But its not the best way once you want to optimise.

Postgresql can do everything I need, but not everything I want, and apologies to mysql fans, but I refuse to touch it.

And, I think over the years they have been such an awesome community that has helped me irregularly when I popped into IRC every year or two for a problem I was having. https://github.com/davidfetter ; https://github.com/RhodiumToad ; https://github.com/bmomjian - you are great peeps!

What I would like to see are more security usability enhancements over and above what you find in https://www.crunchydata.com/products/hardened-postgres/

Security Enhancements List:

Attack Resistance

Better protections for the DBMS itself in assume compromised scenarios, or malicious users

Customized implementations per compliance/security regime

Accelerators for various security configurations

Self-audit, self-secure

Functionality Enhancements List:

Continuous Sync Postgres to other serious DB (not a one-way sync, not a DB move)

One thing I find really interesting from Redis that I would love in a relational database is the concept of ‘blocking’ queries, which block until they get a result.

For example: https://redis.io/commands/BLPOP

If I could do something like:

    Select id, msg
    From job
    Where id > 1224
And have that query block until there actually was a job with an id > 1224, it would open up some interesting use cases.

I would love to be able to "nice" queries, a high "nice" attribute meaning, for the engine, "work on this whenever it does not slow down anything else", especially on the I/O side (à la Linux "ionice").

A "niced" query blocks (and its working set may even be "swapped out") as long as: - the read buffer does not contain anything pertinent for it, - there are "too much" pending I/O requests (<=> submitting an I/O request useful for this query would slow down other less-niced queries)

The "niceness" of a query may be dynamically modified by an authorized user.

Bonus: - any role is associated to a minimal level of niceness (GRANT'able) - the underlying logic interacts with the OS corresponding logic in order to take into account all I/Os (this is especially useful if PG runs on a non-dedicated server)

Not exactly what you are asking for but worth mentioning anyway: https://wiki.postgresql.org/wiki/Priorities

It's always been insane to me that databases aren't better at migrations. There are so many blog posts out there on how to do zero downtime migrations, and I feel like a lot more of that could probably be abstracted away.

There are so many times I wish a blog post was an app that just had a nice gui to do guide me through the steps they suggest, instead of manually modifying config files and running bash scripts. SQL GUI tooling is lacking.

Write-expensive, read-cheap[1]: the exact opposite of the mentioned.

Once your developers have completed an iteration, your DB will see the same queries over and over again (if it doesn't, then it should be an OLAP aggregate). These databases optimize for writes, and defer complexity to reads and, considering that you could see millions more reads than writes, makes no sense whatsoever.

[1]: https://github.com/mit-pdos/noria

I think automatically maintaining materialized views isn’t well supported for a lot of reasons. But I also don’t think that it’s impossible for something like Postgres to support it more in the future.


- Blurring (safely!) the line between database and the app using it: transparently switch between bringing data to compute, or compute to data.

- Comprehensive auto tuning: automatic index creation, automatic schema tuning, dynamically switching between column/row-oriented, etc. User specifies SLOs, database does the rest.

- Deeply related to the previous two points: perfect horizontal scalability

- Configurable per-query ACID properties (e.g. delayed indexing)

- All of the above while maintaining complex SQL features like arbitrary joins, large transactions, triggers, etc.

Sure, some of these are in some form in some existing databases. But none offer all of them.

> Blurring (safely!) the line between database and the app using it

What do you mean by this?

> Comprehensive auto tuning...automatic schema tuning

You should be able to maintain a logical database schema, and then flip a toggle for things you want to have denormalized (and eventually have the db just do it automatically). Or maybe even take a blob of unstructured data and automatically normalize it.

> > Blurring (safely!) the line between database and the app using it

> What do you mean by this?

Consider this app pseudocode:

    resultset = db.query("SELECT ... WHERE ...")
    for row in resultset
      if some_complex_condition(row)
If some_complex_condition filters out a lot of rows then all the data movement has been wasted: it would be much more effective (if possible) if some_complex_condition was pushed down to the database. But if some_complex_condition is compute intensive then potentially pushing it down to the db may actually slow things down, as the db becomes the bottleneck... so the optimal solution, if it exists, 1) is likely to change over time because of changes in the workload and 2) is unlikely to be determinable before runtime.

At the same time, consider the case of the table being selected being almost read-only. In this case, and if it fits, it may make sense to move the data directly to the app, keep it in sync when it changes, and have the query processing (the `SELECT ... WHERE ...`) happen directly in the app.

This hints at what I meant by blurring the line: dynamically shifting where computation happens, and where the data lives, depending on the available resources, the workload, and the data.

This is partially doable today e.g. with Hazelcast, in that it allows app instances to be part of the "database" (or, as they call it, the "in-memory grid").

It is interesting that we use declarative SQL to query from the DB, but then when we get it into the app, we are manually doing further filter, mapping and joining of data. And then denormalizing it into a cache in a separate data store. Losing some benefits of the declarative nature of our queries. And this also applies to the frontend as well as the backend.

My biggest problem with databases is always versioning. IE renaming a column will break old clients. If there was a way you could have multiple schema versions so you could upgrade database then clients later it would be the best.

EDIT: yes thanks for the comments, views and creating and API layers and adding instead of subtracting do all work, but I believe they're all workarounds for the underlying problem. Fixing versioning would make everyone's lives easier.

Consider using Postgres views! You can create a view into your data and use that for reading and writing. Then, when you need to change the underlying data model, you create a new view with the renamed column. Old clients will target the old view, new clients will target the new view. Then, when all old clients are removed, you can remove the old view safely.

"Consider using Postgres views!"

The minute you create a view modifying the underlying table usually gets blocked, in many cases even for changes that have zero relevance to the view. You have to drop the view(s) to get the ALTER TABLE to go through. Absolute nightmare. Oracle, for example, does a far better job of handing this type of evolution and dependency management.

> You have to drop the view(s) to get the ALTER TABLE to go through.

So? With Postgres you can use transactions in your DDL. So it's possible to seamlessly drop the view, alter the underlying table and recreate it, all in one step once the transaction is commited.

"it's possible to seamlessly drop the view, alter the underlying table and recreate it"

And if that were the only problem it wouldn't be too bad, except that other databases don't inflict this task or limit the degree of the problem far better. Unfortunately the problem isn't limited to views. Materialized views, for instance, have to be dropped and recreated along with their indexes. That can involve a large IO operation; one that risks failure.

Pretty soon an otherwise innocuous ALTER TABLE that conceptually has zero impact on dependencies becomes a major undertaking.

Does Postgres support a NO SCHEMA BINDING option like Redshift does for views? If so you get a view but it allows for the underlying table to change — this has advantages and disadvantages but it fixes the view/table coupling issue.

Yes! I would see this tied in with temporal support throughout the database, so you can query the schema and data as it was at 2021-09-03 11:53:28.

This avoids some big problems with temporal data but I can't see it ever being efficient to do.

Becomes too complex. You have to maintain mapping of old columns to new columns and when you are splitting tables it becomes too crazy.

Besides using an ORM that would hide the renaming or views, why not build an API layer ?

Because Postgres views enforce API versioning in ways that don’t break older clients. Pretty much any schema modification that would break an older client is forbidden by REPLACE VIEW. It is certainly possible to break older clients but you have to work at it.

Also… you don’t want to build an API layer because postgrest builds one for you. One that is close to on par with GraphQL (arbitrary joins / filters)

The ideas is not to validate your schema in a client during runtime. Just like you don’t care if a new key is present in a json blob. You should be able to add new features without breaking things.

1) Native Bi temporal capabilities without hurting (too much) performance.

2) Distributed DB, like Spanner and CRDB.

3) Real time analysis of performance and then adapt the DB config accordingly, without requiring HUMAM intervention.

4) Multiple (joint?) storage types per table (Memory, Columnar or Rows, for example);

5) Native support for Copy On Write replicas (useful for ephemeral environments, Stage and development)

6) Serverless capabilities

Better consistency and transaction isolation models. The ANSI transaction isolation levels are ambiguous and confusing, and the different consistency models supported by NoSQL databases are often poorly-specified. A database with clearer abstractions around consistency and isolation would be a big win for applications where correctness and scalability are both high priorities.

I'm missing something as easy to deploy as SQLite, but without its performance problems (like the stop-the-world write lock). Basically, I'd love an embeddable Postgres with a single-file storage.

If you are using SQLite and care about performance, be sure to run in WAL (Write Ahead Log) Mode: `PRAGMA journal_mode=WAL;`[0]. WAL mode is significantly faster and does not have a "stop-the-world write lock" ("Reading and writing can proceed concurrently"). WAL mode is also more efficient with disk I/O, which makes it faster for single user applications as well.

[0]: https://sqlite.org/wal.html

Check out DuckDB https://duckdb.org/

Support for vector indexes and vector similarity queries (dense, and sparse-dense).

I believe PostgreSQL and MySQL will support this within a few years.

In the meantime there’s Pinecone (https://www.pinecone.io) to put alongside your DB.

Also there is at least one plugin for Postgres that I know of, PGVector, but no idea how it performs.

Automatic indexes. Adding indexes is a guessing game. It's a bit of abstraction leakage. Imagine a product engineer did not have to think about how the data is laid out on disk

What? I wouldn’t want this. How would you evne automate it? Creating the right indexes is the same as creating the right tables and columns in your data model: It depends on the business purpose and usage of the data.

> Creating the right indexes is the same as creating the right tables and columns in your data model: It depends on the business purpose and usage of the data.

Right, so the way this works is that the database collects instrumentation data and, over time, automatically applies strategies (indexing improvements being one of them) to improve its performance.


As another user wrote, some db systems already can provide suggestions for missing indexes based on stats. But automatically creating them? Should all new indexes just add up (that would fill the db pretty fast) or replace prevoius ones (what about users depending on these indexes?)

Business usage is observed over time instead of known up front though. In a way this is sort of like the query planner. Couldn't your usage be observed and used to determine appropriate indexes?

i’d be more interested in an automatic index “suggester” based on observation and slow query analysis. there’s also the matter of new use cases where you’d absolutely want to be able to create them manually.

It has been a million years since I've used it, but I think MS SQL Server has this.

Here's some documentation I just found (at https://docs.microsoft.com/en-us/sql/relational-databases/pe...):

> The Missing Indexes report shows potentially missing indexes that the Query Optimizer identified during query compilation. However, these recommendations should not be taken at face value. Microsoft recommends that indexes with a score greater than 100,000 should be evaluated for creation, as those have the highest anticipated improvement for user queries.

> Tip

> Always evaluate if a new index suggestion is comparable to an existing index in the same table, where the same practical results can be achieved simply by changing an existing index instead of creating a new index. For example, given a new suggested index on columns C1, C2 and C3, first evaluate if there is an existing index over columns C1 and C2. If so, then it may be preferable to simply add column C3 to the existing index (preserving the order of pre-existing columns) to avoid creating a new index.

Absolutely. I'm increasingly in favor of generated code you check into source control.

Just spitballing, it sounds plausible to infer the correct index from the set of common queries.

Edit: Which I guess is basically what the sibling commenters said.

I could see it as taking a typical day's worth of transactions, creating new test db from this log, and then letting a tool try out different indexes based on some common heuristics and re-run the queries and benchmark. Because often it requires experimentation to get the right indexes.

I would be interested in more sophisticated type definitions. That is, algebraic data types.

One can achieve this to a degree by storing the data as JSON, but it would be nice to be able to remove the chance of introducing errors when converting to/from JSON.


- Very smart clustering/redundancy features

- Can be queried over HTTP

- Has a great API

- Counterpart project (Pouch DB) can work on the client

- Queries can be cached (it's HTTP)

- Said to be 'schemaless' but can be structured

The project is always overlooked, but I still feel it's one of the better NoSQL datastores.

1. Streaming SQL

Real-time queries should be the default. Some kind of smart query invalidation (similar to incremental view maintenance) as a result of DML statements.

2. Can run in a web browser / mobile phone

Your client-side cache is usually a messy denormalization of your SQL database. For offline/local-first apps you essentially need to run your entire backend API in your browser. So you should be running an SQL database in the browser. For this to be seamless you need your database targetting JS during design stage.

3. Syncing

To support offline/local/mobile use cases, your database needs to be able to sync changes efficiently.

4. Logical-first data schema

You should be able to give your db a logical schema instead of a physical one, and the database can take care of the necessary denormalizations for performance reasons automatically.

5. Lower-level api for query planner

Instead of having SQL as your only interface, there should be a way to interact with the query tree and the query plan directly. People are always creating new query languages hacked over SQL (like Google's Logica).

6. Graph-based queries

Support a query language like Cypher. Provide better solutions for tree-based data structures - the queries get too unwieldy today.

A lot of the points you mention are really interesting to me, as I've been coming to similar conclusions recently. What are good choices that solve these particularly in the context of js/clientside apps?

absurd-sql[1] is the coolest thing I've seen in this space so far. sqlite.js brings sqlite to the browser via wasm, and then absurd-sql implements the storage layer in IndexedDb. So now we have an sqlite compatible db in the browser to play with.

There is also alasql[3] which is implemented in js, and lovefield[2] from Google which seems like an experiment that is now abandoned.

First, you could implement a REST/GraphQL cache in SQL. This would require maintaining mappings of your API response fields to SQL tables.

Going further, you could implement your backend API on the client, and have requests to it go directly to your browser SQL db. The benefit of this is you write your API once, and you get full offline support. If you don't need "local-first" then it's just a matter of figuring out when a query becomes invalid on the server. I could show instant results from local db, and then send a query to the server to ask if the data I already have in my cache can fulfill this request. Could optimize this on the server by listening to DML queries via WAL/LISTEN. WebSockets would be used to maintain real-time updates to only the data you are viewing.

You could also just use SQL directly as your UI API and then trigger these queries on the backend (respecting security of course).

What's doesn't feel optimal though is subscribing to sqlite.js updates in the browser. This makes me feel like we need an SQL db written in JS.

Also, if our DB is running in the same execution environment as our frontend and has one consumer, we could store each row as a JS object, and then reference this directly in our views, and subscribe to updates on a row-by-row basis. So if you are rendering a large table, if a column in a single row changes, when this row is updated in the database, instead of re-rendering the entire table manually, (or smartly detecting if the query was invalidated), you just bind to updates on the row instance, which is what the db is actually storing. I think this would reduce huge amounts of code.

Local-first is a little more difficult. There is some good exploration here[4].

I think writing a db in JS on the backend is not such a bad idea either. The hot paths would be written as Rust addons, but the majority of stuff people want today is just implementing the right logic to automatically handle denormalizations and subscriptions which we already hack together in our application layer using JS et al.

    [1]: https://github.com/jlongster/absurd-sql
    [2]: https://github.com/google/lovefield
    [3]: https://github.com/agershun/alasql
    [4]: https://actualbudget.com/blog/

Thank you for the links, I've heard about some of these projects but will check out others. I've actually been working on a project that is running fully on front-end side and I'm leaning towards indexeddb+observability, but in my case there would be a lot of different data in one view and I wanted to see if I can materialize a view in worker and then push the deltas using messages. Also, while exploring it, I found out you could use service workers to create a virtual server [0] that would cache or produce views as needed, but that solution doesn't seem to be very popular.

[0] https://serviceworke.rs/virtual-server.html

search - really the holy grail. Combine full text search (with tf-idf or bm-25 support. Not the kind that postgres/mysql does).

This is a tricky engineering problem - have two kinds of indexes in the same database. But disk space is cheap. Network is expensive (especially if you're on AWS).

Lots of talk of CockroachDB in this thread but no mention of vitess.io? Shame. While I don’t care for MySQL the folks at planetscaleDB (hosted vitess) are doing amazing work. So are the CRDB folks.

Because HN is predominantly a Postgres and Anti Oracle community, so the use of MySQL doesn't fit their ideology. Which is unfortunate because I think Vitess and Planetscale is quite nice.

re: Oracle

Oracle's super heavy handed approach to sales and extracting every nickel from their cusotmers likely makes them a lot of enemies even though technologically the Oracle Enterprise DBs are likely very capable.

This is pretty stupid, but I'd be happy to see a SQL dialect where a statement can start with a FROM clause. This is also pretty stupid, but there are also some cases where I wouldn't hate some sort of "JOIN AS CHILD" feature along with results that aren't 100% tabular.

You’re looking for a graph database.

What is modern? If we think how databases run better in the modern hardware, the trend goes toward plenty of RAM and fast disk (SSD/memory-disk) that doesn't have the disk seek problem of the old days. Modern databases being developed today aim for those environment.

Automatic versioning of data and CDC (change data capture) broadcast of data deltas to external systems.

> Automatic versioning of data

More specifically, I would argue for the ability to run arbitrarily complex, non-locking, fully-consistent queries against all historic versions of the database, a.k.a. "the database as a value" (also "transaction time" or "system time" temporal queries)

Isn't this part of the SQL standard already, to some extent?

Temporal queries have been part of the SQL standard for a while but are mostly not supported by the major databases. Some databases have partial support or extensions that add some temporal capabilities.

CDC is build-in via binlog; it’s the basis of most replication schemes but can be consumed by anything.

The problem with SQL databases is that the earliest where designed before the internet became widespread.

So they have synchronized clients that do not use HTTP or JSON. You need asynchronous clients that do not consume thread context switches.

Also you always want to replicate all data in more than one location in real-time because 100% read uptime is really easy to provide, while 100% write uptime is a complex nightmare.

I made this database after using MySQL, Oracle and Postgres for 10 years (it's a 2000 line replacement for my needs that these databases filled, a very narrow subset of the features you'll find in legacy databases): http://root.rupy.se

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact