> I personally would argue with every single point this article makes, except scalability.
Maybe you could be more explicit about what you don't like about their ideas? I personally do like a lot of their ideas, such as the following:
> In the future, I’d expect to see a tighter coupling between the frameworks we’re using for reactive frontends – React, Vue, etc. – and the database, via hooks or otherwise.
This builds on the behavior that made MongoDB so phenomenally popular, as the article points out. Data management is pervasive in modern applications and anything that makes it easier for devs to implement is goodness.
Agreed. The recent splurge in VC money for dev tools startups is going to lead to a lot more articles like this one. I hope developers read it with an eye towards that bias.
I know posts with ThOuGhT LeaDeRshIp titles like this are usually annoying, but I thought it would be interesting to write down some of the lessons I've been gathering as I've spent more time covering and using specific databases. My background is in data science / analytics with a couple of years of more traditional full stack here and there.
Broadly we've seen this pattern with infrastructure in general – it's a lot easier to set up a server than it used to be, all things considered. Now obviously if you're a tiny startup, you're more comfortable outsourcing everything to Heroku, and if you're a hyperscale enterprise, you probably want more control on exactly what your database is doing.
The thesis here is that on the tail end (hyper scale), things are getting more under control and predictable, and developers there want the same "nice things" you get with platforms like Heroku. Elsewhere in the ecosystem, more and more parts of the stack are getting turned into "simple APIs" (Stripe for payments, Twilio for comms, etc.). And perhaps most interestingly, as serverless for compute seems kind of stuck (maybe?), it may be the case that serverless for databases – whatever that ends up meaning – is actually an easier paradigm for application developers to work with.
I think the article's thesis is a false and misleading dichotomy.
It's absolutely true that a low friction developer experience is necessary for a database product to be successful. But this in no way implies that database internals are being commoditized or relegated to minor importance.
Snowflake is a particularly bad example as taking a clean sheet and novel approach to internals is the very fulcrum that creates the easy developer experience.
Admittedly it's been a while since I looked at vitess, but my recollection is that it's cross shard functionality is so limited as to make claiming internals no longer matter a bit dubious.
The reason there's only a handful of spanner style systems is exactly because the internals both matter and are quite daunting to get right.
I agree. It is also amazing how different the database systems are that are competing against each other today:
Partitioning:
1) DynamoDb: Partitioning is explicit and one of the most important parts of schema design
2) Spanner, Cockroach: Database automatically partitions the key ranges.
3) Postgres: You will probably never reach the scale where you need to partition your dataset!
Transactions:
1) Spanner, firestore - no stored procedures, client-side transactions are important
2) Dynamodb: No stored procedures, no client-side transactions, only transactions where all items involved are known by primary key in advance.
3) Fauna, Supabase: Stored procedures are the way to go! You do not need application code, access your database from the client!
4) Postgres: We have everything, use what fits your particular use-case!
If database internals did not matter, why are they all doing something different and are sometimes quite opinionated about it?
I think you bring up some good points, but it's factually incorrect to state that mongodb is not ACID compliant (it has been since its 4.0 release). The way to win is not by hyperbole and lies. It's to your point by focusing on developer experience.
As a developer, I have to say that sqlite gives me the best experience.
Everything else pales in comparison.
Create a database?
sqlite3 mydata.db
Where is the database?
In the current directory
How is it structured on disk?
It's a single file
How do I backup the DB?
cp mydata.db /my/backups/mydata.db
Do I have to install a server?
No
Do I have to configure anything?
No
During setup and deployment I usually I dabble a while with the whole GRANT *.* ON localhost IDENTIFIED BY PASSWORD or something. How do I do that with sqlite?
You don't even need to reach two nodes before SQLite becomes grossly inadequate. Even on a single node: SQLite's paradigm of global locks leads to poor performance when multiple threads write to the same table.
You could be a single-node 4-core $5/month VPS instance and run into this issue. SQLite requires "exclusive" access to a table to handle writes (meaning when writing, no other thread can be reading the table). Especially if your transactions start to become complex.
In contrast, MySQL and PostgreSQL allow for simultaneous reads while writes are occurring.
The question is if simultaneous operations really speed up your application.
It is not as if a 4-core machine can do 4 times the DB work if you only allow it.
Memory access, disk access .. they all have their specific behaviour when you try to do things simultaneously. In the worst case, things will just get serialized on a lower level, even if multiple CPU cores send and/or request data simultaneously.
We don’t have to talk in abstract about this, you can benchmark these things or just look at the many ones run by others. For example, some benchmarks around some Postgres scaling improvements: https://wiki.postgresql.org/images/e/e8/FOSDEM2012-Multi-CPU....
If you don’t need anything Postgres offers, by all means stick with sqlite. But your pessimism about being stuck with operations “serialized on a lower level” is not empirically justified for most of the databases that support multi-threading.
> It is not as if a 4-core machine can do 4 times the DB work if you only allow it.
If Thread#1 goes "lock. write. unlock", then that means Thread#2, #3, and #4 all have to wait until Thread#1 is done with the write.
Even if Thread#2/#3/#4 have the data in their CPU-cache, they have to wait for Thread#1 to be complete.
--------
If Thread#1 is writing to a __hard drive__, that means Thread#2, #3, and #4 are waiting on a hard drive read, when they could have instead been reading from L3 cache.
SQLite's model scales extremely poorly in practice. You run into all sorts of problems like this.
I'm talking about relatively simple cases like spinning up a phpbb3 (web forum) instance. The minute you have multiple users writing comments at the same time from different apache/php instances is the minute your SQLite database starts to fall over.
Every write (aka: every comment post) is an exclusive write on SQLite. A full lock that prevents other processes from reading. (Ex: User#1 writes a comment, but User#2 hits refresh. These two items will be sequential if you use SQLite... when normally they'd be concurrent in most other databases)
------
SQLite is an exceptionally good database for many purposes. But it has its weaknesses. There's a reason why MySQL and PostgreSQL exist after all.
We've been using sqlite as our only datastore for the last 5+ years.
There are a lot of fun little tricks you learn along the way.
The most important one is to ignore all the naysayers who claim it won't scale. I can easily saturate any IO subsystem by way of SQLite.
At first you may think it's impossible, but then you start to go outside the lines a bit. You start by turning on WAL. That helps a lot... But then you want another 100x? One thing you can do is heavily abuse the fact that any SQLite db is "just a file" by opening as many as you need to in parallel. 9/10 times you can break down a typical business app into grains roughly approximating the current state of each user.
Might make regulatory compliance easier if each user is stored in their own database too. Makes removing someone's entire presence from your system 1 file deletion activity.
How do you aggregate the data and query among all those databases? Doesn't it incur huge costs? Does sqlite support this natively or are you basically treating it as text files of data per user and don't need complex queries in the first place?
We don't have any use cases for querying across these datastores right now. Hypothetically, if we did need something like this it would probably be an offline ETL analysis tool. I have written a few of these and they can be incredibly performant if everything lives on the same box.
Is there a wrapper to treat a database like its an sqlite database (a file) to simplify to the same level as Sqlite? This may at least simplify some issues. Doesn't matter whether its backed by a local database running in a container or by a remote one.
The biggest database unification effort tends to be focused around ODBC compatibility from what I've seen - and the operations laid out by ODBC are quite trivial and easy to comprehend I think.
I more thought of a cli wrapper to enable simple file-like management of databases with operations like copy, list and not having to setup auth, but rather just provide the local path to the database. However, ODBC looks interesting too.
ODBC might enable the creation of such a cli wrapper in a database-agnostic way. An authentication library retrieving the correct credentials based on the local file path may make the local use as seamless as with Sqlite. To get the best from both worlds.
Yea I think that'd be an interesting approach - and ODBC takes care of all the annoying bits for you (providing you with a single unified API to interact with instead of tailoring the commands to each specific driver).
You can certainly host it on a network drive if the network filesystem has the right features and behaviour.
The same goes for a local filesystem. Sqlite has certain features it requires the filesystem to have. That is independent of how that filesystem stores the data physically.
SQLite actually works just fine on Windows files shares (and yes with multiple clients since Windows file shares do support file locks) but I wouldn’t recommend it as a remote DB/multi client solution.
This feels like kind of a silly comparison. SQLite can't do most of what makes modern databases difficult/valuable. As an example, if you replace your above example with a CSV file, you'd get the same output, right?
I remember that back in the day getting Postgres up and running was an enormous chore - SQLite was definitely the easiest but even MySQL was a walk in the park compared to Postgres. But Postgres is simply amazing - it is entirely transparent about the metadata it's storing to assist querying and has mature tools available to help you optimize queries. The only "big" one that I haven't used is Oracle mostly because the price makes it insanely unreasonable for small companies. Out of MySQL, SQLite, MSSQL and Postgres I'd have to say my favourite is Postgres due to compelling the least number of hair-pulling frustrations and having absolutely pristine documentation - followed closely by MSSQL though MSSQL sometimes seems to go out of its way to have an incompatible dialect.
> NoSQL databases are maturing, for sure – we’re starting to see support for transactions (on some timeframe of consistency) and generally more stability. After years of working with “flexible” databases though, it has become clearer that rigidity up front (defining a schema) can end up meaning flexibility later on.
So funny to me that NoSQL boosters have only recently understood that designing sane schemas and knowing what order your data is inserted is important for data integrity. It's like an entire generation of highly paid software devs never learned fundamental computer science principles.
That being said: going back to 1970 to read the original "A Relational Model of Data for Large Shared Data Banks" by Codd (the paper which started the relational-database + normalization model) is incredibly useful.
But yeah, all of this debate about "how data should be modeled" was the same in 1970 as it is today.
-----
SQL doesn't quite fit 100% into the relational model, but its certainly inspired by Codd's relational model and designed to work with the principles from that paper.
And strangely enough, legions of authors and teachers and courses do a worse job at explaining relational databases than Codd's original 11 page paper.
I had a specific class on relational algebra in uni and it is up there with algorithm design and analysis in the realm of classes that actually provided me the most long term value.
Relational algebra is a lot easier once you start viewing it as relational algebra - a declarative expression of intent that can be manipulated and re-expressed similar to other purely mathematical statements. Then, when performance tuning becomes the watchword, you take that flexible expression and slice and dice it according to how the DBMS you're working with requires to align it with performance. You always want to think of your queries as complex summoning spells that draw in different necessary resources in some particular patterns and then impose an expression form on that blob of data - then you'll skate through all things SQL.
Relational algebra comes from my study of constraint programming / optimization (a closely related field to 3SAT solvers).
From this perspective, the study of relations is more about solving these NP-hard problems. For example, coloring a graph. You can solve things "locally", such as:
Texas | New Mexico | Oklahoma
-----------------------------
Red | Blue | Green
Red | Blue | Yellow
Red | Green | Blue
Red | Green | Yellow
Red | Yellow | Blue
Red | Yellow | Green
Blue | Red | Green
Blue | Red | Yellow
Blue | Green | Red
Blue | Green | Yellow
...
(etc. etc. for all other valid combinations)
And so on for each "locally valid graph coloring" (looking only at a limited number of states). You then combine all of these relations together to find a solution to the 4-coloring problem.
We can see that "solving" a 4-coloring problem is as simple as a cross-join over all these relations (!!!). That is: Texas_NewMexico_Okalhoma cross-join Texas_Oklahoma_Louisiana cross-join Louisiana_Mississippi_Arkansas cross-join ...
We can see immediately that "Texas_NewMexico_Okalhoma cross-join Texas_Oklahoma_Louisiana" will create a new relation, a "Texas_NewMexico_Oklahoma_Louisiana" table, though with the information from just two earlier tables, this new table is "incomplete" so to speak (you'll need to join this table with many other Texas, NewMexico, Oklahoma, and Louisiana tables to ensure global consistency).
We can imagine a big 48-wide table consisting of the 48 states of USA as the final solution to a graph coloring problem. Successful calculation of this massive table will enumerate all possible solutions (!!) of the graph coloring problem.
----------------
Somehow, I find it more obvious to understand relations from this perspective. If anything, learning about constraint programming has made my relational algebra better (and as a result, has made my database skills probably better too)
Its also a funny corner where if you "study relations" hard enough, you eventually reach NP complete problems. 3SAT is easily written in the form of database relations after all :-) (But using a database as a 3SAT solver is probably a bad idea, despite being mathematically connected)
That's exactly what it is. "Self-taught coder" really isn't the right word for what many are, as it implies some form of intentional individual study. More like "self learned to duck tape shit together thanks to Stack Overflow" but we don't have a catchy term for that.
Understanding a limited amount of database internals has been very useful to me. There is one aspect of using databases that you simply cannot abstract away and that is performance. If you ask your database a question in a way it is not suited to perform or that isn't supported by indexes performance is not going to be good.
And these performance differences are not small once your database has a decent size. And if you tables are really large it's not a question of fast or slow but fast enough or so slow it's indistinguishable from the database being down.
Of course to some extent you can simply throw hardware or money at the problem. This certainly works for smaller inefficiencies, but sometimes knowing the database will give you orders of magnitude better performance. Hardware and money also don't scale indefinitely.
At some point, DBaaS systems should be able to understand and make inferences about your use cases, to the point where indexes and other performance optimizations are automated whenever you register a new query or something. This would be the new era of database systems, and as the article points out is increasingly true about all “infrastructure” concerns.
One of the problems with this is that there will always be trade offs. It’s hard to imagine a database understanding the appropriate compromise between read and write speeds for your specific application, for example.
nothing to do with age. this is published by "PlanetScale is a MySQL compatible, serverless database platform powered by Vitess." -- they have vested interest in promoting the complexities of DBs.
In my view the author has absolutely zero basis to make such a claim.
> MySQL, MongoDB, Firebase, Spanner; there has literally never been a better time to be a database user at any level of complexity or scale. But there’s still one common thread (ha!) among them – the focus is on infrastructure, not developer experience.
It was my impression that everyone picked (and still picks) MySQL, MongoDB, and Firebase _because_ they were the easiest to use. It seemed like developer experience was by far the most important thing to them (compared to sane behavior initially in the case of Mongo and MySQL, some of which has since evolved).
I can't imagine any scenario under which a reasonable person of at least median intelligence would perceive MongoDB as "easier to use" than Spanner, unless their entire experience with Mongo was they put one trivial JSON doc into an M0 cluster and got it back out later. Every practical aspect of MongoDB is a complete shitshow, from sharding to backup to failover.
With the understanding that Mongo _is_ popular I think you're making my point. Though I have no clue if most people who choose Mongo evaluate it against Spanner.
"Sharding" "Backups" and "Failovers" are NOT "practical" aspects of any database. They're theoretical. Most databases are not big enough to need sharding. Most backups go unused. Most failover happens automatically, totally managed by your hosting provider.
You know what is practical? Schema design. Query language. That's what made MongoDB super popular; no schemas to worry about. A query is just '{ firstName: "John" }'.
I cannot emphasize this enough: I cannot summon even a milliliter of desire to care about whether Mongo's way of doing these things is actually "better" or "worse". But it is what made it popular.
Backups are most definitely a necessary part of any persistent datastore that is accumulating value for your company. Sharding is something that becomes necessary at scale - having it in your toolbox is a really good idea - but it doesn't come up for most folks. Failovers are the least valuable of what you've mentioned IMO since downtime management is something that even very mature companies have a lot of flexibility with.
SQL is a pretty easy language to learn - you can get pretty much everything working right with some SELECT, FROM, WHERE, GROUP BY and subqueries alone - the more arcane dialect components of SQL (like HAVING - gosh I hate how easy to misunderstand HAVING is) are things you can grow into. But basic SQL - it works and has worked fine for decades - I strongly dislike tooling and languages that go on crusades to make a better SQL.
If the database itself can’t handle failover properly, there’s nothing your hosting provider can do except rewrite it (which is exactly what Amazon ended up doing with MongoDB). Also if you’ve spent any significant time being responsible for operating a database and came away with “backups aren’t practical” you’re insanely lucky (or I and everyone I know is insanely unlucky).
I work for a company that supports ClickHouse. Our focus is analytic systems, which tend to run large compared to OLTP systems.
* Sharding is part of schema design for any analytic app whose data exceeds the capacity of a single host. This is very common for use cases like web analytics, observability, network management, intrusion detection, to name just a few. Automatic resharding is one of the top asks in the ClickHouse community. (We're working on it.)
* How do I backup ClickHouse is one of our top 3 support questions in order of frequency. I just taught a ClickHouse class yesterday--part of a regular series--and it was the first question out of the gate from the audience. It has come up at some point in almost every customer engagement I can think of.
In my experience, your comment is only correct for relatively small applications that are not critical to the business.
> Database internals will eventually just not matter
Of course you need to know the internals of your database. If you've ever come across a project where the team treated a key/value, or document database as a relational one (probably because the query syntax looks similar), then you will know just how important database internals are.
That does make some sense, especially for databases that have been around a while. Lots of internals were written with hard drives in mind rather than SSDs, much lower amounts of memory, and so on. On the other hand, it's nice when your database works well in even a very limited or old environment.
I personally would argue with every single point this article makes, except scalability.