This architecture is roughly how HashiCorp's Nomad, Consul, and Vault are built (I'm one of the maintainers of Nomad). While it's definitely a "weird" architecture, the developer experience is really nice once you get the hang of it.
The in-memory state can be whatever you want, which means you can build up your own application-specific indexing and querying functions. You could just use sqlite with :memory: for the Raft FSM, but if you can build/find an in-memory transaction store (we use our own go-memdb), then reading from the state is just function calls. Protecting yourself from stale reads or write skew is trivial; every object you write has a Raft index so you can write APIs like "query a follower for object foo and wait till it's at least at index 123". It sweeps away a lot of "magic" that normally you'd shove into a RDBMS or other external store.
That being said, I'd be hesitant to pick this kind of architecture for a new startup outside of the "infrastructure" space... you are effectively building your own database here though. You need to pick (or write) good primitives for things like your inter-node RPC, on-disk persistence, in-memory transactional state store, etc. Upgrades are especially challenging, because the new code can try to write entities to the Raft log that nodes still on the previous version don't understand (or worse, misunderstand because the way they're handled has changed!). There's no free lunch.
>You could just use sqlite with :memory: for the Raft FSM
That's the basic design that rqlite[1] had for its first ~7 years. :-) But rqlite moved to on-disk SQLite, since with WAL mode, and with 'PRAGMA synchronous=OFF' [2], it is about as fast as writing to RAM. Or at least close enough, and I avoid all the limitations that come with :memory: SQLite databases (max size of 2GB being one). I should have just used on-disk mode from the start, but only now know better.
(I'm guessing you may know some of this because rqlite uses the same Raft library [3] as Nomad.)
As for the upgrade issue you mention, yes, it's real. Do you find it in the field much with Nomad? I've managed to introduce new Raft Entry types very infrequently during rqlite's 10-years of development, only once did someone hit it in the field with rqlite. Of course, one way to deal with it is to release a version of one's software first that understands the new types but doesn't ever write the new types. And once that version is fully deployed, upgrade to the version that actually writes new types too. I've never bothered to do this in practise however, and it requires discipline on the part of the end-users too.
[2] This might sound dangerous but in the current design of rqlite, the underlying SQLite database is completely rebuilt from the Raft log on startup (which is fsync'ed on every write). So any corruption of the SQLite database due power loss, etc is moot since the SQLite database is not the authoritative store of data in rqlite.
> I should have just used on-disk mode from the start, but only now know better.
Yeah, I saw the recent post about reducing rqlite disk space usage. Using the on-disk sqlite as both the FSM and the Raft snapshot makes a lot of sense here. I'm curious whether you've had concerns about write amplification though? Because we have only the periodic Raft snapshots and the FSM is in-memory, during high write volumes we're only really hammering disk with the Raft logs.
> Do you find it in the field much with Nomad? I've managed to introduce new Raft Entry types very infrequently during rqlite's 10-years of development, only once did someone hit it in the field with rqlite.
My understanding is that rqlite Raft entries are mostly SQL statements (is that right?). Where Nomad is somewhat different (and probably closer to the OP) is that the Raft entries are application-level entries. For entries that are commands like "stop this job"[0] upgrades are simple.
The tricky entries are where the entry is "upsert this large deeply-nested object that I've serialized", like the Job or Node (where the workloads run). The typical bug here is you've added a field way down in the guts of one of these objects that's a pointer to a new struct. When old versions deserialize the message they ignore the new field and that's easy to reason about. But if the leader is still on an old version and the new code deserializes the old object (or your new code is just reading in the Raft snapshot on startup), you need to make sure you're not missing any nil pointer checks. Without sum types enforced at compile time (i.e. Option/Maybe), we have to catch all these via code review and a lot of tedious upgrade testing.
> it requires discipline on the part of the end-users too.
Oh for sure. Nomad runs into some commercial realities here around how much discipline we can demand from end-users. =)
>I'm curious whether you've had concerns about write amplification though?
I mean, yes, the more disk IO rqlite has to make to more write performance will be affected. However the advantages of running with an on-disk SQLite database are worth it I believe. In addition rqlite supports storing the SQLite database file on a memory-backed filed system if users really want that[1]. That can help squeeze more write throughput out of rqlite.
>My understanding is that rqlite Raft entries are mostly SQL statements (is that right?).
That's right, rqlite does statement-based replication, though I'm currently looking into extending it so it also does changeset[2] replication where it makes sense.
like you I'm more open to the idea of keeping data in memory than most of the responders here. when I got to the part of the article about how they are using common lisp with hot reloading, I was thinking, well you guys can do whatever you want, but not everybody is working on that team, ha.
Yes indeed! But this doesn't apply to a startup in the Explore phase, where you don't need replication, and how we did it for a long time. This is the phase where this architecture is the most use for product iteration.
But you're right, once you start using replication in the Expand phase, there certainly are engineering challenges, but they're all solvable challenges. It might help that in Common Lisp we can hot-reload code, which makes some migrations a lot easier.
Decades ago, PG wrote that he didn't use a database for Viaweb, and that it seemed odd for web apps to be frontends to databases when desktop apps were not[0]. HN also doesn't use a database.
That's no longer true, with modern desktop and mobile apps often using a database (usually SQLite) because relational data storage and queries turn out to be pretty useful in a wide range of applications.
> Decades ago, PG wrote that he didn't use a database for Viaweb, and that it seemed odd for web apps to be frontends to databases when desktop apps were not[0].
After reading the link, I don't think that database means the same thing for everyone.
The vwfaq still mentions loading data from disk, and also mention "start up a process to respond to an HTTP request." This suggests that by "database" they meant a separate server dedicated to persist data, and having to communicate with another server to fetch that data.
Obviously, this leaves SQLite out of this definition of database. Also, if you're loading data from disk already, either you're using a database or you're implementing your own ad-hoc persistence layer. Would you still consider you're using a database if you load data from SQLite at app start?
The problem with this sort of mental model is that it ignores the fact that the whole point of a database is to persist and fetch data in a way that is convenient to you without having to bother about low-level details. Storing data in a database does not mean running a postgres instance somewhere and fetching data over the web. If you store all your data in-memory and have a process that saves snapshots to disk using a log-structured data structure... Congratulations, you just developed your own database.
I was certainly inspired by PG's writing (after all we do use Common Lisp, and it's hard to avoid PG in this space). But I don't think they did things like transaction logs like how bknr.datastore does, which makes the development process a lot more seamless.
I think even SQLite itself wasn't as ubiquitous (edit: it didn't exist) when pg write viaweb. If SQLite wasn't there and my options were basically key value stores, I could as well use filesystem in most cases.
Second, querying the RDBMS has been much simplified in past 20 years. We have all kind of ORMs and row mappers to reduce the boilerplate.
We also got advanced features like FTS which are useful for desktop and mobile apps.
Today it's a good choice to use RDBMS for desktop apps.
> If SQLite wasn't there and my options were basically key value stores
Well, there were "options" other than KV stores - MySQL launched a month before Viaweb (but flakey for a good long while.) Oracle was definitely around (but probably $$$$.) mSQL was being used on the web and reasonably popular by 1995 (cheap! cheerful! not terrible!)
(definitely understand making your own in-memory DB in 1995 though)
I think that is just written to disk as something like file41207393 when you click reply.
When the system needs an item it sees if it's cached in memeory and otherwise reads it from disk and I think that is pretty much the whole memory system. Some other stuff like user id that works in the same sort of way.
It just persists its in-memory data structures to disk. Here's the source of an old version; note uses of `diskvar` and `disktable`. A "table" here is just a hashtable.
if pg is still stuck in the 90s lisp, if bet it's just a single process with the site in ram, using make-object-persistent and loading as needed (kinda like python pickle).
it was a different time. to my knowledge, viaweb was a series of common lisp instances. All states for a user session was held IN MEMORY on the individual machine. I remember reading somewhere that they would be on a call with a user on production and patch bugs in real time while they were on the phone.
The web has gotten bigger and a lot of these practices simply would not fly today. If I was pushing a live fix on our prod machine with the amount of testing doing it live while on the customer is on the phone entails today, a good portion of you would be questioning my sanity.
An important reason that practice wasn't as reckless as it sounds is that early Viaweb was just a page builder. The actual web stores its customers were building were static HTML, so updating a customer's instance while talking to them on the phone only affected that one user's backend.
I get the desire to experiment with interesting things, but it seems like such a huge waste of time to avoid having to learn the most basic aspects of MySQL or postgres. You could "just" build on top of and be done with it, especially if you're running in a public cloud provider. I don't buy the increased RTT or troubles with concurrency issues, the latter having simple solutions by basic tuning, or breaking out your noisy customers. There's another post on their blog mentioning the possibility of adding 10 million rows per day and the challenges of indexing that. That's... literally nothing and I don't think even 10x that justifies having to engineer a custom solution.
Worse is better until you absolutely need to be less worse, then you'll know for sure. At that point you'll know your pain points and can address them more wisely than building more up front.
> I get the desire to experiment with interesting things, but it seems like such a huge waste of time to avoid having to learn the most basic aspects of MySQL or postgres.
For server-based database engines you can still make an argument on shedding network calls. It's dubious, but you can.
What's baffling is that the blogger tries to justify not picking up SQLite claiming it might have features that they don't need, which is absurd and does not justify anything.
The blog post reads like a desperate attempt to start with a poor solution to a fictitions problem and proceed to come up with far-fetched arguments hoping to reject the obvious solution.
If you want to shed network calls, the easiest solution would be to just run postgres or MySql on the same server and connecting to it via Unix domain socket. So even if SQLite wasn't an option network overhead isn't a good argument
Here’s the thing that I wonder about: would their business be successful if they didn’t spend all this time reinventing the wheel? Just by building it out in the open and blogging about it, they popularize their product and show their technical prowess. If they’d use the boring technologies that one sticks together and all works, they’d have less to talk about—and thus less publicity?
Wondering if my thinking is flawed, or if going this—arguably unnecessary—extra mile is part of the product and being successful in the space.
Seems weird to start with “not talking about using something like SQLite where your data is still serialized”, then end up with a home grown transaction log that requires serialization and needs to be replicated, which is how databases are replicated anyway.
If your load fits entirely on one server, then just run the database on that damn server and forget about “special architectures to reduce round-trips to your database”. If your data fits entirely in RAM, then use a ramdisk for the database if you want, and replicate it to permanent storage with standard tools. Now that’s actually simple.
I do feel like this largely summarizes as "we built our own sqlite + raft replication", yeah. But without sqlite's battle-tested reliability or the ability to efficiently offload memory back to disk.
So, basically, https://litestream.io/ . But perhaps faster switching thanks to an explicit Raft setup? I'm not a litestream user so I'm not sure about the subtleties, but it sounds awfully similar.
That overly-simplified summary aside, I quite like the idea and I think the post does a pretty good job of selling the concept. For a lot of systems it'll scale more than well enough to handle most or all of your business even if you become abnormally successful, and the performance will be absurdly good compared to almost anything else.
They basically only save on serialization & deserialization at query time, which I would consider an infinitesimal saving in the vast majority of use cases. They claim to be able to build some magical index that's not possible with existing disk-based databases (I didn't read the linked blog post). They lose access to a nice query language and entire ecosystems of tools and domain knowledge.
I fail to see how this little bit of saving justifies all the complexity for run-of-the-mill web services that fit on one or a few servers as described in the article. The context isn't large scale services where 1ms/request saving translates to $$$, and the proposal doesn't (vertically) scale anyway.
You should probably RTFA before making broad assumptions on their solution and how it works. Most of what you wrote is both incorrect and addressed in the article.
Telling people to RTFA is against site guidelines. And I read the entire article before making this comment. If you think I’m wrong, you reply with what’s wrong, not some useless “you’re wrong, RTFA”.
The only thing in my comment that’s not directly based on the article is a handwavy 1ms/request saving estimate, and since they don’t provide any measurement, it’s anyone’s guess.
Is telling the people to RTFA against the guidelines?
The guideline specifically advises to do what GP did: Instead of commenting whether or not someone read the article, to tell them that article answers their questions.
- In GitHub readme you mention etcd / consul. Is rqlite suitable for transaction processing as well ?
- I am imagining a dirt simple load balancer over two web servers. They are a crud app backed onto a database. What is the disadvantages of putting rqlite on each server compared to say having a third backend database.
As for your second question, I don't think you'd benefit much from than that, for two reasons:
- rqlite is a Raft based system, with quorum requirements. Running 2-node systems don't make much sense. [1]
- Secondly, all writes go to the Raft leader (rqlite makes sure this happens transparently if you don't initially contact the Leader node [2]). A load balancer, in this case, isn't going to allow you to "spread load". What is load balancer is useful for when it comes to rqlite is making life simpler for clients -- they just hit the load balancer, and it will find some rqlite node to handle the request (redirecting to the Leader if needed).
It depends on what kind of transaction support you want. If your transactions need to span rqlite API requests then no, rqlite doesn't support that (due to the stateless nature of HTTP requests). That sort of thing could be developed, but it's substantial work. I have some design ideas, it may arrive in the future.
If you need to ensure that a given API request (which can contain multiple SQL statements) is atomically processed (all SQL statements succeed or none do) that is supported however [1]. That's why I think of rqlite as closer to the kind of use cases that etcd and Consul support, rather than something like Postgres -- though some people have replaced their use of Postgres with rqlite! [2]
Thank you - so my takeaway is that rqlite is well suited for distributed “publishing” of data ala etcd, but it is possible to use it as a Postgres replacement - thank you I will give it a go
I'll throw in a "ehh... sorta" though rqlite is quite neat and very much worth considering.
The main caveat here is that rqlite is an out-of-process database, which you communicate with over http. That puts it on similar grounds as e.g. postgres, just significantly lighter weight, and somewhat biased in favor of running it locally on every machine that needs the data.
So minimum read latency is likely much lower than postgres, but it's still noticeable when compared to in-process stuff, and you lose other benefits of in-process sqlite, like trivial extensibility.
SQlite doesn't do Raft. There isn't any simple way to do replicated SQlite. (In fact, writing your own database is probably the simplest way currently, if SQlite+Raft is actually what you want.)
Also the OS will cache a lot of the reads even if your database isn’t sophisticated enough or tuned correctly. Still could be a fun exercise, as with all things on here.
I think it's important to understand that every startup goes through three phases: Explore, Expand, Extract. What's simple in one phase isn't simple in the other.
A transactional database is simple in Expand and Extract, but adds additional overhead during the Explore phase, because you're focusing on infrastructure issues rather than product. Data reliability isn't critical in the Explore phase either, because you just don't have customers, so you just don't have data.
Having everything in memory with bknr.datastore (without replication) is simple in the Explore phase, but once you get to Expand phase it adds operational overhead to make sure that data is consistent.
But by the time I've reached the Expand phase, I've already proven my product and I've already written a bunch of code. Rewriting it with a transactional database doesn't make sense, and it's easier to just add replication on top of it with Raft.
I'd assume in the beginning you do not want to spend time writing a bunch of highly difficult code until you've proven your idea/product. Then when you're big enough and have the money, start replacing things where it makes sense. It seems to be the strategy used by many companies.
Unless, of course, your startup is in the business of selling DBMSes.
Absolutely. By the way, if it wasn't clear from my blog post, in the Explore phase, I used an existing library to do this. It was only in the Expand phase that I put this existing library behind a Raft replication.
I’ve done tons on traditional application server + database on the same server projects. There’s zero infrastructure issue there. You keep implying that a not-in-process RDBMS has to be its own server and that’s super strange. Not to mention having a separate db server also doesn’t add much overhead at all in the early stage, even if you’re doing it for the very first time (been there, done that).
Having Explored with a transactional database: I really can't agree. Just change your database, migrations are easy and should be something you're comfortable doing at any time, or you'll get stuck working around it for 100x more effort in the future.
That was the biggest disconnect I had as well. SQL db have the _best_ data migration tooling and practices of any data system. It’s not addressed in the article how migrations are handled with this system but I’m assuming it’s a hand rolled set of code for each one.
I think sql db make the most sense during the explore phase and you switch off of them once you know you need an improvement somewhere (like latency or horizontal scalability).
And this comes to the difference between Explore phase and Expand phase.
In the Explore phase, data migration was just running code on the production server via a REPL. Some migrations such as adding/removing fields are just a matter of hot-reloading code in CL, so there weren't a lot of migrations that we had to manually run.
In the Expand phase, once you add replication, this does become hard and we did roll out our own migration framework. But by this point we already have a lot of code built out, so we weren't going to replace it with a transactional database.
Essentially, we optimized for the Explore phase, and "dealt with the consequences" in the Expand phase (but the consequences aren't as bad as you might think).
Agreed. Reinventing the WAL means reinventing (or ignoring) all the headaches that come with it. I got the impression it takes them a long time to recover from the logs, so they likely haven't even gotten as far as log checkpointing.
> Agreed. Reinventing the WAL means reinventing (or ignoring) all the headaches that come with it.
But if the blogger learned SQLite, how would they have a topic to blog about?
Also, no benchmarks. It's quite odd that an argument grounded on performance claims does not bother to put out any hard data comparing the output of this project. I'm talking about basic things like how does this contrived custom ad-hoc setup compare with vanilla, out-of-the-box SQLite deployment? Which one performs worse and by how much? How does the performance difference reflect in request times and infrastructure cost? Does it actually pay off to replace the dozen lines of code of on boarding SQLite with a custom, in-development, ad-hoc setup? I mean, I get the weekend personal project vibe of this blog post, but if this is supposed to be a production-minded project then step zero would have been a performance test on the default solution. Where is it?
> It's quite odd that an argument grounded on performance claims
I probably did a bad job then, because everything in the blog post was meant to be developer productivity claims, not performance claims. (I come from a developer productivity background. I'm decent at performance stuff, but it's not what excites me, since for most companies my size performance is not critical as long as it scales.)
I used to work on a telecom platform (think something that runs 4G services), where every node was just part of an in-memory database that replicated using 2PC and just did periodic snapshot to avoid losing data. Basically processes were colocated with their data in the DB.
Very erlang/otp. Joe Armstrong used to rant to anyone who would listen that we used databases too often. If data was important, multiple nodes probably need a copy of it. If multiple nodes need a copy, you probably have plenty of durability.
Even if you weren't using erlang, his influence (and in general, ericsson) permeates the telecom industry.
I worked on a lottery / casino system that was similar. In memory database ( memory mapped files), with a WAL log for transaction replay / recovery. There was also a periodic snapshot capability. It was incredibly low latency on late 90's era hardware.
Setting up a single server with database replication and restore functionality is arguably more complex then setting this up.
There are libraries available to wrap your stuff with this algorithm, and the benefit is that you write your server like it would run on a single machine, and then when launching it in prod across multiple, everything just works.
I'm baffled at the arguments made in this article. This is supposed to be a simpler and faster way to build stateful applications?
The premises are weak and the claims absurd. The author uses overstatement of the difficulties of serialization just to make their weak claim stronger.
When I start a new project, the data structure usually is a "list of items with attributes". For example right now, I am writing a fitness app. The data consists of a list of exercises and each exercise has a title, a description, a video url and some other attributes.
I usually start by putting those items into YAML files in a "data" directory. Actually a custom YAML dialect without the quirks of the original. Each value is a string. No magic type conversions. Creating a new item is just "vim crunches.yaml" and putting the data in. Editing, deleting etc all is just wonderfully easy with this data structure.
Then when the project grows, I usually create a DB schema and move the items into MariaDB or SQLite.
This time, I think I will move the items (exercises) into a JSON column of an SQLite DB. All attributes of an item will be stored in a single JSON field. And then write a little DB explorer which lets me edit JSON fields as YAML. So I keep the convenience of editing human readable data.
Writing the DB explorer should be rather straight forward. A bit of ncurses to browse through tables, select one, browse through rows, insert and delete rows. And for editing a field, it will fire up Vim. And if the field is a JSON field, it converts it to YAML before it sends it to Vim and back to JSON when the user quits Vim.
What they described early on in the article was basically how NUMA machines worked (eg SGI Altix or UV). Also, their claimed benefit was being able to parallelize things with multithreading in low-latency, huge RAM. Clustering came as a low-cost alternative to $1+ million machines. There’s similarities to persistence in AS/400, too, where apps just wrote memory that gets transparently mapped to disk.
Now, with cheap hardware, they’re going back in time to the benefits of clustered, NUMA machines. They’ve improved on it along the way. I did enjoy the article.
Another trick from the past was eliminating TCP/IP stacks from within clusters to knock out their issues. Solutions like Active Messages were a thin layer on top of the hardware. There’s also designs for network routers that have strong consistency built into them. Quite a few things they could do.
If they get big, there’s hardware opportunities. On CPU side, SGI did two things. Their NUMA machines expanded the number of CPU’s and RAM for one system. They also allowed FPGA’s to plug directly into the memory bus to do custom accelerators. Finally, some CompSci papers modified processor ISA’s, networks on a chip, etc to remove or reduce bottlenecks in multithreading. Also, chips like OpenPiton increase core counts (eg 32) with open, customizable cores.
> Imagine all the wonderful things you could build if you never had to serialize data into SQL queries.
This exists in sufficiently mature Actor model[0] implementations, such as Akka Event Sourcing[1], which also addresses:
> But then comes the important part: how do you recover when your process crashes? It turns out that answer is easy, periodically just take a snapshot of everything in RAM.
Intrinsically and without having to create "a new architecture for web development". There are even open source efforts which explore the RAFT protocol using actors here[2] and here[3].
I have built some medium sized systems using Microsoft Orleans (Virtual Actors).
There was no transactional database involved, but everything was ordered and fully transactional.
If you choose say Cosmos DB, MongoDB or DynamoDB as your persistence provider you can even query the persisted state.
My first thought was, “oh, I used to do this when I wrote Common Lisp, it’s funny someone rediscovered that technique in <rust/typescript/java/whatever>”.
I think this has to be the number one misunderstanding for developers.
Yes, SSD in terms of throughput or IOPs has gone up by 100 to 10000x. vCPU performance per dollar has gone up by 20 - 50x. We went from 45/32nm to now 5nm/3nm, and much higher IPC.
But RAM price hasn't gotten anywhere near the same fall as CPU or SSD. It may have gotten a lot faster, you may be even getting to stick lots of memory with higher density chip and channels went from dual to 8 or 12. But if you look at the DRAM Spot price since 2008 to 2022, you will see the lowest DRAM price has been the same at around $2.8/GB for three times. As the DRAM price goes in cycle with $8 / $6 per GB in between this same period. i.e Had you bought DRAM at its lowest point or its highest point during the past ~15 years your DRAM would have cost roughly the same plus or minus 10-20% ignoring inflation.
It was only until Mid 2022 it finally broke through the $2.8/GB barrier and collapse close to $1/GB before settling on ~ $2/GB for DDR5.
Yes you can now get 4TB RAM on a server. But it doesn't mean DRAM are super cheap. Developers on average or for those in big Tech are now earning way more than they were in 2010. Which makes them think RAM has gotten a lot more affordable. In reality even in the lowest point over past 15 years you only get at best slightly more than 2x reduction in DRAM price. And we will likely see DRAM price shot up again in a year or two.
An alternative interpretation is that the maximum RAM capacity for an individual node has drastically increased over the last couple of decades.
A simplistic example, if a given node was limited to 16GB of RAM 20 years ago, I would need 256 nodes to have 4TB of RAM for my system (not including overhead for each OS).
Compared to today, where a single node can have that entire 4TB all in one chassis.
The total cost of RAM chips themselves may not have changed, but the actual cost of using that RAM in a physical system has dropped dramatically.
This is fascinating, thanks for the data! I agree with the the other reply to this: I probably should've said that it's easy to get a machine with 100s of GB of RAM instead of saying it's "cheap".
I've got a handful of small Go applications where I just have a "go generate" command that generates the entire dataset as Go, so the data set ends up compiled into the binary. Works great.
I also have built a whole class of micro-services that pull their entire dataset from an API on start up, hold it resident and update on occasion. These have been amazing for speeding up certain classes of lookup for us where we don't always need entirely up to date data.
Not sure I would call that setup simple, but it is interesting. I have honestly never heard of ‘Raft’ or the Raft Consensus Protocol or bknr.datastore, so always happy to learn something on a Friday night.
I agree, the infrastructure required to make this happen eventually gets quite complicated. But the developer experience is what's super simple. If somebody had to take all our infrastructure and just use it to build their next big app, they can get the simplicity without worrying about the internal plumbing.
Raft is fantastic and most modern systems with more than one node are built on Raft. It is actually proven to be equivalent to Paxos, but the semantics of it are closer to what you would prefer as a software writer and the implementation is much simpler.
I once saw a project in the wild where the "database" was implemented using filesystem directories as "tables" with JSON files inside as "rows".
When I asked people working on it if they considered Redis or Mongo or Postgres with jsonb columns, they just said they considered all of those things but decided to roll out their own db anyway because "they understood it better".
This article gives off the same energy. I really hope it works out for you, but IMO spending innovation tokens to build a database is nuts.
This isn't innovation though. You literally just write your server like you would for a single machine, then wrap it any of the available Raft libraries.
AWS and other cloud providers are money printers because a lot of engineers are insanely tied into established patterns of doing things and can't think through things at a fundamental level. Ive seen company backends where their entire AWS stacks could be replaced by a 2 EC2 instances behind a load balancer with a domain name, without affecting business flow.
We did something similar to the work in the OP post at my work, we had a bunch of ECS tasks for a service, where the service did another call to an upstream service to fetch some intermediate results. We wanted to cache results for lower response latency. People were working to set up a Redis cluster. Except the TPS of the service was like 0.1.
Took me one day to code a /sync api endpoint, which was just a replica of the main endpoint. The only difference is that the main endpoint would spin of a thread to call the /sync endpoint, whereas the /sync endpoint didn't. Both endpoints ended with caching the results in memory before returning. Easy as day, no additional infra costs necessary.
But overall, personally, I don't hate the "spending innovation tokens to build a database is nuts" sentiment too much, because it keeps me employed at high salary while doing minimal work, where things that really should be basic CS are considered innovation.
> then wrap it any of the available Raft libraries.
Raft does consensus. Raft does not do persistence to disk, WAL, crash recovery, indexing, vacuuming (you're using tombstones for your deletes, right?), or any of the other necessary pieces of a database. That's not mentioning how such a system has no query engine, so every piece of data you're looking up in every place you need data is traversing your bespoke data structures.
What you described isn't a database. Keeping some disposable values cached isn't a database.
Raft does do persistence and crash recovery, at least of the transaction logs.
What you need from your side (and there are libraries that already do this):
a) A mechanism to snapshot all the data
b) An easy in-memory mechanism to create indexes on fields--not strictly needed, but definitely makes things a lot more easier to work with.
Bespoke data structures are just simple classes, so if you're familiar with traversing simple objects in the language of your choice, you're all set. You might be over-estimating the benefits of a query engine (and I have worked at multiple places that used MySQL extensively, and used MySQL to build heavily scaled software in the past).
> Raft does do persistence and crash recovery, at least of the transaction logs.
It simply does not. The paper that definitionally is Raft doesn't tell you how to interact with durable storage. The raft protocol handles crash recovery in so far as it allows one or more nodes to rebuild state after a crash, but Raft doesn't talk about serialization or WAL or any of the other things you inevitably have to do for reliability and performance. It gives you a way to go from some existing state to the state of the leader (even if that means downloading the full state from scratch), but it doesn't give you a way to go from a pile of bits on a disk to that existing state.
If you have a library that implements Raft and gives you those things, that's not Raft giving you things. And that library could just be SQLite.
> You might be over-estimating the benefits of a query engine
No, I'm not. It's great to describe the data I want and get, say, an array of strings back without having to crawl some Btrees by hand.
> The paper that definitionally is Raft doesn't tell you how to interact with durable storage.
That's being a bit pedantic. Yeah, I did mean that any respectable library implementing Raft would handle all of this correctly.
> without having to crawl some Btrees by hand.
This is not how I query an index. First, we don't even use Btrees, most of the times it's just hash-tables, and otherwise it's a simpler form of binary search trees. But in both cases, it's completely abstracted away in library I'm using. So if I'm trying to search for companies with a given name, in my code it looks like '(company-with-name "foobar")'. If I'm looking for users that belong to a specific company, it'll look like '(users-for-company company)'.
So I still think you're overestimating the benefits of a query engine.
>persistence to disk, WAL, crash recovery, indexing, vacuuming (you're using tombstones for your deletes, right?),
The point of Raft is that you write your service like it was a single instance, using SQLLite or non relational equivalent, and then use Raft to run a distributed system that can have redundancy, all without additional infra involved, and for the vast majority of the use cases (i.e some backend or some web app service at a startup), this is more than enough, considering there is enough low level stuff in drivers and kernels to make data reliability pretty high already.
I get your point and I don’t doubt the project you’re talking about was a mess, but the file system is a database, and can be a very good choice, depending on exactly what you’re doing.
> I once saw a project in the wild where the "database" was implemented using filesystem directories as "tables" with JSON files inside as "rows".
I did this sort of thing recently. I felt bad doing it, I still objectively hate it, because I do know enough to know that basically I'm re-implementing what years of hardworking O/S developers have done, piecemeal. But at least I'm going in with my eyes open which feels better.
The only real mitigating factor I have is that the application is largely 'never-read' and then when reading is done, it's sequential batches. Which is not normally something databases optimise for and works okay for file-storage.
(If someone does know a lightweight database architecture that performs like this, let me know).
There are benchmarks out there proving that for some use cases (i.e. many small updates) where using SQLite is faster than using the filesystem. [1]
So not only do you get all of the benefits of a relational database, and literally centuries of engineering hours and bugfixes invested into SQLite, you might also get better performance (which is why I presume you even considered rolling your own).
Why not sqlite? put the json in a single column, maybe copy some parts of it or metadata to another two or three. Should be faster than the filesystem for reading multiple rows.
Haha, I totally hear you. But but, we didn't really build the raft consensus layer from scratch. We used an existing robust library for that: https://github.com/baidu/braft
Ah, sure: I did not consider Redis at all. My goal in the Explore phase was to keep the data in the same process as my code, and replacing MySQL with any other database doesn't really help here. This was a developer-productivity goal, not a performance goal.
Redis is best as an in-memory cache, not a database. Having used it in production for roughly a decade, I don't trust it's on-disk capabilities (AOF/RDB etc) as either solid or reliable (or even performant) in an emergency scenario, especially with DR or DB migration in mind.
FWIW, how I read the article it was just an implementation of the original Redis, but with some other language (and this types) than Tcl.
Redis is/was basically just Tcl-typed which are persisted to disk using snapshots (Tcl commands) and append-only Tcl commands, that had a network protocol for non-Tcl applications to talk to
No I haven’t because it’s quite complicated. Databases are very much a solved problem. Unfortunately, this architecture is going to be nigh impossible to hire for and when it goes absolutely sideways recovery will be difficult.
Compared to installing, configuring and maintaining an installation of Redis, this absolutely is complicated. Do you think this is less complicated than using Redis?
In what way is setting up redis or writing a program yourself P hard? What’s the input that leads to polynomial time? And what kind of metric is that? If setting up redis takes me one day or I can write a software myself in a month, does it matter if both are P hard?
And if you have an hourly wage over $1, I am very sure that redis is cheaper at the end of the day than programming your own software and using that.
Polynomial time means that both are deterministic. The diffreence between the two only comes down to how much has to type and copy and paste, provided that the person is well aware and experienced to do both. And the total time for either is negligible, while Raft saves you more money long termin infra costs.
The argument that im fighting agaist is that when someone says its more complex, what they mean is that they dont have experience in doing that. From a business perspective, this is something to consider when hiring from.an average pool, since you point about salary is correct, but the assumption that every single engineer fits this criteria is not correct.
But why, when you can build things in an ordinary way with ordinary tech like Python/Java/C#/TypeScript and Postgres. Lots of developers know it, lots of answers to your questions online, the AI knows how to write it.
Reading posts like this makes me think the founders/CTO is mixing hobby programming with professional programming.
The overwhelming majority underestimates the beauty and effort as well as experience that goes into abstractions. There are some true geniuses at times doing fantastic work, to deliver syntactical sugar while the critics mock the maybe somewhat larger bundle size for “a couple of lines frequently used.” That’s why.
In the end, a good framework is more than just an abstraction. It guarantees consistency and accessibility.
Try to understand the source code if possible before reinventing the wheel is my advice.
What maybe starts out to be fun quickly becomes a burden. If there weren’t any edge cases or different conditions, you wouldn’t need an abstraction. Been there, done that.
Check out https://eclipsestore.io (previously named Microstream) if you're into Java and interested in some of the ideas presented in this article. You use regular objects, such as Records, and regular code, such as java.util.stream, for processing, and the library does snapshotting to disk.
I haven't tried it out but just thinking of how many fewer organizational hoops I would have to jump through makes we want to try it out:
- No ordering a database from database operations.
- No ordering a port opening from network operations.
- No ordering of certificates.
- The above times 3 for development, test and production.
- Not having to run database containers during development.
I think the sweet spot for me would be in services that I don't expect to grow beyond a single node and there is an acceptance for a small amount of downtime during service windows.
Hmm, but the problem with having in-memory objects rather than a db is you end up having to replicate alot of the features of a relational database to get a usable system. And adding all these extra features you want from those dbs end up making a simple solution not very simple at all.
To some extent I think this is an "if all you have is a hammer..." situation. Relational DBs are often not a great fit for how contemporary software manages data in memory (hence the proliferation of ORMs, and adapter layers like graphql). I think it's often easier to write out one's relations in the data structures directly, rather than mapping them to queries and joins
To clarify, as I think some people have misunderstood: we used an existing library called bknr.datastore to handle the "database" part of the in-memory store, so we didn't have to invent too much. Our only innovation here was during the Expand phase, where we put that datastore behind a Raft replication.
I’m not from “start up world” but in the end, few things give me more comfort and lack of surprises down the line than just having a relational database with built in redundancy/transaction logs/back up/recovery. Sure there might always be edge cases (lack of money, regulations, specialist software offering) but in the vast majority of cases - just get a database.
It's interesting you say "backup/recovery" as a strong point of relational databases (servers), because backup and recovery on hot databases have always been a challenge.
With many enterprise databases these days, often "incremental" or other seemingly required backup modes are not included in the "community source" versions; perhaps because surely if you want your database to be backed up safely and then come back online safely, you certainly will fall into the "contact us for quote" enterprise customer demographic.
At least, with SQLite, copying even a hot (in-use) db file to a remote server will usually "just work", with the potential loss of a few transactions, but with most other database/servers, you definitely can't just backup the data directory occasionally and call it a day.
Like I mentioned, I don’t have experience working in a start up. My real world experience with backup/recovery of a live relational DB has been with Oracle using ZDLRA - and indeed its license probably costs dearly.
For stuff like MariaDB a quick search also finds options to perform snapshots, backups, restores etc.
And if you need to be super high available, set up a distributed DB like Cassandra - you lose the relational and transaction part, but at least you’re running a product with known failure modes and known ways to prevent/circumvent them.
I guess my bigger point is that besides “don’t roll your own crypto”, I’d also advice not to roll your own DB. There’s a lot of known stuff in the market, all built by people who made and fixed the mistakes you’re going to make a long time ago.
Sure you reduce deployment complexity, but what about maintaining your algorithm that implements data persistence and replication?
To assume that will never spectacularly bite you is naive. Tests also only go so far as you know what you are testing for, and while you don't know if your product will ever be used, you also don't know if it will explode in success and you will be hostage of your own decisions and technical debt.
These are HARD decisions. Hard decisions require solid solutions. You can surely try that with toy projects, but if I was in a position to build a software architecture for something that had a remote possibility of being used in production, I would oppose such designs adamantly.
Really got a kick out of this article. RAM is big, and cheap. And as we all know the database is the log, and everything else is just the cache. A few questions, comments!
1. I take it you've seen the LMAX talk [0], and were similarly inspired? :)
2. Are you familiar with the event sourcing approach? It's basically what you describe, except you don't flush to disk after editing every field, you batch your updates into a single "event". (you've come at it from the exact opposite end, but it looks like roughly the same thing).
I haven't seen either of these! But I have to say, my inspiration here came from existing libraries. (My only innovation here is taking an existing library that did the whole transaction log thing, and putting it behind a Raft cluster.)
The batching thing is something we can easily do with the library that I'm using. It allows us to define functions as arbitrary transactions. Within the transaction I can do anything that changes state, including changing multiple fields, so we don't have to keep flushing the log after every field.
There is so much wrong with this I don't know where to even start. You want to "keep things simple" and not stand up a separate instance of MySQL/Postgres/Redis/MongoDB/whatever else. So, you:
1. Create your own in-memory database.
2. Make sure every transaction in this DB can be serialized and is simultaneously written to disk.
3. Use some orchestration platform to make all web servers aware of each other.
4. Synchronize transaction logs between your web servers (by implementing the Raft protocol) and update the in-memory DB.
5. Write some kind of conflict resolution algorithm, because there's no way to implement locking or enforce consistency/isolation in your DB.
6. Shard your web servers by tenant and write another load balancing layer to make sure that requests are getting to the server their data is on.
Yeah and good luck when the CEO starts asking for reports and metrics (or anything else that databases have been optimized over the last 50 years to do very well).
We do use Preset for metrics and dashboards, and obviously Preset isn't going to talk to our in-memory database.
So we do have a separate MySQL database where we just push analytics data. (e.g. each time an event of interest happens.) We never read from this database, and the data is a schemaless JSON.
Preset then queries from this database for our metrics purposes.
I don’t want to go ad personam on the blog author - but checking his socials he is not really experienced person.
I don’t think we have anything to discuss here. He seems just to want to do cool stuff and his drop of databases seems to be because he just doesn’t know a lot of stuff there is to know.
I applaud attempt and might be that his needs will be covered by what he is doing.
But for everyone else yes, pick boring technology if you want to do startup because technology shouldn’t be hard or something you worry about if you are making web applications.
Well that’s only 7y of working with people to learn from, it’s not nothing but it’s not enough credentials to make me go from “it’s a horrible idea” to “I must be missing something”
This is cool! I’m always excited by people trying simpler things, as a big fan of using Boring Technology.
But I have some bad news: you haven’t built a system without a database, you’ve just built your own database without transactions and weak durability properties.
> Hold on, what if you’ve made changes since the last snapshot? And this is the clever bit: you ensure that every time you change parts of RAM, we write a transaction to disk.
This is actually not an easy thing to do. If your shutdowns are always clean SIGSTOPs, yes, you can reliably flush writes to disk. But if you get a SIGKILL at the wrong time, or don’t handle an io error correctly, you’re probably going to lose data. (Postgres’ 20-year fsync issue was one of these: https://archive.fosdem.org/2019/schedule/event/postgresql_fs...)
The open secret in database land is that for all we talk about transactional guarantees and durability, the reality is that those properties only start to show up in the very, very, _very_ long tail of edge cases, many of which are easily remedied by some combination of humans getting paged and end users developing workarounds (eg double entry bookkeeping). This is why MySQL’s default isolation level can lose writes: there are usually enough safeguards in any given system that it doesn’t matter.
A lot of what you’re describing as “database issues” problem don’t sound to me like DB issues, so much as latency issues caused by not colocating your service with your DB. By hand-rolling a DB implementation using Raft, you’ve also colocated storage with your service.
> Screenshotbot runs on their CI, so we get API requests 100s of times for every single commit and Pull Request.
I’m sorry, but I don’t think this was as persuasive as you meant it to be. This is the type of workload that, to be snarky about, I could run off my phone[0]
> This is actually not an easy thing to do. If your shutdowns are always clean SIGSTOPs, yes, you can reliably flush writes to disk. But if you get a SIGKILL at the wrong time, or don’t handle an io error correctly, you’re probably going to lose data.
Thanks for the comment! This is handled correctly by Raft/Braft. With Raft, before a transaction is considered committed it must be committed by a majority of nodes. So if the transaction log gets corrupted, it will restore and get the latest transaction logs from the other node.
> I’m sorry, but I don’t think this was as persuasive as you meant it to be.
I wasn't trying to be persuasive about this. :) I was trying to drive home the point that you don't need a massively distributed system to make a useful startup. I think some founders go the opposite direction and try to build something that scales to a billion users before they even get their first user.
Wait, so you’re blocking on a Raft round-trip to make forward progress? That’s the correct decision wrt durability, but…
I’m now completely lost as to why you believe this was a good idea over using something like MySQL/Postgres/Aurora. As I see it, you’ve added complexity in three different dimensions (novel DB API, novel infra/maintenance, and novel oncall/incident response) with minimal gain in availability and no gain in performance. What am I missing?
(FWIW, I worked on Bigtable/Megastore/Spanner/Firestore in a previous job. I’m pretty familiar with what goes into consensus, although it’s been a few years since I’ve had to debug Paxos.)
> I was trying to drive home the point that you don't need a massively distributed system to make a useful startup. I think some founders go the opposite direction and try to build something that scales to a billion users before they even get their first user.
This reads to me as exactly the opposite: overengineering for a problem that you don’t have.
For exactly the reasons you describe, I would argue the burden of proof is on you to demonstrate why Redis, MySQL, Postgres, SQLite, and other comparable options are insufficient for your use case.
To offer you an example: let’s say your Big Customer decides “hey, let’s split our repo into N micro repos!” and they now want you to create N copies of their instance so they can split things up. As implemented, you’ll now need to implement a ton of custom logic for the necessary data transforms. With Postgres, there’s a really good chance you could do all of that by manipulating the backups with a few lines of SQL.
> As implemented, you’ll now need to implement a ton of custom logic for the necessary data transforms. With Postgres, there’s a really good chance you could do all of that by manipulating the backups with a few lines of SQL.
Isn’t writing «a few Lines of SQL» also custom logic? The difference is just the language.
It is also possible that the custom data store is more easily manipulated with other languages than SQL.
SQL really is great for manipulating data, but not all relational databases are easy to work with.
> Wait, so you’re blocking on a Raft round-trip to make forward progress? That’s the correct decision wrt durability, but…
Yeah. I hope it was clear in my post that the goal was developer productivity, not performance.
The round trip is only an issue on writes, and reads are super fast. At least in my app, this works out great. The writes also parallelize nicely with respect to the round trips, since the underlying Raft library just bundles multiple transactions together. Where it is a bottleneck is if you're writing multiple times sequentially on the same thread.
The solution there is you create a single named transaction that does the multiple writes. Then the only thing that needs to be replicated is that one transaction even though you might be writing multiple fields.
> it’s been a few years since I’ve had to debug Paxos
And this is why I wouldn't have recommended this with Paxos. Raft on the other hand is super easy for anyone to understand.
It’s great that people explore new ideas. However this does not seem like a good idea.
It claims to solve a bunch of problems by ignoring them. There are solid reasons why people distribute their applications across multiple machines. After reading this article I feel like we need to state a bunch of them.
Redundancy - what if one machine breaks either a hardware failure a software failure or a network failure (network partition where you can’t reach the machine or it can’t reach the internet)
Scaling- what if you can’t serve all of your customers from one machine ? Perhaps you have many customers and a small app or perhaps your app can use a lot of resources (maybe it loads gigs of data)
Deployment - what happens when we want to change the code and not go down if you are running multiple copies of your app you get this for cheap
There are tons of smaller benefits - right sizing your architecture
What if the one machine you choose is not big enough you need to move to a new machine, with multiple machines you just increase the number of machines.
You also get to use a variety of machine sizes and can choose ones that fit your needs so this flexibility allows you to choose cheaper machines
I feel like the authors don’t know why people invented the standard way of doing things.
Because we don’t want everything to fall over when one machine goes down we need at least 3 machines (for raft). So if our traditional db would have 500 GB of data we now need 3 machines with 500 GB of ram running at all times. That is an epic waste of money. Millions per year to run ? And you could store it in a db for a couple of dollars.
So all of this ram is being used and is only accessed sporadically if at all. This is not good.
Sounds like you could implement the entire thing on a micro db instance (redis or a regular db) with no raft or any other custom implementation or messing.
> Hold on, what if you’ve made changes since the last snapshot? And this is the clever bit: you ensure that every time you change parts of RAM, we write a transaction to disk. So if you have a line like foo.setBar(2), this will first write a transaction that says we’ve changed the bar field of foo to 2, and then actually set the field to 2. An operation like new Foo() writes a transaction to disk to say that a Foo object was created, and then returns the new object.
>
> And so, if your process crashes and restarts, it first reloads the snapshot, and replays the transaction logs to fully recover the state. (Notice that index changes don’t need to be part of the transaction log. For instance if there’s an index on field bar from Foo, then setBar should just update the index, which will get updated whether it’s read from a snapshot, or from a transaction.)
That’s a database. You even linked to the specific database you’re using [0], which describes itself as:
We wanted to simplify our architecture and not use a database, so instead we created our own version of everything databases already do for us. Super risky for a company. Hopefully you don’t spend all of your time maintaining, optimizing, and scaling this custom architecture.
Notice how the complexity of this grows suddenly when you start thinking about infrastructure failure and restarts due to deployments. I have seen this play out dozens of time in my professional career where these systems although starts very simple but eventually becomes a huge maintenance burden over time.
This is where high level abstractions like Durable Execution is much more powerful for developers which has the potential to abstract out this level of details. Basically code up your application like infrastructure failures does not exist and let underlying Durable Execution platform like Temporal or something similar handle resiliency for you.
After reading countless negative comments, many written based on real experience, but almost all tinged with fear, and even a few ad hominem attacks ("...not really experienced" and "...just more lispers?" Really?), I'd like to offer a word of encouragement.
I'm thrilled to see someone try something different, and grateful that he wrote about his positive experiences with it. Perhaps it will turn out to have been the wrong decision, but his writing about it is the only way we'll ever really know. It's so easy to be lulled into a sense of security by doing things the conventional way, and to miss opportunities offered by huge improvements in hardware, well-written open-source libraries, and powerful programming languages.
I have an especially hard time with the idea that SQL is where we all should end up. I've worked at Oracle, and I worked on Google AdWords when it was built on MySQL and InnoDB. I understand SQL's power, but I also understand how constraining it is, not only on data representation, but also on querying. I want to read more posts about people trying to build something without it. Redis is one way, but I'm eager to hear about others.
I wish the author good luck, and encourage him to write another post once Screenshotbot reaches the next stage.
> periodically just take a snapshot of everything in RAM.
Sound similar to `stop the world Garbage collection` in Java. Does your entire processing comes to halt when you do this? How frequently do you need to take snapshots? Or do you have a way to do this without halting everything
Good catch! Snapshotting was certainly a bottleneck that I chose not to write about.
But we aren't really taking the snapshot of RAM, instead we're running some code asking each object to snapshot itself into a stream. If you do this naively, it will block writes on the server until the snapshot is done (reads will continue to work).
But Raft has a protocol for asynchronous snapshots. So in the first step we take an immutable fast snapshot of the state we care about which happens quickly, then writes can keep going while in the background we serialize the state to disk.
We used Redis with persistence to build our first prototype. It performed amazingly and development speed was awesome. We were a full year beyond break-even before adding MySQL to the stack for the few times we missed the ability to run SQL queries, for finance.
> Hold on, what if you’ve made changes since the last snapshot? And this is the clever bit: you ensure that every time you change parts of RAM, we write a transaction to disk.
Every single time… it’s always just the wheel being re-written.
The wheel has been reinvented in different shapes and for different purposes for a long time. It's not necessarily a bad thing. You don't want to keep using stone wheels, or wooden wheels. Have you ever seen modern sci-fi wheels developed for very specific purposes and with specific properties that make them better than just plain-old wheel?
We used an existing library called bknr.datastore to handle this part, so we didn't have to reinvent the wheel :) I mentioned that at the end of the blog post, but I wanted to build up the idea for people who have no prior knowledge about how such things work.
Please, someone explain how building your own in-memory database and snapshotting on top of Raft is simpler than just installing Postgres or SQLite with one of the modern durability tools. Seriously, if you genuinely believe writing concurrency code with mutexes and other primitives and hoping that's all correct is easier than just writing a little SQL, you've tragically lost your way.
[Author here] The transactions and snapshots are still logged to disk. So if the cluster goes down and comes back up, each one just reloads the state. Until at least two machines are back up, we won't be able to serve requests though.
Not sure what you mean by ephemeral things. If you mean things like file descriptors, they are not stored. Technically the snapshot is not a simple snapshot of RAM, it snapshots through all the objects in memory that are set up to be part of the datastore. (It's a bit more complicated and flexible than this, but that's the general idea.)
> Imagine all the wonderful things you could build if you never had to serialize data into SQL queries.
No transactions, no WAL, no relational schema to keep data design sane, no query planner doing all kinds of optimisations and memory layout things I don't have to think about?
You could say that transactions, for example, would be redundant if there is no external communication between app server and the database. But it is far from the only thing they're useful for. Transactions are a great way of fulfilling important invariants about the data, just like a good strict database schema. You rollback a transaction if an internal error throws. You make sure that transaction data changes get serialised to disk all at once. You remove a possibility that statements from two simultaneous transactions access the same data in a random order (at least if you pick a proper transaction isolation level, which you usually should).
> You also won’t need special architectures to reduce round-trips to your database. In particular, you won’t need any of that Async-IO business, because your threads are no longer IO bound. Retrieving data is just a matter of reading RAM. Suddenly debugging code has become a lot easier too.
Database is far from the only other server I have to communicate with when I'm working on user's HTTP request. As a web developer, I don't think I've worked on a single product in the last 4 years that didn't have some kind of server-server communication for integrations with other tools and social media sites.
> You don’t need crazy concurrency protocols, because most of your concurrency requirements can be satisfied with simple in-memory mutexes and condition variables.
Ah, mutexes. Something that programmers never shot themselves in a foot with. Also, deadlocks don't exist.
> Hold on, what if you’ve made changes since the last snapshot? And this is the clever bit: you ensure that every time you change parts of RAM, we write a transaction to disk. So if you have a line like foo.setBar(2), this will first write a transaction that says we’ve changed the bar field of foo to 2, and then actually set the field to 2. An operation like new Foo() writes a transaction to disk to say that a Foo object was created, and then returns the new object.
A disk write latency is added to every RAM write. It has no performance cost and nobody notices this.
I apologise if this comes off too snarky. Despite all of the above, I really like this idea — and already think of implementing it in a hobby project, just to see how well it really works. I'm still not sure if it's practical, but I love the creative thinking behind this, and a fact that it actually helped them build a business.
I would add that the 'serialization' to a RDBMS-schema cites as a negative is actually a huge positive for most systems. Modeling your data relationally, often in 3NF, usually differs from the in-memory/code objects in all but the most simple ORM class=table projects. Thinking deeply about how to persist data in a way that makes it flexible and useful as application needs change (i.e. the database outlives the applications(s)) has value in itself, not just a pointless cost.
I like being able to draw a hard line between application data structures, often ephemeral and/or optimized for particular tasks -- and the persisted, domain data which has meaning beyond a specific application use case.
This is not good advice. It's in parts a hyperbolic and unbalanced view:
> Imagine all the wonderful things you could build if you never had to serialize data into SQL queries.
You can do all those "wonderful things" with an RDBMS too, it's just an additional step.
> First, you don’t need multiple front-end servers talking to a single DB, just get a bigger server with more RAM and more CPU if you need it.
You don't "need" that with a single DB too, you can also get a bigger machine. Also, you can use SQLite and Litestream.
> What about indices? Well, you can use in-memory indices, effectively just hash-tables to lookup objects. You don’t need clever indices like B-tree that are optimized for disk latency.
RDMBS provide all kind of indices. You don't need to build them in your code or re-invent clever solutions. They're all there. Optimized and battle-tested over decades.
> You also won’t need special architectures to reduce round-trips to your database.
You don't need "special architectures" at all. With the most simple setup you get thousands to requests per second and sub 5 ms latency. With SQLite even more. No need for async IO, threads scale well enough for most apps. Anyway, async is not a magical thing.
> You don’t need any services to run background jobs, because background jobs are just threads running in this large process.
How does this change when using an RDBMS?
> You don’t need crazy concurrency protocols, because most of your concurrency requirements can be satisfied with simple in-memory mutexes and condition variables.
I trust a proper proven implementation in SQLite or Postgres much more than "simple in-memory mutexes and condition variables". One reason why Rust is so popular is that it's an eye opener when the compiler shows you all your concurrency bugs you never thought you had in your code.
---------------------
RDBMS solve / support may important things the easy way
- normalized data modelling by refs and joins
- querying, filtering and aggregating data
- concurrency
- storage
Re-inventing those is most of the time much harder, error prone and expensive.
---------------------
The simplest, easy and proven way is still to use an RDBMS. Start with SQLite and Litestream if you don't want to manage Postgres, which is a substantial effort, I admit. Or cost factor, although something like Neon / Supabase / ... for small scale is much much much cheaper than the development costs for all the stuff above.
Honestly SQLite is just a great option. Stored locally, so you have that fast disk access. Does great for small medium and even larger databases. And you just have a file.
The in-memory state can be whatever you want, which means you can build up your own application-specific indexing and querying functions. You could just use sqlite with :memory: for the Raft FSM, but if you can build/find an in-memory transaction store (we use our own go-memdb), then reading from the state is just function calls. Protecting yourself from stale reads or write skew is trivial; every object you write has a Raft index so you can write APIs like "query a follower for object foo and wait till it's at least at index 123". It sweeps away a lot of "magic" that normally you'd shove into a RDBMS or other external store.
That being said, I'd be hesitant to pick this kind of architecture for a new startup outside of the "infrastructure" space... you are effectively building your own database here though. You need to pick (or write) good primitives for things like your inter-node RPC, on-disk persistence, in-memory transactional state store, etc. Upgrades are especially challenging, because the new code can try to write entities to the Raft log that nodes still on the previous version don't understand (or worse, misunderstand because the way they're handled has changed!). There's no free lunch.