Actually, when done right this dramatically simplifies a backend architecture. Even a low-scale application commonly uses multiple databases (e.g. Postgres plus ElasticSearch) and queues+workers for background work.
Our Twitter-scale Mastodon implementation is a direct demonstration of this. It's literally 100x less code than Twitter had to write to build the equivalent feature-set at scale, and it's more than 40% less code than Mastodon's official implementation. This isn't because of being able to design things better with the same tooling the second time around – it's because it's built using fundamentally better abstractions.
Mastodon is extremely similar to the Twitter consumer product. The differences are minor. We compare with Twitter because unlike Mastodon Twitter runs at large scale.
Yes, we implemented the entirety of Mastodon from scratch.
Twitter is a simple product that only has value because of the eyeballs that look at it, not the depth of the product.
I fundamentally disagree with the premise of your blog post but not the premise of your company, so why don't I make an ask for you to write a much more practically useful application. Design a basic shopping cart using your system and compare and contrast it with a well designed relational equivalent. A system that allows products to be purchased and fulfilled is a far closer match to what the majority of companies are using software for compared to writing a Twitter clone.
Here's my take -- at a certain point in scale and volume using a database, it actually does make sense to rewrite all of the following from scratch:
- query planning
- indexing (btree vs GIN, etc) and primary/foreign/unique key structure
- persistence layer
- locking
- enforcing ACID constraints
- schema migrations/DDL
- security, accounts and permissioning
- encryption primitives
But more crucially, my belief and experience is that most companies making lots of money from software products will NEVER remotely reach that scale -- and prematurely optimizing for that is not just the wrong decision but borderline professional malpractice. You can get very far with Postgres and JSONB if you really need it, and you'll spend more time focusing on your business logic than reinventing the wheel.
I'd like to be proven wrong. But I get sinking feeling that I'm not wrong, and while your product is potentially valuable for a very specific use case, you're doing your own company a disservice by distorting reality so strongly to both yourselves and your prospective customers.
I'll round out this comment by linking another comment in this thread that goes very well into the perils of event sourcing when the juice isn't worth the squeeze:
Different applications will be more relatable to different developers. And that is the path we are going down, of steadily building up more and more examples of applying Rama towards different use cases. Some developers will get their light-bulb moment on a compare/contrast to a shopping cart, others vs. a time-series analytics app, etc. It's completely different for different developers, so building up that library of examples will take time. At the moment, we're focused on our private beta users who are technically savvy enough to be able to understand Rama through the first principles on which it's based.
We started with a Twitter demonstration because: a) its implementation at scale is extremely difficult, b) I used to work there and am intimately familiar with what they went through on the technical end, and c) the product is composed of tons of use cases which work completely differently from each other – social graph, timelines, personalized follow suggestions, trends, search, etc. A single platform able to implement such diverse use cases with optimal performance, at scale, and in a comparatively tiny amount of code is simply unprecedented.
It's also a write-only workload that doesn't have to deal with updates intrinsic money movement, pricing, inventory, and other common workloads where atomicity is not a nice to have but a requirement. Twitter is trivial as a product compared to the most mundane shopping cart. You worked at Twitter? Okay great, I worked at AWS which runs the entire internet. And it doesn't really matter. Even if I am intimately familiar with where naive deployments of relational databases breakdown at scale, as well as mitigating approaches (cellular architecture) that make it possible to keep scaling them.
You're not really addressing the substance of my comment so I am left to assume that your omission is because you cannot address it. Be that as it may, you'll hopefully at least take my final comment at its face value that you do your own product a disservice by overhyping what its use case is towards areas that it is objectively a poor fit. People have tried event sourcing many times and it's just not a good fit for many if not most workloads. You can't be everything to everyone. There's nothing wrong with that.
My advice to you is this: call out the elephant in the room and admit that, and focus on workloads where it is a good fit. That extra honesty will go a long way in helping you build a business with sustainable differentiation and product market fit.
They didn't claim authority, they disclaimed arguing by authority: "I worked at AWS [...] And it doesn't really matter. [please address] the substance of my comment".
Most of the comments here bias negative, but you shouldn't take that to heart. Hackernews is, in general, conservative when it comes to attempts to displace entrenched, battle-tested solutions (especially when they come wrapped with unfortunate, hyperbolic rhetoric).
It's a very impressive demo. You should be proud, keep your chin up, and keep us updated as Rama's value prop. continues to grow.
The thing is, the abstractions you're offering is sugar on top of a well-known, well-understood old architectural pattern. The Twitter / Mastodon comparison is unconvincing exactly because both Twitter and Mastodon seemingly totally ignored the well-known, well-trodden ground on this, for no good reason.
If you can convince more developers to apply what is effectively canonical store -> workers -> partitioned denormalized materialized views as a pattern where it makes sense, then great.
But you can do that with just the tools people already have available. Heck, you can do that with just multiple postgres servers (as the depot, and for the "p-stores" and for the indexing functions), and then you don't need to ditch the languages people are familiar with for both specifying the materialization and the queries.
Part of the reason we use a "hodgepodge of narrow tooling" however, tends to be that it allows us to pick and choose languages, and APIs, depending on what developers are familiar with and what suits us, and it allows people to pick and mix. Convincing people to give up that in favour of a fairly arcane-looking API restricted to the JVM is going to be a tough sell a lot of places.
Look I don't have any reason to praise Twitter, but ...
This "Twitter-scale mastadon implementation" is when my red flags went up. It's meant to demonstrate a simpler and more performant architecture, but it actually demonstrates "things you should never do" #1: rewrite the code from scratch.
"The idea that new code is better than old is patently absurd. Old code has been used. It has been tested. Lots of bugs have been found, and they’ve been fixed."
The "1M lines of code" and "~200 person-years" of Twitter being trashed on in this article is the outcome of Twitter doing the most important thing that software should do: deliver value to people. Millions of people (real people, not 100M bots) suffered thru YEARS of the fail-whale because Twitter's software gave them value.
This software has only delivered some artificial numbers in a completely made-up false comparison. Okay it's built on "fundamentally better abstractions", but until it's running for people in the real world, that's all it is: abstract.
Please don't tout this as a demonstration of how to re-create all of Twitter with simpler and more performant back-end architecture.
This is a well-known essay that everyone should read, and yet nobody should ever cite it as a commandment for what other engineers can or cannot do.
Joel was talking about commercial desktop software in the extremely competitive landscape of the 90s, he wasn't talking about world-scale internet service infrastructure. The architecture that delivered a set of X features to the initial N users isn't always going to be enough for X+Y features to the eventual 1000*N users that you promised your investors.
Companies like Google are quite public about how much they rewrite internal software, and that's just what the public hears about. A particular service might have been written with all of the care and optimization that a world-class principal engineer can manage, serve perfectly for a particular workload for several years, and yet still need to be entirely replaced to keep up with a new workload for the following several years.
You wouldn't tell another engineer that they shouldn't rewrite a single function because software should never be rewritten, so it also doesn't make sense to tell them not to rewrite an entire project either. It's their call based on the requirements and constraints that they know much more about.
Nobody should be rushing out to rewrite Linux or LLVM from scratch, and yet we wouldn't even have Linux or LLVM if their developers didn't find reasons to create them even while other projects existed. In hindsight it's clear those projects needed to be created, but at the time people would have said you should never rewrite a kernel or compiler suite.
The Joel article is about not rewriting an existing application. TFA is not saying that Twitter should rewrite. It's saying that if you don't work for Twitter but you want to write something like Twitter maybe this could be a good place to start.
At some point in time the same argument was made for relational databases despite there being stable systems built without them based on ISAM. The newer relational systems took a lot less work to implement but that didn't imply that it made sense to rewrite the ISAM based systems.
By your logic, all testing is invalid except for usage by real users. Two things:
- We stress-tested the hell out of our implementation well beyond Twitter-scale, including while inducing chaos (e.g. random worker kills, network partitions).
- We ran it for real when we launched with 100M bots posting 3,500 times / second at 403 average fanout. It worked flawlessly with a very snappy UX.
The second-system effect is a real thing, but there's a difference when you're building on radically better tooling. All the complexity that Twitter had to deal with that led to so much code (e.g. needing to make multiple specialized datastores from scratch) just didn't exist in our implementation.
> By your logic, all testing is invalid except for usage by real users
That is exactly correct. All testing is make believe except for real case studies by real customers with an intent to pay, and barring that, real pilots with real utilization. Otherwise, you run the risk of building a product for a version of yourself that you are pretending is other people.
Complexity can't be destroyed, only moved around. For a great number of tasks this solution offers a bad set of trade offs.
But, at a certain level of scale everything is a data engineering problem, and sometimes this is the (relatively) simple solution when viewed in the context of the entire system.
'Just use mySQL/SQLite/Postgres' is great advice until it isn't.
Every problem has an amount of essential complexity, which as you said can’t be done away with. Imperfect engineering often adds unnecessary (accidental) complexity.
I think it would a mistake to approach a domain and assume nothing can be made simpler or more straightforward.
In all the companies I've worked at "event sourcing plus materialized views." has only ever caused intense confusion and resulted in more bugs and more downtime. The simpler solutions of either using MySQL or PostGres OR Redis/DynamoDB, all have worked better.
I genuinely believe we would all be better off if Martin Fowler's original article about Event Sourcing had never been written. IMO, it's a bad idea in 99% of cases.
I've worked with people who did work with successfully employed event sources architectures.
One thing there is: These were massive engineering departments in very large companies. Think of several dozen data processing teams, each very much their own small company, all working on different domains with event and stream consumers and utilizing the possibility of restarting their stream consumers some time ago. Impressive Kafka-Clusters ingesting thousands and thousands of events per second and keeping quite the backlog around to enable this refeeding, outages and catching back up and such. At their scale and complexity, I can see benefits.
However, at that scale, Kafka is your database. And you end up with a different color of the same maintenance work you'd have with your central database. Data ends up not being written, data ends up being written incorrectly, incorrectly written data now ends up with transitive errors. At times, those guys ended up having to drop a significant timespan of data, filter our the true input events, pull in the true events and start refeeding from there - except then there was a thundering herd effect, and then they had to start slowly bringing up load... great fun for the management teams of the persistence.
Note that I'm not necessarily saying e.g. Postgres is the solution to everything. However, a decently tuned and sized, competently run Postgres cluster allows you to delay larger architectural decisions for a long time with run-of-the-mill libraries and setup. Forever in a nonzero amount of projects.
Well, if your program domain requires X level of essential complexity, you will have to achieve that somehow. You can either let it live in someone else's code, that's already written and battle tested, or you can rewrite it yourself, which is time consuming and the results might be subpar. There is no free lunch.
Did I miss something, or does that post completely omit concepts like concurrency, isolation, constraints and such? And are they really suggesting "query topologies" (which seem very non-declarative and essentially making query planning/optimization responsibility of the person writing them) are a superior developer environment?
This stuff is all covered thoroughly in our docs. This is a post about the complexity aspect of backend development and how those are addressed with Rama. It's not a thorough explanation of every aspect of Rama, since that would be extremely long. If you dig into Rama, you'll see that it's properties and guarantees are very strong.
And yes, Rama's queries are a massively superior approach. The need for complex query planners is a result of limitations in how you index data, oftentimes from the tension between normalization and denormalization. With Rama, it's easy to robustly materialize multiple views that are already in the shape needed for queries.
I did check the docs. According to the search, there are like 5 references to "consistency", 4 of which are talking about how traditional databases do that poorly and the 5th one seems to suggest to use "depot partitioner" which seems very much like sharding with per-shard consistency. For "isolation" there are t2 references, for "transaction" 1, none of which explains anything.
And I'm sorry, you won't convince me something defined in Java is superior to declarative SQL. There's a lot of problems with SQL, no doubt about it.
In theory, there is no domain (or finite set of domains) that cannot be accurately modeled using tuples of things and their relations.
Practically speaking, the scope of a given database/schema is generally restricted to one business or problem area, but even this doesn't matter as long as the types aren't aliasing inappropriately. You could put a web retailer and an insurance company in the same schema and it would totally work if you are careful with naming things.
Putting everything into exactly one database is a superpower. The #1 reason I push for this is to avoid the need to conduct distributed transactions across multiple datastores. If all business happens in one transactional system, your semantics are dramatically simplified.
> Putting everything into exactly one database is a superpower.
Especially this.
$1M big iron DB server is much cheaper than a redundant array of inexpensive databases when people come to actually use the data; be it developers, analysts or leadership, everybody saves time, except perhaps a few DBAs.
I've worked at two of them -- all data, sans pointers to content addressed immutable object storage, is stored in a single database. It worked well for us; it's not rainbows and unicorns wow all of our problems disappeared but we got what we needed with sufficient performance for an app that was not well optimized.
There were challenges like the difficulty of working with a few huge tables that existed since the company's inception and were practically infeasible to do any schema changes on, but that's more a modeling issue. It was also a single point of failure service. If the db goes down so to does everything but constant sense of impending doom it was less of a problem than I thought with ndb.
It is good to see people on HN are supporting this single db approach. And with development such OrioleDB, it would solve the horizontal scaling. And all the access to data are native.
Sorry, I mean a single point of failure service. I guess maybe the term would be large blast radius. The database implementation was HA to almost paranoid degrees but if we lost the database it was a total outage every service down for all customers. If the DB had issues there was really no sense of a partial outage.
Amazon's entire system ran on a single Oracle cluster up until about 2008 (they were still at the tail end of phasing that out when I was there in 2012).
In my time at Samsung, we were tracking most of the authoritative state for the factory in a single monster Oracle DB instance. This made troubleshooting systems integration problems fairly trivial once you had some business keys to look for.
There are so many crazy tricks out there. You would be amazed at how much abuse an old school RDBMS engine can take on modern hardware. In-memory OLTP on servers that can host terabytes of RAM completely changes the game. Disaggregated compute & storage provides seamless OLAP capabilities on the same SQL connection instance for the most modern providers.
Oracle (the software) isn't a database, though. It is a database management system. It seems highly unlikely that it kept all the data in one place. Different workloads generally need different database structures. And Amazon circa 2008 was complex enough that it no doubt had a gamut of different workload needs, even if that was all managed under one management system.
Really, as it pertains to this, it is a question of "where do I want to move the complexity to?", not "can I eliminate it?"
It's also entirely practical for everything that's not twitter-like, i.e almost everything. But the most noise comes from people with the (admittedly more interesting) use case requirements at the data store level.
So long as your application doesn't throw space and performance away at every opportunity, vertical scaling goes really far these days.
It can be, but I've discovered that in tech it's common to assume that your employees' time costs $0, for some reason. I couldn't even begin to count the number of times I've had to walk a manager back from deciding that we should spend a person-month of engineering time to develop an in-house solution that lets us avoid a $500/year licensing fee.
I see this a lot in RDBMSes. I've used plenty of 'em, and I have nothing against projects like PostgreSQL; they're amazing, especially for the price. But computers cost money, and maintenance time costs money, and development time costs money, and I know from experience that I can squeeze a lot more performance out of the same hardware using MSSQL than I can PostgreSQL, and I'll have an easier time doing it, too. I can even point to some of the specific design and feature differences that lead to the discrepancy.
It's to the point where, whenever I see an article talking about the limitations of relational databases, I automatically mentally insert a "that you can get for free" into the sentence.
As someone bootstrapping a business, my cost structure favors horizontally scaling databases with low/no fixed overhead vs vertically scaling databases with high overhead. And I don’t think this is exactly an uncommon situation - plenty of people, even within large well-funded organizations, would have a hard time justifying such a large expenditure if it were uncertain that it’d be fully utilized.
It’s also a lot easier for me to handle tenancy/isolation/different data “realms” this way. And I sleep better at night knowing a mistake (I have no dbas, and I don’t have time to invest in tons of safeguards as a solo dev when there are much higher EV tasks) won’t nuke absolutely everything.
For small companies, I've never started with horizontal scaling.
With vertical scaling and maybe 2-3 instances, I don't have to worry about hasty ORM code that generates suboptimal queries.
If I have horizontal scaling, sometimes even a reasonable query that's missing an explicit index can turn into a 30s query that I have to spend dev time debugging.
I wish I could upvote this more. 95+% of startups will never need to scale past one reasonably sized single-instance database. Maybe they'll want to have a couple of replicas running for a fast failover, but from a performance perspective, they should spend their time writing efficient queries instead of horizontally scaling.
It's good to think about the what-ifs along the way. Can we shard this setup if we land a huge customer, or one that's hyper-sensitive about having their data separated? If/when that time comes, you'll have ideas of what to do about it. But realistically, most companies never hit that point. Every hour spent worrying how to make their product FAANG-scale is an hour spent not making their product.
I think most even semi-experienced developers know this. But doing the right thing is against their career grow goals so they overcomplicate the architecture on purpose.
This industry is way too much of a fashion show. And too many devs are consumers of new technology in a similar way.
After 20 years I am actually getting a bit skeptical behind the "you should constantly be learning" mantra. If you are constantly learning new, tech, you will never be an expert in anything. (That statement is very much a generalization and I am sure there are tons of people that will point out issues with it).
You also often have to fight for the right thing, then any problems that do happen are your fault.
When the dumbshit over engineered "cloud architecture" that was pushed on you (managers don't get promoted or switch companies for a big raise by reducing their headcount... no, they get promoted by building a big team for a successful [please don't check in on how it was doing two years later...] "cloud transformation") has all kinds of problems and the costs shoot into the stratosphere, while five servers on a rack would have granted greater uptime and performance in practice for your usage patterns, and lower costs... it's not your fault.
Have some bad luck and that simple solution you pushed for happens to experience an unlikely failure causing a significant outage in year 1, and your head's on the chopping block.
What I experienced is that it doesn't matter if I pushed for the right thing or inherited some overcomplicated shit. In case of any problems it's always my fault anyway. You just can't win.
If you want the latitude to affect meaningful change, then you are also going to have to accept all ownership surrounding that thing and likely far beyond. I know it gets some flak (for good reason), but concepts like extreme ownership are fundamentally how I am able to tolerate this "everything is my fault" experience.
If everything will be my fault in the end, then screw it - I will go ahead and take ownership/control of all of those things and make sure they don't come back around to bite me in the ass. If someone questions my authority, I always cast it in terms of "how many weekends with your family are you willing to sacrifice to support that idea?" Oftentimes this is all it takes to get those idle suggestions put in the bin so we can move forward.
Solving database circus in a real business environment requires disengaging all of the safeties and getting very messy with your free time, stress levels and other humans. It's a front-loaded suffering though. If you actually know what you are doing and get through the schema refactor (and it sticks), everything is down-hill from that point. Getting one schema to rule them all and then having the team learn that schema == new common language by which you can communicate that did not exist prior. In our business, non-developers are just as aware of our tables, columns and relations as developers & DBAs are. In some cases, they have more detail about how a field is used than I do.
I wish it would work in practice like that. Because in reality you can't just start making unapproved changes unless you really want to get fired quickly. But you can't get approval either. But you are 'responsible' for keeping it up and running, too.
To elaborate, I’m using managed databases that abstract away scaling so that my UX as a developer is that they are scaling horizontally (in data, and cost).
While I’m pre-launch and doing testing, using a more traditional vertical model is money out of my pocket. And it’s not something I want to maintain after I launch either.
Also, I’m creating something (a freemium game) where as a solo dev it will only be worth it for me to continue working on the business if it sees explosive growth. If I’m lucky enough for that to materialize, there will be at least a few months before I can hire fulltime help for the product (while also being busy doing tons of other stuff) - it would be terrible to have to handle tons of db operations tasks or re-architect things to support growth during that time.
Basically, vertical scaling is optimizing for the wrong level of usage. It’s actually the same for many of the snarky “your startup doesn’t need horizontal scaling” comments here - if your states are {building, huge growth, failure} then horizontal scaling is perfect because it keeps your costs low while you’re building and handles the huge growth phase more gracefully. Yes, maybe you’ll never get to the huge growth phase, but the entire business is oriented around either getting there or failing.
I certainly didn't mean my comment to be snarky. However, I do want to be realistic. There's an exception for every guideline. Maybe yours is that exception. I worked at a company like you described, where we had a few tens of thousands of users, but were actively in talks with a partner that would have given us international press, and who would have taken us to about 20 million concurrent users a month later. (Imagine a beer company telling their customers to use our app to interact with the Super Bowl broadcast. That sort of thing.) We spent a lot of time talking about horizontal scaling because that was the whole point of the company. Maybe yours is like that, too.
But the vast majority of small companies are not like that. They acquire one customer, then another, and another, and grow N<=50% per year, with tons of headroom on default app configurations. For those companies, worrying about explosive, global scaling instead of, you know, building their business, is time they can't get back.
I worked for a company. The system was basically a crud web app, where users could upload medical imaging data, and send it to be processed in the background.
We had the web app set up on Kubernetes, yet the background jobs had to be manually managed by admins at the company. There were less than 300 users in total and the only time Kubernetes scaled beyond one pod was when something was broken. When I joined, they had just implemented the Kubernetes stuff. I thought it was kind of overkill, but as the new guy I didn't feel it was my place to point it out. Buy hey, some people got some nice experience on their resumes.
I know this well. I've also head to argue for 'even if we become a monopolist is our market, the current service without scaling will keep up (we can literally get a reasonable guess of the total number of potential users), why are you implementing holizontal scaling?'
Do you ever get to a place where you actually have to scale it? Like, the PC from my teenager age would probably be more than fast enough for 80% of companies' data.
Also, are you sure your data is actually in a correct view across all these "realms"?
Basically I’m optimizing for viral growth, otherwise the business probably isn’t worth it. I haven’t launched yet but I estimate that at 100k DAU vertical scaling/using a single instance would become a nightmare because of throughput and latency rather than data size.
I’m admittedly using a strange architecture for my use case and I realize now commenting here opened too big a can of worms as explaining exactly what I’m doing would derail the thread. Suffice it to say, my db doesn’t contain just “Bob ordered 5 widgets to Foo Lane” data.
But yes, using a more horizontal database strategy makes it very easy to manage data across realms. That’s one of the main benefits. A single DB would be much harder as far as isolation and separating test/production traffic (assuming this is what you mean by views) than having multiple separable dbs that only talk to dev/staging. And I can easily wipe out and isolate dev and staging this way. I’m frankly shocked people would advocate a single db that doesn’t allow you to do this.
> Basically I’m optimizing for viral growth, otherwise the business probably isn’t worth it. I haven’t launched yet but
If you have not launched yet but are optimizing for facebook-scale, that's not the optimal approach.
I can't comment on your database experience since I don't know it, but the vast, vast majority of people underestimate by orders of magnitude what a database can handle.
If you're not a large public company we all know about (and you're not, if you haven't launched yet), you don't need all the horizontal scale you seem to be building.
I remember joining one company (still a startup but a large one about to IPO). My day#1 briefing was about how this one database was in urgent need of replacement with a dozen+ node cassandra cluster because it was about to exceed it's capacity any second now. That was to be my highest priority project.
I took some measurements on usage and capacity and put that project on the backburner. The db was nowhere near capacity on the small machine it was running on. Company grew a lot, did an IPO, grew some more. Years later I left. That db was still handling everything with plenty of headroom left to grow more.
That's chump change size even for a medium EC2/RDS instance, which should be capable of tens of millions of queries a day without the CPU or disk complaining at you (unless all your queries are table scans or unindexed).
> my db doesn’t contain just “Bob ordered 5 widgets to Foo Lane” data
It doesn't matter, it's still just bytes. What will matter is your query pattern relative to the databases query planner efficacy, and how updates/deletes impact this.
> makes it very easy to manage data across realms
You can just as easily do this at first as separate databases/schemas on the same physical server, with different users and permissions to prevent cross-database/schema joins so that when you need to move them to different machines it's an easier process.
Everyone I know that has tested isolated multi-tenancy that wasn't dependent on legal needs ended up abandoning this approach and consolidating into as little hardware as possible. Heap Analytics had a blog post a few years ago about this, but I can't seem to find it.
Regardless, hope you success in your endeavor and that you come back in a few months to prove us all wrong.
If it's a game, transactions are usually either very few per player per day (login, begin or finish playing a level, say something in a chat, spend money on a loot box, etc.) or easily sharded (e.g. 30 commands per second for each of 10 players in a multiplayer level that lasts 10 minutes, not for each player all the time).
Something operates the database. That thing is going to need a database.
Postgres and etcd are very different. SQLite is very different. Look at this microk8s thread - https://github.com/canonical/microk8s/issues/3227 - as a canonical example. Context is they use dqlite, which is multi-master SQLite with raft. Here people come to actually use the data and it’s a disaster. Just use etcd!
Then people will deploy Postgres on top of Kubernetes. Great. Maybe they will deploy it differently, and their database is a bunch of files in /etc/. Whatever. But you can’t run Kubernetes on Postgres, you wouldn’t want to anyway, the choice to use etcd is very thoughtful. This is an environment designed to simplify the layers of bootstrapping, Kubernetes’s superpower is that it runs as much of its own stuff on top of itself as safe as it is to do so.
It still needs a database that is going to be separate from your applications, and unfortunately, for all interesting applications, the line between deployment and business-logical concerns is blurred.
So IMO, you will have multiple databases, because logical and deployment concerns are intertwined; deployment concerns need databases; and thus to solve chicken and egg, you’ll have at least 2, hopefully the 1st is super easy to administer.
Perhaps you disagree. Maybe you should use dqlite as the only database. It’s so naive to think anything is CP as long as there’s consensus in front of it. And if there’s no strong opinion about the arcane issue I just said, worse if it’s meaningless gobbledygook; or if the strong opinion is “use Postgres,” the data is screwed and the users are screwed.
Big databases are not new. If that was all that was needed, people would have been doing just that for the past 40 years. Turns out it doesn't always work, and sometimes (often?) it's terrible.
Even if theoretically it was all you ever needed, the other constant problem is the implementation. Most developers, today anyway, are morons. They don't understand how databases work or how to use them, so they use them poorly. In order to get away from that fact, they invented new databases (like NoSQL) so they could use their brains less and just write more glue code. Turned out that was horrible as well.
Pretty soon the tides will turn yet again and "just having a big database" will again be seen as not in vogue and another paradigm of "simplicity" will come around (instead of "one big database" being simple it'll be "many small databases" like we had with microservices).
Those who don't understand history are doomed to repeat it.
Being able to model a use case with tuples and relations does not mean the database can meet the performance requirements of that use case. If it can't meet the performance requirements, then the use case is unsupported. It's the same way how no single data structure or combination of data structures can support all regular programming use cases. Sometimes you need a map, sometimes you need a list, sometimes you need a set, sometimes you need a combination, and sometimes you need something completely different.
Yep, you need Google-scale for your shitty startup from the get-go. Otherwise why bother? Especially now, when who have single servers with only a few terabytes of RAM at your disposal.
Many of the problems with databases that I outlined in that post are about how they create complexity, which is not necessarily related to performance or scale. Complexity kills developer productivity, which reduces iteration speed, which can be the difference between an application succeeding or failing.
I can imagine Codd saying the exact inverse: any sufficiently complex data model quickly becomes intractable for developers to assemble ideal indexes and algorithms together each time in response to new queries, which kills productivity and reduces iteration speed. Particularly as the scale and relative distributions of the data changes. The whole idea of declarative 4GLs / SQL is that a query engine with cost-based optimization can eliminate an entire class of such work for developers.
Undoubtedly the reality of widely available SQL systems today has not lived up to that original relational promise in the context of modern expectations for large-scale reactive applications - maybe (hopefully) that can change - but in the meantime it's good to see Rama here with a fresh take on what can be achieved with a modern 3GL approach.
In the experience of my department, event sourcing has brought complexity, not taken it away. Theoretically, when done right, by skilled and disciplined teams and to store the state of one system to build projections off of and not to publish state changes to other systems, I think it might work. But in most cases it's overkill. A big investment with an uncertain payoff.
Interesting, what if one creates multiple databases, but on the same instance (or even process group). Can these conflicting issues resolve somehow? Is there a super-proxy which would only log transaction phases and otherwise offload jobs to database servers (and sqlite wrappers)?
Sometimes you just need a DBA to help you optimise your DB for your workload and future growth. In general, developers make shitty DBA's and vice versa.
As a DBA, who has managed enough monolithic databases at mid and large sized organizations, there are enough safeguards in today's date to avoid the scenario you described - backups, read replicas, replication to avoid the need for unnecessary distributed databases.
Every DBA I've worked with has had performance tuning as one of their top skills, both at the installation-level and query-level. Sometimes it's optimizing a query using some hard-earned knowledge of the RDBMS, and sometimes it's keeping logs and data on separate storage.
> Putting everything into exactly one database is a superpower. T
Amen. Previous co. had a cargo-cult microservices setup including a separate DB for each app. This made things unnecessarily complicated and expensive for no business benefit (definite resume-padding benefit). Lesson: Don't complicate things until you're forced to.
Hard disagree. You don't need different physical databases (though it may become necessary at some point), but you should absolutely value data isolation when you're developing smaller services. Oh you need to rename a column and 6 services call in to read it write this table? You took a 1x solution and made it 5x more difficult to make this change and coordinate, etc. If instead 5 services asked the 1 holder of truth service about the data instead, all of 1 service needs to change when you update the database schema. You can then pick and choose when each dependent service is moved over to the new schema's representation without blocking all services from a simple upgrade path .
If on the other hand, you're fine yolo bringing down your entire tech stack for potentially hours force upgrading all services at once, then you probably shouldn't be using microservices to begin with.
(And yes there are strategies to mitigate needing a big bang for renames as well as long as writers are all consistent, I just wanted to describe why everyone touching the same data stores leads to terrible outcomes like people being unable to make changes because "it's hard", which nobody says out loud but it's definitely visible in the tech stack).
The thing with the microservice + do separation problem is that by the time you’re forced to do it, the cost to migrate and split data out of a single source into many (potentially all at once) is extremely great. Last at work week a trio of services which rely on the same physical db had a problem where they were both locked out because of a problem with the logic in one of them. This caused an outage that lasted 5 hours and resulted in a (possible) loss of several months worth of their dev team’s funding. At a certain scale you try to proactively lessen the impact of possible events like these because the math works out.
If you’re at that scale, then great, do that. But I also fully agree that cargo-culting that kind of decision out of the gate can be a massive waste of money.
It's a dream, a truly beautiful one, but it never works (does it? has anyone pulled it off?). Taking it to an extreme, the entire world needs one integrated database and user authorizations.
Obviously data structures vary, performance is required, and it becomes a bottleneck because it's so critical that mortals can't touch it and every change must be ultra-safe. And then there's security: what if someone finds a way from the development bug section to the HR or executive-only financials sections? Generally, anyone who has tried to implement universal ERM systems knows how difficult and painful integrated systems can be.
But those are the extremes. I'd be interested to know how far people have gotten, and how they did it, pursuing this ideal? I've never seen a business run on one. How about a personal knowledge-management system? Does everything fit? Do you still use spreadsheets for something 'quick', text files for free-form, etc.?
I once toyed with the idea of viewing all the available compute and storage in the world as one big decentralized computer, with multi-tenancy and auth baked into the OS.
> If all business happens in one transactional system, your semantics are dramatically simplified.
100% agreed. One of the biggest issues SQL databases have faced is that the scope & scale of "one transactional system" has evolved a lot more quickly than any off-the-shelf database architecture has been able to keep up with, resulting in an explosion of workaround technologies and approaches that really shouldn't need to exist.
We're now firmly in the cloud database era and can look at things like Spanner to judge how far away Postgres remains from satisfying modern availability & scaling expectations. But it will be great to see OSS close the gap (hopefully soon!).
"YugabyteDB Anywhere" is open core though right? Half the value & complexity is in the orchestration stack. It's definitely a step forward, but such licensing is likely still too restrictive to ever supplant Postgres itself (it terms of breadth of adoption).
> "YugabyteDB Anywhere" is open core though right?
No, it's all open source. YugabyteDB Anywhere is just automation & 24/7 core engineering support. All db features are open source. You can re-create Anywhere without needing to touch c/c++.
Note: All features have always been open source (besides incremental backup which will be soon).
Agreed. Putting data for all your use cases in a single database is a huge deal. Having different databases for each part of the application bloats so much so easily from a cost, complexity and skill set standpoint. Also, makes it extremely hard to debug issues when you have to chart the path of data through 5 different tools.
There is the challenge of workload separation and scaling each component separately but that can be resolved by pulling out challenging workloads into their own "database" albeit on the same stack.
I work in ERP space, which is the ultimate uncool from typical HN perspective; but sometimes "uncool" has a Venn diagram overlap with "mature".
A modern top tier RDBMS is an incredibly powerful, fast, mature, stable tool - when wielded wisely by the knowledgeable wizard. To be fair, the developers must have a solid understanding of programming language, and business domain, and SQL / data dictionary / ERD - which I've always taken for granted but eventually realized is not the norm.
I also work on this space, and kind of see the value of this idea.
Having a log store make sense, and having a super-VIEW replacement that can re-compute the data (so I can denormalize o build secondary tables/indexes that could or not be stored) is a common pattern.
That is fine.
What is NOT fine, and that normal DBS avoid, is to add "eventual consistency".
I NEED to get exact counts, sum, avg, and RIGTH NOW, not a second later.
The more "eventual" the data become the worse things are, and that is the major way to add bugs and problems.
One of the reasons FK are important is they eliminate the chance of getting the relationships wrong, but also, to get it right NOW.
Practically? I've never had someone throw a problem at me I couldn't model in SQL. Not saying I can guarantee performance, but in terms of abstract modeling I've never encountered something that can't be done "clean".
I'd like to use the analogy of painting. What can't you paint on that canvas that isn't trivially-hindered by the available media? Can you actually describe that space of things?
In my estimation, premature optimization is why most people can't work in these terms. You wind up with a tuple like WorldObjects and immediately go off the deep end about a billion row scale before you are done naming things.
At least, if it can be modelled by some type of finite database stored on disk or in memory, then you can store the entire contents of that database as one binary blob.
Now you just need one tuple ("our_system", "the_db", the_binary_blob)
No. Many real domains like natural language are too complex and fuzzy to accurately model with tuples.
To be truly universal, your model needs to be computationally stronger - it needs to be turing complete. For example fractals could never be modeled by any number of tuples, but can be perfectly represented by very short computer programs.
I think there is some "Turing-completeness" for this category as well, that is quite trivial to achieve, though the practical usability may be negligible.
One hacky equivalence I can come up with is that you have a tuple with a single string value, and encode something in that string.
I've heard so many people claim this, but I've never seen this put into practice in production, and at scale.
I don't even think that that proves that this cannot be done... And I've seen A LOT of people try. But, things invariably start to break down, and at this point I just think that the statement above is more wishy washy than anything, though I want to believe...
Seems like a bunch of buzzwords and such. I've been working with databases for years for one of the largest companies in the world and no one has ever said "topology" before.
Any time I would save with this is wasted on learning java and this framework.
Our production-ready, Twitter-scale Mastodon implementation in 100x less code than Twitter wrote to build the equivalent feature-set (just the consumer product) begs to differ that it's "a bunch of buzzwords". https://github.com/redplanetlabs/twitter-scale-mastodon
Meh,its better than nothing, but real world traffic is often very different than simulated traffic. If it isn't actually working with real traffic than i am not impressed.
It's a backend development platform that can handle all the data ingestion, processing, indexing, and querying needs of an application, at any scale. Rather than construct your backend using a hodgepodge of databases, processing systems, queues, and schedulers, you can do everything within Rama within a single platform.
Rama runs as a cluster, and any number of applications (called "modules") are deployed onto that cluster. Deep and detailed telemetry is also built-in.
The programming model of Rama is event sourcing plus materialized views. When building a Rama application, you materialize as many indexes as you need as whatever shapes you need (different combinations of durable data structures). Indexes are materialized using a distributed dataflow API.
Since Rama is so different than anything that's existed before, that's about as good of a high-level explanation as I can do. The best resource for learning the basics is rama-demo-gallery, which contains short, end-to-end, thoroughly commented examples of applying Rama towards very different use cases (all completely scalable and fault-tolerant): https://github.com/redplanetlabs/rama-demo-gallery
What do you mean by "platform"? Is this open source? Can I run everything locally?
Is this basically an RBDMS and Kafka in one? Can I use SQL?
I understand the handwaving around programming semantics, but I'd like clearer explanations of what it actually is and how it works. Is this a big old Java app? Do you have ACID transactions? How do you handle fault tolerance?
It may be early, but I believe folks will be curious about benchmarks. And maybe, someday, Jepsen testing.
Can you please elaborate more on the open source aspect of this?
Will it be an industry revolutionizing, open-source project like containerd (Docker) that every little developer and garage-dev can built upon or will it be benefiting only the big tech corporate world that controls and benefits from power and might which will be able to pay for this?
Especially since you chose to use the name Rama, I am wondering whether this will be for the benefit of all, or only for the benefit of the few who already control more than a fair share of power(finances)?
I like this description. Most one point one i've seen in the thread and you doc. So its not really a tool to use, but more of framework to follow. Wouldn't be the first framework to provide tools / setup processes and workflows in a better than ever tradeoff of features/complexity/skill floor/etc.
But yeah, quite a lot of hype and red flags. My favorite from the website: "Rama is programmed entirely with a Java API – no custom languages or DSLs."
And when you look at the example BankTransferModule.java:
> .ifTrue("isSuccess", Block.localTransform("$$funds", Path.key("toUserId").nullToVal(0).term(Ops.PLUS, "*amt")))
Yeah, it's probably fair to call that a DSL, even if entirly java.
Anyway, hope to get the chance to work with event based systems one day and who knows, maybe it will be Rama.
I consider a DSL something that has its own lexer/parser, like SQL. Since Rama's dataflow API is just Java (there's also a Clojure API, btw), that means you never leave the realm of a general purpose programming language. So you can do things higher-order things like generate dataflow code dynamically, factor reusable code into normal Java functions, and so on. And all these are done without the complexity and risks of doing so by generating strings for a separate DSL, like you get when generating SQL (e.g. injection attacks).
not quite, through more like anything which is widely known
I have worked (a small bit) on very similar (proprietary non public internal) systems before ~5 years ago and when doing so have read block-posts about the experience some people had with similar (also proprietary internal) systems which at that point where multiple years old ...
I guess what is new is that it's something you can "just use" ;=)
yesn't to some degree it's the round trip back to "let's put a ton of application logic into our databases and then you mainly only need the database" times.
Just with a lot of modern technology around scaling, logging etc. which hopefully (I haven't used it yet) eliminates all the (many many) issues this approach had in the past.
By my reading, it's a variant of the "Kappa architecture" (aka "event sourcing").
You have a "Depot", which is an append-only log of events, and then build arbitrary views on top of it, which they call "P-States". The Rama software promises low-latency updates of these views. Applications built on this would query the views, and submit new events/commands to the Depot.
It seems like an event sourcing database. Basically, instead of writing you write a message, then you can make read-only tables that update based on those messages. People do this today in certain domains but it is definitely more complicated than traditional databases.
More complicated in what ways specifically? I think the relevant thing is wether building an app with Rama is more or less complicated. Rama may be more complicated than mysql in implementation, but that doesn't affect me as a developer if it makes my job easier overall.
Discussing levels of complexity quickly gets pretty subjective. It is possible that Rama has found good abstractions that hide a lot of the complexity. It is also possible that taking on more complexity in this area saves you from other sorts of complexity you may encounter elsewhere in your application.
However, there is just more going on in an event sourcing model. Instead of saving data to a location and retrieving it from that location you save data to one location, read it from another location, and you need to implement some sort of linker between the two (or more).
This also comes down to my personal subjective experience. I actually really like event sourcing but I have worked on teams with these systems and I have found that the majority of people find them much harder to reason about than traditional databases.
There can be a lot of integration pain when implementing event sourcing and materialized views by combining individual tools together. However, these are all integrated in Rama, so there's nothing you have to glue yourself as a developer. For example, using the Clojure API here's how you declare a depot (an event log):
That's it, and you can make as many of those as you want. And here's how a topology (a streaming computation that materializes indexes based on depots) subscribes to that depot:
(source> my-events :> *data)
If you want to subscribe to more depots in the topology, then it's just another source> call.
That these are integrated and colocated also means the performance is excellent.
This is what has me excited about Rama. I was very into the idea of event sourcing until I realized how painful it would be to make all the tooling needed.
That's how things typically take off, not on the first attempt. Depends on what's different this time.
(Though NoSQL has outlived its usefulness as a concept IMO, it's just too loose to be useful beyond its early use for "something like CouchDB/Mongo", which this is clearly not)
How exactly is it different from No-SQL?
No schema? Check.
No consistency? Check. (eventually consistent)
Key-value store? Check. (because it's using ZooKeeper under the hood)
Promising amazing results and freeing you from the chains of the SQL? CHECK!
Everything you wrote here is false, with the exception of Rama not using SQL. Rama has strong schemas, is strongly consistent, and is not limited to key/value (PStates can be any data structure combination). Zookeeper is used only for cluster metadata and is not involved with user storage/processing in any way.
I did a year long project to build a flexible engine for materialized views onto 1-10TB live event datasets, and our architecture was roughly converging toward this idea of "ship the code to where the indexes are" before we moved onto a different project
I'm very compelled by Rama, but unfortunately won't adopt it due to JVM for totally irrational reasons (just don't like Java/JVM). Would love to see this architecture ported!
>The solution is to treat these two concepts separately. One subsystem should be used for representing the source of truth, and another should be used for materializing any number of indexed stores off of that source of truth. Once again, this is event sourcing plus materialized views.
At work we decouple the read model from the write model: the write model ("source of truth") consists of traditional relational domain models with invariants/costraints and all (which, I think, is not difficult to reason about for most devs who are already used to ORM's), and almost every command also produces an event which is published to the shared domain event queue(s). The read model(s) are constructed by workers consuming events and building views however they fit (and they can be rebuilt, too). For example, we have a service which manages users ("source of truth" service), and another service is just a view service (to show a complex UI) which builds its own read model/index based on the events of the user service (and other services). Without it, we'd have tons of joins or slow cross-service API calls.
Technically we can replay events (in fact, we accidentally once did it due to a bug in our platform code when we started replaying ALL events for the last 3 years) but I don't think we ever really needed it. Sometimes we need to rebuild views due to bugs, but we usually do it programmatically in an ad hoc manner (special scripts, or a SQL migration). I don't know how our architecture is properly called (I never heard anyone call it "event sourcing").
It's just good old MySQL + RabbitMQ and a bit of glue on top (although not super-trivial to do properly I admit: things like transactional outboxes, at least once delivery guarantee, eventual consistency, maintaining correct event processing order, event data batching, DB management, what to do if an event handler crashes? etc.) So I wonder, what we're missing without Rama with this setup, what problems it solves and how (from the list above) provided we already have our battle-tested setup and it's language-agnostic (we have producers/consumers both in PHP and Go) while Rama seems to be more geared towards Java.
Sounds like you've engineered a great way to manage complexity while using an RDBMS. A few things that Rama provides above this:
* Rama's indexing is much more flexible. For example, if you need to have a nested set with 100M elements, that's trivial. An index like that is common for a social graph (user ID -> set of follower IDs). If you need a time-series index split by granularity, that's equally as trivial (entity -> granularity -> time bucket -> stat).
* There are no restrictions on data types stored in Rama.
* Rama queries are exceptionally powerful. Real-time, on-demand, distributed queries across any or all of your indexes is trivial.
* Rama has deep and detailed telemetry across all aspects of an application built-in. This doesn't need to be separately built/managed.
* Deployment is also built-in. With your approach, an application update may span multiple systems – e.g. worker code, schema migrations – and this can be a non-trivial engineering task especially if you want zero downtime. Since Rama integrates computation and storage end-to-end, application launches, updates, and scaling are all just one-liners at the terminal.
* Rama is much more scalable.
This is looking at Rama from a feature point of view, and it's harder to express how much of a difference the lack of impedance mismatches makes when coding with Rama. That's something you learn through usage.
Rama is for the JVM, so any JVM language can be used with it. Currently we expose Java and Clojure APIs.
So does the “command” (say update customer address) perform the SQL and then some RDBMS trigger send the event onto rabbitMQ or is it an ORM that sends SQL and posts to rabbitMQ?
Plus where do you store events, in what format?
Tell me more please :-)
What you are missing is a cool name for the whole ecosystem
It can be anything: just a property change ("AddressChanged"), or something more abstract (i.e. part of business logic), such as "UserBanned" for example.
Internally, the dispatcher serializes the event as JSON (because of custom event-specific payloads) and stores it into a special SQL table in the same local DB of the service as the original model. Both the original model and the event are committed as part of the same unit of work (transaction) -- so the operation is atomic (we guarantee that both the model and the event are always stored together atomically, and in case of a failure everything is rolled back altogether). It's called "transactional outbox". This step is required because just pushing to RabbitMQ directly does not guarantee atomicity of the operation and previously resulted in various nasty data corruption bugs.
Each service has a worker which finds new committed events in the local DB and publishes them to RabbitMQ (whose exchanges are globally visible to all services). Consumers (other services/apps) subscribe to integration events they are interested in and react to them. There are two types of handlers: some do actual business logic in response to events, others (like view services) just fill their view tables/indexes for faster retrieval from their own local DB, which solves some of the problems listed in the OP.
In our services, the order of processing events is very important (to avoid accumulation of errors), so in case an event handler crashes, it will be retried indefinitely until it succeeds (we can't skip a failed event because it would introduce inconsistency to the data because later events may expect data in a certain state). When a failed event is "stuck" in the queue (we're trying to re-process the same event over and over again), that requires on-call engineers to apply hotfixes. Due to the retries, we also have "at least once" delivery guarantee, so engineers must write handlers in an idempotent way (i.e. it must be safely retriable multiple times).
There's also an additional layer on top of RabbitMQ to support DB sharding and "fair" event dispatching. Our product is B2B so we shard by company (each company has its own set of users). Some companies are large (the largest is a popular fast food chain with 100k employees) so they produce a lot of events whose amount can dwarf the amount of events produced by smaller companies (50-100 employees). So we have a fair dispatcher which makes sure a large company which produces, say, 100k events, doesn't overtake the whole queue for itself (it splits events evenly between all accounts). This system also localizes the problem of stuck events to specific company accounts (the whole global queue is not affected).
So all in all, this is how it's done here, on top of MySQL/RabbitMQ and some glue code in Go.
This seems to have real legs - the materialised view is created first, there are real constraints on how to express an event (ie you have to write the SQL for the event, meaning the event for business must be expressible in terms of RDBMS right there - so 90% of event sourcing problems are faced up to right at the start, and it’s “just” a layer on top of RDBMS
(What I am trying to say is that I have seen things like Event UserAccountRestartedForMarketingSpecial and there is some vague idea that 9 listeners will co-ordinate atomic transactions on 5 MQ channels and …
But this forces people to say this event must be expressed in this SQL against this dbase- if not then there is a mismatch between the SQL model and the business model. You probably need to split things up. I suspect there is some concept like “normalisation boundaries” here - single events cannot transcend normalised boundaries.
Eh materializing data upon mutation can bring you some gains if your product does like one thing and needs to do it very fast. But as soon as you get complex transactions with things that need to be updated in a atomic write or you want to add a new feature that needs data organized in a different way then you're in trouble.
Also deeply unsatisfied of "just slap an index on it" that was lightly trow around on the part about building an application. The index is a global state, it was just moved one step further down the layer.
> it was just moved one step further down the layer
And therefore, crucially, you don't need to manage it yourself anymore. The only thing you need to do is tell the system what to index (which is code, not state/data).
You do manage the state of your database. When you do an UPDATE, you explicitly change the state.
The statefulness of an index, however, is merely an implementation detail. A database without indices would be in the same state, and behave the same (except much slower). This means that the state is not important for the semantics of the datastore, unlike with a regular database.
Even after reading this doc [1], I am not clear on who is the target audience and what are you trying to solve? It would be helpful to take a real world example and translate how easy /efficient it would be to do this via RAMA.
The first is our Twitter-scale Mastodon implementation, which is 100x less code than Twitter wrote to build the equivalent at scale (just the consumer product). It's also more than 40% less code than Mastodon's official implementation (which isn't scalable). https://github.com/redplanetlabs/twitter-scale-mastodon
The rama-demo-gallery repo also contains many short, self-contained, thoroughly commented examples of applying Rama towards very different use cases. These include user profile management, time-series analytics, atomic and fault-tolerant bank transfers, and more. https://github.com/redplanetlabs/rama-demo-gallery
Mastodon is extremely similar to the consumer Twitter product, and "Twitter-scale" is a known number that we tested well beyond. Among other things, we verified our timeline delivery latency was at least as good as Twitter's number ad that it scaled linearly.
I don't see how you can claim this is proved by a "twitter scale mastodon client" unless you are actually running a 40m daily user website. Simulating a real environment, and the accompanying code and infra changes, real users, network usage, etc is impossible.
We do go in circles/cycles quite a lot as an industry. I wonder if the trend is back towards SQL, right now, too many teams been burned by Event Sourcing when they just needed a decent SQL DB? Just idle conjecture....
The comments here are needlessly pessimistic and dismissive of a new data flow paradigm. In fact, this looks like the best NoSQL experience there is. SQL while is a standard now, had to prove itself many times over and also was a result of a massive push by few big tech backers.
Rama still looks like it needs some starter examples - that is all.
From what i could gather reading the documentation over few weeks... Rama is an engine supporting Stored Procedure over NoSQL systems. That point alone is worth a million bucks. I hope it lives up to the promise.
FYI, in case you're unaware the rama-demo-gallery repo has a bunch of short, self-contained, thoroughly commented examples of applying Rama towards different use cases https://github.com/redplanetlabs/rama-demo-gallery
Reminds me a lot of "Turning the Database Inside-Out"[1], but I think Red Planet Labs is overstating their point a little. TtDIO is a lot more careful with its argument, and it doesn't claim to have some sort of silver bullet to sell me.
I haven't read through all of the documentation and while I actually love Java, I'm surprised that there isn't some kind of declarative language (DDL but more than just the "data" in Data Description Language) even if that means relying on non-standard SQL objects/conventions.
CREATE OR REPLACE MODULE MY_MOD ...
CREATE OR REPLACE PSTATE MY_MOD.LOCATION_UPDATE (USER_ID NUMBER, LOC...
CREATE PACKAGE MY_PACKAGE USING MY_MOD
DEPLOY OR REDEPLOY MY_PACKAGE TASKS = 64 THREADS=16 ...
Perhaps the same could be said for DML (Data Manipulation Language). I can imagine most DML operations (insert/update/delete/merge) could be used, while event-source occurs behind the scenes with the caller being none the wiser. Might there be an expressive way to define the serialization of parts of the DML (columns) down to the underlying PState. After all, if the materialized version of the PStates is based on expressions to the underlying data, then surely the reverse expression would be enough to understand how to mutate said underlying data. Or at least a way for Rama to derive the respective event-sourcing processes and handle it behind the scenes? Serialization/deserialization could also defined in SQL-like expressions as part of the schema/module.
I say all of this while being acutely aware that there is undoubtedly as many people out there that dislike SQL as there are that dislike Java, or maybe more.
I really like this:
> Every backend that’s ever been built has been an instance of this model, though not formulated explicitly like this. Usually different tools are used for the different components of this model: data, function(data), indexes, and function(indexes).
Every time I tried to use event sourcing I have regretted it, outside of some narrow and focused use cases.
In theory ES is brilliant and offers a lot of great functionality like replaying history to find bugs, going back to any arbitrary point in history, being able to restore just from the event log, diverse and use case tailored projections, scalability, ...
In practice it increases the complexity to the point were it's a pointless chore.
Problems:
* the need for events, aggregates and projections increases the boilerplate tremendously. You end up with lots of types and related code representing the same thing. Adding a single field can lead to a 200+ LOC diff
* a simple thing like having a unique index becomes a complex architectural decision and problem ... do you have an in-memory aggregate? That doesn't scale. Do you use a projection with an external database? well, how do you keep that change ACID? etc
* you need to keep support for old event versions forever, and either need code to cast older event versions into newer ones, or have a event log rewrite flow that removes old events before you can remove them from code
* if you have bugs in you can end up needing fixup events / event types that only exist to clean up , and as above, you have to keep that around for a long time
* similarly, bugs in projection code can mess up the target databases and require cumbersome cleanup / rebuilding the whole projection
* regulation like GDPR requires deleting user data, but often you can't / don't want to just delete everything, so you need an anonimizing rewrite flow. it can also become quite hard to figure out where the data actually is
* the majority of use cases will make little to no use of the actual benefits
A lot of the above could be fixed with proper tooling. A powerful ES database that handles event schemas, schema migrations, projections, indexes, etc, maybe with a declarative system that also allows providing custom code where necessary.
That's kind of the point. Model your data. Think about it. Don't (mis)treat your database as a "persistence layer" -- it's not. It's a knowledge base. The "restriction" in the relational model is making you think about knowledge, facts, data, and then structure them in a way that is then more universal and less restrictive for the future.
Relations are very expressive and done right is far more flexible than the others named there. That was Codd's entire point:
"Future users of large data banks must be protected from
having to know how the data is organized in the machine (the
internal representation) ..." and then goes on to explain how the predicate-logic based relational data model is a more universal and flexible model that protects users/developers from the static impositions of tree-structured/network structure models.
All the other stuff in this article is getting stuck in the technical minutiae of how SQL RDBMSs are implemented (author seems obsessed with indexes). But that's somewhat beside the point. A purely relational database that jettisons SQL doesn't have to have the limitations the author is poking at.
It's so frustrating we're still going over this stuff decades later. This was a painful read. People developing databases should already be schooled in this stuff.
If Postgres was already horizontally scalable and supported incrementally maintained recursive CTEs (like Materialize can do) then I could see how Rama would be mostly uninteresting to a seasoned SQL developer, but as it is I think Rama is offering a pretty novel & valuable set of 3GL capabilities for developers who need to build scalable reactive applications as quickly as possible. Capabilities which other SQL databases will struggle to match without also dropping down to 3GL APIs.
> A purely relational database that jettisons SQL doesn't have to have the limitations the author is poking at.
Agreed. Relational databases can take us a lot further yet.
In their chosen example, and many others that involves external sources, it's not really "your" data, though. You have no control over the structure, and you then have the choice of deciding whether you want to throw away what you can't shoehorn into your current schema, or store it and transform it as best you can and be prepared to re-transform it as your system evolves.
Really, they're not avoiding modelling the data. They're modelling it in their "P-stores" whether they want to admit that's just another data model or not. It's clearly not data models they object to, but typical database DDL coupled with a desired to not throw away data that doesn't conform right now. Depending on your application domain, not throwing away data that doesn't conform to your current data model can be good or bad, but it's not an accident that they picked Twitter/Mastodon where it's easy to justify not rejecting non-conforming documents.
I agree with you that this doesn't require ditching relational models, though. For that matter not even SQL. You can even fairly easily build this kind of architecture without e.g. ever leaving Postgres (replicated pair for their "depots"; triggers + postgres foreign data wrappers to materialize views on index servers and/or to do requests across index servers).
Precisely this. Every “solution” I’ve seen approaches RDBMS as an obviously incorrect kludge that must be done away with in the name of DX.
RDBMS hasn’t stuck around this long only because it’s good enough. It’s an incredibly powerful way to model data, for now and the future, but yes, it does require you to think carefully and slow down. These are not things to be avoided.
Yes, the core of the problem with people "getting it" has always been that SQL has been so mediocre, and yet it's the only serious option people have.
And so people go looking for something with a modern syntax, that's properly composable, functional, etc. but because they haven't studied the foundations, they reinvent the wheel badly.
The Codd paper is so readable, and so persuasive. People need to start there, and then argue why their novel approach is better on first principles, not just because it smells better than SQL.
But that it's stuck around and done so well for so long tells you that the core foundation is sound.
At the beginning of the article, the "relational" database is described as a particular architecture ("map of maps, with secondary indexes being additional maps") without recognizing that the other given examples ("key/value", "document" and "column-oriented") are also relational and the differences are only, at worst, performance tradeoffs.
> It’s common to instead use adapter libraries that map a domain representation to a database representation, such as ORMs. However, such an abstraction frequently leaks and causes issues. ...
FWIW, I'm creating a tool (strategy) that is neither an ORM or an abstraction layer (eg JOOQ) or template-based (eg myBatis). Just type safe adapters for normal SQL statements.
Will be announcing an alpha release "Any Week Now".
If anyone has an idea for how to monetize yet another database client library, I'm all ears. I just need to eat, pay rent, and buy dog kibble.
>If anyone has an idea for how to monetize yet another database client library, I'm all ears. I just need to eat, pay rent, and buy dog kibble
FWIW jOOQ's model worked great from a consumer/end-user point of view. I got to learn the whole library with no real strings attached, and then pay to adapt it/support it on Big Fucking Databases(TM) as projects required it. Their site and documentation is also quite pleasant.
In a way it feels like double-piggybacking: Microsoft, Oracle, et al. spend a lot on mindshare to make sure technical leads choose those respective stacks. Then the army of developers/consultants, required to get the promised ROI out of those stacks, inevitably have to go looking for tools. Be there to sell them the shovels. It helps if they are already familiar with it. (Free OSS adapters / "community" editions, etc., or even just very generous "evaluation" versions with no real effort to police use in a lab environment.)
Not sure how successful jOOQ has been financially, but considering they've been around for many years at this point, I have to imagine it's worked out well enough to pay for the lights and kibble?
The only other "dev tool" company I've enjoyed working with _more_ has been JetBrains. (The recent AI plugin not-withstanding.)
The domain knowledge embodied in JOOQ is kind of daunting. Truly huge.
Of the current SQL client API tools, mine is most like sqlc. Mine's major improvement, IMHO of course; but concept and workflow are comparable.
What do you think of paying to use such a tool for builds? eg Deployed to a CI/CD pipeline.
Free to use for sql-fiddle, personal, desktop, FOSS, etc. But once a project starts to use a tool for "real work", that's proof of value. Right?
Part of my just wants to want to do shareware. I published shareware in the early 1990s. Since I have the biz acumen of a sea cucumber, I just did the work and let people decide. It was great gig while it lasted.
I guess today's equiv would be some kind of patreon. My reluctance there is the PR/promotion/social media part. I have trouble even imagining what that'd look like.
> Not sure how successful jOOQ has been financially, but considering they've been around for many years at this point, I have to imagine it's worked out well enough to pay for the lights and kibble?
> However, storing data normalized can increase the work to perform queries by requiring more joins. Oftentimes, that extra work is so much you’re forced to denormalize the database to improve performance.
Databases have materialized views though, that solves this problem.
I was in favor of doubling the complexity by prefixing RDB with event logs, for retrospective QA/analysis and prospective client segregation.
Databases now are a snapshot of the data modeling and usage at a particular point in application lifecycle. We manage to migrate data as it evolves, but you can't go back in time.
Why go back? In our case, our interpretation of events (as we stuffed data into the DB) was hiding the data we actually needed to discover problems with our (bioinformatics and factory) workflow - the difference between expected and actual output that results from e.g., bad batches of reagent or a broken picker tip. We only stored e.g., the expected blend of reagents because that's all we needed for planning. That meant we had no way to recover the actions leading to that blend for purposes of retrospective quality analysis.
So my proposal was to log all actions, derive models (of plate state) as usual for purpose of present applications, but still be able to run data analysis on the log to do QA when results were problematic.
Ha ha! They said, but still :)
Event prefixing might also help in the now/later design trade-off. Typically we design around requirements now, and make some accommodation for later if it's not too costly. Using an event log up front might work for future-proofing. It also permits "incompatible" schema to co-exist for different clients, as legacy applications read the legacy downstream DB, while new ones read the upcoming DB.
For a bio service provider, old clients validate a given version, and they don't want the new model or software, while new clients want the new stuff you're building for them. You end up maintaining different DB models and infrastructure -- yuck! But with event sourcing, you can at least isolate the deltas, so e.g., HIPAA controls and auditing live in the event layer, and don't apply to the in-memory bits.
TBH, a pitch like Rama's would play better in concert with existing DB's, to incrementally migrate the workflows that would benefit from it. Managers are often happy to let IT entrepreneurs experiment if it keeps them happy and away from messing with business-critical functions.
This was a very similar conversation that the team at my last job had. Our data were customer-facing so it was a slightly different problem, but the question of "our application cares about what the present state is, but our analytics cares about trends and point-in-time" is a more universal problem.
We used Debezium and Change Data Capture + Outbox patterns in some cases, and solely outbox patterns in others. (See https://debezium.io/blog/2020/02/10/event-sourcing-vs-cdc/ ) - You then can use the change events to do the analysis. That solves your "log all actions" problem. It's still a lot of work to put together the analysis (a tool like DBT helps) but it is a solved problem with enough labor at that point and doesn't disrupt the actual application.
If I was starting from scratch on an application today that I thought might have analytics needs in the future, I would get this early on even if I didn't need it yet.
Did you roll out your own implementation of Outbox pattern? It is an important pattern to address potential data inconsistencies, but I did not see yet a good scalable implementation of it.
My biggest problem with databases is that they are very hard to evolve. They accumulate a history of decisions and are in a suboptimal state. Legacy is widespread in enterprises. Oracle is still milking $ 50B+ annually, and the databases are the primary driver of why you need Oracle and why they can upsell you other products after a compliance audit.
The schema changes are hard (e.g. try to normalize/denormalize data), production is the only environment when things go wrong, in-place changes with untested revert options are default, etc.
Clearly, there's some truth to this perspective. On the other hand, which alternative actually makes this easier? If your schema is non-enforced, and/or your data (and its schema) are in loosely coupled distributed silos, and/or as in the articles case indexes are essentially little programs... Then broad scale schema evolution goes from being difficult to often being effectively entirely infeasible.
It's possible that in turn forces users to adapt and split their changes into smaller chunks, but that's pretty speculative and they might simply fail entirely in the attempt.
Which data stores are more easily evolved than a relational DB with a fairly restrictive schema?
I agree with you that schema has tremendous value. Let's store JSON everywhere and infer it on read, backfires in a big way. Even MongoDB added tooling to enforce a schema.
I just wish there was a better tooling to evolve schema. Blue/green deployments, schema as code (no manual up and down procedures), good backward compatibility story, shadow testing in production, etc.
The database still feels like a monolith and did not get the benefits of the microservice revolution.
I get weird looks when I tell people we ran for 3.5 years on an s3api in front of bucket storage. It scaled to meet our needs and was especially appropriate for our app’s storage profile. And now that the startup doesn’t exist I’m glad that I never wasted time messing with “real” DBs. There’s definitely an industry bias toward using DBs.
How about this: ACID RDBMS in many cases are sugar. That is, they provide very NICE features, but those features can be implemented in other ways. In the cloud world, the sugar may not be worth the costs.
I think the weak case is much stronger than the strong case - that is, you can refactor to remove RDBMS dependencies; but that moves the complexity elsewhere.
A few years ago I tried writing an application (something like Status Hero for internal use) with a non-traditional database. I used Badger, which is just a transactional k/v store, and stored each "row" as protobuf value and an ID number key. (message { int id = 1 }, query by type + ID, store anything with interface { GetId() int }.)
I had additional messages for indexes and per-message-type IDs. (I like auto-incrementing IDs, sue me.) A typical transaction would read indexes, retrieve rows, manipulate them, save the rows, save the indexes, and commit.
The purity in my mind before I wrote the application was impressive; this is all a relational database is doing under the hood (it has some bytes and some schema to tell it what the bytes mean, just like protos). But it was actually a ton of work that distracted me from writing the app. The code to handle all the machinery wasn't particularly large or anything, but the app also wasn't particularly large.
I would basically say, it wasn't worth it. I should have just used Postgres. The one ray of sunshine was how easy it is to ship a copy of the database to S3; the app just backed itself up every hour, which is a better experience than I've had with Postgres (where the cloud provider deletes your backups when you delete the instance... so you have to do your own crazy thing instead).
The article is on-point about managing the lifecycle of data. Database migrations are a stressful part of every deployment. The feature I want is to store a schema ID number in every row and teach the database how to run a v1 query against a v2 piece of data. Then you can migrate the data while the v1 app is running, then update the app to make v2 queries, then delete the v1 compatibility shim. If you store blobs in a K/V store, you can do this yourself. If you use a relational model, it's harder. You basically take the app down that knows v1 of your schema, upgrade all the rows to v2, and deploy the app that understands v2. The "upgrade all the rows to v2" step results in your app being unavailable. (The compromise I've seen, and used, which is horrible is "just let the app fail certain requests while the database is being migrated, and then have a giant mess to clean up when the migration fails". Tests lower the risk of a giant mess, and selective queries result in fewer requests that can't be handled by the being-migrated database, so in general people don't realize what a giant risk they're taking. But it can all go very wrong and you should be horrified when you do this.)
Is this Rama solution similar to the kind of thing you can get with Kafka with KTables?
If so I'd be curious how they've solved making it in anyway operational less complex to manage then a database. It's been a few years since I've run Kafka but it used to kind of be a beast.
Event sourcing (+ materialized views and indices) != abandon your RDBMS. You can have both. Though you might find that traditional RDBMSes don't optimize well enough in the event sourcing (+ materialized views and indices) model.
The atomic bank transfer is done as part of function(data). The data record contains fromAccountId, toAccountId, and amount. The function applies the transfer if it's valid (fromAccountId has at least that amount of funds), and no-ops otherwise.
This example uses microbatching for the processing, so the latency will be ~200 millis. You don't need to poll and could set this up with a reactive PState query to know when the transaction is done. 200 millis is an acceptable latency for a task that needs to have strong cross-partition atomicity.
Note that depot appends can be done with "acking" so as not to return success until all colocated stream ETLs have finished processing the record. Stream ETLs also take only a handful of millis to complete. This is how you would coordinate a front-end with many tasks like you typically do with databases today (e.g. registering an account, updating a profile, adding something to a shopping cart, making a friend request, etc.).
This example uses microbatching because getting cross-partition transactionality with streaming is quite a bit harder, as streaming has either at-least once or at-most once semantics depending on how you configure it. Microbatching is always exactly-once semantics regardless of any failures that happen during processing.
I remembers me CouchDB + incremental map reduce. Except that in CouchDB you can mutate state.
Idk, but keeping all the history doesn't take a lot of space?
This is marketing spiel masquerading as a bad take. Rama may or may not be cool tech, but the idea that they are anywhere close to being able to get rid of structured database systems for complex systems is absolutely laughable to the point that it makes me uninterested in learning more about the tech. Please tone down the hyperbole if you want serious attention.
Rama is a meta-database that contains the generalized set of primitives that can be arranged to represent Datomic in principle. Which is just one of many possible arrangements.
this seems like a classic bait and switch post selling a product called rama
The approach here seems drastically more complicated; for simple apps, you go for a well known master->slave setup. For complicated apps you scale (shard, cluster, etc).
It's a way of doing stream processing that sacrifices a little bit of update latency for higher throughput and exactly-once processing semantics. Regular streaming has single-digit milli update latency, while microbatching has at least a few hundred millis update latency.
By "exactly-once processing semantics", I mean that regardless of how many failures are on the cluster (e.g. nodes losing power, network partitions), the updates into all partitions of all indexes will be as if there were no failures at all. Pure streaming, on the other hand, has either "at-least once" or "at-most once" semantics.
Did anyone else think this was satire for the first few minutes of reading it?
Calling databases global state and arguing why they shouldn’t be used was ridiculous enough that I wanted to call Poe’s Law here.
But it does look like the author was sincere. Event Sourcing is one of those cool things that seem great in theory but in my experience I’ve never seen it actually help teams produce good software quickly or reliably.
Well, the databases are indeed mutable global state but even if you get rid of them, we would still continue to live in a single (i.e. global), mutable world of physical reality. So you have to bite that bullet somewhere, and DBMSs seem to be suited rather well for that.
"Global mutable state is harmful" - well... yes, that's totally correct. "The better approach [..] is event sourcing plus materialized views." .....errr... that's one approach. we probably shouldn't hitch all our ponies to one post.
"Data models are restrictive" - well, yes, but that's not necessarily a bad thing, it's just "a thing". "If you can specify your indexes in terms of the simpler primitive of data structures, then your datastore can express any data model. Additionally, it can express infinite more by composing data structures in different ways" - perhaps the reader can see where this is a bad idea? by allowing infinite data structures, we now have infinite complexity. great. so rather than 4 restrictive data models, we'll have 10,000.
"There’s a fundamental tension between being a source of truth versus being an indexed store that answers queries quickly. The traditional RDBMS architecture conflates these two concepts into the same datastore." - well, the problem with looking at it this way is, there is no truth. if you give any system enough time to operate, grow and change, eventually the information that was "the truth" eventually receives information back from something that was "indexing" the truth. "truth" is relative. "The solution is to treat these two concepts separately. One subsystem should be used for representing the source of truth, and another should be used for materializing any number of indexed stores off of that source of truth." this will fail eventually when your source of truth isn't as truthy as you'd like it to be.
"The restrictiveness of database schemas forces you to twist your application to fit the database in undesirable ways." - it's a tool. it's not going to do everything you want, exactly as you want. the tradeoff is that it does one thing really specifically and well.
"The a la carte model exists because the software industry has operated without a cohesive model for constructing end-to-end application backends." - but right there you're conceding that there has to be a "backend" and "frontend" to software design. your models are restrictive because your paradigms are. "When you use tooling that is built under a truly cohesive model, the complexities of the a la carte model melt away, the opportunity for abstraction, automation, and reuse skyrockets, and the cost of software development drastically decreases." - but actually it's the opposite: a "cohesive model" just means "really opinionated". a-la-carte is actually a significant improvement over cohesion when it is simple and loosely-coupled. there will always be necessary complexity, but it can be managed easier when individual components maintain their own cohesion, and outside of those components, maintain an extremely simple, easy interface. that is what makes for more composable systems that are easier to think about, not cohesion between all of the components!
"A cohesive model for building application backends" - some really good thoughts in the article, but ultimately "cohesion" between system components is not going to win out over individual components that maintain their cohesion and join via loosely-coupled interfaces. if you don't believe me, look at the whole Internet.
I have been working as a database consultant for a few years. I am, of course, in my bubble, but there are a few things I really don't enjoy reading.
> No single data model can support all use cases. This is a major reason why so many different databases exist with differing data models. So it’s common for companies to use multiple databases in order to handle their varying use cases.
I hate that this is a common way of communicating this nowadays. Relational has been the mother of all data models for decades. In my opinion, you need a good reason to use something different. And this is also not an XOR. In the relational world, you can do K/V tables, store and query documents, and use graph functions for some DBs. And relational has so many safety tools to enforce data quality (e.g. ref. integrity, constraints, transactions, and unique keys). Data quality is always important in the long run.
> Every programmer using relational databases eventually runs into the normalization versus denormalization problem. [...] Oftentimes, that extra work is so much you’re forced to denormalize the database to improve performance.
I was never forced to denormalize something. Almost always, poor SQL queries are a problem. I guess this can be true for web hyperscalers, but these are exceptions.
Completely agree. I remember when Mongo / NoSQL was peak hype cycle, and every new project "needed" to use it despite being a big step down in terms of features, ability and maturity. A few years later I ended up on a system with Mongo as the database, started when Mongo was at peak hype. It was every bit as bad as I expected.
I have never seen a Mongo Based system that didn't work off of a single server, the one place where it did actually have an advantage.
I was of the roughly same opinion, especially after getting familiar with relational theory. Despite all the horrors of SQL-the-language, the backing model is close to being universal for data storage and manipulation.
Having said that, we did encounter one good use case recently: storing tree-like documents complete with history changes. Mongo-like DBs are really good for this. While possible in relational DBs, this is superinconvenient.
> I remember when Mongo / NoSQL was peak hype cycle, and every new project "needed" to use it despite being a big step down in terms of features, ability and maturity.
It one for one reason and one reason only - performance over everything else. Any time I'd suggest using MySQL or PG during this period the reactions were as if I'd suggested storing data in a cucumber.
Thats wild to me. Payment data is like, the most relational. You need to learn 20 different data models and all their many interconnected relationships just to implement a custom API integration. Maybe thats why it sometimes feels like pulling teeth to get the data you want out of a particular Stripe flow.
I used a document db for our side of the billing integration and I really, really wish I hadn't.
I am trying to be constructive here: "Relational" in RDBMS does not refer to what you are implying that it refers to. It's a common misconception honestly.
Anyway many interconnected relationships become much simpler to model when you have the richness and flexibility of the document data model.
My mental model is to think of a relational database as an amazing Swiss Army Knife. It can do anything I might need to do, but it's not awesome at anyone thing. I can cut a branch with a Swiss Army Knife, but it's much harder to open a can with a chainsaw. Unless you are 100% certain what problem you are solving take the Swiss Army Knife. Swing up multiple specialized tools or trying to make them do things they weren't designed for will just cost you time that you aren't spending on finding product-market-fit and then it won't matter if your solution won't scale to millions of users.
I think the point of "no single data model" is that you will have different call patterns for different parts of your data.
More, your aggregated data is also data. Do not think that you can just do the "bare truth" data model and run aggregate queries on demand whenever you want. Instead, consider having aggregate queries run off of what can be seen as a realized view over base data. And schedule this view so that it only has to rollup new data every $period of time. No need to recalculate aggregates that you knew yesterday.
The distinction is not between relational or not, rather normalized or not. You can model relational data in a key value store, but it isn’t easy. However it has the advantage of having more predictable performance.
The modern relational database is a near-miracle of human engineering. In the software field, it's probably only surpassed by the operating system, and maybe (or maybe not) the web browser.
I think that there's a tendency to think a technology is "a toy" if it is in a sense too good or elegant.
In the case of databases, I can well imagine Oracle or even Postgres people thinking that SQLite must be a toy because otherwise all the faff they do to set up and admin (and pay obscene amounts for) the databases is actually pointless.
Stands to reason. It is something you want to play with, not a figurine to leave on the shelf to forget about like some of those other database solutions.
SQLite works great as the database engine for most low to medium traffic websites (which is to say, most websites). The amount of web traffic that SQLite can handle depends on how heavily the website uses its database. Generally speaking, any site that gets fewer than 100K hits/day should work fine with SQLite. The 100K hits/day figure is a conservative estimate, not a hard upper bound. SQLite has been demonstrated to work with 10 times that amount of traffic.
A toy that can serve the vast majority of DB use cases. I get it, you can't build a massive-scale project with SQLite, but that doesn't exactly make it a toy DB...
Don't waste your breath on these people, either ones writing this blogspam garbage, reading, or upvoting said garbage. It's fashionable to use terms like "I/O bound" or "denormalisation" to the point where I'm longer sure the majority of commentators really _think_ these terms, or just _say_ them compulsively, almost mechanically, only to later fit/re-arrange the remaining sentence so as to accommodate them. I/O bound this, I/O bound that. Data access patterns, normalise here, denormalise there, best tool for the job!! No, please, you're not fooling anyone with your "measured" take that is claiming nuance all the while making no actual nuanced statements. When it comes to SQL, I'm not sure they even understand how it's not about performance, or "data access" for us, but rather a matter of sanity! I don't want to implement ETL routines for hours on end any time I'm tasked with data integration. Instead, I would probably just use Postgres foreign data wrappers, write some SQL queries, and be done with it. If I couldn't use temporary tables, or materialised views, I would probably go insane doing any kind of data work. (Not to mention a bazillion other things that make a modern, relational database.) I straight up wouldn't do my job because why, I would have to figure the tools out when I could really be doing analysis.
Oblivious as they are, the key takeaway from interacting with these people is; they're probably not doing real data work and don't really care about data, or analysis, for that matter. And this becomes even more obvious with all the talk about "audit tables" and such nonsense. No, please, no. We know how to manage data that is changing; there's Type 2 dimension, or whatever you want to call it, append-only, select distinct on. Downsample it all you want, compress it all you want. I digress. The best we can do is ignore these people. Let them write their shitty blogposts, let them get upvoted, and then when the time comes, simply make sure not to hire these people.
I say, never interrupt your enemy when he's making a mistake. One man's blogspam is another man's saving grace.
Yes, but performance does matter. It's just that the QL isn't the source of performance problems, so all the NoSQLs are a bit silly, and they tend to grow QLs eventually because of that.
You’re looking at it from the databases point of view (makes sense, you consult on them) but there’s a lot going on in the developer world. For instance, MongoDB exists and is popular because it doesn’t center the database but rather the developer. And in particular the needs of the developer line burning down a backlog without the database slowing that down.
Other databases focus on and optimize around a certain set of problems which yes, not as many people have as they think, but they aren’t just reserved for Google either.
And then there’s the world of analytics and data science etc where a host of databases that are not SQL become useful tools.
I do agree though that SQL should be a first consideration. But having worked up and down the stack from dev to dev-ops over the last 15 years I’ve gone from skeptic to enthusiastic in the choices.
> Maybe graph databases are what some folks are looking for.
Beware. These sometimes sell themselves as great for everything ("Look, if you squint, all your data's a graph! Clearly RDBMS is the wrong fit, use our graph database for all of it [and pay us money]!" — the general tone of marketing for at least one major graph database vendor) but tend to have to make interesting performance trade-offs to achieve the kind of quick graph-operation performance they like to showcase.
Step outside the narrow set of things they're good at (sometimes not even including all there reasonable "graphy" sorts of things one might want to do!) and performance may drop to the Earth's core.
They also may be way behind your typical mature RDBMS at things like constraints and transactions, or support far fewer datatypes, which can lead to some real pain.
(this basic pattern actually seems to hold for most attempted RDBMS "killers"—they get hyped as a replacement, but actually their best fit in a sane stack is as a supplement to RDBMS for specific purposes for very particular workloads, if you really, really need them. See also: the wild "NoSQL"/MongoDB hype of some years back)
This is more an advertisement for a type of database than a statement that they are unnecessary.
From what I can tell in the article it seems their differentiator is Event Sourcing and having arbitrary complex index builders on top of the events. It seems similar to EventStoreDB[1].
I have always been interested by the concept of an event sourcing database with projections and I want to build one eventually so it is interesting to see how they have approached the problem.
Also they mention on their site:
> Rama is programmed entirely with a Java API – no custom languages or DSLs.
It makes sense why they have gone this route if they want a "Turing-complete dataflow API" but this can be a major barrier to adoption. This is a big challenge with implementing these databases in my opinion because you want to allow for any logic to build out your indexes/projections/views but then you are stuck between a new complicated DSL or using a particular language.
True, Rama does have a learning curve. It's easier to explain why its indexing capabilities are so powerful, as data structures vs. data models is fairly relatable to most devs. But I actually consider its dataflow API to be the bigger breakthrough. Our Twitter-scale Mastodon implementation and examples in rama-demo-gallery are demonstrations of this.
Also, Rama has a first-class Clojure API (I should probably update the website).
I get your point about global mutable state, but I do not see how you have done anything more than shuffle the complexity around
Being dependant on software like the JVM makes me very reluctant to investigate further. The JVM is not involved in anything I use, I do not think i want it.
A Java API doesn't mean you can only access it from Java. Such APIs can also be accessed from most languages. Obviously bytecode languages like Clojure, Kotlin, Scala, Groovy etc. But also via Truffle languages like JS, Python, Ruby. All without bindings.
You can also do C++ and Rust, but that does require a bit of binding work.
So it's not really limiting you to a single language, or these days, even a single runtime.
Assuming you're willing to run a JVM and possibly specific language implementations. That's fine for a lot of people, I'm sure, but it's not something I'd be willing to do for what is fairly little benefit over "just" whatever solution you're already familiar with to use as the depot and the "P-stores" with workers in between.
Good point. I don’t use red planet labs product myself either, but in a world where it is hard to get people to even consider your product the article did work
> This is more an advertisement for a type of database than a statement that they are unnecessary.
I stopped reading the article once it was evident that the author was not making any case regarding databases being unnecessary, and instead was putting up a series of strawmen that sounded like desperate attempts to reject all alternatives to the thing they were selling.
The author's attempt to talk about data structures was particularly baffling, knowing that pretty standard RDBMS such as PostgreSQL already support JSON data types. Are we supposed to forget about that?
> This is more an advertisement for a type of database than a statement that they are unnecessary.
That was my reaction as well. The article's claimed argument against databases is that they are global mutable state, which is supposed to be bad. But none of what the article advocates for gets rid of global mutable state (which of course you can't do in any real world application, because the real world itself is global mutable state and anything that deals with the real world has to deal with that). It's just hawking a particular way of storing global mutable state.
T.b.h. when looking at things, global mutable state might not be bad, if it is principled?
I.e. I did like how Clojure had it's atomically updating refs.
And you realisticaly won't get more principled than with traditional RDBS, with it's ability to transactions, rollbacks, e.t.c.
Only alternative that I have seen proposed to using DBAS while not relaxing the guaranties, is to double down on event-sourcing and not having mutable-state anymore. Everything could be just a fold over all of the incomming events, right? But I don't think I have seen non-toy example of that anywhere I worked at.
That's not really what people mean by immutable though. What they mean is something halfway between the casual definitions of "immutable" and "persistent," in that your copy of the data at point X will always be valid, even if new data comes in later.
> That's not really what people mean by immutable though.
As far as the term "global mutable state" is concerned, which is what the article claims is bad, the event store is global mutable state. One can think of the individual event records in the store as immutable (they don't change once they're recorded), but the article is talking about the entire store, not an individual record in it.
> your copy of the data at point X will always be valid
If your copy of the event store is out of date (missing some events), it's not valid. A new event that your copy hasn't yet registered might change the results of queries on the store, meaning that your copy will be delivering out of date, hence invalid, query results. You can't handwave around that by gerrymandering the definition of "immutable". The simple fact is, as I said, that the real world is global mutable state, and if your app is dealing with the real world, it has to deal with that.
Append-only log-based data-stores are close enough in my book.
And conceptually, you can then push all of those "what if I am missing events, or tey are in wrong order" into the fold over the data.
Because real world is mutable data, but history of the world at any particular point isn't. It can be incomplete, it can be a personal view, I like that there are things made explicit that otherwise wouldn't. Nice abstraction.
Like, if you are splitting hairs, is even array-append in Haskell immutable? It does create new data somewhere in the memory. But I don't think that makes Okasaki's thesis and book any less valuable :)
> Append-only log-based data-stores are close enough in my book.
Close enough to what? I get that there are advantages to storing global mutable state this way. I just don't see the point of denying that what you are doing is storing global mutable state--every time you append a new log entry to the data store, you mutate the state.
> history of the world at any particular point isn't
Sure, I get that too, and I get that it's nice to have that stored explicitly instead of having to reconstruct it for the particular cases where you need it.
But if someone is running a query on your database, except in some particular cases, they don't want the history of the world at some particular point to dictate the response. They want the most up to date history of the world to dictate the response. And that changes every time you append a new log entry to the data store. Again, I don't see the point of denying that that's what is going on.
> is even array-append in Haskell immutable?
The point of "immutable" in Haskell is that every object is immutable. So array-append in Haskell does not take an existing array object and append an element to it (as the corresponding operation would for, say, a list in Python). It constructs a new array object that is the same as the old object with the new element appended. So the old array object is not mutated.
But the array-append operation of course does change the global state of the program, as you say. So there is no such thing as an immutable global state of a Haskell program. And of course the same would be true for a program in any language.
> I don't think that makes Okasaki's thesis and book any less valuable
Of course it doesn't. But it also doesn't mean the definition of "immutable" that he is using is the only possible or useful one. The article under discussion here is using a different definition, which for the topic it is discussing is more useful. Once more, I don't see the point of trying to twist words around to deny that.
The article calls out event-sourcing and materialized views as a solution.
I reread the article twice now, and I don't see a conflicting definition, that would contradict i.e. append-only-log being immutable.
Unless your argument is that the author is in-denial about benefits of immutability, because it can't be achieved in this weirdly strict form you seem to propose?
But maybe I need to reread second paragraph of the article one more time?
> The article calls out event-sourcing and materialized views as a solution.
Yes, I know, and it may well be a good solution to various issues with conventional databases. I just don't see how it's a solution to global mutable state, since there still is global mutable state.
> I don't see a conflicting definition
The article never gives an explicit definition of "global mutable state". It also does not explicitly say that its proposed solution gets rid of global mutable state. The section "Global mutable state is harmful" strongly implies that its proposed solution will somehow get rid of global mutable state, but the article never actually says that it does. It just goes into a bunch of other benefits of event sourcing and materialized views, all of which might well be valid benefits, but none of which amounts to getting rid of global mutable state.
> Unless your argument is that the author is in-denial about benefits of immutability
I can't see into the author's mind, so I don't know if the aspects of the article that I described above were inadvertent, because the author simply hasn't really thought through what "global mutable state" actually means, or whether it's on purpose, to imply a claim about the article's proposed solution that sounds good but isn't actually true, as a marketing tactic.
As I said in my earlier response, the article doesn't explicitly define what "global mutable state" is, but it does say that conventional databases are global mutable state the same way that global variables in an ordinary program are.
By that definition, an append-only log as a "single point of truth" data store, which is the article's proposed "solution", is also global mutable state. The article does not acknowledge this, but it seems obvious to me.
Yes, there is. Appending a new log entry to the depot mutates the depot. It doesn't mutate earlier records in the depot, but the depot is the entire record set, not just individual records. The record set is different after the append than before it.
Well, it is an advertisement, but much of what it says is undoubtedly true. It's true that databases are effectively global mutable state in practice, with all the headaches that gives, and that event sourcing and materialized views is a damn appealing solution to it.
However, the jump from that to saying you have it is a bit like the jump from talking about the messiah and saying that he's sitting in the next room just now.
We investigated event sourcing and materialized views as the backbone of a (new) big business application five years ago. The problem was that there were so few products back then and they were so untested (unless my mind is playing tricks on me, we did look at EventStoreDB, as well as Kafka streams), and despite being open source they were also usually heavily tied to one commercial provider. No one had anything like experience with it. I couldn't convince my colleagues to go for this, much less my bosses. Hell, I could barely convince myself.
Philosophically, I believe it's sound, but how much has changed in those years?
Yeah, the Java dependency makes me reject it instantly - the only advantage they potentially could have over "just" using any database with ability to query remote databases is not having to roll your own, but the barrier to rolling my own is lower for me than introducing a Java dependency.
(A replicated set of databases for the depots; a set of workers creating the "p-stores" in ... another set of databases, and you've replicated the architecture; heck you can even run this entirely off postgres w/replication + triggers to handle partitioning and materialization with no external workers if you like)
EDIT: Also their API fairly unambiguously is effectively a DSL, just one without it's own separate syntax and parser allowing access from outside the JVM.
I've had enough problems with JVM deployments over the years to just not want the hassle in order to use something which adds marginal value. If you don't, then by all means, use it - I'm not suggesting it's bad for everyone. If it provided some massive major advantage over the large number of other options, I'd consider it too, but I don't see anything that indicates that's the case here.
Do you reject Kafka because it runs on the JVM? This is a ludicrous position to take. You may not want to fund development of projects when you're forced to train devs to work them, but to reject using technologies based on the programming language / framework is just fundamentally flawed. Don't use Python tools because I hate python. Don't use kubernetes because it's golang don't use AWS because it has a bunch of java services. S3? I'm not entirely sure but I wouldn't be surprised if it was java edge service based as well.
> If it provided some massive major advantage over the large number of other options, I'd consider it
People reject using technologies because of the impact of having to support them all the time when the benefit does not justify needing to support one more technology. I don't have an issue with using JVM based services if I can have someone else manage them for me, but when I have to deal with the operational issues myself, or within my team, then how comfortable we are with managing a specific technology vs. the advantages it brings over competing alternatives absolutely is part of the consideration, and it'd be irresponsible for it not to be.
In this case the benefit seems marginal at best, and I'm not deploying a JVM based service for marginal benefits given how much hassle they've brought me over the years.
Nothing is outright rejected because it runs on the JVM, but it is a detracting quality, adding complexity. If Kafka is clearly the best solution available for your problem, you'll suck it up and the use JVM, but if there is another approach that will also serve your needs without a JVM dependence then the choice is easy.
I need to use Java exactly zero times when interacting with those, they have dedicated clients for everything, something this stuff apparently is proud of not having.
Pure B...t.
The title is deceiving and should be instead something along the lines of:
How to architect an application at Mastodon scale without relying on databases.
Also I would be very interested in seeing the actual technology rather than reading sensational claims about the unparalled level of scalability it supports.
What does it provide in order to recover from failure and exceptions and to guarantee consistency of state?
Relational databases are and will always be necessary as they provide a convenient model for querying, aggregating, joining and reporting data.
Much of the value in a database lies in how it supports extracting value from business information rather what extreme scalability features it supports.
Try to create a decent business report from events and then we can speak again.
Many many years ago (maybe middle 2000's), while working for a company (Flickr competitor) that has customised a Gallery (an open source PHP photo album online). The version we forked doesn't use db, rather a file to keep track of each gallery at the root of the folder. So technically I told my boss that it is infinitely scalable. However it is really difficult to search and run reports on these databaseless Gallery nodes. Database a long with word processing and spreadsheets are one these early days 'killer apps' that reflected the usefulness of that type of computing usage for humans.
I mean, AcitivityPub is pretty much the ideal case for this architecture. ActivityPub/the underlying ActivityStream consists of activities that manipulates entities. The activities are in principle immutable, even if the entities they may create are not, and you can largely naively serialize them into a log, and partition them naively by actor or by server and actor or any number of schemes, to put them into the initial "depot".
From there, materializing indexes is trivial, and replicating that is also fairly trivial. You "just" need a bunch of workers processing the depot logs into these "P-stores"
If your requirements fits that, then recovery from failures "just" requires having replicas of the depots, which is "easy" as you just stream the logs elsewhere, and archive them to, say, a blob store, combined with the ability to reset the "cursors" of the workers populating the P-stores.
It's an architecture that works very well when you want to index and query large sets of data where the order either doesn't matter (much) or you can order "well enough" from the data itself, so you can stream to multiple "depots" without worrying about assigning a global order. Such as ActivityPub. E.g. it doesn't really matter for Mastodon if a reply appears a second before the post its a reply to, because the key linking them together is there and you need to be able to gracefully handle the case where you never get the original or isn't even allowed to fetch it.
So I don't doubt they can achieve this, and a platform that makes that easy would be nice.
Their problem is that 1) very few people have data problems that are large enough that it isn't easier to "just" buy/rent a larger database for the depots, 2) or even that have a dataset large enough that they'd need a secondary set of servers to handle the materialization of views (or need the views to actually be materialized in the first place), 3) and even if they do, running workers to populate a secondary set isn't hard. It's also trivial to use tools your devs are already familiar with for the depots and "P-stores". E.g. Postgres works just fine for both until you have an architecture far larger than most people ever need to deal with, but there are many other options too.
Once you start running into challenges with that, the obvious, well-known method is to stream into append-only logs, cut the logs regularly, run the indexers on the new log elements into indexed segments, and zipper merge the log elements and do streaming compression of the values (e.g. arithmetic coding is a common, trivial method) + a skip list. Since the point of these "P-stores" is to reduce the datasets to something where queries are low-complexity, you can generally assume you won't need to support much fancy re-ordering and grouping etc., mostly cheap filtering, coupled with streaming merges of query results from multiple P-store partitions. This is the "do it yourself" solution, which you can find "packaged up" in any number of search solutions, like Elasticsearch, Lucene, Sphinx etc., but it's also not all that hard to build from scratch for a specific use case.
This method has been used for e.g. search engines "forever". First time any system I worked on actually needed it was ca 2006, and we first used Sphinx, then built a custom one (single developer effort; needed only because Sphinx at the time had only a fraction of the features it has today)
Basically, they're building a distributed database that just has very opinionated ideas about the type of problems you should be solving and how. If your problem fits in that it's very possible they can provide something "turn key" that is more cohesive and better packaged up than picking and choosing your own components for the depot and indexes and writing your own indexing workers, but as you say, to most people the extreme scalability isn't the issue and a regular database is enough.
Addressing the Twitter and Mastodon cases is kind of a warning here, because both Twitter and Mastodon (the software) started out with a tremendously naive architecture in this respect, so it's a low hanging fruit if you want to show off impressive-looking improvements.
In terms of reporting, this architecture isn't a problem, because you'd not be working from events, but from the "P-stores" and nothing stops you from ensuring that data is trivially queryable with something reasonable. E.g. either using something like Postgres for the P-stores themselves, or just streaming the data into whatever you want to do reporting from.
But again, it doesn't need to replace the database.
I didn't notice a mention of transactions in the article, nor of constraints. It's all fine to claim that you can compose arbitrary event source domains together and query them but IMHO the biggest power of RDBMS are transactions and constraints for data integrity. Maybe Rama comes with amazing composability features that ensure cross-domain constraints, but I would be really surprised if they can maintain globally consistent real-time transactions.
I've worked on huge ETL pipelines with materialized views (Photon, Ubiq, Mesa) and the business logic in Ubiq to materialize the view updates for Mesa was immense. None of it was transactional; everything was for aggregate statistics and so it worked well. Ads-DB and Payments used Spanner for very good reasons.
Constraints are good, but personally, I consider transactions harmful.
Most of the need for transactions comes from either:
- The inability to do things atomically (e.g. computation bouncing back-and-forth between the backend and the database). I should be able to ship off a query which does everything I need atomically.
- The inability to merge states (e.g. you send a changeset, I send a changeset, and the two combine).
Transactions lead to transaction /failures/, which need to be dealt with properly. That introduces a very high cost to transactions, and is very rarely handled correctly. By the time you deal with failures correctly (which is hard), models without transactions become simpler.
Payment/accounting systems are the typical example of where transactions are critical. The sending and receiving accounts need to be updated in the same transaction that correctly fails if either account would become negative as a result of the operation.
I disagree. Transactions are the wrong way to handle this. The right way to do this is to send a query which says:
"If sending account has more than $10: Decrease sending account by $10. Increase receiving account by $10. Return what happened."
The database should be able to do the above (or any other query) atomically. Under normal circumstances (e.g. barring a network failure or similar), this should never fail. The reason a transaction is necessary is because the above would typically be handled by 2-3 separate queries:
- Start transaction
- Check sending balance
- [Client-side code]
- Increase one account
- [Client-side code]
- Decrease the other
- Finish transaction
Having this back-and-forth can cause this to fail, and network latency means that it often DOES fail.
A database could implement this internally with transactions and retries (in which case, the application programmer doesn't need to think about it and therefore have bugs there), but that's probably the wrong way to do it. There are many more reasonable ways, such as locks in rows + ordering operations, rearranging data, or otherwise.
From the perspective of the programmer, this SHOULD be atomic. There is no reason why everyone writing a financial system needs to be competent at managing these sorts of parallel consistency issues. The current model introduces potential for a lot of consistency bugs, performance issues, and other problems. Transactions also introduce a lot of unnecessary complexity to the store itself.
Their example doesn't require transactions at all, which presumably is one of the reasons why they picked it. It's certainly an example that is ideally suited to showcase this because it's trivial - most references between entities are "weak" in that you can't really be sure what will arrive or whether you'll be able to re-fetch it from its origin.
Put another way: The totally naive "homegrown" re-implementation of their Mastodon architecture would:
* 1) Log somewhere, to whatever.
* 2) Replicate those logs
* 3) Run "something" that auto-create or migrates tables for any "P-stores" that have changed or are not present and streams updates into them.
* 4) A query frontend that queries multiple P-store partitions with fairly restricted query syntax (because, if you need anything complex, in this architecture you should create a new "P-store") and merges the result.
Their use of Mastodon and Twitter as an example is very much because both the original Twitter architecture and the current Mastodon (the software, not the aggregate network) architecture punted on the above. Which is fine - you can migrate towards an architecture like that at any time by simply breaking out a "P-store" and letting a worker populate it from your canonical data store.
Their "problem" is that you need a large scale before this architecture needs something more than 1 database plus a few triggers and user-defined functions, and an even more massive scale before it needs something more than a few databases with replication and a few triggers and user-defined functions and foreign data wrappers, and most systems never get to that scale. And even at that scale there are then plenty of solutions both on the "depot" side and the "p-store" side for scaling further already and has been pretty much "forever".
So they badly need this to be pleasant enough for systems below the scale where people need something like this that people will pick it despite not needing the scale, or there'll be few incentives to use it.
Rama supports very strong transactions actually. Besides being able to do arbitrary atomic updates across multiple indexes on a single partition, you can also do distributed transactions. Our bank transfer example in rama-demo-gallery demonstrates this. https://github.com/redplanetlabs/rama-demo-gallery
If I understand that example correctly it takes careful up-front design of event-sources to achieve the desired constraints. E.g. if, for some reason, transactions between users needed to now be initiated from the receiving user (toUser asks for $amt from fromUser) the microbatching would need to be paused and switched to a new microbatch that feeds from the new event source. Does Rama natively support sequencing of these kinds of schema changes if, e.g. you had to replay all the events from scratch to restore the Pstates from scratch. Traditionally, an event-id or timestamp would be used as a demarcation of different aggregation semantics before and after a sentinel value, and I suppose each microbatch could check against sentinel value before proceeding.
In short, though, I struggle to see lower complexity for transactional workloads in Rama v.s. traditional RDBMS.
Do you plan to provide a higher-level abstraction like SQL that internally compiles to correct Rama implementations of Pstate generation?
I think combining ETL and OTP in a single system has a lot of merit, as several RDBMS systems have been adding materialized views to support ETL workflows for that reason. I'm not convinced that low-level index generation is the right solution. I could be convinced that SQL-style high-level schema declarations parsed into malleable low-level operations that can be modified as necessary with solid support for the semantics of point-in-time schema changes could win over traditional RDBMS. For most use-cases the high-level definitions would be enough but reaching under the hood in a reliable and ops-friendly way would avoid the need for an explosion of custom and mutually-vendor-incompatible RDBMS extensions to SQL.
The transactionality comes from the fact that the depot append is of the entire transaction. So it atomically captures both the debit and credit that happen as a materialization of that record into PStates by the ETL.
If you need a different kind of transaction, then that can be done with a different depot or a different event type within the same depot. And the ETL code would be updated to handle that.
We won't be providing a SQL -> dataflow compiler for Rama ETLs anytime soon. SQL is way more limited than a general purpose programming language, even with extensions.
Sorry, but can you elaborate on what type of transaction isolation-levels your solution supports (i.e. read committed, repeatable read, serializable, read uncommitted)?
Those concepts are relevant to databases like RDBMS's where you have multiple writes happening concurrently to the same index. Rama doesn't work like that – you get parallelism by having multiple partitions, writes on a single partition are strictly ordered, and Rama gets performance by batching the flushing of writes.
For a foreign query in Rama, you only ever get back committed data. So that would be analogous to "read committed". For ETL code, an event has atomic access to all indexes on its task thread. Any data written to those indexes before the event would have "repeatable read" consistency. Any writes done by the event before it finishes have read-after-write consistency, which would be analogous to "read uncommitted". This is correct during ETL execution because no acknowledgement of completion has been given back to clients yet, and uncommitted writes are never visible to client queries.
Thank you for the clarification. I'm still a bit confused on how this replaces databases. Your write-up claims that this solution is a replacement for modern DB systems, but I've seen a need for all 4 transaction isolation levels among various systems that I've implemented for different clients. In essence, your solution provides a 'one size fits all' scenario and different isolation levels for different components which I don't see replacing databases...I'd say your solution more-in is a great and extremely efficient implementation of event-sourcing. Either way - good luck with your company and thanks once again for the clarification :)
You need all four in the context of databases and the architectures in which they live, where computation is separate from storage and concurrent updates are possible. In the context of Rama it's different, where computation is colocated with storage and you have tight control over the ordering of events.
It's probably easier to understand if you share a specific example you've dealt with in the past that needed a particular isolation level. Then I can explain how that would work with Rama.
I feel like I went into this from a position of genuine interest, I'm always on the lookout for significant developments in backend architecture.
But when I hit the sentence "This can completely correct any sort of human error," I actually laughed out loud. Either the author is overconfident or they have had surprisingly little exposure to humans. More concretely, it seems to completely disregard the possibility of invalid/improper/inconsistent events being introduced by the writer... the way that things go wrong. And I don't see any justification for disregarding this possibility, it's just sort of waved away. That means waving way most of the actual complexity I see in this design, of having to construct your PState data models from your actual, problematic event history. Anyone that's worked with ETLs over a large volume of data will have spent many hours on this fight.
I think the concept is interesting, but the revolutionary zeal of this introduction seems unjustified. It's so confident in the superiority of Rama that I have a hard time believing any of the claims. I would like to see a much more balanced compare/contrast of Rama to a more conventional approach, and particularly I would like to see that for a much more complex application than a Twitter clone, which is probably just about the best possible case for demonstrating this architecture.
Indeed. The single most expensive human error I ever made in my life, I made in a system that was architected in exactly this way that the author claims eliminates the possibility of such errors. Yes, it is true that it was very easy to fix the problem and restore the system to working order. But, by the time we detected and resolved the defect, the system had already been operating out of spec for some time and the damage had already been done.
This article seems to be ignorant of its own context. If ensuring that the database is internally consistent is really all it takes, then we all would have stopped worrying about it back in 1983 when Reuter and Harder originally formalized ACID.
Everything about databases is terrible and there is no problem any of their restrictions solve. Everything about Rama’s model is revolutionary, dramatic, very this, extremely that.
The hyperbolic nature of every claim combined with the very thin evidence makes it hard to take any of it seriously.
I was curious too, and I'm sure Rama is amazing for some use-case, but ultimately I think suffers from being more of a puff piece to market the tech.
I don't want to be so dismissive right off the bat though. The author of this post has hedged their bets on Rama being the solution to building stateful services, but it shares the same pitfall that it's not one-size-fits-all.
It takes an expert to design a domain; something which is rarely invested in up-front because in the startup world you might not actually know it up front, not totally. You'll inevitably end up with something thats needs work in future, whether you're managing an SQL schema or mapping out domain events and whether you push data from the producer or fetch it from the consumer. In that sense, event sourcing is no better or worse than any other option, and state has to live somewhere. It's always going to be global in some sense otherwise it's like a Haskell program with no side-effects: you got nothing.
Microsoft Access is/was a pretty nifty tool for building basic applications backed by an RDBMS. You focussed on designing your schema and then it was trivial to build a GUI on top of it in what we would now consider to be 'low code' or 'no code' fashion. We were taught how to do it in secondary school at the age of 11. The storage layer wasn't the problem. You learn 1NF, 2NF, 3NF and all that and then build a table that has 1:1, 1:many and many:many relationships, and then watch Access build a GUI for you that is ugly but basically just works. You could say they're actually skeuomorphic in that these databases, and spreadsheets, map quite closely to how you would file paper documents, index them, and cross-reference them, with the human being the querying interface.
In that sense I think the author has got it all the wrong way round. The complexity is all on the client these days, particularly when it comes to the web. Frontend development monopolises your time: it's 2023 and you still have to write HTML, React, CSS, JSX, what have you from scratch to construct a fancy layer of UI over what is fundamentally a glorified data-entry tool. SPAs are still enormously complex and brittle and thousands of developers are out there building the exact same stuff for their UI, just every so slightly differently each time - look at Reddit, whose mobile web UI presents you a surprise bug every time you load the page.
Why still? MVC for a native GUI holds tried and true. The browser offers you that kind of tooling and WebComponents certainly seem like a step in the right direction, but so much time is spent smashing together collections of React components and styled components and whatever else the million dependencies your project contains offers you.
And the concept of actually automating this through AI hilariously seems to be novel.
I work in a shop with about 6 years of event-sourcing experience (as in, our production has run on eventsourcing since 2017).
My view is, that 'humans are not mature enough for eventsourcing'. For eventsourcing to work sanely, it must be used responsibly. Reality is, that people make mistakes, and eventsourcing HURTS whenever your developers don't act maturely on the common history you have built.
For us, it has meant a bungee jump of "move ALL the things to eventsourcing", followed by a long slow painful 'move everything that doesn't NEED eventsourcing out of eventsourcing again, into relational database, and only keep the relevant eventsourcing parts in the actual eventsource db'.
The main consequences for us have been 'consume a huge/expensive amount of resources' to do what we already did earlier, with vastly fewer resources, at the benefit of having some things easier to do, and a lot of other things suddenly complex.
In particular, it was not a 'costless abstraction', instead it forced us to always consider the consequences for our eventsourcing.
I know of teams that spent over a year trying to build a generic import system on streams, at multiple companies, and in both cases they should have just stood up a simple service on boxes with a lot of ram and auto scaling. Load CSV into memory. Done.
I spent a lot of time reading this yesterday, and started looking at Rama's docs.
I think the a database that encapsulates denormalization so that derived views (caches, aggregations) are automatic is a killer feature. But far too often awesome products and ideas fail for trivial reasons.
In this case, I just can't understand how Rama fits into an application. For example:
Every example is Java. Is Rama only for Java applications? Or, is there a way to expose my database as a REST API? (That doesn't require me to jump through a million hoops and become an expert in the Java ecosystem?)
Can I run Rama in Azure / AWS / Google cloud / Oracle cloud? Are there pre-built docker images I can use? Or is this a library that I have to suck into a Java application and use some kind of existing runtime? (The docs mention Zookeeper, but I have very little experience with it.)
IE: It's not clear where the boundary between my application (Java or not) and Rama is. Are the examples analogous to sprocs (run in the DB) or business logic (run in the application)?
The documentation is also very hard. It appears the author has every concept in their head, because they know Rama inside and out, yet can't emphasize with the reader and provide simpler bits of information that convey useful concepts. There's both "too much" (mixing of explaining pstates and the depot) and "too little" (where do I host it, what is the boundary between Rama and my application?)
Another thing I didn't see mentioned is tooling: every SQL database has at least one general SQL client. (MSSQL studio, Azure data studio,) that allows interacting with the database. (Viewing the schema, ad-hoc queries, ECT.) Does Rama have this, or is every query a custom application?
Anyway, seems like a cool idea, but it probably needs some well-chosen customers who ask tough questions so the docs become mature.
this comment section has gotta be in the absolute upper echelons of non-RTFA i have seen on HN in a long time. even for HN. i acknowledge my own bias, though: i’ve been an admirer of nathan marz’s work from afar for years now, and basically trust him implicitly. but… wow. what fraction of the comments even engage with the substance of the article in any way? it’s not like they didn’t put their money where their mouth(s) is/are: they feel strongly enough about the problem that they built an entire goddamn “dont call it a database” (and business) around it.
i’ve always been pretty sympathetic to code-/application- driven indexing, storage, etc.— it just seems intuitively more correct to me, if done appropriately. the biggest “feature” of databases, afaict, is that most people dont trust themselves to do this appropriately xD. and they distrust themselves in this regard so thoroughly that they deny the mere possibility of it being useful. some weird form of learned helplessness. you can keep cramming all of your variously-shaped blocks into tuple-shaped holes if you want, but it seems awfully closed-minded to deny the possibility of a better model on principle. what principle? the lindy effect?
A lot of the commenters seem like database fans instinctively jumping to defend databases. The post is talking about contexts where you are dealing with petabytes of data. Building processing systems for petabytes has a separate set of problems from what most people have experienced. Having a single Postgres for your startup is probably fine, that's not the point here.
There is no option to just "put it all in a database". You need to compose a number of different systems. You use your individual databases as indexes, not as primary storage, and the primary storage is probably S3. The post is interesting and the author has been working on this stuff for a while. He wrote Apache Storm and used to promote some of these concepts as the "Lambda architecture" though I haven't seen that term in a while.
So what you're saying is that this article is irrelevant for 99.999% of developers. The instinctive jump to defend databases is completely understandable given that context.
> You use your individual databases as indexes, not as primary storage, and the primary storage is probably S3.
Which is a perfectly valid use for a database. Our company's document management system uses a big database for metadata and then, of course, stores the actual files on disk.
I think the complexity gets really crazy at high scale, but the complexity caused by databases is still significant at low scale as well. For example, needing to use an ORM and dealing with all the ways that can leak is pure complexity caused by not being able to index your domain model directly.
* The immutability, lambda architecture points I agree with. I think the separation of the immutable log from the views is important. Databases are frequently used in ways that go against these principles.
* I am not sold that being unable to express the domain model correctly is really a fair criticism of databases. Most businesses in my experience have a domain that is modeled pretty well in a relational DB. I haven't seen a better general solution yet, though I haven't checked out Rama.
At the low end of the scale, there are a lot of companies (or projects) for which the entire dataset fits in a single managed Postgres instance, without any DBA or scalability needs. They still suffer from complexity due to mutable state, but the architectural separation of source of truth vs "views" can be implemented inside the one database, using an append only table and materialized views. There are some kinds of data that are poorly modeled this way (e.g images) but many kinds that work well.
So I don't really view the architectural ideas as repudiating databases in general, more as repudiating a mutable approach to data management.
Such a poorly written article doesn't encourage me to use a brand new and untrusted database; if you can't write a clear article, why would I trust your database code?
This is a thinly veiled ad for Rama but the explanation for why it's so much "better" isn't clear and doesn't make much sense. I strongly urge the author to work with some who is a clear and concise technical writer to help with articles such as these.
I can't wrap my head around the way this solves the global mutable state problem.
First, here's what I do understand about databases and global state: compared to programming variables, I don't think databases are shared, mutable global state. Instead, I see them as private variables that can be changed through set/get methods (e.g., with SQL statements if on such a DB).
So I agree shared, global state is dangerous (I'm not sure I'd call it harmful) and the reason I like databases is that I assume a DB, being specialized at managing data, will do a better job at protecting the integrity of that global state than I'd do myself from my program.
With luck, there may even be a jepsen test of the DB I'm using that lets me know how good the DB is at doing this job.
In this post there's an example of a question we'd ask Rama: “What is Alice’s current location?”
How's that answered without global state?
Because of the mention of event sourcing, I'd guess there's some component that knows where Alice was when the system was started, and keeps a record of events every time she changes her place. If Alice were the LOGO turtle, this component would keep a log with entries such as "Left 90 degrees" or "Forward 10 steps".
If I want to know where Alice is now, I just need to access this log and replay everything and that'd be my answer.
Now, I'm certain my understanding here must be wrong, at least on the implementation side, because this wouldn't be able to scale to the mastodon demo mentioned in the post, which makes me very curious: how does Rama solve the problem of letting me know where Alice is without giving me access to her state?
I don't think there's much to wrap your head around. It doesn't seem to. As far as I can tell, this commits its own version of the classic multithreading 101 error of assuming you can make any data structure "thread-safe" by slapping some lock statements around all its getters and setters, or that some relative newcomers to functional programming commit when they believe that all you need to eliminate concurrency problems is the `state` monad, and proceed to Greenspun Haskell into an imperative language by littering it all over their code.
The truth is, you can mitigate concurrency problems, but there's no way to eliminate them, and believing that you have is a great way to make them worse.
Rama PStates are globally readable but not globally writable. They are only writable from the topology that declares them. All code writing to PStates is thereby always in the exact same program.
Additionally, since PStates are not the source of truth – the depots (event logs) are – mistakes can be corrected via recompute from the source of truth.
"Alice's current location" would be done in Rama like this:
* Have a depot that receives new locations for people. The appended records would have three keys: userId, location, and timestamp. The depot is partitioned by userId.
* Have an ETL with a PState called $$currentLocation that's a map from userId to location.
* The ETL consumes the depot and update the PState as new data comes in.
So with a depot being responsible for locations, and partitioned by user id, is is too off to think of this as the way cockroachdb does sharding, with ranges where there's a single raft leader per range?
From your description of PStates, I reckon there's some inherent delay from when "an event happens" (bear with me, I know ...), the depot appends it to its log/stream, and a PState consumes it?
Let me try to make a concrete example: suppose I'm a consumer that will make a decision based on Alice's location (it could be a media streaming service that must offer a different catalog view depending on the region the user is, even for the same user). What's the Rama way to know how good my location knowledge of Alice is (e.g., "I can be sure I know where Alice was no longer than T time ago")?
When you use stream ETLs, the delay between a depot append and the corresponding PState updates becoming visible is in the single-digit millis range. With microbatching it's at least a few hundred millis.
Coordination of understanding whether your writes have propagated to PStates in consuming ETLs is provided at the depot append level. If you do depot appends with full acking enabled (which is the default), the append call doesn't complete until all colocated stream topologies have finished processing that record. So in your client code (using Clojure here for the example), you could do:
> I can't wrap my head around the way this solves the global mutable state problem.
That's because you can't get rid of global mutable state. The only thing you can do is try to spread it around to add write concurrency, but that only works for some transactional/concurrency models and schemas.
It looks from the other reply like the answer is that the data is sharded to effectively increase write concurrency, but it's still 1 writer per shared.
Pretty interesting once you read past the marketing push.
I mostly like the approach, but there are a lot of questions/issues that spring to mind (not that some of them don't already have answers, but I didn't read everything). I'll list some of them:
* I'm pretty sure restrictive schemas are a feature not a bug, but I suppose you can add your own in your ETL "microbatch streaming" implementation (if I'm reading this right, this is where you transform the events/data that have been recorded to the indexed form your app wants to query). So you could, e.g., filter out any data with invalid schema, and/or record and error about the invalid data, etc. A pain, though, for it to be a separate thing to implement.
* I'm not that excited to have my data source and objects/entities be Java.
* The Rama business model and sustainability story seem like big question marks that would have to have strong, long-lasting answers/guarantees before anyone should invest too much in this. This is pretty different and sits at a fundamental level of abstraction. If you built on this for years (or decades) and then something happened you could be in serious trouble.
* Hosting/deployment/resources-needed is unclear (to me, anyway)
* Quibble on "Data models are restrictive": common databases are pretty flexible these days, supporting different models well.
* I'm thinking a lot of apps won't get too much value from keeping their events around forever, so that becomes a kind of anchor around the neck, a cost that apps using Rama have to pay whether they really want it or not. I have questions about how that scales over time. E.g., say my has depot has 20B events and I want to add an index to a p-state or a new value to an enum... do I need to ETL 20 billion events to do routine changes/additions? And obviously schema changes get a lot more complicated than that. I get that you could have granular pstates but then I start worrying about the distributed nature of this. I guess you would generally do migrations by creating new pstates with the new structure, take as much time as you need to populate them, then cut over as gradually as you need, and then retire the old pstates on whatever timeline you want.... But that's a lot of work you want to avoid doing routinely, I'd think.
I'm starting to think of more things, but I better stop (my build finished long ago!)
* By "restrictive schemas" I mean being forced to represent your data storage in non-optimal ways – like not being able to have nested objects in a first-class way. Schemas themselves are extremely important, and they should be as tight as possible.
* Rama's JVM-based, so the entire ecosystem is available to you. You can represent data as primitive types, Java objects, Protobuf, Clojure records, etc.
* You deploy and manage your own Rama clusters. The number of nodes / instance types depends on the app, but Rama doesn't use more resources than traditional architectures combining multiple tools.
* Some databases support multiple very specific data models (e.g. Redis). I don't consider that flexible compared to Rama, which allows for arbitrary combinations of arbitrarily sized data structures of arbitrary partitioning.
* Depots (the "event sourcing" part of Rama) can be optionally trimmed. So you can configure it to only keep the last 2M entries per partition, for example. Some applications need this, while others don't.
* If you're adding a new PState, it's up to you how far back in the depot to start. For example, you could say "start from events appended after a specific timestamp" or "start from 10M records ago on each partition".
* We have a first-class PState migrations feature coming very soon. These migrations are lazy, so there's no downime. Basically you can specify a migration function at any level of your PStates, and the migration functions are run on read. In the background, it migrates iterates over the PState to migrate every value on disk (throttled so as not to use too many resources).
A very simple thing about this (and many systems!) is that if your whole thing is "log writes and do the real work later", you lose read-your-writes and with it the idea that your app has a big persistent memory to play in.
This doesn't only matter if you're doing balance transfers or such; "user does a thing and sees the effects in a response" is a common wish. (Of course, if you're saving data for analytics or such and really don't care, that's fine too.)
When people use eventually-consistent systems in domains where they have to layer on hacks to hide some of the inconsistency, it's often because that's the best path they had out of a scaling pickle, not because that's the easiest way to build an app more generally.
I guess the other big thing is, if you're going to add asynchrony, it's not obvious this is where you want to add it. If you think of ETLs, event buses, and queues as tools, there are a lot more ways to deploy them--different units of work than just rows, different backends, different amounts of asynchrony for different things (including none), etc. Why lock yourself down when you might be able to assemble something better knowing the specifics of your situation?
This company's thing is riding the attention they get by making goofy claims, so I'm a bit sorry to add to that. I do wonder what happens once they're talking to actual or potential customers, where you can't bluff indefinitely.
Note that you can sometimes "fix" the issues you mention relatively easily with one or both of these. This method isn't always appropriate, but it often is sufficient:
* Have anything potentially updated by the changes users make auto-update (e.g. via websockets or server push). As long as the latency for an update to make it through the system isn't too high, this is often fine. E.g. let's say you're posting a reply, then putting a spinner in the UI to be replaced once the reply has made it "back" to the client can be fine - as long as it's quick enough.
* If it's not quick enough and/or instead of doing that, it's often sufficient to "fake it": Once it's been pushed to the server, just inject the same data back into your UI as if it came from the server. As long as what you submit to the server and what you get from the server is similar enough, it's often fine. If there are fields you can't populate, then just use the first method to still have the data auto-update once they make it through the whole chain. It does have the downside that if the user refreshes it might appear to disappear unless you e.g. put it in localstorage or similar, and then it does start getting hairy, so using it to paper over long delays it not a good idea. In those cases, there is the option of updating a server-side cache, but then you're starting to do a lot of work to avoid having consistent state and might want to consider if you should instead just bite the bullet and ensure you do have consistent state.
Of course this does not mean its always appropriate, or sufficient, to use this style anyway, and I absolutely agree that there are plenty of instances where not being able to immediately read your writes is a problem, but sometimes these kinds of tweaks can be very helpful.
I agree the claims made are a bit goofy, though, and I think I'd frankly mostly prefer to roll my own over dealing with their API, but the architectural style can be quite useful if its not overdone.
Yeah--as a user I've even seen apps where e.g. my own post immediately appears in the feed and thought "I see what you did there!"
Not as disagreement, but to help map the problem space out more, I think faking it is relatively easy for apps built around a few fundamental flows that are done a lot (like Mastodon) and trickiest when the app has many flows built to capture the many little wrinkles of some task/business/organization. If you're using an app to do the work of a school registrar or HR department or something you both probably expect your action to show up right when you hit Submit and also have lots of distinct actions you might do. It's then that it seems especially tricky to give users a reasonable-feeling view of things while building on an eventually-consistent foundation.
Oh, absolutely, this trick absolutely only works well for specific types of apps. You need it to be relatively self-contained, and not need to be reflected elsewhere if the user clicks around, or it's worse than just giving the user a "submitted. check back soon" message, in which case doing it "properly" is best.
Mastodon type apps are absolutely the best case scenario for this type of thing, because "everything" is async.
As a simple soloprenuer full stack dev who's never worked on an application serving more than a few thousand users, I can understand and relate to all of the problems written about here (some very compelling arguments), found myself nodding most of the way through, but I simply don't understand the proposed solution. Even the Hello World example from the docs flew over my head. And I've been programming apps in production for 15 years, and I like Java.
This needs a simple pluggable adaptor for some popular frameworks (Django, Laravel, or Ruby on Rails) and then I can begin to have an idea how this actually be used in my project.
Disk drives are also large global mutable states. So is RAM at the operating system level.
The article conflates the concept of data storage with best programming practices. Sure, you should not change the global state throughout your app because it becomes impossible to manage. The database is actually an answer to how to do it transactionally and centrally without messing up your data.
Right, so the solution is more complexity? Of course it is. Sigh