I wince every time I use application code to do what I know SQL can do faster and easier. ORMs are nice but it's often impossible to translate performant SQL into the ORM's API. SQL is just so concise.
Doing data retrieval and manipulation with application code instead of SQL is like doing web layout with JS instead of CSS or doing graphics with canvas instead of HTML/SVG.
Sure it's more flexible, but it's also far less simple, readable, and in many circumstances less performant.
We recently changed one of our supplier. The previous one allowed us a full read access to their database (400 tables) but the new one insists that we use their SOAP webservices.
So now when someone ask me for some data, I need to code, manually joining objects instead of crafting a quick SQL query and using the export to CSV function.
Dictionaries (C# key value collection) have fast become a staple diet since Linq is way too slow when you try to 'join' hundred of thousands of objects together.
It feels backward (and I'm saying this as a dev, I'm certainly no DBA or SQL expert).
I thought about making a "shadow database" on a server since my work PC is only 8 GB ram but ultimately decided against it.
The first reason was GDPR compliance and the second is that we have an already complex architecture as it is (and we're only 2 to manage it) so I don't want to add one more brittle layer that will need debugging at the worst time.
About GDPR: is it possible that holding data in a non-storage/volatile medium could be legally distinct from non-volatile storage, especially if it's essentially syncing with an authoriative data source that's responsible for managing GDPR? Because if not, it seems to me every proxy and persistence layer runs legal risks....
No. GDPR covers any processing. Simply having the data pass through your machines / software makes you a part of the chain of processors / controllers.
Just writing sql and maintaining and testing it like regular code is a perfectly reasonable thing to do.
Not to mention that having sql be callable as an actual function in the programming language that the rest of your application is written in is just so much more comfortable than having small bits of sql scattered about in strings.
Haha ... I'll admit I used to say that but rarely did an internal system (what I work on most) ever actually have that requirement. Systems that are sold to customers to run on their premises are another story but don't worry about using PostgreSQL-only (e.g.) features to speed up both your processing and development times. I have however seen the horrors of 600 line stored-procedures with calls to other stored-procedures nested 15 deep. If you get to the point that you can't easily understand the business logic, you've gone too far.
There seem to be two ways that interacting with the database works out in practice: You can embrace the specific DBMS you have, get the most out of it, and end up tightly coupled to it. Or you can try to be database agnostic, increase your development costs and limit your performance and data expressivity in the service of that goal, and still end up tightly coupled to the DBMS.
My pet hypothesis is that "database agnostic" isn't really about being able to switch databases on a whim. It's more expressing a developer-centric anxiety. A dog whistle way of saying, "I don't want to have to RTFM of a new DBMS." Which I get, most DBMSes have awful manuals. Especially SQL databases. SQL dialects tend to mix syntax, semantics, and subtle implementation details all together into one big ball of mud and tangled hair. But this is one spot where I think it really is worthwhile to hold your nose and get through it.
Scenario 1: No plans to add additional functionality.
Sticking to the e-commerce example in the article, if I am designing a system for use by just one customer who wants to offer the discount for just that one scenario and they don’t have any future needs to offer any other kinds of discount then put logic in theDB. It’s quick and easy.
Scenario 2: Plans to expand to add additional functionality.
Sticking to the example we have future plans to offer other types of discounts to our customers. Which means we have to develop generic interfaces so it’s easy to add functionality. Have the business logic in the application layer.
Scenario 3: plans to expand and additional functionality but also let users configure and add additional functionality without IT intervention. Sticking to the e-commerce example we want business users to create new offers and expose them to our customers. In this case use a rule engine like drools.
It really depends on what functionality you are delivering and how you see that evolving in the future
The article was written at a time when "polyglot" wasn't common iirc ... to the extent that knowing SQL in addition to a programming language wasn't common.
Today the thinking "keep all domain logic in code" would take you to crazy places given the diversity of systems we need to build applications.
"A foolish consistency is the hobgoblin of little minds" and all that.
If anything it seems more sensible to centralize storage of all business rules in the DB.
One thing this leaves out is the danger of concurrency issues, which can sometimes be worsened by using an ORM. Writing correct, race-free SQL can be very hard (with READ COMMITTED transactions) or require complex retry logic (with SERIALIZABLE). Preventing deadlocks can also be a concern.
ORMs tend to hide the underlying SQL operations, making it even harder to verify whether operations are concurrency safe.
I think that this is one of those spots where, if you want things to be ORM-friendly, you've got to bake it into the data model.
A good starting point is to use CR instead of CRUD. Append-only tables with soft deletes aren't necessarily the most performant, but they're naturally less susceptible to race conditions. The lack of destructive modification (under normal operation) also makes it easier to diagnose problems and keep audit trails.
One way I get around this is having an append only version of the main table, and then a "most current record" version of the table. My append only table will be a running record of all the data I've loaded unmodified. Then I run a create table as select statement against my append only table to generate my new version of the final table. So in essence, I'm "deleting" the old version of the record by omitting it when I create my new version of the final table.
Not necessarily. You can insert a new row with a field named ‘deleted’ or similar set to ‘true’. This is how you would implement soft deletes in an append-only table.
It doesn’t have to be. You Can write reads that don’t return entries if a separate deletion entry exists.
Can’t say I’ve worked that way in sql, but I have seen that pattern in append only data structures. With for instance a “order fulfilled” entry essentially is a delete operation on an outstanding order entry.
READ COMMITTED and REPEATABLE READ benefit from retry logic as well, not just SERIALIZABLE.
For a long time, I sought to write deadlock free code.
But that is very hard.
For example in PostgreSQL every UPDATE must be ordered, every DELETE must be ordered. [1]
Finally, I did myself a favor and create application-level retires.
This is 100% cool so long as (1) your deadlocks aren't so frequent so as to reach a performance problem and (2) the action is "replayable" (e.g. no read-once streams). Fortunately, these are both frequently true.
It is a great tragedy of our day that so many software engineers are poor at SQL. Use tools like Jinja and DBT to modularize it, and Airflow to schedule it; but use of SQL should be maximized wherever possible.
One thing this article doesn't mention is that in many cases databases scale poorly, and only vertically. You can throw a bunch of domain logic in complex SQL queries and see significant performance benefits, but at some point you may find that your application has grown and your SQL server is running into IOPS and CPU limits that are difficult to fix. At that point your answer could be, pull logic out of those complex queries and create simpler queries that ask less of the DB and run on the application side so you can scale horizontally.
I don't think I've ever seen performance improvements from taking logic in a complex SQL query and re-implementing it at the application layer.
The bottleneck with databases is virtually never calculations on results (CPU), it's disk access and network latency and bandwidth.
And generally, if you do have crazy complex CPU-bound calculations you need to do on data (e.g. scientific stuff)... SQL doesn't provide the necessary functions/control anyways, so the database isn't even an option for that.
The only issue I can imagine you might have run into at some point is complex queries badly written (e.g. recursive subqueries without indices) that you could speed up with application logic -- but then the solution is to optimize the query.
And I've often seen things go dramatically in the opposite direction.
I/O is generally just about the slowest thing a computer can do. So, in general, the more you can reduce the data before sending it across the network, the better. And, heck, a well-crafted SQL query can often save you having to even read large chunks of data off of the disk in the first place, let alone pipe it across the network.
> (...) but at some point you may find that your application has grown and your SQL server is running into IOPS and CPU limits that are difficult to fix.
It's not unique to SQL/databases that at some point your user base may outgrow your current architecture and you may have to invest in something else.
It's dangerous to think you can skip this and just employ whatever architecture is "good" for bigger user bases from the start. There are different trade-offs in different systems and the trade-offs of most alternatives are not as well understood as the trade-offs of databases. I have seen more than one project implode when something which was simple with a database was far more complicated with the alternative.
Most (if not all) modern RDBMSs allow for reliable horizontal and vertical scaling. SQL Server (the db you mention) is no slouch when paired with additional CPUs (for vertical linear scaling) and allows for transactionally safe replication to distributed clusters of SQL Servers (for horizontal scaling) in geographically distributed servers. PostgreSQL has this and it's absolutely free.
Where is this meme that RDBMSs do not scale coming from? (I see lots of people saying this as though it is a given. Has there been any published data on this for me to read?).
> Where is this meme that RDBMSs do not scale coming from?
They don't scale to Google or Facebook operational sizes. Once you get to a billion customers or so the ol' RDBMS tends to struggle. Because everyone wants to be Google they imagine they have Google's problems. I've been in a meeting where the client was talking about their severe scaling issues for their "big data" which could only possibly be resolved by state of the art cloud solutions. I pressed them on the numbers - they had 400GB. You can buy an iPhone with 512GB.
> They don't scale to Google or Facebook operational sizes.
You may or may not know this, but the primary datastore used at both Google and FB is MySQL. Sure, they use replication and sharding, but I would strongly argue that MySQL with sharding scales better than some multi master NoSQL thing like Cassandra.
Related, you should check out https://github.com/vitessio/vitess if you haven't seen it. It's what Youtube and others use for their primary data store in production.
I think another reason for this, is that companies had a tendency to only have one single running database, to store everything, and often even in the same schema. Which ended up not scaling well in the organisation when so many unrelated topics/project/features went through one and the same database. I don't know if it has anything to do with licensing and paying per instance, maybe, but I feel that a lot of those problems of a database that doesn't scale, could have been solved by simply using a few databases for isolated domains instead of cramming everything into one.
> Where is this meme that RDBMSs do not scale coming from? (I see lots of people saying this as though it is a given. Has there been any published data on this for me to read?).
It mostly comes from people who don't know how and when to create an index.
I feel like you should understand how a relational database works before using them, especially if it's at the scale that you're running into "CPU limits"
> Where is this meme that RDBMSs do not scale coming from?
In my days it was more like "You can't afford RDBMSs at that scale". Reach a certain point and Larry gets a new yacht. Cheaper/open source offerings have moved that goal post by quite a bit.
Although it sometimes scares me how them poor databases get treated when performance is dropping. Little Jimmy JOIN is the first one to be put down, often way before there's a need for it.
> SQL Server (the db you mention) is no slouch when paired with additional CPUs (for vertical linear scaling) and allows for transactionally safe replication to distributed clusters of SQL Servers
Additional CPUs are $7,000 USD per core, and replication is labor intensive. Transactional replication has a nasty habit of breaking as the source tables are changed, and Availability Groups have a ton of bugs (as evidenced by any recent Cumulative Update.)
Saying that SQL Server scales is like saying your wallet scales to hold any amount of money. Sure, it might, but it’s up to you to put the coin in, and it’s gonna for a lot of coin - compared to scaling out app servers, who have generally near no license cost, and code is synchronized at deploy time.
As to recommending you a place to read, I hate to say this, but you could start with my blog. Pretty much every week, I’ve got real life examples turned into abstract tutorials on there from companies who hit scaling walls and had to hire me for help. (Past client examples: Stack Overflow, Google.)
> Probably still cheaper than trying to impement scalable transactions in higher layers.
Transactions, yes, I totally agree. That's what databases are for: reliably storing and retrieving data. It's where you start doing domain logic that things get tougher, like (and I wish I was joking) calling cross-continent web services from the database and building HTML inside SQL.
It's certainly possible, but more complex than just putting up a bunch of stateless application servers all accessing the same database. With no local cache. Then saying the DB is slow ;-)
It's also the kind of thing many developers don't like doing: thinking about operational concerns.
As an aside, horizontal scaling of sorts can be achieved by using microservices, it's actually one of the few really valid reasons for this type of architecture. If the microservice databases are not independent, you're doing it wrong.
The situation goes both ways. Sometimes people put complex logic into the database and it hurts scalability. Other times people pull logic out of the database, but losing all the correlation seeking and filtering means that it now has to do a lot more IO than it did before, what hurts scalability too.
+1 Also, in my experience, SQL query optimizers can also be unpredictable on complex queries (behavior changes with data size and statistics) which can cause them to suddenly change to slow execution plan when data grows or something is added , despite having all indexes for a good plan possible (which for interactive apps is to avoid table scans always -- they might be faster in some bad corner cases, but if data doesn't fit memory, they can't be cached and they slow down everything else).
Follow up: yes, you're all correct, but sometimes things in an organisation aren't perfect and I've been in many situations where the database is the bottleneck with little power to change it. Now that I think about it, it's mostly a symptom of the database frequently under a different 'organisational area' compared to the application. This means we might have the ability to, for example, easily add more application servers, but the database falls under a different department and any changes require a lot more coordination with other parts of the business. I realise this could happen for any resource, but in my experience it's very common to have it occur with a database.
What's the limit of vertical SQL scaling and what kind of applications (I'm talking not only about the functionality, but also realistic level of popularity) can end up hitting it, in your experience?