Hacker News new | past | comments | ask | show | jobs | submit login
Operating on a minimal two-core Postgres instance: Query optimization insights (robertrode.com)
179 points by rob742 10 months ago | hide | past | favorite | 175 comments



The optimizations are no doubt impressive, but looking at it from a costs perspective – they run their company on two DB instances, each with two CPU cores and 4 GB of ram. The annual cost of running such a machine on AWS (with SSD included) is ~$4000. Even if their DB infra was only half as efficient as it is today, their costs would increase by a whopping $8000/yr.

How much do they pay an average software developer? How many hours have developers spent on query monitoring, optimization, removing joins, removing aggregations? What is the ongoing cost of learning and applying these techniques to all new code written at the company? Even with the most conservative estimates I'm confident that it costs them several orders of magnitude more in developer productivity than they are saving on hardware.

Optimization is good, but every company needs to recognize where they are throwing away dollars to save cents. There is an optimal level of inefficiency that companies want to have.


> How much do they pay an average software developer?

Let's say that you pay a lot. Like you expend $20.000 USD doing this.

But we forget the cost power of software.

Your monthly savings extend to the infinity.

Your dev is now freer to do other things that are precluded in a more complex scenario ("you save pennies here, but expend pounds elsewhere"), because complexity DEMANDS MORE ATTENTION.

This is something I have observed all my life. I have never regretted to optime]ize for operational costs even if this means I "waste" a lot of time for this. I work in a more cutthroat environment than most startups (ERPs), where my time to deploy to production something is measured in hours, and is an environment that is filled by the graveyard of my competitors that have drink all the kool-aids.

Complex operational environments ALWAYS have turned worse/more complex than do the simplest things.

P.D: I think is similar to spending time in refactoring, cleanups, testing, security audits, training, and even resting.

All of that sounds like "costs". But you are doing this stuff alongside, just need to budget a few hours a week on this and the profits come and come.

Never regret using time on this stuff.

NEED to be smart on this? Sure. But is pure muscle and after a while is great.


Carefully designing your application around an underpowered database is almost surely more complicated than just paying for a more powerful database.

We're not talking about K8s or sharding. We're talking about adding more cores and RAM to the single database node that they're already using.

Imagine how much easier the backend dev's life would be if they could just write SELECT FROM JOIN WHERE like everyone else, and not have to pore over query plans and design/implement/test a bunch of custom joining routines. This path sounds like a high-effort artisanal approach to something where it doesn't seem warranted. It's penny-wise and pound-foolish in exactly the way you (very nicely) described.


It's true, but its great experience for future things when the costs actually matter - I cut my teeth at a biz that refused to pay more for database hardware and now my performance skills pay the bills at much nicer companies.


Surely it's great for you as the developer to get paid extra to learn transferable skills.

That's not necessarily true of the business.


Absolutely, I did not feel it as a positive at the time and recommended not that much more money for significantly better utilization of human time, but you take what you can get with intractable CEOs.


Imagine how much easier the frontend dev's life would be if they could just demand whatever horrifying schema they want, and it would be delivered to them. Oh wait, that's GraphQL, and it worked as well as it sounds.

I am begging people to learn SQL; it is not a hard language, and it hasn't changed in decades.


any good book ?


I bought https://theartofpostgresql.com/ few years a go. It goes more into intermediate/advanced category and I have been quite happy with it.

I was fortunate to have a Database Basics course in the university that was 80% database schema design and 20% SQL queries and indexes. Reasoning for such a split was that fixing a bad database schema is much harder than fixing bad queries.


thanks too


For beginner learning? https://sqlbolt.com/ is a decent place to start.


Intermediate and advanced too, if you can, that would be lovely.


https://modern-sql.com/ has some info about more recent additions to SQL to make life easier/write higher level or more composable queries.

Once you have the basics of SQL down (like how to write a query), you want to understand what the database actually does with your queries, so you can predict what the query plan "ought to be"/what you want, and you can read the query plans to try to understand why you did or didn't get the plan you expect.

https://use-the-index-luke.com/

And then once you understand the fundamentals, your database's manual is the place to go. Especially different index types (including their internals!) and what kinds of queries they support. The above link goes into depth about b-trees, but there are other types, and you can use b-trees as a reference point to contrast with others.

https://www.postgresql.org/docs/current/index.html


thank's a lot


As someone who does DB optimization for a living - there is a tradeoff, I've seen far too many places where db instances run with default settings out of the box, queries missing basic indexes, no one even cared to run an explain analyze.

I've seen far too experienced developers with no basic understanding of how relational databases operate. And it's gotten worse over the years - as they can keep throwing hardware at the problem. Performance optimization has become a lost art. This is as much a problem of negligence as much as it is of education.

I won't even touch on proper design, which requires someone with significant database expertise from the get-go ( which most startups may not have).


> I've seen far too many places where db instances run with default settings out of the box, queries missing basic indexes, no one even cared to run an explain analyze.

Indeed! One place I worked at, there was something like 10x (forget exactly) gains to be had simply by editing the MySQL config to match the workload. Seems like nobody bothers anymore. There is so much money to be saved by reducing excess operating costs.


Yep. Also, this is what I specialize in if anyone wants a senior engineer that knows operating systems and databases! ^_^


> Performance optimization has become a lost art.

It's really not very hard, basically add indexes wherever lookups are done but don't go overboard, especially with tables that see a lot of inserts. The DB will tell you what it's doing if you ask, it's all easily investigated. If your joins resist optimisation your data structure is probably off.

> I won't even touch on proper design, which requires someone with significant database expertise from the get-go

In more extreme cases this is true, but a few basic principles and an understanding of databases will get you 95% of the way there and is quite intuitive. Something as basic as "don't copy data around, each individual entity exists in one place, and use its key to refer to it otherwise" will help enormously.


> The DB will tell you what it's doing if you ask, it's all easily investigated.

It will tell you _what_ it's doing, but not _why_ it's doing it. Unraveling the mysteries of the query planner is a dark art that is still firmly in the realm of graybeards.


Exactly and that's where it gets hairy and you need deep expertise.


> It's really not very hard, basically add indexes

It's really not but you'd be surprised how many people don't know the basics of how indexes work. I've had people tell me that their queries couldn't be optimized further because "they had already added an index for every column of the table".

It doesn't help that SQL, by design, hides the actual algorithm doing the data access from the users while simultaneously relying on them to add indexes to achieve performance, which is in my humble opinion the worst mistake of SQL.


That is not a SQL mistake, but an implementation choice. Most modern databases can determine where indexes should be added, and add these automatically. E.g.: https://www.oracle.com/news/connect/oracle-database-automati...


This is news to me. Can any other DBs do this?

(I only skimmed -- far too much "Joan explains that" to signal ratio.)



Not for postgres or mysql.


I love to ask this very simple question in interviews: if a database index makes queries faster, why not add an index for every column on a table?

I want them to say “the heck are you talking about, because that’s not how anything works?”… it does trip some people up though that just legitimately don’t understand what’s happening.


As a former DBA and optimization expert, the more appropriate response is NOT "add more indexes", but rather to profile and analyze your query plans first, then make decisions off those; e.g. a long running query might be causing deadlocks, another query can be optimize with different joins or ctes, etc.

Adding indexes can increase I/O time (more writes), increase DB file sizes (more stuff to store) etc, which are not inconsequential. Sometimes it's also as simple as moving indexes to a tablespace stored on an SSD.


Rounding out the end of 2023 where a 15 TB pcie gen 4 drive is $1800, it's hard for me to accept the fact that SSD storage for databases (or at least the indexes) isn't just a given.


And now your system can scale better, saving you enough money to justify the optimization work long-term.

The StackOverflow folks famously take their DB design and performance very seriously, and now they do not need an army of overpriced "architects" to plug the performance holes with more insane and expensive complexity.

https://stackexchange.com/performance

Coincidentally I just wrote up a post about the importance of minding your database skills:

https://renegadeotter.com/2023/11/12/your-database-skills-ar...


Stack Overflow is the poster child of over-optimizing DB engineering while neglecting every other product feature that end users actually want, so not sure if that should be your example.


Somehow they managed to become one of the most important websites for software developers despite neglecting every product feature you miss.


They have never made a dollar of profit in their 15 years of operation and just sold themselves to a private equity firm for pennies. "Most important website for software developers" doesn't really mean anything unless one can also be a sustainable company in the process.


Failure to monetize is usually the exact opposite of failure to please users.


Was it because of "neglecting every other product feature that end users actually want"?


There is a case to be made they took it to an extreme, but there is too much going in the other direction as well. It's not like most teams are working on the meat of the product anyway - they probably spend 40% of their time running a small project on Kubernetes and getting rid of daily security warnings from NPM.


Have you seen their code base? They did not over optimize the DB engineering by any means that I could tell.

Having a simple monolithic database is not optimizing your DB engineering by any means.

edit: and inventing a better ORM doesn't count either.


How did you arrive at $4000?

I used https://calculator.aws/#/estimate and my number is over an order of magnitude less than $8000 for 2 instances with SSDs, 4GB RAM, and 2 cores.


I've always thought there should be a meta "AWS Calculator" Calculator tool, which takes the output from the official AWS calculator and applies an algorithm to make it more realistic to factor in the things people forget about (mostly data transfer)


I/O, backups, and such. Just having the instances cost money. Using them costs even more!


And egress at AWS.


Do you expect a database to do much egress?


Depends on the queries and the application.

Even just the case of sitemap access alone can very well read half a database's tables and many of each table's columns in the act. That being done dozens of times a week between various search vendors and yeah your egress from just SEO can exceed the size of your database alone. Just depends on the nature of your data.

Then consider paging views, infinite scroll.

Then toss in something like Algolia or other third-party tools ingesting your data.

But it really boils down to your application. You might have NONE of these and egress is just crumbs.


Egress is defined as outbound traffic to the public internet.

Unless your database is publicly accessible (which it really should not be), its egress should be zero. What your /application/ returns to the public internet, is a different conversation.


The balancing act between one-time developer hours and the recurring costs associated with scaling up database resources can be challenging. If sufficient funds are available, scaling up may seem like an easy decision. However, investing in the development of expertise in efficient database operations is usually more cost-effective in the long run. Additionally, acquiring the skills to debug queries and optimize database configurations can ultimately save on developer hours.


> The optimizations are no doubt impressive

The sad thing is that the things they discuss like monitoring and optimizing slow queries have always, until recently, been baseline competence.

It's only recently that the universal answer to something being too slow is to simply spin up another dozen, or heck make it two dozen, cloud instances to compensate.

> Even if their DB infra was only half as efficient as it is today, their costs would increase by a whopping $8000/yr.

I think the interesting comparison is not so much the same team doing things slightly less efficiently, but comparing to the average status quo in the industry.

So many companies are spending $100K+/month (month not year) on AWS costs to run very tiny minimal loads due to the attitude that efficiency never matters, just spin up more instances. Pretty soon you're talking real money. And by the time the CFO starts screaming, it's very difficult to go back to a more sane infrastructure.


How much would they pay an average SRE to deal with random fires operating a poorly designed, poorly instrumented and poorly understood system?


This. Optimized system means the devs took the time to understand the system, know the (most frequent) usage patterns and will not be nagged by 'something is slow' problems every so often. Of course, as with everything, there is a danger of going overboard. (And industry is firmly headed in the 'just throw more people/hardware/AI at the problem' direction anyway).


The last line from the article: "We must not forget that 20 years ago only a fraction of the performance of current hardware resources was available, and despite these restrictions, performance database-based applications were developed."

I think you have a mental model in which these people are spending weeks and months optimizing a system they should just upgrade while the real tasks languish, but in my experience, probably all they are doing is just paying a bit of attention every so often and simply doing things that don't require 32 cores and terabytes of RAM. A lot of things don't need that power.

20 years ago you had a lot more real variance in power. If your 100MHz server couldn't handle it, maybe your 400MHz server could, and the gap between those two systems is probably even bigger than the numbers imply because that 400MHz was probably equipped with better everything else (RAM, disk, cache, better performance/cycle for the CPU) to go with it. Nowadays, with CPU speeds having stalled, it's easy to go wide and service lots of diverse queries at a time, but if you have a particular query that your lowest-end DB node is struggling with, there's a pretty decent chance that just pushing the button to go to a bigger node will have a distinctly sub-linear result on performance improvement. I've certainly seen this from a number of teams in a number of places. You can push a button to get more CPUs but you can't push a button to get much better CPUs. Some better, yes, but not like it used to be.

If you just try a little, modern commodity systems can do a lot. Even with only 4 cores and 4GB. Just try a little. Yes, there are absolutely tasks that that doesn't work with; if you've got 10,000 hours of video to re-encode that system isn't going to cut it. But a lot of developers are really bad at understanding how long things should take and don't realize that the thing that takes two minutes during CI really should take something more like 50ms because https://accidentallyquadratic.tumblr.com/ and somebody really ought to look into that rather than just bump the CI node size again hoping it solves the problem.

(And for Pete's sake, when bumping the instance size does happen to not solve the problem... turn it back down again! Despite what I just wrote I do understand there's a time and a place to spray money at the problem, but when spraying money at the problem doesn't even help, STOP.)


Reserved m6a.xlarge(4vcpu, 16 GB RAM) costs <$1000. Assuming most companies use reserved for things like DB. So yeah their savings is like less than $1000, at which point it feels more like a developer satisfaction project rather than having any ROI on money terms.


Link? ta


> How many hours have developers spent on query monitoring

Install Percona Monitoring & Management. Clickhouse + Grafana all in one, dashboards ready to go, and free. Not hard.

> optimization, removing joins, removing aggregations?

How is this not part of the job already? Too many people seem to have it in their head that the database is an inconvenience that gets in the way of their application. It is the heart and soul of every application, because all applications boil down to CRUD. You should learn SQL, full stop. Relation algebra is not that difficult. The time people spending learning the newest stupid JS framework could instead be spent learning a language that hasn't changed in decades. It will serve you well.


You can run this on Hetzner ARM hosts for about $30/year o_O


> The annual cost of running such a machine on AWS (with SSD included) is ~$4000

Are you sure about this estimation? Because Digital Ocean offers 2vCPU + 8GB machines for $25 / month, which is more than 10 times cheaper.


These optimizations should be done regardless of db power. You don’t need an extra eng to do them, you just do them. They are part of the job.


I don't think most people today realize how wasteful modern software engineering is. They think, it's fine to be inefficient, I'll just get a bigger X. And we are now blessed with technology and solutions that make that a possibility, when for the longest time it wasn't.

But you don't want to have to optimize when you don't expect to. It takes time away from your features and quality, and is risky. You want to predict when you'll need to grow and how much effort it might take, and control costs. That's why software development should be efficient from the start. By only using what you need, you don't run into situations where you have to optimize to survive.

Anyone who says to me "I'll just put everything in PostgreSQL and optimize later" sounds just as naive as the person who says "let's build for scalability before we have any customers". Both are giving themselves problems they don't need, and trying to justify it by thinking it's going to save them time. The former is giving themselves problems in the long term, the latter is giving themselves problems in the short term. Even if these tradeoffs are acceptable to you, they are often unnecessary!


> Anyone who says to me "I'll just put everything in PostgreSQL and optimize later" sounds just as naive as the person who says "let's build for scalability before we have any customers". Both are giving themselves problems they don't need

Both are giving themselves problems, but they are fundamentally different decisions. I will almost always take the "optimize later" approach as there's not always a "later".

For most new products, the biggest challenge is finding product market fit. Once you have that, the world is your oyster. I will happily pay a 10% tax after PMF to be 10% better at finding PMF now.

If I've found PMF, I will have more revenue, more traction, more budget, etc to fix the problems I created. If I don't have PMF, everything goes in the trash.


It's not 10% though, is it.


10% was just a smallish number. In my experience, it's actually less than 1%.

Code typically follows one of two paths:

* It's in an unused/underused part of the app. Performance doesn't really matter.

* It's in a highly used part of the app. Performance matters, but you'll never be able to predict the features that get layered on top of it. Fix performance as you build those new features.


This is my experience. Get it out there, learn what you don’t yet know about the problem/product space, then improve and optimize.

The hard parts, in my experience, are being sufficiently seasoned to know what parts are hard to do later and are worth facing up front, and making sure the decision makers buy into this so you’re not blindsided by the classic “there’s no time for tech debt.”


A new business already has a laundry list of "I need to ${thing} to survive". Being able to move one such problem a couple of years in the future is already an attractive value before considering most businesses won't last that long anyways (for reasons unrelated to how optimized the database was). I mean if you already have the know how, confidence, and experience to do it right in the same amount of time then duh. If not, is spending precious extra time that early on really going to help you survive long enough to reach where how optimized you infrastructure spend is seems like an important problem for your cash flow?

Sometimes I think this then gets conflated with more established businesses where continuing to start every project so heavily this way does tend to create more problems than it solves.


I once ran the daily trade reconciliation of an entire US primary exchange on my laptop, using a PERL script (in 2007). I wrote it in a couple of days, and each day's processing ran in minutes. Ten years later, I witnessed a team spending weeks spinning up spark clusters to perform a job that was significantly smaller. At the time I thought I was the person who was missing some important context. Now I'm beginning to realise that I was not in fact misinformed or insane.


Maybe you were missing something? You could have been a manager, commanding a team of Spark engineers and an even higher salary?


... then they see the cloud bill.

Cloud companies found a way to directly monetize developers' love of complexity and overengineering. It's genius.


Developers often love simplicity and underengineering, which can also lead to performance problems.

My personal definition of engineering is: choosing good compromises between competing needs.


> Cloud companies found a way to directly monetize developers' love of complexity and overengineering. It's genius.

The developers' tastes differ a lot. Consider the extreme counterexamples

* the "small code" scene

* people who love to port modern games to C64, Amiga, retro consoles, ...

The situation is in my opinion rather that quite some managers love complexity and "future-proofness", and thus hire developers who have a similar taste.


This comment will be taught in business schools one day.


1. Find pathology.

2. Attach toll booth.

3. Profit.


IME it's idiot managers that go for love of complexity and overengineering - though not all mgrs are idiots.


Especially in small/medium sized businesses, hardware is cheaper than payroll. It's most cost effective to pay people to build features (that drive revenue) while paying a bigger hardware bill than spend engineering hours optimizing hardware usage.

At some scale they'll be a break even/shift


> "I'll just put everything in PostgreSQL and optimize later"

Sounds good to me!


"I'll just drink and eat heavily in my 20s and worry about exercising later"

Or you could... not?


To what benefit? In my (early) twenties, I could do that without consequence. But what benefit is there to not doing it while you can? (I mean, assuming you're doing so at a non dangerous level)


> In my (early) twenties, I could do that without consequence

I mean, maybe for you, anecdotally. Anecdotally, I had a friend at 23 who almost died due to his unhealthy drinking and his body basically shutting down. Doctors told him he couldn't drink anymore. And studies have shown that any amount of binge drinking/eating is bad for health outcomes long-term.

You're actually proving my point... people just hand-waive away warnings about future problems, using cherry-picked "evidence" like "hey my friend binge-drank for 30 years and he's fine", or "my father smoked two packs a day for 50 years and he's still around".

Similarly, (ab)use of technology for the wrong thing sets you up for problems later on. Don't do that and you avoid the problems later on.

But nobody wants to listen to the doctor telling them to drink less or give up smoking. Just like nobody wants to listen to good advice from experienced technologists. Sigh.


I'd ask what's the benefit in drinking, but most people see brain destruction and increased risk of doing something stupid as "fun".


Alcohol also reduces the level of social awkwardness quite a bit that many people in the software engineering field exhibit. Which can be quite helpful in making new friends or finding a partner. I'd expect a significant percentage of people would not have been with their current partner without alcohol.


You know people drink and exercise at the same time right? :)


Agreed. Way easier to optimize this one thing than redesigning your application every few years...


> They think, it's fine to be inefficient, I'll just get a bigger X. And we are now blessed with technology and solutions that make that a possibility, when for the longest time it wasn't.

Unfortunately, the culture in my group is basically this. We never address things until they're dire inefficiencies; thus, I spend a nontrivial amount of my time addressing workloads that take 10+ hours but need to be faster (and often with minor changes are cut down to 2 hours).

Having a culture of "I'll just throw more hardware at it" / "If it takes an hour or ten, I still am going to run it and get the results tomorrow" is costing ridiculous amounts of money and lost time. It's such an uphill battle to try to change that culture.


In my fuzzy sentimental memory, software bloat seemed to always self-regulate to 5-10 lines per second of logs or 2-3 screen flickers per minute, whichever applicable. Things go slower and optimizations come about, go faster and another Midleware.js comes to be.

My working theory is this is the common denominator max comfortable flickering rate for developers. There's got to be sweet spot for UI, how do I say, fluidity. And that fluid-ness usually aren't controlled for dev tools, tools themselves must evolve to stay at comfortable refresh-rate, by natural selection. That shouldn't be too weird of an idea.


I don’t know if you’ve ever had to build an MVP at an early stage company


You have to love the revival of things like "Let's render the html on the server!", "Lets create the whole page with just one request!", "Let's use SQL, and perhaps just ONE server!", "Try using a server in your office to save on cloud bills", "Put all your microservices in one process so they are cheap to run and easy to deploy!".


> "Try using a server in your office to save on cloud bills"

and then company stops working when that server dies.


You could also say the same for the cloud when your internet is not working.


It is easy to setup fault tolerant, simple, low maintenance and cheap for small traffic and data infra in the cloud (e.g. AppEngine+CloudSQL) where provider will provide 99.99% availability.

Internet not working is usually can be mitigated through multiple providers, cellular included


You can setup redundant local servers too.


yes, and then you need to have engineer with $300k salary to support your app/postgresql cluster, load balancer, fallack, backups, security, etc with unknown quality.


So it seems I'm severly underpaid and underemployed. Seriously I'm wasting my time reading HN.


If you can run such infra with 99.99% availability, you probably should be actively sending resumes on multiple openings to increase your income.


Er. We did that years ago and it was perfectly normal. Some sort of load balancer appliance (I honestly kinda liked the F5s) on the front, your choice of compute cluster behind it running at least 2 copies of each service (usually on 2 separate hosts). I could be in a bubble but I was always under the impression that that was common, and it never seemed terribly hard.


Load balancer for app is easy part, harder part is to configure replication and fail over for postgres.


So you are saying when I am able to configure postgres I can earn a $300K salary? Sorry but this is utter bullshit.


I do find it fascinating that we are entering a time now where there are engineers who have _only_ ever used cloud services - and that they can't conceive of a era where _we used to have to do all that stuff ourselves on-prem_.

BTW this isn't a criticism - just an observation - cloud has been around now for that long. Still feels "new" to me :D


I worked in the shop which had all of these ops (app cluster, DB cluster) on premise, and they had 6 high paid people serving this, not just one, and I think there are many companies with their own sites like that.

So, speak for yourself.


> So you are saying when I am able to configure postgres I can earn a $300K salary? Sorry but this is utter bullshit.

did you ever try to configure postgres cluster with failover, and also backup, recovery, monitoring?

Also, I said it is hardest part, but you still need to know your load balancer, security, monitoring, updates, app deployment and configuration stuff.

Yes, this is big chunk of knowledge which in my understanding you don't have, and the topic is that this all worry free replaceable by AppEngine+CloudSql for $100/m if your traffic/data are not very large.


> yes, and then you need to have engineer with $300k salary to support your app/postgresql cluster, load balancer, fallack, backups, security, etc with unknown quality.

So? Cloud engineers are paid $300k/a to know all the 100s of little intricacies needed with putting your production compute nodes into a cloud.

You aren't simply going to be able to move your shit to cloud-only services and fire all the network/server/infra people without replacing them with cloud people.

Whether you're hosting locally, on the cloud, or somewhere in between, you need to pay a salary to someone to make sure that everything stays up.

The trade-off is that workloads and type of work differ between businesses: on one extreme a single technical person can move everything to cloud services and manage that f/time and on the other extreme a single technical person can move everything to local and manage that f/time.

Where the business is on that spectrum dictates what they should be doing to get the most bang for their buck.

Either way, no matter where you are on that spectrum, you are going to have at least a single technical person on staff. Asking for that crew to be cloud experts in addition to local-infra experts is going to be more expensive than hiring local-infra experts.

The TLDR is: unless you have actual concrete evidence that moving to (or starting with) a cloud solution is cheaper over a single financial year, it probably isn't.


> You aren't simply going to be able to move your shit to cloud-only services and fire all the network/server/infra people without replacing them with cloud people.

the point is that for simple app+db you can: AppEngine+CloudSql will provide full automation on few button clicks. You probably will need one "IT guy".


Some of the points mentioned are sensible, but others are horrible (shift joins to the application) or simply nonsensical, like this:

> If suboptimal methods (Nested Loop, Hash Join, Merge Join, etc.) are chosen for joins, this can lead to unnecessarily high resource consumption.

Pray tell, what join strategies are left to use? Also, none of those are suboptimal, given the right schema and settings. Nested loop is fine, provided one side of the join is small.

What isn’t mentioned (perhaps in a future post it will be) is how table and column statistics can affect performance. That’s how you get nested loops when a merge join would actually be more performant.


Just to push against the crowd here pooh poohing client-side joins: this is a standard technique for scaling when you are stuck (for the time being) with a RDB and is definitely not in general "horrible". It does not mean pulling entire tables to the client then joining there. It means splitting a query that in effect walks down a chain of pointers (FKs) between tables into separate client-initiated queries. There are several benefits to this: 1. makes the design "nosql-ready" for when you design out the RDB. 2. Makes all DB queries inherently indexed and quick, allowing easy perf analysis (any non-quick query must be a bug) and alleviating connection pool starvation issues. 3. Prevents pesky devs from adding expensive queries under the radar. 4. allows more ready deployment of caching solutions. 5. recognizes that client resources scale with the number of clients, and are paid for by the client while RDB resources generally scale with difficulty and $$$, and are paid for by you.


Maybe it means (trying to read charitably) that those are the methods, and if a suboptimal one of them (for a given query, schema, etc.) is chosen, then ...

Or it's just a continuation of the first point: any non-application join is suboptimal. (Erm..)

Edit: in fact, re-reading TFA, I think it is the latter or something like it - because the shift to application logic is mentioned as a 'countermeasure' for the 'suboptimal join methods'. I think it means deal with sometimes getting a bad plan by taking control and doing it yourself in the application. (I don't agree, just trying to understand.)


The article says "Shift logic to the application level to reduce database load."

(Emphasis mine.)

It may have been edited since you commented (perhaps because of your comment), but it seems like they aren't actually advocating to do joins at the application level.


I worked at a company with an extremely complex optimization application with more business rules than any human being could ever comprehend. The logic has to go somewhere. Either the database or the interface or the application itself I guess. There are pro/cons to all. Some of the logic is hard to follow in SQL compared to a general purpose language. The vendor for us of course did some kind of blend where a lot of SQL packages do certain validations and so on. The optimization problem reads a bunch of .csv files from the database and then has its own rules it uses to build the math problem for the mathematical solver.


I could’ve sworn it said joins, but I’m not positive.

Tbf this is what Prisma does, albeit it joins at its own layer, then passes results back to the caller. Still horrifying.


It just means PostgreSQL can choose the wrong one for a specific instance, not that those are universally wrong.

"PostgreSQL chosing a suboptimal join methods (a 'join method' is Nested Join, Hash Join, Merge Join, etc., each of which is optimized for a specific situation) leads to unnecessarily high resource consumption.

"[And unlike virtually every other DBMS, PostgreSQL doesn't support specifying which method is used for a particular join. So even if it happen relatively infrequently, it's very difficult to correct a situation where it does choose the wrong one.]"


> PostgreSQL doesn't support specifying which method is used for a particular join

Not directly, no. You can make some of them possible / impossible depending on your indexing and server settings, as well as the query itself. For example, if the smaller side of the join can't fix into `work_mem` – which defaults to 4 MiB – you won't get a hash join. You can either create smaller tables, more restrictive join conditions, or increase the size of `work_mem` (but be careful not to blow up the DB, since that's per connection).


And you can use the enable_* sledgehammers as well.


If you're joining in the application, why use a relational database?


Just because you're joining in the application for one code path doesn't mean all of the code paths are doing the same thing. There are certainly valid use-cases for joining in the application because application layer can often times be horizontally scaled more cheaply/easily than the persistence layer.

So in a scenario where you have plenty of network IO being under utilized for your database, it's not unimaginable where joining in the application layer would be preferable.


FWIW the article never says to join at the app level. GP read wrong or something. The article says to shift logic to the application.


Ah - fair enough. I knew I should've checked the article first before replying!


This may start to depend on exactly what you're willing to call relational, but because you still want the schema & foreign key constraints for example enforced in the database, you're just going to do the actual mechanics of joining the data in the application?

    join_users_and_widgets(
      run_sql('select * from users where account_id = :1', req.account_id),
      run_sql('select * from widgets'),
    )
for example, where the db still enforces that `widgets.owner_id` is a fk to `users.id` say. And `join_users_and_widgets` only has to know the current version of the schema, all rows have the same columns as it were.


This is so true. If you just want the data then use a different store (e.g. NoSQL) and everything will be quicker, lighter weight and slower to develop…


What does "lighter weight" mean?

Smaller install size? Smaller memory footprint? Schemaless?


Sometimes it's better to do things in the application instead of caking it in an sql query.

An example would be a subquery that yields a large temp table, only to be pruned down to a single row.

Your choice is to either have understanding of the query optimizer and formulate a query that forces the subquery with a smaller temp table first, or you can bring it back to the application. Either approach works, but in once case you need to have deep-ish understanding of the optimizer and SQL, and in the other not so much (at the cost of an RTT.)


Bringing more data to the app so that the app filters it instead of the DB server is always wrong.

An app with such solution is a sign of mutliple issues that either the DB schema is wrong, or that the devs don't know enough Sql, or both.

Sql has been there for decades, much less longer than many of the actual dev framework. PostgreSQL (and other DBMS) have also been there for decades, they have perfected the art of fetching data.


Nothing is ever "always wrong". It may be a sign that a little more thought can lead to a more maintainable/quick/efficient solution, but it may also be that under the circumstances an approach that would be the wrong one if applied widely actually is the correct tradeoff between those factors in this case.


> but in once case you need to have deep-ish understanding of the optimizer and SQL

I will never understand why devs think they need to understand their chosen application well, but not SQL. Your application relies on SQL, so why shouldn't you deeply understand it? Relational algebra isn't that complex compared to other things we take for granted as necessary.


One issue I frequently ponder is: What would happen if we gave a modern computer to a developer 30 or 40 years ago?

They'd be overwhelmed by the instruction set I suppose, but how large a business could they run on a modern desktop?


You know, some of us are still alive today :)


I agree that most DB instances are grossly underutilized (or abused performance wise).

But an "entire company" is a tad crude unit of measure when talking about optimization.

Good point though. I've seen companies create a completely new replicated DB instance for every little pet project they start. Crazy. Nice if you're a cloud provider though.


> Improve database schemas, e.g., through normalization, to promote efficient filtering.

Cries in ActiveRecord type columns.

In less sad news, I recently learned that Postgres can have serious problems with joining a lot of tables. I looked at a join of several tables (table A 80MM, table B 30MM, table C 1MM rows) and instead of using the WHERE parameter to reduce the result set of table A to a few thousands, it fetched all rows from table C first.

The applied heuristic in this case apparently being "join tables by order of lower row count".

Setting "join_collapse_limit" to 1 forces the query planner to join the table in exactly the order as they appear in the query and that solved my performance issues neatly.


> The applied heuristic in this case apparently being "join tables by order of lower row count".

Not really - it's all cost based analysis. There are a few main culprits that might be at work:

1) Selectivity estimates can get very off if you have cross-column correlations or your joins are over multiple aggregations / DISTINCT or such.

2) Join order planning is exponential in the number of joins. Once there are more than geqo_threshold (default = 12) joins, we don't explore the whole space anymore.

3) If you have more than join_collapse_limit / from_collapse_limit joins, subqueries in the from list, we'll not flatten them anymore, to prevent the join ordering problem to become even harder. It might be worth trying to just increase them with your query.


Interesting info. Thanks.

In my particular case it doesn't help that I only have limited control over the query and the query designer has a bias towards large, complicated db functions.


Now that CTEs are no longer optimization fences in recent versions of Postgres, is there a way to somehow manually declare an optimization fence around a certain region of the query?

I've had similar problems in Snowflake with the query planner taking a bad path, and no obvious way to hint otherwise.


Functions are hard barriers for optimization and I don't see that changing any time soon.

Which goes even so far that EXPLAIN doesn't work on them and you just get a black box of "Function call to ..." in the execution plan.

Using auto_explain is the only workaround known to me for that.


But that's not the case when functions are inlined.


Interesting. I've never seen this though. I guess that only happens for simple cases?


WITH MATERIALIZED restores the fence.


Whatever is spent optimizing this to two cores is hours wasted when you could expand. Developer time is still cheaper than Cloud Cores or Ram. Yes you should write SQL queries with basic best practices and have a schema that is just right between performance and normalized.

Not mentioned was table size. You can do a lot of odd things with < 100 million rows of data and get away with it.


Yes - I think the overall headline of this is correct (most people should just use PostgreSQL) but they should NOT go into five zoom levels of detail on optimization or anything other than really understanding and utilizing knowledge of how a relational database works (utilizing indexes etc.)


Nothing in the article was in the weeds, and you can do all of it in RDS as well if you’d like (pgstat is an available extension, I think enabled by default).


Not entirely clear from the article/blog - is this an RDS instance (equivalent) or EC2 instance (equivalent) in a cloud with storage attached?

I didn't think it was possible to edit postgres.conf on a RDS type managed database. But the author also talks about scaling up quickly as performance walls are hit while they work on optimizing stuff.


In RDS this would be possible via Parameter/Option groups


Nice.

I do something similar with low cost instances. I use Go and I get a great bang for my buck (performance to dx to costs).

In my case I also leverage the cheap access to gobs of SSD or nvme. I use super cheap VMs for the app servers and they come with 10-20gb disk. I use this as KV embedded cache. For a few bucks a month I’m able to pump out 7-10k rps at load (synthetic but still).

I have a low end pg instance that’s able to pump out 2k tps (synthetic) so with the little bit of caching I’m able to save some costs and also offer good performance. All under $100 a month.

I hope when my business does take off I will have a good base to expand that I can get far without spending tons of money on inefficient compute.


> Shift logic to the application level to reduce database load.

I’m not convinced by this. What type of logic are we talking about?


This makes no sense and is typically the opposite of what I'd build. The more business logic (various validation rules and processing) you have in the database (via stored procedures) the less problems you have, not least because all that processing becomes ACID.

The reducing load on the database claim is nonsense since the application would have to query the database anyway and less efficiently that the database could, avoiding numerous round trips between application and database.

If you implement business logic in front end applications, you're asking for trouble. Many places I've worked didn't allow software outside the database to do anything more than call a stored procedure.

Application servers did come into fashion at one point, with predictable results: subtle timing bugs, inconsistent processing with different results depending on which application server processed the data, a general difficulty in finding the source of bugs, security problems.

Everyone would be a lot less stressed if they did everything in the database.


That is the exact opposite of my experience. I try very hard to have the database as dumb as possible.

- Having mixed workloads on the database servers make it almost impossible to profile and tune. By definition, a database answers a lot of queries from a lot of clients. If these queries are hybrid data fetching / compute, then it's pretty much impossible to make sense of any system metrics.

- Having hybrid workloads usually lead to considering DB servers as black boxes (because you cannot make sense of their resource usage easily). This in turn leads to having hyper pumped up DB servers with both crazy RAM, crazy CPU, crazy caches, crazy SSDs and crazy HDDs.

- DB vendors of course understand the previous point, thus why Oracle switched to a "pay per core on the machine" model...

- The DB logic layer ecosystem is most of the time subpar compared to a more traditional programming language. If you depend on your DB server performing a lot of work, you will want to make these awesome queries accessible. That is, you will want stored procs, extensions, etc. Dealing with those is a huge pain to test, version, update...

- Overall all RDBMS have special SQL extensions all over the place. You're probably using tons of those without even knowing it. With time, vendor locking will become an immense burden and you'll be forced to pay millions to Oracle to keep your business running.


Oracle, DB2 or SQLServer?

>The reducing load on the database claim is nonsense since the application would have to query the database anyway and less efficiently that the database could, avoiding numerous round trips between application and database.

Ah, but by shifting the logic to the app, I can now query multiple databases (of disparate types even! vector, document, object, relational, etc.), each optimized for its particular data and query types and combine the result. That's very....not easy in stored procedures. I can also scale out different tiers of the application based on its needs (heavy compute, gpu-based, etc.). How do you do that when everything is a stored proc on a database?


Well depends on the requirements, does it? I'm not suggesting you try to stick all your processing of all things inside a database, I'm saying put all critical processing of data stored in said database within it. Put all related data in the same database where is makes sense. Whatever data doesn't belong in that database can be merely referenced from it.

> I can now query multiple databases

This is another thing I disagree with. Wherever possible related data should be centralised to improve ACID properties and to create a reliable single "source of truth". I know this is unfashionable. with everyone claiming they need massive scale (which does cause some issues with this design), but much like microservices, it ends up being mostly busy work for programmers and related professions.

Having said that, multiple databases, if largely orthogonal to each other present no real problems, and the same design concept is applied to each.


FWIW I think I agree with you if you consider "critical processing of data" data validations (IME via constraints, checks, triggers, etc...). I think where things can really go wrong is when the DB accepts anything the application layer thinks is valid.

My favorite ORM says the same thing: https://sequel.jeremyevans.net/rdoc/files/doc/validations_rd...

I wouldn't be surprised if more people agree with that as well, maybe just using different terms.


> I think where things can really go wrong is when the DB accepts anything the application layer thinks is valid.

No, my view is the opposite of that: the database doesn't allow anything invalid to enter the database.


Hmm well if I understand you correctly that's what I was saying as well:

Things go wrong when the DB trusts the application layer (rather than doing its own validation) -- e.g. the DB should control what is valid data, not the application layer.


The problem with having business logic in the DB is not a performance problem, it's a maintainability, testing, and documentation problem.


You've demonstrated the real issue here, but not the one you claim, it's that programmers have no idea how to use (SQL) databases or how they work.

The idea that software has those negative characteristics merely because it resides within a database, is laughable.


All of the PLSQL dialects are pretty much universally mediocre at best. They don’t (de-)compose well, they don’t version-control well, they don’t document well, they don’t debug well — they pretty much lack any of the learnings and tooling that occurred in application languages over the last 20 years. You’re basically dealing with all of the problems of a poorly designed DSL made by your favorite coworker.

You could do something more sane like running lang-extensions on the db so you have access to a “real” language to specify your business logic in, but then you haven’t really changed anything except to run your app logic on the DB server and doesn’t really have anything to do with the RDBMS at that point.

Moving business logic into the schema/queries can be fine, but you still have the issues that you can’t document it particularly well, can’t version control it well, can’t test it well, can’t trust your performance tuning well, etc. Which all derives from DBMS expecting to be treated as “living systems”, and has nothing to do with knowing/not knowing SQL.

And the problem with modifying living systems is that you really don’t want to be fucking around with it because anything could potentially send it reeling (see: surgeons). You really want it to be stable… so moving your unstable business logic out is almost a necessity


You're making my point here. RDBMSes are not any more difficult or mysterious than any other software except for the fact that you're more familiar with that software and language, thus your 'doesn't X well'.

What you will find though is any sufficiently complicated business application software implementing an ad hoc informally-specified bug-ridden slow implementation of half of what the RDBMS is giving you for free.


> except for the fact that you're more familiar with that software and language, thus your 'doesn't X well'.

That’s exactly not my point. My point is that, knowing SQL, knowing what RDBMS’s have and what they are, RDBMS’s are a bad place to stuff your business logic. You have a variety of options that don’t work well. Schemas/queries are the only relevant place where RDBMS’s excel, and because it’s a living system, you don’t really want something so unstable as business logic to permeate through your schema significantly.

It’s not a matter of familiarity. It’s the wrong tool for the job being discussed.


What's not a "living system"? You can update RDBMSes much more cleanly and reliably than almost all other software. Stored procedures (in a variety of languages) and the like do not necessitate changes to your data schema and the data that ends up in your schema belongs there, regardless of how its produced. That is a data design issue.

Being able to roll forward and back with code and data (including an updated schema) that is synchronised and works is a huge win. In most places where I've worked with serious RDBMS systems developers have no access whatsoever to production. The fact that this isn't the case universally illustrates the strength of these RDBMS approaches.


why aren't stored procedures on a DB not able to be version controlled well? People already version control database schemas and migrations, why would stored procedures not also be able to be version controlled just as well? You can test it by creating shadow databases (what Prisma does) to make sure all migrations can be done on an empty database just fine, add test data, call the procedure, and check the output. Genuinely please do enlighten me as to why this isn't possible or doable, I would love to learn more about this part of the field.


A lot of the stuff I see people writing application code for doesn't require any more complex SQL features than joins, aggregations and filters. i.e the stuff plain old SQL is really good at.


That’s just born of either psychosis or naivety; but even when one does know SQL, there’s more to consider. RDBMS’s are awkward systems to work with, and SQL knowledge isn’t the issue

moving business logic into schema/queries is usually fine, (and perhaps the only sane place for the DB to directly interact with business logic) though schema changes has all the issues described. Queries get to live with your application code, and so can be treated sanely, but you only really get to a good setup with a decent query builder library (ORM-haters smashing strings together like an ape is naturally full of issues, and ORM-lovers get all the fun of table/object model-mismatch and poor query generation). But then you run into the issue that every database library tries to be database-agnostic, and you lose access to all the fancy features your database provides


> but you only really get to a good setup with a decent query builder library (ORM-haters smashing strings together like an ape is naturally full of issues, and ORM-lovers get all the fun of table/object model-mismatch and poor query generation). But then you run into the issue that every database library tries to be database-agnostic, and you lose access to all the fancy features your database provides

I've found that smashing strings together works excellently so long as:

- You use something like JavaScript's template strings so that interpolated fragments are inline

- You have named placeholder support (so that you're not trying to count parameter placements which gets hugely complex for optional parameters).

- You mainly use it for SELECTS and have minimal query builder for INSERTs/UPDATEs/UPSERTs

And you get full access to all of the database's features.


If you’re maintaining a layer between construction and finalization (the query builder), then you’re not really in the string smashing group I was describing. The main thing I was trying to get at there is no one should make the mistake of “ORMs are shit, I can just write SQL strings directly” because that road is plagued with problems (because SQL is not a language that composes well..).

And it’s not just named parameters; with a sane query builder you don’t need to totally rewrite things when dealing with subqueries vs top-level, aggregations, etc.

As long as you don’t go so far as to introduce the horror of an ORM


> The main thing I was trying to get at there is no one should make the mistake of “ORMs are shit, I can just write SQL strings directly”

This has actually been the road I've been heading down on my personal projects. If I were to use this for work I would use something like Kysely so I can have type safety and make sure the application layer is updated to support every migration, but I've taken time to really practice SQL and write queries & migrations manually and avoid ORMs completely. I love writing SQL by hand, however I know that doesn't scale without some automated checking going on.


This is a common pattern with NoSQL databases, but really, the author of this article should just spend more than $10 a month on their Postgres instance.


Here is a scenario that I've encountered a number of times.

We needed to serve up some data for an API endpoint. Great! We'll write a SQL query for it. Just need a couple of joins and where-clause parameters -- it will be speedy, and easy to maintain.

Before long, we are asked to make our query more robust, including a couple of optional parameters. Some null-coalescing in our join expressions and where-clauses is introduced. Performance immediately suffers. (I've seen the same kind of impact when introducing unions or function calls within such expressions -- the query plan can get very messy when you try to scale in this manner.)

So what do we do? Restructuring our data (maybe with a materialized view or something) is one option.

A reasonable alternative is to break up the single query into a few more specialized queries, and then use application logic to inspect parameters and choose the best query to use for each request.

Plenty of trade-offs to consider in both cases.


I don't think there's any trade-offs in that situation. It doesn't make sense to have your queries handle optional parameters when you have that info already precent in the application layer and can choose the specialized query. I suppose if you were really rushed you could end up in that starting scenario.


How to operate on a tiny database server: shift all work from the database server to the application servers.

See also: webscale databases


The last place I worked had some JSON columns and doing lots of JSON operations tended to eat up all the Postgres CPU (the servers were generally 16-32 cores)


We had that too. Partial expression indexes mitigated most of the performance problems and CPU load.

We then created consistently-named functions that made the appropriate JSON query. Then an expression index on the function. If the key was missing (NULL result), it wasn't added to the index, and we stripped all JSON nulls before storing.

It wasn't necessarily pretty, but neither was the client's source data, which we had little to no control over. Best of a bad situation.

Jsonb columns required more steps afforded us more flexibility and timeliness when the client kept asking for "one more feature" at the last minute.


Native columns are better/faster.


Not everything fits into those.


Not with that attitude.


Easier to scale application CPU usage horizontally than a single instance DB.


Even easier (and much less error prone when it comes to ACID properties) would be to scale the single instance DB vertically instead.


well, no. This is because there's a cap (albeit very high these days) scaling vertically - but the bigger advantage is you can also variably scale the application. It's a much bigger pain to scale up and down your DB instance with traffic. application pods can just be shut down etc.

Also you can just wrap your queries in a session/transaction on the client, so no worries there. But yes, don't do this until you need to.


Some companies do large amounts of stored procedures for even routine tasks.


Ah I see. I thought it meant taking away some merge logic out of the database layer.


Oh I think they mean that, too. I am just trying to be charitable by calling out times where it can be a lift to yank something out of the DB.

This application side join stuff was real popular about 15 years ago with Mongo and similar stores that didnt support joins.

It is one of the many things that separated RethinkDB from the document db crowd at that time.


I'm picturing that meme with the guy sweating over pressing a button.

Add new features that generates revenue, or optimize the database.

Hmmm.

Of course the response is that you want to optimize the database so that when you do eventually add new features that generate revenue, you can scale to meet the demand.

Or is it...


This meme is a bit too misleading.

In reality tons of startups have a shitty product with zillions of bugs because "move fast break things". Then later on they don't survive because the only reason people even pay for them is because the other startups are spending silly money on it. When the well dries up, it all comes crumbling down.

Turns out that startup's software is too shitty and simply not good enough to survive a cost cutting round.

People love to pretend that just adding more features always beats everything else, but it's not as clear cut as people make it seem.


We're talking about premature optimization, not necessarily shitty/buggy software.


Does anyone have any experience with Pgtune, i.e has if made a difference in performance?

https://pgtune.leopard.in.ua/


If you want a data store, just use a NoSQL server.

If you want an RDBMS, learn how to use it.

At least with NoSQL you know you're trading ignorance for ease of use.


This gives me hope, my current idea I work on is osal planned for a two-core Postgres cloud instance from Hetzner.


> This gives me hope, my current idea I work on is osal planned for a two-core Postgres cloud instance from Hetzner.

Keep in mind that years ago having two cpus (single core cpus, with incredibly worse performance) was the pinnacle of technology, and still people managed to handle a decent amount of data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: