Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Server-side web apps – is the database still a bottleneck?
127 points by vanilla-almond 12 months ago | hide | past | web | favorite | 112 comments
I've read many times in comments here on Hacker News that the speed of your database or disk access is the biggest bottleneck for your server-side web app. Therefore, the speed of your server-side programming language is not important (or less important).

Does this still hold true? Given that SSDs are now commonly used by many hosting providers - is database/disk access still a bottleneck for server-side web apps?

In short, yes.

The database is the bottleneck because it's much harder to scale than applications.

The path of evolution in the industry:

1. Stateful application - usually only 1 server, not distributed at all. It's very hard to scale.

2. An obvious solution is to make application stateless and having a centralized state. Then the application is very easy to scale and operate. And databases slowly became the bottleneck because there are much more applications than databases.

3. Then not so obvious solution is to split your problem domain into solution contexts, where every context become an application and each application have a database to talk to. So the databases are still not really distributed but it's sort of distributed by your sub-domain of business. (I think it's the industry mainstream or becoming mainstream now)

4. Then the non-obvious solution is to have truly distributed states split across your application. Basically, the goal is you can distribute any object across a whole lot cluster, in a more or less transparency way which lets you treat the cluster as a whole without concern about individual machines. (There are some cutting edge stuff yet to become mainstream, like Orleans / Lasp-lang or just cluster/shared actor like Akka, or just distributed Erlang, etc)

It's an interesting example that short-term solution is in a totally different direction from the powerful solution. To address this in short-term is to extract and centralize the state, while to address this in the long-term is to truly distribute the state.

I want to quibble with you on one thing—the database is often easy to scale, by three orders of magnitude, by configuring it correctly, and then moving it onto dedicated hardware.

Similarly, very few applications need more performance than you can get by scaling vertically and having a hot spare replica database.

Although a distributed data store on a few cheaper VPSes/nodes is probably less expensive.

One example of #4 that is making headway in production systems is the Calvin protocol (a RAFT based write-ahead-log with lock-free deterministic transaction application). It's implemented by my employer FaunaDB. You can learn more here https://fauna.com/blog/consistency-without-clocks-faunadb-tr...

An easy and often effective solution is read replicas.

Operations that need ACID get ACID and operations that don't, don't.

And every production-grade DB make read replicas easy.

Yes, multi-master is the interesting and challenging thing to us, but single-master still has a lot of juice.

A variant of no 3 is used for some large scale telecom equipment supporting multi millions of attached users with serious uptime requirements. State is distributed and handled locally and replicated to a sibling node for fallback protection in case a node goes down. The replication is dynamic so new sibling relationships are established when nodes come and go. There is also some more persistent state (like billing information) that is used to recover in case of total node failure but a lot of the transient state can be rebuilt when users are forced to reattach

actually before 3 and 4 you can scale reads which often is the bottleneck and most major databases can scale reads pretty good these days.

For simple "web apps", typically its a poor understanding of SQL that is the bottle neck. Since most web frameworks include an ORM, many developers don't even know much SQL any more and treat the database as a simple 'object' that you can call methods on and iterate through. Case in point: you're building a dash board for a web based store to show details about the 5 most recent orders. But each order references 5 other items (as a simple example). And each item references an SKU# that can come from 5 manufacturers.

So an optimized query set from an ORM will end up querying the DB like this: First it selects the 5 rows from the order table, but since you left out any join clauses, it then runs 5 more selects on each of the rows returned from the order table to pull the related rows from the items table. So we are now up to 6 queries.

But each row of the item table references 5 manufacturers (again, as a simple schema example). So each of the 'item' 'objects' (5 per order) has to then make a query to fetch the sku manufacturer. So that is 25 more select statement queries on top of the original 6. So using the ORM properly and building out the right join clauses, you would have made 1 query. Instead you made 31. Now lets say that DB is not on the same machine, but a different machine on the network. Throw in 2ms - 10ms latency per query and now you are looking at 62 - 310ms of latency to fetch your data. And that is just for 5 'rows' of data. Do the math for a more realistic dataset, 500-1000 rows or so. Now you're looking at minutes of latency for your data to come from the DB.

The worse case I've seen recently was this exact case. To generate an inventory page to show an html table with about 300 rows, the DB was being queried over 300,000 times with a page load of about 7 minutes (and getting longer everyday). I added the proper join clauses to the ORM and one query was issued which would return the result set in about 700ms which put the page load time to about 1 second. A 1 second page load imo is still not very good. But it seemed to impress everyone that with 1 hour of work, a page load went from from 7 minutes to 1 second....

For what ever reason, most people don't understand this anymore. Instead of learning SQL, the line of thought is to upgrade the machine, move to a 'no-sql' DB, introduce some caching layer, and many number of other ideas. But "lets make sure we are querying the DB instead of iterating through it" is never discussed.

You've done a great job of explaining the type of DB interactions that most frequently cause performance issues at my workplace. I don't understand why developers are so unwilling to just learn SQL.

I am inclined to agree. I have searched for the perfect ORM and have decided that objects and relations do not map. And it's really not hard to write:

   db.Query("select foo, bar from baz")
   var baz Baz
   row.Scan(&baz.foo, &baz.bar)
to get an object.

The problem I have with having queries like this in my code is that your schema and your code can get out of sync and these are errors only occuring at runtime.

Therefore, looking for a safer solution, I've also looked at various options in my last Go project, and ended up with SQLBoiler. It supports fetching related documents in a single query and seems to be quite performant.

I apply migrations when the app starts, and have a policy that the code version n-1 should be compatible with schema version n, so that releases can be rolled back safely. (How do you handle replicas? The database gets globally locked and the first replica does the migrations. Yup, you have downtime. Don't change your SQL schema if you want high availability.)

I don't think that an ORM saves you from having to have a versioning policy, schema migrations, and tests for that. If your code expects a data structure that looks like map[string]string{"foo":"bar"} but the database column is now called "quux", you're not going to find what the code is looking for at runtime. The difference is you get an empty string instead of an explicit error.

Ultimately I find it very annoying. When I was at Google I used spanner, and everything in the database was just a protocol message. Want to rename something? Feel free. Only the code cares, the database still uses the tag number. Want to add a field? Code compiled against an older version of the protocol message will just silently ignore it.

In the end, I think bolting on relational semantics to an object database is a lot easier than bolting on object semantics to a relational database. But to do that, you have to undo 50 years of thinking about database storage.

>and everything in the database was just a protocol message.

This is interesting. Do you have any more information on it?

Stored procedures.

Blow them away and recreate them all from source after you run your schema and data migration scripts. Then you'll know at build time whether your queries still match the schema.

It helps to also not be afraid of the star when writing queries. That can keep you safe from missing new columns. You just need to take a bit of care and think about what you're doing.

To keep code+schema in sync, try using liquibase or flyway or similar tools to manage all SQL DDL, end ensure your build+deploy process always runs your liquibase/flyway scripts.

This reminds me of Django’s select_related.

You can get pretty far with select-related and prefetch_related alone, but sometimes there’s a need to break out SQL in order to get past some kind of performance barrier.

I started my career in Django without any SQL background and then moved to a financial company where everything was done in MSSQL.

If I never have to write or read a SQL query again, I would consider that a good thing. Although I had a few fun learning experiences.

If only more people used simple object structures to data mapping. Working on an overly normalized database, with in-database logging on all access via stored procedures for all access, and hitting bottlenecks on a few dozen users.

A "well structured" database can lead to performance issues. Every join and sub-query has a cost and they add up.

The funny thing is that on this project, I largely stepped back from anything to do with the Database because the target was only around 100 users or so on a deployment that could largely be dealt with by faster hardware. Now, stuck trying to scale to more users than that and stuck with layers of procedures/functions that don't perform well at all. I wanted a few loosely coupled tables with some destructured (json) data in them. Wouldn't be having half the issues today.

Also, not a fan of ORMs... it's usually easier to do simpler mapping directly or a simpler data mapper that isn't a more typical full on ORM.

I haven't much experience with stored procedures but I have seen views cause load on a DB.

Almost every app I have worked on had a database that were several tables for a single object and several for the where clause. As good as the table layout seemed to be in terms of logic it just led to slow queries like you said joins and sub-queries come at a cost. Then again, I have seen clients balk at the idea of fixing the database and insist on a more unreadable yet performant query. But its their time and money so I don't fight back.

Sometimes, a really good DBA is what a project needs because so few programmers are good at performant design.

Granted you can get a performance boost from denormalizing, and even maintain integrity, I still find the denormalization I want today won't be the denormalization I want tomorrow.

Moving from denormalization to denormalization can be hard, and you end up with a sui generis application, versus a normalized database where so much has been written about how to deal with problems up to a reasonably large scale, although the document store style does seem to be reasonably popular so maybe I'm missing something.

Depending on what I've needed in the past, I've simply duplicated all data from the RDBMS into denormalized structures for ElasticSearch or MongoDB. It's worked incredibly well for scenarios that are mostly-read where normalized queries are very expensive.

The reason the database is the bottleneck is that in traditional architectures everything can be duplicated and run in parallel except writing to the database which has to be done serially one write at a time. How fast you can write (and commit to "permanent" storage like disc and SSD) is constrained by hardware.

There are ways to alleviate this issue but they have different trade offs. One of the most used pattern is "eventual consistency".

You might want to read about the CAP theorem if you have not heard about it: https://en.wikipedia.org/wiki/CAP_theorem

I've fixed a lot of performance problems over my career.

This is not true, and not what we talk about when we say it's almost always the db.

The usual cause of database problems is a single big query that is causing a slow page, or a lot of medium size queries inside a poorly written loop. Whether it be bad joins, missing indexes, select statements in the select statements, operations in select statements that force the db to run the query line by line instead of as a set (the technical term suddenly escapes me), it's the queries that almost always cause db slowdowns.

It rarely has anything to do with writes, and then in my experience usually had to do with escalating locks, and I haven't seen that problem in ages as hardwares got faster.

There are an awful lot of low-hanging-fruit performance problems out there that are just a result of developers not using the DB correctly—usually doing too much in whatever slow scripting language they're using, rather than letting the DB do as much as it can. Sorting, counting, combining, a bunch of queries in a loop where one or two cleverer ones would do.

I think part of why low-feature, specific-usecase DBs like Mongo can get such traction as a general purpose datastore (when marketed as such, of course, to sell more licenses, no matter how bad an idea it is—looking at you, Neo4j) is because so many devs don't know what a half-decent SQL DB can (and should) do for them in the first place.

Is there an alternative to Neo4j you would recommend, or do you think graph dbs are bad in general?

I have a pet project that works with data that’s very graphy (timetabling app with lots of interdependent events); I tried using postges at first since I’m used to it, but found myself writing really ugly looking recursive queries. Neo4j seems to fit my use case a lot better, and their graphql plugin has been really useful.

N4J's a good option if you have a dense graph (lots of edges) and need to do one of the handful of things it's good at with said graph (shortest path, that sort of thing). I would not recommend using it as a general datastore, even if your data's a little "graphy", because it's not great at stuff you'd think it might be, like "find this subgraph, then return some data from nodes related to each node", performance-wise, and it lacks tons of features something like Postgres has, including simply providing lots of built-in data types. I'd recommend using it in combination with an authoritative datastore if you just need to do some "graphy" things, using it as a secondary datastore just for certain purposes—it'll be safer and likely perform better.

[EDIT] it's also not nearly as well-supported so you'll find a lot of supporting libraries with multi-datastore support don't support N4J yet, or do but only in some crippled or poorly-tested (not widely deployed) fashion.

[EDIT EDIT] it's also not a great fit if you have a lot of constraints on or structure to your graph. At least as of when I used it ~1 year ago it had essentially no support for expressing things like "this type of node should only permit two outbound edges". You can work around that but the solutions will be less safe and/or suck.

Cool, thanks for the info. I was planning on copying some of the data to postgres for certain reporting features that would work better with sql, but I’m not really sure how that will work out yet.

Constraints and hosted multi-datastore solutions aren’t really an issue for me, but the type of query you mentioned about subgraphs might be. One query I know I’ll need is one that can quickly identify nodes with lots of neighbors that have a specific property.

The data is definitely very graphy, so I really wanted a graph database as the primary. Dgraph seems pretty good (advertises itself as more reliable and faster), but it sounds like graph databases might just be kind of oversold in general, and that I might want to reconsider. One other issue is speed of development, which the graphql plugin really increases, so I’ll probably stick with Neo4j for now. Swapping dbs theoretically shouldn’t be THAT painful if I don’t care about migrating data and keep the graphql layer the same.

> This is not true, and not what we talk about when we say it's almost always the db.

It depends who's talking and what they are talking about.

If we're talking about fixing slow website, then sure, bad big ugly queries are one dominant issue, but not the only one. If we're talking about architecture of a greenfield project, then balancing between concurrent writes and consistency is going to be your ultimate problem, everything else can be "duplicated" (read-duplicated DB servers, caching strategies, multiple apps server, etc...).

And who ever needs to think about that stuff? A tiny number of companies.

Again, I just think you're not considering the vast majority of us here work in terms of thousands, 10s of thousands or millions of users, not the sort of scales where anything you're saying ever becomes relevant.

For most of us, if you spend any time thinking about that, you're utterly wasting it. It's pointless. Greenfield or not.

I have no idea where the mindset comes from that only large companies have large datasets.

In my experience, a dataset is considered large if its the largest dataset the company has ever seen. It could be 2GB and be considered "large".

Replying to mattmanser in the sibling comment since I cannot replying to it directly:

It comes in faster than you think. One of the company I work for was a small company, at the time I think we were 4 or 5 including the 2 founders, they sell some specific information as a service through APIs. They charge per call, so you have to update the customer funds in their account on each API call. Yes, people can want and make lots of APIs call very fast, and in a business which charges per API call, this is a good thing, but you have to find strategies and make business decisions on how to handle accepting and replying to new API calls while maintaining their account balance.

Load the remaining customer funds into RAM, and decrement them on each API. Periodically flush the remaining balance to DB at set intervals (say every hour). If the server goes down, well, your customers got free API calls for an hour to compensate them for their trouble. With any reasonable uptime you'll save more on DB traffic than you lose in lost revenues.

(BTW, the "can't reply" issue happens on HN when there's a long comment thread between just two people that's being rapidly updated. If you wait 10 minutes or so you'll be able to reply.)

> Load the remaining customer funds into RAM, and decrement them on each API. Periodically flush the remaining balance to DB at set intervals (say every hour).

So you only ever have one app server?

Sure, this specific issues could be handled other ways, but in general global state is hard.

If you have more than one app server load the customer funds into Redis and use atomic operations to decrement it. Serving out of RAM is going to be orders of magnitude faster than updating disk regardless of how much you spend on network overhead. If you have so many customers that they blow the RAM limit on a Redis box....well, you won't, because that's billions of customers.

You can get surprisingly far on one app server (or sometimes one app server with 2 mirrors for failover). See eg. the latest TechEmpower benchmarks [1], where common Java and C++ web frameworks can serve 300k+ reqs/sec off of bare-metal hardware. The OP indicated that these API requests are basically straight queries anyway, and only need to write to update the billing information. Reads can run incredibly fast (both because they can usually be served out of cache and because they can be mirrored), so if your only bottleneck is the write, take the write out of the critical path.

In general global state is hard. Don't solve general problems, solve specific ones. Atomic counters have a well-understood and highly performant solution.

[1] https://www.techempower.com/benchmarks/#section=data-r17&hw=...

Using a database like redis isn't quite the same as just "loading it into ram".

Also, I never said that you couldn't use a single app server because of performance; as you said, you can handle quite a bit of traffic on a single box, even with slower languages.

I find it hard to believe this actually caused a DB problem, updating an int column on an indexed table should have taken nanoseconds. Far faster than any HTTP calls should have been happening. Sounds like something was very wrong. Someone fucked up with the network causing some severe DB latency or they'd written some really shitty code.

Even if it did, as the other comment says, you could easily work round it. If that was the behaviour there was no good reason to have it utterly perfect.

Depends if autocommit is enabled. If it is (and on many DBs it's the default) then you're telling the DB that by the time the increment completes, it should be written to durable storage. That's a minimum of one seek, which on a hard disk is about 4ms, which = max 250 TPS. On a SSD you might get 0.1ms seek times = 10k TPS.

It's possible to kill your performance very quickly with databases if you request features that you don't actually need.

That's where decisions around consistency kick in.

You need to decide if you absolutely do not want to serve any API call unless you are sure they've been paid for, in which case you have to create a commit transaction on that account for each call. Or you decide how much you can let a would be rotten customer get away with, and uses queues which leads to eventual consistencies.

Nothing is slower than accessing third party databases you can't optimize. And a near 100% likelihood in any enterprise operation. We still work around this by caching tables where possible, but often we must access the single source of truth regardless of performance.

line by line: table scan?

Databases can handle multiple concurrent writes just fine and have since the 1970's.

Not a charitable interpretation of the parent comment. Even though databases allow concurrent transactions, commiting these transactions with complete sequential consistency is still an inherently serial process, at least insofar as it cannot be generally split across machines.

There are distributed databases that support serializable transactions across clusters of machines.


To the same table? To the same row?

Say you're an ecomm, how do you update inventory on an item "concurrently"? If you're going to say "locks" or "atomic commit", that's exactly what I mean by "you have to serialize your writes".

Just thinking out loud here but the particular case of inventory of an item should be possible to decrement/increment using a PN-Counter CRDT as long as you are fine with eventual consistency. Of course there's a risk of overcommitment which has to be considered but such risks usually exists in commerce any way due to or factors.

Yes, eventual consistency, which is exactly the alternative I mention in my original post.

If you can tolerate eventual consistencies (most problem can), than the database is no longer the slower piece.

How much contention are you expecting?

This is a business prediction/problem and exactly what defines your bottleneck.

I think parent is talking about things like MVCC[1] in contrast with simple read-write locks.

[1] https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...

I think that parent was trying to state that maintaining sequential consistency is a bottleneck for traditional databases.

And that is true if you assume significant contention. Which is not a safe assumption.

If there is no contention, then what is the problem you cannot easily architect around in 2019?


let's put the entire database on a ram disk (or at least the write logs)

What happens when someone trips over the power cord?

That's why you would use non-volatile memory (e.g., NVMe) if you needed a persistent store with that degree of performance.

Redundancy, and snapshots + replay logs

Based on my experience, it's the most typically seen bottleneck. But bottlenecks change over time. You notice slowness somewhere in a website action and you dig into it. Sometimes, it's a new database table that isn't optimized with correct indexing. Sometimes, you end up doing an n^2 or n^3 operation on the backend (sometimes it's on the frontend). Sometimes, you're data is growing at a speed that your single monolith database isn't cutting it and you need to shard or consider a different type of storage.

Bottlenecks change all the time. You fix one, and now suddenly, the bottleneck lies somewhere else. There's almost always a bottleneck somewhere. A database bottleneck is relatively common, though.

From my experience, the bottleneck is mostly at some type of Big-O relevant scale in the code or database that using java vs ruby vs python doesn't make a huge difference.

However, if you're in a situation where speed is absolutely crucial - like high frequency trading (where milliseconds mean millions of dollars) - well, that opens a completely different can of worms.

Just think about the architecture of NYSE. They are supporting billions of trades per day from so many decades.

It is probably one of the more common together with loading too much JavaScript and image content.

However even with that said a modern SQL database can handle A LOT of traffic and I've had discussions about internal CRUD applications with a couple of thousand users at most where people say we must use a document database because relational databases does not scale. Insane.

Of course people can use document databases and similar where it fits but you get a lot of nice things with a traditional rdmbs that you might miss later.

I haven't had a discussion about performance in years, weirdly enough, but people did default to the scalable solutions - even before knowing whether they needed it. Think new applications built using microservices. They also spent nearly two years before releasing something with a fraction of the features of the system it was supposed to replace.

If someone mentions a need for a technology choice based on performance, they had better come up with real numbers. Until they do, clarity, correctness and consistency should be the primary metrics. Don't solve a performance problem before you have the luxury of having a performance problem.

Premature scaling is akin to premature optimization.

On one of my recent projects, the highest use data was stored in DynamoDB. The entire dataset could trivially fit in RAM. So much wasted effort.

The org also over used Spark, event streams, and elasticsearch.

One could argue (rationalize) that using DynamoDB, for instance, for the small stuff builds team competency applicable to the big stuff. Or that using one set of tools reduces dependencies. I'd have to see some case studies. Because from where I was sitting, 90% of our maintenance costs were from mitigating bad sacred cow design decisions.

I've worked on similar metastatic tech debt based on NoSQL. Much as I love Redis, it's not a duck (fly, swim, walk).

I can understand the competency stuff. I was really blown away with React when that arrived and used it early, probably too early. React itself has been remarkably stable and consistent but everything around was chaotic for a couple of years. Anyway, we went all in in the department and builds all our front ends with React. I would argue it is a great fit for around half our applications as they are more complex but the rest could easily have been built with very little javascript enhancement. However now people can move between projects more easily and building something with React when you are comfortable with it and has a nice base is not slower than something else.

So I understand that people want to try new things but damn I wonder if I sounded the same when I wanted to start using React.

- "We can just use couchdb because then we won't have any problems with scale and we will never have to have any downtime because schema updates".

....yeah but that comes at a cost. You need to handle those schema updates in code. You need to handle those conflicts that arise from eventual consistency in code.

Anyway, I like the idea where you have a limited innovation budget where you can go for a new type of db but then you select other technologies that you are comfortable with and are true and tested. Not "Yeah, we will build this in a completely new way and we have no experience with docker, kubernetes, nodejs, document databases, vue.js, Elastic stack, prometheus or kibana" but this modern way of building things means everything will go much faster. Sometimes you have to go that route but the learning time can be pretty rough.

That largely depends on the usage patterns of the app in question.

Write operations can be particularly expensive, read operations much less so, especially with application-level caching. Even as for write operations or notoriously expensive read operations such as full table scans modern RDBMS are optimised to a degree that for most applications database performance shouldn't be an issue.

While traditional hard drives competed with the network for the title of "system component with the lowest bandwidth" (see http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keyn...) accessing data on an SSD is way faster than accessing a resource via network.

So, today the technical bottleneck (as in "component with the lowest bandwidth") in a typical web app system architecture is the network.

I've spent my career answering this question at myriad companies. In a few cases, the database was actually a bottleneck. In the vast majority of cases, the database was just being used very poorly and was no longer a bottleneck once used correctly. The exceptions to this usually involve exotic data models, which isn't most web apps.

Of the few cases where the database was the bottleneck in practice, I cannot think of a single cases where it should have been the bottleneck in theory. The limitations were always a property of the specific database implementation being used, not intrinsic to databases as a class of software. For example, most open source databases have notoriously poor storage throughput as a product of their architecture; faster storage definitely helps but it often solves a problem that may not exist in other database implementations. In these cases the problem is an architectural mismatch.

Sometimes architectural mismatch is an engineering problem. A handful of workloads and data models are currently only possible with custom database engines e.g. high-throughput mixed workloads or data models that require continuous load shifting. As a practical matter, the exotic cases that require a custom database engine are also the cases where no company should try to build it themselves without a deep expert -- they are considered technically exotic for good reasons.

Never underestimate the power of a properly configured server with 1TB of RAM running MySQL off of SSD.

In my experience, people used to cloud services misjudge the power of correctly tuned modern hardware by two-three orders of magnitude.

Giving a general answer to your question is hard in the absence of further information.

Heh, most systems could fit their entire working dataset in 1TB and never worry about I/O.

Except for logs, of course. Those things are incredible disk space hogs. Also, serialising data to logs can surprisingly expensive. (Our ES logging cluster has tens to low hundreds of TB of storage, and it expires data after roughly 40 days.)

I went the opposite way, from an employer with an insane dedicated server to using cloud services, and I was appalled at how weak the cloud SQL servers are.

Of course, they don´t need hardware expenses, licensing is factored in, administration is orders of magnitude easier and cheaper, so the tradeoff is usually worth it.

Surely this depends entirely on the app.

Is the database a bottle neck for a CRM app that deals with millions of customer records? Probably because everything else is incredibly lightweight. Is it the bottleneck for an app that transcodes video on the server? Absolutely not.

And so on, for every imaginable app.

But the database is often the hardest thing to scale. Scaling up transcoding is childs play by comparison.

Nowadays, in most system architectures, app servers are stateless and can be scaled horizontally. But something needs to hold the state, and that thing will be the bottleneck. Yes, you can horizontally scale some databases, but distributed state is very tricky to get right and will have performance trade-off.

What are you doing? How many writes/s? How much processing are you doing in the app server? Do you have a queue you can offload work onto?

Your question is a bit open-ended.

Scaling your database is generally the hardest part. Any jerk can put 100 app servers behind a load balancer. Scaling data is significantly harder, especially when it comes to writes. There are options to get around this, but they come with compromises.

So, really, it depends on what you're doing. In general, use things you're comfortable with. Is Python slower than Rust? Sure. But do you need 2ms response times from your app server or is 10ms fine, especially when latency between the user and your first load balancer is likely 3-5x that number? Build it, test it, if it's too slow, cache it. If you can't cache it, break out the offending piece and optimize it or put it into a faster language.

Anything can be your bottleneck. Fix one bottleneck, find the next one. That said, databases are often the bottleneck. And with web apps, "the speed of your server-side programming language" is almost never the bottleneck - although the architectural limitations of your framework may be.

I've seen a lot of cases where "the speed of your server-side programming language" was, in fact, the problem—because the developer was doing stuff in it that they should have told their DB to do for them. Basically the speed of the server-side programming language doesn't matter only as long as you're very careful not to, you know, use it to do anything. Which is a realistic goal in many applications, by leaning on one or more DBs more heavily.

In general yes, database will be your bottleneck.

In reality, when database becomes a bottleneck it means you're probably making enough that you can just throw more money at it.

Let me know when this happens - I'll stand in front of it.

In my experience yes.

Example: We struggled with some read-queries taking a long time when aggregating over millions of rows. To solve this, we used triggers to pre-compute aggregate groups in real-time during write, which allowed us to easily achieve sub-second response times. However, this caused a lot of write amplification and now this is becoming a bottleneck. We could probably solve some of these problems by looking into column stores, but that carries it's own downsides (operational complexity, real-time data loading, etc.). Or we could buy faster disks, but that gets expensive quickly.

YMMV but I certainly still experience more database/disk related bottlenecks than application-level ones.

SQL in my experience is not designed for workloads where you want to compute together/return a ton of values from a ton of rows at once. Its good where you need to cherry-pick just a few rows belonging to one user from a databses of millions of row (e.g. CRUD operations).


simple CRUD operations on few rows scale fine

data-science/analytics stuff involving millions of in one query scale rather badly

SQL isn't the problem IMO. The problem is that the implementation we're using (PostgreSQL) is a row-store which is indeed more optimal for the kind of operations you mention.

However, SQL isn't limited to row-stores. There are column-store implementations that are quite amazing for aggregate queries, e.g. Clickhouse. Using one of those would very likely work for us, but my understanding is that loading data into them in real-time is problematic.

How close is real time? Seconds? Minutes?

Also, what type of aggregate?

Could you do something like computing the aggregate groups in background jobs instead, perhaps staggered, to take that out of the critical path?

Yup, that would reduce write IOPs but increase the latency for making the data available to our users.

In my experience with enterprise applications, we hit database bottlenecks only if you are doing aggregate operations over millions of records or firing 'like' queries over varchar(max) columns without full text indexes. Or generating reports with joins over multiple million record tables.

For most crud operations any production ready database will not bottle neck before you have to fine tune your application server and application itself. Non thread safe application code and libraries have caused more performance issues than large join queries.

Yes, the database is still a bottleneck, but that depends on the size of the data and what you're querying for. If your data fits comfortably in memory (and so do any operations you perform over it), then performance is not really an issue. For many businesses — probably the vast majority — all the data could just sit in memory.

From fastest to slowest: web server > database > synchronous database connection.

The bottleneck in my code is rarely the database but rather the interface that is used to it. I'm sure if you had thousands of users running high latency but low utilization queries you could reach the actual limit of the database.

99% of the performance problems I experience involve loading data from systems that don't let me access it directly and instead the data is simply dumped and my application has to load the dump. When you have to delete and reinsert 300000 rows every day and use the naive way of INSERTing the data then you will wait several minutes because the synchronous nature of SQL means that the client has to wait for a response on every statement. An easy way out is to use non-standard features like COPY or INSERT INTO x FROM unnest(...) in postgres. Except they aren't really pretty to use. Your 1 line Domain.save() suddenly turns into a 30 line query just for inserting some data. I still prefer the former for smaller datasets even if it means waiting 2 minutes on an initial import that only happens once because that can be quickly implemented and then you can simply throw the code away if you need something better.

Yes, any server-side language will do. But when you read "database is the most common bottleneck", you could frequently replace it with "database is the least well understood component in the stack". As soon as you stop using the database as a dumb store, you start to realise that your db can handle a lot more than you thought. One has to go beyond what the average ORM offers.

"Therefore, the speed of your server-side programming language is not important (or less important)."

I view this statement as a prediction, in the scientific sense: "If you re-implement a program in a faster language, but have the same database access, the user will not have any visible performance improvements."

Like almost anything else, the answer is, "it depends". It is often true.

However, I have falsified it in some specific instances. It is definitely possible to take a "scripting language", which ranges in the 20x-50x slower than C (a rough, but adequate, approximation of "optimal"), then layer it in a couple more layers of abstraction that may be really nifty and cool and awesome but also adds another 10x to the slowdown (or more... if you're not careful, this term can be unboundedly large), and get to the point that simply generating an HTML page takes a full second of a 3GHz's processor time. I have replaced things that did non-trivial database work with something in a compiled language and gone from "noticeably sluggish" to "essentially instant" without all that much effort.

In one of those cases, significant effort had been expended trying to figure out why the database was so slow, presumably because of the prevalence of this meme, when even the first glance at the output of a profiling tool said that was not where the problem was.

Performance is rarely the first thing to worry about, but I find it's something you do want to at least keep in mind while developing. If you do at least keep it in mind, I suspect the saying is generally true, and the raw speed of the underlying server language won't bother you much. But you are definitely much closer to the point where performance becomes problematic than you start at with a faster language, so you need to be that much more vigilant. People who are lulled into a false sense of security by the idea that the database is always the bottleneck can get blindsided by performance issues in their app.

I think the meme overall is damaging. It's not true enough to make it a core, unquestioned assumption that you can rely on, and if it can't be that, then you might as well simply use the correct answer, which is to use a real profiler and look at the real data. I pretty much always through at least a basic profile pass at any code that isn't just brute scripting.

Most web application code is IO bound. You will get better performance by using more productive language to iterate and create better architecture. Github(Ruby) is noticeably faster than Bitbucket(Java) because engineering time spent on what matters.

I think most slowness is coming from bloated apps trying to push too much code/data over the crappy network. Additionaly it is easy to misconfigure AWS infrastructure.

"You will get better performance by using more productive language to iterate and create better architecture."

And if you blindly follow that, you'll fall into exactly the trap I mentioned, where that "better architecture" involves layering another 10x+ slowdown on top of your already-slow scripting language and before you know it it's 100,000 cycles just to emit "<p>".

I think there's rather more compute-bound web-generation code out there in the world than people think, precisely because this meme makes it so people never even look. Of course you'll think it's all IO bound if you just assume it is and never check even a little bit whether that's true, which from what I can see, is the norm in the web programming world. The data can be screaming that you are in fact not IO bound, but if nobody's listening because "everybody knows" that can't be true... who's going to find out it's wrong?

It's hugely dependant on the app and type of data it's processing.

Our primary apps deal with trading data at 15/30 minute intervals over multiple years. It's not uncommon to have millions of rows involved in a process.

Yes, nearly all of our performance problems boil down to the database. The majority of the time spent on the server is on data access.

Basic CRUD on small data wouldn't be the same.

Assuming everything is designed to scale linearly to some point well above the current workload, a better way to phrase this is to ask what percentage of your hardware spend goes to compute, network and storage machines.

Note that a single postgres instance with one core and 1gb of ram on 1 ssd can “scale linearly” by my definition, since you can easily double, triple, etc all those specs.

On the other hand, a fully populated data center running scale out software at 100% power capacity can’t scale linearly anymore, because you can’t upgrade it anymore. For that data center, power or maybe space is the bottleneck.

Short answer to the question you asked: SSD’s are not going to be the bottleneck for software that’s migrating off disk, and the database probably won’t either. Hardware trends mean the storage and database just got 10-100x faster, while the business logic maybe doubled in speed.

If the system was well balanced before, that means no one spent then-unnecessary effort on now-necessary optimizations on the compute side.

Biggest bottleneck these days are poorly written sqls and badly optimized db's. Hardware isn't really an issue anymore.

Also, poorly written application logic.

Persistence is still slow, but is no longer a major bottleneck. SSDs are faster, but io and network is still slow.

What has changed it is much easier for the client to be continually in touch with the server so that by the time the next "page" is loaded all necessary data has been retrieved.

This way the database is no longer the same obvious bottleneck.

It really depends on the app, but if you're asking about a typical situation of a typical crud app, then yes, db is still likely the slowest part. SSD doesn't change this, because a typical crud app fits completely in memory and is 99% buffered reads anyway.

Is buffered read a database feature or the application?

Yes. Either. Depends on your application. It may be a feature of orm caching as well.

Even if your disk speed is infinitely fast, or your using an in-memory thing like Redis, hitting the DB is almost always going to be slow-ish due to network latency.

Redis is awesome badass rockstar tech, but I try and avoid it for any single-processe thing, instead opting for a tried-and-true global (thread-safe) hash table. Network packets aren't free, and no matter how cool your database is, I don't see you avoiding that. Not to mention that there is a cost to (de)serializing data, which you don't pay if you use something built into the language itself.

I'm not saying to go drop Redis (or any database/cache), just to keep this in mind.

The data is a bit old, but I don't think the relative order or rough magnitude has changed.


Often, yes... it will largely depend on what you are doing, how cacheable things are, and how large your data set is.

If you are largely static and most of your content is tethered to fixed keys that can map from URL patterns, then you could through a large redis cache in front and it wouldn't be the bottleneck... or enough ram to cache on-system.

There are more details to how much your disk or database are your bottleneck, but in almost every case, they are still much slower than your CPU or your network interface (that may be a bottleneck in some cases),

One important thing that has changed since I learned of the "databases are slow" is that they can now be "as-a-service", with services like Amazon Aurora; from the outside interface they're just SQL databases, but underneath they can scale up a lot with minimal effort or knowledge. Likewise, you can have your database of choice and put it on a server which you can scale up or down. Not sure how the migration for those works though.

The serverless database options still remain the bottleneck though. They are still often the slowest aspect of any logic path if not also the most expensive.

I thought this promising title was going to be about a new trend called "server-side web apps" and how they reduce DB round trips. Oh well :)

"One cool, super-advanced trick you can use to reduce latency for simple queries by 95%!"

Then you read the article and it's just running the app and DB on the same physical server, and communicating over local sockets rather than a TCP/IP stack through various virtualization layers and across several networks, real and virtual—so, partying like it's 1999.

It does feel weird pulling an entire database into a docker container... I don't know what the path looks like to make that a cluster.

Yes, for the most part, it still is. It's not always disk though. I can be CPU or memory but most of the time it is the database that falls short of it.

Yarp, it is.

This is a ridiculous conversation. Disks and RAM will always occupy their respective positions in the storage hierarchy, so databases (assuming stored on disks) will always be the bottleneck.

Memory access of any type is often the slowest. This might just be a general law of computing regardless of level of the stack at this point.

In the serverless world, memory accesses are the most expensive with writes being the most expensive of all (where expensive == slow AND literally expensive)

Do you have an example of that being the case? Memory is usually pretty quick compared to disk & network.

I mean memory, storage in a broad sense. Not a statement about in RAM or cache, though I see the confusion in retrospect. Having code access some sort of storage (better word than memory) is always the most expensive part.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact