Hacker News new | past | comments | ask | show | jobs | submit login

The reason the database is the bottleneck is that in traditional architectures everything can be duplicated and run in parallel except writing to the database which has to be done serially one write at a time. How fast you can write (and commit to "permanent" storage like disc and SSD) is constrained by hardware.

There are ways to alleviate this issue but they have different trade offs. One of the most used pattern is "eventual consistency".

You might want to read about the CAP theorem if you have not heard about it: https://en.wikipedia.org/wiki/CAP_theorem




I've fixed a lot of performance problems over my career.

This is not true, and not what we talk about when we say it's almost always the db.

The usual cause of database problems is a single big query that is causing a slow page, or a lot of medium size queries inside a poorly written loop. Whether it be bad joins, missing indexes, select statements in the select statements, operations in select statements that force the db to run the query line by line instead of as a set (the technical term suddenly escapes me), it's the queries that almost always cause db slowdowns.

It rarely has anything to do with writes, and then in my experience usually had to do with escalating locks, and I haven't seen that problem in ages as hardwares got faster.


There are an awful lot of low-hanging-fruit performance problems out there that are just a result of developers not using the DB correctly—usually doing too much in whatever slow scripting language they're using, rather than letting the DB do as much as it can. Sorting, counting, combining, a bunch of queries in a loop where one or two cleverer ones would do.

I think part of why low-feature, specific-usecase DBs like Mongo can get such traction as a general purpose datastore (when marketed as such, of course, to sell more licenses, no matter how bad an idea it is—looking at you, Neo4j) is because so many devs don't know what a half-decent SQL DB can (and should) do for them in the first place.


Is there an alternative to Neo4j you would recommend, or do you think graph dbs are bad in general?

I have a pet project that works with data that’s very graphy (timetabling app with lots of interdependent events); I tried using postges at first since I’m used to it, but found myself writing really ugly looking recursive queries. Neo4j seems to fit my use case a lot better, and their graphql plugin has been really useful.


N4J's a good option if you have a dense graph (lots of edges) and need to do one of the handful of things it's good at with said graph (shortest path, that sort of thing). I would not recommend using it as a general datastore, even if your data's a little "graphy", because it's not great at stuff you'd think it might be, like "find this subgraph, then return some data from nodes related to each node", performance-wise, and it lacks tons of features something like Postgres has, including simply providing lots of built-in data types. I'd recommend using it in combination with an authoritative datastore if you just need to do some "graphy" things, using it as a secondary datastore just for certain purposes—it'll be safer and likely perform better.

[EDIT] it's also not nearly as well-supported so you'll find a lot of supporting libraries with multi-datastore support don't support N4J yet, or do but only in some crippled or poorly-tested (not widely deployed) fashion.

[EDIT EDIT] it's also not a great fit if you have a lot of constraints on or structure to your graph. At least as of when I used it ~1 year ago it had essentially no support for expressing things like "this type of node should only permit two outbound edges". You can work around that but the solutions will be less safe and/or suck.


Cool, thanks for the info. I was planning on copying some of the data to postgres for certain reporting features that would work better with sql, but I’m not really sure how that will work out yet.

Constraints and hosted multi-datastore solutions aren’t really an issue for me, but the type of query you mentioned about subgraphs might be. One query I know I’ll need is one that can quickly identify nodes with lots of neighbors that have a specific property.

The data is definitely very graphy, so I really wanted a graph database as the primary. Dgraph seems pretty good (advertises itself as more reliable and faster), but it sounds like graph databases might just be kind of oversold in general, and that I might want to reconsider. One other issue is speed of development, which the graphql plugin really increases, so I’ll probably stick with Neo4j for now. Swapping dbs theoretically shouldn’t be THAT painful if I don’t care about migrating data and keep the graphql layer the same.


> This is not true, and not what we talk about when we say it's almost always the db.

It depends who's talking and what they are talking about.

If we're talking about fixing slow website, then sure, bad big ugly queries are one dominant issue, but not the only one. If we're talking about architecture of a greenfield project, then balancing between concurrent writes and consistency is going to be your ultimate problem, everything else can be "duplicated" (read-duplicated DB servers, caching strategies, multiple apps server, etc...).


And who ever needs to think about that stuff? A tiny number of companies.

Again, I just think you're not considering the vast majority of us here work in terms of thousands, 10s of thousands or millions of users, not the sort of scales where anything you're saying ever becomes relevant.

For most of us, if you spend any time thinking about that, you're utterly wasting it. It's pointless. Greenfield or not.


I have no idea where the mindset comes from that only large companies have large datasets.


In my experience, a dataset is considered large if its the largest dataset the company has ever seen. It could be 2GB and be considered "large".


Replying to mattmanser in the sibling comment since I cannot replying to it directly:

It comes in faster than you think. One of the company I work for was a small company, at the time I think we were 4 or 5 including the 2 founders, they sell some specific information as a service through APIs. They charge per call, so you have to update the customer funds in their account on each API call. Yes, people can want and make lots of APIs call very fast, and in a business which charges per API call, this is a good thing, but you have to find strategies and make business decisions on how to handle accepting and replying to new API calls while maintaining their account balance.


Load the remaining customer funds into RAM, and decrement them on each API. Periodically flush the remaining balance to DB at set intervals (say every hour). If the server goes down, well, your customers got free API calls for an hour to compensate them for their trouble. With any reasonable uptime you'll save more on DB traffic than you lose in lost revenues.

(BTW, the "can't reply" issue happens on HN when there's a long comment thread between just two people that's being rapidly updated. If you wait 10 minutes or so you'll be able to reply.)


> Load the remaining customer funds into RAM, and decrement them on each API. Periodically flush the remaining balance to DB at set intervals (say every hour).

So you only ever have one app server?

Sure, this specific issues could be handled other ways, but in general global state is hard.


If you have more than one app server load the customer funds into Redis and use atomic operations to decrement it. Serving out of RAM is going to be orders of magnitude faster than updating disk regardless of how much you spend on network overhead. If you have so many customers that they blow the RAM limit on a Redis box....well, you won't, because that's billions of customers.

You can get surprisingly far on one app server (or sometimes one app server with 2 mirrors for failover). See eg. the latest TechEmpower benchmarks [1], where common Java and C++ web frameworks can serve 300k+ reqs/sec off of bare-metal hardware. The OP indicated that these API requests are basically straight queries anyway, and only need to write to update the billing information. Reads can run incredibly fast (both because they can usually be served out of cache and because they can be mirrored), so if your only bottleneck is the write, take the write out of the critical path.

In general global state is hard. Don't solve general problems, solve specific ones. Atomic counters have a well-understood and highly performant solution.

[1] https://www.techempower.com/benchmarks/#section=data-r17&hw=...


Using a database like redis isn't quite the same as just "loading it into ram".

Also, I never said that you couldn't use a single app server because of performance; as you said, you can handle quite a bit of traffic on a single box, even with slower languages.


I find it hard to believe this actually caused a DB problem, updating an int column on an indexed table should have taken nanoseconds. Far faster than any HTTP calls should have been happening. Sounds like something was very wrong. Someone fucked up with the network causing some severe DB latency or they'd written some really shitty code.

Even if it did, as the other comment says, you could easily work round it. If that was the behaviour there was no good reason to have it utterly perfect.


Depends if autocommit is enabled. If it is (and on many DBs it's the default) then you're telling the DB that by the time the increment completes, it should be written to durable storage. That's a minimum of one seek, which on a hard disk is about 4ms, which = max 250 TPS. On a SSD you might get 0.1ms seek times = 10k TPS.

It's possible to kill your performance very quickly with databases if you request features that you don't actually need.


That's where decisions around consistency kick in.

You need to decide if you absolutely do not want to serve any API call unless you are sure they've been paid for, in which case you have to create a commit transaction on that account for each call. Or you decide how much you can let a would be rotten customer get away with, and uses queues which leads to eventual consistencies.


Nothing is slower than accessing third party databases you can't optimize. And a near 100% likelihood in any enterprise operation. We still work around this by caching tables where possible, but often we must access the single source of truth regardless of performance.


line by line: table scan?


Databases can handle multiple concurrent writes just fine and have since the 1970's.


Not a charitable interpretation of the parent comment. Even though databases allow concurrent transactions, commiting these transactions with complete sequential consistency is still an inherently serial process, at least insofar as it cannot be generally split across machines.


There are distributed databases that support serializable transactions across clusters of machines.

https://www.cockroachlabs.com/blog/serializable-lockless-dis...


To the same table? To the same row?

Say you're an ecomm, how do you update inventory on an item "concurrently"? If you're going to say "locks" or "atomic commit", that's exactly what I mean by "you have to serialize your writes".


Just thinking out loud here but the particular case of inventory of an item should be possible to decrement/increment using a PN-Counter CRDT as long as you are fine with eventual consistency. Of course there's a risk of overcommitment which has to be considered but such risks usually exists in commerce any way due to or factors.


Yes, eventual consistency, which is exactly the alternative I mention in my original post.

If you can tolerate eventual consistencies (most problem can), than the database is no longer the slower piece.


How much contention are you expecting?


This is a business prediction/problem and exactly what defines your bottleneck.


I think parent is talking about things like MVCC[1] in contrast with simple read-write locks.

[1] https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...


I think that parent was trying to state that maintaining sequential consistency is a bottleneck for traditional databases.


And that is true if you assume significant contention. Which is not a safe assumption.


If there is no contention, then what is the problem you cannot easily architect around in 2019?


Latency.


let's put the entire database on a ram disk (or at least the write logs)


What happens when someone trips over the power cord?


That's why you would use non-volatile memory (e.g., NVMe) if you needed a persistent store with that degree of performance.


Redundancy, and snapshots + replay logs




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: