Hacker News new | comments | show | ask | jobs | submit login

>Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be available

In my experience, yes network partitions are incredibly rare. However 99% of my distributed ststem partitions have little do with the network. When running databases on a cloud environment network partitions can occur for a variety of reasons that don’t actually include the network link between databases:

1. The host database is written in a GC’d language and experiences a pathological GC pause.

2. The Virtual machine is migrated and experiences a pathological pause

3. Google migrates your machine with local SSDs, fucks up that process and you lose all your data on that machine (you do have backups right?)

4. AWS retires your instance and you need to reboot your VM.

You may never see these issue if you are running a 3 or 5 cluster database. I began seeing issues like this semi regularly once the cluster grew to 30-40 machines (Cassandra). Now I will agree that none of the issues took down majority, but if your R=3, it really only takes an unlucky partition to fuck up an entire shard

I may be a bit of an old fart, but this is the exact reasoning behind my decision to never go with "distributed X" if there's a "single-machine X" where you can just vertically scale.

If you can afford 3-5 machines/VMs for a cluster you can almost certainly afford a single machine/VM with 2-4x the resources/CPU and chances are that it'll perform just as well (or better) because it doesn't have network latency to contend with.

Of course if you're around N >= 7 or N >= 9, then you should perhaps start considering "distributed X".

As long as your application is mostly built on very few assumptions about consistency it's usually pretty painless to actually go distributed.

Of course, there are legitimate cases where you want to go distributed even with N=3 or N=5, but IME they're very few and far between... but if your 3/5 machines are co-located then it's really likely that the "partitioning" problem is one of the scenarios where they actually all go down simultaneously or the availability goes to 0 because you can't actually reach any of the machines (which are likely to be on the 'same' network).

I think part of this is that most of the common knowledge about scaling is hard fought from the 90s/2000 era. eBay got bigger and bigger Sun boxes to run Oracle, until they couldn't get anything bigger -- then they had a problem and had to shard their listings into categories, etc. In the last few Intel cpu generations, computation performance has had small gains, but addressable memory has doubled about every other release, you can now get machines with 2TB of ram, and it's not that expensive.

You can fit a lot of data in an in memory database of 2TB.

My experience is that scaling up can lead to owning pets instead of cattle. It's never fun having a 1TB ATS instance that's acting funky but you're terrified of bouncing it. That's more a devil's advocate anecdote than an argument against scaling up.

I think you can (and clearly should) be ready for your big boxes to go down. And it is much more painful to bounce a large box than a small box. But that doesn't mean having say 4x [1] big boxes is strictly worse than having a whole bunch of smaller boxes.

[1] one pair in two locations, basically the minimum amount, unless you're really latency insensitive so you could have one in three locations.

Yup, Google clusters also originally had machines with 1 or 2 CPUs! SMP on Linux was a new thing!

Nowadays you easily have 32 cores on a machine, and each core is significantly faster than it was back then (probably at least 10x). That is a compute cluster by the definition of 1999.

So for certain "big data" tasks (really "medium data" but not everyone knows the difference), I just use a single machine and shell scripts / xargs -P instead of messing with cluster configs.

You can crunch a lot of data on those machines as long as you use all the cores. 32x is the difference between a day and less than an hour, so we should all make sure we are using it!

Common languages like node.js, Python, R, OCaml, etc. do NOT let you use all your cores "by default" (due to GILs / event loop concurrency). You have to do some extra work, and not everyone does.

> Common languages like node.js, Python, R, OCaml, etc. do NOT let you use all your cores "by default"

If the code is written in a distributed fashion from the start then it can be designed so it's one python/node/R process per core. And that makes the transition to distributed servers simpler anyway.

> If the code is written in a distributed fashion from the start then it can be designed so it's one python/node/R process per core.

That's a really glib dismissal of how hard the problem is. Python and node have pretty terrible support for building distributed systems. With Python, in practice most systems end up based on Celery, with huge long-running tasks. This configuration basically boils down to using Celery, and whatever queueing system it is running on, as a mainframe-style job control system.

The "shell scripts / xargs -P" mentioned by chubot is a better solution that is much easier to write, more efficient, requires no configuration, and has far fewer failure modes. That is because Unix shell scripting is really a job control language for running Unix processes in parallel and setting up I/O redirection between them.

Am I correct in assuming Elixir/Erlang does a much better job at this compared to Node/Python/etc., putting aside (what I understand to be) the rather big problem of their relative weakness for computation?

Erlang can be a good fit -- the concurrency primatives allow for execution on multiple cores, and the in memory database (ets) scales pretty well. Large (approaching 1TB per node) mnesia databases require a good bit of operational know how, and willingness to patch things, so try to work up to it. Mnesia is erlang's optionally persistent, optionally distribution database layer that's included in the OTP distribution. It's got features for consensus if you want that, but I've almost always run it in 'dirty' mode, and used other layers to arrange so all the application level writes for a key are sent to a single process which then writes to mnesia -- this establishes an ordering on the updates and (mostly) eliminates the need for consensus.

I believe with the combination of native functions ("NIFs") in rust and some work on the nif interface (to avoid making it so easy to take down the whole beam VM on errors) - you might get more of best of both worlds today - than you used to. As you say erlang itself is rather slow wrt compute.

Thankfully it's not an issue for me. Elixir/Erlang is pretty much perfect for most of my use-cases :). But I foresee a few projects where NIFs or perhaps using Elixir to 'orchestrate' NumPy stuff might be useful. Most of my work would remain on the Elixir side though.

Yes. Erlang is built around distributed message sending and serialization. Python does not have any such things; even some libraries like Celery punt on it by having you configure different messaging mechanisms ("backends" like RabbitMQ, Redis, etc.) and different serialization mechanisms (because built-in pickle sucks for different uses in different ways). Node.js does not come with distributed message sending and serialization either.

> With Python, in practice most systems end up based on Celery, with huge long-running tasks.

Oh dear... Yeah, that's a terrible distributed system. Interestingly, all the distributed systems I've worked on with Python haven't had Celery as any kind of core component. It's just poorly suited for the job, as it is more of a task queue. A task queue is really not a good spine for a distributed system.

There are a lot of python distributed systems built around in memory stores, like pycos or dask, or built around existing distributed systems like Akka, Casandra, even Redis.

> There are a lot of python distributed systems built around in memory stores, like pycos or dask, or built around existing distributed systems like Akka, Casandra, even Redis.

Cassandra and Redis just mean that you have a database-backed application. How do you schedule Python jobs? Either you build your own scheduler that pulls things out of the database, or you use an existing scheduler. I once worked on a Python system that scheduled tasks using Celery, used Redis for some synchronization flags, and Cassandra for the shared store (also for the main database). Building a custom scheduler for that system would have been a waste of time.

> Cassandra and Redis just mean that you have a database-backed application.

Oh there's a lot more to it than that. CRDT's... for example.

Well, Celery uses Rabbit-MQ and typically Redis underneath. Rabbit-MQ to pass messages, and Redis to store results.

You can scale up web servers to handle more requests, which then uses Celery to offload jobs to different clusters.

Yeah, but fundamentally Celery is a task queue. You don't build a distributed system around that.

I think the intention was that if you're gently coerced into working with a single thread, like with node, then you're also coerced into writing your code in a way that's independent from other threads. In theory, it's easier reasoning about doing parallel work when you start from this point - I've certainly noticed this effect before.

I don't think any reasonable developer would dismiss concurrency/parallelism as easy problems.

Or you could use a language that supports threads, like Go, Java, C, C++, C#... This gives you much more flexibility.

That's a really good observation and I agree. I sometimes have to pinch myself to make sure I'm not dreaming when I see that there are actually pretty affordable machines with these insane amounts of RAM. (I think the first machine I personally[0] owned had 512K or something like that.)

[0] Well, my dad bought it, but y'know. He wasn't particularly interested, but I think he recognized an interest in computers in me :).

My first computer had 38911 BASIC BYTES FREE.

Luxury! My first computer (Sinclair ZX Spectrum) had 48KB RAM total, and IIRC 8KB of BASIC RAM available.

Guess I win this round, unless there are older farts than me - Commodore VIC-20 with 3.5KB RAM which was plenty for me to do really cool things with in BASIC and 6502 machine language)

My first computer was a TRS-80 Model 1 with Level 1 BASIC. 4KB of RAM and 4KB of ROM.

It predated the VIC-20 by just under 3 years. You might have had 3.5KB free for BASIC programs but the VIC-20 had a comparatively spacious 5KB of RAM and 20KB of ROM.

I did start earlier than the VIC-20 (with TRS-80 Model 2 at school) but you definitely take the title :)

You can get commodity x86 server with 12TB RAM and 224 cores.

If you're willing to go back a generation on CPU (and merely 192 cores), you can get 24TB.


It looks like 12TB ram is available in eight socket Intel servers? Those are sort of commodity, but availability is limited, and NUMA becomes a much larger issue than in the more easily available dual socket configurations. Looks like Epyc can do 4TB in a dual socket configuration.

Anyone have experience with mainframes when it comes to this level of cost/performance? I suppose 500k USD is still too "cheap" to consider more special hw/systems - but where's the cut-off when getting a monster running db2 and Code in a Linux VM or something?


It depends what is meant by "commodity" I guess.

The largest server AWS offers is only 64 physical cores (128 logical) and less than 4 TB RAM.


Quad socket R1 (LGA 2011) supports Intel® Xeon® processor E7-8800 v4/v3, E7-4800 v4/v3 family (up to 24-Core)

Up to 12TB DDR4 (128GB 3DS LRDIMM); 96x DIMM slots (8x memory module boards: X10QBi-MEM2)

AFAIK the latest Skylake Xeons ("Scalable" - Platinum/Gold/Silver/Bronze) have regressed to 1.5TB support, see https://ark.intel.com/products/93794/Intel-Xeon-Processor-E7... http://www.colfax-intl.com/nd/downloads/Intel-Xeon-Scalable-... corroborated by https://ark.intel.com/products/120502/Intel-Xeon-Platinum-81...

Since a single 128GB stick costs about 2900 USD, 96 of them will run to ~280 000 USD plus the server so it's likely to be above 300 000K.

If you want to go with 6TB "only" then it's a lot, lot cheaper as 64GB RDIMM sticks can be had below 700 USD. The end result might cost closer to a third of the 12TB server than half of it.

RAM prices comparison: https://memory.net/memory-prices/

That's a previous-gen CPU system. Current is:


Server itself is might only be $25k but those 224 cores could add another $90k, so the total would be close to $400k.

The previous-gen version, SYS-7088B-TR4FT (link in my comment upthread), has 192 DIMM slots, so if you don't need CPU horsepower, you cn get the lower-density modules and still have 12TB (or the max of 24TB for the price premium!).

I edited my answer to indicate that current gen 3.06TB RAM supporting CPUs are not out (yet?)

Right, which is why the current generation is limited to 12TiB for 8S systems, compared to 24TiB for the generation you quoted.

Even previous gen, if 12TiB main memory is your goal (NUMA concerns aside), it's probably worth going for the 8S system instead of the 4S one, since that's a savings of $144k, and 8 slower/fewer-core CPUs might even be cheaper than 4 that have twice the performance.

Interesting, though I don't see the price or any other purchasing details (at least on that page).

Pretty sure once you're talking about 12TB of memory it goes to "call for pricing".

I am guessing that it would be in 250-270K range

That seems more like enterprise built and sold specially, and less like commodity hardware.

To me things available from grey box x86 vendors like supermicro or lenovo look like commodity hardware.

here's online configurator for Supermicro 8-way box https://www.rackmountpro.com/product/2570/7089P-TR4T.html

My experience is that it's unfortunately really hard to convince people who don't deeply understand distributed systems (but think they know) that just because a system is called 'HA' or can have an 'HA' mode turned on, that doing so has downsides. They freak out if you try to propose _not_ running the HA mode, because they don't (or aren't willing to) understand the potential downsides of dealing with split-brain, bad recovery, etc. They just see it as a "enable checkbox and forget it" kind of thing that should always be turned on. This is a big struggle of mine, any suggestions would be appreciated...

A rational way of explaining it is -- what is my service level objective? If my SLO is for two nines of uptime, then I can be down for 3.65 days a year, and I can get away with single-homing something, or going with a simple hot-failover replicated setup. I just need to be pretty confident that I can fix it if it breaks within a few hours.

If I need more nines of uptime than a single system could provide (and think about the MTBF numbers for the various parts in a computer, a single machine is not going to give you even three nines of uptime), then I'm literally forced to go distributed with N >= 3 and geographical redundancy (we stay up if any k of these machines fail, or if network to "us-east-4" goes down). Things get worse (and you need to be more paranoid) if your service level obligation turns into a service level agreement, because then it usually costs you money when you mess up.

Of course, the more distributed you are, the slower and more complicated your system becomes, and the more you expose yourself to the additional associated downtime risks ("oops, the lock service lost quorum"). They usually cost more to run, obviously. C'est la vie. There is no magic bullet in software engineering.

It used to be that you could spend 3X or more in engineering effort getting a distributed system up and running vs its singly-homed alternative. These days with cloud deployments (and Kubernetes for on-prem) you get a lot of labor-saving devices that make rolling out a distributed system a lot easier than it used to be. You still need to do a cost/benefit analysis!

> what is my service level objective?

There are environments where flat time distribution for SLO calculation is not acceptable. (cough betting exchange)

If your traffic patterns are extremely spiky, such as weekly peaks hitting 15-20x of your base load, and where a big chunk of your business can come from those peaks, then most normal calculations don't apply.

Let's say your main system that accepts writes is 10 minutes down in a month. That's easily good for >99.9% uptime, but if a failure + PR hit from an inconveniently timed 10-minute window can be responsible for nearly 10% of your monthly revenue, that's a major business problem.

So when setting SLOs, they should be set according to business needs. I may be a heretic saying this but not all downtime is equal.

Time based SLOs definitely have their limitations, but in this instance isn't it fairly easy to redefine the SLO in terms of requests rather than time?

This is one of the recommendations given in the Google SRE book: use request-level metrics for SLOs/SLIs where possible. As your systems grow larger the probability of total outage, which would be measured in time, becomes a smaller fraction of the probability of partial outage.

Since total outages are a special case of partial outages, use metrics that cleanly measure partial outages. That's request error metrics.

I wish it was that easy - our teams have their targets for p99 and p995 ratios but they cannot capture the overall user experience. For us it's not just the ratio of failed requests, but closer to a four-tuple of:

  * maximum number of users affected
  * maximum time of unavailability
  * maximum observed latency 
  * highest ratio of failed requests over a sequence of relatively tight measurement windows
Those are demanding constraints, but such is reality when peak trading activity can take place within just a few minutes. If users can not place their trades during those short windows, they will quickly lose confidence and take their business elsewhere.

So yes, request ratio is certainly a good part of the overall SLO but covers only a portion of the spectrum.

> If I need more nines of uptime than a single system could provide (and think about the MTBF numbers for the various parts in a computer, a single machine is not going to give you even three nines of uptime).

Three nines is 8 hours and 45 minutes. In my experience with quality hardware, at a facility with onsite spares, that gives you two to three hardware faults, assuming your software and network is perfect, which I think is a fair assumption :)

Definitely the hardest problem when you switch to a distributed database is you have to contemplate failure. When your SPOF database is down, it's easy to handle, nothing works. When your distributed database partially fails, chances are you have a weird UX problem.

Thank you very much, especially that first paragraph. I think you've given me some good discussion tools, and I really appreciate that.

It's a struggle for me, too. People sometimes rest on their assumptions even though new technologies and paradigms can invalidate old rules (I'm sure everyone has lots of examples).

I found that constant communication and scenarios help bridge the gap between how they think things work and how you do. It doesn't always work - a lot of time people don't care to discuss things in detail.

But sometimes it worked and the other person came to realize a new way of thinking. Sometimes the discussion made me realize I was the person using invalid assumptions :)

Thanks for your thoughts. And definitely yes re your last point! Like rubber duck debugging :)

> I may be a bit of an old fart, but this is the exact reasoning behind my decision to never go with "distributed X" if there's a "single-machine X" where you can just vertically scale.

I use this same reasoning to use an RDBMS until I find a reason it will not work. An ACID compliant datastore makes so many things easier until you hit a very large scale.

Also an "old fart", I would recommend postgresql first over any of these other databases. Solve the big data scaling problems when you actually have them. One database server with replication and failover is going to still solve 95-98% or more of the use cases on the web.

It unfortunately doesn’t adequately solve “tweak and update the database software in the middle of the day without requiring downtime” situation very well

We do rolling releases of software all the time but it’s pretty hard for us to do much optimisation of our DB setup without doing it in the middle of the night because of how all this stuff works.

Where I worked in the past they had a rule to only hire DBAs that could manage schema changes and such without downtime. Proper planning up front can mitigate the need to take a lot of those outages.

They still took a window a year for changes where there was no alternative. That is still seems preferable to me to the kinds of ongoing problems that distributed eventually consistent databases produce. You might have better availability numbers, but if customer service has to spend money to fix mistakes caused by data errors I don't know that is better for the business.

Exactly. And also things like rebooting the server, updating the database image, etc. This is why I use HA databases. I can do whatever I like and as long as most of them are still up, traffic continues to be served. If one node is having network issues, no problem. The database is still accessible. And so on.

I would assume you have a staging system to test fixes/updates on, and that rolling out updates would take the form of updating a read slave, then doing a cutover?

Or are you talking about other kinds of tweaks?

The issue we have is that the cutover is super hard to do in a zero-downtime way because we almost always have active connections and you can’t have multiple primaries. There’s tricks like read connections on the replicas but it’s still really hard to coordinate.

Given that Postgres has built in synchronous replication, I feel like it should also have some support for multiple primaries (during a cutover window) to allow better HA support

You are totally correct that most applications do not require more than a single host to handle all the load, especially given the hosts that are easily available today. However, the far more common reason to go distributed is redundancy.

This is especially true for storage systems, which are the topic of this post, and are by far some of the most complex distributed systems. Loosing a node can mean loosing data. Loosing a large, vertically scaled node, can mean loosing a lot of data.

You can mitigate those risks with backups, and cloud APIs of today give customers ability to spin up replacement host in seconds. But then the focus shifts to what will achieve higher availability - a complex distributed system, but one that always runs in a active/active mode. Or a simpler single node system, but one that relies on infrequently exercised and risky path during failures.

There's a substantial difference between High Availability (HA) clustering and Load Sharing (LS) clustering. In an HA cluster a production database can run on a single active node while replicating to a separate standby node. Network partitions will cause a lag in replication but will not present an external consistency issue as there's still only one active node at any time. When the partition is resolved the replication catches up and all is right with the world. Which is to say, you can scale vertically while having both reliability and availability.

Hard "distributed systems" problems in databases generally only creep up when you're trying to deploy a multi-active system.

Agreed that such setups can be simpler than multi-master replication. Yet, there's still a hard distributed systems challenge even in the active/passive setup like you describe. When a failure occurs, the clients need to agree which host is the master and which host is the slave. And in the presence of network partitions, it can be impossible to say if the old master is dead, or partitioned away, and hence still happily taking writes.

The most common way in which such systems failover is by fencing off the old master prior to failover. Which in itself can be a hard distributed problem.

You can also just configure the client to point to a single db node and have an applicative monitoring based update to the db connection string upon failover. Then you're not actually dependent on your client loadbanacing logic. It's very robust and if you follow CQRS also highly available.

Except propagating a configuration update to multiple clients, without a split brain is a hard distributed systems problem :). Especially in the presence of those pesky network partitions.

Does running on 3-5 machines give you the advantage of being able to apply kernel patches and restart machines without bringing down the whole system? That might matter to people with paying customers, even if in theory they could've replaced their 3 machines with one bigger one.

It may, but if your customer can't tolerate/afford 5-10 minutes of downtime you're already into "big league" territory. I could see a case if you're rolling out gradually, but with only 3-5 machines you're not really going to notice any subtle problems in a 1-vs-2 or 2-vs-3 situation.

I'm not sure anyone would ever actually have only one machine, without a failover option, and failing over had better not bring down the whole system.

But now we're back to a distributed system

I don't think that's true, in the usual sense of the term.

When a replica is used as a read slave, that introduces problems with (strong) consistency, that definitely starts to resemble any other distributed system.

Adding automated failover adds the need for some kind of consensus algorithm, another feature of distributed systems.

However, if failover is manual, for example, then 2 nodes can be nearly indistinguishable from 1 node, if no failover occurs.

It also strains the definition of "distributed" if both nodes are adjacent, possibly even with direct network and/or storage connections to each other.

I wouldn't call you an old fart... It's just that you understand that massively distributed databases are usually premature optimization.

Past a given point it's time to re-design and optimize for today rather than what was quick and easy with a small user-base. Unless you know when you're starting that such a huge user-base is the target the Quekid5's approach is proper.

Past that easy single / very small cluster case it's time to start asking more important design questions. As an example, which part(s) NEED to be ACID complaint and which can have eventual consistency (or do they even need that as long as the data was valid 'at some point'? So just 'atomic'.)

This is good and sound advice. The logical outgrowth of this is that the vast majority of organizations don’t need to use distributed databases. (Those that do should probably opt for hosted ones first, and then maybe consider running their own if they have the explicit need, and the substantial SRE budget for it.)

The one resource you won't get any more of from any cloud provider by scaling vertically, is network latency. (Because with most cloud providers, you've usually already got the fastest link you'll get for an instance, on their cheapest plan!)

Given the same provider, ten $50 instances can usually handle a much higher traffic load (in terms of sheer bulk of packets) than a single $500 instance can.

Alternately, you can switch over to using a mainframe architecture, where your $500 instance will actually have 10 IO-accelerated network cards, and so will be able to effectively make use of them all without saturating its however-many-core CPU's DMA channels.

Then you have a huge single point of failure. AP distributed systems assume you will lose nodes and force you to deal with the reality of them happening.

So you are either accepting the downtime or planting your head in the ground in denial.

Losing a majority due to various faults is not equivalent to a network partition. In practice for the systems you are working with it may have a similar effect (or a much worse one!), but in theory it is possible to recover from these faults in a short amount of time as long as there are no network disruptions.

In particular, if a node is completely down, that's likely a lot better than a network partition because it eliminates the possibility that it's going to be accepting read and write requests that could have stale data or create inconsistent data (respectively).

5. Your network isn't partitioned exactly, but someone bumped an ethernet cable and the packet loss has reduced the goodput on that link to a level too low to sustain the throughput you need. With most congestion control algorithms (basically, not-BBR), at 10Gb even 1% packet loss is devastating.

6. Well, basically the hundred other reasons the machine could brown-out enough that things start timing out even though it's sporadically online. Bad drive, rogue process, loose heatsink, etc.

Dead hosts are easy. Half-dead hosts suck.

These are partial node failures not network partitions.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact