In my experience, yes network partitions are incredibly rare. However 99% of my distributed ststem partitions have little do with the network. When running databases on a cloud environment network partitions can occur for a variety of reasons that don’t actually include the network link between databases:
1. The host database is written in a GC’d language and experiences a pathological GC pause.
2. The Virtual machine is migrated and experiences a pathological pause
3. Google migrates your machine with local SSDs, fucks up that process and you lose all your data on that machine (you do have backups right?)
4. AWS retires your instance and you need to reboot your VM.
You may never see these issue if you are running a 3 or 5 cluster database. I began seeing issues like this semi regularly once the cluster grew to 30-40 machines (Cassandra). Now I will agree that none of the issues took down majority, but if your R=3, it really only takes an unlucky partition to fuck up an entire shard
If you can afford 3-5 machines/VMs for a cluster you can almost certainly afford a single machine/VM with 2-4x the resources/CPU and chances are that it'll perform just as well (or better) because it doesn't have network latency to contend with.
Of course if you're around N >= 7 or N >= 9, then you should perhaps start considering "distributed X".
As long as your application is mostly built on very few assumptions about consistency it's usually pretty painless to actually go distributed.
Of course, there are legitimate cases where you want to go distributed even with N=3 or N=5, but IME they're very few and far between... but if your 3/5 machines are co-located then it's really likely that the "partitioning" problem is one of the scenarios where they actually all go down simultaneously or the availability goes to 0 because you can't actually reach any of the machines (which are likely to be on the 'same' network).
You can fit a lot of data in an in memory database of 2TB.
 one pair in two locations, basically the minimum amount, unless you're really latency insensitive so you could have one in three locations.
Nowadays you easily have 32 cores on a machine, and each core is significantly faster than it was back then (probably at least 10x). That is a compute cluster by the definition of 1999.
So for certain "big data" tasks (really "medium data" but not everyone knows the difference), I just use a single machine and shell scripts / xargs -P instead of messing with cluster configs.
You can crunch a lot of data on those machines as long as you use all the cores. 32x is the difference between a day and less than an hour, so we should all make sure we are using it!
Common languages like node.js, Python, R, OCaml, etc. do NOT let you use all your cores "by default" (due to GILs / event loop concurrency). You have to do some extra work, and not everyone does.
If the code is written in a distributed fashion from the start then it can be designed so it's one python/node/R process per core. And that makes the transition to distributed servers simpler anyway.
That's a really glib dismissal of how hard the problem is. Python and node have pretty terrible support for building distributed systems. With Python, in practice most systems end up based on Celery, with huge long-running tasks. This configuration basically boils down to using Celery, and whatever queueing system it is running on, as a mainframe-style job control system.
The "shell scripts / xargs -P" mentioned by chubot is a better solution that is much easier to write, more efficient, requires no configuration, and has far fewer failure modes. That is because Unix shell scripting is really a job control language for running Unix processes in parallel and setting up I/O redirection between them.
Oh dear... Yeah, that's a terrible distributed system. Interestingly, all the distributed systems I've worked on with Python haven't had Celery as any kind of core component. It's just poorly suited for the job, as it is more of a task queue. A task queue is really not a good spine for a distributed system.
There are a lot of python distributed systems built around in memory stores, like pycos or dask, or built around existing distributed systems like Akka, Casandra, even Redis.
Cassandra and Redis just mean that you have a database-backed application. How do you schedule Python jobs? Either you build your own scheduler that pulls things out of the database, or you use an existing scheduler. I once worked on a Python system that scheduled tasks using Celery, used Redis for some synchronization flags, and Cassandra for the shared store (also for the main database). Building a custom scheduler for that system would have been a waste of time.
Oh there's a lot more to it than that. CRDT's... for example.
You can scale up web servers to handle more requests, which then uses Celery to offload jobs to different clusters.
I don't think any reasonable developer would dismiss concurrency/parallelism as easy problems.
 Well, my dad bought it, but y'know. He wasn't particularly interested, but I think he recognized an interest in computers in me :).
It predated the VIC-20 by just under 3 years. You might have had 3.5KB free for BASIC programs but the VIC-20 had a comparatively spacious 5KB of RAM and 20KB of ROM.
It depends what is meant by "commodity" I guess.
The largest server AWS offers is only 64 physical cores (128 logical) and less than 4 TB RAM.
Quad socket R1 (LGA 2011) supports
Intel® Xeon® processor E7-8800 v4/v3,
E7-4800 v4/v3 family (up to 24-Core)
Up to 12TB DDR4 (128GB 3DS
LRDIMM); 96x DIMM slots (8x memory
module boards: X10QBi-MEM2)
AFAIK the latest Skylake Xeons ("Scalable" - Platinum/Gold/Silver/Bronze) have regressed to 1.5TB support, see https://ark.intel.com/products/93794/Intel-Xeon-Processor-E7... http://www.colfax-intl.com/nd/downloads/Intel-Xeon-Scalable-... corroborated by https://ark.intel.com/products/120502/Intel-Xeon-Platinum-81...
Since a single 128GB stick costs about 2900 USD, 96 of them will run to ~280 000 USD plus the server so it's likely to be above 300 000K.
If you want to go with 6TB "only" then it's a lot, lot cheaper as 64GB RDIMM sticks can be had below 700 USD. The end result might cost closer to a third of the 12TB server than half of it.
RAM prices comparison: https://memory.net/memory-prices/
Server itself is might only be $25k but those 224 cores could add another $90k, so the total would be close to $400k.
The previous-gen version, SYS-7088B-TR4FT (link in my comment upthread), has 192 DIMM slots, so if you don't need CPU horsepower, you cn get the lower-density modules and still have 12TB (or the max of 24TB for the price premium!).
Even previous gen, if 12TiB main memory is your goal (NUMA concerns aside), it's probably worth going for the 8S system instead of the 4S one, since that's a savings of $144k, and 8 slower/fewer-core CPUs might even be cheaper than 4 that have twice the performance.
If I need more nines of uptime than a single system could provide (and think about the MTBF numbers for the various parts in a computer, a single machine is not going to give you even three nines of uptime), then I'm literally forced to go distributed with N >= 3 and geographical redundancy (we stay up if any k of these machines fail, or if network to "us-east-4" goes down). Things get worse (and you need to be more paranoid) if your service level obligation turns into a service level agreement, because then it usually costs you money when you mess up.
Of course, the more distributed you are, the slower and more complicated your system becomes, and the more you expose yourself to the additional associated downtime risks ("oops, the lock service lost quorum"). They usually cost more to run, obviously. C'est la vie. There is no magic bullet in software engineering.
It used to be that you could spend 3X or more in engineering effort getting a distributed system up and running vs its singly-homed alternative. These days with cloud deployments (and Kubernetes for on-prem) you get a lot of labor-saving devices that make rolling out a distributed system a lot easier than it used to be. You still need to do a cost/benefit analysis!
There are environments where flat time distribution for SLO calculation is not acceptable. (cough betting exchange)
If your traffic patterns are extremely spiky, such as weekly peaks hitting 15-20x of your base load, and where a big chunk of your business can come from those peaks, then most normal calculations don't apply.
Let's say your main system that accepts writes is 10 minutes down in a month. That's easily good for >99.9% uptime, but if a failure + PR hit from an inconveniently timed 10-minute window can be responsible for nearly 10% of your monthly revenue, that's a major business problem.
So when setting SLOs, they should be set according to business needs. I may be a heretic saying this but not all downtime is equal.
Since total outages are a special case of partial outages, use metrics that cleanly measure partial outages. That's request error metrics.
* maximum number of users affected
* maximum time of unavailability
* maximum observed latency
* highest ratio of failed requests over a sequence of relatively tight measurement windows
So yes, request ratio is certainly a good part of the overall SLO but covers only a portion of the spectrum.
Three nines is 8 hours and 45 minutes. In my experience with quality hardware, at a facility with onsite spares, that gives you two to three hardware faults, assuming your software and network is perfect, which I think is a fair assumption :)
Definitely the hardest problem when you switch to a distributed database is you have to contemplate failure. When your SPOF database is down, it's easy to handle, nothing works. When your distributed database partially fails, chances are you have a weird UX problem.
I found that constant communication and scenarios help bridge the gap between how they think things work and how you do. It doesn't always work - a lot of time people don't care to discuss things in detail.
But sometimes it worked and the other person came to realize a new way of thinking. Sometimes the discussion made me realize I was the person using invalid assumptions :)
I use this same reasoning to use an RDBMS until I find a reason it will not work. An ACID compliant datastore makes so many things easier until you hit a very large scale.
We do rolling releases of software all the time but it’s pretty hard for us to do much optimisation of our DB setup without doing it in the middle of the night because of how all this stuff works.
They still took a window a year for changes where there was no alternative. That is still seems preferable to me to the kinds of ongoing problems that distributed eventually consistent databases produce. You might have better availability numbers, but if customer service has to spend money to fix mistakes caused by data errors I don't know that is better for the business.
Or are you talking about other kinds of tweaks?
Given that Postgres has built in synchronous replication, I feel like it should also have some support for multiple primaries (during a cutover window) to allow better HA support
This is especially true for storage systems, which are the topic of this post, and are by far some of the most complex distributed systems. Loosing a node can mean loosing data. Loosing a large, vertically scaled node, can mean loosing a lot of data.
You can mitigate those risks with backups, and cloud APIs of today give customers ability to spin up replacement host in seconds. But then the focus shifts to what will achieve higher availability - a complex distributed system, but one that always runs in a active/active mode. Or a simpler single node system, but one that relies on infrequently exercised and risky path during failures.
Hard "distributed systems" problems in databases generally only creep up when you're trying to deploy a multi-active system.
The most common way in which such systems failover is by fencing off the old master prior to failover. Which in itself can be a hard distributed problem.
When a replica is used as a read slave, that introduces problems with (strong) consistency, that definitely starts to resemble any other distributed system.
Adding automated failover adds the need for some kind of consensus algorithm, another feature of distributed systems.
However, if failover is manual, for example, then 2 nodes can be nearly indistinguishable from 1 node, if no failover occurs.
It also strains the definition of "distributed" if both nodes are adjacent, possibly even with direct network and/or storage connections to each other.
Past that easy single / very small cluster case it's time to start asking more important design questions. As an example, which part(s) NEED to be ACID complaint and which can have eventual consistency (or do they even need that as long as the data was valid 'at some point'? So just 'atomic'.)
Given the same provider, ten $50 instances can usually handle a much higher traffic load (in terms of sheer bulk of packets) than a single $500 instance can.
Alternately, you can switch over to using a mainframe architecture, where your $500 instance will actually have 10 IO-accelerated network cards, and so will be able to effectively make use of them all without saturating its however-many-core CPU's DMA channels.
So you are either accepting the downtime or planting your head in the ground in denial.
6. Well, basically the hundred other reasons the machine could brown-out enough that things start timing out even though it's sporadically online. Bad drive, rogue process, loose heatsink, etc.
Dead hosts are easy. Half-dead hosts suck.