Hacker News new | past | comments | ask | show | jobs | submit login
NewSQL databases fail to guarantee consistency and I blame Spanner (dbmsmusings.blogspot.com)
510 points by evanweaver on Sept 21, 2018 | hide | past | favorite | 301 comments

>Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be available

In my experience, yes network partitions are incredibly rare. However 99% of my distributed ststem partitions have little do with the network. When running databases on a cloud environment network partitions can occur for a variety of reasons that don’t actually include the network link between databases:

1. The host database is written in a GC’d language and experiences a pathological GC pause.

2. The Virtual machine is migrated and experiences a pathological pause

3. Google migrates your machine with local SSDs, fucks up that process and you lose all your data on that machine (you do have backups right?)

4. AWS retires your instance and you need to reboot your VM.

You may never see these issue if you are running a 3 or 5 cluster database. I began seeing issues like this semi regularly once the cluster grew to 30-40 machines (Cassandra). Now I will agree that none of the issues took down majority, but if your R=3, it really only takes an unlucky partition to fuck up an entire shard

I may be a bit of an old fart, but this is the exact reasoning behind my decision to never go with "distributed X" if there's a "single-machine X" where you can just vertically scale.

If you can afford 3-5 machines/VMs for a cluster you can almost certainly afford a single machine/VM with 2-4x the resources/CPU and chances are that it'll perform just as well (or better) because it doesn't have network latency to contend with.

Of course if you're around N >= 7 or N >= 9, then you should perhaps start considering "distributed X".

As long as your application is mostly built on very few assumptions about consistency it's usually pretty painless to actually go distributed.

Of course, there are legitimate cases where you want to go distributed even with N=3 or N=5, but IME they're very few and far between... but if your 3/5 machines are co-located then it's really likely that the "partitioning" problem is one of the scenarios where they actually all go down simultaneously or the availability goes to 0 because you can't actually reach any of the machines (which are likely to be on the 'same' network).

I think part of this is that most of the common knowledge about scaling is hard fought from the 90s/2000 era. eBay got bigger and bigger Sun boxes to run Oracle, until they couldn't get anything bigger -- then they had a problem and had to shard their listings into categories, etc. In the last few Intel cpu generations, computation performance has had small gains, but addressable memory has doubled about every other release, you can now get machines with 2TB of ram, and it's not that expensive.

You can fit a lot of data in an in memory database of 2TB.

My experience is that scaling up can lead to owning pets instead of cattle. It's never fun having a 1TB ATS instance that's acting funky but you're terrified of bouncing it. That's more a devil's advocate anecdote than an argument against scaling up.

I think you can (and clearly should) be ready for your big boxes to go down. And it is much more painful to bounce a large box than a small box. But that doesn't mean having say 4x [1] big boxes is strictly worse than having a whole bunch of smaller boxes.

[1] one pair in two locations, basically the minimum amount, unless you're really latency insensitive so you could have one in three locations.

Yup, Google clusters also originally had machines with 1 or 2 CPUs! SMP on Linux was a new thing!

Nowadays you easily have 32 cores on a machine, and each core is significantly faster than it was back then (probably at least 10x). That is a compute cluster by the definition of 1999.

So for certain "big data" tasks (really "medium data" but not everyone knows the difference), I just use a single machine and shell scripts / xargs -P instead of messing with cluster configs.

You can crunch a lot of data on those machines as long as you use all the cores. 32x is the difference between a day and less than an hour, so we should all make sure we are using it!

Common languages like node.js, Python, R, OCaml, etc. do NOT let you use all your cores "by default" (due to GILs / event loop concurrency). You have to do some extra work, and not everyone does.

> Common languages like node.js, Python, R, OCaml, etc. do NOT let you use all your cores "by default"

If the code is written in a distributed fashion from the start then it can be designed so it's one python/node/R process per core. And that makes the transition to distributed servers simpler anyway.

> If the code is written in a distributed fashion from the start then it can be designed so it's one python/node/R process per core.

That's a really glib dismissal of how hard the problem is. Python and node have pretty terrible support for building distributed systems. With Python, in practice most systems end up based on Celery, with huge long-running tasks. This configuration basically boils down to using Celery, and whatever queueing system it is running on, as a mainframe-style job control system.

The "shell scripts / xargs -P" mentioned by chubot is a better solution that is much easier to write, more efficient, requires no configuration, and has far fewer failure modes. That is because Unix shell scripting is really a job control language for running Unix processes in parallel and setting up I/O redirection between them.

Am I correct in assuming Elixir/Erlang does a much better job at this compared to Node/Python/etc., putting aside (what I understand to be) the rather big problem of their relative weakness for computation?

Erlang can be a good fit -- the concurrency primatives allow for execution on multiple cores, and the in memory database (ets) scales pretty well. Large (approaching 1TB per node) mnesia databases require a good bit of operational know how, and willingness to patch things, so try to work up to it. Mnesia is erlang's optionally persistent, optionally distribution database layer that's included in the OTP distribution. It's got features for consensus if you want that, but I've almost always run it in 'dirty' mode, and used other layers to arrange so all the application level writes for a key are sent to a single process which then writes to mnesia -- this establishes an ordering on the updates and (mostly) eliminates the need for consensus.

I believe with the combination of native functions ("NIFs") in rust and some work on the nif interface (to avoid making it so easy to take down the whole beam VM on errors) - you might get more of best of both worlds today - than you used to. As you say erlang itself is rather slow wrt compute.

Thankfully it's not an issue for me. Elixir/Erlang is pretty much perfect for most of my use-cases :). But I foresee a few projects where NIFs or perhaps using Elixir to 'orchestrate' NumPy stuff might be useful. Most of my work would remain on the Elixir side though.

Yes. Erlang is built around distributed message sending and serialization. Python does not have any such things; even some libraries like Celery punt on it by having you configure different messaging mechanisms ("backends" like RabbitMQ, Redis, etc.) and different serialization mechanisms (because built-in pickle sucks for different uses in different ways). Node.js does not come with distributed message sending and serialization either.

> With Python, in practice most systems end up based on Celery, with huge long-running tasks.

Oh dear... Yeah, that's a terrible distributed system. Interestingly, all the distributed systems I've worked on with Python haven't had Celery as any kind of core component. It's just poorly suited for the job, as it is more of a task queue. A task queue is really not a good spine for a distributed system.

There are a lot of python distributed systems built around in memory stores, like pycos or dask, or built around existing distributed systems like Akka, Casandra, even Redis.

> There are a lot of python distributed systems built around in memory stores, like pycos or dask, or built around existing distributed systems like Akka, Casandra, even Redis.

Cassandra and Redis just mean that you have a database-backed application. How do you schedule Python jobs? Either you build your own scheduler that pulls things out of the database, or you use an existing scheduler. I once worked on a Python system that scheduled tasks using Celery, used Redis for some synchronization flags, and Cassandra for the shared store (also for the main database). Building a custom scheduler for that system would have been a waste of time.

> Cassandra and Redis just mean that you have a database-backed application.

Oh there's a lot more to it than that. CRDT's... for example.

Well, Celery uses Rabbit-MQ and typically Redis underneath. Rabbit-MQ to pass messages, and Redis to store results.

You can scale up web servers to handle more requests, which then uses Celery to offload jobs to different clusters.

Yeah, but fundamentally Celery is a task queue. You don't build a distributed system around that.

I think the intention was that if you're gently coerced into working with a single thread, like with node, then you're also coerced into writing your code in a way that's independent from other threads. In theory, it's easier reasoning about doing parallel work when you start from this point - I've certainly noticed this effect before.

I don't think any reasonable developer would dismiss concurrency/parallelism as easy problems.

Or you could use a language that supports threads, like Go, Java, C, C++, C#... This gives you much more flexibility.

That's a really good observation and I agree. I sometimes have to pinch myself to make sure I'm not dreaming when I see that there are actually pretty affordable machines with these insane amounts of RAM. (I think the first machine I personally[0] owned had 512K or something like that.)

[0] Well, my dad bought it, but y'know. He wasn't particularly interested, but I think he recognized an interest in computers in me :).

My first computer had 38911 BASIC BYTES FREE.

Luxury! My first computer (Sinclair ZX Spectrum) had 48KB RAM total, and IIRC 8KB of BASIC RAM available.

Guess I win this round, unless there are older farts than me - Commodore VIC-20 with 3.5KB RAM which was plenty for me to do really cool things with in BASIC and 6502 machine language)

My first computer was a TRS-80 Model 1 with Level 1 BASIC. 4KB of RAM and 4KB of ROM.

It predated the VIC-20 by just under 3 years. You might have had 3.5KB free for BASIC programs but the VIC-20 had a comparatively spacious 5KB of RAM and 20KB of ROM.

I did start earlier than the VIC-20 (with TRS-80 Model 2 at school) but you definitely take the title :)

You can get commodity x86 server with 12TB RAM and 224 cores.

If you're willing to go back a generation on CPU (and merely 192 cores), you can get 24TB.


It looks like 12TB ram is available in eight socket Intel servers? Those are sort of commodity, but availability is limited, and NUMA becomes a much larger issue than in the more easily available dual socket configurations. Looks like Epyc can do 4TB in a dual socket configuration.

Anyone have experience with mainframes when it comes to this level of cost/performance? I suppose 500k USD is still too "cheap" to consider more special hw/systems - but where's the cut-off when getting a monster running db2 and Code in a Linux VM or something?


It depends what is meant by "commodity" I guess.

The largest server AWS offers is only 64 physical cores (128 logical) and less than 4 TB RAM.


Quad socket R1 (LGA 2011) supports Intel® Xeon® processor E7-8800 v4/v3, E7-4800 v4/v3 family (up to 24-Core)

Up to 12TB DDR4 (128GB 3DS LRDIMM); 96x DIMM slots (8x memory module boards: X10QBi-MEM2)

AFAIK the latest Skylake Xeons ("Scalable" - Platinum/Gold/Silver/Bronze) have regressed to 1.5TB support, see https://ark.intel.com/products/93794/Intel-Xeon-Processor-E7... http://www.colfax-intl.com/nd/downloads/Intel-Xeon-Scalable-... corroborated by https://ark.intel.com/products/120502/Intel-Xeon-Platinum-81...

Since a single 128GB stick costs about 2900 USD, 96 of them will run to ~280 000 USD plus the server so it's likely to be above 300 000K.

If you want to go with 6TB "only" then it's a lot, lot cheaper as 64GB RDIMM sticks can be had below 700 USD. The end result might cost closer to a third of the 12TB server than half of it.

RAM prices comparison: https://memory.net/memory-prices/

That's a previous-gen CPU system. Current is:


Server itself is might only be $25k but those 224 cores could add another $90k, so the total would be close to $400k.

The previous-gen version, SYS-7088B-TR4FT (link in my comment upthread), has 192 DIMM slots, so if you don't need CPU horsepower, you cn get the lower-density modules and still have 12TB (or the max of 24TB for the price premium!).

I edited my answer to indicate that current gen 3.06TB RAM supporting CPUs are not out (yet?)

Right, which is why the current generation is limited to 12TiB for 8S systems, compared to 24TiB for the generation you quoted.

Even previous gen, if 12TiB main memory is your goal (NUMA concerns aside), it's probably worth going for the 8S system instead of the 4S one, since that's a savings of $144k, and 8 slower/fewer-core CPUs might even be cheaper than 4 that have twice the performance.

Interesting, though I don't see the price or any other purchasing details (at least on that page).

Pretty sure once you're talking about 12TB of memory it goes to "call for pricing".

I am guessing that it would be in 250-270K range

That seems more like enterprise built and sold specially, and less like commodity hardware.

To me things available from grey box x86 vendors like supermicro or lenovo look like commodity hardware.

here's online configurator for Supermicro 8-way box https://www.rackmountpro.com/product/2570/7089P-TR4T.html

My experience is that it's unfortunately really hard to convince people who don't deeply understand distributed systems (but think they know) that just because a system is called 'HA' or can have an 'HA' mode turned on, that doing so has downsides. They freak out if you try to propose _not_ running the HA mode, because they don't (or aren't willing to) understand the potential downsides of dealing with split-brain, bad recovery, etc. They just see it as a "enable checkbox and forget it" kind of thing that should always be turned on. This is a big struggle of mine, any suggestions would be appreciated...

A rational way of explaining it is -- what is my service level objective? If my SLO is for two nines of uptime, then I can be down for 3.65 days a year, and I can get away with single-homing something, or going with a simple hot-failover replicated setup. I just need to be pretty confident that I can fix it if it breaks within a few hours.

If I need more nines of uptime than a single system could provide (and think about the MTBF numbers for the various parts in a computer, a single machine is not going to give you even three nines of uptime), then I'm literally forced to go distributed with N >= 3 and geographical redundancy (we stay up if any k of these machines fail, or if network to "us-east-4" goes down). Things get worse (and you need to be more paranoid) if your service level obligation turns into a service level agreement, because then it usually costs you money when you mess up.

Of course, the more distributed you are, the slower and more complicated your system becomes, and the more you expose yourself to the additional associated downtime risks ("oops, the lock service lost quorum"). They usually cost more to run, obviously. C'est la vie. There is no magic bullet in software engineering.

It used to be that you could spend 3X or more in engineering effort getting a distributed system up and running vs its singly-homed alternative. These days with cloud deployments (and Kubernetes for on-prem) you get a lot of labor-saving devices that make rolling out a distributed system a lot easier than it used to be. You still need to do a cost/benefit analysis!

> what is my service level objective?

There are environments where flat time distribution for SLO calculation is not acceptable. (cough betting exchange)

If your traffic patterns are extremely spiky, such as weekly peaks hitting 15-20x of your base load, and where a big chunk of your business can come from those peaks, then most normal calculations don't apply.

Let's say your main system that accepts writes is 10 minutes down in a month. That's easily good for >99.9% uptime, but if a failure + PR hit from an inconveniently timed 10-minute window can be responsible for nearly 10% of your monthly revenue, that's a major business problem.

So when setting SLOs, they should be set according to business needs. I may be a heretic saying this but not all downtime is equal.

Time based SLOs definitely have their limitations, but in this instance isn't it fairly easy to redefine the SLO in terms of requests rather than time?

This is one of the recommendations given in the Google SRE book: use request-level metrics for SLOs/SLIs where possible. As your systems grow larger the probability of total outage, which would be measured in time, becomes a smaller fraction of the probability of partial outage.

Since total outages are a special case of partial outages, use metrics that cleanly measure partial outages. That's request error metrics.

I wish it was that easy - our teams have their targets for p99 and p995 ratios but they cannot capture the overall user experience. For us it's not just the ratio of failed requests, but closer to a four-tuple of:

  * maximum number of users affected
  * maximum time of unavailability
  * maximum observed latency 
  * highest ratio of failed requests over a sequence of relatively tight measurement windows
Those are demanding constraints, but such is reality when peak trading activity can take place within just a few minutes. If users can not place their trades during those short windows, they will quickly lose confidence and take their business elsewhere.

So yes, request ratio is certainly a good part of the overall SLO but covers only a portion of the spectrum.

> If I need more nines of uptime than a single system could provide (and think about the MTBF numbers for the various parts in a computer, a single machine is not going to give you even three nines of uptime).

Three nines is 8 hours and 45 minutes. In my experience with quality hardware, at a facility with onsite spares, that gives you two to three hardware faults, assuming your software and network is perfect, which I think is a fair assumption :)

Definitely the hardest problem when you switch to a distributed database is you have to contemplate failure. When your SPOF database is down, it's easy to handle, nothing works. When your distributed database partially fails, chances are you have a weird UX problem.

Thank you very much, especially that first paragraph. I think you've given me some good discussion tools, and I really appreciate that.

It's a struggle for me, too. People sometimes rest on their assumptions even though new technologies and paradigms can invalidate old rules (I'm sure everyone has lots of examples).

I found that constant communication and scenarios help bridge the gap between how they think things work and how you do. It doesn't always work - a lot of time people don't care to discuss things in detail.

But sometimes it worked and the other person came to realize a new way of thinking. Sometimes the discussion made me realize I was the person using invalid assumptions :)

Thanks for your thoughts. And definitely yes re your last point! Like rubber duck debugging :)

> I may be a bit of an old fart, but this is the exact reasoning behind my decision to never go with "distributed X" if there's a "single-machine X" where you can just vertically scale.

I use this same reasoning to use an RDBMS until I find a reason it will not work. An ACID compliant datastore makes so many things easier until you hit a very large scale.

Also an "old fart", I would recommend postgresql first over any of these other databases. Solve the big data scaling problems when you actually have them. One database server with replication and failover is going to still solve 95-98% or more of the use cases on the web.

It unfortunately doesn’t adequately solve “tweak and update the database software in the middle of the day without requiring downtime” situation very well

We do rolling releases of software all the time but it’s pretty hard for us to do much optimisation of our DB setup without doing it in the middle of the night because of how all this stuff works.

Where I worked in the past they had a rule to only hire DBAs that could manage schema changes and such without downtime. Proper planning up front can mitigate the need to take a lot of those outages.

They still took a window a year for changes where there was no alternative. That is still seems preferable to me to the kinds of ongoing problems that distributed eventually consistent databases produce. You might have better availability numbers, but if customer service has to spend money to fix mistakes caused by data errors I don't know that is better for the business.

Exactly. And also things like rebooting the server, updating the database image, etc. This is why I use HA databases. I can do whatever I like and as long as most of them are still up, traffic continues to be served. If one node is having network issues, no problem. The database is still accessible. And so on.

I would assume you have a staging system to test fixes/updates on, and that rolling out updates would take the form of updating a read slave, then doing a cutover?

Or are you talking about other kinds of tweaks?

The issue we have is that the cutover is super hard to do in a zero-downtime way because we almost always have active connections and you can’t have multiple primaries. There’s tricks like read connections on the replicas but it’s still really hard to coordinate.

Given that Postgres has built in synchronous replication, I feel like it should also have some support for multiple primaries (during a cutover window) to allow better HA support

You are totally correct that most applications do not require more than a single host to handle all the load, especially given the hosts that are easily available today. However, the far more common reason to go distributed is redundancy.

This is especially true for storage systems, which are the topic of this post, and are by far some of the most complex distributed systems. Loosing a node can mean loosing data. Loosing a large, vertically scaled node, can mean loosing a lot of data.

You can mitigate those risks with backups, and cloud APIs of today give customers ability to spin up replacement host in seconds. But then the focus shifts to what will achieve higher availability - a complex distributed system, but one that always runs in a active/active mode. Or a simpler single node system, but one that relies on infrequently exercised and risky path during failures.

There's a substantial difference between High Availability (HA) clustering and Load Sharing (LS) clustering. In an HA cluster a production database can run on a single active node while replicating to a separate standby node. Network partitions will cause a lag in replication but will not present an external consistency issue as there's still only one active node at any time. When the partition is resolved the replication catches up and all is right with the world. Which is to say, you can scale vertically while having both reliability and availability.

Hard "distributed systems" problems in databases generally only creep up when you're trying to deploy a multi-active system.

Agreed that such setups can be simpler than multi-master replication. Yet, there's still a hard distributed systems challenge even in the active/passive setup like you describe. When a failure occurs, the clients need to agree which host is the master and which host is the slave. And in the presence of network partitions, it can be impossible to say if the old master is dead, or partitioned away, and hence still happily taking writes.

The most common way in which such systems failover is by fencing off the old master prior to failover. Which in itself can be a hard distributed problem.

You can also just configure the client to point to a single db node and have an applicative monitoring based update to the db connection string upon failover. Then you're not actually dependent on your client loadbanacing logic. It's very robust and if you follow CQRS also highly available.

Except propagating a configuration update to multiple clients, without a split brain is a hard distributed systems problem :). Especially in the presence of those pesky network partitions.

Does running on 3-5 machines give you the advantage of being able to apply kernel patches and restart machines without bringing down the whole system? That might matter to people with paying customers, even if in theory they could've replaced their 3 machines with one bigger one.

It may, but if your customer can't tolerate/afford 5-10 minutes of downtime you're already into "big league" territory. I could see a case if you're rolling out gradually, but with only 3-5 machines you're not really going to notice any subtle problems in a 1-vs-2 or 2-vs-3 situation.

I'm not sure anyone would ever actually have only one machine, without a failover option, and failing over had better not bring down the whole system.

But now we're back to a distributed system

I don't think that's true, in the usual sense of the term.

When a replica is used as a read slave, that introduces problems with (strong) consistency, that definitely starts to resemble any other distributed system.

Adding automated failover adds the need for some kind of consensus algorithm, another feature of distributed systems.

However, if failover is manual, for example, then 2 nodes can be nearly indistinguishable from 1 node, if no failover occurs.

It also strains the definition of "distributed" if both nodes are adjacent, possibly even with direct network and/or storage connections to each other.

I wouldn't call you an old fart... It's just that you understand that massively distributed databases are usually premature optimization.

Past a given point it's time to re-design and optimize for today rather than what was quick and easy with a small user-base. Unless you know when you're starting that such a huge user-base is the target the Quekid5's approach is proper.

Past that easy single / very small cluster case it's time to start asking more important design questions. As an example, which part(s) NEED to be ACID complaint and which can have eventual consistency (or do they even need that as long as the data was valid 'at some point'? So just 'atomic'.)

This is good and sound advice. The logical outgrowth of this is that the vast majority of organizations don’t need to use distributed databases. (Those that do should probably opt for hosted ones first, and then maybe consider running their own if they have the explicit need, and the substantial SRE budget for it.)

The one resource you won't get any more of from any cloud provider by scaling vertically, is network latency. (Because with most cloud providers, you've usually already got the fastest link you'll get for an instance, on their cheapest plan!)

Given the same provider, ten $50 instances can usually handle a much higher traffic load (in terms of sheer bulk of packets) than a single $500 instance can.

Alternately, you can switch over to using a mainframe architecture, where your $500 instance will actually have 10 IO-accelerated network cards, and so will be able to effectively make use of them all without saturating its however-many-core CPU's DMA channels.

Then you have a huge single point of failure. AP distributed systems assume you will lose nodes and force you to deal with the reality of them happening.

So you are either accepting the downtime or planting your head in the ground in denial.

Losing a majority due to various faults is not equivalent to a network partition. In practice for the systems you are working with it may have a similar effect (or a much worse one!), but in theory it is possible to recover from these faults in a short amount of time as long as there are no network disruptions.

In particular, if a node is completely down, that's likely a lot better than a network partition because it eliminates the possibility that it's going to be accepting read and write requests that could have stale data or create inconsistent data (respectively).

5. Your network isn't partitioned exactly, but someone bumped an ethernet cable and the packet loss has reduced the goodput on that link to a level too low to sustain the throughput you need. With most congestion control algorithms (basically, not-BBR), at 10Gb even 1% packet loss is devastating.

6. Well, basically the hundred other reasons the machine could brown-out enough that things start timing out even though it's sporadically online. Bad drive, rogue process, loose heatsink, etc.

Dead hosts are easy. Half-dead hosts suck.

These are partial node failures not network partitions.

It's definitely true that putting the burden of consistency on developers (instead of on the DB) results in a lot more tricky work for developers. On my project, which started six years ago, we use Cloud Datastore, because Cloud Spanner hadn't come out yet. It results in complicated, painful code that would be completely unnecessary with stronger transactional guarantees. Some examples: https://github.com/google/nomulus/blob/master/java/google/re... https://github.com/google/nomulus/blob/master/java/google/re... https://github.com/google/nomulus/blob/master/java/google/re...

It's no surprise that we're currently running screaming to something with stronger transactional guarantees.

It's worth noting that Cloud Datastore's follow-on, Cloud Firestore, does provide strong consistency, and includes a "Datastore mode" that supports the Datastore API.

Firestore is currently in beta, but once it's GA we will be migrating all Datastore users to Firestore: https://cloud.google.com/datastore/docs/upgrade-to-firestore

Disclaimer: I work on Cloud Datastore/Firestore.

I always got the impression that Firestore is more aimed towards mobile apps rather than backend applications, and as such more or less a different kind of product. Is this not the case?

Datastore was initially designed for use from App Engine, i.e., an easy start, no management, automatic scaling environment ("serverless" to use the current in-vogue term).

I would view the Firestore API as a further extension (the "Datastore Mode" functionality was always an element of the design) of that paradigm, extending to the case where you have no trusted piece of code to mediate requests to the database, thus allowing direct use from, e.g., mobile apps (at which point other issues such as disconnected operation surface).

So not so much a "different kind of product" and more a product that supports a strict superset of use cases.

While Firestore has a Android/iOS/Web SDK, it also has great backend support (Python, Java, Node, Go, Ruby, C#, PHP) as well. The "realtime" features of Firestore are better suited for mobile IMO, but using Firestore as a scalable, consistent, document/nosql database for your backend is definetly a good use for it.

I actually think most of the server SDKs don't even expose many of the realtime APIs. Maybe they will in the future, but it shows that you can use Firestore like a normal database just fine.

(I work for GCP)

I’m guessing that Nomulus is your project?

I ran across it before and just wanted to say it’s really cool that this is open sourced.

I'm the tech lead of it. Glad to hear that you've heard of it before, and yeah, I think it's pretty cool it's open sourced too, which is why I made it happen! https://opensource.googleblog.com/2016/10/introducing-nomulu...

Any idea if any company is planning to create an offer dedicates To .BRAND new gTLDs?

Why does anybody need to provide consistency? Often you don't have complete consistency anyways. There are bugs, there are time delays, there is parallel processing. Why even have the requirement?

Let me expand on an example issue that comes up.

Nomulus is software that runs a domain name registry, including most notably the .app TLD. There are three fundamental objects at play here; the domains themselves, contacts (ownership information that goes into WHOIS), and hosts (nameserver information that feeds into DNS). There's a many-to-many relationship here, in that contacts and hosts can be reused over an arbitrarily large number of domains.

The problem is that you can't perform transactions over an arbitrarily large number of different objects in Cloud Datastore; you're limited to enlisting a maximum of 25 entity groups. This means that you can't perform operations that are strongly consistent when contacts or hosts are reused too often. This situation comes up a lot; registrars tend to reuse the same nameservers across many domains, as well as the contacts used for privacy/proxy services.

These problems don't arise in a relational SQL database, because you can simply JOIN the relevant tables together (provided you have the correct indexes set up) and then perform your operations in a strongly consistent manner. That trades off scalability for consistency though, whereas in Spanner you give up neither.

I don't see how this relates to the question whether consistency is needed. If two tables only have 80% of all the rows that you expect you can STILL do a join on them. It's just that the join, just like your original data, is not containing all data sets. The join itself will not raise an exception because of that.

Strong consistency is required because you cannot delete/rename a host or contact if it is in use at all. Hence the requirement for strong consistency. It's not good enough to say that you can go ahead with the operation because it's not used by at least 80% of domains; you need to know that it's not in used by any domains.

Well, it's pretty hard to even find applications that require strong global consistency and are willing to sacrifice latency for that. Typically apps don't need much consistency at all and can sacrifice some data instead, like with most RDBMS setups in the wild. Beyond that SEC (strong eventual consistency) covers pretty much all consistency needs there are.

> it's pretty hard to even find applications that require strong global consistency

Exactly my point, even a little generalized. The world isn't consistent, why always put unrealistic constraints unto ourselves. If we change from "I really need to have all the datasets in all of their truthest form" and instead go with "well there is some data coming, better than nothing" the whole system might be even more reliable. Things in the middle won't die just because the world is imperfect.

The post talks about Spanner using "a specialized hardware solution that uses both GPS and atomic clocks to ensure a minimal clock skew across servers." But what it fails to mention is that is that this time is distributed using a software solution --- NTP, in fact.

Google makes its "leap-smeared NTP network" available via Google's Public NTP service. And it's not expensive for someone to buy their own GPS clock and run their own leap-smared NTP service.

Yes, it means that someone who installs a NewSQL database will have to set up their own time infrastructure. But that's not hard! There are lots of things about hardware setup which are at a similar level of trickiness, such as UPS with notification so that servers can shutdown gracefully when the batteries are exhausted, locating your servers in on-prem data centers so that if a flood takes out one data center, you have continuity of service elsewhere, etc., etc.

Or of course, you can pay a cloud infrastructure provider (which Google happens to provide, but Amazon and Azure also provides similar services) to take care all of these details for you. Heck, if you use Google Compute Platform, you can use the original Spanner service (accept no substitutes :-)

I agree. Attempting to remove the GPS and atomic clocks from the system seems like a premature optimization to me. They are cheap and the ability to order events according to an actual wall clock is a huge benefit.

Some basics for less-critical applications: https://github.com/jrockway/beaglebone-gps-clock/blob/master...

It makes me wonder if, as an industry, we shouldn't just include hardware clocks in our basic expectations for a data center.

That is, in order to be considered a satisfactory, non bare bones data center, you'd need to supply on-site, hardware, redundant, maintained and monitored clocks, verifiably meeting certain standards of accuracy and precision. Just like right now people expect backup power, cooling, network connectivity, physical security, access, etc.

It doesn't seem like the burden would be that large for data center operators. The hardware costs don't sound large, and they already have people being paid to monitor and maintain things.

> And it's not expensive for someone to buy their own GPS clock and run their own leap-smared NTP service

This is a fun weekend hobby project that anyone can do at home if they are curious. Get a GPS chip with a PPS output signal, a plain old “hockey puck” antenna, and a computer to hook it up to (i have a rpi in my attic) and you can deliver fairly accurate time to your home network. Not a setup I’d rely on for a commercial, globally distributed deployment, but accurate enough for home use, and probably better than your current time source.

The problem with production usage is not clock accuracy, but accuracy guarantees. If a clock is within whopping 1ms, it is ok as long as I can prove it is definitely within the 1ms.

Which, as far as I see, boils down to hardware+software reliability.

I feel like the author is making a huge omission by not talking about the properties of multi-raft sharded DBs coupled with the TrueTime-like APIs from cloud providers in 2018. He links to a post from the CEO of CockroachDB written in early 2016, and Amazon launched their TimeSync service in 2017.

It's likely that Spencer would have something different to say about the matter today.

> I feel like the author is making a huge omission by not talking about the properties of multi-raft sharded DBs coupled with the TrueTime-like APIs from cloud providers in 2018.

I mean this is all fine and dandy if it works (plus, how do know that it actually works in all the edge cases?), but that's a HUGE amount of complexity. IMO you really need to have a very good reason to even consider anything this complex.

In the post and another comment here it is stated that Calvin can handle any real-world workload. However, according to my reading of the Calvin paper, one must understand the update keys before starting the transaction. I also experienced limitations when trying to use FaunaDB: it doesn't support ad-hoc queries and it only allows for indexed queries.

I really like the Calvin protocol, and is does seem perfectly suited to many application workloads, but it is odd to me though, to see Calvin presented as purely superior to all alternatives. It seems like more research and work needs to be done to create systems that can address its shortcomings around transactions that query for keys (3.2.1 Dependent transactions in the paper), including ad-hoc queries, and even interactive queries (with a transaction).

Disclaimer: I work on TiDB/TiKV. TiKV (distributed K/V store) uses the percolator model for transactions to ensure linearizability, but TiDB SQL (built on top) allows write skew.

Support for dependent reads is a subject that the Calvin paper only touches on briefly, describing one strategy of using "reconnaissance reads" to determine the key set.

In FaunaDB, this is formalized as an optimistic concurrency control mechanism that combines snapshot reads at the coordinator and read-write conflict detection within the transaction engine. By the time a transaction is ready to commit the entire key set is known. I wrote a blog post that goes into more detail here: https://fauna.com/blog/acid-transactions-in-a-globally-distr...

Better analytics support, including ad-hoc queries is on our roadmap. That being said, requiring indexes was a design choice: First and foremost, we want FaunaDB to be the best operational database. It is a lot easier to write fast, predictable queries if they cannot fall back to a table scan!

While FaunaDB requires you to create indexes to support the views your workload requires, the reward is a clearer understanding of how queries perform. We've built indexes to be pretty low commitment, however. You can add or drop them on the fly based on how your workload evolves over time.

Thanks for the link to the explanation. It seems like FaunaDB is doing the hard work to show how Calvin can actually be used for all types of workloads.

You're welcome! I've enjoyed following TiKV/TiDB's development as well. It is an interesting time for databases. :-)

The CAP theorem has been truly disastrous for databases. The CAP theorem simply says that if you have a database on 2 servers and the connection between those serves goes down, then queries against one server don't see new updates from the other server, so you either have to give up consistency (serve stale data) or give up availability (one of the servers refuses to process further requests). That's all that CAP is, but somehow half the industry has been convinced that slapping a fancy name on this concept is a justification for giving up consistency (even when the network and all servers are fully functional) to retain availability. The A for availability in CAP means that ALL database servers are fully available, which is unnecessary in practice, because clients can switch to the other servers. Giving up consistency introduces big engineering challenges. You're getting something that most people don't need in return for a large cost.

This is exactly right, and perhaps the most cogent explanation of the CAP theorem on the internet.

For a longer explanation of the same idea which includes a concrete example of how you can get "high availability" in a CP system, see: https://apple.github.io/foundationdb/cap-theorem.html

"The A for availability in CAP means that ALL database servers are fully available"

Is this true? I always thought it meant that clients could continue to read and write to "the database" which could include the client switching to another node. There is nothing in CAP theorem about latency, so switching, even if it adds high latency, is fine by CAP theorem.

This lack of accounting for latency is what makes CAP theorem less useful than a lot of people realize IMO.

From implementation point of view networks are asynchronous and therefore are always partitioned. We don't actually have a luxury of arbitrary CAP interpretations, we can't know whether other nodes are available or not at any given moment. So instead we have to make requests to other nodes and always choose either to wait for responses (or more complex communications to achieve consensus) and get C or not wait for anything and get A, although each node can be a bit behind on updates from other nodes. Thus the CAP choices are pretty much about latency: waiting for globally visible updates vs not waiting and getting low latency. Both can be mixed in various proportions to get consistency with good latency, but still reasonable tolerance of byzantine failures.

It is true: if you allow clients to switch then you can have all three C,A,P.

No, you can't. Say you have two nodes, and a client has just sent a COMMIT to node 1. Then Node 1 gets partitioned away from node 2 and the committing client vanishes. Giving clients the ability to switch to node 2 doesn't help you determine whether node 2 does or does not have the data that was committed (consistency), so you have to choose between CP and AP.

You can trivially redirect all clients to node 1 whether or not there is a partition, and voila, you have a CAP system under that definition. If clients can be partitioned from servers then we're back in the original scenario in which switching is not allowed: just partition each client from all but one server (but a different server for each client).

Maybe there is a way to modify the CAP theorem so that it says something non-trivial (e.g. a theorem about the limit of how many node failures a Paxos-like algorithm can handle, but even this is probably trivial, namely half), but the CAP theorem as stated by the originators is trivial, and the proof is a dressed up version of what I stated above. I don't think the authors would disagree with this at all, since they explicitly state:

"The basic idea of the proof is to assume that all messages between G1 and G2 are lost. If a write occurs in G1 and later a read occurs in G2, then the read operation cannot return the results of the earlier write operation."

Isn't it interesting that such a paper has 1600 citations and has reshaped the database industry?

> The A for availability in CAP means that ALL database servers are fully available, which is unnecessary in practice, because clients can switch to the other servers.

This assumes that only the servers are partitioned from each other and clients are not partitioned from the majority quorum. This might be rare but it is not impossible at scale.

There is also a latency cost of strict serializability or linearizability which is hard to mitigate at geo-replicated scale.

Read latency doesn't have to be impacted by linearizibility. Write latency is higher, but we have to do a fair comparison. If a non-consistent database acknowledges a write it doesn't have to have made that write visible, so comparing that latency to a linearizable DB write is apples to oranges. If we compare latency until the write is globally visible, the non-consistent DB can still do better, so it is definitely possible to imagine scenarios in which such a DB is better.

However, for what percentage of use cases does a reduction in write latency weigh up against the disadvantages? I think that's a very small percentage. Heck, the vast majority of companies that are using a hip highly available NoSQL database cluster would probably be fine running a DB on a single server with a hot standby.

You could image a system that gives you a bit of both. When you do a write, there are several points in time that you may care about, e.g. "the write has entered the system, and will eventually be visible to all clients" and "the write is visible to all clients". The database could communicate that information (asynchronously) to the client that's doing the write, so that this client can update the UI appropriately. When a client does a read, it could specify what level of guarantee it wants for the read: "give me the most recent data, regardless of whether the write has been committed" or "only give me data that is guaranteed to have been committed". Such a system could theoretically give you low latency or consistency on a case by case basis.

You can't always "choose another server" while maintaining low latency operations.

> slapping a fancy name on this concept is a justification for giving up consistency (even when the network and all servers are fully functional) to retain availability

This is a misconception. AP databases are not supposed to give up consistency, just not wait for all nodes to see updates. That's it. Consistency is still there, nodes resync, users always see their own updates and all that.

Consistency in the CAP sense has a precise meaning: all reads see all completed writes. That is, if the database has told some client that their write succeeded, then it is not allowed to still serve other clients the old data.

Inconsistency in the CAP sense may also cause inconsistency in the sense that you mean, for instance if two transactions are simultaneously accepted, but each transaction violates the precondition of the other. In a consistent database one of the transactions will see the result of the other transaction and fail, whereas a DB without consistency may accept both transactions and end up in a semantically incorrect state.

So then its CAP not AP? CAP was about you can’t have them all?

No, it's still AP, it's just CAP consistency is very specific thing and you can't generalize it into "giving up consistency". AP systems don't give up consistency, they just don't explicitly wait for it.

Then you are saying that all the trouble Spanner goes through to actually be consistent is not needed? And by saying a AP system can be “consistent” there’s no need for a product as Spanner. Consistency is obviously a timing problem. Something that is not consistent at some time will not be consistent since it will be inconsistent at that time.

CAP consistency is linearizability, so if what you mean is there are other models of consistency available to "AP systems" (kind of hate the framing that these are binary features of a distributed system rather than a huge space of design choices, but anyways), then yes that's true. But AP systems don't always guarantee things like read-your-own-write consistency models, and not all conflicts that occur during a partition can be resolved in many of these databases.

Sure. But it's kind of is a huge space of design choices (https://jepsen.io/consistency), so it's not ok to claim you give up consistency with any of them.

The C in there is for "Strong Consistency"; CAP allows for lessers forms of consistency, like "Eventual Consistency", without giving up on Availability.

Well yes but that’s not what we are talking about here. And “eventual consistency” is not “consistency”. And I would argue that “eventual consistency” is not consistent since it can result in fake states since it’s not consistent.

I'm the author of that post. I'm happy to respond to comments on the post on this thread for the next several hours. You can also leave comments on the post itself, and I will respond there at any time.

The problem isn't a lack of very accurate clocks per se, right? What you need is an accurate bound on clock error, whatever it might be. It sounds to me that non-Googles are specifying bound parameters that, unlike Google, they cannot guarantee. Doesn't that mean the blame goes to whomever put inaccurate bound parameters into the software? Why blame the algorithm for garbage input?

> It sounds to me that non-Googles are specifying bound parameters that, unlike Google, they cannot guarantee.

I've been complaining about this for years and it's so nice to see others echo the sentiment. Everyone chases timestamps but in reality they're harder to get right than most people are willing to acknowledge.

Very few places get NTP right at large-scale, or at least within accuracies required for this class of consistently. I've never seen anyone seriously measure their SLO for clock drift, often because their observability stack is incapable of achieving the necessary resolutions. Most places hand-wave the issue entirely and just assume their clocks will be fine.

The paper linked within TFA suggests a hybrid clock which is better but still carries some complications. I'll continue to recommend vector clocks despite their shortcomings.

The problem is that the bound on clock error directly affects your performance. So if you're willing to accept, say, a second in clock error, then all transactions will take a minimum of one second. That level of performance is going to be unacceptable in many situations.

The potential clock error on VMs without dedicated time-keeping hardware is so large that performance turns into absolute garbage.

I would understand if the complaint was that Spanner is too slow without expensively accurate clocks and synchronization. But the complaint is that Spanner fails to guarantee consistency, which doesn't make sense to me. The requirements clearly include giving a valid clock bound, so if you give an invalid clock bound, it's clearly your fault for getting incorrect results, not Spanner's!

Spanner does guarantee consistency, thanks to its use of hardware atomic clocks and GPS. It's alternatives like CockroachDB that don't have this dedicated hardware that can fail to guarantee consistency if clocks get of sync (a problem that can't happen in Spanner).

Spanner is really fast and massively parallelizable.

We recently open-sourced https://github.com/rubrikinc/kronos for the exact same problem. Coincidentally I shared that on Show HN just today: https://news.ycombinator.com/item?id=18037609

Neat. The obvious question would be, how does kronos compare with ntpd? Do they work together, is kronos a replacement, do they solve different problems, etc.? Is it expected that all the servers in a cluster synced by kronos are already running ntpd, and that kronos provides an additional level of reduced skew on top of that?

It'd be great if you could address this somewhere in the top level README on GitHub.

I've added a section in the README for this. It works in conjunction with NTPD and the time provided by this library has some extra guarantees like monotonicity, immunity to large clock jumps etc (more info in the README).

Hi, you used DynamoDB as an example of a weakly consistent system in the opening paragraph, but it actually supports both modes [1]. The point of confusion might come from the fact that the service described in 2007 Dynamo paper was an inspiration for DynamoDB, rather than DynamoDB itself.

Disclaimer, I work for AWS, but not on DynamoDB team.

[1] https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

Thanks. I added a parenthetical remark to the post to indicate that I was talking about DynamoDB's default settings.

It's not really the default settings, per se. You don't have to change any bit of configuration about your database to get consistency. The DynamoDB API gives you the GetItem API call and a boolean property to choose to make it a consistent read.

It's left as a very simple task for developers leveraging DynamoDB to make the appropriate trade offs on consistent or inconsistent read.

source: Used to work for AWS on a service that heavily leveraged DynamoDB. Not _once_ did we experience any problems with consistency or reliability, despite them and us going through numerous network partitions in that time. The only major issue came towards the end of my time there when DynamoDB had that complete service collapse for several hours.

On the sheer scale that DynamoDB operates at, it's more likely to be a question of "How many did we automatically handle this week?" than "How often do we have to deal with network partitions?"

From the GetItem docs[0]

"GetItem provides an eventually consistent read by default."

This seems to meet the definition of "DynamoDB's default settings"

[0] - https://docs.aws.amazon.com/amazondynamodb/latest/APIReferen...

It's enough of a gray area to make DynamoDB a poor example in this context, since if I were to claim that it was eventually consistent without additional configuration, then an informed person might reasonably assume I didn't know what I was talking about.

It would be better to state that both eventually consistent and fully consistent reads are available, and consistency can be enforced up front via configuration.

I recently came across CockroachDB and thought it's capabilities interesting, almost too good to be true. I also have been looking at Citus Data which shards and distributes transactions, are you aware of any consistency shortcomings with it?

CockroachDB's performance is dependent on time synchronization. If a node detects it's too far behind, it will commit suicide.

However, before it detects it, there is a possibility of stale reads.


Thanks for pointing that out - I have yet to rtfm and dive deep. I wonder how frequently time sync problems occur in virtual environments after ntp syncing - I've seen pretty erratic behavior on virtual active directory domain controllers even after syncing with hyper-v and vmware.

It’s been a long time since I messed with domain controllers but I believe Microsoft used to have explicit guides for handling time on virtual DCs. At that time we kept around a a couple hardware DCs to be safe but I do remember having the VMware agent correct the time could result in some bad results. I think it was because it immediately fixed the time and didn’t slowly correct the drift but it’s been a long time so my memory could be off.

Here’s their blog post on how they manage to live without atomic clocks. I’ve found at least one assumption of thiers that i don’t agree with, but notwithstanding that, it’s a good read.


Time shouldn't be a massive issue for AD no? It's a vector clock, not a UTC clock. The UTC clock is only used to solve conflicts no?

Why do you say that? I thought kerberos depended on timestamps +/- drift?


Or do you mean some other part of AD?

It defaults to 5 minutes. I perhaps wrongly assumed the issue was within 5 minutes.

My experience from test workloads on Cockroach is that single transaction performance is very bad compared to something like Postgres -- with 2.0 I was seeing easily 10x worse performance than Postgres with a three-node test cluster on Google Cloud. My impression is that it's worse for apps that have lots of small CRUD transactions on low numbers of rows, as is typical with web/mobile UIs.

Aggregate cluster performance seemed very good, though; i.e., adding a bunch more concurrent transactions did not slow down the other transactions noticeably.

why didn't you disclose your connections in your post? It appears you've written much the same article for FaunaDB's official blog in the past (https://fauna.com/blog/distributed-consistency-at-scale-span...).

That post was actually quite different than this one. This one focuses on distributed vs. global consensus.

But I added a note at the end that clearly documents my connection to FaunaDB. As far as Calvin, the post itself clearly says that it came out of my research group.

The article specifically states that Calvin comes out of his research group and that FaunaDB was inspired by Calvin.

Nice writeup, but given the title, I was hoping you would also have a constructive test case that shows that these systems fail to meet their guarantees (like Jepsen did). :)

I wonder if any of the aforementioned systems (Calvin/Spanner/YugaByte) that can opportunistically commit transactions and detect issues and roll back + retry all within the scope of the RPC so it can still conform to linearisability requirement?

I've only ever worked on small projects so I'm not at all familiar with these very high-scale distributed databases but from the post it seems to indicate that Spanner is in a league of its own because it integrated hardware into the mix where the others are software only. What are the differences in scale between the two categories mentioned?

Yes, Spanner is quite unusual in the distributed database world in how hardware is a pretty important part of their solution. Other systems may claim important integrations with hardware, but for Spanner, the architecture really relies on particular hardware assumptions.

To answer your question about scale: there is no real practical difference in scalability between the two categories discussed in the post. Partitioned consensus has better theoretical scalability, but I am not aware of any real-world workload that can not be handled by global consensus with batching.

I think this "global consensus with batching does everything partitioned does" is a very much theory vs practice type of statement. As in, in theory there is no difference between theory and practice :-)

I've seen those batched consensus systems, and honestly, you're kidding yourself if you think they can handle a million qps. Just transmit time of the data on ethernet would become an issue alone! Even with 40 gig - transmit time never becomes free. So now you're stuffing 1/10th of a million qps worth of data via a single set of machines (3, 5, 7, 9? Some relatively small amount)

Am I misunderstanding you? Hopefully I am!

"Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition."

Many of the distributed clusters I've maintained had crap infrastructure and no change control, and parts of the clusters were constantly going down from lack of storage, CPU and RAM, or bad changes. The only reason the applications kept working were either (1) the not-broken vnodes continued operating as normal and only broken vnodes were temporarily unavailable, or (2) we shifted traffic to a working region and replication automatically caught up the bad cluster once it was fixed. Clients experienced increased error rates due primarily to these infrastructure problems, and very rarely from network partition.

Does your consistent model take this into account, or do you really assume that network partition will be the only problem?

It seems you have other problems (crap infrastructure and no change control) to deal with before the issues in this article become your biggest concern, but are not the cases you list themselves partition problems?

They cause partition, but their origin isn't the network. Nobody who runs a large system has perfectly behaving infrastructure. Infrastructure always works better in a lab than in the real world. Even if you imagine your infrastructure is rock-solid, people often make assumptions, like their quota is infinite, or their application will scale past the theoretical limits of individual network segments, i/o bounding, etc.

The point is, resources cause problems, and the network is just one of many resources needed by the system. Other resources actually have more constraints on them than the network does. If a resource is constrained, it will impact availability in a highly-consistent model.

The author states that simply adding network redundancy would reduce partitions, and infrastructure problems are proof that this is very short-sighted. "You have bigger problems" - no kidding! Hence the weak-consistency model!

Even if you maintain your infrastructure properly you run on x86 servers with disks and CPUs that need cooling, using network devices that have fascinating failure scenarios. I guess assuming that your infra is not reliable is a must for any database nowadays.

Do these concerns also apply in an HTAP or OLAP context e.g. systems like Cloudera's Kudu, which uses Hybrid Time? Or maybe Volt which you also worked on?

I've worked on a system that loaded data into Kudu in near real time and simultaneously ran queries on the data. Kudu has no transactions, consistency is eventual which was sufficient due to our near-real time constraint, however you do need a stable NTP source. We have lost data when the cluster could not get a reliable NTP connection, decided to shut down and tablet servers' data files became corrupted.

Vertica seems to include some of these observations, such as global consistency and group commit. I believe these are easier to achieve (lower overhead) in OLAP due to fewer, larger transactions.

OLAP systems tend to be read only (for analytics) so the question of transactions and consistency isn't really applicable.

They don't load themselves.

I question your statement that building apps on weakly-consistent systems is so difficult. I’ve worked on very large scale systems that you’ve definitely heard of and probably used that are built atop storage systems with very weak semantics and asynchronous replication. Aren’t such systems existence proofs, or do you think there’s just a huge difference between the abilities of engineers in various organizations?

He said it was difficult, not impossible.

I think it's a fairly noncontroversial statement. Dealing with eventual consistency is always going to be more difficult and require more careful thought and preparation than immediately consistent systems.

How many programmers do you think are out there that have only ever worked on systems that use a single RDBMS instance, and what would happen if they tried to apply their techniques to a distributed, eventually consistent environment?

Exactly. My dad was coding when DBMSes rose to prominence, and it was basically a way to take a bunch of things that were hard to think about and sweep them under the rug. People wrote plenty of good software before they existed, but if you wanted to write a piece of data, you had to think about which disk and where on the disk and exactly the record format. Most programmers just wanted a genie that they could give data to and later ask for it back.

It's the same today, but worse. Most programmers still want a simple abstraction that lets them just build things. But now it's not just which sector on which disk, but also which server in which data center on which continent, while withstanding the larger number of failure modes at that scale.

When necessary, people can explicitly address that complexity. But it has a big cost, a high cognitive load.

The biggest challenge, in my experience, is explaining the weak consistency guarantees to stakeholders, especially in a QA setting (e.g. product owner demos the product to colleagues, numbers are not immediately up-to-date).

Great read, thank you for sharing. Do you have any opinion on the design of Eris[0]? Consistency is achieved with extra hardware, but that hardware is a network-level sequencer.

[0]: https://syslab.cs.washington.edu/papers/eris-sosp17.pdf

AFAIK Eris (and previous similar work from the same group) assume network ordering guarantees that can be provided in a single datacenter but probably not in a WAN setting. This discussion is about distributed and geo-replicated databases.

I think it would be helpful to move the disclaimer to the top of the post, just for clarity's sake. Is there a forecasted date on the release of the independent Jepsen study? Who is performing it? Thanks!!

Is there a "best-of-both-worlds" approach that could work, or are these two approaches mutually exclusive? I have to imagine that time drift can eventually reconciled with some kind of time delta.

The two approaches seem mutually exclusive to me.

won't batching transactions increasing the failure rate by a multiple of (batch size)? how should users reason about that tradeoff?

I’ve seen comments on HN over the years in which someone Dunning-Kruegers their way into saying that TrueTime is easily replicated. I always wonder if they have sixteen senior SREs in their pocket, because that’s the level of production engineering Google applies to the problem. Time SRE has at various points had take measures up to and including calling the USAF and telling them their satellites are fucked up. If you don’t have the staff for this, the easiest way to get access to TrueTime is probably to just use Cloud Spanner.

> Time SRE has at various points had take measures up to and including calling the USAF and telling them their satellites are fucked up

It's another cute anecdote, but Google culture is full of these, always scant on details and always intended to show how big/smart/important/complex/indispensable their engineering is.

"Had to" is a strong term here, it's made to sound like USAF could not possibly have noticed some deviation they were likely to correct of their own accord as a matter of routine as they had been doing for the 20 years of the GPS project prior to Google being founded.

The reality is drift and bad clocks are and always have been a feature of GPS, one explicitly designed for, one an entire staff exists to cope with, and designs depending on the absolute accuracy of a single clock have never been correct

So yes, Google can be very impressed with Google. But I'm not sure that's the issue here.

Is it really surprising that people who have extremely precise time needs and a whole team devoted to solving them would notice issues that other people wouldn't? I think it's a very common pattern that a product has some set of trailblazer users who find issues before the people who make the product.

Also, I think you're over-interpreting. "Had to" here only means that they noticed and reported the issue first because their system depended on GPS time being right. It doesn't preclude the possibility that the USAF would notice and fix the issue eventually, just with a higher latency that Google wanted.

If some condition existed that exceeded GPS intended design, you most certainly wouldn't learn of it first from some random anecdote on HN.. more likely the front page of the BBC as the transportation system instantly collapses

So the anecdote itself is noise, it's intended to show how seriously intractable a problem accurate time is, but it doesn't do that, instead it only demonstrates OP's lack of familiarity with GPS and willingness to regurgitate corporate old wives' tales

A single satellite mildly misbehaving on occasion won't necessarily cause catastrophe. You're normally connected to more than the requisite 3 satellites anyway, so you might notice less accuracy, but not anything terrible.

Most of these systems are designed to work if you lose GPS entirely, so they fail gracefully.

Planes won't actually fall out of the sky if GPS makes mistakes. That's y2k fearmongering.

Why is it hard to believe that a group using GPS for a unique purpose has unique needs and detect unique issues?

Sub-millisecond flaws in GPS would make the transportation system collapse? Why?

Here's an interesting article[1] about how relativity affects GPS satellites. The clock ticks in a GPS satellite need to be accurate to within 20-30 nanoseconds for accuracy, and they tick 38 microseconds/day faster to account for relativity.

[1] http://www.astronomy.ohio-state.edu/~pogge/Ast162/Unit5/gps....

GPS is one of the few technologies that have to account for both Special Relativity and General Relativity. The level of engineering that went into the system is just amazing.

Fun fact, GPS satellites use rubidium clocks instead of cesium clocks, and only maintain their accuracy thanks to yet another incredible feat of engineering.

Triangulation of location is bounded by the accuracy of those clocks.

1 microsecond is 300 meters of error.

What if all the satellites are off by the same amount?

According to the link posted higher up in the thread, in early 2016 they were all off by 13 microseconds for 12 hours, with no apparent consequences for anything ordinary people use GPS for such as location finding.

To triangulate, I think you need to know (1) where the satellites are, and (2) how far you are from each satellite. I think either absolute distance or relative distance works.

Getting both of these depends on knowing the time. That time comes from the satellites. Let's say they are all off by 1 us. Your time is derived from satellite time. That would mean the time you use to look up/calculate their positions will be off by 1 us from the correct time so you would get the wrong position for the satellites.

A quick Googling says the satellites orbital speed is 14000 km/hr, so using a time off by 1 us to look up/calculate satellite position would give you a position that is about 4 mm off.

The procedure for deriving the time from the satellites would get some extra error from this, but that should be limited to about the time it takes like to travel 4 mm, so we can ignore that. As a result your distance measurements between you and satellites would be off by about 4 mm or less.

The key here is that when all the satellites have the same error, the time you derive has the same error, so your distance calculations should still work, and so you only get an error of about how far satellites move in an interval equal to the time error.

In summary, if all the satellites are off by 1 us, your triangulation seems like it would be about 4 mm more uncertain.

If only one satellite is off, it is going to depend on how the time algorithm works. If the algorithm is such that it ends up with a time much closer to the times of the correct satellites than to the off satellite, then if it calculates the distance from the triangulated position to the expected positions of the satellites, and compares that to the measured distance, it should find that one is off by something on the order of the distance light travels in 1 us, and the others are all pretty close to where they should be. It should then be able to figure out that it has one unreliable satellite it, drop it, and then get the right location.

I have no idea if they actually take those kinds of precautions, though.

The case that would really screw it up would be if several satellites are off, but by different amounts. With enough observation it should be possible in many cases to even straighten that out, but it might be too complicated or too time consuming to be practical. (This is assuming that the error is that the satellites are simply set to the wrong time, but that wrong time is ticking at the right rate).

Precise geolocation relies on extreme time accuracy (the story always being that relativistic time dilation effects with the difference in gravity on the surface vs LEO must be accounted for), so yeah, it wouldn't surprise me one bit that the accuracy required is on the order of much less than a millisecond.

> Is it really surprising that people who have extremely precise time needs and a whole team devoted to solving them would notice issues that other people wouldn't

If GPS timing is bad, a lot of people will notice that their position on the map is incorrect, because that's the whole purpose of the GPS network.

A 1 microsecond error is 300 meters.

> A 1 microsecond error is 300 meters.

While the speed-of-light propagation is about 300 meters in a microsecond, isn't the final position error possibly much greater? For calculating position on Earth, you can think about a sphere expanding at the speed of light from each satellite. The 1 microsecond error here corresponds to a radius 300m bigger or smaller, which only corresponds to 300m horizontal distance on the ground if the satellite is on the horizon (assuming that Earth is locally a flat plane for simplicity here). For a satellite directly overhead, the 300m error is a vertical distance. Calculating the difference in horizontal position from this error is then finding the length of a leg of a right triangle with other leg length D and hypotenuse length D+300m, where D is the orbital distance from the satellite (according to Wikipedia, 20180km). The final horizontal distance error is then sqrt((D+300)^2 - D^2), or about 110km.

Of course, this is just the effect of a 1us error in a single satellite, I'm sure there's ways to detect and compensate for these errors.

Intuitively this seems wrong to me. If the satellite is overhead, the error would put you 300m into the ground so to speak. I'm not sure why you project that horizontally, and especially why you take the distance to the satellite into account.

As another sanity check, if the error for 1 us is 110 km, the error for 1 ns would be 110 m, and I suspect 1 ns error is not unusual for consumer electronics:

> To reduce this error level to the order of meters would require an atomic clock. However, not only is this impracticable for consumer GPS devices, the GPS satellites are only accurate to about 10 nano seconds (in which time a signal would travel 3m)


> If the satellite is overhead, the error would put you 300m into the ground so to speak.

Right, I was basically calculating where that signal would just be reaching the surface at the same time it was 300m under you. This is a circle around you with a radius of ~110km (again using the approximation of the ground as a flat plane). Thinking about it more, there's not much reason to do this (GPS isn't really tied to the surface of the Earth, it gives you 3-D coordinates). I guess my point was that the 300m of distance from 1us of light propagation should not be thought of as a horizontal distance.

That would be if it were straight overhead, intersecting tangentially with another sphere. Realistically they're not overhead, but if two satellites are 30 degrees apart, the line of intersection between their spheres will move twice the distance one of the spheres moves. The magnifying factor is 1/sin(angle between the satellites from the observer).

If I remember correctly, there was a bug a couple of years back which caused an incorrect time offset between GPS and UTC time to be uploaded to some of the satellites - off by a handful of microseconds. Didn't affect navigation but it did trip a bunch of alerts on systems that relied on precise time. I don't think Google was the one that alerted the USAF to that though, in fact they may not have had sufficiently accurate timekeeping back then.

> Despite the flawed data set, there were no impacts to GPS positioning and navigation. Furthermore, GPS time (tGPS) was unaffected. Only a subset of the functions that make use of the GPS-UTC offset were affected.


GPS is not typically used to confirm a position that is known accurately by other means, and that is not its purpose. Only in those cases where there is a manifest conflict with independent spatial information will the problem be evident.

>GPS is not typically used to confirm a position that is known accurately by other means

I am not so sure about that. The most common use of GPS is in satnav in cars. Satnavs typically show a map, and typically it is very easy to confirm your position on a map. Any inaccuracy by more than the usual few meters would be quickly noticed by the majority of GPS users.

People are going to notice a 300m deviation due to landmarks and their eyes.

Rarely, if you are navigating at sea or in the air or in the woods... and even on the road, it is not uncommon for my GPS device to be clearly off without justifying the conclusion that there is a fault in a satellite.

Here are some users that have a high chance of noticing visually and in aggregate would probably produce a lot of noise:

* Air and sea port operators and navigators

* Military personal running supply lines

* Military personal on foot in operations and training

* Space-X


* River boats

* Fresh water fishermen

* Etc

Out of all the possible users who would notice a 300m deviation just based on visual reconciliation, I personally would not say it would be so rare that the USAF would not find out very quickly. Of course, this is ignoring the equipment that would likely detect the issue way before somebody in the Army started phoning the USAF.

Come on, in urban traffic a 300m error will easily place one in a parallel street.

Unless I'm woefully off base here, if the satellites were incorrect, you would basically be permanently 300m off, not just temporarily.

There's not so many GPS satellites out there that you're going to be bouncing around them all the time - even if only one is affected, it would be very noticeable for extended periods of time.

I specifically worded this to be about money not brains. Most readers here can probably imagine how to implement a bounded time service. Most readers here also cannot afford to operate one. That is the point. Operating software reliably at large scale happens to be very expensive. 24x7 coverage with a short time-to-repair costs at a minimum several million dollars per year.

  24x7 coverage with a short time-to-repair costs at
  a minimum several million dollars per year.
Interesting - what are the constituents of that cost?

What sort of challenges do you face? Do you use PTP grandmaster clocks, or something else? How many sites, and how many clocks per site? Are the support issues mostly hardware failures, configuration problems, or something else? Is 24/7 support needed because the equipment lacks failover support, or is the failover support unreliable or insufficient?

You generally need at least 4-5 SREs for a high availability large (big 5) scale subsystem in a multinational corp just to cover all of the timezones and make sure you're not frantically calling everyone when someone goes on vacation or has to pick up their kid from the nurse. The salary plus benefits and overhead on that is easily in the millions.

I think it was meant that Google has such high costs. I read somewhere that Google operates two atomic clocks in each of its data centers, but I can't find a source for it right now, just this: https://www.wired.com/2012/11/google-spanner-time/

Atomic clocks aren't all that expensive. You can get a decent rubidium one for US $5K.

I'm guessing that was a reference to the January 2016 event[1]?

Google wasn't the only company that noticed it, and I have no idea if they discovered it before the USAF, but I can believe that someone from Google would phone up Schriever and ask WTF is going on.

[1] http://ptfinc.com/gps-glitch-january-2016/

Google also reports software and hardware security vulnerabilities and infrastructure security issues to external companies, organizations, and stakeholders responsible for the the design, operation, and maintenance of external systems. Google isn't the only company that does this - other organizations do too - but at this level expertise is a scarce resource, and we're all in the same boat so it behooves everyone when those capable can and do cooperate and participate in keeping a vigilant watch. This ethos is one if the reason the West dominates.

> including calling the USAF and telling them their satellites are fucked up

Citation needed. There is a worldwide organization, led by the US Naval observatory, that keeps constant watch on GPS satellite time performance. If google noticed anything that USNO and the other participants didn't that would need a paper or three published.

I’m sure someone has talked about this and proven or disproven it, but I’ve always been a little uneasy with the truetime protocol. I mean it’s faster than Raft, but for the use case maybe we are trying to fix the wrong problem?

If two events are independent, it matters very little what order we record them in the system of record.

My whole career we have been building cause and effect at transaction time but when we debug we stare at log files and time stamps like we are reading tea leaves, trying to figure out what situation A led to corrupted data in row B.

Maybe there’s a way to record this stuff instead of time stamps? Something DVCS style. Or maybe it’s provably intractable.

I think if 2 events are independent then you can use an eventually consistent DB like Cassandra and move on with your day. The importance of something like TrueTime arises when you have an example like in the article. Step 1: remove your parents from seeing your photo albums. Step 2: post your spring break photo album

That doesn't sound like a system you need "true time" for. :(

Of course, I think I just like thinking about how true time will start to fail once we get beyond the earth. Consider, what is the true time for events we are seeing in the stars right now?

Granted, I fully cede that being able to rely on a fully sequenced notion of time that everyone is a part of makes some reasoning much easier.

TrueTime is not a timestamp, but a time with a spread which reflects the maximum difference in time between two data centers.

If i understand correctly, if the size of this increases too much, this brings down the number of transactions per second which can be done, so keeping the spread small is key. Hence the atomic clocks etc.

Or just use global consensus with batching and avoid the need for clock synchronization altogether!

the only solution to this is to launch your own satellites :)

If we seal the surface of, then boil the ocean we can launch our own time satellites with completely sustainable propulsion.

I've always had an issue with systems assuming a universal time, because physics (Special Relativity) tells us there is no such thing. Two events in different places will be viewed as having a different order depending on your frame of reference. All that really matters on a physical level is causal connections.

I believe vector clocks capture this semantic. But they have other trade-offs.

Special relativity has no such problem. In SR you can easily define a “time” coordinate everywhere such that all events can be timestamped with that coordinate and will respect causality.

GR at least usually has this property as well. (It doesn’t in the presence of closed timelike curves. It does in weak gravity and in the FLRW metric in cosmology. I’m not sure about the general strong gravity case.

Of course every inertial frame in SR has a well defined time coordinate, but that is not a universal time - other frames will disagree on which of two not-causally-connected events happened first. This is normally explained through the lack of a well defined "simultaneity" across different inertial frames.

You are correct that there is no truly capital-U-Universal time, but it doesn't matter. You control the whole system, so just choose one and call it "true time" and make everything participating in the system match it. Simultaneity in all inertial frames can be translated between one another, so if you go to a new place that has, for example, more time dilation due to different gravity, just note the parameters and translate it into your chosen "true time" and adjust the spread.

My point is not that it isn't possible, but that it is arbitrary and has no physical meaning. Writes thrown away in one frame because another concurrent write was "later" would in fact be kept in another frame. This is why it feels like a 'bug' to me conceptually - it's not how the universe works so why should a database need it.

To summarize, you're saying 'The universe doesn't need linear serializeability, so why should we?'

We build abstractions because they're useful to us, not because they have some special meaning to the universe. Systems with simpler abstractions are easier to understand and therefore build on top of. Complex numbers, for example, have no direct physical meaning in the (non-quantum mechanical) universe but can still be extremely useful and are sometimes the only/best way to solve some classical problems.

If you want to use a database where a required step of querying it is specifying a reference frame that the ordering of events is relative to, feel free. "In fact, for any two spacelike separated events, it is possible to find a reference frame where you can reverse the order in which they happen."[1] Personally I'll take a hard pass on bug reports like 'Foreign key constraint fails in reference frame 0.992c at 37.2Mm vector towards Alpha Centauri' which reads like the climax of Dante's Inferno for Systems Programmers.

[1]: https://physics.stackexchange.com/a/75765/28368

No, I'm saying the less impedance mismatch we have with the real physical world the easier things will be to work and make scale beyond earth (we will need a network that can scale to mars, for example, soonish). Fundamentally, we are limited by the laws of physics, so it is beneficial to understand them.

We build abstractions because they are useful, but we also often build the wrong abstractions - this is the entire history of science - building better abstractions. Simpler abstractions are great, but they can limit you. We can build all kinds of wonderful machinery with just classical physics, but if you want the modern world with GPS etc. you need relativity to make it work. All the databases based on truetime are great and marvels of engineering, but they won't be able to scale to even a second planet (getting a GPS equivalent to work across two planets is orders of magnitude harder than the earth one, not to mention the latency).

I'm not condoning your idea of a database that explicitly uses frames, but rather something based on more physical foundations like cause-and-effect. As I mentioned, I think vector clocks satisfy this. But maybe there are other better alternatives.

I'm fully aware of the physics (I have a MPhys, and DPhil in Particle Physics from Oxford).

Can't you solve that by just picking a reference frame though?

HTTP has had a reference frame since 1.0. It’s still one of my favorite features of just about any protocol ever.

Cache headers can set expiration times or offsets, but the client and server both send their own time stamp in the message. So when the server says “expire this at noon” but your laptop’s clock is five minutes slow or three time zones away, you can still figure out what the server meant by noon because the server tells you what time it thinks it is now. The accuracy is limited by network delays but for the problem space getting the right answer to the nearest second is still pretty damned good.

Yes, and we all pick the Earth’s gravity well as our reference frame. Differences due to elevation and latitude are too small to measure for purposes of transaction clocks.

yeah, I’m pretty sure TrueTime is only defined on Earth, and GPS satellites do take relativity into account in their design.


You need a quite elaborate system of a grid of synchronized, stationary clocks and rods to correctly define a reference frame, which is worse than what's required to come up with a "good enough on earth to some time resolution" true time.

To be clear this is pretty much exactly what the system of GPS satellites does, but without physical rods, and it is very expensive to construct and maintain. This works fine to a certain accuracy as long as you restrict your application to staying on earth.

When infinite resources are available, organizations will somehow find a way to consume them. There are properly consistent not-SQL architectures that do not need clocks, Datomic is one, and it does not require coordination with any government agencies to operate. https://www.datomic.com/

Also, I believe all the major cloud providers provide a "TrueTime" API service. I forgot the name that AWS uses, but you can call it on your EC2 instances and make sure your hosts are all in sync. It's pretty cool.

AFAIK AWS is only offering NTP service with a GPS source (i.e. stratum 1). TrueTime appears to be a service offering that is a step above that in terms of the guarantees it provides.

Ah, I've never had a use case for anything more accurate than NTP

You may never have built a system that needed it, but LTE cell handoffs require higher precision than NTP can supply. Almost every cell phone in use today benefits from highly precise timing.

This is why PTP and high precision GPS devices are built to integrate with cell provider gear.

TrueTime is more of a client side library that utilizes a service like NTPd.

You can implement a toy version of TrueTime with about 60 lines of C which uses a single ntp_gettime call for each of the TrueTime api functions (now, before, after).

If AWS's NTPd service offers drift <= 200 us/s, you could use it for TrueTime.

This article lacks any reference to FoundationDB, which offers external consistency and serializable distributed transactions without trusting clocks in any way. We designed it starting in 2009 and so its lineage is independent of either Calvin or Spanner.

FDB doesn't have an actively developed SQL layer at the moment, so I guess you could say it isn't a "NewSQL" database, but none of the properties under discussion have much to do with the query language.

As best as I could find, FoundationDB doesn't provide strict serializability (i.e. linearizability + serializability) and doesn't tackle multi-DC deployments. So I wouldn't put it in the same class as FaunaDB or Spanner.

There is a "Datacenter-aware mode": https://apple.github.io/foundationdb/configuration.html#data...

Here is some discussion about linearizabilty in fdb: https://news.ycombinator.com/item?id=16884882

Without knowing exactly how reads and writes are replicated, I am skeptical. I have gone through the technical documentation and haven't found many details. For reference, I've done work in storage and consensus algorithms and I can tell you for a fact that without using a consensus algorithm for either reconfiguration or request propagation, you will have consistency violations.

I would love to be proven wrong, as more systems with strong consistency guarantees is better, but for now, I don't believe that foundation db provides stronger guarantees than serializable reads and writes.

FoundationDB uses a consensus algorithm for reconfiguration, but not in the (happy path) transaction pipeline. It provides (by default) strict serializability (i.e. serializability and external consistency/linearizability) for arbitrary, ad hoc, interactive transactions, and it's expected to provide excellent performance when every single transaction is cross node (so e.g. indexes can be efficiently updated this way). It provides better fault tolerance than consensus replicated databases typically can because it needs only N+1 replicas instead of 2N+1 to survive N faults (it keeps 2N+1 replicas of tiny configuration for consensus, if course). It has the best testing story in the industry and is used at scale by, among others, the largest company in the world. Because it doesn't have consensus in the datapath it can have lower latencies than consensus replicated databases in multi region deployments, it also supports asynchronous replication, and an upcoming feature will provide a unique option for multi region failover with sub geographic write latencies, maintaining full transaction durability in failover, as if in synchronous replication, except for the exceptionally rare case where the failure of multiple datacenters in a region are exactly simultaneous.

Besides its extensive documentation, you can read its source code and run its deterministic simulation tests yourself if you are interested (it's Apache licensed). Skepticism on these points was reasonable when we originally launched it in 2012 but is getting a little silly in 2018.

How does FoundationDB do distributed transactions without consensus?

> With 10ms batches, Calvin was able to achieve a throughput of over 500,000 transactions per second. For comparison, Amazon.com and NASDAQ likely process no more than 10,000 orders/trades per second even during peak workloads.

I haven’t worked with NASDAQ stream directly, but knowing how fast equities tick I find this “10,000 orders/sec” estimate quite low.

Not to mention that 10ms delay in confirming an order would be really terrible.

According to NASDAQ's website [1]:

> Able to consistently sustain an order rate of over 100,000 orders per second at sub-40 microsecond average latency

[1]: https://business.nasdaq.com/market-tech/marketplaces/trading

> We will trace the failure to guarantee consistency to a controversial design decision made by Spanner that has been tragically and imperfectly emulated in other systems.

This post doesn't establish any "controversy" about Spanner's design decision. It only says that it requires special hardware, which other systems attempt to emulate despite not having this specialized hardware.

To call this decision "controversial" I think one would need to show that it has some significant problem in the environment it was designed for.

By controversial, I mean that in the database research community (where I spend most of my time), there is significant disagreement about global consensus vs. Spanner's choice of partitioned consensus.

There may be disagreement about Spanner's choice, but my observation of it in production environment is that the choices were reasonable tradeoffs that permitted a large number of products to be moved off BigTable with confidence about the results of the system running in a demanding environment. It has worked well enough to remain established and I don't see any successor with a "better" design coming along and replacing it within a decade.

Disagreement about what though? Does Spanner's solution have an objective problem? Do you or others in your community have specific reasons to believe that it cannot deliver on its promises?

Spanner's approach requires help from hardware and several full time employees maintaining and ensuring the uncertainty guarantees. This increases the cost of the maintaining the system, which for Cloud Spanner is partially passed on to the end users. If you can build a system that doesn't require time synchronization, yet doesn't have any significant drawbacks relative to what Spanner provides, you'd be better off using this alternative system.

> If you can build a system that doesn't require time synchronization, yet doesn't have any significant drawbacks relative to what Spanner provides, you'd be better off using this alternative system.

But you describe exactly the drawbacks of giving up time synchronization:

> The main downside of the first category is scalability. A server can process a fixed number of messages per second. If every transaction in the system participates in the same consensus protocol, the same set of servers vote on every transaction. Since voting requires communication, the number of votes per second is limited by the number of messages each server can handle. This limits the total amount of transactions per second that the system can handle.

How is "worse scalability" not a significant drawback?

This just sounds like an engineering tradeoff. I don't think engineering tradeoffs are the same as controversy. I get that your group's DB takes a different approach. But "blaming" Spanner for making a different trade-off doesn't come off well (I approached the article with an open mind).

You beat me to it. Also, per

> Calvin was able to achieve a throughput of over 500,000 transactions per second. For comparison, Amazon.com and NASDAQ likely process no more than 10,000 orders/trades per second even during peak workloads.

Maybe if you are only considering the small scope of just the order transactions, but they are actually writing way more data to their databases such as logs, metrics, current state for things like shopping carts. In my experience, what teams have done is split their data off into their own siloed database to minimize scaling problems, but this becomes super painful when you want to join your data with others. If spanner can hold all of our data, scale, and handle joining across all of it, that sounds like a huge win.

You clearly have an agenda here, the blog post contains little information while spreading FUD. Never heard of calvin and will now avoid it.

Calvin is a research database, described in a paper and also provided as an open source project, but last I checked, it was pure research, not usable for real work. The "production version" of Calvin is arguably FaunaDB, which is a hosted SaaS product.

Abadi is a very well known and, as far as I know, respected database researcher. No need to avoid his work.

That's because the author is pushing Calvin, which came out of his research group.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact