In my experience, yes network partitions are incredibly rare. However 99% of my distributed ststem partitions have little do with the network. When running databases on a cloud environment network partitions can occur for a variety of reasons that don’t actually include the network link between databases:
1. The host database is written in a GC’d language and experiences a pathological GC pause.
2. The Virtual machine is migrated and experiences a pathological pause
3. Google migrates your machine with local SSDs, fucks up that process and you lose all your data on that machine (you do have backups right?)
4. AWS retires your instance and you need to reboot your VM.
You may never see these issue if you are running a 3 or 5 cluster database. I began seeing issues like this semi regularly once the cluster grew to 30-40 machines (Cassandra). Now I will agree that none of the issues took down majority, but if your R=3, it really only takes an unlucky partition to fuck up an entire shard
If you can afford 3-5 machines/VMs for a cluster you can almost certainly afford a single machine/VM with 2-4x the resources/CPU and chances are that it'll perform just as well (or better) because it doesn't have network latency to contend with.
Of course if you're around N >= 7 or N >= 9, then you should perhaps start considering "distributed X".
As long as your application is mostly built on very few assumptions about consistency it's usually pretty painless to actually go distributed.
Of course, there are legitimate cases where you want to go distributed even with N=3 or N=5, but IME they're very few and far between... but if your 3/5 machines are co-located then it's really likely that the "partitioning" problem is one of the scenarios where they actually all go down simultaneously or the availability goes to 0 because you can't actually reach any of the machines (which are likely to be on the 'same' network).
You can fit a lot of data in an in memory database of 2TB.
 one pair in two locations, basically the minimum amount, unless you're really latency insensitive so you could have one in three locations.
Nowadays you easily have 32 cores on a machine, and each core is significantly faster than it was back then (probably at least 10x). That is a compute cluster by the definition of 1999.
So for certain "big data" tasks (really "medium data" but not everyone knows the difference), I just use a single machine and shell scripts / xargs -P instead of messing with cluster configs.
You can crunch a lot of data on those machines as long as you use all the cores. 32x is the difference between a day and less than an hour, so we should all make sure we are using it!
Common languages like node.js, Python, R, OCaml, etc. do NOT let you use all your cores "by default" (due to GILs / event loop concurrency). You have to do some extra work, and not everyone does.
If the code is written in a distributed fashion from the start then it can be designed so it's one python/node/R process per core. And that makes the transition to distributed servers simpler anyway.
That's a really glib dismissal of how hard the problem is. Python and node have pretty terrible support for building distributed systems. With Python, in practice most systems end up based on Celery, with huge long-running tasks. This configuration basically boils down to using Celery, and whatever queueing system it is running on, as a mainframe-style job control system.
The "shell scripts / xargs -P" mentioned by chubot is a better solution that is much easier to write, more efficient, requires no configuration, and has far fewer failure modes. That is because Unix shell scripting is really a job control language for running Unix processes in parallel and setting up I/O redirection between them.
Oh dear... Yeah, that's a terrible distributed system. Interestingly, all the distributed systems I've worked on with Python haven't had Celery as any kind of core component. It's just poorly suited for the job, as it is more of a task queue. A task queue is really not a good spine for a distributed system.
There are a lot of python distributed systems built around in memory stores, like pycos or dask, or built around existing distributed systems like Akka, Casandra, even Redis.
Cassandra and Redis just mean that you have a database-backed application. How do you schedule Python jobs? Either you build your own scheduler that pulls things out of the database, or you use an existing scheduler. I once worked on a Python system that scheduled tasks using Celery, used Redis for some synchronization flags, and Cassandra for the shared store (also for the main database). Building a custom scheduler for that system would have been a waste of time.
Oh there's a lot more to it than that. CRDT's... for example.
You can scale up web servers to handle more requests, which then uses Celery to offload jobs to different clusters.
I don't think any reasonable developer would dismiss concurrency/parallelism as easy problems.
 Well, my dad bought it, but y'know. He wasn't particularly interested, but I think he recognized an interest in computers in me :).
It predated the VIC-20 by just under 3 years. You might have had 3.5KB free for BASIC programs but the VIC-20 had a comparatively spacious 5KB of RAM and 20KB of ROM.
It depends what is meant by "commodity" I guess.
The largest server AWS offers is only 64 physical cores (128 logical) and less than 4 TB RAM.
Quad socket R1 (LGA 2011) supports
Intel® Xeon® processor E7-8800 v4/v3,
E7-4800 v4/v3 family (up to 24-Core)
Up to 12TB DDR4 (128GB 3DS
LRDIMM); 96x DIMM slots (8x memory
module boards: X10QBi-MEM2)
AFAIK the latest Skylake Xeons ("Scalable" - Platinum/Gold/Silver/Bronze) have regressed to 1.5TB support, see https://ark.intel.com/products/93794/Intel-Xeon-Processor-E7... http://www.colfax-intl.com/nd/downloads/Intel-Xeon-Scalable-... corroborated by https://ark.intel.com/products/120502/Intel-Xeon-Platinum-81...
Since a single 128GB stick costs about 2900 USD, 96 of them will run to ~280 000 USD plus the server so it's likely to be above 300 000K.
If you want to go with 6TB "only" then it's a lot, lot cheaper as 64GB RDIMM sticks can be had below 700 USD. The end result might cost closer to a third of the 12TB server than half of it.
RAM prices comparison: https://memory.net/memory-prices/
Server itself is might only be $25k but those 224 cores could add another $90k, so the total would be close to $400k.
The previous-gen version, SYS-7088B-TR4FT (link in my comment upthread), has 192 DIMM slots, so if you don't need CPU horsepower, you cn get the lower-density modules and still have 12TB (or the max of 24TB for the price premium!).
Even previous gen, if 12TiB main memory is your goal (NUMA concerns aside), it's probably worth going for the 8S system instead of the 4S one, since that's a savings of $144k, and 8 slower/fewer-core CPUs might even be cheaper than 4 that have twice the performance.
If I need more nines of uptime than a single system could provide (and think about the MTBF numbers for the various parts in a computer, a single machine is not going to give you even three nines of uptime), then I'm literally forced to go distributed with N >= 3 and geographical redundancy (we stay up if any k of these machines fail, or if network to "us-east-4" goes down). Things get worse (and you need to be more paranoid) if your service level obligation turns into a service level agreement, because then it usually costs you money when you mess up.
Of course, the more distributed you are, the slower and more complicated your system becomes, and the more you expose yourself to the additional associated downtime risks ("oops, the lock service lost quorum"). They usually cost more to run, obviously. C'est la vie. There is no magic bullet in software engineering.
It used to be that you could spend 3X or more in engineering effort getting a distributed system up and running vs its singly-homed alternative. These days with cloud deployments (and Kubernetes for on-prem) you get a lot of labor-saving devices that make rolling out a distributed system a lot easier than it used to be. You still need to do a cost/benefit analysis!
There are environments where flat time distribution for SLO calculation is not acceptable. (cough betting exchange)
If your traffic patterns are extremely spiky, such as weekly peaks hitting 15-20x of your base load, and where a big chunk of your business can come from those peaks, then most normal calculations don't apply.
Let's say your main system that accepts writes is 10 minutes down in a month. That's easily good for >99.9% uptime, but if a failure + PR hit from an inconveniently timed 10-minute window can be responsible for nearly 10% of your monthly revenue, that's a major business problem.
So when setting SLOs, they should be set according to business needs. I may be a heretic saying this but not all downtime is equal.
Since total outages are a special case of partial outages, use metrics that cleanly measure partial outages. That's request error metrics.
* maximum number of users affected
* maximum time of unavailability
* maximum observed latency
* highest ratio of failed requests over a sequence of relatively tight measurement windows
So yes, request ratio is certainly a good part of the overall SLO but covers only a portion of the spectrum.
Three nines is 8 hours and 45 minutes. In my experience with quality hardware, at a facility with onsite spares, that gives you two to three hardware faults, assuming your software and network is perfect, which I think is a fair assumption :)
Definitely the hardest problem when you switch to a distributed database is you have to contemplate failure. When your SPOF database is down, it's easy to handle, nothing works. When your distributed database partially fails, chances are you have a weird UX problem.
I found that constant communication and scenarios help bridge the gap between how they think things work and how you do. It doesn't always work - a lot of time people don't care to discuss things in detail.
But sometimes it worked and the other person came to realize a new way of thinking. Sometimes the discussion made me realize I was the person using invalid assumptions :)
I use this same reasoning to use an RDBMS until I find a reason it will not work. An ACID compliant datastore makes so many things easier until you hit a very large scale.
We do rolling releases of software all the time but it’s pretty hard for us to do much optimisation of our DB setup without doing it in the middle of the night because of how all this stuff works.
They still took a window a year for changes where there was no alternative. That is still seems preferable to me to the kinds of ongoing problems that distributed eventually consistent databases produce. You might have better availability numbers, but if customer service has to spend money to fix mistakes caused by data errors I don't know that is better for the business.
Or are you talking about other kinds of tweaks?
Given that Postgres has built in synchronous replication, I feel like it should also have some support for multiple primaries (during a cutover window) to allow better HA support
This is especially true for storage systems, which are the topic of this post, and are by far some of the most complex distributed systems. Loosing a node can mean loosing data. Loosing a large, vertically scaled node, can mean loosing a lot of data.
You can mitigate those risks with backups, and cloud APIs of today give customers ability to spin up replacement host in seconds. But then the focus shifts to what will achieve higher availability - a complex distributed system, but one that always runs in a active/active mode. Or a simpler single node system, but one that relies on infrequently exercised and risky path during failures.
Hard "distributed systems" problems in databases generally only creep up when you're trying to deploy a multi-active system.
The most common way in which such systems failover is by fencing off the old master prior to failover. Which in itself can be a hard distributed problem.
When a replica is used as a read slave, that introduces problems with (strong) consistency, that definitely starts to resemble any other distributed system.
Adding automated failover adds the need for some kind of consensus algorithm, another feature of distributed systems.
However, if failover is manual, for example, then 2 nodes can be nearly indistinguishable from 1 node, if no failover occurs.
It also strains the definition of "distributed" if both nodes are adjacent, possibly even with direct network and/or storage connections to each other.
Past that easy single / very small cluster case it's time to start asking more important design questions. As an example, which part(s) NEED to be ACID complaint and which can have eventual consistency (or do they even need that as long as the data was valid 'at some point'? So just 'atomic'.)
Given the same provider, ten $50 instances can usually handle a much higher traffic load (in terms of sheer bulk of packets) than a single $500 instance can.
Alternately, you can switch over to using a mainframe architecture, where your $500 instance will actually have 10 IO-accelerated network cards, and so will be able to effectively make use of them all without saturating its however-many-core CPU's DMA channels.
So you are either accepting the downtime or planting your head in the ground in denial.
6. Well, basically the hundred other reasons the machine could brown-out enough that things start timing out even though it's sporadically online. Bad drive, rogue process, loose heatsink, etc.
Dead hosts are easy. Half-dead hosts suck.
It's no surprise that we're currently running screaming to something with stronger transactional guarantees.
Firestore is currently in beta, but once it's GA we will be migrating all Datastore users to Firestore: https://cloud.google.com/datastore/docs/upgrade-to-firestore
Disclaimer: I work on Cloud Datastore/Firestore.
I would view the Firestore API as a further extension (the "Datastore Mode" functionality was always an element of the design) of that paradigm, extending to the case where you have no trusted piece of code to mediate requests to the database, thus allowing direct use from, e.g., mobile apps (at which point other issues such as disconnected operation surface).
So not so much a "different kind of product" and more a product that supports a strict superset of use cases.
I actually think most of the server SDKs don't even expose many of the realtime APIs. Maybe they will in the future, but it shows that you can use Firestore like a normal database just fine.
(I work for GCP)
I ran across it before and just wanted to say it’s really cool that this is open sourced.
Nomulus is software that runs a domain name registry, including most notably the .app TLD. There are three fundamental objects at play here; the domains themselves, contacts (ownership information that goes into WHOIS), and hosts (nameserver information that feeds into DNS). There's a many-to-many relationship here, in that contacts and hosts can be reused over an arbitrarily large number of domains.
The problem is that you can't perform transactions over an arbitrarily large number of different objects in Cloud Datastore; you're limited to enlisting a maximum of 25 entity groups. This means that you can't perform operations that are strongly consistent when contacts or hosts are reused too often. This situation comes up a lot; registrars tend to reuse the same nameservers across many domains, as well as the contacts used for privacy/proxy services.
These problems don't arise in a relational SQL database, because you can simply JOIN the relevant tables together (provided you have the correct indexes set up) and then perform your operations in a strongly consistent manner. That trades off scalability for consistency though, whereas in Spanner you give up neither.
Exactly my point, even a little generalized. The world isn't consistent, why always put unrealistic constraints unto ourselves. If we change from "I really need to have all the datasets in all of their truthest form" and instead go with "well there is some data coming, better than nothing" the whole system might be even more reliable. Things in the middle won't die just because the world is imperfect.
Google makes its "leap-smeared NTP network" available via Google's Public NTP service. And it's not expensive for someone to buy their own GPS clock and run their own leap-smared NTP service.
Yes, it means that someone who installs a NewSQL database will have to set up their own time infrastructure. But that's not hard! There are lots of things about hardware setup which are at a similar level of trickiness, such as UPS with notification so that servers can shutdown gracefully when the batteries are exhausted, locating your servers in on-prem data centers so that if a flood takes out one data center, you have continuity of service elsewhere, etc., etc.
Or of course, you can pay a cloud infrastructure provider (which Google happens to provide, but Amazon and Azure also provides similar services) to take care all of these details for you. Heck, if you use Google Compute Platform, you can use the original Spanner service (accept no substitutes :-)
Some basics for less-critical applications: https://github.com/jrockway/beaglebone-gps-clock/blob/master...
That is, in order to be considered a satisfactory, non bare bones data center, you'd need to supply on-site, hardware, redundant, maintained and monitored clocks, verifiably meeting certain standards of accuracy and precision. Just like right now people expect backup power, cooling, network connectivity, physical security, access, etc.
It doesn't seem like the burden would be that large for data center operators. The hardware costs don't sound large, and they already have people being paid to monitor and maintain things.
This is a fun weekend hobby project that anyone can do at home if they are curious. Get a GPS chip with a PPS output signal, a plain old “hockey puck” antenna, and a computer to hook it up to (i have a rpi in my attic) and you can deliver fairly accurate time to your home network. Not a setup I’d rely on for a commercial, globally distributed deployment, but accurate enough for home use, and probably better than your current time source.
Which, as far as I see, boils down to hardware+software reliability.
It's likely that Spencer would have something different to say about the matter today.
I mean this is all fine and dandy if it works (plus, how do know that it actually works in all the edge cases?), but that's a HUGE amount of complexity. IMO you really need to have a very good reason to even consider anything this complex.
I really like the Calvin protocol, and is does seem perfectly suited to many application workloads, but it is odd to me though, to see Calvin presented as purely superior to all alternatives. It seems like more research and work needs to be done to create systems that can address its shortcomings around transactions that query for keys (3.2.1 Dependent transactions in the paper), including ad-hoc queries, and even interactive queries (with a transaction).
Disclaimer: I work on TiDB/TiKV. TiKV (distributed K/V store) uses the percolator model for transactions to ensure linearizability, but TiDB SQL (built on top) allows write skew.
In FaunaDB, this is formalized as an optimistic concurrency control mechanism that combines snapshot reads at the coordinator and read-write conflict detection within the transaction engine. By the time a transaction is ready to commit the entire key set is known. I wrote a blog post that goes into more detail here: https://fauna.com/blog/acid-transactions-in-a-globally-distr...
Better analytics support, including ad-hoc queries is on our roadmap. That being said, requiring indexes was a design choice: First and foremost, we want FaunaDB to be the best operational database. It is a lot easier to write fast, predictable queries if they cannot fall back to a table scan!
While FaunaDB requires you to create indexes to support the views your workload requires, the reward is a clearer understanding of how queries perform. We've built indexes to be pretty low commitment, however. You can add or drop them on the fly based on how your workload evolves over time.
For a longer explanation of the same idea which includes a concrete example of how you can get "high availability" in a CP system, see: https://apple.github.io/foundationdb/cap-theorem.html
Is this true? I always thought it meant that clients could continue to read and write to "the database" which could include the client switching to another node. There is nothing in CAP theorem about latency, so switching, even if it adds high latency, is fine by CAP theorem.
This lack of accounting for latency is what makes CAP theorem less useful than a lot of people realize IMO.
Maybe there is a way to modify the CAP theorem so that it says something non-trivial (e.g. a theorem about the limit of how many node failures a Paxos-like algorithm can handle, but even this is probably trivial, namely half), but the CAP theorem as stated by the originators is trivial, and the proof is a dressed up version of what I stated above. I don't think the authors would disagree with this at all, since they explicitly state:
"The basic idea of the proof is to assume that all messages between G1 and G2 are lost. If a write occurs in G1 and later a read occurs in G2, then the read operation cannot return the results of the earlier write operation."
Isn't it interesting that such a paper has 1600 citations and has reshaped the database industry?
This assumes that only the servers are partitioned from each other and clients are not partitioned from the majority quorum. This might be rare but it is not impossible at scale.
There is also a latency cost of strict serializability or linearizability which is hard to mitigate at geo-replicated scale.
However, for what percentage of use cases does a reduction in write latency weigh up against the disadvantages? I think that's a very small percentage. Heck, the vast majority of companies that are using a hip highly available NoSQL database cluster would probably be fine running a DB on a single server with a hot standby.
You could image a system that gives you a bit of both. When you do a write, there are several points in time that you may care about, e.g. "the write has entered the system, and will eventually be visible to all clients" and "the write is visible to all clients". The database could communicate that information (asynchronously) to the client that's doing the write, so that this client can update the UI appropriately. When a client does a read, it could specify what level of guarantee it wants for the read: "give me the most recent data, regardless of whether the write has been committed" or "only give me data that is guaranteed to have been committed". Such a system could theoretically give you low latency or consistency on a case by case basis.
This is a misconception. AP databases are not supposed to give up consistency, just not wait for all nodes to see updates. That's it. Consistency is still there, nodes resync, users always see their own updates and all that.
Inconsistency in the CAP sense may also cause inconsistency in the sense that you mean, for instance if two transactions are simultaneously accepted, but each transaction violates the precondition of the other. In a consistent database one of the transactions will see the result of the other transaction and fail, whereas a DB without consistency may accept both transactions and end up in a semantically incorrect state.
I've been complaining about this for years and it's so nice to see others echo the sentiment. Everyone chases timestamps but in reality they're harder to get right than most people are willing to acknowledge.
Very few places get NTP right at large-scale, or at least within accuracies required for this class of consistently. I've never seen anyone seriously measure their SLO for clock drift, often because their observability stack is incapable of achieving the necessary resolutions. Most places hand-wave the issue entirely and just assume their clocks will be fine.
The paper linked within TFA suggests a hybrid clock which is better but still carries some complications. I'll continue to recommend vector clocks despite their shortcomings.
The potential clock error on VMs without dedicated time-keeping hardware is so large that performance turns into absolute garbage.
Spanner is really fast and massively parallelizable.
It'd be great if you could address this somewhere in the top level README on GitHub.
Disclaimer, I work for AWS, but not on DynamoDB team.
It's left as a very simple task for developers leveraging DynamoDB to make the appropriate trade offs on consistent or inconsistent read.
source: Used to work for AWS on a service that heavily leveraged DynamoDB. Not _once_ did we experience any problems with consistency or reliability, despite them and us going through numerous network partitions in that time. The only major issue came towards the end of my time there when DynamoDB had that complete service collapse for several hours.
On the sheer scale that DynamoDB operates at, it's more likely to be a question of "How many did we automatically handle this week?" than "How often do we have to deal with network partitions?"
"GetItem provides an eventually consistent read by default."
This seems to meet the definition of "DynamoDB's default settings"
 - https://docs.aws.amazon.com/amazondynamodb/latest/APIReferen...
It would be better to state that both eventually consistent and fully consistent reads are available, and consistency can be enforced up front via configuration.
However, before it detects it, there is a possibility of stale reads.
Or do you mean some other part of AD?
Aggregate cluster performance seemed very good, though; i.e., adding a bunch more concurrent transactions did not slow down the other transactions noticeably.
But I added a note at the end that clearly documents my connection to FaunaDB. As far as Calvin, the post itself clearly says that it came out of my research group.
I wonder if any of the aforementioned systems (Calvin/Spanner/YugaByte) that can opportunistically commit transactions and detect issues and roll back + retry all within the scope of the RPC so it can still conform to linearisability requirement?
To answer your question about scale: there is no real practical difference in scalability between the two categories discussed in the post. Partitioned consensus has better theoretical scalability, but I am not aware of any real-world workload that can not be handled by global consensus with batching.
I've seen those batched consensus systems, and honestly, you're kidding yourself if you think they can handle a million qps. Just transmit time of the data on ethernet would become an issue alone! Even with 40 gig - transmit time never becomes free. So now you're stuffing 1/10th of a million qps worth of data via a single set of machines (3, 5, 7, 9? Some relatively small amount)
Am I misunderstanding you? Hopefully I am!
Many of the distributed clusters I've maintained had crap infrastructure and no change control, and parts of the clusters were constantly going down from lack of storage, CPU and RAM, or bad changes. The only reason the applications kept working were either (1) the not-broken vnodes continued operating as normal and only broken vnodes were temporarily unavailable, or (2) we shifted traffic to a working region and replication automatically caught up the bad cluster once it was fixed. Clients experienced increased error rates due primarily to these infrastructure problems, and very rarely from network partition.
Does your consistent model take this into account, or do you really assume that network partition will be the only problem?
The point is, resources cause problems, and the network is just one of many resources needed by the system. Other resources actually have more constraints on them than the network does. If a resource is constrained, it will impact availability in a highly-consistent model.
The author states that simply adding network redundancy would reduce partitions, and infrastructure problems are proof that this is very short-sighted. "You have bigger problems" - no kidding! Hence the weak-consistency model!
I think it's a fairly noncontroversial statement. Dealing with eventual consistency is always going to be more difficult and require more careful thought and preparation than immediately consistent systems.
How many programmers do you think are out there that have only ever worked on systems that use a single RDBMS instance, and what would happen if they tried to apply their techniques to a distributed, eventually consistent environment?
It's the same today, but worse. Most programmers still want a simple abstraction that lets them just build things. But now it's not just which sector on which disk, but also which server in which data center on which continent, while withstanding the larger number of failure modes at that scale.
When necessary, people can explicitly address that complexity. But it has a big cost, a high cognitive load.
It's another cute anecdote, but Google culture is full of these, always scant on details and always intended to show how big/smart/important/complex/indispensable their engineering is.
"Had to" is a strong term here, it's made to sound like USAF could not possibly have noticed some deviation they were likely to correct of their own accord as a matter of routine as they had been doing for the 20 years of the GPS project prior to Google being founded.
The reality is drift and bad clocks are and always have been a feature of GPS, one explicitly designed for, one an entire staff exists to cope with, and designs depending on the absolute accuracy of a single clock have never been correct
Is it really surprising that people who have extremely precise time needs and a whole team devoted to solving them would notice issues that other people wouldn't? I think it's a very common pattern that a product has some set of trailblazer users who find issues before the people who make the product.
Also, I think you're over-interpreting. "Had to" here only means that they noticed and reported the issue first because their system depended on GPS time being right. It doesn't preclude the possibility that the USAF would notice and fix the issue eventually, just with a higher latency that Google wanted.
So the anecdote itself is noise, it's intended to show how seriously intractable a problem accurate time is, but it doesn't do that, instead it only demonstrates OP's lack of familiarity with GPS and willingness to regurgitate corporate old wives' tales
Most of these systems are designed to work if you lose GPS entirely, so they fail gracefully.
Planes won't actually fall out of the sky if GPS makes mistakes. That's y2k fearmongering.
Why is it hard to believe that a group using GPS for a unique purpose has unique needs and detect unique issues?
Could this be it?
Fun fact, GPS satellites use rubidium clocks instead of cesium clocks, and only maintain their accuracy thanks to yet another incredible feat of engineering.
1 microsecond is 300 meters of error.
According to the link posted higher up in the thread, in early 2016 they were all off by 13 microseconds for 12 hours, with no apparent consequences for anything ordinary people use GPS for such as location finding.
To triangulate, I think you need to know (1) where the satellites are, and (2) how far you are from each satellite. I think either absolute distance or relative distance works.
Getting both of these depends on knowing the time. That time comes from the satellites. Let's say they are all off by 1 us. Your time is derived from satellite time. That would mean the time you use to look up/calculate their positions will be off by 1 us from the correct time so you would get the wrong position for the satellites.
A quick Googling says the satellites orbital speed is 14000 km/hr, so using a time off by 1 us to look up/calculate satellite position would give you a position that is about 4 mm off.
The procedure for deriving the time from the satellites would get some extra error from this, but that should be limited to about the time it takes like to travel 4 mm, so we can ignore that. As a result your distance measurements between you and satellites would be off by about 4 mm or less.
The key here is that when all the satellites have the same error, the time you derive has the same error, so your distance calculations should still work, and so you only get an error of about how far satellites move in an interval equal to the time error.
In summary, if all the satellites are off by 1 us, your triangulation seems like it would be about 4 mm more uncertain.
If only one satellite is off, it is going to depend on how the time algorithm works. If the algorithm is such that it ends up with a time much closer to the times of the correct satellites than to the off satellite, then if it calculates the distance from the triangulated position to the expected positions of the satellites, and compares that to the measured distance, it should find that one is off by something on the order of the distance light travels in 1 us, and the others are all pretty close to where they should be. It should then be able to figure out that it has one unreliable satellite it, drop it, and then get the right location.
I have no idea if they actually take those kinds of precautions, though.
The case that would really screw it up would be if several satellites are off, but by different amounts. With enough observation it should be possible in many cases to even straighten that out, but it might be too complicated or too time consuming to be practical. (This is assuming that the error is that the satellites are simply set to the wrong time, but that wrong time is ticking at the right rate).
If GPS timing is bad, a lot of people will notice that their position on the map is incorrect, because that's the whole purpose of the GPS network.
A 1 microsecond error is 300 meters.
While the speed-of-light propagation is about 300 meters in a microsecond, isn't the final position error possibly much greater? For calculating position on Earth, you can think about a sphere expanding at the speed of light from each satellite. The 1 microsecond error here corresponds to a radius 300m bigger or smaller, which only corresponds to 300m horizontal distance on the ground if the satellite is on the horizon (assuming that Earth is locally a flat plane for simplicity here). For a satellite directly overhead, the 300m error is a vertical distance. Calculating the difference in horizontal position from this error is then finding the length of a leg of a right triangle with other leg length D and hypotenuse length D+300m, where D is the orbital distance from the satellite (according to Wikipedia, 20180km). The final horizontal distance error is then sqrt((D+300)^2 - D^2), or about 110km.
Of course, this is just the effect of a 1us error in a single satellite, I'm sure there's ways to detect and compensate for these errors.
As another sanity check, if the error for 1 us is 110 km, the error for 1 ns would be 110 m, and I suspect 1 ns error is not unusual for consumer electronics:
> To reduce this error level to the order of meters would require an atomic clock. However, not only is this impracticable for consumer GPS devices, the GPS satellites are only accurate to about 10 nano seconds (in which time a signal would travel 3m)
Right, I was basically calculating where that signal would just be reaching the surface at the same time it was 300m under you. This is a circle around you with a radius of ~110km (again using the approximation of the ground as a flat plane). Thinking about it more, there's not much reason to do this (GPS isn't really tied to the surface of the Earth, it gives you 3-D coordinates). I guess my point was that the 300m of distance from 1us of light propagation should not be thought of as a horizontal distance.
I am not so sure about that. The most common use of GPS is in satnav in cars. Satnavs typically show a map, and typically it is very easy to confirm your position on a map. Any inaccuracy by more than the usual few meters would be quickly noticed by the majority of GPS users.
* Air and sea port operators and navigators
* Military personal running supply lines
* Military personal on foot in operations and training
* River boats
* Fresh water fishermen
Out of all the possible users who would notice a 300m deviation just based on visual reconciliation, I personally would not say it would be so rare that the USAF would not find out very quickly. Of course, this is ignoring the equipment that would likely detect the issue way before somebody in the Army started phoning the USAF.
There's not so many GPS satellites out there that you're going to be bouncing around them all the time - even if only one is affected, it would be very noticeable for extended periods of time.
24x7 coverage with a short time-to-repair costs at
a minimum several million dollars per year.
What sort of challenges do you face? Do you use PTP grandmaster clocks, or something else? How many sites, and how many clocks per site? Are the support issues mostly hardware failures, configuration problems, or something else? Is 24/7 support needed because the equipment lacks failover support, or is the failover support unreliable or insufficient?
Google wasn't the only company that noticed it, and I have no idea if they discovered it before the USAF, but I can believe that someone from Google would phone up Schriever and ask WTF is going on.
Citation needed. There is a worldwide organization, led by the US Naval observatory, that keeps constant watch on GPS satellite time performance. If google noticed anything that USNO and the other participants didn't that would need a paper or three published.
If two events are independent, it matters very little what order we record them in the system of record.
My whole career we have been building cause and effect at transaction time but when we debug we stare at log files and time stamps like we are reading tea leaves, trying to figure out what situation A led to corrupted data in row B.
Maybe there’s a way to record this stuff instead of time stamps? Something DVCS style. Or maybe it’s provably intractable.
Of course, I think I just like thinking about how true time will start to fail once we get beyond the earth. Consider, what is the true time for events we are seeing in the stars right now?
Granted, I fully cede that being able to rely on a fully sequenced notion of time that everyone is a part of makes some reasoning much easier.
If i understand correctly, if the size of this increases too much, this brings down the number of transactions per second which can be done, so keeping the spread small is key. Hence the atomic clocks etc.
I believe vector clocks capture this semantic. But they have other trade-offs.
GR at least usually has this property as well. (It doesn’t in the presence of closed timelike curves. It does in weak gravity and in the FLRW metric in cosmology. I’m not sure about the general strong gravity case.
We build abstractions because they're useful to us, not because they have some special meaning to the universe. Systems with simpler abstractions are easier to understand and therefore build on top of. Complex numbers, for example, have no direct physical meaning in the (non-quantum mechanical) universe but can still be extremely useful and are sometimes the only/best way to solve some classical problems.
If you want to use a database where a required step of querying it is specifying a reference frame that the ordering of events is relative to, feel free. "In fact, for any two spacelike separated events, it is possible to find a reference frame where you can reverse the order in which they happen." Personally I'll take a hard pass on bug reports like 'Foreign key constraint fails in reference frame 0.992c at 37.2Mm vector towards Alpha Centauri' which reads like the climax of Dante's Inferno for Systems Programmers.
We build abstractions because they are useful, but we also often build the wrong abstractions - this is the entire history of science - building better abstractions. Simpler abstractions are great, but they can limit you. We can build all kinds of wonderful machinery with just classical physics, but if you want the modern world with GPS etc. you need relativity to make it work. All the databases based on truetime are great and marvels of engineering, but they won't be able to scale to even a second planet (getting a GPS equivalent to work across two planets is orders of magnitude harder than the earth one, not to mention the latency).
I'm not condoning your idea of a database that explicitly uses frames, but rather something based on more physical foundations like cause-and-effect. As I mentioned, I think vector clocks satisfy this. But maybe there are other better alternatives.
I'm fully aware of the physics (I have a MPhys, and DPhil in Particle Physics from Oxford).
Cache headers can set expiration times or offsets, but the client and server both send their own time stamp in the message. So when the server says “expire this at noon” but your laptop’s clock is five minutes slow or three time zones away, you can still figure out what the server meant by noon because the server tells you what time it thinks it is now. The accuracy is limited by network delays but for the problem space getting the right answer to the nearest second is still pretty damned good.
This is why PTP and high precision GPS devices are built to integrate with cell provider gear.
You can implement a toy version of TrueTime with about 60 lines of C which uses a single ntp_gettime call for each of the TrueTime api functions (now, before, after).
If AWS's NTPd service offers drift <= 200 us/s, you could use it for TrueTime.
FDB doesn't have an actively developed SQL layer at the moment, so I guess you could say it isn't a "NewSQL" database, but none of the properties under discussion have much to do with the query language.
Here is some discussion about linearizabilty in fdb: https://news.ycombinator.com/item?id=16884882
I would love to be proven wrong, as more systems with strong consistency guarantees is better, but for now, I don't believe that foundation db provides stronger guarantees than serializable reads and writes.
Besides its extensive documentation, you can read its source code and run its deterministic simulation tests yourself if you are interested (it's Apache licensed). Skepticism on these points was reasonable when we originally launched it in 2012 but is getting a little silly in 2018.
I haven’t worked with NASDAQ stream directly, but knowing how fast equities tick I find this “10,000 orders/sec” estimate quite low.
Not to mention that 10ms delay in confirming an order would be really terrible.
> Able to consistently sustain an order rate of over 100,000 orders per second at sub-40 microsecond average latency
This post doesn't establish any "controversy" about Spanner's design decision. It only says that it requires special hardware, which other systems attempt to emulate despite not having this specialized hardware.
To call this decision "controversial" I think one would need to show that it has some significant problem in the environment it was designed for.
But you describe exactly the drawbacks of giving up time synchronization:
> The main downside of the first category is scalability. A server can process a fixed number of messages per second. If every transaction in the system participates in the same consensus protocol, the same set of servers vote on every transaction. Since voting requires communication, the number of votes per second is limited by the number of messages each server can handle. This limits the total amount of transactions per second that the system can handle.
How is "worse scalability" not a significant drawback?
This just sounds like an engineering tradeoff. I don't think engineering tradeoffs are the same as controversy. I get that your group's DB takes a different approach. But "blaming" Spanner for making a different trade-off doesn't come off well (I approached the article with an open mind).
> Calvin was able to achieve a throughput of over 500,000 transactions per second. For comparison, Amazon.com and NASDAQ likely process no more than 10,000 orders/trades per second even during peak workloads.
Maybe if you are only considering the small scope of just the order transactions, but they are actually writing way more data to their databases such as logs, metrics, current state for things like shopping carts. In my experience, what teams have done is split their data off into their own siloed database to minimize scaling problems, but this becomes super painful when you want to join your data with others. If spanner can hold all of our data, scale, and handle joining across all of it, that sounds like a huge win.
Abadi is a very well known and, as far as I know, respected database researcher. No need to avoid his work.