Hacker News new | more | comments | ask | show | jobs | submit login

In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.

Classic Two Generals. "No news is good news," generally isn't a good design philosophy for systems designed to detect trouble. How do we know that stealthy ninjas haven't assassinated our sentries? Well, we haven't heard anything wrong...

It may not be good design, but it might be necessary / practical design. If you have enough machines that some percentage of them are down or unreachable at any given time, you can't wait for full go-ahead before proceeding; you'll never get full go-ahead. So you're left with probabilistic solutions, and as T approaches infinity the expectation of more than zero false-positives approaches 1.

The whole point of the canary sub-population, though is that 1) It's not your whole population. 2) You want to find out empirically if something's wrong.

this was my exact thought...it would seem both feasable and reasonable to have a more active canary process i.e....

anycast "canary test in progress"

edge routers store new configs

anycast "canary test PASS"

edge routers activate new config

edge routers canary test new config (and pass or revert)

edge routers report home that all is well

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact