I saw these guys talk at QCon. It was a fascinating talk, and an excellent example of SRE adaptability and nonstandard, uncommon innovation given unusual constraints.
Not speaking for them, just from my memories of the talk and the following Q/A, but their reasons for this stack were primarily:
- They couldn't run in the cloud because connectivity to their sites is often terrible.
- They mostly ran IoT stuff from the k8s clusters--automated kitchen equipment like fryers and fridges, order tracking/status screens, building control systems, and metrics aggregation so they can see how businesses are doing.
- Because of the bad connectivity, a "fetch"/"push" (from the k8s clusters at the edge) model was needed for deployments/logging/administration/getting business data back up to the cloud.
- They explicitly did not process payments.
- k8s was used primarily for ease of deployment and providing a base layer of clustered reliability for pretty simple services. Since the boxes in the cluster were running in often-unventillated racks/closets full of junk in random restaurants, having that base layer was very important to them. Other solutions were evaluated and they chose k8s after consideration.
- Unlike typical IoT/automation setups here, they wanted to be able to experiment, monitor, and deploy software without the traditional industrial control practice of "take shit down, flash your controller (call a tech if you don't understand that), spin it up, and if it breaks you're down until we ship a new control unit or you manually fail over to a backup".
- However, they didn't want to fall into the IoT over-the-air update security pitfalls (it would really suck if someone hacked your fridge's temperature control system and gave a week's worth of customers salmonella). As a result they spent a ton of time making very good (and simultaneously very simple) deployment/update authorization and tracking tools. They chose the "pull" model and keying/security layers explicitly to avoid having to think about tons of open remote-access vectors and/or site hijacking.
- The k8s tooling (and some of their own) allowed easy, remote rollbacks to "default/clean state" in case something went wrong, which was critical given that downtime might compromise a restaurant and having a "reset button" automated in was important for ease-of-use by nontechnical, overworked site managers.
- The clustering allowed individual nodes to fail (which they will, because unreliable environments), and people to manually yank ones with confidence.
- While, as some commenters pointed out, the leader (re)election system chosen might be unacceptably slow/randomized for, say, a cloud database, it is perfectly sufficient for failing over a control system in a restaurant. A few seconds of delay on an order tracking screen, or a system reboot/state-loss of in-flight orders is vastly preferable than some split-brain situation making the restaurant accidentally cook 1.25x the correct number of sandwiches for hours, to go to waste.
It's important to understand their use case: they needed to basically ship something with the reliability equivalent of a Comcast modem (totally nontechnical users unboxed it, plugged it in, turned it on, and their restaurant worked) to extremely poorly-provisioned spaces (not server rooms) in very unreliable network environments. For them, k8s is an (important) implementation detail. It lets them get close to the substrate-level reliability of a much more expensive industrial control system in their sites (with clustering/reset/making sure everything is containerized and therefore less likely to totally break a host), while also letting them deploy/iterate/manage/experiment with much more confidence and flexibility than such systems provides.
I think this is a great story of using new tools for a novel (or at least unusual) purpose, and getting big benefits from it.
Brian, Caleb: great talk, great writeup. Sorry HN is . . . being HN. Keep at it.
Edit: QCon talk summary is here: https://www.infoq.com/news/2017/07/iot-edge-compute-chick-fi.... If you have any employees/friends that went, they should have access to the video. It may be made public at some point, too.
I don't think HN was super vicious. They presented an out-of-the-box solution to a problem but they didn't define the problem fully. Based on what we saw, their solution seemed way overkill.
Glad to hear that there was a solid reason behind it, not just hype and recruiting buzz.