> As we are now several hours into this outage and do not have satisfactory timeline for resolution, we have begun the process of migrating our hosts into another deployment zone within GCE. We will have a baseline set of services migrated within the hour and evaluate our ability to operate in a split deployment. Should we need to pursue a complete migration of hosts across zones then we would expect another 4-5 hours to return to full operational capacity.
Wait, their service isn't setup to operate in a split environment out of the box? I think it's time SaaS companies start documenting their IaaS setup so purchasers can do a high level audit before they decide to use it for potentially a core part of their own product/service.
I imagine if one were a customer of this SaaS, it's on the customer to ask what availability to expect.
Clearly this vendor thought that their savings on their IaaS bill outweighed any operational or reputational risk they'd suffer from an outage at a lower layer (pun unintended).
Blake from Layer here: We are forthright with all our customers about our current deployment configuration and the roadmap timelines for evolving into a deployment with higher availability characteristics. There is real complexity in operating a system such as ours in a widely distributed configuration and like any other company at our stage we regularly assess risks and make trade-offs. Sometimes we get things wrong.
We are very sorry to all our customers for the downstream impacts their businesses. We came up short and are doing everything we can to make it right.
I agree. If you're going to use abstracted infrastructure but you don't understand basic distributed architecture you shouldn't really be blaming your cloud provider.
Agreed, and to reply in the context of a critical comment someone further up noted (about never wanting to deal with any SaaS provider that required talking to a human) ... these are the kinds of reasons -- asking for architecture details or regulatory/security audits -- you NEED to be able to talk to humans, especially if you're operating in a regulated industry yourself or you're trying to sign an enterprise agreement with far-reaching consequences.
Hey, Blake from Layer here: we regularly undergo architecture and deployment reviews with our customers. We are fully transparent with the current deployment configuration and timelines for revisions.
Last night we lost a race to evolve our architecture and deployment ahead of a zone level issue that affected our total operations. We are working on it in earnest but there is very real complexity in operating a widely distributed real-time system.
I like how Algolia does that, https://www.algolia.com/infra (their blogposts and presentations go into much more detail)
Currently thinking of creating a similar page for getstream.io, at the moment we always explain it during sales/onboarding calls. (we replicate our data to 3 different instances across multiple AZs)
Thanks for the link. We are taking a look at what Algolia has done here and will likely put together a public infrastructure overview page for Layer as well.
Wait, their service isn't setup to operate in a split environment out of the box? I think it's time SaaS companies start documenting their IaaS setup so purchasers can do a high level audit before they decide to use it for potentially a core part of their own product/service.