The capacity is there, might as well use it.
That said, if you didn't have said rack, I'm not so sure it would be worth it to even make a purchase order. Sure things outside of your control may break when you are using a cloud. But guess what, things outside your control will also break, on-prem. Particularly hardware, and network connectivity. There is no way your networking can be better than, say, GCP's own networking, or that you can deploy redundant workloads across availability zones (or even regions!) yourself.
By the time a purchase order for a new server can arrive, we can have a production-ready system running, with redundancy across availability zones, automatic failover, CDNs, backups, the works.
Basically, I don't care if someone knocks out power in my block, if someone cuts a network cable, or even if a machine goes up in flames.
One thing I would say is: even if you are very happy with your current setup, if you have some time to automate a similar setup on the cloud (keyword: automate), then I would suggest doing just that, and offload backups to the cloud too. Even if only as a business continuity thing.
My business is mainly deep learning R&D. Current cloud GPU, networking, and storage pricing gives me ulcers given my compute needs and the size of my datasets.
I do run my website in cloud, with redundancy and all that. I also use cloud storage for backup, and for K8S registry. If I was selling e.g. inference services, I'd be running them in cloud too (passing the costs onto the clients). Most of my local workloads could easily be shifted right across to any decent K8S provider.
But the fact is, my lone rack has been humming along with zero unscheduled downtime for 3 years now. I can count several global outages in each of the three major cloud services in this timeframe (most of them during US work hours, BTW), so I'm inarguably better off with the setup I have now than I would be if I moved it all to the cloud. Not to mention it already paid for itself several times over even though I burn through several hundred dollars a month in electricity.
Also, if you're sourcing components yourself or from a third party that isn't shipping specific configurations that have been tested by hundreds or even hundreds of thousands of times in other installations (i.e. not Dell), then be prepared for the possibility that you'll run into weird errors from the specific combination of mainboard, CPU, memory, NIC, RAID, etc used in a system that can be extremely painful and time consuming to to diagnose for RMA. I mean, if you're lucky, that will exhibit itself as an a crash that happens early and is easy to reproduce.
At this point, I see the premium you pay to large name providers like Dell/HP/etc as not just insurance that they will quickly accept and replace any possibly faulty hardware but also as the fee they charge for ironing out all the kinks of a platform so you don't have to worry about them for the most part.
It's totally possible to build your own hardware for cheap, but I no longer have the will to do so. To me, cloud services are like that, just to a lesser degree. I can handle the hardware locally (even if not custom built), but I'm finding less of a benefit to doing so as I get better at dynamically handling load (which has benefits in areas besides cost as well).
I have just checked the pricing of what my approximate (GPUs are different in my case) setup would cost to run on GCP, and it totals up to $40K a month. Not a typo, I did not add any extra zeros there. That's not including storage or networking costs, and _including_ the sustained use discount. That's nuts.
> it totals up to $40K a month
Yeah, it sounds like your use case is bad for cloud (not that I'm an expert), especially if you're able to get that all in one rack. It's a lot more useful if you have different layers that may need to grow at different speeds. For example, a front-end load balancing set of systems, a group of web servers to aggregate some back-end services for page display, different service clusters, and a set of databases (read-only replicas and master-master replication, etc). Both being able to scale some systems vertically (just allocate more RAM and/ro CPU to it) and other horizontally (just add more servers to a load balanced set) can be extremely useful, especially when the unexpected hits and you find you load is multiple times the normal amount all of a sudden and you need to provision now.
The more secure a system is designed to be, the more likely it is to treat unusual conditions as an attack and possibly perform some destructive action to thwart the assumed attacker. Think of phones configured to delete all data after X incorrect password attempts, HSMs with anti-tamper switches, etc.
I’ve always enjoyed this quote, but my problem with [the description of] this outage is the third-party dependency.
Packets can’t get from your cloud provider to downstream users of CenturyLink? That’s fair.
Your cloud provider can’t send packets to/from CenturyLink, so they nuke your database? I literally don’t understand.
Is the service described actually a third-party service that’s been white boxed? (I mean this in the most honest way possible. I do not understand the details, and I found the article surprising).
The root cause of nearly all of these screwups is that large, complex systems can't be fully understood or observed, and that a good chunk of knowledge about such systems is institutional, rather than explicit. So from time to time people _will_ make assumptions that don't match reality, and reality will punish them for it. Which is what, I strongly suspect, happened in this case.
It reminds me of a friend who wonders why his parents don't just install Ubuntu because windows is so awful.
Don't get me wrong, I use cloud (GCP, if you must know) too, and if my business grew massively, I'd probably use it more. But frankly I'm more satisfied with my own "on prem" solution. Single rack which basically pays for itself every 3 months or so in cloud costs, what's not to like?
Yes a major outage that’s no fault of my own could take down my site - and that isn’t great. On the other hand, wouldn’t you trust them more than yourself? Just like you trust a bank more than yourself to keep your money safe? I’m sure the average major cloud service has far better uptime than the average on prem solution. At what point does it become hubris to spin up your own on prem solution?
>> versus using a bank
If my bank charged 25% of my balance every month, I'd very quickly choose the mattress. :-) But I do recognize that my needs are pretty uncommon.
Google Cloud is fair competition – provided they have the service you need. AWS and Azure both beat them in number of services. If Google has it, then it should behave as expected, and some are downright impressive (GKE and VM auto migration on GCE).
Azure is... infuriating. Inconsistent, unreliable APIs, surprising behavior everywhere (attach an internal load balancer, lose internet connectivity!?), lots of restrictions on which features can be used with which SKUs.
I see improvements and it is difficult to beat them in the enterprise, but speaking as an engineer, man Azure is infuriating.
Ah, yes, the "TCP and UDP egress work unless you define an _ingress_ load balancing rule for either protocol, at which point the other protocol breaks until you create a dummy rule for the other protocol that will trigger your security team to send you tickets every few months."
We're doing our best, but we're not going to suggest there's not more to do. Every major cloud provider has had issues at one point or another (I formerly worked at Amazon and Google), and I'll just say - we hear you, and we are fiercely committed to earning your trust.
First I want to say, thank you for your service, keep it up, there is a lot to be done, but I see progress.
Secondly, please, get your teams together and start communicating. We encounter a lot of issues with things that should just work or should be much simpler. Sometimes we contact support and just get handed from team to team without actually finishing something.
Third, please, oh, please, get your SDKs (especially the Python one) fixed. It looks like every new build breaks something, sometimes even the same version on multiple installations since there is a lot of variable versioning done under the hood...
Sometimes I get a feeling a lot if things are "leaking" towards the customer. Wanna change an instance type? You get a "instance not available in cluster" error, or something similarly undocumented. Wanna copy a snapshot between regions? Good luck with that, and hope you've got some retry logic and a hell of a timeout.
Keep pushin'! :)
It is a problem because a lot of the good documentation about overall usage or architecture for some reason is in this blog format but I can see it just being written off because "its just a blog".
They also leave a lot out of the true documentation such as Landing Zone stuff (in fact they hide this one entirely) and Application Discovery Service (they make no mention how the agents work). There are plenty of other minutia that I have had to contact our rep. to get from the product team.
The most common or basic stuff has some great documentation. But once you stray into the "weird" it isn't uncommon to see 2 or 3 line documentation pages about something.
Wait what? S3 is 11 nines. That's like 1 file every 8 PB/year-years.
It seems that once you add a load balancer, all traffic gets funneled through it, doesn't matter if it was addressed to it or not. Which is unlike any other load balancer I have ever seen.
Coming from other clouds, this was a shock.
The only thing comparable is AWS's NLB. Because that load balancer is so transparent, clients appear to be connecting directly, with the original source ip. Which caused issues when I wanted to deploy my own Elasticsearch and use an internal NLB for master discovery (whenever a request got routed to the same machine packets got discarded by the kernel). But you can just switch to another load balancer then.
There is an excellent Network Academy video on YouTube  that covers some of the internal implementations of the layer-4 load balancers (in the context of HA for NVA's).
Interestingly, that's my experience on Azure. Seems like a lot of offerings were smashed together in order to try to form a coherent product.
Seems like that would apply to many, many web services and products in the last decade.
Azure's AKS is way better as a whole, but until recently, wasn't integrated with their own scalesets. How does that even happen?
To be fair to Azure, AKS is an example on how to rollout features. All the knobs are there (even more so than GKE, you can even set cluster ip ranges), it just works and no silly surprises. At least, I haven't found surprising behavior yet.
To be more specific..
Disclaimer: I've built distributed infra on AWS, Azure and GCP as well as on premise. There's a reason I'm using GCP. I'm not stating that reason below. Instead these are a few of the reasons I don't like it. In summary I believe they focus on too many things with too little depth. This is not quite unique to GCP but rather new to GCP IMO.
Take endpoints for example. You get a nice feature that documents your API based on the code-level docs. To update it you also get an API, but it only works if you do one manual sync first from the console (https://cloud.google.com/endpoints/docs/grpc/dev-portal-sync...)
Then you discover that your documentation starts disappearing and debug for a while just to figure out that the last refactoring got the total length of your service name, RPC name and parameters over 80 characters and your doc doesn't show up if that's the case.
Then you use the tracing capabilities only to discover that the traces don't propagate across services. There's something that ESP (their nginx proxy doesn't do). You take a look at it, try to build it, but discover it uses a Bazel version that is two years old.
Then you look at quotas / throttling First you'll notice that the examples don't work. You just get errors (and apparently they also don't get fixed after sending feedback) https://cloud.google.com/endpoints/docs/grpc/quotas-configur.... You look at the example and notice it's copy-pasted-modified from some JSON and the field names are incorrect.
Then you see it throttling only works with API keys, but they (and everyone else) don't recommend you use API keys and instead use IAM service accounts. Except that a bunch of features won't work unless you use API keys. So you use API keys and then discover you can't provision them, because there's no API. You have to, again, do it manually through the console. You talk to support and they'll recommend you use IAM service accounts because they are much better, although you won't be able to use the api-key specific features (https://cloud.google.com/endpoints/docs/grpc/when-why-api-ke...).
If you take a look at service accounts and IAM you discover new things.. I won't go into details here, but let’s just say figuring out what if a policy should work in a particular case is more art than science.
Then in GKE you want to enable TLS on your gRPC service. It should be easy. https://cloud.google.com/endpoints/docs/grpc/enabling-ssl It's just that it doesn't work, or that different documents say both that it works and that it doesn't work.. Take a look at https://github.com/kubernetes/ingress-gce/issues/18 From certificate provisioning to functional service, It took roughly 5 days involving tcpdump, looking at ESP source code, all the logs, raising bugs on github, etc.
Then see Service Catalog (https://cloud.google.com/kubernetes-engine/docs/how-to/add-o...). You try to download from the link they provide and it will throw an error. You debug maybe talk to support, and they'll tell you that there was a patch and it fixes it. Just that it wasn't released since April 2018...
Then one day your GKE pod won't deploy and you see no errors. It takes a while to realize that if the port name is over a certain number of characters it won't work.
There are many of these and I could keep going :)
I quit and slithered off to work with SQL Server 2000 and I’m sure my soul slipped back in one night.
edit: Mostly I've heard it just generally is not great, plus you have to deal with typical Oracle badness.
I got a VM for 30 days with 16 gigabytes (!) of RAM and it didn't cost me a penny. In fact, the machine continued working for a few months after the trial ended, which I thought was very generous. They just blocked port 80 and 443 inbound.
They even have a high-touch sales process where a real person sends you emails and replies when you send them questions. Imagine that at AWS!
I was actually tempted to keep paying for the VM from Oracle, but the high-touch sales person literally could not tell me how much it was going to cost. She could point me towards the price calculator page, but the VM type I had been given in my free trial did not exist in their price calculator. I asked for just a simple number: how much is this going to cost me per month? She could not give me an answer. So I didn't buy it. I still have no idea whether it would have been a good deal or not.
You haven’t had the joy of having a business support plan from AWS. You can start a chat and get immediate help. They will do a screen share with you if necessary. I have had a 100% success with them. This is everything from CloudFormstion issues, S3 cross account permissions with CodeBuild/CodePipeline, API gateway/Route53 set up issues, IAM permissions to read and write from S3 to Aurora/Redshift, a best practices question with DynamoDB/Dax with Python, and a Route 53/Security group/autoscale group/load balancer issue.
All of these issues were because of my
lack of knowledge on the platform.
The do that too once you are big enough.
I only hear bad things about AWS and Google Cloud, and I hear nothing much about Azure.
My anecdotal experience: I spent a couple of weeks (!) setting up our environment (Bitbucket, Django, Ubuntu, Dockerized) on Azure App Service and Azure Pipelines. Their documentation was incomplete, out-of-date and MS support staff struggled to help if you didn't have a Windows machine (their RDP software doesn't support Linux, Skype for Business doesn't support Linux and normal Skype for Linux doesn't support screen sharing).
Little things like trying to SSH into any machine so that you can execute commands on your docker container (for, say, database migrations or to check logs) is almost impossible. If it wasn't for the help of a lot of people on #docker in Freenode I would probably still be working on it.
I had to use Google Hangouts with a Microsoft support person's personal gmail account, while he was connected over VPN (since he was based in Shanghai), so I could show my issue. The support person was extremely pleasant to deal with and understanding, though, and he went above and beyond to help get my issue resolved even though it turned out to not be from his department.
However, after getting set up, I noticed I was getting 12 second (!) responses from an API I had written just to retrieve a logged-in user's first name, last name and email in JSON. This API resolves locally in 20ms - including layers of authentication.
This turned out to be a known issue when running a managed "Azure Database for PostgreSQL" service and was common on MS support forums.
After reaching out to Microsoft support for Azure Database for PostgreSQL, their response was this, copy-and-pasted:
> As you are currently using Basic Tier (2 vCores, 51200MB), the bad performance is expected.
> When comparing with the performance in your VM, the on-prem is supposed to be better than cloud even within the same hardware environment.
> Please give it a test in higher tier and configure it with a compatible settings compared with your VM. In the meanwhile, you can monitor the slow queries via Query Performance Insight to find out what queries were running at a long time when those API were called.
> Pricing tier information can be found at https://docs.microsoft.com/en-us/azure/postgresql/concepts-p... .
...they tried to upsell me on the higher tier database 3 times in that email chain, believing that this level of performance was acceptable for my database tier.
Of course the next tier up from the $60/month that I was on was $160/month, and since we only have maybe two concurrent users at most it didn't make sense to triple our costs just to avoid 12 second database calls.
I moved the entire service to AWS last week. The set up was painless and swift. Using equivalently priced services, the API now resolves in 50ms.
I don't think I'll ever go back. Not even for free.
-Rachel, from Azure Database for PostgreSQL
I don't believe your reasons because I had two Azure Database for PostgreSQL instances in two different locations (yes, I moved it across as part of troubleshooting) yet experienced the same levels of performance.
The product that you charge ~$60 a month for is fundamentally broken. It is unusable for any application.
If you are genuinely sorry, I'm happy to give you my company's contact details so you can refund or credit the last few months of what we paid for your service. Otherwise it's this type of empty response and lack of responsibility (or desire to correct the problems, or even correct your marketing material to emphasize what kinds of latency to expect) that led me to instruct my management and ultimately lead the move over to AWS.
To be clear: it was ultimately only a 5 minute loss (and the fact that the DNS outage was simultaneous probably meant there wasn't much data being stored anyway) because they had a regular snapshot facility. So defense in depth saved them.
Still, yikes. That's a pretty disastrous bug.
By what logic is this NOT a terrible idea?
So that's why it's a great idea to implement data deletion as a two-phase sequence of synchronous key deletion, then asynchronous low-priority block scrubbing (or marking free for reclamation).
But not handling the case where your system is confused whether the keys are deleted (versus just temporarily unavailable) is less of a great idea.
...at the heat death of the universe
Yeah they come with a firewall but still. Imagine competing with everyone else on a single namespace.
At least for the s3 bucket is justified because those are meant to be accessible, but the databases?
not names. those are the limited conflicting resource, and they aren't arbitrarily nested:
"The server can contain only lowercase letters, numbers, and the hyphen (-) character" - "The domain name postgres.database.azure.com is appended to the server name you provide."
it also means I cannot just create parallel environments on their own subnet with the same scripts, I've to enter and manage configuration points for all the names, either as dns aliases or straight up in the application connection strings, and it also makes unnecessarily harder to just migrate the stuff around as needed to a failover node
i.e. rds advantage was zero downtime upgrades too, then and then sent a mail everyone last week with "we going to have 5 minute downtime to upgrade your db, whether it's multi zone with replicas or not"?
you can't trust them with that, the cloud is just somebody else computers', you need your own failover on top, and instead of changing one entry in the namespace I've to redeploy all clients, oh joy.
99.9: 43m 49.7s
99.99: 4m 23.0s
Sounds like they need to cough up some money for their four 9s customers...