GCE has many quirks, for instance the inconsistency between API and the UI, it misses the richness of the services offered by AWS but everything GCE does offer is just faster, more stable and much more consistent.
One of the biggest problems with AWS is that once you outgrow the assigned limits, it becomes a hell to get more resources from them. We're running on average around 25k servers a day, majority of them preemptible (spot). AWS requires that you request exact type and location for your instances. GCE only asks for a region and then overall resources (e.g. number of CPUs).
Also the pricing is much less complicated. 1 core costs y, 32 cores cost 32*y.
Are you running 25k without using Spot Fleets? Spot fleets let you specify value per instance type, and then total value you need. ("Value" could be CPU, memory, network or whatever) AWS will maintain the lowest cost spot instances that fulfill your requirements.
(work at Google Cloud)
25k number of servers or its the cost of the servers ?
To be more specific, we redeploy servers from images instead of replacing just the code. The majority of servers (99.9%) is preemptible (meaning that they will be deleted every 24 hours) and thus we get a huge discount. Also our whole communication is through pubsub so the servers don't communicate directly between each other.
Thanks to all these optimizations, we run only what we need at any given time, which keeps the cost low. It's still much larger that anything I've ever expected, but I'm still trying to convince myself that this is the sign of success.
- we're very small team (13 people). We would never be able to manage this kind of infrastructure
- we're growing incredibly fast. Just 6 months ago we were running on ~300 servers. We grew so quickly that we are now working with GCE's infrastructure planers to plan our resource needs
- even we're not suffering from massive bursts, we don't grow gradually but rather in massive jumps. It would take us months instead of days to grow this quickly on bare metal. This way we can grow in days.
- it is way more complicated to host in many geographic locations, especially in asia. We need to be present in many parts of the world and with GCE/AWS/Azure it's very easy and convenient.
- we did try to work with some bare metal hosters (like OVH), but they are not able to deliver the servers on the schedule we need. They require way longer times and obviously much longer commitments
- the pricing is actually not that different once you run on preemptible (spot) instances. The bare bone would give us some performance boost, but the freedom is worth every penny.
As I've mentioned earlier, we're running an immutable infrastructure. That means once we need to change something we replace the whole server. Each server runs only one single service. That allows us to run smaller instances but in large quantities.
We actually did run on Softlayer. It was a nightmare. They have consistently some outage somewhere. We couldn't count on any instance to stay up. The performance was better, but you can't threat the infrastructure as a cattle and that was a huge limitation for us.
I would imagine that rolling a container to thousands hosts may take a while.
What kind of load and software do you run? In my experience, dramatically scaling out increases load and latency variance and cause all kind of problems.
When was your experience with SoftLayer? Care to elaborate? I've got some interests in them for future projects. I'd rather hear about the issues now :D
This is the benefit of running on GCP. We don't have to trouble ourselves with the headache of scaling both with the images and containers thanks to the internal tools offered by GCP.
We quit Softlayer around February. We were running bare metal and they went down so often that we essentially ended up keeping one super large server for a very insignificant service. We never gave them too much of chance so I may be too harsh.
I stopped thinking about bare metal a while ago. I'm thinking about the comparison with AWS and I see that your entire business is relying on capabilities only available on Google Cloud (which indirectly, is why I'm advising Google Cloud nowadays. It's easier for basic usage and it's unlocking a whole new set of extreme usage).
Things you couldn't have pulled off on AWS that easy: Create thousands of hosts, deploy images that fast, managed kubernetes, pubsub, worldwide AMI, multi region subnets.
It's likely that bare metal is lower cost than spot - it is for me.
It's likely that your issue is allocation of capital - you'd rather put your money into burning dollar bills into AWS versus pre-paying for 18 months of servers with capital.
I'm happy your company is doing great.
I agree GCP is a great competitor to AWS.
I think you're being close minded to building out a 20-60 cabinet data center to offload some of your workload.
It's hard to do a DC with over 20 racks and pay more than 1/2 of AWS's prices.
I understand why would think so, because you don't know our setup. Just to give you an idea, we process 16PB of data every single month. This is an ingress traffic. If we would pay for this traffic going out (egress), we would end up paying over $1M just for the traffic itself. By keeping everything in a single spot it costs us literally 0.
That said, I've tried. I reached out to many hosters, like OVH and others. They just don't have the capacity we need. 20 servers will not make a change for us. We wanted to start with 500, but it would take them 9 months.
Your $1M estimate is 3X too high at 2c/GB.
In two hours I'll save you $25k per month. In 6 months $300k/month.
Either you are growing and should be investing in cost efficiency or you run a staid lifestyle business. Which is fine, but if you aren't growing, get off expensive cloud.
I'd say it's time to forget about the infrastructure. Should optimize the software and rewrite in C++ :D
I'd worry about the time it takes to reconfigure these servers with ansible/salt. or the time it takes to kill/rebuild them from a system image.
On a side note, the 4th competitor is IBM SoftLayer. It's different enough from the top 3 (Amazon, Google, Azure) to be a thing of its own.
src: doh's HN profile.
Get 1-2 more people and move off the cloud. Maybe not for everything but to service your base traffic. You can manage it if you spend a few days learning how, in the same way you learned AWS/GCE.
I can't imagine why you'd guys need 25,000 servers. Your G bill at a minimum must be over 5 million a month. I could be wrong, but those numbers don't add up too me.
What instance type are you running?
As the instance go, we have almost everything ranging from many thousands on the micro side to hundreds on the largest side.
25,000 * $200 = 5 million a month.
It truly differs minute by minute.
Last I checked. AWS has hourly billing and Google Cloud has sub-hour billing.
If they are really bursty, they are in troubles with AWS pricing model.
> the number where it makes more economic sense to build out your own data center.
I would think that this number doesn't exist... unless you have already built datacenters for yourselves and you have a major internal expertise in that.
I am not sure what he is doing at 25k servers but you can see real world math puts this in a serious money range. Data centers can be built for less.
I wonder if the down voters on the hardware comments actually run services at scale.
You didn't see that coming, did you? :D
That gives you 25k CPU + 83TB of memory. (Of course, that's just to give a figure, they probably don't use single core instances).
Forget everything you know about AWS, it's over priced. Google is half the costs of AWS in average (down to one quarter for special cases).
I'm not sure what you mean by building your own datacenters (really, the buy some land and build from scratch?). Just having a bunch of dudes in a few locations worldwide is gonna be a multi-million dollar projects. And I'm not even talking about building, power, cooling, servers, storage, network. That's the hell of an undertaking.
We too are hockey sticking but not ready to go to metal .... yet.
Your reasons make sense. I would grow that team soon :)
The i2 comparisons are under "local SSD and scaling up". Google have local 400 GB SSD that can be attached to any instance. That's a lot more flexible and a hell lot cheaper when you have specific needs.
I grew to envy you as I read. It looks like interesting work.