To be more specific, we redeploy servers from images instead of replacing just the code. The majority of servers (99.9%) is preemptible (meaning that they will be deleted every 24 hours) and thus we get a huge discount. Also our whole communication is through pubsub so the servers don't communicate directly between each other.
Thanks to all these optimizations, we run only what we need at any given time, which keeps the cost low. It's still much larger that anything I've ever expected, but I'm still trying to convince myself that this is the sign of success.
- we're very small team (13 people). We would never be able to manage this kind of infrastructure
- we're growing incredibly fast. Just 6 months ago we were running on ~300 servers. We grew so quickly that we are now working with GCE's infrastructure planers to plan our resource needs
- even we're not suffering from massive bursts, we don't grow gradually but rather in massive jumps. It would take us months instead of days to grow this quickly on bare metal. This way we can grow in days.
- it is way more complicated to host in many geographic locations, especially in asia. We need to be present in many parts of the world and with GCE/AWS/Azure it's very easy and convenient.
- we did try to work with some bare metal hosters (like OVH), but they are not able to deliver the servers on the schedule we need. They require way longer times and obviously much longer commitments
- the pricing is actually not that different once you run on preemptible (spot) instances. The bare bone would give us some performance boost, but the freedom is worth every penny.
As I've mentioned earlier, we're running an immutable infrastructure. That means once we need to change something we replace the whole server. Each server runs only one single service. That allows us to run smaller instances but in large quantities.
We actually did run on Softlayer. It was a nightmare. They have consistently some outage somewhere. We couldn't count on any instance to stay up. The performance was better, but you can't threat the infrastructure as a cattle and that was a huge limitation for us.
I would imagine that rolling a container to thousands hosts may take a while.
What kind of load and software do you run? In my experience, dramatically scaling out increases load and latency variance and cause all kind of problems.
When was your experience with SoftLayer? Care to elaborate? I've got some interests in them for future projects. I'd rather hear about the issues now :D
This is the benefit of running on GCP. We don't have to trouble ourselves with the headache of scaling both with the images and containers thanks to the internal tools offered by GCP.
We quit Softlayer around February. We were running bare metal and they went down so often that we essentially ended up keeping one super large server for a very insignificant service. We never gave them too much of chance so I may be too harsh.
I stopped thinking about bare metal a while ago. I'm thinking about the comparison with AWS and I see that your entire business is relying on capabilities only available on Google Cloud (which indirectly, is why I'm advising Google Cloud nowadays. It's easier for basic usage and it's unlocking a whole new set of extreme usage).
Things you couldn't have pulled off on AWS that easy: Create thousands of hosts, deploy images that fast, managed kubernetes, pubsub, worldwide AMI, multi region subnets.
It's likely that bare metal is lower cost than spot - it is for me.
It's likely that your issue is allocation of capital - you'd rather put your money into burning dollar bills into AWS versus pre-paying for 18 months of servers with capital.
I'm happy your company is doing great.
I agree GCP is a great competitor to AWS.
I think you're being close minded to building out a 20-60 cabinet data center to offload some of your workload.
It's hard to do a DC with over 20 racks and pay more than 1/2 of AWS's prices.
I understand why would think so, because you don't know our setup. Just to give you an idea, we process 16PB of data every single month. This is an ingress traffic. If we would pay for this traffic going out (egress), we would end up paying over $1M just for the traffic itself. By keeping everything in a single spot it costs us literally 0.
That said, I've tried. I reached out to many hosters, like OVH and others. They just don't have the capacity we need. 20 servers will not make a change for us. We wanted to start with 500, but it would take them 9 months.
Your $1M estimate is 3X too high at 2c/GB.
In two hours I'll save you $25k per month. In 6 months $300k/month.
Either you are growing and should be investing in cost efficiency or you run a staid lifestyle business. Which is fine, but if you aren't growing, get off expensive cloud.
I'd say it's time to forget about the infrastructure. Should optimize the software and rewrite in C++ :D
I'd worry about the time it takes to reconfigure these servers with ansible/salt. or the time it takes to kill/rebuild them from a system image.
On a side note, the 4th competitor is IBM SoftLayer. It's different enough from the top 3 (Amazon, Google, Azure) to be a thing of its own.
src: doh's HN profile.
Get 1-2 more people and move off the cloud. Maybe not for everything but to service your base traffic. You can manage it if you spend a few days learning how, in the same way you learned AWS/GCE.
I can't imagine why you'd guys need 25,000 servers. Your G bill at a minimum must be over 5 million a month. I could be wrong, but those numbers don't add up too me.
What instance type are you running?
As the instance go, we have almost everything ranging from many thousands on the micro side to hundreds on the largest side.
25,000 * $200 = 5 million a month.
It truly differs minute by minute.