Hacker News new | comments | show | ask | jobs | submit login

You should have moved to bare metal months/years ago. Things like making your servers preemptible is like putting a bandaid on a gaping bullet hole.

There are many reasons why bare metal doesn't make much sense for us, at least at the moment. Here are couple:

- we're very small team (13 people). We would never be able to manage this kind of infrastructure

- we're growing incredibly fast. Just 6 months ago we were running on ~300 servers. We grew so quickly that we are now working with GCE's infrastructure planers to plan our resource needs

- even we're not suffering from massive bursts, we don't grow gradually but rather in massive jumps. It would take us months instead of days to grow this quickly on bare metal. This way we can grow in days.

- it is way more complicated to host in many geographic locations, especially in asia. We need to be present in many parts of the world and with GCE/AWS/Azure it's very easy and convenient.

- we did try to work with some bare metal hosters (like OVH), but they are not able to deliver the servers on the schedule we need. They require way longer times and obviously much longer commitments

- the pricing is actually not that different once you run on preemptible (spot) instances. The bare bone would give us some performance boost, but the freedom is worth every penny.

@user5994461 can't reply to your comment for some reason, but you're correct. Ansible would never scale. We ditched it a long time ago. We're using packer to create images/docker and then distribute them automatically via kubernetes or just directly through the custom images.

As I've mentioned earlier, we're running an immutable infrastructure. That means once we need to change something we replace the whole server. Each server runs only one single service. That allows us to run smaller instances but in large quantities.

We actually did run on Softlayer. It was a nightmare. They have consistently some outage somewhere. We couldn't count on any instance to stay up. The performance was better, but you can't threat the infrastructure as a cattle and that was a huge limitation for us.

The "reply" button sometimes goes away when the discussion is deep enough. Gotta click the comment to comment.

I would imagine that rolling a container to thousands hosts may take a while.

What kind of load and software do you run? In my experience, dramatically scaling out increases load and latency variance and cause all kind of problems.

When was your experience with SoftLayer? Care to elaborate? I've got some interests in them for future projects. I'd rather hear about the issues now :D

You're still thinking about these servers as a bare metal. We don't keep the servers running. We create a server always from scratch or based on an image we prepared or with a docker container that is automatically pulled from an internal repo once the OS is turned on.

This is the benefit of running on GCP. We don't have to trouble ourselves with the headache of scaling both with the images and containers thanks to the internal tools offered by GCP.

We quit Softlayer around February. We were running bare metal and they went down so often that we essentially ended up keeping one super large server for a very insignificant service. We never gave them too much of chance so I may be too harsh.

Pulling a 100 MB docker image from 1000 hosts would take an eternity.

I stopped thinking about bare metal a while ago. I'm thinking about the comparison with AWS and I see that your entire business is relying on capabilities only available on Google Cloud (which indirectly, is why I'm advising Google Cloud nowadays. It's easier for basic usage and it's unlocking a whole new set of extreme usage).

Things you couldn't have pulled off on AWS that easy: Create thousands of hosts, deploy images that fast, managed kubernetes, pubsub, worldwide AMI, multi region subnets.

Never tried it on AWS so can't comment on that. It works very well on GCP however. We send over 1M request to their API to rescale the stack every single day (they had to up our limits because it was always timeouting on us).

It's likely that a large percentage of your workload is static.

It's likely that bare metal is lower cost than spot - it is for me.

It's likely that your issue is allocation of capital - you'd rather put your money into burning dollar bills into AWS versus pre-paying for 18 months of servers with capital.

I'm happy your company is doing great.

I agree GCP is a great competitor to AWS.

I think you're being close minded to building out a 20-60 cabinet data center to offload some of your workload.

It's hard to do a DC with over 20 racks and pay more than 1/2 of AWS's prices.

You summed up nicely, up until the last point.

I understand why would think so, because you don't know our setup. Just to give you an idea, we process 16PB of data every single month. This is an ingress traffic. If we would pay for this traffic going out (egress), we would end up paying over $1M just for the traffic itself. By keeping everything in a single spot it costs us literally 0.

That said, I've tried. I reached out to many hosters, like OVH and others. They just don't have the capacity we need. 20 servers will not make a change for us. We wanted to start with 500, but it would take them 9 months.

They are fools. I will build you 1k servers with 100g to AWS in 75 days.

Your $1M estimate is 3X too high at 2c/GB.

In two hours I'll save you $25k per month. In 6 months $300k/month.

Either you are growing and should be investing in cost efficiency or you run a staid lifestyle business. Which is fine, but if you aren't growing, get off expensive cloud.

You've a good logic in this, but you don't have enough details about us to judge it properly. The gain would be much smaller than you think and we would lose a lot of freedom, which would slow us down.

I don't think it's worth continuing to argue. It's obvious you've done your research and built a real system that works, you don't have to continuously defend people who claim you can save 50% but don't understand why running an entire system means more than just miniizing the hardware cost.

> I think you're being close minded to building out a 20-60 cabinet data-center to offload some of your workload.

I'd say it's time to forget about the infrastructure. Should optimize the software and rewrite in C++ :D

We already have huge portion of the system in C/C++ ;)

Would you mind giving more details about what you run with those 25k servers?

I'd worry about the time it takes to reconfigure these servers with ansible/salt. or the time it takes to kill/rebuild them from a system image.

On a side note, the 4th competitor is IBM SoftLayer. It's different enough from the top 3 (Amazon, Google, Azure) to be a thing of its own.

> Founder of Pex [pex.com], a video analytics & rights management platform able to find and track online video content anywhere.

src: doh's HN profile.

I agree but that's years out. We are barely scratching the surface.

Once the hockey stick flattens out then you will want to find permanent homes, assuming you have a viable long term business model

I'll be honest, it really feels like you haven't done a good job of scaling up first (at least based on your comments about cost and how they aren't provisioning fast enough). When you say you tried OVH and it didn't work, are you deploying dual processor hexacores with 64 gb+ of RAM? Because if not you're probably not doing it right. What software stack are you running?

Get 1-2 more people and move off the cloud. Maybe not for everything but to service your base traffic. You can manage it if you spend a few days learning how, in the same way you learned AWS/GCE.

I may be missing something, why is bare metal better in this case (immutable/premptible servers)?

Just the sheer number of instances the OP is using has put the usage FAR into the territory where the premiums spent on virtualized instance costs far outweigh any clever strategy one might use to make things cheaper in the cloud. It has nothing to do with what the infrastructure is, and everything to do with raw volume.

Spot instances are super cheap...about 80% less than regular instances.

Yea but it's still virtualized slow molasses at the end. And still more expensive. My point is that at 25k servers, even a 5% efficiency gain is significant, and I'm willing to bet the margins are still higher.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact