
Ask HN: Checklist for highly available bare metal deploy? - mstaoru
We&#x27;re a small startup in Shanghai, China, currently in stealth mode. Our core algorithm is relatively computationally intensive, and better adapted to CPU.<p>Cloud here is not cheap (comparatively), e.g. a measly on-demand 1 core 1G RAM instance could run up to $23&#x2F;month + traffic + storage. Prepaid is ~$10.<p>However, we&#x27;re able to buy (own) bare metal 2 x 2893v3 (24 cores) + 256G ECC RAM + 5 x 300G 15K drives with a battery-backed hardware RAID for about $1500. Another $60&#x2F;month for 1U colocation with unlimited power and 10M unlimited link (bandwidth is crazy expensive in China, e.g. dedicated 100M link can cost up to additional $1500&#x2F;month). I&#x27;d say 10 of these should be enough at our scale.<p>The load is pretty typical web &amp; database flow with some heavy Numpy&#x2F;Scipy spikes from time to time.<p>Let&#x27;s assume that the engineer labor cost is zero.<p>Could HN recommend some learning resources for minimizing pain with this setup?
======
zzzcpan
Probably the biggest thing is to separate edge servers from backend servers,
running DNS and reverse http(s) proxies (aka load balancers) on the edge
servers. This would allow you to achieve high availability on the internet
facing layer and below. Use DNS routing and DNS failover to steer clients to
specific edge servers and to avoid unavailable ones. The more edge servers you
have from more independent hosting providers, the higher availability you can
get. They can be cheap VPS servers. Failover between backends then can be
handled on the edge servers on the reverse proxy layer. Backend servers won't
even need physical IP addresses, they could tunnel to edge servers and be
located anywhere. You would need to handle connectivity problems though, you
can start by using fail fast approach with very low connect timeout and low
timeouts and such to quickly and seemlessly switch between backends. I don't
know of any off-the-shelf solution that can handle connectivity problems well,
so at some point you will have to write your own tunneling proxy, but before
that you can survive on nginx, it can mark backends as failed and avoid
sending requests there for configurable amount of time. Start with at least 3
edge nodes from different hosting providers and 2 locations for backends,
could be a single location if you can get 2 independent ISPs there.

~~~
mstaoru
Good advice! I think Envoy is very well suited for this, configurable over API
and has dynamic async DNS resolve with keepalives.

------
quaquaqua1
Before sending anything to cloud, I test it on a few different machines I have
sitting in my office.

Then I sandbox it on some free tier cloud for a sanity check.

If you ever do send it to production on cloud, be ready to pull the plug if it
ever goes crazy and wants to give you a crazy bill.

In China, electricity and parts and even land can be cheap compared to USA
right? So I think that explains the lack of cloud providers.

Good luck!

~~~
mstaoru
Thanks! There are many cloud providers (Alibaba, Tencent, Huawei, Kingsoft
etc), but they are not a de-facto "standard for everything", like AWS in the
US. They are more used for CDN and cloud APIs, like recognition or mobile
storage (think Firebase). But for infra, many companies still run racks.

