Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> why wouldn't you just handle it on your end and provide the public with a single domain?

That is what we did for the S3 protocol. It adds cost via a load balancer.

The whole original storage design was based on the fact that in our original product line (Backblaze Personal Backup) we owned both ends of the protocol - our servers on the back end, and our client on the customer laptop. We were able to eliminate all load balancers from our datacenter by being a little tiny bit more intelligence in the client application (maybe 50 lines of code). The client asks the central server where there is some free space. The server tells it. Then the client "hangs up" and calls the storage vault directly, no load balancer required! Then the client uploads as long as that storage vault does not fill up or crash. If the storage vault crashes, or is taken offline, or fills up all the spare space it has, the client is responsible to go back and ask the central server for a NEW location. This fault tolerance step in the client ENTIRELY eliminates load balancers! Normally you need an array of servers and a load balancer to accept uploads, because what if one of the array of servers crashed, had a bad power supply, or needed to update the OS? The load balancer "fixes that" for you by load balancing to another server. Pushing the intelligence down into the client saved us money. Nobody ever noticed or cared because our programmers could write the extra 50 lines of code, to save the $1 million worth of F5 load balancers (or whatever solution Amazon S3 has).

We based our original B2 api protocols on this cost savings and higher reliability, but it does push the 50 lines of code logic down to the client. It caused a lot of developers this extreme, extreme angst. They just couldn't imagine a world where their code had to handle upload failures and retries. They would ask us "how many retries should we try before we just fail to backup"? Should I try 2 retries, or 3 before the backup entirely fails and the customer loses data? Our client guys had a whole different approach, since it was a computer we just went ahead and tried FOREVER. Never endingly, until the end of time, in an automated fashion. A couple times a year one client gets unlucky and it requires several round trips before getting a vault to upload to, but who cares? It's a computer, it can retry forever. It never gets tired, never gives up.

But S3 never figured this out, and they require the one upload point have "high availability". It saves any app developers about 50 lines of code and a lot of angst, but then we (Backblaze) has to purchase a big expensive load balancer, or build our own. We mostly built our own.



(I work at Google on Cloud Storage.)

Developers working with cloud storage APIs generally need to get used to the idea that not everything is going to work all of the time. Retries and proper status code/error handling are critical to making your application work properly in real-world conditions, and as "events" occur. Every major cloud storage provider has circumstances under which developers must retry to create reliable applications; Backblaze is no different. For GCS, we document truncated exponential backoff as the preferred strategy [1].

Google has its Global Service Load Balancer (GSLB) [2], which handles...let's just say an enormous amount of traffic. GSLB is just part of the ecosystem at Google.

It's hard to design a storage system that's "all things to all people"! There are a series of tradeoffs that need to be made. Backblaze optimizes for keeping storage costs as low as possible for large objects. There are other dimensions that customers are willing to pay for.

[1] https://cloud.google.com/storage/docs/exponential-backoff [2] https://landing.google.com/sre/sre-book/chapters/production-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: