Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Dropbox saved millions building a custom load balancer (betterstack.com)
58 points by cnctvfrc on Dec 11, 2024 | hide | past | favorite | 31 comments


We're well past 2016, but Stratechery had an opinion that Dropbox focused too much on infrastructure projects like this and would have had more success focused on improving product/market fit.

"That's why I actually find this announcement really disappointing. Apparently Dropbox has been devoting significant resources for at least two years to a project that will no doubt have a positive impact on the bottom line but a minimal impact on the top line. It's all well-and-good (and honestly impressive) to announce 500 million registered users, but the reluctance to disclose both active users and especially the number (and size) of its business customers speaks even more loudly. How might have the product and company evolved if the company had continued to rely on AWS and devoted its resources to fixing its product-market fit problem?"

https://stratechery.com/2016/dropbox-leaves-aws-should-ups-a...


Perhaps the answer to this lies in the incentives for VCs. The current dropbox strategy produces a sustainable, lifestyle business for its employees and customers. They are happy with a product that meets their needs. It's not what VCs want at all; they want either total domination or acquisition. The middle-ground is uninteresting to them. So, had they stayed with AWS, they may have bought a 10% chance at 10x more VC return, and a 90% chance that they are bought up and absorbed into OneDrive.

I prefer the current outcome to a swing for the fences.


I don't think it would be possible for them to stay with AWS considering their storage volume usage. As soon as the storage was out everything else has followed as well


I dunno, if you can’t provide enough value to adequately mark up bulk purchases of commodity Cloud storage, what exactly are you selling?


Dropbox's business IS storage, which means running on top of storage is always going to be a threat and cut into their margins. What incentive does AWS have to give Dropbox a really sweet S3 deal? They know Dropbox needs the storage. It's like why it's better for a business to own the building its in, because if you become successful, your landlord has the incentive to increase your rent. This isn't about if AWS can provide a compelling bulk rate for S3, it's about if your business lives or dies based on the AWS deal renegotiation.


I guess that depends on whether you think cloud storage is a commodity.

Surely despite their business being storage, Dropbox would be foolish to design and manufacture their own hard disks?


No, I don't think that Dropbox should manufacture its own hard drives. The main reason is that switching hard drive manufacturers can be done piecemeal as you need to buy them. Getting data out of S3 if the contract negotiations go bad can cost more than storing it. It's just very different economics and level of vulnerability given the two.


Cloud storage before all the major cloud players were even a thing, for starters?


Sure, that was a great feature in 2007. (S3 existed when Dropbox was founded, FWIW.)

It eventually stopped being a differentiating source of value, and trying to out-commodity the CSP’s on storage cost at scale seems like a bad strategy to bring value back to the product. At tremendous effort you make it possible to lower prices by 20% or whatever, in order to keep the same profit on an undifferentiated product. Who cares?


Dropbox is a company with thousands of engineers. They should be able to focus on both aspects.

It seems Dropbox has an issue with execution. It already has a set of customers. They should be able upsell other things. They are trying with Dropbox Sign.

But other features like Paper and Photos don't seem to do well. Paper is deprecated, I think. Failing to expand to a doc-like saas is a very bad sign, when your customers use Dropbox to store documents.


> Dropbox is a company with thousands of engineers. They should be able to focus on both aspects.

This highlights a big issue in online discourse, the false dichotomy is everywhere. "why didn't they allocate resources in solving world hunger instead of uber for furbies" Because they chose not to, not because it was an either-or.


> It isn't stated that Dropbox saved millions annually from this change. But, based on the cost and resource savings they mentioned from implementing Robinhood, as well as their size. It can be inferred that they saved a lot of money, most likely millions from this change.

So, clickbait.

The entire thing is a terrible summarization of Dropbox's own blog post about the topic (linked elsewhere in this thread). Better to read that instead.


HAProxy may be dynamically weighted via their Runtime API. That's how they load balance microservices dynamically...

You can use whatever you want as criteria and then make the changes. https://www.haproxy.com/documentation/haproxy-runtime-api/re...

They did something similar at Reincubate to dynamically scale according to load.


Earlier HN thread for the original article on dropbox.tech (48 comments): https://news.ycombinator.com/item?id=41968299


One thing you need to be careful of when load balancing like this is buggy servers. Let's say you have a bug that makes a server handle requests by immediately failing them. That machine will start to look like it's performing really well, and so you start sending more requests to it, and more fail, etc.,. This can turn even a tiny canary of a build into a complete outage.

I'm also surprised the article says that is took days to equalize usage, it's unclear, but hopefully this is just because the were rolling it out slowly?


From the original article:

> A quick note on our migration strategy It can be risky to switch load balancing strategies all at once. This is why we enable service owners to configure multiple load balancing strategies for a service in Robinhood. The LBS writes the weighted endpoints list generated by different strategies into different entries in the routing database. At Dropbox, we have a percentage-based feature gate, so we implement a mixed strategy where clients use the weighted sum of the weights generated by two load balancing strategies as the endpoint weight. For example, endpoint A might be weighted at 100 based on PID-based load balancing and 200 based on simple round-robin strategy. If we set the feature gate to 30% for PID-based load balancing, the weight of endpoint A becomes 100*0.3 + 200*0.7 = 170. This way, we can ensure that every client sees the same weight assignment for endpoints while gradually migrating to the new load balancing strategy.

https://dropbox.tech/infrastructure/robinhood-in-house-load-...


^-- mind atm the markup here is catching the poster by surprise. The math should read: 100*0.3 + 200*0.7 = 170.

Not 1000.3 + 2000.7

(escape the *, or it bounds an italic section - *this* becomes this. \*this\* becomes *this*.)


oops. should be fixed now


“Let’s say you have a bug that makes a server handling requests by immediately failing them.”

Perhaps such designs should build progress indicators into the servers’ responses. If not visible (eg crytpo), maybe the response size is a hint where the error messages are always a specific, small size. The load balancer would then look for progress indicators or error signatures when assigning weights.


The buggy servers would still pass health checks though?


A little surprised that it isn't normal for load balancers to weight machines or consumers differently. I'm not super familiar with them I guess even in this case they add a lot of complexity with the secondary service that reads the load data.


SRE here, some load balancers do support weighting. However, most of time since we are using cloud machines or hardware refreshes are mostly in large cycles, the fleets running X service is the same and therefore standard round robin works well enough. It appears that Dropbox does not do that and therefore, they needed a solution like they built. One of those, "At your scale, this makes sense. Few companies are at this scale."

EDIT: It probably wouldn't be hard to home roll something like using HAProxy/Caddy APIs and Prometheus.


Nginx can do weighting. Nginx plus can do a lot more.


This is very cool, but it seems like a lot to build to optimize the load balancer.

I'm curious why there isn't a discussion of "Power of Two" load balancers. They do a VERY good job getting close to near optimal load with relatively simple overhead. Vs something like what dropbox built which is a pretty heavy system

https://medium.com/the-intuition-project/load-balancing-the-...


> It isn't stated that Dropbox saved millions annually from this change. But, based on the cost and resource savings they mentioned from implementing Robinhood, as well as their size. > It can be inferred that they saved a lot of money, most likely millions from this change.

So the clickbaity-title is just a thought of the author? I mean they haven't provided a range in which these savings have been made so they're probably right at one point but come on...


Link is someone's summary of an article on Dropbox blog.

Actual blog article is https://dropbox.tech/infrastructure/robinhood-in-house-load-...


Bad summary of a better article.


And I'll bet the devs who implemented this were compensated with at best a very small fraction of the savings.


tl;dr: instead of distributing requests round-robin style they give fewer requests to servers with high CPU load and more to servers with low CPU load

Add some complexity for scale and crashing load balancers


This sounds like it should be already standard practice


Implementing smart algorithms are easy on paper, but unwanted complexities crop up always, at every level.

It's a great tar pit of Murphy's law and Hofstadter's law combined.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: