More

evanj · on Aug 8, 2017

Fair queuing is an interesting and complementary approach that I have not considered, thanks for the suggestion.

I usually work with "internal" systems that don't always have an obvious "user" identifier, but it would be fairly easy to add one of some sort (e.g. an application id or similar). However, for that to be useful, there still has to be some sort of limit. In your case, the limit seems to be 1 concurrent request per user, up to a maximum of 100. In the case of some internal system, like say a metrics aggregator, what is the correct limit? I'm not sure, and I think it would require tuning.

That said, given a reasonable limit, this would definitely ensure that the "misbehaving" application (e.g. the one flooding the metrics aggregator with packets) is the one that gets punished, rather than everyone else. I'll have to think about this more. Thanks!

evanj · on Aug 8, 2017

I agree that a universal solution sounds difficult. However, it seems like it should be possible to have something that works "well enough" for some (broad) class of servers. My example is TCP, which does a "good enough" job of controlling the flow of packets over a wide range of networks.

evanj · on Aug 8, 2017

Exactly right. I just enabled billing on this project. Oops!

evanj · on Aug 8, 2017

This is a good point, although I don't think I said "connections", I said "requests". The definition of requests is going to vary significantly. Yes, very slow clients are yet another problem that extremely robust systems don't have to handle.

In my personal case, nearly all of my work has always been on "internal" services that are being used inside a single organization inside a data center, where this is rarely a problem. This is much more of an issue for services that are accessed over the public Internet, where you need to handle malicious attackers as well as users that have horrible connections.

In any case, I believe what I wrote still applies, but you do need to set appropriate timeouts (yet another parameter to tune; yuck!)

evanj · on Aug 8, 2017

I use App Engine's static file serving to serve my site. It has a 1 GB/day bandwidth limit. It turns out Hacker News still moves a lot of traffic, so this is the first time I've crossed it! Oops. Billing is now enabled on this project.

evanj · on Aug 7, 2017

Awesome this is exactly the sort of system I had in mind when I wrote that article, I'll take a look, thanks!

sitkack · on Aug 8, 2017

Logged in users should take priority, depending on the number of page requests, all users will see an error at some point breaking the site for everyone.

evanj · on Aug 7, 2017

Thanks for the link, the abstract seems relevant. I agree: a control algorithm seems like it should work. The challenge is figuring out the metrics and parameters to make something that works "well enough" for most applications, like TCP does for networks. I would really love for someone to figure that out, so we don't run into this very often.

ot · on Aug 8, 2017

Yeah, it's something that needs to be tuned per-service. It is quite easy if the service is X-bound for some local resource X (CPU, disk, flash, network card), but if it is bottlenecked on external service calls it can be quite challenging to define a representative stress metric.

evanj · on Jan 30, 2017

You make an excellent point: probability says a 4-byte CRC32C provides much weaker guarantees as the length of the message gets longer. These CRCs are typically optimized for pretty short messages, and that is what I had in mind when I wrote that article (e.g. the kind of messages that might get exchanged as part of an online serving system).

For overkill discussions about these issues, see

"32-Bit Cyclic Redundancy Codes for Internet Applications" which talks about the kinds of errors that various CRCs are guaranteed to detect: https://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koop...

evanj · on Jan 30, 2017

See the previous Hacker News discussion from October 2015: https://news.ycombinator.com/item?id=10360108

I'm glad people are still interested in this subject. :)

evanj · on Aug 17, 2016

This is possible, but I find these restrictions to be hard to follow. As soon as you need to call a function, you now need to audit that function to determine that it only calls other async signal safe functions. When you come back to the code to fix a bug six months later, you need to remember these restrictions.

Do this when you must, but it is easy to screw up.