Hacker News new | past | comments | ask | show | jobs | submit login

Designing for concurrent requests is pretty much a given for web servers.



Designing software for concurrent requests, as in using poll/epoll/kqueue/IOCP, implementing worker pools, etc. is given, yes.

Designing infrastructure for concurrent requests is definitely not. I've worked on shared hosting systems with high concurrency requirements and it definitely was more complicated than just installing an Apache MPM—we had to think about balancing load across multiple servers, whether virtualizing bare-metal machines into multiple VMs was worthwhile (in our case it was for a very site-specific reason), how many workers to run on each VM, how much memory we should expect to use for OS caching vs. application memory, how to trade off concurrent access to the same page vs. concurrent access to different pages vs. cold start of entirely new pages, whether dependencies like auth backends or SQL databases could handle concurrency and how much we needed to cache those, etc. At the end of the day you have a finite number of CPUs, a finite network pipe, and a finite amount of RAM. You can throw more money at many of these problems (although often not a SQL database) but you generally have a finite amount of money too.

I would be surprised if most people had the infrastructure to handle significantly increased concurrency, even at the same throughput, as their current load. It's not a sensible thing to invest infrastructure budget into, most of the time.

(You can, of course, solve this by developing software to actively limit concurrency. That's not a given for exactly the reasons that developing for concurrency is a given, and it sounds like Lucidchart didn't have that software and determined that switching back to HTTP/1.1 was as good as writing that software.)


Sure, but in this article the problem was lack of software concurrency. This article really does boil down to "HTTP/2 exposed a fundamental flaw in our software".


Maybe I misread? My takeaway was that their frontend web server handled concurrency just fine and was happy to dispatch requests to the backend in parallel, but the backend couldn't keep up and the frontend returned timeouts. That's exactly what you get if you put a bunch of multithreaded web servers in front of a single-writer SQL database that needs to be hit on every request.

Yes, most such cases should be rearchitected to not go through a single choke point. But my claim is that this isn't automatic merely by developing for the web, and going through a CP database system is a pretty standard choice for good reason.


I have operated services where each web server had a single thread, because that provided a better user experience than more "optimal" configurations. It had certain bizarre scaling implications for single-core performance and I wouldn't have designed it that way in today's era, but there are times when this makes sense. (For example, when the web server only serves authenticated sessions, and each session requires a bound mainframe connection for requests, and parallelism is not only impermissible to the mainframe but would exceed the capacity available.)


I feel like the description we should be using for that class of machine is "former mainframe". (I'm sort of joking)

In seriousness, though, I'm both curious and a little bit skeptical of what user experience benefit that architecture would give over a server-side request queue and a single worker against the queue. That would allow you to pay the cost of networking for the next request while the mainframe is working. You could even separate the submission of jobs from collecting the result so that a disconnected client could resume waiting for a response. Anyway, I'm not saying you needed all that to have a well-functioning system, I'm just not convinced that a single threaded architecture is ever actually good for the user unless it gives a marked reduction in overhead.


Queues are operational complexity. Given the (worst-case-ish) choice between "architecture without a queue that sometimes has HTTP-level timeouts" and "architecture with a queue that reliably renders a spinner and sometimes has human-task-level timeouts," I'd probably favor the former unless management etc. really want the spinner and I'm confident we have tooling to figure out why requests are getting stuck in the queue. Without that tooling, debugging the single-threaded architecture is much easier.


Sure! But that is trading off user experience for technical simplicity (which you do often have to do at some point). However: the argument was that this system was better for user experience than a design that could accept requests in parallel, which is what I'm resisting/not yet understanding. In reality, I'm sure that the system was fine for the use cases they had, which is what I meant to admit with "I'm not saying you needed all that". I will say that the single threaded no-queue design already carries a big risk of request A blocking request B.


My argument that this helps user experience is that, when a failure does happen, it's a lot easier to figure out why, tell the user that experienced it what happened and get them unblocked, and fix it for future users in a simpler system than a more complex one. The intended case is that failures should not happen, so if you're in the case where you expect your mainframe to process requests well within the TCP/HTTP timeouts and you can do something client-side to make the user expect more than a couple hundred ms of latency (e.g., use JS to pop up a "Please wait," or better yet, drive the API call from an XHR instead of a top-level navigation and then do an entirely client-side spinner), you may as well not introduce more places where things could fail.

If you do expect the time to process requests to be multiple minutes in some cases, then you absolutely need a queue and some API for polling a request object to see if it's done yet. If you think that a request time over 30 seconds (including waiting for previous requests in flight) is a sign that something is broken, IMO user experience is improved if you spend engineering effort on making those things more resilient than building more distributed components that could themselves fail or at least make it harder to figure out where things broke.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: