We have to talk about this Python, Gunicorn, Gevent thing

orisho · on March 7, 2020

The problem that's described here - "green" threads being CPU bound for too long and causing other requests to time out is one that is common to anything that uses an event loop and is not unique to gevent. node.js also suffers from this.

Rachel says that a thread is only ever doing one thing at a time - it is handling one request, not many. But that's only true when you do CPU bound work. There is no way to write code using blocking IO-style code without using some form of event loop (gevent, async/await). You cannot spin up 100K native threads to handle 100K requests that are IO bound (which is very common in a microservice architecture, since requests will very quickly block on requests to other services). Or well, you can, but the native thread context switch overhead is very quickly going to grind the machine to a halt as you grow.

I'm a big fan of gevent, and while it does have these shortcomings - they are there because it's all on top of Python, a language which started out with the classic async model (native threads), rather than this model.

Golang, on the other hand, doesn't suffer from them as it was designed from the get-go with this threading model in mind. So it allows you to write blocking style code and get the benefits of an event loop (you never have to think about whether you need to await this operation). And on the other hand, goroutines can be preempted if they spend too long doing CPU work, just like normal threads.

jashmatthews · on March 8, 2020

> You cannot spin up 100K native threads to handle 100K requests that are IO bound (which is very common in a microservice architecture, since requests will very quickly block on requests to other services). Or well, you can, but the native thread context switch overhead is very quickly going to grind the machine to a halt as you grow.

Why? Bigger MySQL servers regularly use 10k native pthreads which spend most of their time waiting on IO.

Context switch overhead isn’t linear with the number of threads! The Linux scheduler runs O(logN). Moving from 10k MySQL threads to 1 thread per core will give you less than 10% extra throughput.

Doing IO means context switching. Goroutines don’t get a free pass here.

orisho · on March 8, 2020

There are other costs to regular context switching as opposed to goroutines/greenlets (the green threads that gevent uses). I don't remember the details but specific attention was paid to the point of making context switching and other resource consumption by these green threads cheaper than native threads, so I suggest reading about it in greenlet/Golang docs :) You can also try searching for C10K which was the term people used to discuss how to achieve 10K connections, and is often associated with cooperative threading.

For example, the cost of the context switch itself (storing all registers) is more significant with native threads.

Just try spinning up 100K threads that each print a line and then sleep for 10ms, and see how high your CPU usage gets.

Also, doing IO does not necessarily mean context switching - it means calling into the kernel (system calls). If you use an async IO operation (read/write from a socket) and then continue to the next thread, by the time you're done with all ready threads, you're likely to have some sockets ready to read from again, so you might not context switch at all. Kernel developers are working on even reducing the need for syscalls with io_uring, which is designed to allow you to perform IO without system calls.

jashmatthews · on March 8, 2020

Green threads are much cheaper to switch than pthreads, yes. In real applications the difference is far smaller than it was 20 years ago when C10k was challenging. In 2020 you can just open 10k threads and forget about it.

With 100k threads and 100k Goroutines, each doing nothing but waiting on a mutex: pthreads in C takes ~20 microseconds per thread and in Go it’s about ~5 microseconds using Goroutines.

This difference disappears really easily. Parse some JSON and it’ll be gone.

Entering kernel code is the expensive part of context switching so syscalls are very nearly as expensive. Reading from a socket still needs a syscall, even with green threads or asynchronous IO.

The more different bits of IO you do, like in a real web app, the less advantage their is to green threads. This is one reason Rust dropped their M:N threading implementation.

fulafel · on March 8, 2020

There shouldn't be any 100k hard limit for threads at least in Linux, though you need enough memory for 100k stacks of course. You need to increase some default limits for it though (https://stackoverflow.com/a/26190804)

Assuming a generous(?) 20 kB per thread in stack and other corresponding OS bookkeeping inforation you could have 1k threads in 20 MB, or 1M threads in 20 GB.

Doing 100 Hz timer wakeups and IOs concurrently in 100k threads makes 10 M wakeups/second, that takes a chunk of CPU independent of green / native threads choice. Performance vs kernel threads will depend on the green threads implementation.

jashmatthews · on March 8, 2020

Yup. The Linux scheduler wakes threads based on IO events. You don’t end up just cycling through 100k threads all waking and sleeping again.

fulafel · on March 8, 2020

It's worth noting that the c10k writeup came out 20+ years ago, and those bottlenecks have been addressed both by fixing software bottlenecks and 20 years of semiconductor improvements.

nopurpose · on March 7, 2020

Beam VM (Erlang/Elixir) is another example which copes with the problem very well due to preemption.

Floegipoky · on March 7, 2020

To expand on this, by allowing preemption on function calls and simply not providing a loop mechanism, you can guarantee an upper bound on how long a process may need to wait until it can be scheduled.

cbsmith · on March 8, 2020

Guarantee is a strong term. If you have an infinite number of processes to running, there aren't guarantees on how long until it is scheduled.

Similarly, with cooperative multi-tasking as in Gevent, you can manipulate scheduling to try to provide better guarantees about wait times. It's just... you can't ignore the problem.

adev_ · on March 8, 2020

> You cannot spin up 100K native > threads to handle 100K requests that are IO bound (which is very common in a microservice architecture, since requests will very quickly block on requests to other services)

The Varnish server does that all the time and it is not a problem.

Modern OS handle a large number of threads much better than 10 years ago.

In case of python however, you can not do that due to the GIL and the fact python handle native threading terribly.

mborch · on March 8, 2020

From the docs:

> We rarely recommend running with more than 5000 threads. If you seem to need more than 5000 threads, it’s very likely that there is something not quite right about your setup, and you should investigate elsewhere before you increase the maximum value.

adev_ · on March 8, 2020

> We rarely recommend running with more than 5000 threads. If you seem to need more than 5000 threads, it’s very likely that there is something not quite right about your setup, and you should investigate elsewhere before you increase the maximum value.

Yes, and still 5000 will be more than enough to serve most of production load. Even on high traffic websites.

crimsonalucard · on March 7, 2020

In NodeJS and python "async/awaits" you don't have to deal with deadlocks and mutexes and other ugly concurrency primitives/problems. The reduction in complexity of the program is huge which is something rachel completely discounts.

Gibbon1 · on March 8, 2020

I'm not sure discounts. I mostly do firmware in that environment you can use either preemptive or cooperative multitasking. That latter is much less likely to result in corrupted state and results in smaller code size. Which is why I like it. But you pay with latency and task starvation problems as Rachel complains of.

crimsonalucard · on March 8, 2020

So in another reply some guy pulled off a deadlock in node by creating his own external lock primitive outside of the node stdlib which in itself doesn't contain a single lock primitive. Technically, I'm wrong, it can happen, but only if you go out of your way to do it:

See the conversation here: https://news.ycombinator.com/item?id=22515100

akvadrako · on March 8, 2020

Actually Linux can handle a few million threads just fine - you only need to increase the max pid.

You should just update your reasoning and say you can’t have 100M threads waiting for blocking IO.

NewJazz · on March 8, 2020

Can't a CPU bound task just make itself preemptible in Python by calling some dummy async function, like asyncio.sleep(0)?

orisho · on March 8, 2020

Yes, but you shouldn't just do that every time as calling into the event loop has its own processing cost. If you do that on each iteration of the loop, it will probably become much slower.

At a previous job, I wrote a wrapper that would compare the last execution time of the event loop scheduler, and if it was larger than some value, it would yield (by sleeping for 0ms). This was intended for use in loops with varying length, which might be long and could be short depending on some external factors, but yielding on each loop iteration made them slower by an order of magnitude.

IIRC It was sending messages to RabbitMQ, which somewhere down the line is writing to a socket, but not necessarily blocking - if the socket manages to flush its buffer fast enough (faster than the processing code can send messages), writing to it may never block, resulting in a CPU bound loop (since the loop may perform work other than sending).

We didn't want people to have to think too hard about how their loop was going to behave (especially as it might mean reasoning about the internals of a third party library), and so the wrapper was born. If your loop was short enough, all it did was compare ints so the cost was negligible.

anaphor · on March 8, 2020

Doesn't that mean you have to somehow know when it's been running too long and then yield back control somehow? What if the slow operation is something that's atomic, like multiplying some huge numbers?

NewJazz · on March 8, 2020

Well the example given is decoding JSON. If that is happening in a long loop, you can yield once per loop and be safe. Not all problems are neatly broken apart like that, but in those cases how much of a chance does the server have to not timeout regardless, you know?

Note that once per loop might be too often, but you can just measure how long a loop run typically takes and compare to how soon you want to preempt the task, then yield at the right interval.

emj · on March 8, 2020

Seems like abstractions will bite you there, most will just do cool_library.unmarshall(request) in some variant, those libraries will not have the same method for yielding as you have.

TeMPOraL · on March 8, 2020

Abstractions are meant to be broken! One could probably work around this problem by adding new functions to cool_library, or modifying existing ones, whose code would be copy-pasted from the library but with some asyncio.sleep(0) calls spliced in strategic places :). For legacy projects, it may make more sense to cheat like this than to rewrite the whole project in a saner tech stack.

Too · on March 8, 2020

Before webworkers this was how you did things in the browser to not get the "page is not responding" popup on computationally expensive operations. Break each big operation into many small operations and step to the next phase using setTimeout(1,..)

pdonis · on March 8, 2020

Many CPU bound tasks don't have convenient break points that are short enough times apart where doing this would be useful.

_pgmf · on March 8, 2020

Yes, `gevent.sleep()` or `time.sleep()` (if you monkey-patch) will yield control back to the scheduler.

notyourday · on March 7, 2020

> The problem that's described here - "green" threads being CPU bound for too long and causing other requests to time out is one that is common to anything that uses an event loop and is not unique to gevent. node.js also suffers from this.

Isn't node.js IO native threaded?

orisho · on March 7, 2020

I don't know, but I was actually discussing time when the thread is performing pure CPU operations and not doing any IO. In this case the JS code does not yield to the event loop, so any other requests waiting will stall.

This is actually an important property of JavaScript that allows it to work without locks (as long as you do not await, your code runs completely synchronously and will not be interrupted). It's not a flaw, but definitely requires awareness.

Gevent code that doesn't spawn more than one native thread (the one that runs the event loop) also has this property - you don't need locks as long as you do not perform IO. In Python's case it can be more tricky, as you might end up yielding to the event loop by indirectly performing IO when logging or something similar.

In JavaScript, the only case, AFAIK, where this can happen is when you await. Nothing else will cause you to yield to the event loop.

strken · on March 7, 2020

The JS event loop is kind of funny, because tasks have no priority and no preemption and so they're executed first-in first-out. If you want to avoid blocking other code paths for more than a certain length of time you need to change the order you push tasks in[0] and you need to limit execution time in some way, which can get pretty complex to reason about.

[0] by "push tasks" I mean using setTimeout, new Promise(), await, requestAnimationFrame, or similar

notyourday · on March 7, 2020

Ok, so it is opposite in nodejs? There's a single JS thread but as soon as it performs any async IO ( which is the only IO that one should be performing in node ), that IO creates a thread servicing that IO request. In addition to that, if one is crazy enough to run long computational operations there's setTimeout(), setImmediate() and process.nextTick() and friends.

Though I would say it doing any real compute work inline rather than farming it out to workers is a boneheaded idea regardless of the language used.

orisho · on March 7, 2020

AFAIK there is only ever just one thread (that runs user code). setTimeout, etc are all also invoked by the event loop.

I can vaguely recall (but it's 1AM so I didn't verify - take with a grain of salt) that nodejs uses an a thread pool for IO, but I guess that's only for IO for which there is no async API, otherwise that would be wasteful.

I imagine a quick search about the event loop and said thread pool would yield better researched answers :)

bernawil · on March 8, 2020

Nodejs is single process, single thread and single core. Chrome used to run different tabs as threads of a process but that ended since Spectre and Meltdown. Back to single process since then.

  async IO creates a thread servicing that IO request

Nope. The event is a single FIFO queue implemented with libuv. It differentiates between sync and async function and basically goes around the queue running synchronous functions to completion every time it find one and async ones for a set time before switching to next one.

notyourday · on March 8, 2020

> Nodejs is single process, single thread and single core.

ps -eLf disagrees:

  $ ps -eLf|grep node
  deploynode     16438 16173 16438  0   11 00:21 pts/6  00:00:01 /opt/bin/node node-le-manager-api 3550
  deploynode     16438 16173 16439  0   11 00:21 pts/6    00:00:00 /opt/bin/node node-le-manager-api 3550
  deploynode     16438 16173 16440  0   11 00:21 pts/6    00:00:00 /opt/bin/node node-le-manager-api 3550
  deploynode     16438 16173 16441  0   11 00:21 pts/6    00:00:00 /opt/bin/node node-le-manager-api 3550
  deploynode     16438 16173 16442  0   11 00:21 pts/6    00:00:00 /opt/bin/node node-le-manager-api 3550
  deploynode     16438 16173 16443  0   11 00:21 pts/6    00:00:00 /opt/bin/node node-le-manager-api 3550
  deploynode     16438 16173 16444  0   11 00:21 pts/6    00:00:00 /opt/bin/node node-le-manager-api 3550
  deploynode     16438 16173 16445  0   11 00:21 pts/6    00:00:00 /opt/bin/node node-le-manager-api 3550
  deploynode     16438 16173 16446  0   11 00:21 pts/6    00:00:00 /opt/bin/node node-le-manager-api 3550
  deploynode     16438 16173 16447  0   11 00:21 pts/6    00:00:00 /opt/bin/node node-le-manager-api 3550
  deploynode     16438 16173 16448  0   11 00:21 pts/6    00:00:00 /opt/bin/node node-le-manager-api 3550
  deploynode     16579  8353 16579  0    1 00:27 pts/4    00:00:00 grep node

dragonwriter · on March 8, 2020

> Chrome used to run different tabs as threads of a process but that ended since Spectre and Meltdown. Back to single process since then.

Chrome supports several different process models, apparently, but as Spectre/Meltdown (as well as future renderer and UXSS bugs) mitigation it now by default runs a minimum of one process per site/origin (site isolation). See https://www.chromium.org/developers/design-documents/process... and https://www.chromium.org/Home/chromium-security/site-isolati...

Rapzid · on March 8, 2020

There are Linux system calls in wide use that are blocking. Both node.js and Golang use native threads to make the system calls and "park and wait" for the response. The runtimes then take the information from the call and incorporate it back into the user code threads(the one user thread in node.js).

crimsonalucard · on March 7, 2020

The implementation of libuv under nodejs (and the state of the art python async/await) actually uses a total of three threads. It's a complicated model and it's not even strictly correct to call it an "event loop." I don't completely understand it and very few people do.

From a high level you can think of it as a single thread that can only context switch during an IO call. Otherwise the function has to run and block all the way to termination. <---- This is what rachel is complaining about.

dvdgsng · on March 8, 2020

Any good reads on that at hand?

dvdgsng · on March 8, 2020

Ok, libuv docs seem just fine. Nevermind :)

delusional · on March 8, 2020

As long as my 100k threads are all waiting on io the kernel doesn't need to switch to them.

I get that once you reach full utelization, the best the kernel can do is to make it fair, and that's going to cost. That's not a general statement like "100k threads grinds the machine to a halt".

williamDafoe · on March 10, 2020

I have worked on a system that spins up 50k threads on a CPU with 72 threads and it works just fine, it's the google web page fetching system. If that didn't work then Google wouldn't fetch web pages. At all.

orf · on March 7, 2020

> Go back and look. I said that it forks and then it imports your app. Your app (which is almost certainly the bulk of the code inside the Python interpreter) is not actually in memory when the fork happens. It gets loaded AFTER that point.

You can just pass `--preload` to have gunicorn load the application once. If you're using a standard framework like Django or Flask and not doing anything obviously insane then this works really well and without much effort. Yeah I'm sure some dumb libraries do some dumb things, but that's on them, and you for using those libraries. Same as any language.

If you want to stick your nose up at Python and state outright "I will not write a service in it" then that's up to you, it just comes across as your loss rather than a damning condemnation of the language and it's ecosystem from Rachel By The Bay, an all-knowing and experienced higher power. I guess everyone else will keep quickly shipping value to customers with it while you worry about five processes waking up from a system call at once or an extra 150mb of memory usage.

cakoose · on March 7, 2020

Even if you manage to preload everything you need, Python's reference counting mechanism will cause everything to be copied anyway.

Every time you access an object, its reference count is mutated, which will the memory page to be copied.

There are workarounds if you're willing to mess with the Python interpreter: https://instagram-engineering.com/copy-on-write-friendly-pyt...

pas · on March 8, 2020

It still helps with the loaded .so files and whatnot. The code objects and other things that are immutable. (CPython refcounts those too?)

pdonis · on March 8, 2020

> The code objects and other things that are immutable. (CPython refcounts those too?)

CPython refcounts all objects. Refcounting is not required because of mutability; it's required because the interpreter needs to know when an object's memory can be reclaimed for something else.

I don't know if code objects specifically would have their refcounts mutated a lot, since typically they're only referenced by one object, the function that they're the code for. But function objects will have their refcounts mutated every time the function is called, since that sets up a stack frame that grabs a reference to the function object and then releases it when the function returns.

rcxdude · on March 8, 2020

The code and read-only data of .so files will be shared anyway

ghostwriter · on March 7, 2020

> Python's reference counting mechanism will cause everything to be copied anyway.

isn't it CPython specific though, and may not be observed when running on PyPy?

dfox · on March 8, 2020

It is. And in practice it is far from everything, only stuff that is actively directly referenced by the service code.

lozenge · on March 7, 2020

gc.freeze() reached Python 3.7 as mentioned there, I'm not sure how many web frameworks are using it though.

nopurpose · on March 7, 2020

> If you're using a standard framework like Django or Flask then this works really well and without much effort.

I dug into it ~year ago, Django loads almost everything lazily, so simple --preload did next to nothing. I had to write code to load app for real at import time, exact thing article and common wisdom tells us not to do.

orf · on March 8, 2020

Django loads some things lazily but not almost everything. Nearly all your imports should be loaded by the preloaded application and shared across forks (COW semantics aside), and this usually takes up a non-trivial amount of memory. The things that are lazy are usually lazy for a reason - database connections, caching etc.

I believe the i18n system is also lazily loaded and depending on the languages you configure it can take up a fair bit of memory.

closeparen · on March 7, 2020

The whole discipline Rachel writes about is clearly intended for mature, scaled operations where outages and inefficiencies are legitimately worth much more than the systems wizards to stop them. There’s a time and a place for “move fast and break things” and if that’s where you are, it’s probably not for you.

worik · on March 7, 2020

"The whole discipline Rachel writes about is clearly intended for mature, scaled operations where outages..."

That is not true. RotB is describing saftware inefficiencies that we learnt to do without 20 years plus ago.

Because the Python hackers that built all these tools did not pay attention, when they built new tools they recreated the old problems.

Worik's 23.6918th rule of creativity: It is easier to write than read.

FridgeSeal · on March 8, 2020

I don’t think you guys are saying different thing here.

The article in this case is describing a bunch of common processes/optimisations/features that we have learnt to be critical for effective and efficient running of software. The author does this because the audience she writes for is, as the previous comment puts it “mature, scaled operations where outages...” etc etc

ary · on March 7, 2020

> You can just pass `--preload` to have gunicorn load the application once. If you're using a standard framework like Django or Flask and not doing anything obviously insane then this works really well and without much effort. Yeah I'm sure some dumb libraries do some dumb things, but that's on them, and you for using those libraries.

It's not always trivial to ensure none of your dependencies have import-time side effects. Sometimes the productivity/business benefit provided by the depedendency outweighs the pain introduced by the side effects.

Skunkleton · on March 7, 2020

If spinning up a few more workers will solve a performance problem for you, it’s probably worth the time to throw the preload flag on there and see what it does to your test suite. Since you are already cost optimizing at this point you probably have the time.

kodablah · on March 7, 2020

> everyone else will keep quickly shipping things with it while you worry about five processes waking up from a system call at once or an extra 150mb of memory usage

With the current state of ecosystems, this quality-vs-quantity mutual exclusivity is much less pronounced. These days, you can fire up these services as quick or quicker than in Python with better performance and resource usage that is also more maintainable. Unless you speak of highly ecosystem-dependant libraries (e.g. ML), Python defenses that rely on time to market say more about the author's narrow comfort than general expediency.

nimish · on March 7, 2020

Of course weird domain specific libraries are a major reason to use python. I can get my oddball service up in a few days vs weeks or months to replicate some bizarre library.

If all you're doing is basic web stuff then sure you can do it in Lang du jour but even after a decade go doesn't have a competent xml library for example.

Time to market is killer. Python buys you time to determine whether your product is even worth building. Deployment sucks though.

e: I literally could not have written the services in my current role in a language other than Java or Python without replicating 100kloc libraries. Java would have required a bunch of work to integrate with the other services we had. So: python. If that costs me an extra $1k/mo for servers but gets us a customer paying $100k a month, was it wasteful?

lmm · on March 7, 2020

Why was it easier to integrate with those other services from Python than from Java? Only because you already picked Python for those, right?

(I agree with you about time to market being the important thing. I don't think Python is winning that game any more though: its dependency management and deployment has fallen behind the rest of the industry, and newer languages have largely caught up with its conciseness without having to make the same compromises)

nimish · on March 8, 2020

Not really. Setting up a scalable Java service is complicated even with tooling like spring boot, and Java has better libraries for our use case, but the feedback cycle and general code velocity was much slower since we'd be on java 8 among other things. Plus learning spring or a Java ee framework is tantamount to learning a whole new language it wasn't worth the time then.

lmm · on March 8, 2020

> Setting up a scalable Java service is complicated even with tooling like spring boot

How so? You can use much the same techniques you would in Python, or you can deploy a .war to a bunch of application servers and achieve what you'd do with docker/kubernetes/etc. in a much simpler way. You've also got a much better chance of scaling up with a single instance and not needing to scale horizontally.

> the feedback cycle and general code velocity was much slower since we'd be on java 8 among other things

What's keeping you on Java 8? Major JVM version upgrades are much easier and safer than even minor Python upgrades. I'm not doubting your situation, but old version of one language versus new version of another is not really a fair basis for comparison.

> Plus learning spring or a Java ee framework is tantamount to learning a whole new language it wasn't worth the time then.

Sure. I'm not saying it's wrong to choose to stick with the technology you're currently using - there's definitely a cost to switching or learning something new. But it's worth being conscious of whether your technology choices are being driven by legacy constraints and whether you'd want to make a different choice on a green-field project.

nimish · on March 8, 2020

1. "Deploying a war to an application server" is a giant pain in the ass when you don't have people deeply familiar with EE app servers and tuning them. Python is not the best choice here but little can beat go's scp deploy

2. Tell that to every library that relied on java EE being shipped with the JDK. A lot of the reason to use Java was these libraries that are, if you're lucky, in maintenance mode. They're still stuck on 8 since it's not worth putting the time to migrate them to 11+ (I don't have that luxury, unfortunately)

3. > technology choices are being driven by legacy constraints

This project was greenfield but the problem domain is plagued by legacy constraints (all the way back to mainframes). "Rebuild from scratch" in a nicer, newer language would take years compared to a runway measured in months, so you do the math.

lmm · on March 8, 2020

> Python is not the best choice here but little can beat go's scp deploy

Java is pretty damn close to that if you take the route of building a shaded jar (and embedding jetty if you need a web server). You need to install the JVM on the target server but that's all.

> Tell that to every library that relied on java EE being shipped with the JDK. A lot of the reason to use Java was these libraries that are, if you're lucky, in maintenance mode.

I don't think I ever saw a library like that? There's a huge, high-quality ecosystem of open-source Java libraries and I've never heard of any of them being reliant on Java EE.

> This project was greenfield but the problem domain is plagued by legacy constraints (all the way back to mainframes). "Rebuild from scratch" in a nicer, newer language would take years compared to a runway measured in months, so you do the math.

Sure, but you can't say this project is an example of when Python is a good technology choice if really the main reason you were using Python was because of legacy constraints.

orf · on March 7, 2020

Citation needed. You can write anything you want in any language you want, but if your team is experienced with Python then they will continue to ship value quickly. Sure, maybe if they were all well versed in brainfuck they could ship things quicker.

Narrowing in on the rather specific point about shipping things with Python and ignoring the larger argument that it doesn't freaking matter if things are not as efficient as they could possibly be is quite odd to be honest. I'm sure some of the arguments in the blog post would apply to whatever language you had in mind while writing your reply.

sneak · on March 7, 2020

> a damning condemnation of the language and it's ecosystem from Rachel By The Bay, an all-knowing and experienced higher power

If I were her, I’d put that on my CV. :D

nodamage · on March 7, 2020

Yes after reading through the article it's not very clear to me what the actual problem is with using Python/Gunicorn/Gevent.

The author seems to be saying something about how if a worker is busy doing CPU intensive work (is decoding JSON really that intensive?) then other requests accepted by that worker have to wait for that work to complete before they can respond, and the client might timeout while waiting?

If that's the case:

1. Wouldn't this affect any language/framework that uses a cooperative concurrency model, including node.js and ASP.NET or even Python's async/await based frameworks? How is this problem specific to Python/Gunicorn/Gevent?

2. What would be a better alternative? The author says something about using actual OS-level threads but I thought the whole point of green threads was that they are cheaper than thread switching?

adrianmonk · on March 8, 2020

1. Yes, it would affect other things. This is just an illustrative example.

2. Green threads have lower overhead, but it's a false economy if it causes you to do needlessly redo work because of timeouts that could have been avoided.

Which it seems it must because the kernel doesn't have the insight to know whether a green thread is doing that epoll because it's ACTUALLY idle or because it's not but it's willing to try to juggle a second (or third) thing while it has something on the back burner. So the kernel indiscriminately assigns work to threads without regard for whether they are juggling a lot or nothing.

Whereas with native threads, they never ask the kernel for more work while they're blocked on something else because they are literally blocked and thus won't be making that epoll system call.

(The article also mentions something about LIFO policy, which exacerbates the problem because it favors assigning work to the process which is likely to already have most of it.)

pas · on March 8, 2020

How come there's no work stealing? Green threads are supposed to be backed by some N:M thread pool, no?

Also, isn't the problem that JSON decoding (or whatever computation) simply block the thread and the other green threads cannot proceed at all, because there are simply no safepoints (yield points) inside these low level functions?

And in all these cases shouldn't the application estimate work (eg. in case of JSON if the string is longer than 100K), and if it's too big just put it on a dedicated heavy compute N:N thread pool?

For Python it's best practice anyway because the GIL, no?

ryanobjc · on March 8, 2020

Is there a n:m green threading mode for python? What does gunicorn/gevent do? Sounds like “not that”.

Yes best practices and all that, but that really doesn’t sound like what’s happening.

pas · on March 8, 2020

Python was always more about breadth than depth. (CPython is full of known inefficiencies, but it's with us since 1989, and basically the core dev who worked most on performance - Victor Stinner - thinks the best way is to introduce subinterpreters - https://github.com/vstinner/talks/blob/master/2019-EuroPytho... )

Oh, that PDF is interesting, Python 3.8 has shared memory for multiprocessing, no more pipe objects between processes.

Furthermore extension and internal stuff always had the ability to release the GIL and do its own thing (for example, on a threadpool, or using async/nonblocking I/O). But I have no idea about Gevent. I never liked it. (Just as Twisted/Tornado it was too much magic for too little benefit.)

Matthias247 · on March 8, 2020

I don’t know what gevent is doing. But if you have a global interpreter lock then M:N might not be worth it, since only thread will make progress (outside of the syscalls, which are non blocking).

Too · on March 8, 2020

Surprised nobody mentioned asyncio.run_in_executor yet. It's designed to offload the event loop from long running cpu bound tasks, by moving them to another thread pool (or process pool if you are afraid of GIL). Eventually that thread pool will obviously also get starved given enough load but at least you wont have CPU blocking IO and vice versa. Tricky thing is knowing when an operation might grow to become too slow for io-thread given dynamic inputs.

https://docs.python.org/3/library/asyncio-eventloop.html#asy...

obeleh · on March 8, 2020

that's because `run_in_executor` doesn't spread CPU usage. All it does is wrap functions in threads so you can call them async. It doesn't create multiple processes so you're still limited to a single core in Python.

Too · on March 8, 2020

See example 3 in link above

Rapzid · on March 8, 2020

Decoding JSON is surprisingly intensive. Check out this to see what's going on with it in Dotnet: https://michaelscodingspot.com/the-battle-of-c-to-json-seria... .

Node.js will have that issue and in fact the stdlib JSON encoding/decoding can't even be paused so once you start a processing something you're stuck until it's done. You could, however, write an incremental serializer/deserializer that could spread processing out across many event loop cycles to mitigate.

Go, ASP.NET, and others not so much, depending, because the schedulers can pause and resume the tasks(on top of being threaded).

Thorrez · on March 7, 2020

> 1. Wouldn't this affect any language/framework that uses a cooperative concurrency model, including node.js and ASP.NET or even Python's async/await based frameworks? How is this problem specific to Python/Gunicorn/Gevent?

I think she's against anything that has that problem. Not every green thread implementation has that problem. For example Go doesn't have that problem. Because there were 4 CPU threads (I think) and only 2 things needing to be done, with Go's M:N scheduling those 2 things would be sure to both be running.

shakna · on March 8, 2020

> The author seems to be saying something about how if a worker is busy doing CPU intensive work (is decoding JSON really that intensive?) then other requests accepted by that worker have to wait for that work to complete before they can respond, and the client might timeout while waiting?

Yes. Decoding JSON with Python is CPU-intensive.

This a very simple shell script around Python, that is designed from the get-go to crash with an exception. However, it may not be the exception you expect:

    n="$(python3 -c 'import math; import sys; sys.stdout.write(str(math.floor(sys.getrecursionlimit() - 4)))')"
    left="$(yes [ | head -n "$n" | tr -d '\n')"
    echo "$left" | python3 -c 'import json; print(json.loads(input()))'

Python's docs suggest you should arrive at one of two errors here: MemoryError (We are trying to parse something sizeable) or json.DecodeError (The JSON is invalid).

You won't. You'll hit RecursionError.

Because, despite how badly Python deals with recursion, the JSON library depends on it extensively. Which means that there is a huge stack-tree being built every time you try and decode JSON in Python (Dictionaries, function call overhead, etc, etc.), only for it to be thrown away.

alrs · on March 7, 2020

Decoding JSON in Python without using a C library is indeed CPU-intensive.

pdonis · on March 7, 2020

Python's standard library json module uses a C extension module for the CPU intensive stuff.

hamburglar · on March 8, 2020

Yes, decoding JSON in python is much more efficient if you are not actually decoding JSON in python.

kragen · on March 8, 2020

It's damn slow too compared to protobufs, or, especially, FlatBuffers.

nodamage · on March 8, 2020

Yeah but I think it's reasonable to assume no one using Python in production actually does this.

dang · on March 8, 2020

(We detached this subthread from https://news.ycombinator.com/item?id=22514448, which it's more on-topic than.)

zzzeek · on March 7, 2020

> (is decoding JSON really that intensive?)

in Python, everything is generally CPU intensive compared to what it would be in compiled languages, even though things like JSON decoding are usually happening in a C library, Python programs that do close to nothing still use way more CPU than you would if you were running in the JVM, or Go, C, whatever.

> Wouldn't this affect any language/framework that uses a cooperative concurrency model, including node.js and ASP.NET or even Python's async/await based frameworks? How is this problem specific to Python/Gunicorn/Gevent?

CPU bound-ness affects all of these platforms, yes. It affects Python and other intepreted languages the most however because these platforms get the most CPU-bound the most quickly. Also, applications that are written in scripting languages tend to have a lot of business logic going on in the first place; after all, if you just wanted to serve static pages you could use Apache with the Event NPM and if you wanted to proxy HTTP requests you'd use HAProxy; both event-based systems that are very much not CPU bound.

But yes, most importantly, Python's asyncio system is completely impacted by these same issues and I would have preferred she address that, as asyncio is part of the standard library now and is way more popular than gevent.

> What would be a better alternative? The author says something about using actual OS-level threads but I thought the whole point of green threads was that they are cheaper than thread switching?

I will grant she lost me a bit with the "use a real RPC system with <feature> <feature> <feature>" thing, and additionally the "load the application in the child process" thing is pretty typical, a worker process should obviously have either threads or greenthreads in use so that each process can handle multiple concurrent requests, but only as many as you'd want handled effectively by one core since the GIL is going to enforce that (another thing you wouldn't have to deal with in other languages such as the above mentioned compiled languages), but it's typical that child processes are going to have a mostly original copy of things.

But as far as the "context switching" thing, I've yet to see benchmarks that show the overhead of OS-level context switching actually being more of a performance burden than the less frequent, but more work intensive context switching that user-space schemes like asyncio have to use. If you are writing a logic-heavy, or even a logic-just-a-bit service that receives Python requests you will also have to worry about CPU-bound issues all the time. Using regular threads with processes, like what you get using something like mod_wsgi, will allow individual processes to attend to web requests more evenly. With mod_wsgi you can configure worker daemons that run multiple OS level threads and you can also have multiple daemon processes.

I'm not sure if the multi-process model used by mod_wsgi has solved the accept() problem, however in my experience the bigger problem is when a service configures itself to allow for 1000 greenlets within each process, while each process is realistically capable from a CPU perspective of handling maybe 5 or 10 concurrent requests, there's no mechanism that ensures that each process gets an even balance of requests. That is, you might have all your requests waiting in one process, because you told them it can process 1000 at a time, while other processes are idle.

TL;DR I'm in the "event based programming is extremely overrated in Python" camp.

nodamage · on March 8, 2020

> Python programs that do close to nothing still use way more CPU than you would if you were running in the JVM, or Go, C, whatever.

Yeah but that's not exactly shocking news to anyone, is it? People generally choose Python for other reasons (productivity, library ecosystem, etc.) because the performance is "good enough" for most web apps, and if you reach a traffic level where performance becomes an issue that's a good problem that you can optimize for later. (Like Facebook did with PHP, Twitter did with Ruby, etc.)

> But yes, most importantly, Python's asyncio system is completely impacted by these same issues and I would have preferred she address that, as asyncio is part of the standard library now and is way more popular than gevent.

Right, the blog post gave me the impression she was calling out the combination of Python/Gunicorn/Gevent specifically for some reason. But if the underlying goal was to just point out that Python is slow then I am curious what people think the right solution is? Just switch out of Python and use Go or something else?

guggle · on March 8, 2020

I love to work in python and I came here just to point out that it's pretty obvious one should not use it if raw performance is a concern (which is not in many situations). I remembered this Cal Henderson talk at Djangocon:

https://i.postimg.cc/Dy81R2QQ/yourmom.png

That said, I wrote a bit of Go recently and the experience was pleasant enough that I'd consider it for future works (should the needed libraries exist), as the extra performance and ease of deployment comes with very little effort from the developer.

zzzeek · on March 8, 2020

The underlying goal was to show that python gets CPU bound very easily, this is not at all the same as saying "python is slow". If I want slow I'd use Ruby.

p_l · on March 8, 2020

Having used both, my experience is that I can make Ruby fast, while making Python fast starts with "let's stop using python in this code path" ...

_pgmf · on March 8, 2020

It's a shame this is not a top voted comment. Not many people understand both the intricacies and tradeoffs involved in all these approaches, and get burned, then blame the tools.

benreesman · on March 7, 2020

It seems to me that this submission is getting a lot of blowback in the comments for 1) the style and 2) the implication that wiring up Python services with HTTP is bad engineering. I don’t think this is productive.

On the first point, yeah Rachel’s posts are kinda snarky sometimes, but some of us find that entertaining particularly when they are highly detailed and thoroughly researched. I’ve worked with Rachel and she’s among the best “deep-dive” userspace-to-network driver problem solvers around. She knows her shit and we’re lucky she takes the time to put hard-earned lessons on the net for others to benefit from.

As for “microservices written in Python trading a bunch of sloppy JSON around via HTTP” is bad engineering: it is bad engineering, sometimes the flavor of the month is rancid (CORBA, multiple implementation inheritance, XSLT, I could go on). Introducing network boundaries where function calls would work is a bad idea, as anyone who’s dealt seriously with distributed systems for a living knows. JSON-over-HTTP for RPC is lazy, inefficient in machine time and engineering effort, and trivially obsolete in a world where Protocol Buffers/gRPC or Thrift and their ilk are so mature.

Now none of this is to say you should rewrite your system if it’s built that way, legacy stuff is a thing. But Rachel wrote a detailed piece on why you are asking for trouble if you build new stuff like this and people are, in my humble opinion, shooting the messenger.

seemslegit · on March 8, 2020

"JSON-over-HTTP for RPC is lazy, inefficient in machine time and engineering effort, and trivially obsolete in a world where Protocol Buffers/gRPC or Thrift and their ilk are so mature."

Laziness is a virtue in our profession :) huge "citation required" on it being more efficient in engineering effort and most importantly it's ubiquitous and interoperable with any language without having to rely on grpc or thrift implementation and tools for that language/platform.

Re machine time efficiency - to the extent I understood what she was actually complaining about simply none of those issues are attributable to json-over-http being used.

p_l · on March 8, 2020

On the digression of efficiency... treating efficiency as something you solve by throwing more machine resources at is causing recognizable problems today.

And the calculus of cost doesn't even always work out the same.

seemslegit · on March 8, 2020

That's a prime example of survivorship bias, projects that spent their runway optimizing for machine resources are not around to have their problems recognized.

p_l · on March 8, 2020

I had more than one project where people were cheaper than machines, and going tight on budgets for machine resources led to the project surviving instead of dying.

Hell, a big example of that is Stack Overflow, which runs a very busy site on much less hw than they would have needed otherwise, just by taking high level optimization questions up front.

seemslegit · on March 8, 2020

And can you spot a non json/http based service in their stack ? (no redis and sqlserver don't count because they don't own those)

https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar... don't know if there's a later one.

EDIT: I see they store their redis values in protobuf which is probably more due to memory utilization rather than serialization performance concerns

p_l · on March 9, 2020

I can spot, however, kernel-mode HTTP servers (including customized ones), and heavy use of pretty advanced stack with good optimization capabilities. The choices of the stack do make a big impact, something that they have mentioned several times, with the summary of "paying Microsoft licenses paid back very well compared to using popular open-source stacks"

Remember, Performance is also a feature. Both a non-functional one (to reduce your costs) or functional one (to have happier users).

diebeforei485 · on March 8, 2020

> we’re lucky she takes the time to put hard-earned lessons on the net for others to benefit from.

I genuinely don't see much of a lesson to learn from this particular blogpost, and it appears neither did many others in HN. If there is one, beyond "don't use x", it's hard to find it.

I get the impression that this particular post is being upvoted to the top of HN because of who the author is, not necessarily because this post itself has value. This results in a whole bunch of others reading it, wondering why they're wasting their time with such a rambling post.

kragen · on March 8, 2020

Here are some lessons you might learn:

1. Load before forking.

2. Remember that green threads tend to have problems with fairness of scheduling.

3. JSON decoding gobbles CPU.

4. Scheduling fairness problems increase response time variance.

4½. Green threads also increase it.

5. Don't forget about retries of timed-out requests into account in protocol design; idempotence is the simplest solution when you can use it.

6. Wake-one semantics to avoid the thundering herd are important for performance when you have multiple threads, and Gunicorn has that thundering herd problem, so you probably don't want to be running it this way on a 64-core box with hyperthreading. (The problem is of course less severe than it was for Apache because the green threads don't thunder.)

7. Gevent uses epoll, not select, poll, or RT signals

8. EAGAIN and SIGPIPE if you didn't know about those. (Somebody is in today's lucky ten thousand.)

9. What kinds of mechanisms “tend to show up given time in a battle-tested [network server] system.”

10. Your systems don't have to be fragile pieces of shit.

rachelbythebay · on March 8, 2020

Very nice. Thank you.

kragen · on March 8, 2020

I'm not sure whether the person I was replying to is The One to whom all these things are too obvious to be worth mentioning, or if these were too implicit for them to notice, or a combination. Either way, no, thank you for writing it.

dfox · on March 8, 2020

The takeaway should be: don't do green threads/event loops for anything that involves any kind of non-trivial processing or even better yet, don't do that unless you really need to do such things (and “better performance” is not valid reason)

anaphor · on March 8, 2020

One of the people who designed Protobuf has criticized it (Edit: to clarify, one of the authors of v2, see below), so that doesn't really inspire much confidence in it for me.[1] Your general point is correct though, there's much more to a well designed RPC system than what HTTP based systems can do, but protobuf/gRPC is very much lacking in ideas that are decades old at this point, like promise pipelining, etc.

Also, I feel like she (intentionally maybe) is conflating concurrency and parallelism. These "green thread" systems provide concurrency, but not parallelism. That should be something people are aware of when they use them.

[1] https://capnproto.org/faq.html#how-do-i-make-a-field-require...

kentonv · on March 8, 2020

> One of the people who designed Protobuf has said it's awful

Hmm, you seem to be citing me. To clarify:

1. I didn't design Protobuf. I just rewrote the implementation (created version 2) and open sourced it.

2. I don't think it's awful. In fact, I think it's best-of-breed for what it is, and certainly a much better choice than JSON-over-HTTP. Yes, in Cap'n Proto I've added a bunch of features which Protobuf doesn't have, like zero-copy and promise pipelining, and I obviously think these features make it better tech. But, to be fair, these ideas make Cap'n Proto more complicated, and whether that complexity is worth it is still very much unproven at the scale that Google uses Protobuf and gRPC/Stubby.

> These "green thread" systems provide concurrency, but not parallelism.

I'm not sure if these specific definitions of "concurrency" and "parallelism" are universal. I wasn't aware of them, at least.

gen220 · on March 8, 2020

Hi kentonv! I have a tangential question for you, if you don’t mind. I brought up your capnproto with a few friends at work, while chatting about the profiling data of our services (mostly CPU-bound, mostly on protobuf de-/encoding). After convincing ourselves that language-agnostic “zero-cost” requests weren’t completely magical, and that the whole promise thing is very useful, we got to wondering...

Do you think it’s possible, that gRPC/proto could evolve in a non-total-rewrite way to earn the benefits offered by capnproto? I figure you’d be best positioned to answer that kind of question, having worked so intimately on both! :)

We have enjoyed getting to know and using gRPC and proto, I also want to thank you for your work! capnproto is an inspiring solution to a prima facie unsolvable problem, I hope to see it succeed more universally, or at least inspire a proto4. :) Thank you again!

kentonv · on March 8, 2020

I think Protobuf fundamentally can't achieve zero-copy parsing without changing the underlying encoding. That said, zero-copy parsing only provides a significant real-world benefit in certain use cases. For the use case of RPC over a network -- especially over the internet -- zero-copy parsing has minimal benefit. The places where zero-copy parsing can be a big win are when it means you can mmap() a very large file, or for IPC in shared memory.

On the other hand, Promise Pipelining -- and, more generally, object-capability RPC -- could definitely be added to gRPC. In fact, the very first iteration of Cap'n Proto was a Protobuf-based RPC system that used the same service definition syntax that gRPC now uses. "Cap'n Proto" at the time meant "capabilities and protobuf". (That version of the project was short-lived and shares no code at all with the current Cap'n Proto.)

However, I don't think gRPC is likely to add ocaps unless and until the model proves itself by gaining wide popularity elsewhere. It doesn't make sense for gRPC to take the risk of adding a big, new, experimental feature which they'll be forced to support forever when the demand hasn't been proven yet. Ocaps are gaining a lot of popularity lately (with major new tech like Fuschia and WASI being capability-based) but I think there's still further to go before it would make sense for gRPC to adopt it.

Youden · on March 8, 2020

I'm not entirely clear on whether this Github issue would bring feature-parity with Cap'n'proto but apparently Google already has a zero-copy API for Protocol Buffers internally: https://github.com/protocolbuffers/protobuf/issues/1896

kentonv · on March 8, 2020

I originally wrote the "zero copy" support in proto2, long before I created Cap'n Proto. What Protobuf means by "zero copy" is much more limited than what Cap'n Proto means. Protobuf's "zero copy" applies only to individual string (or "bytes") fields. The effect is that when you call the getter for that field, you get a pointer to the string inside the original message buffer, rather than a copy allocated on the heap. The overall message structure still needs to be parsed upfront and converted into an object tree on the heap (which I count as a copy).

Cap'n Proto is very different. Every Cap'n Proto object is actually a pointer into the original message buffer. Accessing one element of a large array is O(1) -- the previous (and subsequent) elements don't need to be examined at all. Similarly with structs, each field is located at a known fixed offset from the start of the struct, so can be accessed without examining other fields. Protobuf inherently cannot do this; there is no way to know where a field is located without first parsing all previous fields in the same message.

anaphor · on March 8, 2020

Thanks for the response, I was being a bit tongue in cheek. I don't think you actually think it's awful :) And thanks for clarifying that.

About concurrency vs. parallelism, I think it is fairly standard to think of them as two different concepts that overlap somewhat.

You can have concurrency with parallelism (e.g. pthreads, or M:N threading where you map "green threads" on to processes that can run in parallel). You can also have concurrency without parallelism. The difference between the two is that parallelism can be deterministic, whereas concurrency is always going to be non-deterministic.

I recognize that it's not very clear, but it isn't a distinction that I made up, e.g. https://existentialtype.wordpress.com/2014/04/09/parallelism...

Izkata · on March 8, 2020

> > These "green thread" systems provide concurrency, but not parallelism.

> I'm not sure if these specific definitions of "concurrency" and "parallelism" are universal. I wasn't aware of them, at least.

To be clear, since GP didn't define them: concurrency simulates parallelism through context switching. Context switching itself encompasses both cooperative multitasking (gevent does this) and preemptive multitasking (modern operating system threads when they're sharing a CPU).

AFAIK it is universal, but close enough not to matter in most cases so people get lazy with their words.

kragen · on March 8, 2020

Those definitions have been coming into fashion in the last couple of decades. I think it's useful to have the distinction but I wish we had new words that didn't previously mean both things.

ahuang · on March 7, 2020

I think the main issue is it seems really one-sided and the intent was to be snarky, vs educational. I posted a comment here detailing some ways to work around some of the pitfalls. I think if she devoted more time in the article to solutions vs. complaining, her points would come across more productively.

tgbugs · on March 7, 2020

This is a great review of what is going on "behind the scenes."

As the maintainer of about 5 little services with this structure I have vowed never to write another one. The memory overhead alone is a source of eternal irritation ("Surely there must be a better way....").

Echoing other commenters here, the real cost isn't actually discussed. Namely that there is a solution to some of these problems (re long running tasks?), but it carries with it a major increase in complexity. Its name is Celery and oh boy have fun with the ops overhead that that is going to induce.

A while back I did some unscientific benchmarking of the various worker classes for python3.6 and pypy3 (7.0 at the time I think?). Quoting my summary notes: 1. "pypy3 with sync worker has roughly the same performance, gevent is monstrously slow gthread is about 20 rps slower than sync (1s over 1k requests), sync can get up to ~150rps" 2. "pypy3 clearly faster with tornado than anything running 3.6" 3. "pypy3 is also about 4x faster when dumping nt straight from the database, peaking at about 80MBps to disk on the same computer while python3.6 hits ~20MBps"

I won't mention the workload because it was the same for both implementations and would only confuse the point, which is that there are better solutions out there in python land if you are stuck with one of these systems.

One thing I would love to hear from others is how other runtimes do this in a sane and performant way. What is the better solution left implicit in this post?

_pgmf · on March 8, 2020

With python the only thing that matters is the workload.

seemslegit · on March 7, 2020

"I will not use web requests when the situation calls for RPCs"

I'm surprised how often devs treat this distinction as architecturally meaningful. Web requests are just RPCs with some of the parameters standardized and multiple surfaces for parameters and return values - query string, headers, body. This is completely orthogonal to the strategy used to schedule IO, concurrency, etc.

wrs · on March 7, 2020

I think what’s she’s getting at is that RPC usually comes bundled with some kind of strictly typed serialization format and standard infrastructure (service discovery, dispatch, error handling, etc.). A lot of web frameworks just take one request and hand it to a function, and the rest, including decoding the loose JSON that might be in the body, is up to you.

seemslegit · on March 7, 2020

Sure, but this is porcelain and has nothing to do with threading vs forking vs greenlets etc.

WookieRushing · on March 7, 2020

RPC systems come in many different levels like programming languages. While there are low level RPC systems which are just a simple layer around web requests, there are others that do a lot more. They can do retries, host selection, and stateful operations like the article mentions.

These systems tend to take care of a bunch of easy to mess up logic which tends to be accumulate around any system that wants to send something and not mess up. Any sufficiently old web request system tends to look identical to a high level RPC system designed badly.

So choosing an RPC system should give you all the features you’d eventually end up building around using web requests without spending your time rewriting it.

seemslegit · on March 7, 2020

See above

keyanp · on March 7, 2020

Thanks for posting this. I had the same reaction. I think making this distinction ends up muddling what an author means when they refer to RPC and ends up overloading the term.

kragen · on March 8, 2020

REST constrains your architecture in ways that RPC doesn't. Of course you can use HTTP without using REST, but then you're paying a lot of the costs of REST without getting its benefits, and a simpler RPC protocol might be better.

seemslegit · on March 8, 2020

Contrasting REST with RPC makes more sense, while REST apis map quite naturally to RPC as the other way around is more convoluted.

But a dedicatd RPC stack such as grpc or thrift constraints my architecture in other ways, that json-over-http doesn't.

And what are those unavoidable costs of REST that are not just the wire overhead of HTTP ?

kragen · on March 8, 2020

Proxies, built-in retry semantics in many clients, code complexity for dealing with the mandatory flexibility in HTTP, the presumption that GET is safe for, say, prefetching or spidering, etc. I think REST is often worth the cost (I might be the only person on this thread who has written an HTTP server in assembly), but if you're not using REST, using HTTP will tend to cause you a lot of headaches you could have avoided, for little or no benefit.

“But firewalls!”

bostik · on March 8, 2020

I would think this is orthogonal to REST-vs-RPC:

> the presumption that GET is safe for [reads]

GET requests are supposed to be idempotent and side-effect free. Sadly, too often, they are not. Internal state gets mutated, a query parameter gets stored and/or processed in a way that affects future reads, and so on.

But then again, nothing prevents you from writing RPC based service with those semantics either. It just might be that people who develop and maintain RPC services are better aware of the consequences of their actions.

REST is conceptually easy, has human-readable on-the-wire payloads and operates on synchronous semantics.[×] The bar for entry is simply lower, allowing even an inexperienced developer to get going with very little friction.

The very fact that I can do this:

    import json, requests
    dada = json.loads(whatever)    
    res = requests.post(URL, data=dada)

... means it's incredibly easy to get off the ground and shuffle data between services. Conversely, it's also easy to send the wrong data, in a wrong way.

×: webhooks are effectively callbacks, but instead of setting the code entry point for async return in your call, you expose a webhook route and treat it as any other request.

kragen · on March 8, 2020

Your example code uses HTTP but not REST. This explains most of our apparent disagreement; what you are talking about when you say “REST” has very little to do with REST itself. REST is not conceptually easy; it's just that people frequently confuse it with HTTP.

jchw · on March 7, 2020

I mean... you do have to parse text formats. HTTP parsing may be a solved problem, but that doesn't mean the overhead or complexity of doing so disappears.

Also, TLS is not really ideally lightweight for RPCs, but you should absolutely encrypt your RPC traffic (imo.) So I really think the whole stack is out.

(P.S.: If you are wondering what kinds of 'lightweight' replacements for TLS exist, I think my personal favorite attempt is CurveCP, although it is a bit dated nowadays. I wouldn't often recommend people roll their own, but you could certainly do something simple with NaCl/libsodium directly. Maybe QUIC also fits the bill?)

topspin · on March 7, 2020

> that doesn't mean the overhead or complexity of doing so disappears

No, it doesn't. But there is no evidence this overhead actually mattered here. It usually doesn't because the CPUs easily outrun whatever bandwidth is available which is why JSON over HTTP is fine 99% of the time. There is absolutely nothing in this blog post that shows that's not also the case here. No rationale is provided as to how a strongly typed RPC mechanism would solve any actual problems the services is having.

So we're left with guesswork and the authors hang-ups about HTTP vs some as yet unnamed RPC solution.

Also, Gunicorn? Thundering herd? These are solved problems. Space the toy proxy and use something real like haproxy. At a minimum.

Finally, none of this griping about HTTP vs RPC actually addresses the _actual_ problem: the server can't process requests in a timely manner. That points to some deeper inefficiency or design issue that likely has nothing to do with Python or Gunicorn or Gevent at all. We're not given any insight as to what the hell the server is doing with all that CPU. Or why the client isn't using a protocol intended for long running processes; RPC schemes have timeouts too ya know....

This reads like a poorly informed rant.

dirtydroog · on March 8, 2020

HTTP is overly verbose, is a pain and slow to parse, and there are various interpretations of what the protocol spec allows and disallows. If machines are communicating why should it be in a human readable format? Binary is far quicker

topspin · on March 8, 2020

This is based on a number of obsolete premises. Contemporary HTTP techniques include[1] framed binary wire protocol, redundant header elimination, compressed header values and efficient cryptography. HTTP/2 + TLS 1.3+ are extremely efficient together and are difficult to improve upon. When compatibility, implementation quality and ubiquity[2] are considered they are effectively impossible to improve upon. Except...

If, in the unlikely case that your particular bit of brilliance is indeed hampered by the vestigial amount of overhead still present in contemporary HTTP, then you might do as Google and several other operations have done and dispense with traditional techniques (including TCP) altogether via QUIC.

Just because what you see in the Network tab of your browser's 'Developer tools' window looks like something from 2005 doesn't mean that's what is actually on the wire. It mostly isn't any longer as a share of global traffic.

But again, all of that is irrelevant; the post provided no evidence that replacing HTTP with some dubiously unnamed form of RPC would solve any actual problems. HTTP was tossed in the rant basket with a bunch of other things, few of which appeared relevant to the actual failure modes cited.

[1] https://www.oreilly.com/library/view/high-performance-browse... [2] https://github.com/http2/http2-spec/wiki/Implementations

dirtydroog · on March 8, 2020

I'm not talking about what I see in my browser, I'm talking about services communicating in a private cloud, and I'm quite sure the author was too.

Thanks for the list of HTTP/2 implementations. I hope they all interpret the spec the same and so can all be used with each other.

seemslegit · on March 7, 2020

There is nothing that says that "RPC" can't do multiple request/response cycles over an existing open (and encrypted) connection rather than initiate a new one for every call, just like HTTP. Or even pipeline them like HTTP/2.0

dfox · on March 8, 2020

Or even do RPC calls in both directions over one stream socket... essentially all the “inflexible” RPC protocols of 90's can do that (and incidentally the way how this is implemented usually involves nested event loops)...

dirtydroog · on March 8, 2020

HTTP1.1 only closes the connection if one party sets Keepalive: Close. It was HTTP1.0 that was one-shot. There was also pipelining support added but apparently nobody bothered to support it.

jchw · on March 7, 2020

HTTP 2 is more reasonable. But by the time you get to HTTP 3 you're just doing HTTP 2 over QUIC. At which point, why not just send RPC payloads directly over QUIC?

seemslegit · on March 7, 2020

Because what I have in place is a good old HTTP 1.1 ?

jchw · on March 7, 2020

That's circular. I can't argue anything from this point. "Why should I get an electric car when I have a good old gas car in the driveway?" I don't have an answer. I do have answers for why one is better than the other. APIs work over HTTP in spite of limitations, not because of good synergy. I think gRPC is the most reasonable implementation of such (disclaimer: I work for Google, but not on gRPC and I've used gRPC before I worked at Google) but I still think it is overkill for many people. If you are using HTTP+REST+JSON and it works fine for what you are doing, then fine - there's an ecosystem already built around it. But the kinds of things people do with lighter weight and more efficient RPC layers literally aren't doable over standard HTTP/1.1 and REST. It enables stuff you wouldn't think of, when you can measure the absolute overhead in bytes. (As an example, I'm not aware of anyone actually doing this, but it would almost certainly be possible to forward low level signals like USB or perhaps even PCI express packets over a lightweight RPC layer, and get all of the encryption/access control/etc. you already have in your stack.)

Answers for why HTTP/1.1 is a poor fit:

- Text format requires text parsing. How long do you limit the header lines? What transport compression do you support? Text parsing is inefficient compared to binary formats.

- A lot of difficult to understand behavior. When do you send 100 Continue, what do you do when you receive it? What happens when you are on a keep-alive connection and there's no Content-length? (There's a whole flow chart for something simple like this.) etc.

- A lot of cruft. Like chunked encoding is weird. Trailers are also weird. What happens when a header is specified twice?

Answers for why HTTP/2 is still a poor fit:

- What are the headers even for? You now have this entire section of your request that doesn't matter, with its own compression scheme called HPACK. Why?

- Server push. It's nice that you have bidirectional streams, but this is clearly designed for browser agents. gRPC repurposes this for bidirectional streaming as it should be, but...

- ...Often times, hacks like that lead to the worst problem: You did all of this work to use HTTP as an RPC layer, and you can't even use it in a browser because the sane things you do for your backend might not be compatible. in gRPC there's a special layer for handling this, but it's a lot of additional cruft.

HTTP/REST is great because there's a huge ecosystem, but that's not even a solid win due to the complexity. As an example, years ago I ran into huge problems with Amazon ELB because it was buffering my entire request and response payloads, and imposing its own timeouts on top. All documented behavior, but you can't just plug in this HTTP thing and hope for it to work. Basically anything in the middle that also speaks HTTP has to be carefully configured. Again, leading to doubt over the whole point of using a protocol like HTTP. There's rules for what should be GET, PUT, POST, DELETE, and yet those all interact strangely. No payload in GET body, some software gets weird about calls like DELETE, so sometimes you have to support POST for what should be a PUT and so on.

And at the end of the day, all you really wanted was RPC payloads in both directions, and you have all of this crap around it, and it's largely just because web browsers exist, but none of this stuff even works well together.

It works OK if you don't really care much and just throw a software stack together, but that doesn't mean it will be efficient, doesn't mean you won't run into problems. I definitely prefer to go for simpler, and HTTP is not actually simpler. It just has the benefit of having an existing ecosystem.

seemslegit · on March 7, 2020

I don't think we disagree about anything here, if you wanna optimize for maximal machine/network utilization then optimize for that with gRPC or equivalent, if you wanna optimize for lean stack and have to use HTTP anyway because you're on the web then use (RPC over) HTTP - both can be considered more "efficient" depending on the setting and your constraints.

But the point was that contrasting web requests with RPC is a mistake of category and has little to do with various IO handling and concurrency models that the author was discussing.

jchw · on March 7, 2020

Well, the thing is, I do agree with the author, though, on their point of not using web requests for RPCs. I think we must be interpreting the author's text differently.

seemslegit · on March 8, 2020

Either that or she's lumping together two separate issues, or both.

lern_too_spel · on March 8, 2020

The author never conflated the design choices of RPC protocol and threading model. Just the opposite, in fact. She listed them separately.

seemslegit · on March 8, 2020

Except she writes: " Then it does its weird userspace "thread" flip back to the original request's context and reads the response. It chews on this data, because, again, it's terrible JSON stuff and not anything reasonable like a strongly-typed, unambiguously, easily-deserialized message. The clock is ticking and ticking. "

If she laments that it is bad design to do the deserialization on the IO thread it is just as true for JSON as it is for protobuf or whatever "true rpc" format she considers worthy.

lern_too_spel · on March 8, 2020

It is less true for formats that deserialize faster. I still don't see where she is confusing the two. At the very top, she explicitly notes them separately:

"I will not use web requests when the situation calls for RPCs. I will not use 'green' (userspace) 'threads' when there are actual OS-level threads and parallelization is necessary."

cwp · on March 7, 2020

Sigh. Yes. I have been there and done that (more or less) and it sucks. The root problem is that data scientists really want to use Python for machine learning, but wrapping a Python model in a service that uses CPU and memory efficiently is really difficult.

Because of the GIL, you can't make predictions at the same time you're processing network IO, which means that you need multiple processes to respond to clients quickly and keep the CPU busy. But models use a lot of memory and so you can't run all THAT many processes.

I actually did get the load-then-fork, copy-on-write thing to work, but Python's garbage collections cause things to get moved around in memory and triggers copying and makes the processes gradually consume more and more memory as the model becomes less and less shared. Ok, so then you can terminate and re-fork the processes periodically, and avoid OOM errors, but there's still a lot of memory overhead and CPU usage is pretty low even when there are lots of clients waiting and...

You know I hear Julia is pretty mature these days and hey didn't Google release this nifty C++ library for ML and notebooks aren't THAT much easier. Between the GIL and the complete insanity that is python packaging, I think it's actually the worst possible language to use for ML.

crimsonalucard · on March 7, 2020

She's talking about green threads which is different from regular threading in python. Under nodejs/python style green threads only IO calls are concurrent to a single computation task. There is no parallelism under both styles of threading unless you count concurrent IO as parallel.

She is basically complaining about a pattern that was popularized by NodeJS and emulated in python by older libraries like gevent, twisted and tornado. Currently python3 uses keywords async/await as an API around the same concepts implemented in the older libraries.

This has nothing to do with GIL.

cwp · on March 8, 2020

In the case of the article, you are correct. I have a slightly different case where I'm wrapping scikit-learn model. We're NOT just calling another service and waiting for a response, we're doing computation, in Python. So the GIL is actually a problem.

ghostwriter · on March 7, 2020

> Because of the GIL, you can't make predictions at the same time you're processing network IO

Why not? If the model is a Python wrapper around some C/C++ library, then GIL can be released and this is actually a recommended approach used by almost any CPU-intensive python libraries - https://docs.python.org/3/c-api/init.html#releasing-the-gil-... You can have parallel computations inside your wrapped C extension, while Python interpreter is processing IO.

ary · on March 7, 2020

This is spot on. My one and only gripe is with this part:

> So how do you keep this kind of monster running? First, you make sure you never allow it to use too much of the CPU, because empirically, it'll mean that you're getting distracted too much and are timing out some requests while chasing down others. You set your system to "elastically scale up" at some pitiful utilization level, like 25-30% of the entire machine.

Letting a Python web service, written in your framework of choice, perform CPU-bound work is just bad design. A Python web service should essentially be router for data, controlling authentication/authorization, I/O formatting, and not much else. CPU intensive tasks should be submitted to a worker queue and handled out of process. Since this is Python we don't have the luxury of using threads to perform CPU-bound work (because of the Global Interpreter Lock).

ghostwriter · on March 7, 2020

> Since this is Python we don't have the luxury of using threads to perform CPU-bound work (because of the Global Interpreter Lock).

You can have it with threads if the CPU-bound work is done inside a C extension - https://docs.python.org/3/c-api/init.html#releasing-the-gil-...

yowlingcat · on March 7, 2020

I like the author's articles most of the time. While this article contains some truths, I don't think it argues very persuasively for its conclusion. Okay, these parts of the Python ecosystem don't work well together, and it's a bad, unpolished experience. Fair, as with other criticisms of Python.

The question, however, is why one would use gevent at this point in Python's evolution. There's async await now, and things like FastAPI. If you want to use, say, the Django ecosystem, use Nginx and uWSGI and be done with it. Maybe you need to spend some more resources to deploy your Python. Okay. Is that a problem? Why are you using Python? Is it because it's quick to use and helps you solve problems faster with its gigantic, mature ecosystem that lets you focus on your business logic? Then this, while admittedly not great, is going to be a rounding error. Is it because you began using it in the aforementioned case and now you're boxed into an expensive corner and you need to figure out how to scale parts of your presumably useful production architecture serving a Very Useful Application?

Maybe you need to start splitting up your architecture into separate services, so that you can use Python for the things that it does well and use some other technology for the parts that aren't I/O bound and could benefit from that. But that's not this article is about. This article is about someone making the wrong choices when better choices existed and then making a categorical decision against using Python for a service. I'd say that's what "we have to talk about" if you ask me.

wrsh07 · on March 7, 2020

I've been working on a legacy internal python system that suffers from most of the complaints here (and in the excellent COST paper Rachel links at the bottom).

The problems alluded to are, yes, solvable in python. But they also seem endemic in python systems.

When everyone who uses the tool uses it wrong, maybe it's not the user's fault.

(That said, I generally do think there's a time and place for python systems or web apps. That time is generally when speed and maintainability is significantly less important than flexibility)

cbsmith · on March 8, 2020

> The problems alluded to are, yes, solvable in python. But they also seem endemic in python systems. > > When everyone who uses the tool uses it wrong, maybe it's not the user's fault.

Yes, though that doesn't mean it is necessarily the code's fault.

Honestly, I was very confused by this article, because I thought everyone understood what was going on, the trade-offs involved, and how that ought to impact your design decisions.

It's not that Gevent'd Gunicorn is intrinsically a bad thing. You're going for cooperative multi-tasking/concurrency, so no preemptive multi-tasking support. This creates potential challenges with fair scheduling if you have real-time constraints like timeouts... so you design accordingly.

One of the advantages of this model is you do indeed need less memory (and often a little less CPU) to handle high load levels. It's not like you are intrinsically better off if you use Python in a forking model. You can still end up so CPU bound that you timeout handling requests... the only difference is you'll get fairer splitting of the CPU's time across tasks. It can actually get worse if you get lost in an infinite series of context switches (yes, there are ways to mitigate this problem... although they can create fair scheduling problems... it's a natural tension), or worse still, start swapping.

If the notion that running out of CPU might mean you have timeouts hasn't occurred to you...

kroltan · on March 8, 2020

> When everyone who uses the tool uses it wrong, maybe it's not the user's fault.

I'm not the GP, but I guess that a tool that is

> quick to use and helps you solve problems faster with its gigantic, mature ecosystem that lets you focus on your business logic

Can never cover all bases perfectly, and is generally great when starting out, but ultimately not built to be very forgiving when grown too much.

> now you're boxed into an expensive corner and you need to figure out how to scale

When you get to this point, and the requirements start to be more focused on performance, then it's time to start switching Python out. That does not devaule Python in the earlier stages of development and operation.

The point being that Python is the right tool for getting stuff working quick, not to getting stuff executing quick.

divbzero · on March 7, 2020

Agreed on both (a) I usually like the author’s articles and (b) think she’s missing the point on this one.

gevent and gunicorn were good attempts to remedy a bad situation. async/await is the solution that the Python community is coalescing around. Even with Django, there are active efforts to support ASGI. [1]

[1]: https://docs.djangoproject.com/en/3.0/howto/deployment/asgi/...

ghostwriter · on March 7, 2020

Gevent was doing it right and async syntax was a huge mistake that fractioned community-contributed libraries into two incompatible camps with lots of unnecessary cloning happening at present moment.

In high-level languages with virtual machines and/or garbage collectors, the runtime system should be solely responsible for scheduling green threads around IO entry points, all without special syntactic markers. GHC has it right (https://www.aosabook.org/en/posa/warp.html), Gevent was a right development with on-par async performance metrics (https://gist.github.com/rfyiamcool/41d4004b7fd46516d0b4f34f6...), that had a standard synchronous coding style. It could be adopted into the core language and improved further without splitting the community.

divbzero · on March 7, 2020

I have run Python in its traditional synchronous form, using gevent, and with the more recent async/await syntax. I don’t hold this opinion strongly, but do lean towards async/await syntax for the sake of explicit is better than implicit [1]. Node.js which was asynchronous from the start also separates async from sync explicitly with, for example, distinct fs.readFile() and fs.readFileSync() functions [2].

[1]: https://www.python.org/dev/peps/pep-0020/ "PEP 20"

[2]: https://nodejs.org/dist/latest-v12.x/docs/api/fs.html#fs_fs_... "Node.js File System"

(Edit: Commenting only on clarity of syntax. Those performance metrics are interesting and I’ve admittedly never hit a scale where the difference has a practical impact.)

ghostwriter · on March 8, 2020

> I don’t hold this opinion strongly, but do lean towards async/await syntax for sake of explicit is better than implicit

I guess it's a question of where the line that defines "too implicit" should be drawn. I'm totally fine with implicit gevent yields, yet sometimes when I need to do heavy Python meta-programming, I wish things were more explicit around language semantics, namely everything around inheritance handling inside meta-classes (for instance, see the current implementation of enum.Enum).

int_19h · on March 8, 2020

What is the basis for that assertion? The "high-level VM with green threads" approach has been tried for a long time - most prominently, Java - and it just doesn't seem to stick.

For Python especially, it is problematic because it is a glue language more often than not, and VM-specific green threads are not good for cross-language interop. When you have promises and async/await around them, at ABI level it can all be mapped to a simple callback, which any language that has C FFI can handle. When you have green threads, every language in the picture has to be aware of them - and god forbid you have two different VMs with different notions of green threads interacting.

ghostwriter · on March 8, 2020

> What is the basis for that assertion?

The fact it's implemented in a runtime system I use nowadays

> For Python especially, it is problematic

Shall we say it's a complex task instead of a problematic case?

> The "high-level VM with green threads" approach has been tried for a long time - most prominently, Java - and it just doesn't seem to stick.

afaik, it didn't stick because JNI related to green threads needed to be scalable on SMP, while the runtime implementation used a single thread, and then a decision was made to move to native threads, which doesn't necessarily indicate any inherent issues with the VM-managed green threads (and CPython specifically cannot utilise SMP with its threads anyways). At least, this was mentioned in https://www.microsoft.com/en-us/research/publication/extendi... (Section 7).

> When you have promises and async/await around them, at ABI level it can all be mapped to a simple callback, which any language that has C FFI can handle.

Why a VM wouldn't be able register those callbacks and bound them to a concrete OS thread when it knows that an FFI interop is going to happen? I don't see the point where explicit async/await is needed for it. It may require thread-safety markers (and that's what GHC's FFI interface has - https://wiki.haskell.org/GHC/Using_the_FFI#Improving_efficie...), but that's not the story about the invasive async syntax we have in contemporary Python.

YawningAngel · on March 8, 2020

It works nicely in Golang and Haskell. The main issue with Java and Python is that the core runtime developers, reasonably, do not wish to spend a lot of time developing equivalent systems.

int_19h · on March 8, 2020

I can't speak for Haskell, but inadequate performance of C FFI in Go is routinely mentioned as the reason why the community is so reluctant to wrap existing C libraries, rather than reimplementing them from scratch in Go.

YawningAngel · on March 8, 2020

To be completely honest, I don't know much about C interfaces or systems programming in general. Looking at benchmarks Go's FFI does indeed seem to perform pretty poorly. However, as a web dev, I find it works well for the concurrent programming tasks I find myself dealing with.

fastball · on March 8, 2020

The ASGI spec came from the Django project as part of their Django Channels work. "There are active efforts to support ASGI in Django" is selling them a bit short, methinks.

meowface · on March 8, 2020

I still use gevent, even for brand new projects. I work much faster with it than with async/await, and the performance appears to be comparable. I've tried getting used to async/await, but I find gevent much simpler to work with, in spite of the arguments made in places like https://glyph.twistedmatrix.com/2014/02/unyielding.html.

I wasn't aware of this particular inefficiency, but gevent is still fulfilling its purpose for me very well, and I see no reason to change. I like lightweight threads and thinking in terms of background jobs and dividing up work instead of remembering what things to annotate and when. I use locks if I need predictability. I like Python because I can develop quickly with it, and I can do so even faster with gevent while still getting more than enough performance.

cakoose · on March 7, 2020

Using async/await instead of gevent wouldn't fix the issue described in the blog post.

dnautics · on March 8, 2020

> If you want to use, say, the Django ecosystem, use Nginx and uWSGI and be done with it

This is not trivial to set up correctly and other ecosystems already have sane, scalable, secure-out-of-the-box deployment schemes figured out.

_pgmf · on March 8, 2020

Just because you don't understand the difference between gevent and asyncio, please don't post garbage laundry lists of your flavor of the month stack choices.

It's an amazing library and a very unique way to write cooperatively-scheduled applications. Best of all it works with existing libraries and doesn't require special "asyncio" implementations from top to bottom. It's not a silver bullet, but don't fool yourself that asyncio is because it's been blessed.

yowlingcat · on March 8, 2020

I think I understand the difference between gevent and asyncio pretty well. Moreover, it sounds like you understand the difference between community adoption and not, but you're fighting against community adoption with your own opinion of what is a "garbage laundry list" -- okay. You can say that. But, there's a reason the gevent approach is not what the community settled on.

What you call "special" asyncio implementations others would merely call obviously explicit code. Async/await is a powerful syntactic construct. I would never go back to gevent hell after using it.

cakoose · on March 8, 2020

It seems to be a complaint against doing process-per-CPU.

Let's say your server has 4 CPUs. The conservative option is to limit yourself to 4 requests at a time. But for most web applications, requests use tiny bursts of CPU in between longer spans of I/O, so your CPUs will be mostly idle.

Let's say we want to make better use of our CPUs and accept 40 requests at a time. Some environments (Java, Go, etc) allow any of the 40 requests to run on any of the CPUs. A request will have to wait only if 4+ of the 40 requests currently need to do CPU work.

Some environments (Node, Python, Ruby) allow a process to only use a single CPU at a time (roughly). You could run 40 processes, but that uses a lot of memory. The standard alternative is to do process-per-CPU; for this example we might run 4 processes and give each process 10 concurrent requests.

But now requests will have to wait if more than 1 of the 10 requests in its process needs to do CPU work. This has a higher probability of happening than "4+ out of 40". That's why this setup will result in higher latency.

And there's a bunch more to it. For example, it's slightly more expensive (for cache/NUMA reasons) for a request to switch from one CPU to another, so some high-performance frameworks intentionally pin requests to CPUs, e.g. Nginx, Seastar. A "work-stealing" scheduler tries to strike a balance: requests are pinned to CPUs, but if a CPU is idle it can "steal" a request from another CPU.

The starvation/timeout problem described in the post is strictly more likely to happen in process-per-CPU, sure. But for a ton of web app workloads, the odds of it happening are low, and there are things you can do to improve the situation.

The post also talks about Gunicorn accepting connections inefficiently and that should probably be fixed, but that space has very similar tradeoffs <https://blog.cloudflare.com/the-sad-state-of-linux-socket-ba....

j88439h84 · on March 7, 2020

Using an ASGI server that supports async/await, such as Uvicorn, instead of green threads, forking, etc, seems like a good idea these days. Also means you can use Starlette which has a much nicer design IMO than some of the old frameworks.

- https://www.uvicorn.org/

- https://www.starlette.io/

rcarmo · on March 7, 2020

I'm digging https://github.com/RobertoPrevato/BlackSheep myself, since I prefer the terseness of Bottle and stacking decorators to add functionality to handlers.

Any of the above (or Sanic) can do ~3K RPS on a single core on a Raspberry Pi (which is where I test things for portability, optimisation and a little fun), and the RAM overhead is generally not that bad small (just did a little "hello world" uvicorn/blacksheep app and I see 22MB resident/10MB shared per worker, and one of my Clojure servers taking up over four times that...)

worik · on March 7, 2020

This is exactly what I think.

Those below who complain about the complaints are missing the point.

We (computer programmers as a general class) have not learnt from history. We keep reinventing wheels and each time they are heavier and clunkier.

What we used to do in 40K of scripts now takes two gigabytes in python/django/whateverthehellelse. E.g. mail list servers. Mailman3 hang your head in shame!

ris · on March 7, 2020

I don't disagree with any of this but

> "Why in the hell would you fork then load, instead of load then fork?"

In python it often seems to make little difference. The continual refcount incrementing and decrementing sooner or later touches most everything and causes the copy to happen whether you're mutating an object or not.

I've had some broad thoughts about how one would give cpython the ability to "turn off" gc and refcounting for some "forever" objects which you know you're never going to want to free, but it wouldn't be pretty as it would require segregating these objects into their own arenas to prevent neighbour writes dirtying the whole page anyway...

wrmsr · on March 7, 2020

They took a step towards this with https://docs.python.org/3/library/gc.html#gc.freeze but it doesn't go as far as disabling refcount touching outright. I've experimented with doing that, both per-object and just globally, and the results really were promising if your forkserver can keep up with providing the necessarily much shorter-lived worker processes.

ris · on March 7, 2020

Thanks for this link - I had completely missed it (I think I was just expecting to disable gc entirely or perform some rudimentary surgery on its linked list)

jks · on March 8, 2020

There's a 2017 writeup from Instagram: https://instagram-engineering.com/dismissing-python-garbage-...

ris · on March 8, 2020

This isn't quite the same thing, but it is one of the articles that spurred my thoughts on this subject. In cpython, gc != refcounting. Instagram were talking about disabling gc, which would have stopped objects which they weren't using from being falsely copied, but wouldn't have stopped objects that they were using (but not mutating) from being copied.

ahuang · on March 7, 2020

I think this conflates a poor implementation of a webserver with python/gunicorn/gevent being bad. There are a few (easy) things to do to avoid some of the pitfalls she encountered:

> A connection arrives on the socket. Linux runs a pass down the list of listeners doing the epoll thing -- all of them! -- and tells every single one of them that something's waiting out there. They each wake up, one after another, a few nanoseconds apart.

Linux is known to have poor fairness with multiple processes listening to the same socket. For most setups that require forking a process, you run a local loadbalancer on box, whether it's haproxy or something else, and have each process listen on its own port. This not only allows you to ensure fairness by whatever load balance policy you want, but also lets you have healthchecks, queueing, etc.

>Meanwhile, that original request is getting old. The request it made has since received a response, but since there's not been an opportunity to flip back to it, the new request is still cooking. Eventually, that new request's computations are done, and it sends back a reply: 200 HTTP/1.1 OK, blah blah blah.

This can happen whether it's an os threaded design or a userspace green-thread runtime. If a process is overloaded, clients can and will timeout on the request. The main difference is in a green-thread runtime it's about overloading the process vs. utilizing all threads. Can make this better by using a local load balancer on box and spreading load evenly. It's also best practice to minimize "blocking" in the application that causes these pauses to happen.

>That's why they fork-then-load. That's why it takes up so much memory, and that's why you can't just have a bunch of these stupid things hanging around, each handling one request at a time and not pulling a "SHINYTHING!" and ignoring one just because another came in. There's just not enough RAM on the machine to let you do this. So, num_cpus + 1 it is.

Delayed imports (because of cyclical dependencies) is bad practice. That being said, forking N processes is standard for languages/runtimes that can only utilize a single core (python, ruby, javascript, etc.).

This is not to say that this solution is ideal -- just that with a small bit of work you can improve the scalability/reliability/behavior under load of these systems by quite a bit.

pdonis · on March 7, 2020

The problem being described here isn't Python. gunicorn, or gevent; it's bad programming. I'd be willing to bet there are systems out there written in C++, Java, and Ruby that do the same dumb things. The solution is to not do dumb things--to understand what your program is doing. It's perfectly possible to do that in Python, gunicorn, and gevent.

_old_dude_ · on March 7, 2020

In the case of Java, the Selector API was introduced in Java 4 (2002) for this exact reason, avoid to have all the threads to all waits/being notified on accept().

DevKoala · on March 7, 2020

In this crap situation atm, can attest. Currently maintaining a Python app for the delicate snowflakes whose years of math understanding somehow prevents them from being able to learn a language that isn’t Python.

We have money, let’s just blow it. /s

fancyfredbot · on March 8, 2020

I really like Rachel's blog and I think I understand the point she's making here. However I think she sees it from the point of view of very large scale services. In many cases you can have a solution ready more quickly with less developer time if you use these technologies, and at smaller scale this more than pays for the additional hardware you need to cope with the inefficiency. In such cases writing services in python is pragmatic and sensible.

doctoboggan · on March 7, 2020

I recently started playing around with Google Cloud Run and am running some python/flask/gunicorn code in a docker container on the platform.

I noticed in the logs that I am getting a lot of Critical Worker Timeouts and I am wondering if this has anything to do with it.

countbayes · on March 8, 2020

We had that problem and hacked around it with the Dockerfile instructions below, if you find a better solution that would be great :)

--Dockerfile snippet--

# Cloud Run concurrency is assumed to be set to 10 but we don't assume that is exact

# See 'https://github.com/benoitc/gunicorn/issues/1801' so disabling concurrency

ENV GUNICORN_CMD_ARGS="-c gunicorn_config.py --workers 1 --threads 1 --timeout 120 --preload"

CMD [ "gunicorn", "pkg.http:app" ]

# Or just use Flask directly if concurrency is set to 1

#CMD [ "python", "cmd/server/main.py" ]

doctoboggan · on March 8, 2020

Thanks for the pointer. I was messing around with --preload and --timeout flags and they seemed to work, although I think that isn't fixing the root problem.

Matthias247 · on March 8, 2020

I’m not sure what the main point of the article is? Telling us that eventloops have problems? Sure, the lack of preemption can cause latency problems in some tasks. But native threads have other issues - that’s why people use eventloops.

Is the message that epoll and co are lot efficient enough? That’s also true. Api Problems and thundering here are known. And not only limited to Python applications as users. io completion based models (eg throuh uring) solve some of the issues.

Or is this mainly about Python and/or Gevent? If yes, then I don’t understand it, since the described issues can be found in the same way in libuv, node.js, Rust, Netty, etc

Fazel94 · on March 8, 2020

I can relate to the writer, working with legacy sucks. This was my main take on the blog post, others are just brilliant ways of rationalizing why other people such and why there are other people than me.

Definitely, I am smarter than the guy who wrote this because then I wouldn't have these problems(Or He is smarter and I just didn't ask him about his rationale).

What I design wouldn't run into these BS problems that I have to fix, It just wouldn't run into problems generally. (Or It would have more problems than this one)

I had these conversations with myself at least a thousand times, and then it was just the case in the parentheses.

nemoniac · on March 7, 2020

So..., What's a good alternative? One that's relatively straightforward to implement compared to the Python approach.

jerf · on March 7, 2020

In this very particular sense, almost anything else is better. Dynamic scripting languages that are intrinsically single threaded because they were single-threaded for the first 10-15 years of their lives, and it is virtually impossible to retrofit true threading all the way from their basic runtimes through all their libraries [1], are basically the pessimal case for this particular problem.

This is not the whole story of the value of those languages. As the article even alludes to, at small loads or with lots of care this can be made to "work". But it is something that an engineer should know about them before picking them up and using a tool for something it isn't really good at.

[1]: I add this caveat because I don't think there's anything about dynamic scripting languages that makes them intrinsically difficult to thread any moreso than any other category of language, it's just that by an accident of history, they all come to us from the 1990s personal computer world, and they all spent at least a decade cooking and setting and building libraries and communities and developer skillsets before a serious need for threading was even on the horizon.

samatman · on March 7, 2020

It's a good caveat, because Lua, in particular, has fully-reentrant functions. You can run a bunch of Lua_states cooperatively or on a threaded basis without problems. Everything the VM does, from the C side, receives a Lua_state as the first argument.

It's intrinsically single-threaded, yes. But each instance is quite small and they stay out of each others way. Add coroutines and there's a lot you can safely do with Lua that's a real pain to accomplish with Python.

mhd · on March 8, 2020

In the days of yore, I might've attempted to shoe-horn Tcl in such a situation. Decent event-loop for distributing tasks and (like Lua, but unlike Python/Ruby) more eager to escape to C for performance-sensitive tasks.

oconnor663 · on March 8, 2020

> I don't think there's anything about dynamic scripting languages that makes them intrinsically difficult to thread any moreso than any other category of language

I think there might be some intrinsic factors:

1. Languages like Python don't want to expose the program to undefined behavior. Defining some trivial class, and then manipulating it from two threads at the same time, is not supposed to crash the program or introduce horrible security issues.

2. Languages like Python have "property bag" objects. That means (someone who knows better should check me on this) that most writes in a typical program are hash table operations, rather than primitive stores of an int or whatever. Locking each table separately, or using a fully atomic implementation, can be a significant slowdown in single-threaded programs, compared to using a GIL.

jerf · on March 9, 2020

I think if you wrote a new dynamic scripting language from scratch with the intention that it be threaded that you could probably come up with something. Python is blocked from it not because it is impossible, but because they've been unwilling to orphan all their C extensions.

The problem is that while it may raise hackles when stated so bluntly, dynamic scripting languages are generally on their way out anyhow (although they have a ways to go before that is even generally recognized, and even longer before they're legacy; we're talking plural decades for the whole process here), and also, fighting really hard to get threading into a language that is also intrinsically slow is just not the sort of compelling end-point that would inspire an author to create it, and a community to back it up. Why would you want a "threaded scripting language" that literally takes a 32-core machine to catch up to what numerous languages would be capable of doing on a single core? (A new language will have a long ways to go before it can match a good JS VM. Look at Perl 6/Raku's performance history.) Especially since such a language would be racing things like Nim, Crystal, a D revitalization, and several other contenders who are getting 90% of the convenience of scripting languages while getting 90% (or more) of the performance of compiled ones.

pdonis · on March 7, 2020

> What's a good alternative?

You don't necessarily need one; it depends on what kind of application you have and what its bottleneck is. If your application is network I/O bound, event-driven asynchronous I/O, which is what this article is describing, works fine, and a single process/thread even in a dynamic language like Python can handle a large volume of requests.

The specific issue this article is describing is due to a particular poor implementation of event-driven asynchronous I/O, not a general problem with the entire concept.

If your application is CPU bound, then yes, you need to use threads (or multiple processes), and you shouldn't be trying to mix event-driven asynchronous I/O with that.

jimbokun · on March 7, 2020

Anecdotally, seems like a lot of people are using Go, or maybe Elixir to keep the dynamic typed development experience but much more efficient hardware utilization.

j88439h84 · on March 7, 2020

Async/await

jasonhansel · on March 7, 2020

I really think this should be solved at the OS level. Why is it so hard to implement kernel threads in an efficient way? Threading shouldn't need to be done in userspace.

asveikau · on March 8, 2020

It's not hard. The very start of the article says she trusts proper kernel threads more. But a bunch of these languages were designed and written in a time before threads were much of a thing. So they fake threads in user mode rather than fix the assumptions of the runtime.

Now, using kernel threads uncarefully will lead to different problems, which are famously tricky... But what I mean is, it is not "omg kernel threads are hard" that is the primary factor preventing direct access to them. It's that the language runtime has a lot of baggage.

drenginian · on March 8, 2020

Ok so if there’s a problem, what a solution?

If I use uWSGI is problem gone?

_bxg1 · on March 7, 2020

How does this compare with NodeJS, given the event loop and the managed way that concurrent tasks happen there?

crimsonalucard · on March 7, 2020

It's the same thing. She and many other people don't know it but she's complaining about the same model that was introduced and popularized by NodeJS.

Current python "green threads" use the keywords async/await as an api. Underneath this api, the state of the art implementations use libuv (wrapped in a python library called uvloop) the exact SAME C++ library that powers nodejs.

diebeforei485 · on March 7, 2020

I find this post to be unintelligible. Given that it's been upvoted to the top of HN though, can someone TL;DR of the intellectual value of this post? It seems to be stepping through the details of what is going on while also being rambling.

tus88 · on March 7, 2020

Gotta love those people who fail to understand how things are supposed to be used, fail miserably as a result, then throw the baby out with the bathwater in a fit of tantrum.

Yes, Python has a GIL. Yes, lightweight threads are mostly good for IO bound tasks. Yes it can still be used effectively if you design your app correctly.