I would never in a million years look at a performance problem and given these two options:
1. Write the critical section in a faster language in serial (ie, rather than dynamic interpreted script, maybe compiled bytecode, or maybe even native machine code).
2. Write the critical section multithreaded in the script language.
I would never think to use #2 first. I would always just move my CPU bound code into a tiny C++ library and only worry about threading as a matter of last resort. You get so many huge leaky problems (even if you got rid of the GIL you would be looking at variable synchronization, atomic timings, and cache coherency) from going to multithreading it is never worth it over just writing the same code section in native code and using a native call API, even the default way of writing your native code with Python.h involvement.
Your argument is essentially that one can write slow things in Python and fast things in C, and that this solves the majority of problems (let's even be generous and call it 98% of them that fit neatly into these two categories). The trouble here is that "the number of programs" is a large number, and 2% of a large number is still large, and leaves important classes of programs without a good solution.
One class of programs left out in the cold is the network server. Now a network server must respond to a large number of requests. From a basic software engineering perspective, a 16-core machine should be responding to AT LEAST 16 requests at once (much more if some of the requests are IO bound). So the network server needs some kind of parallel processing (whether threads, subprocesses, or whatever you want to suggest). Under your philosophy, programs that need threading (thus all network servers) should not be written in Python, but somebody should be stepping down to C. While it is probably true that a very small minority of network servers should not be written in Python, the broader claim is absurd; you should be able to write reasonably-performing network servers in Python with relative ease. It is, after all, a server-side language; "writing a server" should be very high on the list of "things you can do".
Now more broadly, the existence of greenlet, Twisted, gevent, and their popularity (we're talking top-100 packages here) speak to the fact that there are a LOT of python programmers who have threading-related requirements. Are they on crack? Now mix in the new standard library stuff like asyncio (3.4) and threading is clearly an important enough issue to get major attention from the core committers. Are they on crack?
Now you might operate in a world where every time you need threads is an isolated case and it's fairly simple to drop down to C. But there are a lot of people (in absolute terms; I don't know if they are in the majority) where when they want threads the right solution is to use threads.
The thing I hear from the core committers whenever the GIL comes up is "if we worked on the GIL, we would be taking lots of time away from more important things." But when you look at the things they work on instead--unicode, iterators, ordered dictionaries, argparse, etc.--plenty of people in this thread are insufficiently motivated to upgrade. Are ordered dictionaries really more important than GIL work? To me, the answer is clear. I would rather have some progress on the GIL problem than every single py3k feature combined.
So here's some perspective on the concurrency problem. I write network servers for a living - most are in C++ or Java, but I would love to be able to use Python.
There are a number of high-level approaches you can use to concurrency. Shared-nothing processes. Threads and locks. Callback-based events. Coroutines. Dependency graphs and data-flow programming.
They all suck, and they all suck in different ways. Processes have large context-switching overheads, and take up a lot of memory, and require that you serialize any data you want to communicate across them. Threads and locks make it very easy to corrupt memory if you forget a lock, very easy to deadlock if you don't have a clear convention for what order to take locks in, and ends up being non-composable when you have libraries written under different such conventions. Callbacks require that you give up the usage of "semicolon" (or "newline") as a statement terminator; instead you have to break up your program into lots of little functions whenever you make a call that might block, and you have to manually manage state shared between these callbacks. Coroutines requires explicit yield points in your code, and opens up the possibility of a poorly-behaving coroutine monopolizing the CPU. Dependency graphs also require manual state management and lots of little functions, and often a lot of boilerplate to specify the graph.
Python has a "There should be one - and only one - obvious way to do things" philosophy, and with asyncio, Guido seems to have decided that the obvious way for Python is going to be coroutines. It's an interesting choice, and he's not alone in that - I recall Knuth writing that coroutines were an under-studied and under-utilized language concept that had many desirable properties. Coroutines free you from having to worry about your global mutable state potentially changing on every single expression, and they also give you the state-management and composition benefits that explicit callbacks lack.
There are parts of them that suck - like having to explicitly adds "yield from" at any blocking suspension point, and having to propagate that "yield from" down the call stack if added to a synchronous call. But having written a bunch of threaded Java server and (desktop) GUI code, a lot of callback-based Javascript, and a lot of C++ in both callback and dependency-graph style, all of those models suck a whole lot as well.
While your approach works most of the time, there are many cases where there isn't such a thing as a 'critic section' or a 'hotloop'. Performance relevant sections can be spread out into hundreds of places and reimplementing all of them can be less efficient than a complete rewrite. Guessing the performance characteristics of complex programs (e.g. not just computational in nature) is hard, and therefore so is the language choice.
Well sure, I wouldn't go with #2 either, but the GIL blocks you from doing fast processing in your C thread while the rest of the program continues because it's locked the entire process. So whether you write native code or not, you're stuck to one processor.
I would never in a million years look at a performance problem and given these two options:
1. Write the critical section in a faster language in serial (ie, rather than dynamic interpreted script, maybe compiled bytecode, or maybe even native machine code). 2. Write the critical section multithreaded in the script language.
I would never think to use #2 first. I would always just move my CPU bound code into a tiny C++ library and only worry about threading as a matter of last resort. You get so many huge leaky problems (even if you got rid of the GIL you would be looking at variable synchronization, atomic timings, and cache coherency) from going to multithreading it is never worth it over just writing the same code section in native code and using a native call API, even the default way of writing your native code with Python.h involvement.