"Writing directly to the screen" (by which I assume you mean writing pixels one by one to the framebuffer) is a bad idea for modern graphics hardware. It was fine on the 486, but nowadays you need the ability to do global optimizations for good 2D (or 3D) graphics performance. Ironically, the Web stack is much better positioned to do this than, say, Win32, because of the declarative nature of CSS.
Besides, as some downthread have pointed out, you didn't "write directly to the screen" in Win32. You went through GDI.
What happens if you try to present an immediate mode API for UIs is the status quo with APIs like Skia-GL. You frequently end up switching shaders and issuing a new draw call every time you draw a rectangle, and you draw strictly in back to front order so you completely lose your Z-buffer.
Imagine if games worked like that: drawing in back to front order and switching shaders every time you drew a triangle. Your performance would be terrible. But that's the API that these '90s style UI libraries force you into. Nobody thought that state changes would be expensive or that Z-buffers could exist when Win32, GTK, etc. were designed. They strictly drew using the painter's algorithm, and they used highly specialized routines for every little widget piece because minimizing memory bandwidth was way more important than avoiding state changes. But the hardware landscape is different now. That requires a different approach instead of blindly copying what the "native" APIs did in 1995.
"What happens if you try to present an immediate mode API for UIs is the status quo with APIs like Skia-GL."
I don't know what Skia-GL is, but in games, the more experienced people tend to use immediate-mode for UIs. (This trend has a name, "IMGUI". I say 'more-experienced people' because less-experienced people will do it just by copying some API that already exists, and these tend to be retained-mode because that is how UIs are usually done). UIs are tremendously less painful when done as IMGUI, and they are also faster; at least, this is my experienced. [There is another case when people use retained-mode stuff, and that's when they are using some system where content people build a UI in Flash or something and they want to repro that in the game engine; thus the UI is fundamentally retained-mode in nature. I am not a super-big fan of this approach but it does happen.]
"and you draw strictly in back to front order so you completely lose your Z-buffer"
That sounds more like a limitation of the way the library is programmed than anything to do with retained or immediate mode. There may also be some confusion about causation here. (Keep in mind that Z buffers aren't useful in the regular way if translucency is happening, so if a UI system wants to support translucency in the general case, that alone is a reason why it might go painter's algorithm, regardless of whether it's retained or immediate).
"But that's the API that these '90s style UI libraries force you into."
90s-style UI libraries are stuff like Motif and Xlib and MFC ... all retained mode!
I don't agree that an IMGUI style forces you into any more shader switches than you already would have. It just requires you to be motivated to avoid shader switches. You could say that it mildly or moderately encourages you to have more shader switches, and I would not necessarily disagree. That said, UI rendering is usually such a light workload compared to general game rendering that we don't worry too much about its efficiency -- which is another reason why game people are so flabbergasted by the modern slowness of 2D applications, they are doing almost no work in principle.
Back to the retained versus IMGUI point ... If anything, there is great potential for the retained mode version to be slower, since it will usually be navigating a tree of cache-unfriendly heap-allocated nodes many times in order to draw stuff, whereas the IMGUI version is generating data as needed so it is much easier to avoid such CPU-bottlenecking operations.
Here is a (somewhat old) video explaining some of the motivations behind structuring things as IMGUI: https://www.youtube.com/watch?v=Z1qyvQsjK5Y
pcwalton seems to be presuming that part of the contract of an "immediate mode API" is like old-school ones it actually immediately draws to the frame buffer by the end of the call.
Whereas you are talking about modern "immediate mode API"s where the calls just add things to an internal data structure that is all drawn at once, avoiding unnecessary shader switches etc. IIRC this is how Conrod (Rust's imgui library) and https://github.com/ocornut/imgui work, although with varying levels of caching.
One point to make about retained mode GUIs is I remember reading an argument that immediate mode is great for visually simple UIs, such as those in video games, but isn't as good for larger scale graphical applications and custom widgets. For example when rendering a large text box, list or table you don't want to have to recalculate the layout every frame so you need some data structure that sticks around between frames specific to the widget type, so that's what retained mode APIs like Qt do for their widgets.
Sure you can do the calculations yourself for exactly which rows of a table are currently in view and render those and the scrollbar with an immediate mode API, but the promise of toolkits like Qt is that you don't have to write calculations and data structures for every table.
Immediate mode GUI systems are allowed to keep state around between frames and the most-featureful ones do. The "immediate mode" is just about the API between the library and the user, not about what the library is allowed to do behind the scenes. The argument that retained-mode systems are inherently better at this doesn't hold water; it is kind of an orthogonal issue.
This works just as well/quickly as a retained mode API in almost all cases. There's some cases like extremely long tables with varying row heights and sortable columns, where you need an efficient diff of the table contents. Since recalculating layout and sorting every frame is inefficient. Retained mode APIs do this with methods to add and delete rows. It's possible to do with an immediate mode API, but to detect differences in the rows passed in quickly you need to use a functional persistent map data structure with log(n) symmetric diff. Or you can just have an API that is mostly immediate mode but has some kind of "TableLayout" struct that persists between frames and is modified by add and remove functions.
I'm curious what API you would use for implementing a table with varying row heights (that you only know upon rendering but can guess beforehand), sortable columns and millions of rows. I implemented this in an immediate mode GUI API a few months ago, and I did it with persistent maps and incremental computation in OCaml. Incrementally maintaining a splay tree and a sorted order by symmetric diff of the input maps. This isn't as nice of an API in languages like C++ so I'm wondering if there's a better way.
In general my policy is that when things get really complicated or specialized, the application knows a lot more about its use case than some trying-to-be-general API does, so it makes sense for the application to do most of the work of dealing with the row heights or whatever. (It's hard for me to answer more concretely since it depends on exactly what is being implemented, which I don't know.)