* audio developers recognizing the parallels between the way audio and video hardware works, and taking as many lessons from the video world as possible. The result is that audio data flow (ignoring the monstrosity of USB audio) is now generally much better than it used to be, and in many cases is conceptually very clean. In addition, less and less processing is done by audio hardware, and more and more by the CPU.
* video developers not understanding much about audio, and failing to notice the parallels between the two data handling processes. Rather than take any lessons from the audio side of things, stuff has just become more and more complex and more and more distant from what is actually happening in hardware. In addition, more and more processing is done by video hardware, and less and less by the CPU.
(one important difference: if you don't provide a new frame of video data, most humans will not notice the result; if you don't provide a new frame of audio fata, every human will hear the result).
I feel as if both these worlds would really benefit from somehow having a full exchange about the high and low level aspects of the problems they both face, how they have been solved to date, how might be solved in the future, and how the two are both very much alike, and really quite different.
I think the key difference, though, is that the consumer video needs are probably in the area of two orders of magniture more computationally complex for video than for audio. On consume machines, almost no interesting real-time audio happens. It's basically just playback with maybe a little mixing and EQ. Your average music listener is not running a real-time software synthesizer on their computer. Gamers are actually probably the consumers with the most complex audio pipelines because you're mixing a lot of sound sources in real-time with low latecy, reverb, and other spatial effects.
The only people doing real heavyweight real-time audio are music producers and for them it's a viable marketing strategy for audio programmers to expect them to upgrade to beefier hardware.
With video, almost every computer user is doing really complex real-time rendering and compositing. A graphics programmer can't easily ask their userbase to get better hardware when the userbase is millions and millions of people.
Also, of course, generating 1470 samples of audio per frame (44100 sample rate, stereo, 60 FPS) is a hell of a lot easier than 2,073,600 pixels of video per frame.
I agree that audio pipelines are a lot simpler, but I think's largely a luxury coming from having a much easier computational problem to solve and a target userbase more willing to put money into hardware to solve it.
There's a GDC talk out there about the complexities of audio in Overwatch, a competitive game. Not only do you have invaluable audio cues coming from 12 players at the same time, but also all of their combined abilities and general sound effects, like countdown timers and such.
One of the bigger problems talked about was that in a competitive game you can't just scale the loudness of an opponent's footsteps by the direct distance between them and you. You need to take into account level geometry, and so the game has to raycast in order to determine how loud any single sound effect should sound to every active player in the game.
Edit (link): https://www.youtube.com/watch?v=zF_jcrTCMsA
The end effect is that on low-powered hardware your audio can end up buzzing. In addition the high costs of dolby atmos processing on the CPU affects the rest of the game, such that turning off dolby atmos is a viable way to get higher FPS at the cost of lower-fidelity audio.
I wouldn't discount the amount of people using voice assist and rely on noise reduction/echo cancellation in their meetings, or the near future where we have SMPTE 2098 renderers running on HTPCs.
I really think we're only a few years away from seeing a lot more realtime DSP in consumer applications. Conferencing, gaming, and content consumption all benefit immensely, and it would good for us all to start thinking about the latency penalty on audio-centric HCI like we do for video.
You generate N channels worth of 1470 samples per frame, and mix (add) them together. Make N large enough, and make the computation processes associated with generating those samples complex enough, and the difference between audio and video is not so different.
Jacob Collier routinely uses 300-600 tracks in his mostly-vocal overdubs, and so for sections where there's something going on in all tracks (rare), it's more in the range of 400k-900k samples to be dealt with. This sort of track count is also typical in movie post-production scenarios. If you were actually synthesizing those samples rather than just reading them from disk, the workload could exceed the video workload.
And then there's the result of missing the audio buffer deadline (CLICK! on every speaker ever made) versus missing the video buffer deadline (some video nerds claiming they can spot a missing frame :)
Sure, but graphics pipelines don't only touch each pixel once either. :)
> Jacob Collier routinely uses 300-600 tracks in his mostly-vocal overdubs, and so for sections where there's something going on in all tracks (rare), it's more in the range of 400k-900k samples to be dealt with.
Sure, track counts in modern DAW productions are huge. But like you note in practice most tracks are empty most of the time and it's pretty easy to architect a mixer than can optimize for that. There's no reason to iterate over a list of 600 tracks and add 0.0 to the accumulated sample several hundred times.
> If you were actually synthesizing those samples rather than just reading them from disk, the workload could exceed the video workload.
Yes, but my point is that you aren't. Consumers do almost no real-time synthesis, just a little mixing. And producers are quite comfortable freezing tracks when the CPU load gets too high.
I guess the interesting point to focus on is that with music production, most of it is not real-time and interactive. At any point time, the producer's usually only tweaking, recording, or playing a single track or two and it's fairly natural to freeze the other things to lighten the CPU load.
This is somewhat analogous to how game engines bake lighting into static background geometry. They partition the world into things that can change and things that can't and use pipelines approach for each task.
> And then there's the result of missing the audio buffer deadline (CLICK! on every speaker ever made)
Agreed, the failure mode is catastrophic with audio. With video, renderers will simply use as much CPU and GPU as they can and players will max everything out. With audio, you set aside a certain amount of spare CPU as headroom so you never get too close to the wall.
Not anymore at least. :)
In the early days of computer audio production, it was very common to rely on external PCI cards to offload the DSP (digital signal processing) because CPUs at the time couldn't handle it.
And, talking about learning lessons from other fields, there's no particular reason that Jacob has to to render his audio at full fidelity in real time. Video editors usually do their interactive work with a relatively low fi representation, and do the high fidelity rendering in a batch process that can take hours or days. As I'm sure you're aware.
They don't generally lower fidelity. That is needed for video simply because the data sizes for video footage are so huge they are to work with.
But DAWs do let users "freeze", "consolidate" or otherwise pre-render effects so that everything does not need to be calculated in real-time on the fly.
Film sound editors do not do what you're describing. They work with typically 600-1000 tracks of audio. They do not lower fidelity, they do not pre-render. Ten years ago, one of the biggest post-production studios in Hollywood used TEN ProTools system to be able to function during this stage of the move production process.
Typical DAW users are using a pro-sumer machine and bouncing tracks as necessary to fit within their CPU limits.
There's been a huge push recently to remove layers from graphics APIs and expose hardware more directly, in the form of Vulkan, DX12, and Metal. The problem that this has uncovered is that graphics hardware is not only very complex but quite varied, so exposing "what is actually happening in hardware" is incompatible with portability and simplicity. As a result you get APIs that are either not portable or hard to use.
> more and more processing is done by video hardware, and less and less by the CPU.
Unfortunately this will remain a reality because it comes with a 20-100x performance advantage.
Things are not totally cleared up, but the variation has been much reduced, and as a result the design of the driver API and then user-space APIs above that have become more straightforward and more consistent even across platforms.
>20-100x performance advantage.
Yes, for sure this is one aspect of the comparison that really breaks down. Sure, there are audio interfaces (e.g. those from UAD) that have a boatload of audio processing capabilities on board, and offload that DSP to the hardware saves oodles of CPU cycles. But these are unusual, far from the norm, and not widely used. Audio interfaces have mostly converged on a massively less capable hardware design than they had 20 years ago.
Interestingly, for really advanced audio processing like sound propagation in 3D environments or physics-based sound synthesis , GPUs will probably be necessary.
so to restate something i said just below, the fact the "most audio processing is well within CPU capabilities" is only really saying "audio processing hasn't been pushed in the way that video has because there's been no motivating factor for it". if enough people were actually interested in large scale physically modelled synthesis, nobody would see "audio processing [as] well within CPU capabilities". why aren't there many people who want/do this? Well ... classic chicken and egg. The final market is no doubt smaller than the game market, so I'd probably blame lack of the hardware for preventing the growth of the user community.
ironically, sound propagation in 3D environments is more or less within current CPU capabilities.
From the standpoint of the people making the games, 3D graphics had another massive advantage: they were also cheaper than the alternative. When DOOM first appeared in December of 1993, the industry was facing a budgetary catch-22 with no obvious solution. ... Even major publishers like Sierra were beginning to post ugly losses on their bottom lines despite their increasing gross revenues.
3D graphics had the potential to fix all that, practically at a stroke. A 3D world is, almost by definition, a collection of interchangeable parts. ... Small wonder that, when the established industry was done marveling at DOOM‘s achievements in terms of gameplay, the thing they kept coming back to over and over was its astronomical profit margins. 3D graphics provided a way to make games make money again.
Not sure if and how this translates to audio.
Of course, today's game budgets are huge --- detailed 3d models are expensive, and FMV cutscenes are nearly mandatory. And games have expanded to 10s of gigabytes.
I also doubt that specialized audio hardware could provide a large benefit over a GPU for physically modeled synthesis. Ultimately what you need is FLOPS and GPUs these days are pretty much FLOPS maximizing machines with a few fixed function bits bolted on the side. Swapping the graphics fixed function bits for audio fixed function bits is not going to make a huge difference. Certainly nothing like the difference between CPU and GPU.
Re: physically modelled drums ... watch this (from 2008) !!!
Consider the sound made right before the 3:00 mark when Randy just wipes his hand over the "surface". You cannot do this with samples :)
It says something about the distance between my vision of the world and reality that even though I thought this was the most exciting thing I saw in 2008, there's almost no trace of it having any effect on music technology down here in 2020.
I would say that it's not impossible to have a low latency path from the GPU to the audio buffers on a modern system. It would take some attention to detail but especially with the new explicit graphics APIs you should be able to do it quite well. In some cases the audio output actually goes through the GPU so in theory you could skip readback, though you probably can't with current APIs.
Prioritization of time critical GPU work has also gotten some attention recently because VR compositors need it. In VR you do notice every missed frame, in the pit of your stomach...
The problem is that the workload for audio is significantly different. In a given signal flow, you may have processing stages implemented by the primary application, then the result of that is fed to a plugin which may run on the CPU even in GPU-for-audio world, then through more processing inside the app, then another plugin (which might even run on another device) and then ... and so on and so forth.
The GPU-centric workflow is completely focused on collapsing the entire rendering process onto the GPU, so you basically build up an intermediate representation of what is to be rendered, ship it to the GPU et voila! it appears on the monitor.
It is hard to imagine an audio processing workflow that would ever result in this sort of "ship it all to the GPU-for-audio and magic will happen" model. At least, not in a DAW that runs 3rd party plugins.
GPU development is harder and the debugging tools are atrocious, so it's not without downsides. But the performance would be unbeatable.
You still can't render an orchestral-size collection of physically modelled instruments on the CPU. A processor that could would be some of analogy for the modern GPU, but for various reasons, there has been little investment in chip design that would move further in this direction.
Games in particular created a large push for specialized GPU architectures, and there's really no equivalent motivation to do the same for audio, in part because there's no standard synthesis approach that things would converge on (the way they have in GPUs).
So, "perfectly well computed in software for audio" is more of a statement about the world as-it-is rather than one about any fundamental difference between the two media.
I don't have enough working knowledge or mathematics to fully comprehend the 3d pipeline but as I understand, it is a huge linear algebra/matrix manipulator and audio processing would fall into this. My reason for asking is that at least in the guitar world I've seen issues related to CPU processing and multiple VST plugins all wired together. Some guitarists use fairly complicated rigs to get their tone and when you have multiple amps, speaker cabinet simulation, and effects running you need a pretty powerful machine to do everything real-time without under-running buffers.
There are specific hardware solutions for this but I've always wondered if you could just create shaders or something and dump the audio stream to the GPU and get back the transformed audio stream with minimal CPU burden. Any insight here would be very helpful to me.
Put differently, the GPU is plenty powerful to do the processing you're thinking of, but it can't get the results back to the CPU in the required time.
For video, there is no roundtrip: we send a "program" (and maybe some data) from the CPU to the GPU; the GPU delivers it into the video framebuffer.
For audio, the GPU has no access to the audio buffer used by the audio interface, so the data has to come back from the GPU before it can be delivered to the audio interface.
I would be happy to hear that this has changed.
None of this prevents incredibly powerful offline processing of audio on a GPU, or using the GPU in scenarios where low latency doesn't matter. Your friends with guitars and pedal boards are not one of those scenarios, however.
That's how gamers do it when they want the lowest latency possible, anyway. Something like "find the lowest frame rate your game runs on, and cap it to 80% of that".
This thread fascinates me. A sibling comment addresses why it can't be done in realtime (barring some pedestrian-sounding hardware improvements), but the question of "how" fascinates me. I know a little bit about audio, I'm a classical musician with some experience with big halls, I know some physics, and I wrote a raytracer in JS back in the early aughts. Can we raytrace sound? It's a wave, it's got a spectrum, different materials reflect, refract, and absorb various parts of the spectrum differently... sounds pretty similar! The main differences I see is that we usually only have a handful of "single pixel cameras" (I expect to lose a bit of parallelism to that) and more problematic, we can't treat the speed of sound as infinite. I can imagine a cool hack to preprocess rays for a static scene, but my imagined approach breaks down entirely if the mic moves erratically.
In addition, you may also consider physical modelling synthesis to be an example of this. Look up Pianoteq. It is a physical modelled piano that literally does all the math for the sound from a hammer of particular hardness striking a string of given size and tension at a particular speed, and then that sound generating resonances within the body of the piano and other strings, and then out into the room and interacting with microphone placement, lid opening etc. etc.
Most (not all) people agree that Pianoteq is the best available synthetic piano. It's all math, no samples.
There are similar physical modelling synth engines for many other kinds of sounds, notably drums. They are not particular popular because they use more CPU than samples and can be complicated to tweak. They also really benefit from a more sophisticated control surface/instrument than most people have access to (i.e. playing a physically modelled drum from a MIDI keyboard doesn't really allow you to get into the nuances of the drum).
The Oculus Audio SDK contains an acoustic ray tracer for that. See https://www.oculus.com/blog/simulating-dynamic-soundscapes-a... for a short writeup on what is acutually in there under the hood.
Disclaimer: I've worked on that ray tracing engine at FRL for a while.
I wasn't talking about codecs, but about how software interacts with the hardware. We're already dealing with the move away from OpenGL, which for a while had looked as if it might actually finally wrap most video in a consistent, portable model. Metal and Vulkan are moderately consistent with each other, but less so with the existing OpenGL-inspired codebase(s).
But there's even more simplistic levels than this: in audio, on every (desktop) platform, it is trivial for audio I/O to be tightly coupled to the hardware "refresh" cycle. WDM-WASAPI/CoreAudio/ALSA all make this easy, if you need it (and for low latency audio, you do). Doing this for graphics/video is an insanely complex task. I know of no major GUI toolkit that even exports the notion of sync-to-vblank let alone does so in any actually portable way (Qt and GTK have both made a few steps in this direction, but if you dive in deep, you find that its a fairly weak connection compared to the way things work in audio).
But I feel like we have a number of years to go until we can really get back to where we used to be with vintage gaming consoles and CRT displays.
Most of the examples I can think of are ones where the software slowdown has more than cancelled out the hardware improvements. Then there are a some areas where hardware performance improvement was sufficient to overcome software slowdown. Software getting faster? Software getting faster than hardware??
Algorithms have been refined. Faster paths have been uncovered for matrix multiplication  (unsure if the latest improvements are leveraged) and other algorithms.
Use-cases that have been around fo run a wile (say, h.264 encode/decode) are more optimized.
We now tend to be a lot better at managing concurrency too (see: rust, openmp and others), with the massively parallel architectures that come out nowadays.
However, I can't really agree that they are examples of software improvements being multiples of hardware improvements.
1. Compiler optimizations
See Proebsting's Law , which states that whereas Moore's Law provided a doubling of performance ever 18-24 months, compiler optimizations provide a doubling every 18 years at best. More recent measurements indicate that this was optimistic.
2. Compilers getting faster
Sorry, not seeing it. Swift, for example, can take a minute before giving up on a one line expression, and has been clocked at 16 lines/second for some codebases.
All the while producing slow code.
See also part of the motivation for Jonathan Blow's Jai programming language.
3. Matrix multiplication
No numbers given, so ¯\_(ツ)_/¯
The big improvements have come from moving them to hardware.
You want examples where software has sped up at a rate faster than hardware (meaning that new software on old hardware runs faster than old software on new hardware).
Can you give me an example of 2005-era swift running faster on newer hardware than today's compiler on yesterday's hardware? You can't, as this is a new language, with new semantics and possibilities. Parsing isn't as simple as it seems, you can't really compare two different languages.
These software improvements also tend to pile up along the stack. And comparing HW to SW is tricky: you can always cram more HW to gain more performance, while using more SW unfortunately tends to have the opposite effect. So you have to restrict yourself HW-wise: same price? same power requirements? I'd tend to go with the latter as HW has enjoyed economies of scale SW can't.
Concurrency might be hardware, but in keeping with the above point, more execution cores will be useless for a multithread-unaware program. Old software might not run better on new HW, but old HW didn't have these capabilities, so the opposite is probably true as well. Keep in mind that these new HW developments were enabled by SW developments.
> No numbers given, so ¯\_(ツ)_/¯
Big-O notation should speak for itself, I am not going to try and resurrect a BLAS package from the 80s to benchmark against on a PIC just for this argument ;)
Other noteworthy algorithms include the FFT . (I had another one in mind but lost it).
> The big improvements have come from moving them to hardware.
I'm talking specifically about SW implementations. Of course you can design an ASIC for most stuff. And most performance-critical applications probably had ASICs designed for them by now, helping prove your point. SW and HW are not isolated either, and an algorithm optimized for old HW might be extremely inneficient on new HW, and vice-versa.
And in any case, HW developments were in large part enabled by SW developments with logic synthesis, place and route, etc. HW development is SW development to a large extent today, though that was not your original point.
What can't be argued against, however, is that both SW and HW improvements have made it much easier to create both HW and SW. Whether SW or HW has been most instrumental with this, I am not sure. They are tightly coupled: it's much easier to write a complex program with a modern compiler, but would you wait for it to compile on an old machine? Likewise for logic synthesis tools and HW simulators. Low-effort development can get you further, and that shows. I guess that's what you are complaining about.
That wasn't my point, but the claim of the poster I was replying to, and it was exactly this claim that I think is unsupportable.
Has some software gotten faster? Sure. But mostly software has a gotten slower and the rarer cases of software getting faster have been outpaced significantly by HW.
> You want examples where software has sped up at a rate faster than hardware
that in some/many areas software has improved performance by several multiples of the improvement in hardware performance over the last 20 years
The original JITs were done in the late 80s early 90s. And their practical impact is far less then the claimed impact.
As an example the Cog VM is a JIT for Squeak. They claim a 5x speedup in bytecodes/s. Nice. However the naive bytecode interpreter, in C, on commodity hardware in 1999 (Pentium/400) was 45 times faster than the one microcoded on a Xerox Dorado in 1984, which was a high-end, custom-built ECL machine costing many hundred thousands of dollars. (19m bytecodes/s vs. 400k bytecodes/s).
So 5x for software, at least 45x for hardware. And the hardware kept improving afterward, nowadays at least another 10x.
> [compilers] Parsing isn't as simple as it seems [..]
Parsing is not where the time goes.
> 2005-era swift running faster
Swift generally has not gotten faster at all. I refer you back to Proebsting's Law and the evidence gathered in the paper: optimizer (=software) improvements achieve in decades what hardware achieves/achieved in a year.
There are several researchers that say optimization has run out of steam.
(the difference between -O2 and -O3 is just noise)
> Big-O notation should speak for itself
It usually does not. Many if not most improvements in Big-O these days are purely theoretical findings that have no practical impact on the software people actually run. I remember when I was studying that "interior point methods" were making a big splash, because they were the first linear optimization algorithms that had polynomial complexity, whereas the Simplex algorithm is NP-hard. I don't know what the current state is, but at the time the reaction was a big shrug. Why? Although Simplex is NP-hard, it typically runs in linear or close to linear time and is thus much, much faster than the interior point methods.
Similar for recent findings of slightly improved multiplication algorithms. The n required for the asymptotic complexity to overcome the overheads is so large that the results are theoretical.
The Wikipedia link you provided goes to algorithms from the 1960s and 1940s, so not sure how applicable that is to the question of "has software performance improvement in the last 20 years outpaced hardware improvement by multiples?".
Are you perchance answering a completely different question?
> [H264/H265] I'm talking specifically about SW implementations
Right, and the improvements in SW implementations don't begin to reach the improvement that comes from moving significant parts to dedicated hardware.
And yes, you have to modify the software to actually talk to the hardware, but you're not seriously trying to argue that this means this is a software improvement??
> Parsing is not where the time goes.
Not with the current algorithms.
But let's agree to put this argument to a rest. I generally agree with you that
1. Current software practices are wasteful, and it's getting worse
2. According to 1. most performance improvements can be attributed to HW gains.
I originally just wanted to point out that this was true in general, but that there were exceptions, and that hot paths are optimized. Other tendencies are at play, though, such as the end of dennard's scaling. I tend to agree with https://news.ycombinator.com/item?id=24515035 and to achieve future gains, we might need tighter coupling between HW and SW evolution, as general-purpose processors might not continue to improve as much. Feel free to disagree, this is conjecture.
> And yes, you have to modify the software to actually talk to the hardware, but you're not seriously trying to argue that this means this is a software improvement??
My point was more or less the same as the one made in the previously linked article: HW changes have made some SW faster, other comparatively slower. These two do not exist in isolated bubbles. I'm talking of off-the-shelf HW, obviously. HW gets to pick which algorithms are considered "efficient".
Recursive descent has been around forever, the Wikipedia page mentions a reference from 1975. What recent advances have there been in parsing performance?
> 1. Current software practices are wasteful, and it's getting worse
> 2. According to 1. most performance improvements can be attributed to HW gains.
3. Even when there were advances in software performance, they were outpaced by HW improvements, certainly typically and almost invariably.
But the White House advisory report cited research, including a study of progress over a 15-year span on a benchmark production-planning task. Over that time, the speed of completing the calculations improved by a factor of 43 million. Of the total, a factor of roughly 1,000 was attributable to faster processor speeds, according to the research by Martin Grotschel, a German scientist and mathematician. Yet a factor of 43,000 was due to improvements in the efficiency of software algorithms.
I actually took Professor Grötschel's Linear Optimization course at TU Berlin, and the practical optimization task/competition we did for that course very much illustrates the point made in the answer to the stackexchange question you posted.
Our team won the competition, beating the performance not just of the other student teams' programs, but also the program of the professor's assistants, by around an order of magnitude. How? By changing a single "<" (less than) to "<=" (less than or equal), which dramatically reduced the run-time of the dominant problem of the problem-set.
This really miffed the professor quite a bit, because we were just a bunch of dumb CS majors taking a class in the far superior math department, but he was a good sport about it and we got a nice little prize in addition to our grade.
It also helped that our program was still fastest without that one change, though now with only a tiny margin.
The point being that, as the post notes, this is a single problem in a single, very specialized discipline, and this example absolutely does not generalize.
In any area where computing performance has been critical - optimization, data management, simulation, physical modeling, statistical analysis, weather forecasting, genomics, computational medicine, imagery, etc. there have been many, many cases of software outpacing hardware in rate of improvement. Enough so that it is normal to expect it, and something to investigate for cause if it's not seen.
The entire IT market is ~ $3.5 trillion, HPC is ~ $35 billion. Now that's nothing to sneeze at, but just 1% of the total. I doubt that all the pieces I mentioned also account for just 1%. If so, what's the other 98%?
Second, there are actually many factors that contribute to software bloat and slowdown, what you mention is just one, and many other kinds of software are getting slower, including compilers.
Third, while I believe you that many of the HPC fields see some algorithmic performance improvements, I just don't buy your assertion that this is regularly more than the improvements gained by the massive increases in hardware capacity, and that one singular example just doesn't cut it.
Interesting. CRTs also had a fixed framerate. Let's make that 60 fps for the sake of the argument.
It really depends on what you are calling "vintage". Most later consoles (with GPUs) just composited images at a fixed frame-rate.
Earlier software renderers are quite interesting, though, in that they tend to "race the beam" and produce pixel data a few microseconds before it is displayed. Does that automatically transfer to low latency? I'm not sure. If the on-screen character is supposed to move by a few pixels with the last input, it really depends on whether you have drawn it already. Max latency is 16 ms, min is probably around 100 µs. That gives you 8 ms of expected latency. And I think you still get tearing, in some cases.
There is also no reason it couldn't be done with modern hardware, except the wire data format for HDMI/DP might need to be adjusted.
However, and I've said that for a long time, one key visible difference between CRTs and LCDs is persistence. Images on a LCD persist for a full frame, instead of letting your brain interpolate the images. The result is a distinctively blurry edge on moving objects. Some technologies such as backlight strobing (aka ULMB) aim to aleviate this (you likely need triple buffering to combine this with adaptative sync, which I haven't seen).
I wonder if rolling backlights could allow us to race the beam once again?
QLED/OLED displays could theoretically bring a better experience than CRTs if the display controller allowed it: every pixel emits its own light, so low persistence is achievable. You don't have a beam with fixed timings, so you could just update what's needed in time for displaying it.
This is usually called ghosting, and can be combated using "overdrive";
Edit: regarding my older comment, I thought that QLED were quantum dots mounted on individual LEDs. They are regular LCDs, with quantum dots providing the colour conversion. That makes more sense from an economical perspective, less so for performance. Maybe OLED and QLED could be combined to leverage the best OLED for all colors?
| / \
| | \
| | \
| | `-._
The switching time is the time between when the signal starts (or rather when the monitor has finished working out what the signal means and has started trying to change the luminosity) and when it gets bright (or in the case of LCDs, to the right luminosity).
Ghosting comes from not switching strongly enough and overdrive (switching too hard) tends to lead to poor colour accuracy.
You are right. That directly contributes to latency. This effect is called pixel response time , and is usually measured "grey-to-grey" . Nowadays, though, I think monitors usually have a short response time (<1ms for "gamer" monitors).
> Persistence is the time it takes between when the luminosity goes up and when it goes down
Right. But that proves the two are somewhat linked, especially as pixel response isn't symmetric (0->1 and 1->0).
Reading a bit more into it, the industry terms for persistence seem to be both grey-to-grey (GtG), and Moving Picture Response Time (MPRT). The latter is a measurement method  for perceived motion blur. It directly depends on the time each pixel remains lit with the same value ("persistence" strikes again), so a slow (>1 frame) or incomplete transition can create motion blur (contributes to persistence).
> low persistence is emulated with a high refresh rate and black frame insertion
It can also be achieved wit backlight strobing on backlit displays: strobing for 1ms on a "full-persistence" (pixel always on) display gives a 1ms persistence regardless of the frame rate. A ~1000 FPS display would be necessary to get the same level of persistence with black frame insertion alone. I believe this is part of the reason Valve went with an LCD instead of an OLED screen to get the 0.33ms persistence on the Index HMD.
 (from the above) https://hal.archives-ouvertes.fr/hal-00177263/document
 (also from blurbusters article) https://lcd.creol.ucf.edu/Publications/2017/JAP%20121-023108...
Something like that sitting where the scan out engines exist today (with their multiple planes the composite today) would be absolutely killer if you could do hardware software co design to take advantage of it.
I've also thought that something like that would be great in VR/AR.
If you've only got fixed function shaders and a relatively simple rasterizer the raster-to-display works fine. The DS also only has a single contributor to the output. A game on a PS4 or a PC is just one of several contributors to a compositing window/display manager which is actually doing final composition to the output device(s).
This guy has articles about most of the major consoles from the last 30 years, I find them fascinating to read even though much of it goes over my head.
I still think we get better compositor performance by decoupling "start of frame" / "end of frame" and the drawing API. The big thing we lack on the app side is timing information -- an application doesn't know its budget for how long it should take and when it should submit its frame, because the graphics APIs only expose vsync boundaries. If the app could take ~15ms to build a frame, submit it to the compositor, and the compositor takes the remaining ~1ms to do the composite (though much likely much much less, these are just easy numbers), we could could be made to display in the current vsync cycle. We just don't have accurate timing feedback for this though.
One of my favorite gamedev tricks was used on Donkey Kong Country Returns. There, the developers polled input far above the refresh rate, and rendered Donkey Kong's 3D model at the start of the frame into an offscreen buffer, and then, as the frame was being rendered, processed input and did physics. Only at the end of the frame, did they composite Donkey Kong into the updated physics. So they in fact cut the latency to be sub-frame through clever trickery, at the expense of small inaccuracies in animation. Imagine if windows get "super late composite" privileges, where it could submit its image just in the nick of time.
(Also, I should probably mention that my name is "Jasper St. Pierre". There's a few tiny inaccuracies in the history -- old-school Win95/X11 still provides process separation as the display server bounds the window's drawing to the window's clip list for the app, and Windows 2000 also had a limited compositor, known as "layered windows", where certain windows could be redirected offscreen , but these aren't central to your thesis)
Counting Intel, most (by number) desktop devices have pretty sophisticated hardware layer capabilities. Sky Lake has 3 display pipes, each of which has 3 display planes and a cursor. The multiple pipes are probably mostly used for multi-monitor configurations, but it's still a decent setup. From the hints I've picked up, I believe DirectFlip was largely engineered for this hardware.
There are lots of interesting latency-reducing tricks. Some of those might be good inspiration, but what I'm advocating is a general interface that lets applications reliably get good performance.
Thanks for the discussion!
DirectFlip can be supported on a lot of hardware, but is limited to borderless fullscreen unless you have overlays. https://youtu.be/E3wTajGZOsA?t=1531 has an explanation.
https://developers.google.com/web/updates/2019/05/desynchron... has half the story for getting extremely low latency inking on Chromebooks with Intel GPUs. The other half is ensuring the canvas is eligible for hardware overlay promotion, which the ChromeOS compositor will do.
One of these days I'll write that up.
I guess it's possible that their overlay support is limited and not fully equivalent to more modern overlays. It's tough to tell for sure from that description. But even if so, the price discrimination aspect may still have stopped them from wanting to implement a more capable feature and expose it in the GeForce drivers.
Edit: This documentation suggests it's limited to 16 bit color and 1 bit transparency. http://http.download.nvidia.com/XFree86/Linux-x86/100.14.19/...
If each steps takes 4 msec (240 Hz) instead of 16.7 msec (60 Hz), and you're the same number of steps behind, the latency is reduced by (16.7 - 4) * nsteps, or 50 msec - 12 msec = 38 msec.
Now that's assuming your system can actually keep up and produce frames and composite that fast, which is where the "brute force" comes into play.
Then we have "average" delay in ms rather than frames.
(Asking as someone who doesn't know this area well)
More than could, some games let you do this explicitly.
If you render "as fast as possible (to a max of xyz hz), you get better worst case performance than if you delay compositing. A render time of 2 physical frames results in only 1 missed frame instead of a render time of 1.26 physical frames resulting in 2 missed frames.
Of course nothing is simple in the real world, the flip side is rendering the extra frames creates waste heat that slows down modern processors with insufficient cooling.
All i ended capturing was something approximating the experience i had back when Windows allowed me to disable the compositor which wasn't vsynced. It still doesn't feel as smooth, just smoother. Nowhere near the smoothness of my CRT though.
Is this similar to the "dirty rectangles" technique? 
It seems difficult to implement some sort of scene system (for either an app, game or general GUI) that given a single key press can determine a minimum bounding box of changes on the screen no matter what the scene is, and is able to render just that bounding box.
If the single key occurs on a text area, potentially the whole text area could be "damaged" right? Edit: I was looking at this with the "Rendering" tab of chrome dev tools, enabling "Paint flashing" and "Layout Shift Regions", it seems like the text area is its own layer and the space is partitioned pretty cleverly on things like paragraphs and lines, but from time to time the whole text area just flashes, which tells me the algorithm sometimes is not sure what is dirty and just repaints the whole thing, but not always.
> It seems difficult to implement some sort of scene system (for either an app, game or general GUI) that given a single key press can determine a minimum bounding box of changes on the screen no matter what the scene is, and is able to render just that bounding box.
I don't think it's that bad. In the days before hardware graphics, basically every 2D sprite engine used in every computer game you liked had to do this logic.
Text editors are a harder case in some ways because of line wrapping, but the editor does need to figure out where everything goes spatially in order to render, so extending that to tell which things have not moved is, I think, not that difficult.
Variable refresh rate (VRR / G-Sync / FreeSync / Adaptive Sync) and high-refresh rate hardware is now starting to gain traction on the high-end, usually with both technologies combined. Apple's ProMotion is often touted as a "120Hz" feature, but it's also a variable refresh in that it can support much lower framerates. Without a static v-sync interval, you can display the frames as they become available, for both perfect tear-free visuals and lower latency. It's quite likely we'll have that on Mac's as well.
I recently got a 144Hz FreeSync (G-Sync compatible) "business" monitor, so this tech is starting to filter down from the gamer side. It works great with a compositor on Linux! Ultra smooth mouse and input response at 144Hz, and completely tear-free. I would highly recommend it for developers as well.
I got a 165Hz monitor and while it is indeed smoother than a 60Hz monitor with a vsync'd compositor, it still isn't as smooth as a 60Hz monitor without a compositor.
This is an important point. There are many solutions out there (beyond compositing) that add incredible complexity along with other costs in order to solve a perceived problem by it's proponents... not uncommonly those same people pushing the solution are incapable of fairly judging the cost benefit - sometimes the solutions are just not worth the cost.
There is currently an issue in chromium compositing with checkerboarding (blanking whole regions of the screen with white) when scrolling due to an optimization released to reduce jank:
You will notice this in recent releases of chromium when you scroll any non trivial page fast enough... or a simple page very fast by yanking the scroll bar... unless you have an extremely fast computer and graphics card. When reading through that thread you will notice how defensive they are, it's a complex and no doubt incredible bit of coding they have done - unfortunately the side-effect is worse than the original problem that most users don't even notice.
In the long run I think time will prove this problem and it's solutions are purely transient... In the same way font anti-aliasing is becoming obsolete with DPIs higher than we can perceive, solving jank isn't necessary if you cannot perceive it (i.e using frame rates > 60 Hz)... and the original problem really isn't as bad as all the people attempting to solve it think it is anyway.
Screens might exceed acuity of most users but not hyperacuity. Even on a 400ppi phone screen at typical viewing distances it is possible to tell whether a slanted line is anti-aliased or not. Font anti-aliasing is not becoming obsolete any time soon.
This post is focused on local compositing, but I think the same arguments apply as with the networked case: too many updates is actually worse than vsync, but the usual case of "just ship deltas" is amazing (for remote display you get the bandwidth and latency win, here you'd get latency/power/whatever).
I think a low-level "updates" API would make sense for some sophisticated applications, but I'm not convinced that this quote holds:
> I think this design can work effectively without much changing the compositor API, but if it really works as tiles under the hood, that opens up an intriguing possibility: the interface to applications might be redefined at a lower level to work with tiles.
Seems like if you can show Chrome and VLC both working fine, that'd be a great proof of concept!
I am curious if there is a solid data structure that could be used to collect updates together in such a way that updates could be coalesced into a collection of other updates efficiently.
Of course just slapping RT kernel in a modern system alone would not do much good. The whole system needs to be thought with RT in mind, starting from kernel and drivers through toolkits/frameworks and services up to the applications themselves. But ultimately the APIs for application developers should be nudging people to do the right things (whatever those would be)
The video systems in 8-bit home computers only worked because the entire system provided "hard realtime" guarantees. CPU instructions always were the same number of cycles and memory access happened at exactly defined clock cycles within an instruction. Interrupt timing was completely predictable down to the clock cycle, etc etc. Modern computers get most of their performace because they dumped those assumptions and as a result made timings entirely unpredictable (inside the CPU via caches, pipelining, branch prediction, etc etc...), and between system components (e.g. CPU and GPU don't run cycle-locked with each other).
But I don't want to underestimate the engineering challenges either. You'd have to make this stuff actually work.
Can someone enlighten me about the real benefits?
The first bit is that practically all that is argued in for here, is in Android - and has been so for a long time (~2013 last I worked on the topic in that context) and it is still not enough - they will go (and are going) higher refresh rate.
Even at the time there were crazy things like tuning CPU governor to wake up when touch screen detects incoming 'presence'(before you can get an accurate measurement of where and how hard the finger hit) - it's not like you are going to change your muscle intention mid flight. Input events were specifically tagged so that the memory would be re-used rather than GCed and reallocated.
A sunny frame could go from motion to photon in 30ms. Then something happens and the next takes 110ms. That something is often garbage collectors and other runtime systems that think it is safe to do things as there is no shared systemic synchronization mechanism around for these things.
The second is that this is -ultimately- a systemic issue and the judge, jury and executioner is the user's own perception. That's what you are optimising for. Treating it as a graphics only thing is not putting the finger on something, it is picking your nose. The input needs to be in on it, audio needs to be in on it, and the producer needs to be cooperative. Incidentally, the guitar hero style games and anything VR are good qualitative evaluation targets but with brinelling like incremental loads.
1. communicate deadlines to client so the renderer can judge if it is even worth doing anything.
2. communicate presentation time to client so animations and video frames arrive at the right time.
3. have a compositor scheduler than will optimize for throughput, latency, minimizing jitter or energy consumption.
4. inform the scheduler about the task that is in focus, and bias client unlock and event routing to distinguish between input focus and the thundering herd.
5. type annotate client resources so the scheduler knows what they are for.
6. coalesce / resample input and resize events.
7. align client audio buffers to video buffers.
8. clients with state store/restore capabilities can rollback, inject and fastforward (http://filthypants.blogspot.com/2016/03/using-rollback-to-hi...)
There is a bunch of more eccentric big stuff after that as well as the well known quality of life improvements (beyond the obvious gamedev lessons like don't just malloc in your renderloop and minimize the amount of syscall jitter in the entire pipeline), but baby steps.
Oh, and all of the above++ is already in Arcan and yet it is still nowhere near good enough.
It tells just how much more garbage been thrown into the linux stack since 2007, that even, effectively, kernel based, and hardware accelerated animation stutters today.
If done carefully, this would give you smoother animations at considerably lower power consumption than traditional triple buffering. It's not the way present APIs are designed today, though, one of my many sadnesses.
As I hinted, this is a good way to synchronize window resize as well. The window manager tells the application, "as of frame number 98765, your new size is 2345x1234." It also sends the window frame and shadows etc with the same frame number. These are treated as a transaction; both have to arrive before the deadline for the transaction to go through. This is not rocket science, but requires careful attention to detail.
The full-screen electron window apparently displaced the window system compositor. So "compositing" specialized compositors can be useful.
Chromium special cased video, making it utterly trivial to do comfortable camera passthru, while a year+ later, mainstream stacks were still considering that hard to impossible. So having compositor architecture match the app is valuable.
That special case would collapse if the CPU needed to touch the camera frames for analysis. So carefully managing data movement between GPU and CPU is important. That's one design objective of google's mediapipe for instance. Perhaps future compositors should permit similar pipeline specifications?
My XR focus was software-dev "office" work, not games. So no game-dev "oh, the horror! an immersion-shattering visual artifact occurred!" - people don't think "Excel, it's ghastly... it leaves me aware I'm sitting in an office!". Similarly, with user balance based on nice video passthru, the rest of the rendering could be slow and jittery. Game-dev common wisdom was "VR means 90 fps, no jitters, no artifacts, or user-sick fail - we're GPU-tech limited", and I was "meh, 30/20/10 fps, whatever" on laptop integrated graphics. So... Games are hard - don't assume their constraints are yours without analysis. And different aspects of the rendered environment can have very different constraints.
My fuzzy recollection was chromium could be persuaded to do 120 Hz, though I didn't have a monitor to try that. Or higher? - I fuzzily recall some variable to uncap the frame rate.
I've used linux evdev (input pipeline step between kernel and libinput) directly from an electron renderer process. Latency wasn't the motivation, so I didn't measure it. But that might save an extra ms or few. At 120 Hz, that might mean whatever ms HID to OS, 0-8 ms wait for the next frame, 8 ms processing and render, plus whatever ms to light. On electron.js.
New interface tech like speech and gesture recognition may start guessing at what it's hearing/seeing many ms before it provides it's final best guess at what happened. Here low latency responsiveness is perhaps more about app system architecture than pipeline tuning. With app state and UI supporting iterative speculative execution.
Eye tracking changes things. "Don't bother rendering anything for the next 50 ms, the user is saccading and thus blind." "After this saccade, the eye will be pointing at xy, so only that region will need a full resolution render" (foveated rendering). Patents... but eventually.
If you don't care that much about latency, there's a clear benefit in never having windows getting "damaged" just because other stuff has temporarily occluded them. You will never get "ghost" images because one app has locked up or is slow. No flurry of "WM_PAINT", "WM_NCPAINT" or whatever equivalent just because you are moving windows. No matter how fast your system is, any 'movement' is much more fluid (as position changes are almost free).
Essentially, having windows completely isolated from one another - except when they are "composited" back - simplifies a lot. There's a lot of cruft and corner cases that you don't have to care.
There are features that can enhance usability that are trivial to do with compositors but difficult without - for instance, showing a window "preview" when you are switching apps or when you mouse over in the taskbar. While you can do this without a compositor, you can get this very easily and for almost 'free'.
And, as you point out, you can post-process and apply whatever effects you want. But this is a minor detail and was used initially as a selling point.
Of course, for a tiling window manager none of this matters as they never overlap. But most users don't use tiling window managers.
Compositors do not do much GPU leverage outside of drawing a bunch of textured quads for the final image, but those textures are assembled on the CPU (some toolkits have GPU backends but even that goes back and forth the CPU because not everything is done though the GPU).
It would need a big rewrite of the GUI stack to actually take advantage of GPUs with whatever service/server handles it to have a full scene tree with all the contents of the desktop resident (and whenever possible, manipulated) in GPU memory.
As things are nowadays, however, at best you get a mix of the two (and that itself isn't without its drawbacks too).
Isn't that why X11 provided an (optional) backing store?
And TBH IMO the choice of window system affects way more the day-to-day operations of a desktop oriented distribution than the init system ever did, so i expect slightly more friction about removing Xorg (as a sidenote the server is Xorg, X11 is the protocol and another server - most likely a fork of Xorg - can continue providing it if Xorg shuts down).
I’d really like to get 60 Hz window resizing as well, as in iPadOS’s split screen, but even compositing tiling window managers like PaperWM and sway struggle with this at the moment.
I do get vsync tearing though (skylake i915) and I found that running a compositor (compton I think) fixed it. Not that I use it haha.
The semantics of JS itself make certain optimizations hard. Even a simple VM can do arithmetic faster than no-JIT Node.js and JSC: https://godbolt.org/z/5GbhnK (an example I've been working on lately) (I'm interested in no-JIT perf because that's what you're allowed to do in your own runtime on iOS). LuaJIT is the best at this, and even there Lua's semantics prevent certain optimizations--https://news.ycombinator.com/item?id=11327201
I think you can do a lot better with a bytecode that is more 'typed', like in SPIR-V: https://www.khronos.org/registry/spir-v/specs/1.0/SPIRV.html...
And if you are gonna say that VSCode is fast you have no idea how truly fast software feels like - if you come across try a QNX 6 demo iso and be amazed at how fast your computer answers to literally everything.
Microsoft's React Native team has charts where it shows Electron has 300x slowdown versus native applications.
Also, my anecdotal experience writing some audio algorithms in pure JS is that it's ... not an order of magnitude, but not far away, from the performance of the same code written in C++ with std::vector<double>s. Likely there's some magic that could be done to improve it, but the C++ code is written naïvely and ends up extremely well optimized.
It sounds like you don't understand how a browser works. Performance issues are typically due to the browser runtime environment which includes a slow and complex DOM.
Counter to this narrative the DOM is very fast, and compared to many other platforms it's much faster at displaying a large number of views and then resizing and reflowing them.
ex. Android warns when there's more than 80 Views, meanwhile Gmail is 3500 DOM elements and the new "fast" Facebook is ~6000 DOM elements. Neither of those apps spends the majority of their time in the browser code on load, it's almost entirely JS. Facebook spends 3 seconds running script on page load on my $3k Macbook Pro. That's not because the DOM is slow, that's because Facebook runs a lot of JS on startup.
If you cut down to the metal browsers can be quite fast, ex. https://browserbench.org/MotionMark/
I'll bet half the coreutils are a minimum of 10x slower in their day-to-day use cases (i.e. small n) when re-implemented in any interpreted language like JS. Cold start overhead matters a lot for CLI programs.
asm.js is what you mean?
If you so, you are not better to handwriting an assembly yourself.
People claiming that JS can be made be as fast as C have no idea at all what they say. Even Java with hand-jacked VM is easier to scale for such load than a JS server reduced to 1k lines of vanilla JS, using all and every performance trick Nodejs provides.
My point was never to say that JS is in general as fast as C, only that a careful dev that is familiar with the VM's operation can get close for some tasks.
Well, I suppose that's way more feasible in RISC CPUs.
Damn I hate when Apple is right about something.