For those of you who are new to movie codecs, here is some info that might be useful:
roughly, videos are not streams of images, one after the other (well motion jpeg is, but ignore that) They are "key frames" ie full images, then a set of blocks with some vectors that move those blocks around to make a moving image. So you'll see something like b, i and p frames, each have a different role for making either a full image from which the next frames re-constructed, or the blocks that make the in-between frames.
Another key piece is the discrete cosine transform.
Most image and video codecs (JPEG and MPEG and AV1 too) use DCT or a related technique. It's a very simple idea at its core. The algorithm looks for an equation that, when plotted, somewhat looks like the original bitmap. The set of possible equations are picked so they can produce a lot of complex different patterns, and be compactly represented.
In more detail, the sum of a set of sine waves, can describe discrete data, or vice versa. It is related to how the sum of square waves produced by a digital to analog converter, are exactly the same as the original analog waveform, within the sampling rate and bit depth limits. Once the data is transformed into a set of sine waves added up, you can drop the minor terms with relatively little effect on the output. The more terms you drop, the higher the compression, and the more blobby and repetitive the pattern, and the blurrier and less detailed the result.
This technique is also used in lossy audio compression.
Imagine you’re trying to encode the topography of a hilly landscape faithfully.
If you use 8 bits to represent the height you can have 256 distinct levels, which is reasonably smooth — like a Minecraft map.
If you tried to cut this down to just half - 4 bits - then now you would have only 16 levels! This is very ugly, with big blocky staircases instead of nice smooth hills.
So what to do to compress this data better?
One approach is to find the nearest sine waves to the shape of the hills. Use a big sine wave for the big hills and then add on small sine waves to represent the smaller bumps and ridges.
There is a way to do this so that you end up with roughly the same amount of data as the original 8-bit encoding and have the same output.
Now if you throw away half of this sine wave data you still get the original smooth shapes because sine waves are inherently smooth! Instead of turning blocky the map becomes slightly incorrect. Hills stay roughly the same but they shift around and might loose some fine detail.
Essentially, humans are sensitive to staircase compression — even small amounts are very noticeable, but insensitive to the sine wave compression. We can exploit this to squeeze more bits out of the data before this compression becomes visible.
Sine and cosine waves have a property that you can approximate a signal by just taking the dot product with these basis functions to get a list of coefficients, and then you multiply those coefficients with the basis functions to get the original signal back. Not all functions are basis functions.
You can see that the upper-left one is all white, that's the "DC" (Direct Current) basis. As you go right and down, they increase in frequency.
So the encoder gets all the coefficients and then it quantizes the high-frequency ones to save bits. That's why JPEGs often have ringing / rippling artifacts where an edge will be sharp but have waves coming out on either side.
If you quantize the coefficients enough, then some of those bottom-right ones end up quantizing to zero. So JPEG encoders run a lossless compression step on the coefficients to squish all the zeroes and small values together. You can crunch a JPEG smaller by replacing this lossless compression with a newer algorithm.
And the decoder just inflates those coefficients and multiplies them by the same basis functions to get the bitmap back.
There's details I don't understand in the middle like loop filters and de-blocking filters to hide the 8x8 block artifacts, but the heart of it is just "take a dot product with these functions to encode, multiply those dots with the same functions to decode".
Thank you so much! I always wondered about that. The relevant bit, since it's near the bottom:
> The final step was to bring it all together with Periodic Intra Refresh. Periodic Intra Refresh completely eliminates the concept of keyframes: instead of periodic keyframes, a column of intra blocks moves across the video from one side to the other, “refreshing” the image. In effect, instead of a big keyframe, the keyframe is “spread” over many frames. The video is still seekable: a special header, called the SEI Recovery Point, tells the decoder to “start here, decode X frames, and then start displaying the video”–this hides the “refresh” effect from the user while the frame loads. Motion vectors are restricted so that blocks on one side of the refresh column don’t reference blocks on the other side, effectively creating a demarcation line in each frame.
> Immediately the previous steps become relevant. Without keyframes, it’s feasible to make every frame capped to the same size. With each frame split into packet-sized slices and the image constantly being refreshed by the magic intra refresh column, packet loss resilience skyrocketed, with a videoconference being “watchable” at losses as absurd as 25%.
Yeah. And that's why TV stores really like slow-motion shots or static landscapes to show off the TV. Any motion will cause "HDTV blur" as the encoder struggles to describe complex motion with the limited number of bits it's allowed to use.
Stuff like static, film grain, particles like snow or rain, those all suck up bits from the same encoding budget.
This could be a problem for video game streaming, and it could affect the artistic decisions a game studio makes - Drawing a billion tiny particles on a local GPU will look crisp and cool, but asking a hardware encoder to encode those for consumer Internet (or phone Internet) might be too much. I think streamers have run into this problem already.
I think there are certain presets that work slightly better depending on what you're encoding. x264 has a "touhou" preset that should work slightly better for confetti and things like it.
Many streaming services sidestep this by generating grain on the client device rather than encoding it in the video though, but that may just be to make screen recording more annoying.
Also the case for software encoders. Hardware encoders do it faster with the caveat of only encoding in pre-determined ways, but whether hardware or software what happens and what you get are fundamentally the same.
https://en.wikipedia.org/wiki/AV1