Hacker News new | past | comments | ask | show | jobs | submit login
H.264 is magic (2016) (sidbala.com)
680 points by dpeck on May 24, 2019 | hide | past | favorite | 180 comments

The interesting thing about H.264 is that it achieves excellent compression rates not by a single "miracle trick", but by combining multiple small-gain techniques. Every technique lets you gain on the order of 1-3%, but when you combine them all, you get fantastic results.

Another tidbit of (potentially) interesting information: it took many years for H.264 to achieve its full potential, for two reasons:

1) Implementing a decoder is really, really complicated, especially if you want it to run fast, and if you think a miracle "hardware decoder" solves the problem, you need to remember that H.264 places heavy requirements on memory bandwidth and caching (B frames), so it's not like you can just feed a stream through a chip and get video frames on the other side.

2) Implementing an encoder is really, really complicated :-) — and there are many ways to encode a video stream. H.264 delivers the tools, but does not specify how you should use them. It took many years for decoders to get to a state where most of H.264 is used efficiently.

It's a tremendous effort to simply ensure standards. There are so many knobs and dials on h264 encoders, and features in each profile, is hard to appreciate the compatability we have today across devices - especially considering how much is in silicon.

I remember sponsoring a feature in x264 (the best software encoder) and the level of complexity in that code is mind bending. The patch was to allow better CPU utilization when running with over 16 cores - something the authors hadn't run into at the time and had no systems to test on. I did a simple profile during encode, sent them the trace, in 2 days the patch was tested and working and it hit main in a week's time. Boom, all 48 cores used.

Made me realize just how much talent is behind these technologies and I still shake my head at how much money is generated with this and how little of that makes it's way back to the developers at the core of it.

> Made me realize just how much talent is behind these technologies and I still shake my head at how much money is generated with this and how little of that makes it's way back to the developers at the core of it.

If it's any consolation, not much more makes it to the talented developers in commercial products, either. It mainly ends up going to executives and sales.

I think in New York and California at least, developers are getting paid closer to what their work is worth. Salaries are often hitting the $300k ballpark for highly talented developers. Average pay itself is now around a $150k.

I think it's primarily developers in other countries[1] that are getting severely underpaid. Hopefully that changes in the coming years.

[1] The most egregious example might be India. Around 7 years ago, I was on a plane from Kerala (a state in India), and the guy sitting next to me happened to run a software consulting company (in Kerala). His customers were from Japan and some other countries (can't remember), and his company wrote Linux drivers for proprietary hardware. He hired kids right out of college. I asked him how much he paid them, and he said without blinking, Rs. 15000 per month. Type in "INR 15000 to USD" to Google, and you're in for a shock. It's around $2600 per year, but probably closer to $3000 per year at the exchange rates back then.

American developers doing the same job could earn up to 100 times that, and at the very least 40 times that amount. I can't understand why on Earth salaries are 40 to a 100 times higher for the same exact work, and for work that could be done by anyone anywhere on the globe, but I hope this massive difference in pay converges and disappears soon.

What kind of developers are you referring to? I’ve noticed a huge discrepancy between companies and industries.

Demand and supply right. How many of these developers started a profit sharing company? As a dev you sign up your life to work for another company where you trade time for money. You sign on that sheet. That's the contract. The company can make bazillions off you but that's the deal you signed.

I really really hope with the recent IPOS, all the devs that made good money end up doing startups that are more employee friendly than VC friendly.

But then, this is the ugly world we live in. The selfish gene yo!

x264 is really amazing because it has an amazing team of knowledgeable people who are personally interested in its success. x265 seems to be just some commercially run open source project and I'm not sure if it will ever make as good use of the spec as x264 does. For a time x265 was only about as good as x264 regarding subjective quality per bit.

>something the authors hadn't run into at the time and had no systems to test on.

Was it Dark Shikari? I remember there were lots of threading optimisation done, but that was... ~2010 time frame.

I really wish he is still working in the video encoding scene.

Yes, was Jason, in 2012 iirc. Sliced-thread lookahead patch circa may 2012. He was working at gaikai at the time doing amazing things with cloud game streaming.

I don't fully agree that a h.264 system is extremely hard to implement, because it has some nice properties:

- once you have a reference implementation, you can have exact input/output pairs, which makes development easy, especially if the reference implementation exposes intermediate results.

- because of the previous point, testing is easy.

- even if you don't have a reference implementation, you can start by building a slow version of an encoder/decoder, generate testing data (intermediate outputs), and step by step optimize the system, while ensuring the system produces the same outputs at every step.

Have you tried to implement one? :-) I worked with a team who did a full implementation from scratch (commercially), and I optimized some parts for certain architectures.

The problem is not in getting the output. The reference decoder decodes just fine. The difficult part is getting the output on time, e.g. getting you decoder to run fast.

The reference decoder has not been designed for speed, and getting things to run fast requires architecting everything very differently: it's not just a matter of applying micro-optimizations to reference code.

Also, the "intermediate results" might be different for each decoder architecture, so you can't really rely on the reference code for those.

So yes, while testing is indeed easy, writing a fast decoder is definitely not a "step by step optimization" process. I'd bet every performant decoder out there was rewritten almost completely at least twice.

Which are the fastest freely available decoders available, no necessary open source?

Is it really deterministic mapping from input to output, for given configuration options? I imagine you can build different algorithms that choose what data can be removed, with different tradeoffs between encoding time, decoding time, memory consumption, output file size and experience of quality.

Decoder is completely deterministic. In fact the normative part of most MPEG standards is implementation of the decoder in C-like pseudocode (which is horribly slow because it is not optimized in any way and mostly deals with discrete bits).

On the other hand how encoder works is not specified at all apart from the obvious requirement that it has to produce something decodable by the above mentioned decoder.

I run XBMC on an old laptop and it struggles, but manages to play, X.264 encoded videos.

However, it doesn't handle X.265 videos very well - pauses every few frames.

How come? I'm assuming its because 265 is much more CPU intensive.

What's the typical compression benefit of using x.265 vs x.264?


Update: noticed this comment below from @Causality1 "devices from over a decade ago have hardware h264 decoders. Extremely low powered modern devices can play very large 265 videos, if they have a hardware decoder."

x265 is not using HW for decoding so it's very heavy on the CPU.

Please don't confuse "x265" and "H.265 / HEVC"

At least on paper, some hardware h.265 decoding is supported on Intel since Haswell (2013), on nVidia since Maxwell 2-nd gen (2015), on AMD since Polaris (2016).

Is it possible to (ab)use that hardware support to do other tasks by mapping them to a decoding task?

Dunno. Why would you want to? You can already compute lots of stuff on GPUs.

Couple generations ago, AMD did the other way, they used GPU shader ALUs for video encoding/decoding. They stopped doing that because power efficiency is worse than custom silicon that’s now in all modern GPUs, even in mobile SoCs.

Where's the reference implementation? And is there actually more than one competing implementation out there?

> is there actually more than one competing implementation out there?

Yes, there are many software decoders out there, usually specialized for certain architectures (the team I worked with wrote a decoder for Texas Instruments DSPs, for example). Then there are hardware implementations, which are rarely just "hardware", it's usually hardware-assisted software that does the decoding.

Interesting introduction. Too bad it doesn't really get into some of the more complicated things an advanced H.264 encoder like x264 is doing, e.g. adaptive quantization methods. There's also a mistake or two, for example

> P-frames are frames that will encode a motion vector for each of the macro-blocks from the previous frame.

Actually, P-frames can use data from multiple previous frames, not just the last one.

I think it's worth pointing out, as well, that even though we've technically surpassed H.264 with newer codecs like H.265, VP9, and AV1, all of which are roughly 20%-50% more efficient, this has required tremendous increases in encoding complexity. H.264 is special - it seems to occupy a kind of inflection point on the complexity / efficiency curve. It's far more efficient than previous codecs like Xvid, WMV, and so on, but at the same time even a lot of underpowered devices from over a decade ago can easily play it. We're not likely to see tradeoffs that good again in the video codec space.

When you control for image quality VP9 outperforms H.264 in single-threaded decoding:


You get the same image quality at a lower bitrate and faster decoding.

VP9 encoding is much slower, but Intel's SVT-VP9 encoder is getting over 300 frames per second on the right hardware:


You measuring the best of VP9 Encoder in Image quality, and then points to SVT-VP9 being the fastest encoder which isn't optimise for quality.

The VP9 encoder also tends to smooth ( clean ) the video rather than working with the noise ( Not a problem for most short internet video ), which even as of today x264 is very very good and possibly the best at.

I made two separate, unrelated statements. In any case, libvpx isn't the best VP9 encoder for quality. EVE is.

The point is that you can select the VP9 encoder that makes the most sense for your use case. You can choose from EVE, libvpx, Intel SVT-VP9, Intel Quick Sync, NGCodec, mobile device encoders, etc.

That's an interesting claim. I have to say it doesn't match my experience at all, e.g. when playing VP9 videos on Youtube. Several thoughts:

* I'm not sure what the "ffh264" decoder they're talking about is. My understanding is that pretty much everyone is using libx264 for both encoding and decoding. Maybe the latter is more performant?

* Using SSIM to pick "equal quality" files will probably not give very accurate results.

* SVT-VP9 is, if I'm not mistaken, a hardware encoder. One of the issues with these is that they require some significant tradeoffs to get better speed, so they'll have worse quality than a software encoder at the same bitrate. If you go with software you really pay the price when it comes to encoding time.

Some corrections:

ffh264 is FFmpeg's decoder. libx264 only does encoding, not decoding.

SVT-VP9 is a software encoder, specially optimized for Intel Xeon Scalable and Intel Xeon D processors.

libx264 isn't used as a decoder, it only encodes.

I thought SSIM was not a very good metric for measuring quality? My impression has always been that the VP9 encoders optimize for these simple “objective quality metrics”, therefore they do well on those metrics. But encoders like x264 don’t, so they do not do as well on those metrics.

This is true according to my observations. And really, nobody cares about "objective" quality (except for quick feedback during development, I guess). It has to look good.

And VP9 is actually usable on TI Sitara platforms, while H.264 and H.265 cause 100% CPU load and still stutter.

> H.264 is special - it seems to occupy a kind of inflection point on the complexity / efficiency curve.

15 years ago there were people who said exactly the same thing about MPEG4-ASP (eg. divx/xvid). It all comes down to Moore's law and hardware acceleration.

Another thing about hardware acceleration is that on many platforms you really did not want to push 622Mbit/s worth of raw chroma-subsampled YUV FullHD video across whatever interconnect to your GPU, so moving things like motion compensation into GPU hardware was only way to actually play FullHD content. And this is also part of the reasons why is H.264 somehow special today.

> 15 years ago there were people who said exactly the same thing about MPEG4-ASP (eg. divx/xvid). It all comes down to Moore's law and hardware acceleration.

There are always format fanboys but it really wasn't anything like that because the MPEG4-ASP experience still had problems for the average user (it wasn't a big win on file size/quality, compatibility was not a given, people still had to pay for licensed players, etc.).

In contrast, I think H.264 hit the threshold where it's good enough for the average person and you basically never hit the case where a file doesn't work for the person you shared it with because even though it's license-restricted, most operating systems and browsers include support. Moving away from that is going to be hard because most people will not see much advantage — AV1 is clearly better on file-size but most people aren't hitting limits on a regular basis and the feedback loops are often missing (e.g. you hit play in Netflix on your phone — how many people would have any way to tell how much of their data plan that used?).

>you hit play in Netflix on your phone — how many people would have any way to tell how much of their data plan that used?

Instead of focusing on the client, what about the server? How happy would Netflix be to reduce their outbound traffic by 20%? I'm guessing it's quite happy.

Oh, don’t get me wrong: I love better codecs and I want them to happen. I’m just with the person upthread who was saying H.264 will stick around: Netflix and YouTube can upgrade rapidly because they control the clients and have robust transcoding pipelines: if it saves 10% on bandwidth costs they’ll recoup the work quickly. It’s upgrading everyone else that’ll take awhile, especially since so much video is natively produced as H.264.

Netflix is already using HEVC since a while.

Yes - that’s why I used them as an example because they’ve published details about the benefits and how they compare codecs.


Also the comparison to a PNG instead of a JPEG was a bit disingenuous since PNG is a lossless codec. I understand it was done to sound impressive but if you compare lossless PNG to a codec that just reuses the same frame and uses a lossy codec on top, it's not surprising the video will be smaller.

I'm not convinced.

For an article talking about the magic of compression, comparing to uncompressed png makes a lot more sense than, say, starting your comparisons to something that's already compressed like jpeg.

That png is a much closer pedagogocal relation to the previously discussed "3D array of pixels, 2 in space and one in time" that he's building on.

PNG is compressed, just losslessly. A bitmap image is an uncompressed image format (which is why you don't see them around much anymore).

But given that the PNG comparison was shown before an extensive discussion on how lossy encryption works by discarding detail that is hard to perceive it's entirely appropriate as a starting point for this article.

That's because even devices from over a decade ago have hardware h264 decoders. Extremely low powered modern devices can play very large 265 videos, if they have a hardware decoder.

That's exactly true. I still remember the time I managed to have the decade-old iPhone 3GS play an extraordinarily large H.264 encoded video (around 40Mbps) and it absolutely had no problem doing so!

Yeah, I remember the first time I ever got an h.264 video, on a somewhat old processor. Couldn't play at all, and it was not a high resolution file. Whether things have a hardware decoder is going to be the biggest factor for a while, and there's plenty of room to improve.

Yes, and reason because the decoder was cheap to implement. Transistor isn't getting cheaper like it was in the old days. The complexity of anything post H.264 is much higher, which also meant higher transistor count, higher budget, and higher cost.

>it seems to occupy a kind of inflection point on the complexity / efficiency curve.

I believe it is one reason why MPEG is working on EVC, a Royalty Free Video Codec based on H.264 with a target of 20% improvement.

Hopefully the industry as a whole, mainly Apple and Google can settle on something that is better than H.264 and JPEG.

Right now Google will not support H.265, and Apple isn't fully on board with AV1 yet.

Correct me if I'm wrong but EVC targets 4x complexity of AVC with 20% improvement. So it has more to do with being royalty free but nothing to do with inflection point on the complexity / efficiency curve of H264.

> EVC, a Royalty Free Video Codec based on H.264 with a target of 20% improvement.

EVC won't deliver much value. VP9 is already royalty-free, already outperforms x264, and already has broad support and use.

Apple should add VP9 support now (like everyone else has) and work to add AV1 support in the future.

HEVC in practice is 3x the encoding complexity for 40% to 50% bitrate reduction.

Overall, this article is a pretty good overview of how lossy image/video compression works. However, what the author describes as "quantization" is actually a low pass filter. With quantization, you do not simply zero out the high frequency components. You "snap" them to specific intervals. For example, if you have some data varying between 0 and 100, like [7, 39, 97, 42, 13], quantizing that by a factor of 5 would give you [5, 40, 95, 40, 15]. This gives you an approximation of the fine details, rather than simply throwing them away.

> Let's say you've been playing a video on YouTube. You missed the last few seconds of dialog, so you scrub back a few seconds. Have you noticed that it doesn't instantly start playing from that timecode you just selected. It pauses for a few moments and then plays. It's already buffered those frames from the network, since you just played it, so why that pause? Because you've asked the decoder to jump to some arbitrary frame, the decoder has to redo all the calculations - starting from the nearest I-frames and adding up the motion vector deltas to the frame you're on - and this is computationally expensive, and hence the brief pause.

Uh, is that actually what's going on there? Because for some reason, seeking is basically instant when I'm playing locally-saved videos, including ones I downloaded directly from Youtube and didn't re-encode.

Actually, this is one of the reasons why I download videos before watching them, instead of using the normal Youtube player.

Yeah. VLC for example doesn't have a frame-by-frame backward seek, because they argue it would blow up the memory consumption to keep all those frames around just in case you want to jump backward. And since videos are mostly hardware-decoded on the GPU these days, you'd be wasting all that memory in your VRAM.

Here's one of many threads where Jean-Baptiste Kempf addresses the feature request. https://forum.videolan.org/viewtopic.php?p=390778&sid=1a571d...

The amount of pausing is going to vary a lot by video. If you have an I-frame followed by 250 P-frames, there will probably be a noticeable pause.

It has been a long time since I used VLC—I generally use mpv on Windows and QuickTime X on Mac because I like their minimal UIs—but I remember it being instant too!

Side note, using VRAM to make seeking instantaneous seems to me like a perfectly good use of resources, at least if my device has the VRAM to spare. On my desktop, I have a 1080 Ti, and it mostly sits idle when I'm watching a video...

For what it's worth, mpv lets you choose which style of seeking you want with the seekbarkeyframes OSC option. Leave it on (the default) and mpv will seek to the nearest keyframe when you click on the seekbar. Switch it off, and mpv will seek to the most recent prior keyframe and silently decode to the exact point you clicked on, so you should get that pause as mentioned in the article. Depending on how quickly your computer can decode video, it should still be pretty fast.

You may want to revisit VLC. It now has the option for you to run it with with minimal UI or even with no UI at all (just a video frame), and if you are brave enough you can also design a custom UI for it.

The only thing I really don't like about VLC is the traffic cone icon, but I get that's their brand so I have to live with it.

I'm another mpv user and I switched from VLC to mpv just because mpv is plain better than VLC. VLC would give me some slowdowns and would hang sometimes, whereas mpv has always played everything perfectly.

But VLC has many many more options and possibilities, if you need it.

This is kind of why I don't want to use VLC.

I want my media player to do exactly one thing: open the video or audio files I tell it to open, and play them. Plus the option to pause, seek, enable subtitles, or change audio tracks. Anything else is extraneous.

I'm something of a software minimalist. I've spent a lot of time finding and hiding UI elements on my Jailbroken iPhone. If I don't need something I really want it out of sight.

Like what?

You can change the icon on your desktop and in your application bar, possibly on the window bar too, surely?

A counterexample, using VRAM on my desktop's Intel's iGPU seems like a waste of VRAM when it already barely manages to decode 1080p. It heavily depends if it's worth, and given most of the machines are on the lower end of performance the choice is sane right now.

Yeah, that's why it ought to depend on how many resources are available.

On the other hand, VLC serves as a gentle reminder (or introduction) of how video compression works, every time you seek. ;-)

If it's not used otherwise it's not "wasted".

Another "it doesn't instantly start playing from that timecode you just selected" is to jump to a nearby I-frame, and just start playback from there, instead of going exactly to the selected timecode. (How well this works depends on how frequent the I-frames are in your downloaded file. In the 90s it wasn't uncommon for I-frames to be minutes apart, but less so now.)

I've noticed my video player does this, but only when using the "5 seconds forward/back" arrow key commands. Tap the left arrow and you end up skipping back to the same moment over and over, even if you wait a bit.

Interestingly, YouTube doesn't display this behavior. I find it kind of annoying, to be honest. Sometimes there are reasons I want to go exactly the same duration back/forward, every time.

You can even go forward and backward frame by frame on YouTube with the . and , keys.

It would also presumably have to load that full frame over the network, so maybe that's part of the difference.

This is what I've always assumed was happening—but it's annoying! If I'm seeking backwards that means the frame has already been played once. Don't throw it away so quickly!

What the article suggests is that it didn't throw away the frame, it simply never had the complete frame that you're seeking to because it was in-between full frames.

Which should also happen if the video is being played locally, which is why I was confused.

Either Youtube caches the video data, in which case seeking should be as fast as offline players, or Youtube discards the data, in which case it has to be downloaded again. It certainly seems to me like Youtube is discarding the data. I wish they didn't do that.

When you click on Youtube's timeline, it seeks to the nearest I frame, not to the exact frame, so it avoids the problem altogether.

It depends on the player/platform, but generally the YouTube player does not avoid the problem altogether. There's more that goes into this, but you can experimentally verify this by playing with the seek bar in the browser. For me (on Chrome), it does a pretty good job at seeking to the time I requested, not to the nearest IDR-frame (which are usually a few seconds apart).

It might also be because youtube uses DASH?

Yeah, I see whole bunch of additional network activity, suggesting that they redownloading something.

I mean, I only have 32 GB of RAM. Wouldn't want to waste any of that on letting videos rewind.

1 minute of uncompressed 8bit 1080p60 video is ~20.86GB. Unless you also plan to recompress into jpg or something frame by frame your 32GB of RAM won't last long.

The replacement for network activity is obviously the compressed format. So it's more like 50-100MB, and the vast majority of the time the disk can fit the playing video.

Chroma subsampling is a terrible idea in the digital era. You should just assign a different compression factor to the chroma channels instead. I don't understand why we keep using this system of discarding 75% of the color data BEFORE applying the lossy compression algorithm. (Well I do know of one argument - apparently it reduces the amount of CPU time required to compress and decompress the data, but I think that's a pretty lame reason)

> You should just assign a different compression factor to the chroma channels instead [of subsampling chroma].

The good news is that we don't need to change the H.264 standard for this.

Supposing that:

- most of the already deployed embedded H.264 decoders support 4:4:4 profiles

- non-chroma-downsampled input content is available

Then, it's only a matter of changing encoder implementations to do exactly as you say.

I'm pretty sure there is plenty of 444 content. Movies are likely filmed/edited at 444, they are provided to theaters at 444, but they are generally only made available to consumers at 420.

I believe the reason is that most hardware decoders sold to consumers do not support anything above 420. As Blu-rays and such are all currently 420, the chip makers don't have much incentive to support 444. It's a shame because your TV fully supports 444, and the difference can be huge on some content.

There is a physiological reason that you can't resolve the chroma at the same resolution as the luminance.

But even after you decide to allocate << 50% of your information bandwidth to chroma, there may still be ways to use that allocation than are more effective (perceptually) than 2x downsampling.

This. Subsampling is kind-of the most primitive method of "lossy compression" there is.

Chroma subsampling is a natural way to leverage the information content produced by Bayer filters in most cameras.


Chroma subsampling is nice because it lets you "delay" some of the upsampling to later device in the pipeline, like your GPU outputs YUV4:2:0 and your monitor hardware does the chroma upsampling. If you don't want that, use 4:4:4. This feature you're talking about where you assign different compression factor to different color planes is already implemented in x264:


Fully agree. Seeing all the color conversion errors it can lead to also.. In the future it would be more simple for everyone to always use 4:4:4

x264 supports encoding 4:4:4 (i.e. no chroma subsampling) video. There's some very obvious cases where it produces better results, but it's typically just slower with no upside.

The GP is talking about using 4:4:4 but having lower bitrates for chroma channels than luminance

That is still slower than subsampling.

But it can produce a result which looks better at the same bitrate. It’s not all about speed.

"but it's typically just slower with no upside" was the original statement.

Good point. When compressing a full video to a size target, I've always felt the 1080p, squashed version was still better than the 720p, less-squashed version.

Subsampling actually is a form of quantization, which in turn is a form of lossy compression.

A bad form of it given that you already have the ability to reduce bitrate per channel on the encoder side. It leads to a lot of issues like pixel format conversion and decoders incompatibilities for example

But it isn’t free so it’s a pity it’s taking over the world. Chrome plays h.264 / mp4 because they are licenced. Firefox counts on OS support. (The Cisco plugin only supports webrtc calls) so you can’t legally watch MP4 videos on Linux.

Webm seems to be as good and free.

Can some comment if I am mistaken about this?

A bit tangential, but you are mistaken in that you are confusing container formats and codecs. MP4 is a container format, which can contain, among other codecs, h.264 video. WebM is another container format, which can contain, among other codecs, VP9 video. MKV and AVI are other container formats, and h.265, VP8, and AV1 are other codecs. There are lots of codecs compared to container formats, and the same codecs can often go in multiple different containers (you can even convert encoded video losslessly to a different container format).

Licensing issues are almost all around codecs, not container formats.

It's not, but it actually won't be too long until the patents expire for it (a quick google says 2027). The same used to be true of mp3. It was ubiquitous, but annoyingly not open, and now it's both.

It’s all capped at a max payment and not terribly expensive.

VP9 and its new versions will take over but it will be a while until it’s implemented in hw fully.

Non-free standards are bad for open source and innovation in general.

The good thing about standards is everyone can have their own.

It is pretty amazing that H.264 is not one algorithm that just popped out and changed the world, instead it is accumulation of collection of tricks developed over several decades.

I distinctly remember few folks saying that you would need to violate laws of physics to transmit 1080p @ 60 fps over wifi. Now we are all doing it many times over and billions of dollars are being made every year. This is one of the unsung achievement worthy of Nobel-prize level awards but no one seems to know who these people were.

I wonder if there is any detailed history of how various components of H.264 came together, who led this effort, how projects were funded for such a long time?

Previously discussed:

[HN 2016] - https://news.ycombinator.com/item?id=12871403

I wrote a comment there about frame differencing and wanting to try writing a video decoder after having written a JPEG one. Now over 2.5 years later, I can say that I did write a toy H.261 decoder, along with one for MPEG-1 and the beginnings of another for H.262/MPEG-2 in my spare time, and it wasn't all that difficult. In fact, because of the need to not only decode an image for each frame, but to do it quickly enough to keep up with the framerate on the limited hardware of the time, these early video codecs are in some ways simpler than JPEG --- e.g. all the Huffman tables are static, the colourspace and subsampling are either fixed or have only a small number of variations, etc. As evidence in support of this, my JPEG decoder is around 750 LoC (in C), while the H.261 one is slightly smaller at just under 700 LoC. The MPEG-1 decoder is a bit more complex, at close to 1kLoC. I haven't finished the H.262 one and it is more complex (interlacing is a pain...), but it's probably doable in less than 2kLoC total for a decoder that understands both MPEG-1 and 2/H.262 --- the latter is somewhat a superset of the former. I wasn't really optimising for size, nor did I have a good idea of how much code it'd take before I finished, so these should be quite "realistic" estimates of complexity.

I've seen and given recommendations on here to write a JPEG codec as a learning exercise, and a search of GitHub reveals plenty of others who've done it; but the same situation doesn't seem to hold for video. Nonetheless, having done it and realised it's not that much work (i.e. should be doable in a weekend or two), I now recommend trying to write codecs for H.261 and MPEG-1 too. There's not that much media now in those two codecs, but if you get to MPEG-2, you can experience watching DVDs using your own code.

The complexity in H.264 gets you later on, when you try to build a complete (e.g. fully compliant) decoder. There are some killer features which seemed like a good idea at the time, but which make the complexity staggering.

MBAFF is a good example (Macroblock-Adaptive Frame/Field encoding). What this means is that the encoder can choose for every block whether it wants to encode it as a single frame or two interlaced fields. Combined with B-frames (and remember that B-frames reference frames both in the past and in the future) this makes for really interesting memory access patterns, which then kills your cache locality and murders performance.

One under-appreciated advantage of H.264: it's the last video codec where the artifacts look obviously artificial. Modern codecs have artifacts that look too much like real objects, so they take more brain power to ignore. After watching low bitrate H.264 for a few minutes I stop noticing the artifacts, which isn't the case with modern codecs.

What codecs are you talking about? I'm interested to see these artifacts for myself.

Comparison videos from: http://video.1ko.ch/codec-comparison/



At 15 seconds, in the VP9 version, it looks like smoke is pouring off the top-left metal pipe of the machine. It immediately draws my attention because it's so out of place. But in the H.264 version, the whole image is noisy, so nothing distracts me.

I'm not really sure which is which but the webm version looks way way way better to my eye. Not even any comparison really.

Very interesting. Do you have links on this subject (or could you write an article) ?

It's just personal observation. I noticed the same thing in audio artifacts with the RNNoise demo from xiph.org:


General question about video codecs: why are codecs tied to resolutions? Why can't some codecs be sued for resolutions larger than some maximum? This is probably the one thing I never understood about codecs.

Mostly it's just because the standards say so. There's nothing about how MPEG-2 works that inherently makes it impossible to encode 4K video with it, the standard just says that MPEG-2 video shall not exceed 80Mbit/s, 1920 horizontal pixels, 1152 vertical pixels, or 62,668,800 luminance samples per second.

The limits are just there for interoperability. As is, if you encode a video that complies with MPEG-2 High Level, then you can be pretty confident that any MPEG-2 High Level decoder will decode it.

Without the limit everything becomes a mess. You're writing a decoder, what max res should you support? 4K? 8K? The people writing decoders don't agree and then people trying to distribute these high resolution files find that they work in some decoders but not others.

Thanks! What do you mean by "not being able to decode" something though? Do you mean time-wise (not enough bandwidth to keep up) or do you mean the algorithm would fail in a different way? The latter doesn't make sense to me, and the former seems like it's entirely system-dependent? Like isn't it the exact same problem if I have an underpowered computer?

It means hardware decoders won't be able to decode the videos. (Some software decoders that are overly rigid about following the spec may refuse as well, but most won't, and none actually have to from my understanding.)

You can absolutely use h.264 to encode 8K@120fps if you want to.

Ohh hardware decoders! Makes sense.

Not a codec expert, but I can see two reasons. The first is that some fields in the encoding are not big enough, for instance a 12-bit field for the resolution in some header can represent only up to 4095 pixels (or 4096 if "zero pixels" is not allowed). The second is that all video codecs I know of subdivide the image into small blocks and compress each block individually (mostly; there's some prediction from nearby blocks, and some mixing of the blocks at the end to smooth the edges between them), and older codecs only have blocks which are too small for higher resolutions (imagine for instance a 8x8 block in a 320x240 video, and then in the same video at 1280x960; the block clearly covers much less of the later video, so there's less variation for it to work with. Increase the resolution too much, and small blocks would start acting more like individual pixels, and you'd lose nearly all the compression.)

The first reason seems possible I guess, but not likely as being the only reasons. The second reason though should only imply that you lose the compression, not that you're unable to produce the file in the given format in the first place? What I'm referring is that I recall encoder tools outright rejecting high resolutions for older formats.

A looser tie to definition (which is what I interpret your use of resolution as) is computation. Representing an 1080i stream in MPEG2 takes more bits per second than representing that same stream in MPEG4 or MPEG5. So if you keep the bandwidth constant, each generation has a rough maximum definition. MPEG 4 is great for HD, but MPEG 5 is probably what you want if you want to do UHD because the bandwidth won't increase.

I don't negate the idea that a 10-bit field for the horizontal pixel count would be a significant barrier, but if it were super hard to come up with more efficient compression, some sort of extension would be defined to allow bigger horizontal pixel counts.

Chroma subsampling is a terrible idea for non-photographic images. People are not "terrible" at color perception. I can clearly see its harmful effects to on-screen captures, line art, anime, 2d video games, or even Super Mario 64's hat.

Youtube's chroma subsampling makes colors bleed, Mario's hat turn into chunky red blocks approximately filling Mario's hat, screen captures discolored and grainy around text, and sharp colored anti-aliased lines turn into a discolored mess.

(Mario's hat was pixelated on pannenkoek2012's emulated SM64 videos at native 640x480. Maybe newer video decoders antialias the chroma channel, so Mario's hat is not pixelated but "merely" blurry.)

I upload oscilloscope videos with colored lines on black backgrounds, and stopped using brightly colored lines partly because color was being blurred and discolored by Youtube chroma subsampling.

I bought a laptop camera module and attached to a USB cable and it pegged my raspberry CPU at 90% and soon I started receiving temperature warnings through that thermometer icon on the screen.

I wonder why webcams don't directly use the onboard graphic card.

Even the cheap webcams all suffer from the very same problem.

The reason I bought laptop card, I thought it would be higher quality and lower price and would be good for monitoring my 3d prints.

I've no found any Raspberry Pi camera with autofocus with IR filter intact ( I need it for full light use and better colours during day time)

Try the Microsoft LifeCam range. AutoFocus, Supported by Linux, Really good image.

I like this high-level summary. One thing I'd change though is that the photo loses details in the frequency domain only. In real view it actually gains unnecessary details/noise - it's not that the grill details disappeared, it's the whole image that gained an anti-grill layer. (You can now see it on the smooth part of the laptop)

> You wouldn't say HHHHHHHHH. You would just say "10 tosses, all heads" - bam! You've just compressed some data! Easy. I saved you hours of mindfuck lectures. This is obviously an oversimplification

Oversimplification and exaggeration. Understanding Huffman encoding and deflate algorithm are just two five minute articles away.

It's also not compressed at all - the english prose is 21 characters, over twice as long. Also the English prose requires at least ASCII, while the string of H can be encoded with a single bit each. 168 bits > 10 bits.

To be fair, it is faster to say, out loud, because our verbal "character set" is so much richer. A unit of pronunciation is roughly a syllable. "10 tosses, all heads" has only 5 syllables, while "ach ach ach..." takes 10. But this insight is a bit much to ask of someone new to the concept of compression.

It's difficult to describe but I read it more like converting the raw data (a list of results) to a generator. Akin to the "compression" described in this xkcd: https://www.xkcd.com/1155/

The chroma subsampling is why 4k video played on a 1080p/1440p monitor looks so much better/clear/detailed on monitors. Typical 1080P native encoding has less chroma detail for size saving.

I've seen this article before, and this part always stuck with me:

> I captured the screen of this home page and produced two files:

> PNG screenshot of the Apple homepage 1015KB

> 5 Second 60fps H.264 video of the same Apple homepage 175KB

> Eh. What? Those file sizes look switched.

> No, they're right. The H.264 video, 300 frames long is 175KB. A single frame of that video in PNG is 1015KB.

The article does go into how and why this is possible (tl;dr H.264 is lossey, png is not), but for the difference in human-detectable quality, H.264 is astonishingly more efficient.

It's a good example but the difference doesn't quite have to be as dramatic as the article makes out.

Screenshot tools rarely produce optimised PNGs. A quick test of the image in the article shows that optipng (lossless) can bring the size down to ~570kb or pngquant (lossy) to ~240kb. Zopfil could further compress the images produced by optipng or pngquant, but it's a case of diminishing returns at that point.

To be clear the article makes a good comparison but I think it perhaps overstates the case to make a point.

I worked on this project where we needed to analyze the motion of objects from the camera in real time. We were trying to find an algorithm that was fast enough to run on the hardware, but to no avail.

One day, it dawned on me that since our hardware encodes the video in h264, we should be able to get some of that info from the video itself. So we tried extracting the motion vectors from the video and sure enough, it was good enough for our use and we got it basically for free.

FFT is the real magic there and has been since the 90s.

Convince me otherwise.

The FFT is lossless, so there’s no compression as a result of using it.

The compression is gained by perceptual modeling and dumping parts, and by powerful entropy modeling, like CABAC and such.

Also more advanced things are usually used in modern compression: DCTs, sliding window stuff, wavelets, CABAC stuff, motion prediction schemes, etc., none of which were used in the 90s (except DCT and rudimentary prediction).

"FFT is lossless" - is this true without infinite precision values?

The FFT using real numbers is not lossless (except in certain numerically coincidental cases that are not mush practical use), but there are lossless approximations used routinely in compression to avoid error at this point.

For fixed input and output (say 8 bit to 8 bit data), it's easy to make lossless FFTs, since the final truncation or rounding can be designed to never lose information. So in these cases, even a FFT based on , say, IEEE 754 doubles can be made lossless when using integral input and output.

A similar example is converting 8 bit RGB to 0-1 floating point by dividing the color channel by 255.0, which is lossy as real numbers since the resulting floating point is truncated (except precisely for inputs 0 and 255). However, multiplying the result by 255.0, and rounding appropriately, you can recover each input without loss. So this is an example of a lossless transformation that has lossy steps in between. This can be done in almost any case you need to do it.

FFTs in compression are not used to lose values, but to change signal domain, where perhaps interesting and lossy things can be done.

Google lossless FFT or reversible FFT and dig around.

At the end of the day, however, it's not FFTs that are useful very much in compression. They're only used for signal changes at best, and they've been made obsolete for the most pary by other techniques.

Can you give some details re:

> The FFT using real numbers is not lossless

The DFT over the reals is invertible, right? How can it be invertible but not lossless?

> The FFT using real numbers is not lossless

FFTs use terms of the form e^(2 pi i k / n), and these terms cannot be represented exactly except in rare cases with finite precision floating point numbers (follows from Gelfand's Theorem).

Thus, as soon as you try to use or compute such a term, you've made an approximation, losing information.

The transform is approximately invertible using finite precision, and if your inputs are some fixed set you can make sure and do error analysis to ensure those terms come back out via careful rounding.

But the FFT, and it's inverse, involve infinitely precise real numbers that cannot be represented as floating point.

The DFT fails for the same reason.


As I explained above, if you have a limited set of inputs, say byte values 0-255, that get converted to floating point via these approximations, then the inverse approximation is applied, then the final values are appropriately rounded, you can make the DFT and FFT (and almost any approximation algorithm) on this limited subset of inputs 0-255.

As a simple example, consider turning a byte color channel 0-255 into a 32 bit float in 0.0f - 1.0f via dividing by 255.0. Now every one of the values except 0.0f and 1.0f are approximations, since the only exactly representable floats are dyadic (denominator is power of 2) and these denominators are 255 (no powers of 2).

So this is lossy. But the limited inputs means there are 255 different floats possible.

Now multiplying by 255.0, which is (int this case) not lossy on floats, puts each back to close to an integer (but not all are integers). Rounding to closest will restore the original 0255 byte values.

So the roundtrip is lossless here. But the same transforms dividing by 255.0f then multiplying by 255.0f are not lossless for all floating point inputs.

In each case for any algorithm, one needs to carefully design the inputs and outputs and transforms carefully to ensure it behaves as desired.

This is the tip of the iceberg when dealing with floating point algorithms :)

The DFT over the reals is not lossless. Using floats, for example, and applying it to max_float overflows, so cannot be inverted. It's not even bit invertible for almost any set of inputs. It's only invertible on very limited specialized sets of inputs

I find it very confusing that you seem to use "real number" and "floating point number" interchangeably.

Like this:

> The DFT over the reals is not lossless. Using floats, for example, and applying it to max_float overflows, so cannot be inverted.

I don't see why you make statements about the reals based on what happens with floats. They aren't the same, and the DFT exists independently of actually computing it.

No, but it's close enough to be round-trippable which is all most applications need. That is, if you do an FFT with enough precision to use the same number of bits as the original representation, you will get values that are close enough to the correct values that if you do it again, you will get the same numbers back, to the limits of whatever precision you were using.

I don't see how this is relevant. Certainly related, but not immediately relevant. The Nyquist rate / frequency (depending on where you stand) determines whether the process of sampling is invertible, but the FFT exists independent of the process of sampling and is always invertible (in the maths). 'acchow's question is whether this holds in practical systems, like computing the FFT and IFFT with the finite precision + finite range numbers that we can use.


You mean DCT?

Please see my other comment under this post: H.264 has no "real magic", the magic comes from multiple small-gain tricks combined together.

DCT is basically a type of DFT (for functions with symmetries in the values). You can easily implement a fast DCT using the FFT at the expense of a constant performance penalty (because the more general code can’t skip doing the redundant work implied by the symmetries in the data). The techniques used to implement a fast DCT are pretty well the same ones used for an FFT.

The basic concept of the DCT as used in compression applications is that the FFT only works for periodic data, but the arbitrary data in a block of pixels are not periodic, so we pretend that it is part of a periodic function with double the period where we mirror the data back-to-front on the other half. If we didn’t have a DCT implementation per se we could explicitly mirror the data into a larger array and then run an FFT, and then just look at half of the coefficients (the others are zero).

This still creates some artifacts since the interval of our data is not actually periodic, and derivatives aren’t necessarily matching between the two ends. But it turns out to be more-or-less good enough.

An alternative approach to approximating a function sampled on a uniform grid over a non-periodic interval using the FFT is to first estimate the first few derivatives at the endpoints and then subtract off a polynomial matching that data. Then we can more safely pretend that the remainder is periodic. cf. http://zindajaved.com/data/publications/javedTrefethen2014b....

Very well. I would like to correct two misconceptions that you carry. Not only is the Fast Fourier Transform not magical, but it comes out of a straightforward elaboration of the question of how multiplication of two numbers might be done quickly. The technique was discovered before the (19)90s, too; Gauss discovered it over two centuries ago.

I think you are intentionally misunderstanding the OP's point. He is talking about FFT as applied to lossy video compression.

In direct response to the OP, I would suggest that motion estimation and entropy coding are up there on equal footing with FFT.

Still what got into all MacBooks > mid 2015 is a hardware HVEC encoder(if I’m not mistaken), anyone knows why HEVC and not H.264 ?


NOTE: I found out HEVC is H.265 so my question might not make sense :)

Fairly sure hardware H.264 decoding has been included with every Mac, usually on the GPU, for years and years now. A couple years after the iPhone, I think.

Decoding sure, but I was saying about encoding, my Macbook mid 2015 comes with "Intel Iris Pro 1536 MB" and takes 25% cpu to capture screen with Quicktime, while newer MacBooks >=2016 come with HEVC, which after some searches I found out is also called H.256[1][2], and is not just in Macs[3]

[1] https://support.apple.com/en-us/HT208238

[2] https://www.macrumors.com/guide/hevc-video-macos-high-sierra...


EDIT: I see now the label 2016, so opening question didn't make sense :) sorry.

They also have encoding. The software might not be using it properly though. https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video

I think screen recording is slow not due to the video encoding but since it defeats standard GPU acceleration of the GUI in order to be able to capture the contents

It really shouldn't. You just feed the framebuffer both into the DMA output, and in the H.26x encoder. If needed, use another framebuffer space if the encoder is slow, so you don't _have_ to recycle the buffer after this frame, and risk the encoder not being done yet.

Patent. H.265 is practically unusable until patents are expired. Because there are multiple major patent pool holders . It's just impossible to safely use for most of us.

In the article near the top it says

“1080p @ 60 Hz = 1920x1080x60x3 => ~370 MB/sec of raw data.”

Can someone help me understand the 60 Hz part of this? Was it meant to be fps or does it really mean Hz? And why?

The average refresh rate of a computer display is 60Hz. In this context it's interchangeable with framerate but we think of it in terms of Hz because of the monitors.

The author means 60 frames per second, which has the units of Hertz. You sometimes see this unit used for screen refresh rates in your system settings.

Nice summary!

I just wish he wouldn't have used pounds for comparison... I mean there are 3 countries left on this planet that don't use the metric system officially. Even the scientists in the US use it. Reminds me of http://www.joeydevilla.com/wordpress/wp-content/uploads/2008...

But given how he managed to simplify such a complex technology to an easy-to-read article I don't really want to criticize him for using the wrong unit system.

I have to say I really love sid's style of writing. Humurous and entertaining yet brief and to the point.

Hope there's more like this to come.

Written in 2016, it is the only entry on the blog. Odds are slim, at least coming from Sid.

I’d be interested to see this same analysis with H.265 given that its the way forward and the standard on the for many devices and media.

Thinking back on the different dead ends, I figure that H.264 is just a reasonable evolutionary stage.

Now, Pixelon...that was probably magic.

Somehow I missed this dot-com crash story. The number of things wrong with it is amazing: https://www.wired.com/2000/05/perilous-fall-of-pixelon/

If stories like this interest you, I highly recommend Nat Geo's Valley of the Boom. I don't even like "history shows" but it's a funny and self-aware take on SV in the 90s with solid acting and a pretty low budget.

Fascinating. I love these accessible dives.


Not sure why you were downvoted. It is literally in the article.

When someone is explaining what part of the article left the biggest impression on them and why, it's weird and unhelpful to go "that's literally in the article!"

Oh, maybe they edited their comment (or maybe I misread it). Originally, I believe it said that it _didn't_ explain why at all.

Same here. I'd be happy to delete my comment in either case, but it's too late now.

Neat! I'd love to read another post like this about H.265.

What a nice article! Love finding these on hn

Don’t use blockquotes instead of section headings… please. This makes my skin crawl, but more importantly, it messes up assistive technology such as rotor and text-to-speech.

I believe the "magic" is happening in the visual cortex.

We have the tech today for even wildly better compression, at least for natural looking images. Using neural nets. There is a big computational cost to them, but if history is any guide, that won't be an issue forever.

Got any links to that?

Its sort of obvious if you see stuff like this:


The neural network basically has a huge lookup table of "natural" looking stuff. Just point into that table and get what you want out of it: grass, trees, a crowd, etc and the NN will produce it to your specifications.

Judging from the comments, the article mostly resonates with people who already know how the codec works, but it doesn't seem to be very useful to engineers who, like me, want to understand it in a bit more technical detail. The author could have removed the casual conversation noise in the text (very annoying in technical articles!) and filled it with some more useful bits instead.

I enjoyed the reading and think that it is not a technical article rather entertainment. Maybe that is wrong too and it is both. Why should every technical article by definition be written in a boring language?

For technical details use the terms and search them in your favourite search engine. There are literally tons of articles about every aspect of H.265.

> For technical details use the terms and search them in your favourite search engine.

Thanks for lecturing me, but the article in question is neither too entertaining nor too informative. That was my point. Other than that, of course I can find better materials on the net.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact