Another tidbit of (potentially) interesting information: it took many years for H.264 to achieve its full potential, for two reasons:
1) Implementing a decoder is really, really complicated, especially if you want it to run fast, and if you think a miracle "hardware decoder" solves the problem, you need to remember that H.264 places heavy requirements on memory bandwidth and caching (B frames), so it's not like you can just feed a stream through a chip and get video frames on the other side.
2) Implementing an encoder is really, really complicated :-) — and there are many ways to encode a video stream. H.264 delivers the tools, but does not specify how you should use them. It took many years for decoders to get to a state where most of H.264 is used efficiently.
I remember sponsoring a feature in x264 (the best software encoder) and the level of complexity in that code is mind bending. The patch was to allow better CPU utilization when running with over 16 cores - something the authors hadn't run into at the time and had no systems to test on. I did a simple profile during encode, sent them the trace, in 2 days the patch was tested and working and it hit main in a week's time. Boom, all 48 cores used.
Made me realize just how much talent is behind these technologies and I still shake my head at how much money is generated with this and how little of that makes it's way back to the developers at the core of it.
If it's any consolation, not much more makes it to the talented developers in commercial products, either. It mainly ends up going to executives and sales.
I think it's primarily developers in other countries that are getting severely underpaid. Hopefully that changes in the coming years.
 The most egregious example might be India. Around 7 years ago, I was on a plane from Kerala (a state in India), and the guy sitting next to me happened to run a software consulting company (in Kerala). His customers were from Japan and some other countries (can't remember), and his company wrote Linux drivers for proprietary hardware. He hired kids right out of college. I asked him how much he paid them, and he said without blinking, Rs. 15000 per month. Type in "INR 15000 to USD" to Google, and you're in for a shock. It's around $2600 per year, but probably closer to $3000 per year at the exchange rates back then.
American developers doing the same job could earn up to 100 times that, and at the very least 40 times that amount. I can't understand why on Earth salaries are 40 to a 100 times higher for the same exact work, and for work that could be done by anyone anywhere on the globe, but I hope this massive difference in pay converges and disappears soon.
I really really hope with the recent IPOS, all the devs that made good money end up doing startups that are more employee friendly than VC friendly.
But then, this is the ugly world we live in. The selfish gene yo!
Was it Dark Shikari? I remember there were lots of threading optimisation done, but that was... ~2010 time frame.
I really wish he is still working in the video encoding scene.
- once you have a reference implementation, you can have exact input/output pairs, which makes development easy, especially if the reference implementation exposes intermediate results.
- because of the previous point, testing is easy.
- even if you don't have a reference implementation, you can start by building a slow version of an encoder/decoder, generate testing data (intermediate outputs), and step by step optimize the system, while ensuring the system produces the same outputs at every step.
The problem is not in getting the output. The reference decoder decodes just fine. The difficult part is getting the output on time, e.g. getting you decoder to run fast.
The reference decoder has not been designed for speed, and getting things to run fast requires architecting everything very differently: it's not just a matter of applying micro-optimizations to reference code.
Also, the "intermediate results" might be different for each decoder architecture, so you can't really rely on the reference code for those.
So yes, while testing is indeed easy, writing a fast decoder is definitely not a "step by step optimization" process. I'd bet every performant decoder out there was rewritten almost completely at least twice.
On the other hand how encoder works is not specified at all apart from the obvious requirement that it has to produce something decodable by the above mentioned decoder.
However, it doesn't handle X.265 videos very well - pauses every few frames.
How come? I'm assuming its because 265 is much more CPU intensive.
What's the typical compression benefit of using x.265 vs x.264?
Update: noticed this comment below from @Causality1
"devices from over a decade ago have hardware h264 decoders. Extremely low powered modern devices can play very large 265 videos, if they have a hardware decoder."
Couple generations ago, AMD did the other way, they used GPU shader ALUs for video encoding/decoding. They stopped doing that because power efficiency is worse than custom silicon that’s now in all modern GPUs, even in mobile SoCs.
Yes, there are many software decoders out there, usually specialized for certain architectures (the team I worked with wrote a decoder for Texas Instruments DSPs, for example). Then there are hardware implementations, which are rarely just "hardware", it's usually hardware-assisted software that does the decoding.
> P-frames are frames that will encode a motion vector for each of the macro-blocks from the previous frame.
Actually, P-frames can use data from multiple previous frames, not just the last one.
I think it's worth pointing out, as well, that even though we've technically surpassed H.264 with newer codecs like H.265, VP9, and AV1, all of which are roughly 20%-50% more efficient, this has required tremendous increases in encoding complexity. H.264 is special - it seems to occupy a kind of inflection point on the complexity / efficiency curve. It's far more efficient than previous codecs like Xvid, WMV, and so on, but at the same time even a lot of underpowered devices from over a decade ago can easily play it. We're not likely to see tradeoffs that good again in the video codec space.
You get the same image quality at a lower bitrate and faster decoding.
VP9 encoding is much slower, but Intel's SVT-VP9 encoder is getting over 300 frames per second on the right hardware:
The VP9 encoder also tends to smooth ( clean ) the video rather than working with the noise ( Not a problem for most short internet video ), which even as of today x264 is very very good and possibly the best at.
The point is that you can select the VP9 encoder that makes the most sense for your use case. You can choose from EVE, libvpx, Intel SVT-VP9, Intel Quick Sync, NGCodec, mobile device encoders, etc.
* I'm not sure what the "ffh264" decoder they're talking about is. My understanding is that pretty much everyone is using libx264 for both encoding and decoding. Maybe the latter is more performant?
* Using SSIM to pick "equal quality" files will probably not give very accurate results.
* SVT-VP9 is, if I'm not mistaken, a hardware encoder. One of the issues with these is that they require some significant tradeoffs to get better speed, so they'll have worse quality than a software encoder at the same bitrate. If you go with software you really pay the price when it comes to encoding time.
ffh264 is FFmpeg's decoder. libx264 only does encoding, not decoding.
SVT-VP9 is a software encoder, specially optimized for Intel Xeon Scalable and Intel Xeon D processors.
Their subjective testing methodology:
15 years ago there were people who said exactly the same thing about MPEG4-ASP (eg. divx/xvid). It all comes down to Moore's law and hardware acceleration.
Another thing about hardware acceleration is that on many platforms you really did not want to push 622Mbit/s worth of raw chroma-subsampled YUV FullHD video across whatever interconnect to your GPU, so moving things like motion compensation into GPU hardware was only way to actually play FullHD content. And this is also part of the reasons why is H.264 somehow special today.
There are always format fanboys but it really wasn't anything like that because the MPEG4-ASP experience still had problems for the average user (it wasn't a big win on file size/quality, compatibility was not a given, people still had to pay for licensed players, etc.).
In contrast, I think H.264 hit the threshold where it's good enough for the average person and you basically never hit the case where a file doesn't work for the person you shared it with because even though it's license-restricted, most operating systems and browsers include support. Moving away from that is going to be hard because most people will not see much advantage — AV1 is clearly better on file-size but most people aren't hitting limits on a regular basis and the feedback loops are often missing (e.g. you hit play in Netflix on your phone — how many people would have any way to tell how much of their data plan that used?).
Instead of focusing on the client, what about the server? How happy would Netflix be to reduce their outbound traffic by 20%? I'm guessing it's quite happy.
For an article talking about the magic of compression, comparing to uncompressed png makes a lot more sense than, say, starting your comparisons to something that's already compressed like jpeg.
That png is a much closer pedagogocal relation to the previously discussed "3D array of pixels, 2 in space and one in time" that he's building on.
I believe it is one reason why MPEG is working on EVC, a Royalty Free Video Codec based on H.264 with a target of 20% improvement.
Hopefully the industry as a whole, mainly Apple and Google can settle on something that is better than H.264 and JPEG.
Right now Google will not support H.265, and Apple isn't fully on board with AV1 yet.
EVC won't deliver much value. VP9 is already royalty-free, already outperforms x264, and already has broad support and use.
Apple should add VP9 support now (like everyone else has) and work to add AV1 support in the future.
Uh, is that actually what's going on there? Because for some reason, seeking is basically instant when I'm playing locally-saved videos, including ones I downloaded directly from Youtube and didn't re-encode.
Actually, this is one of the reasons why I download videos before watching them, instead of using the normal Youtube player.
Here's one of many threads where Jean-Baptiste Kempf addresses the feature request. https://forum.videolan.org/viewtopic.php?p=390778&sid=1a571d...
The amount of pausing is going to vary a lot by video. If you have an I-frame followed by 250 P-frames, there will probably be a noticeable pause.
Side note, using VRAM to make seeking instantaneous seems to me like a perfectly good use of resources, at least if my device has the VRAM to spare. On my desktop, I have a 1080 Ti, and it mostly sits idle when I'm watching a video...
The only thing I really don't like about VLC is the traffic cone icon, but I get that's their brand so I have to live with it.
I want my media player to do exactly one thing: open the video or audio files I tell it to open, and play them. Plus the option to pause, seek, enable subtitles, or change audio tracks. Anything else is extraneous.
I'm something of a software minimalist. I've spent a lot of time finding and hiding UI elements on my Jailbroken iPhone. If I don't need something I really want it out of sight.
Interestingly, YouTube doesn't display this behavior. I find it kind of annoying, to be honest. Sometimes there are reasons I want to go exactly the same duration back/forward, every time.
Either Youtube caches the video data, in which case seeking should be as fast as offline players, or Youtube discards the data, in which case it has to be downloaded again. It certainly seems to me like Youtube is discarding the data. I wish they didn't do that.
The good news is that we don't need to change the H.264 standard for this.
- most of the already deployed embedded H.264 decoders support 4:4:4 profiles
- non-chroma-downsampled input content is available
Then, it's only a matter of changing encoder implementations to do exactly as you say.
I believe the reason is that most hardware decoders sold to consumers do not support anything above 420. As Blu-rays and such are all currently 420, the chip makers don't have much incentive to support 444. It's a shame because your TV fully supports 444, and the difference can be huge on some content.
Webm seems to be as good and free.
Can some comment if I am mistaken about this?
Licensing issues are almost all around codecs, not container formats.
VP9 and its new versions will take over but it will be a while until it’s implemented in hw fully.
I distinctly remember few folks saying that you would need to violate laws of physics to transmit 1080p @ 60 fps over wifi. Now we are all doing it many times over and billions of dollars are being made every year. This is one of the unsung achievement worthy of Nobel-prize level awards but no one seems to know who these people were.
I wonder if there is any detailed history of how various components of H.264 came together, who led this effort, how projects were funded for such a long time?
[HN 2016] - https://news.ycombinator.com/item?id=12871403
I've seen and given recommendations on here to write a JPEG codec as a learning exercise, and a search of GitHub reveals plenty of others who've done it; but the same situation doesn't seem to hold for video. Nonetheless, having done it and realised it's not that much work (i.e. should be doable in a weekend or two), I now recommend trying to write codecs for H.261 and MPEG-1 too. There's not that much media now in those two codecs, but if you get to MPEG-2, you can experience watching DVDs using your own code.
MBAFF is a good example (Macroblock-Adaptive Frame/Field encoding). What this means is that the encoder can choose for every block whether it wants to encode it as a single frame or two interlaced fields. Combined with B-frames (and remember that B-frames reference frames both in the past and in the future) this makes for really interesting memory access patterns, which then kills your cache locality and murders performance.
At 15 seconds, in the VP9 version, it looks like smoke is pouring off the top-left metal pipe of the machine. It immediately draws my attention because it's so out of place. But in the H.264 version, the whole image is noisy, so nothing distracts me.
The limits are just there for interoperability. As is, if you encode a video that complies with MPEG-2 High Level, then you can be pretty confident that any MPEG-2 High Level decoder will decode it.
Without the limit everything becomes a mess. You're writing a decoder, what max res should you support? 4K? 8K? The people writing decoders don't agree and then people trying to distribute these high resolution files find that they work in some decoders but not others.
You can absolutely use h.264 to encode 8K@120fps if you want to.
I don't negate the idea that a 10-bit field for the horizontal pixel count would be a significant barrier, but if it were super hard to come up with more efficient compression, some sort of extension would be defined to allow bigger horizontal pixel counts.
Youtube's chroma subsampling makes colors bleed, Mario's hat turn into chunky red blocks approximately filling Mario's hat, screen captures discolored and grainy around text, and sharp colored anti-aliased lines turn into a discolored mess.
(Mario's hat was pixelated on pannenkoek2012's emulated SM64 videos at native 640x480. Maybe newer video decoders antialias the chroma channel, so Mario's hat is not pixelated but "merely" blurry.)
I upload oscilloscope videos with colored lines on black backgrounds, and stopped using brightly colored lines partly because color was being blurred and discolored by Youtube chroma subsampling.
I wonder why webcams don't directly use the onboard graphic card.
Even the cheap webcams all suffer from the very same problem.
The reason I bought laptop card, I thought it would be higher quality and lower price and would be good for monitoring my 3d prints.
I've no found any Raspberry Pi camera with autofocus with IR filter intact ( I need it for full light use and better colours during day time)
Oversimplification and exaggeration. Understanding Huffman encoding and deflate algorithm are just two five minute articles away.
To be fair, it is faster to say, out loud, because our verbal "character set" is so much richer. A unit of pronunciation is roughly a syllable. "10 tosses, all heads" has only 5 syllables, while "ach ach ach..." takes 10. But this insight is a bit much to ask of someone new to the concept of compression.
> I captured the screen of this home page and produced two files:
> PNG screenshot of the Apple homepage 1015KB
> 5 Second 60fps H.264 video of the same Apple homepage 175KB
> Eh. What? Those file sizes look switched.
> No, they're right. The H.264 video, 300 frames long is 175KB. A single frame of that video in PNG is 1015KB.
The article does go into how and why this is possible (tl;dr H.264 is lossey, png is not), but for the difference in human-detectable quality, H.264 is astonishingly more efficient.
Screenshot tools rarely produce optimised PNGs. A quick test of the image in the article shows that optipng (lossless) can bring the size down to ~570kb or pngquant (lossy) to ~240kb. Zopfil could further compress the images produced by optipng or pngquant, but it's a case of diminishing returns at that point.
To be clear the article makes a good comparison but I think it perhaps overstates the case to make a point.
One day, it dawned on me that since our hardware encodes the video in h264, we should be able to get some of that info from the video itself. So we tried extracting the motion vectors from the video and sure enough, it was good enough for our use and we got it basically for free.
Convince me otherwise.
The compression is gained by perceptual modeling and dumping parts, and by powerful entropy modeling, like CABAC and such.
Also more advanced things are usually used in modern compression: DCTs, sliding window stuff, wavelets, CABAC stuff, motion prediction schemes, etc., none of which were used in the 90s (except DCT and rudimentary prediction).
For fixed input and output (say 8 bit to 8 bit data), it's easy to make lossless FFTs, since the final truncation or rounding can be designed to never lose information. So in these cases, even a FFT based on , say, IEEE 754 doubles can be made lossless when using integral input and output.
A similar example is converting 8 bit RGB to 0-1 floating point by dividing the color channel by 255.0, which is lossy as real numbers since the resulting floating point is truncated (except precisely for inputs 0 and 255). However, multiplying the result by 255.0, and rounding appropriately, you can recover each input without loss. So this is an example of a lossless transformation that has lossy steps in between. This can be done in almost any case you need to do it.
FFTs in compression are not used to lose values, but to change signal domain, where perhaps interesting and lossy things can be done.
Google lossless FFT or reversible FFT and dig around.
At the end of the day, however, it's not FFTs that are useful very much in compression. They're only used for signal changes at best, and they've been made obsolete for the most pary by other techniques.
> The FFT using real numbers is not lossless
The DFT over the reals is invertible, right? How can it be invertible but not lossless?
FFTs use terms of the form e^(2 pi i k / n), and these terms cannot be represented exactly except in rare cases with finite precision floating point numbers (follows from Gelfand's Theorem).
Thus, as soon as you try to use or compute such a term, you've made an approximation, losing information.
The transform is approximately invertible using finite precision, and if your inputs are some fixed set you can make sure and do error analysis to ensure those terms come back out via careful rounding.
But the FFT, and it's inverse, involve infinitely precise real numbers that cannot be represented as floating point.
The DFT fails for the same reason.
As I explained above, if you have a limited set of inputs, say byte values 0-255, that get converted to floating point via these approximations, then the inverse approximation is applied, then the final values are appropriately rounded, you can make the DFT and FFT (and almost any approximation algorithm) on this limited subset of inputs 0-255.
As a simple example, consider turning a byte color channel 0-255 into a 32 bit float in 0.0f - 1.0f via dividing by 255.0. Now every one of the values except 0.0f and 1.0f are approximations, since the only exactly representable floats are dyadic (denominator is power of 2) and these denominators are 255 (no powers of 2).
So this is lossy. But the limited inputs means there are 255 different floats possible.
Now multiplying by 255.0, which is (int this case) not lossy on floats, puts each back to close to an integer (but not all are integers). Rounding to closest will restore the original 0255 byte values.
So the roundtrip is lossless here. But the same transforms dividing by 255.0f then multiplying by 255.0f are not lossless for all floating point inputs.
In each case for any algorithm, one needs to carefully design the inputs and outputs and transforms carefully to ensure it behaves as desired.
This is the tip of the iceberg when dealing with floating point algorithms :)
The DFT over the reals is not lossless. Using floats, for example, and applying it to max_float overflows, so cannot be inverted. It's not even bit invertible for almost any set of inputs. It's only invertible on very limited specialized sets of inputs
> The DFT over the reals is not lossless. Using floats, for example, and applying it to max_float overflows, so cannot be inverted.
I don't see why you make statements about the reals based on what happens with floats. They aren't the same, and the DFT exists independently of actually computing it.
Please see my other comment under this post: H.264 has no "real magic", the magic comes from multiple small-gain tricks combined together.
The basic concept of the DCT as used in compression applications is that the FFT only works for periodic data, but the arbitrary data in a block of pixels are not periodic, so we pretend that it is part of a periodic function with double the period where we mirror the data back-to-front on the other half. If we didn’t have a DCT implementation per se we could explicitly mirror the data into a larger array and then run an FFT, and then just look at half of the coefficients (the others are zero).
This still creates some artifacts since the interval of our data is not actually periodic, and derivatives aren’t necessarily matching between the two ends. But it turns out to be more-or-less good enough.
An alternative approach to approximating a function sampled on a uniform grid over a non-periodic interval using the FFT is to first estimate the first few derivatives at the endpoints and then subtract off a polynomial matching that data. Then we can more safely pretend that the remainder is periodic. cf. http://zindajaved.com/data/publications/javedTrefethen2014b....
In direct response to the OP, I would suggest that motion estimation and entropy coding are up there on equal footing with FFT.
EDIT: *HEVC not HVEC
NOTE: I found out HEVC is H.265 so my question might not make sense :)
EDIT: I see now the label 2016, so opening question didn't make sense :) sorry.
I think screen recording is slow not due to the video encoding but since it defeats standard GPU acceleration of the GUI in order to be able to capture the contents
“1080p @ 60 Hz = 1920x1080x60x3 => ~370 MB/sec of raw data.”
Can someone help me understand the 60 Hz part of this? Was it meant to be fps or does it really mean Hz? And why?
I just wish he wouldn't have used pounds for comparison... I mean there are 3 countries left on this planet that don't use the metric system officially. Even the scientists in the US use it. Reminds me of http://www.joeydevilla.com/wordpress/wp-content/uploads/2008...
But given how he managed to simplify such a complex technology to an easy-to-read article I don't really want to criticize him for using the wrong unit system.
Hope there's more like this to come.
Now, Pixelon...that was probably magic.
The neural network basically has a huge lookup table of "natural" looking stuff. Just point into that table and get what you want out of it: grass, trees, a crowd, etc and the NN will produce it to your specifications.
For technical details use the terms and search them in your favourite search engine. There are literally tons of articles about every aspect of H.265.
Thanks for lecturing me, but the article in question is neither too entertaining nor too informative. That was my point. Other than that, of course I can find better materials on the net.