'Suppose you have some strange coin - you've tossed it 10 times, and every time it lands on heads. How would you describe this information to someone? You wouldn't say HHHHHHHHH. You would just say "10 tosses, all heads" - bam! You've just compressed some data! Easy. I saved you hours of mindfuck lectures.'
This is a really great, simple way to explain what is otherwise a fairly complex concept to the average bear. Great work.
And of course the truth is you would just transmit "H_10" with the universally agreed upon knowledge that "H" is "Heads" and "_" is number of times.
Yes I get that the alphabet is already agreed upon.
But if I only transmit H or T (uncompressed) that's just one bit needed per symbol. So I can transmit HHHHHHHHH in ten bits. If I introduce this simplified compression to the system and add 0-9 to the alphabet, that now needs four bits per symbol, so the message H10 is 12 bits long (which is longer than uncompressed). And HTHTHTHTHT would be forty bits so if the message doesn't repeat simply it's now four times larger!
See what I mean? It's not successfully compressed anything.
The solution to this is easy and is Huffman coding, but it doesn't make sense to show it for a ten bit message as it won't work well, and in the trite explanation of compression of 'just the symbol and then the number of times it's repeated' this isn't mentioned, so it's only half the story and people will still be puzzled because they will see that your message contains MORE entropy after 'compression', not LESS!
"To another human" is the key phrase, and sometimes I wonder if HN is populated with humans or androids. No offense intended to androids with feelings.
> The solution to this is easy and is Huffman
Well, not. As you said, for 10 bits doesn't matter; and in general it will depend on the input; sometimes run length encoding performs better than Huffman; and also there are cases were Huffman won't capture high order entropy. Also, for zero order entropy arithmetic encoder is superior than Huffman. Unless you care about decompression speed...
Which bring me back to the fact that there is not such a thing as"the solution" in data compression. But more importantly: it was just an example to show an idea; and actually a pretty good one (run length encoding)
How does it work? The first bit is a sign bit. If it's zero, the next seven bits are raw coinflips, 0 for tails, 1 for heads. If it's one, the second bit signifies whether the next sequence consists of heads or tails (again, 0 for tails, 1 for heads), and the remaining six bits tell how long said sequence is.
This is a fairly simple encoding for the strategy described in the article, which I thought of off the top of my head in about five minutes, and I know nothing about compression. It's slightly broken (what if the sequence ends mid-byte?), but it does kind of work. Somebody who actually knows about compression could probably do this better.
What's really cool is that the simple explanation can be extended to explain things like why ciphertext doesn't compress well: because ciphertext has no patterns
In any case, an LZ decompressor fits in less than 100 bytes of machine instructions and a simple compressor can be implemented in not more than several hundred, all the while providing extremely high compression for its simplicity. It will easily outperform order-0 static or dynamic Huffman on practical files like English text, and would probably make a good assignment in an undergraduate-level beginning data structures/algorithms/programming course; yet it seems more popular to give an assignment on Huffman using trees, which is somewhat ironic since in the real world Huffman is implemented using bit operations and lookup tables, not actual tree data structures.
To give a trivial example, LZ will easily compress ABCDABCDABCDABCD while order-0 Huffman can't do much since each individual symbol has the same frequency.
My guess is that the "late" development of LZ is due to mainly two reasons:
i) At that moment the pattern matching algorithms were not so advanced. E.g. suffix tree was very recent, and in the next years lots of advances occurred in that area...
ii) Although LZ can appear easier or more intuitive than Huffman, I think it is much less intuitive to prove a good bound in the compression achieved by LZ. OTOH, Huffman is build in a way that shows that it achieves zeroth-order compression.
And Huffman isn't optimal unless you are lucky, unlike arithmetic coding.
To start your range is [0, 1). For each symbol you want to encode you take your range and split it up according to your probabilities. E.g. if your symbols are 25% A, 50% B and 25% C, then you split up that range in [0, 0.25) for A, [0.25, 0.75) for B and [0.75, 1) for C.
Encoding multiple symbols is just applying this recursively. So to encode the two symbols Bx we split up [0.25, 0.75) proportionally just like we did [0, 1) before to encode x (where x is A, B or C).
As an example, A is the range [0, 0.25), and AC is the range [0.1875, 0.25).
Now to actually turn these ranges into a string of bits we choose the shortest binary representation that fits within the range. If we look at a decimal number:
The beauty of arithmetic coding is that after encoding/decoding any symbol we can arbitrarily change how we split up the range, giving rise to adaptive coding. Arithmetic coding can perfectly represent any data that forms a discrete string of symbols, including changes to our knowledge of data as we decode.
A Huffman tree for digits might assign 0-5 to 3 bits and 6-9 to 4 bits. Encoding three digits will use on average slightly more than 10 bits. Using AC will let you give the same amount of space to each possibility, so that encoding three digits always uses less than 10 bits.
"0" = 0.0b = 0 falls in the range [0,0.25) so it's a valid encoding for "A"; but isn't it also a valid encoding for "AA", "AAA", etc.?
AA = [0,0.25) * [0, 0.25) = [0, 0.125), and so on.
It seems that adding "A"s to a string in general doesn't change its encoding.
It's the equivalent to pretending a Huffman stream never ends and is padded with infinite 0s.
Saying "10 tosses, all heads" reduces the chance of omitting a toss in data entry, which is all to the better.
On the HyperLogLog algorithm to count things.
- heads, heads, heads, tails, tails, heads
The key idea is to encode differences; even in an I-frame, macroblocks can be encoded as differences from previous macroblocks, and with various filterings applied: https://www.vcodex.com/h264avc-intra-precition/ This reduces the spatial redundancies within a frame, and motion compensation reduces the temporaral redundancies between frames.
You can sometimes see this when seeking through video that doesn't contain many I-frames, as all the decoder can do is try to decode and apply differences to the last full frame; if that isn't the actual preceding frame, you will see the blocks move around and change in odd ways to create sometimes rather amusing effects, until it reaches the next I-frame. The first example I found on the Internet shows this clearly, likely resulting from jumping immediately into the middle of a file: http://i.imgur.com/G4tbmTo.png That frame contains only the differences from the previous one.
As someone who has written a JPEG decoder just for fun and learning purposes, I'm probably going to try a video decoder next; although I think starting from something simpler like H.261 and working upwards from there would be much easier than starting immediately with H.264. The principles are not all that different, but the number of modes/configurations the newer standards have --- essentially for the purpose of eliminating more redundancies from the output --- can be overwhelming. H.261 only supports two frame sizes, no B-frames, and no intra-prediction. It's certainly a fascinating area to explore if you're interested in video and compression in general.
"essentially" makes it sound like it isn't precisely true. MJPEG is literally just a stream of JPEG images. The framing of the stream varies a bit, but many implementations are just literal JPEG images bundled one after the other into a MIME "multipart/x-mixed-replace" message.
But when seeking, why wouldn't any local media playback seek backwards and reconstruct the full frame? It's not like the partial frame after seeking is useful - I'd rather wait 2 seconds while it scrambles (i mean "hurries up") to show me a proper seek, wouldn't everyone?
What was your Internet search for finding that imgur frame? What is this effect called?
Most codecs/players do. VLC used to be criticized for being different in that regard. One possible advantage is istantaneous seeking, as there's no need to decode all the needed frames (which could amount to several seconds of video) between the nearest I-frames (the complete reference pictures) and the desired one.
: plural, because prediction can also be bidirectional in time
The use of incomplete video frame data for artistic purposes is called "datamoshing".
VLC is willing to let my entire screen look like a blob of grey alien shit for 10 seconds instead of just taking a moment to reconstruct frames.
And its hardware acceleration for newer codecs is balls. Sucks because otherwise, it's right up there with f2k for me.
* Sane defaults (encodings and fonts, scaletempo for audio)
* instantaneous play of next and previous videos
* navigation in random playlist actually works
* Easy always on top key binding
* Most mplayer key bindings work
I'll definitely keep on trying it for a while.
I think I can find some use for this in certain situations. Still lacks a good playlist building schema.
Yes, this is what I was talking about, and yes, specifically for VLC. Plus it's not like playback is so taxing that all cores are pegged at 100% during playback. When I seek, VLC should get off its ass and scramble to come up with the correct full frame then. I'll wait.
The example I used in the lecture where datamoshing came up was the music video for Charlift's "Evident Utensil"; I always thought this was a neat example.
Kanye West's "Welcome to Heartbreak" (https://www.youtube.com/watch?v=wMH0e8kIZtE)
A$AP Mob's "Yamborghini High" (https://www.youtube.com/watch?v=tt7gP_IW-1w)
For example if you replace H.264 with a much older technology like mpeg-1 (from 1993) every sentence stays correct, except this:
"It is the result of 30+ years of work" :)
I was hoping the author would write about H.264 specifically, for instance, how it was basically the "dumping ground" of all the little tweaks and improvements that were pulled out of MPEG-4 for one reason or another (usually because they were too computationally expensive), and why, as a result, it has thousands of different combinations of features that are extremely complicated to support, which is why it had to be grouped into "profiles" (e.g., Baseline, Main, High): http://blog.mediacoderhq.com/h264-profiles-and-levels/
I was also hoping that he would at least touch on the features that make H.264 unique from previous MPEG standards, like in-loop deblocking, CABAC Entropy Coding, etc..
Again, it's fine as an introduction to video encoding, but there's nothing in here specific to H.264.
Still, it's a good overview of generic video compression.
Did you miss the third paragraph?
As someone who knew nothing about it before, I found it lived up to it's goal.
> The only thing moving really is the ball. What if you could just have one static image of everything on the background, and then one moving image of just the ball. Wouldn't that save a lot of space? You see where I am going with this? Get it? See where I am going? Motion estimation?
Reusing the background isn't motion compensation -- you get that by encoding the differences between frames so unchanging parts are encoded very efficiently.
Motion compensation is when you have the camera follow the ball and the background moves. Rather than encoding the difference between frames itself, you figure out that most of the frame moved and you encode the different from one frame to a shifted version of the blocks from a previous frame.
Motion compensation won't work particularly well for a tennis ball because it's spinning rapidly (so the ball looks distinctly different in consecutive frames) but more importantly because the ball occupies a tiny fraction of the total space so it doesn't help that much.
Motion compensation should work much better for things like moving cars and moving people.
Along the same lines, it would be interesting to figure out an automated time-varying-feature detection algorithm to determine which kinds of transforms are the right ones to encode.
Do video encoders already do something like this? It seems like a pretty difficult problem since there are so many permutations of applicable transformations.
That's how Framefree worked. It segments the image into layers, computes a full morph, including movement of the boundary, between successive frames for each layer, and transmits the before and after for each morph. Any number of frames can be interpolated between keyframes, which allows for infinite slow motion without jerk. You can also upgrade existing content to higher frame rates.
This was developed back in 2006 by the Kerner Optical spinoff of Lucasfilm. It didn't catch on, partly because decompression and playback requires a reasonably good GPU, and partly because Kerner Optical went bust. The segment-into-layers technology was repurposed for making 3D movies out of 2D movies, and the compression product was dropped. There was a Windows application and a browser plug-in. The marketing was misdirected - somehow, it was targeted to digital signs with limited memory, a tiny niche.
It's an idea worth revisiting. Segmentation algorithms have improved since 2006. Everything down to midrange phones now has a GPU capable of warping a texture. And it provides a way to drive a 120FPS display from 24/30 FPS content.
The parties involved with Framefree were involved in fraud litigation around 2010. The case record shows various business units in the Cayman Islands and the Isle of Jersey, along with Monolith in Japan and Framefree in Delaware. No idea what the issues were. It looks like the aftermath of failed business deals.
The inventors listed on the patents are Nobuo Akiyoshi and Kozo Akiyoshi.
Daala and AV1's PVQ is an example of a predictor for contrast and brightness (in a very broad sense).
The previous codec MPEG4 part 2 ASP (aka DivX&XviD) had "global motion compensation" which could encode scales and rotation, but like most things in that codec it was broken in practice. Most very clever ideas in compression either take too many bits to describe or can't be done in hardware.
That's part of why video encoding can be very slow --- with motion compensation, to produce the best results the encoder should search through all the possible motion vectors and pick the one that gives the best match. To speed things up, at a slight cost in compression ratio, not all of them are searched, and there are heuristics on choosing a close-to-optimal one instead: https://en.wikipedia.org/wiki/Block-matching_algorithm
This is a great overview and the techniques are similar to those of h264.
I found it invaluable to get up to speed when I had to do some work on the screen content coding extensions of hevc in Argon Streams. They are a set of bit streams to verify hevc and vp9, take a look, it is a very innovative technique:
Don't know in photoshop, but in Gimp there's a plugin called "wavelet decomposer" that does that.
There was a question about retouching photos some while ago (http://photo.stackexchange.com/questions/48999/how-do-i-take...) that using wavelets was a good use of it.
I wanted to recreate this for the home page of my file manager . The best I could come up with was . This PNG is 900KB in size. The H.264 .mp4 I now have on the home page is only 200 KB in size (though admittedly in worse quality).
It's tough to beat a technology that has seen so much optimization!
Sadly, this is what makes video encoders designed for photographic content unsuitable for transferring text or computer graphics. Fine edges, especially red-black contrasts start to color-bleed due to subsampling.
While a 4:4:4 profile exists a lot of codecs either don't implement it or the software using them does not expose that option. This is especially bad when used for screencasting.
Another issue is banding, since h.264's main and high profiles only use 8bit precision, including for internal processing, and the rounding errors accumulate, resulting in banding artifacts in shallow gradients. High10 profile solves this, but again, support is lacking.
But if you have source/subsampled/interpolated comparisons that show 99% identical results i would be interested to see them.
Of course all that is useless if you don't have control over the output device. Just having the ability to record 4:4:4 makes the issue go away as long as the target can display it, no matter what interpolation they use.
Ehm, what?! The image on the right looks really bad and the missing holes was the first thing I noticed. No zooming needed.
And that's exactly my problem with the majority of online video (iTunes store, Netflix, HBO etc). Even when it's called "HD", there are compression artefacts and gradient banding everywhere.
I understand there must be compromises due to bandwidth, but I don't agree on how much that compromise currently is.
However if that MacBook Pro image was placed on the side of an article where the primary content was the text you were reading, you'd glance at the image and your brain would fill in the details for you. You probably wouldn't notice the difference in that context.
For most use cases, there likely is very little functional difference between the two images. At least, that was how I understood it.
Although to be fair, I suspect that a lot of times what I'm looking at are mpeg videos that have been recompressed a half dozen or more times with different encoders. Each encoder having prioritized different metrics. So, the the quality gets worse until it doesn't really matter how good the compression algorithm is. Each new re-compression is basically spending 3/4 of its bits maintaining the compression artifacts from the previous two passes.
Isn't the images above the text a zoomeed version?
>Here is a close-up of the original...
I mean for fast action scenes, I rarely notice the difference between 720p and 1080p at 10ft away... but different encoding and sources, not just size alone can make significant differences.
That's not really a weird thing on HN though. Video codecs are exactly the kind of thing that we get excited about.
Previously Daala was presented as a candidate for NETVC but apparently this didn't go anywhere? https://en.wikipedia.org/wiki/NETVC
How does DCT work:
The fair coin flip is also an example of a process that cannot be compressed well at all because (1) the probably of the same event happening in a row is not as high as for unfair coins (RLE is minimally effective) and (2) the uniform distribution has maximal entropy, so there is no advantage in using different code lengths to represent the events. (Since the process has a binary outcome, there is also nothing to gain in terms of code lengths for unfair coins.)
It would be really cool to further extend it showing actually how the various tiles are encoded and between frames, something along the lines of: http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-equal...
H.265 can even do deltas between blocks in the same frame, IIRC, and is excellent for still image compression too.
A good way to develop an intuition for the fourier space is to look at simple images and their DFT transforms: http://web.cs.wpi.edu/~emmanuel/courses/cs545/S14/slides/lec... (3/4 of the way through the slide deck).
This analysis of a "bell pepper" image and its transform is also helpful: https://books.google.com/books?id=6TOUgytafmQC&pg=PA116&lpg=....
As for why you want to do this: throwing away bits in the spatial domain eliminates distinctions between similar intensities, making things look blocky. In the frequency domain, however, you can throw away high-frequency information, which tends to soften patterns like the speaker grills in the MBP image that the human eye isn't that sensitive to to begin with.
Or in this case, a real data set.
Along with the Wikipedia article and the obvious Internet search, there's a lot of good stuff that has been on HN: https://hn.algolia.com/?query=fourier%20transform&sort=byPop...
The JPEG article also has a very good, step by step example of the DCT, followed by quantization and entropy coding: https://en.wikipedia.org/wiki/JPEG
(Though for images it's in 2D, not 1D which is more commonly done)
I would expect also the edges in the image to become more blurred, as edges correspond to high-frequency content. However, this only seems to be slightly the case in the example images.
In this context, the edges of, say, the macbook are not "high frequency" content, since they only feature one change (low to high luminosity) in a given block rather than several (high-low-high-low-high) like for the grill.
Neural nets are always expensive to train. You'd better be getting something from them that you can't get some other way.
With video compression, I think most would agree that there might be better architectures/algorithms that we haven't stumbled upon yet. Whether specifically "neural networks" will be the shape of a better architecture, I don't know. But almost surely some meta-algorithm that can try out tons of different parameters/data-pipeline-topologies for something that vaguely resembles h.264 might find something better than h.264.
Neural nets are expensive to train. But so is designing h.264.
I was under the impression that the first 100,000 units are free, and then 20c per unit afterwards to a max of $25m.
H264 drops to 10c per unit after 5m units, to a max of $6.5m.
You need to be shipping 125 million units annually to hit the full $25m.
Yes it's more, but it's not quite ten times. And notably if the chip maker pays the royalties, then the content creators don't need to (though that was excepted indefinitely with H264).
Parts regurgitated from a quick google for reference 
So it depends how these royalty works in details. If only the chip manufacture are paying, Mediatek, Qualcomm, Samsung, Intel, AMD, Nvidia, Apple. That is at least 10 players paying maximum. And if you consider small players, the total contribution of Royalty fees to HEVC is 1 Billion / Year. ONE BILLION!! In the life time of a Video Codec that typically run at least a decade, these patents are 10 Billions.
Do you think that is a fair price, i think everyone should decide for their selves.
Said group's demands are basically the reason Netflix started considering VP9.
b) Leave codec packs in 2000 where they belong. They are a great malware vector and also good at messing with settings they shouldn't.
>KCP utilizes the following components:
MPC-HC - A robust DirectShow media player.
madVR - High quality gpu assisted video renderer. Included as an alternative to EVR-CP.
xy-vsfilter / XySubFilter(future) - Superior subtitle renderer.
LAV-Filters - A package with the fastest and most actively developed DirectShow Media Splitter and Decoders.
(Optional) ReClock - Addresses the problem of audio judder by adapting media for smooth playback OR utilized for bit perfect audio.
I'm actually using MPC-HC and AC3Filter to deal with some files where I couldn't hear the centre channel on VLC (on stereo speakers). Everything else isn't really needed.
Yes, I've dumbed it down in the article to a simple circle to illustrate the point.
So instead of the reported 916KiB we're looking at 584KiB.
This doesn't change the overall point, but details matter.
$ wget https://sidbala.com/content/images/2016/11/FramePNG.png
--2016-11-04 22:08:08-- https://sidbala.com/content/images/2016/11/FramePNG.png
Resolving sidbala.com (sidbala.com)... 18.104.22.168, 22.214.171.124, 2400:cb00:2048:1::6819:1112, ...
Connecting to sidbala.com (sidbala.com)|126.96.36.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [image/png]
Saving to: ‘FramePNG.png’
FramePNG.png [ <=> ] 622.34K --.-KB/s in 0.05s
2016-11-04 22:08:08 (12.1 MB/s) - ‘FramePNG.png’ saved 
$ pngout FramePNG.png
In: 637273 bytes FramePNG.png /c2 /f5
Out: 597850 bytes FramePNG.png /c2 /f5
Chg: -39423 bytes ( 93% of original)
A video on youtube led me to Joofa Mac Photoshop FFT/Inverse FFT plugins  which was worth a try. I was unable to register it, as have others. Then I came across ImageJ , which is a really great tool (with FFT/IFFT).
Edit: if anyone checks out ImageJ, there's a bundled app called Fiji  that makes installation easier and has all the plugins.
If anyone has other apps/plugins to consider, please comment.
BPG is an open source lossless format for images that uses HEVC under the hood, and is generally better than PNG across the board: http://bellard.org/bpg/
For a runner-up lossless image format unencumbered by H265 patents (completely libre), try http://flif.info/.
Anyway super interesting subject.
> Even at 2%, you don't notice the difference at this zoom level. 2%!
I'm not supposed to see that major streakiness? The 2% difference is extremely visible, even 11% leaves a noticably bad pattern on the keys (though I'd probably be okay with it in a moving video), only the 30% difference looks acceptable in a still image.
Simplistic as it is, it touches on all the main differences. The only problem with H.265 is the higher requirements and time needed for encoding and decoding.
The article discusses lossy compression in broad terms, but have we reaped all the low hanging fruit? Can we expect some sort of saturation just like we have with Moore's law where it gets harder and harder to optimize videos?
We can conclude that 64KiB demos are at least 48 times as magical as H.264.
> H.264 is protected by patents owned by various parties. A license covering most (but not all) patents essential to H.264 is administered by patent pool MPEG LA. Commercial use of patented H.264 technologies requires the payment of royalties to MPEG LA and other patent owners. MPEG LA has allowed the free use of H.264 technologies for streaming internet video that is free to end users, and Cisco Systems pays royalties to MPEG LA on behalf of the users of binaries for its open source H.264 encoder.
Edit: I emphasize this mainly because the terms have a specific meaning in standards jargon but also because it places the blame for software patent abuses on the wrong parties (the standards developers rather than the lawyers and legislators).
Uh, anyone familiar with the MPEG process will assure you that the companies involved love (let me restate that: PREFER) to bring in technology on which they own the patents so they get a good cut of the resulting patent pool.
Sometimes this is even done even though it technically makes no sense. Best example: hybrid filter-bank in MP3.
The process also provides no protection or discouragement from patents from semi-involved industry partners appearing later on, etc.
This difference in approach is a stark contrast to the IETF, which is why Opus work, and future AV1 work are happening under the IETF rather than the MPEG groups.
Of course, this also only applies in countries which enforce software patents.
The point you are making here is PRECISELY the point that the author was making in the article: that a lossy format can be far, far smaller. He then goes into the details (from a high-level point of view) of what kinds of losses H264 incurs.
"Okay, but what the freq are freqX and freqY?"
I apologize if this is trivial. What does 1920 in above equation represent?
Btw question is trivial but don't feel apologetic about asking questions. None of us know everything and in a field we don't know, our questions will be trivial.
In other words, why is H.264 in particular magical?
First of all, I think he meant "you would NOT even notice".
Second of all, that's the first thing I noticed. That PNG looks crystal clear. The video looks like overcompressed garbage.
What a terrible introduction of lossy compression. This would mean that if I empty the thrash bin on my desktop, it's lossy compression.
The concept of going through all compression ideas that are used is pretty neat though.
No he didn't explain "right after that." He rambled on and on, and even after all of that, he STILL doesn't bring up JPG.
It's an inherently stupid comparison to make. You can't polish a turd.
> This concept of throwing away bits you don't need to save space is called lossy compression. H.264 is a lossy codec - it throws away less important bits and only keeps the important bits.
> PNG is a lossless codec. It means that nothing is thrown away. Bit for bit, the original source image can be recovered from a PNG encoded image.