Regarding the comments about complexity of existing formats - I agree, with the exception of PNG. In terms of complexity, QOI is very similar to PNG filtering phases, i.e. when you strip primary compression phase (Huffman coding) from PNG.
Also, I admire how the author combined a lot of common and cool tricks into QOI:
- Block tags form a sort of a prefix code, making it easy to differentiate block types with differing sizes
- Look-back array is used (common trick from LZW family)
- Delta-coding is used when storing pixel diffs
- In the end, if nothing else works, full pixel value is written
Also, the fact that everything is byte aligned is very appreciated.
This is a breath of fresh air and really impressive how good performance (both processing and size reduction) this gets with so little (very readable!) C.
Png has it’s own story as it’s #1 design goal was to avoid techniques that were patented at the time. So they combined a simple set of prepasses with zip/gzip compression. There were better techniques known at the time, but the idea was that combining simple, and known patent-free solutions was the safest way to achieve a free format.
(The GP-mentioned Bink is lossy ! Then he also for some reason uses a picture tending towards photo realistic which does NOT play well with lossless !)
However, if I may make a suggestion: If you do end up rolling your own container format for any reason, always include a way to store the colour space of the image!
Treating RGB images as arrays of bytes without further tags is like treating char* as "text" without specifying the code page. It could be UTF8. Or Latin-1. Or something else. You don't know. Other people don't know. There's often no way to tell.
Similarly with colour spaces: sRGB, Display P3, and Adobe RGB are just close enough to be easily confused, but end up looking subtly wrong if mixed up.
If you want all that, there are formats galore that support it! The point is more: where are the formats for people who don’t need all that? Here is one. Please don’t try to ruin it.
Just like e.g. some compilers don’t support a lot of source file encodings but their non-support consists of clearly stating that a source file is UTF-8.
Note that this requirement is only for an image exchange format. A raw-er image format used internally in e.g a game doesn’t need it as badly.
But in an exchange format every user needs to know whether the image data is sRGB or Adobe. Because the reader didn’t encode the data.
The GPs advice about recording color space is spot on. Why design a new container when there are tons of ones off the shelf ones. The collection of files inside a Zip is my goto.
But this approach is a source of endless conflict with image processing tooling, because nobody except the game author knows what the channels mean. Game developers keep complaining about image processing tools doing anything other than bytes-in-bytes-out on that data (e.g. clearing "unused" alpha or treating RGB as human-perceptible colors), because it breaks the unstated assumptions.
People had basically exactly the epiphany you described. For example, there are several color models which try to be more lighting-invariant or (racially) color blind (this is a very similar problem, humans are all more or less "orange" hue-wise due to melanin's spectrum). TSL is one such space, there are others.
S3TC, BC7, and ASTC are all lossy compressions so they are really not comparable.
A lot of people don't want to play games that look bad - I'd assume that this is going to be a particular issue when a game goes for photo-realistic graphics, only to end with people with orange skin colour on lots of monitors...
- colorspace - arbitrary string, the colorspace of the color data on disk.
- rgb_primaries - arbitrary string, indicates the physical meaning of the linear RGB values.
- transfer_function - arbitrary string, indicates the function to use to linearise the values from disk into the RGB primaries.
- intent - one of display, texture, data, or an arbitrary string. Display images get tone-mapped, textures do not, data only gets linearised, even if it has rgb_primaries attached, other values only have internal meaning to the concrete art pipeline.
For the first three, rgb_primaries and transfer_function take precedence over colorspace. The reason I want them is because people do mix and match. You could say that you can put the same data in some string format in the colorspace tag, but then everyone will invent their own slightly different way of specifying it and interoperability will suffer. Best to just force the values to be unstructured.
A lot of arbitrary strings, but that's how it is - there is really no standard list of colorspace names. These four fields can cover pretty much everything I have seen people do.
Though currently a lot of people just tag the filename with a colorspace suffix and call it a day. OpenColorIO even has a function to directly get a colorspace from a filename.
So what is “ground truth” exactly? We need to know what units we’re talking about; is it lumens, or “tristimulus” values (what we want our eyes to see), or LCD monitor voltages, or something else? First we have to agree on what ground truth means.
A big reason for common formats like PNG and JPEG to need color profiles is because 8 bits per channel is not enough color resolution unless you store the colors non-linearly. 8 bits is barely enough (or some would say almost enough) to represent low-dynamic-range images on TVs and monitors even when you have the colors gamma encoded or use a perceptual color space.
The problem with storing “ground truth” values for pixels in 8-bits-per-channel formats is that many experts would probably say this implies a linear color space. And a linear color space encoded in 8 bits leaves noticeable gaps between colors, such that you can see color banding if you try to render a smooth gradient. You lose up to 2 bits of color precision in some regions, notably the dark colors.
So, in order to not have color banding, we correct the colors before storing them by trying to make the colors have even steps between any 1-bit change in the spectrum. The color profile is just a record of what was done to correct the colors, so someone else can un-correct them as needed.
You might think we could do away with color profiles for high-dynamic-range images, but the non-linearity of our vision and our displays means that linear encodings are still inefficient with their bits even if you have 32 bits per channel.
By default this all falls back to sRGB, which is not the "ground truth pixel value" you may be assuming it is.
The 0.5 value is not half the brightness of 1.0 in sRGB. It merely means half the power output to your CRT.
If you want correct editing operations, you need to work in linear RGB.
It gets more fun. What is white? The color of your backlight? The color of white paper under your particular desk light?
And what is black? Is it pure darkness, or the black ink of your printer? Does your printer represent pure black, or dark gray?
We're stuck with images stored in the display profile of old CRTs by default, because that was the most practical option at the time.
The analogue in photography is photographs lose color clarity and fade as they age. Why should I care if this is the case in images as display technology evolves when this is already a problem with every physical medium?
> For correct editing operations, and correct display. Sadly, it's not pedantry, it's reality.
I've been designing images and graphics for the web and sometimes for print off and on as a regular part of my various job functions since ~2009 and I have yet to see a situation where a color profile on an image isn't a huge pain and something that needs to be stripped away to get seemingly correct results.
Back to my point, I still don't see the value of having a color profile applied to an image. Images exist in the ether as platonic ideals, and we try to approximate them with our various display technologies. Why complicate the matter by also having multiple "flavors" of the platonic ideal itself? When I say (255, 0, 0), I expect the display to show me its best approximation of red. When I say (255, 255, 255), I expect the display to show me something as close to white as possible (and at the brightest possible setting). When I say (0, 0, 0), I expect the display to show me something that looks as close to black as possible. It's up to the display technology to decide whether this means just turn off that pixel on the screen or disengage the backlight or do whatever, but at the end of the day it's just trying to approximate black.
This is complicated enough, why do I need an image that will look good on only one kind of display and will have it's greens look pink on other displays. Isn't having a color profile for the display enough?
Which red? There isn't a single, true "red". And different materials produce different reds (my work's Sony BVM-X300's reddest red is going to look way different than your monitor's red). Not all displays use only RGB color primaries, too. For example, at Dolby's office in San Francisco they have a projector in their theater that uses 5 (IIRC) color primaries, not 3. 6-primary projectors exist. And other non-RGB displays.
> When I say (255, 255, 255), I expect the display to show me something as close to white as possible
Which white? D50? D55? D60? D65? D67? Something else? And yes, these different white points (and many others) are actually used in practice.
> (and at the brightest possible setting).
100 nits looks way, way different than 4,000 nits. Some monitors can do 10,000 nits.
> When I say (0, 0, 0), I expect the display to show me something that looks as close to black as possible. It's up to the display technology to decide whether this means just turn off that pixel on the screen or disengage the backlight or do whatever, but at the end of the day it's just trying to approximate black.
Which black? This might sound dumb, because we can agree that there is an objective "absolute black" (i.e. zero photons). But when an artist creates an image, the monitor they use has some version of black. If you don't account for that, the image may be distorted. Blacks can be crushed, for example.
An absolute color space exists. It's called XYZ. We could use it. Some image formats support it.
 I think from a stupid thing I did as a kid.
XYZ is really annoying to work in though. ACES is a pragmatic solution here, quite literally: https://blog.frame.io/wp-content/uploads/2019/09/ACES-APO.jp...
Okay, fair enough two out of three primaries are imaginary colors, but nobody said you have to use the whole color space when your processing pipeline is meant to be using 32 bit floats. Delivery formats might want to be using a more real color space.
By working space I mean a space in which a user manipulates the data in some way, e.g. image editing, grading, compositing or rendering.
What "red" and "green" are has changed quite dramatically with different display technologies. A display designed to meet Rec.2020 can show colors that other displays literally cannot produce and the deviation between the primaries is so big that everything looks like garbage if you don't do a color space conversion. Take some sRGB content and display it on a DCI P3 display. Looks like shit. Humans look like crabs.
> On monitors and displays, sure, those vary, but why allow people to specify a color profile on an image
The sole reason why we have device profiles defining device color spaces is so that we can convert from the image's color space to the device's color space. If images don't have a profile assigned to them, you don't need a device profile.
So you either have both image and device profile, or neither. Just one doesn't make sense.
If your green is going pink, then either your profiles are wrong, or your software is broken. Maybe it really is pink, and it just looks green for you, because you're ignoring color profiles.
But the fact is, most software is broken, and you should store images with the sRGB profile.
And also, you can actually calibrate consumer hardware, so that you can scan a photo, and reprint it. And the scan, display, and print will look exactly the same. (It's not the case by default, because consumer printers do what you do, stretch or fit the color space to the printer's. The Vivid and Natural profiles in the driver, respectively. This is a good default for documents, not for professional photography.)
We can continue believing that Santa exists, or we can accept that effectively nobody has correct color profiles, and doesn't care either.
It's nice metadata you have there, would be a shame if I applied night mode to it at 6PM.
> also, you can actually calibrate consumer hardware
...with professional hardware that costs more than the one you have at hand.
Again, pretty much everyone will just tweak their monitor/printer settings until they get results that look OK.
>display and print will look exactly the same
Until you turn off the lights. Or until you turn on the lights (what color temperature is you light?). Or until the wind blows, moving that cloud over the sun, changing light temperature.
Or — wait for it — actually, that's all, while you've been waiting the sun has set.
The point being, all we can shoot for is FF0000 being what would be "red" for most, 00FF00 being "green", and 0000FF being "blue" — and then accept the "fade" of the digital image from one screen to another as a fact of life.
So, learn how to stop worrying and love unspecified RGB colorpsaces. Magenta only exists in your brain anyway 
Side note: did you ever think about how your recorded music isn't adjusted for the frequency response of the loudspeakers, and half the people will listen to it with the "BASS MEGABOOST" feature on anyway?
That's why the art of mastering exists. It's accepting the reality, and putting work into making it work with uncalibrated, crappy hardware — as well as hi-fi gear.
PS: have fun calibrating your vision to make sure that you don't see green where I see red
: even though for some reason I don't know, it works much less well: if you try and take pictures with the wrong white balance setting, the picture will look like shit no matter how long you look at it.
As a side note,
This article is pretty terrible, as her author mixes everything up.
Just imagine you're using an HDR display, where the brightest white is as bright as the sun.
That is precisely what attaching a profile to the image is supposed to avoid. It indicates what (255, 0, 0) in that image actually means. Then you convert that to the display profile to get the triplet that you must send to the display to actually get something that looks like it.
> Isn't having a color profile for the display enough?
What would you do with just one profile?
Because with digital data we can do better?
When I say #ff0000, do I mean "sRGB red", as a TV would display in pure red phosphors back in the 1970s, or do I mean "1000K" / 700nm red (daylight is 6500K), which can only be reproduced by specialized displays?
Most people from the time before Apple made Display P3 wide color displays standard in all their products — so, most HN commenters — believe that #ff0000 just means "only red, no green or blue" — but all web browsers currently are left to invent their own answer for what #ff0000 means, and they do not all invent the same answer. Yet.
So it is with images.
In the beginning times, people made sixteen-color images for NTSC monitors, and then other people learned you could display synthetic colors that don't exist by mangling the image bitstream to the monitor, and those sixteen colors were hard-coded but varied by like +/-5% per device due to acceptable manufacturing tolerances, so they were internally consistent anyways.
And so we end up, today, trying to solve color profiles for file formats that specify colors as RGB hex values — which are, by definition, restricted to sRGB and thus wildly insufficient. But if we plan too far, we get file formats that are totally ridiculous - TIFF comes to mind - that can represent anything under the sun, at the cost of having a 1:1 mapping of "pixels" to "bytes" or worse.
You may also find the history of the EXR format relevant here, as it supports a functionally infinite amount of dynamic range in images, and there are definitely good niche use cases for it — but pragmatically, it's ridiculously overkill for normal everyday purposes. You could argue that everyone should use EXR in Lab, but then someone would suggest Oklab (since it makes better perceptual compromises) or HSLuv (what a great name), or you could try storing raw CIE 1931 chromaticity coordinates, which has been deprecated by CIE 170-2 to address serious flaws in the 1931 system.
Tying this back to engineering, there are often very good reasons for designing a processor or a byte store to be RISC or CISC, big or little endian, and we simply cannot predict what will work most pragmatically for all cases universally ahead of time, any more than the metric system was invented early enough in time to prevent miles per hour from being a unit of measure.
It's more like saying the range from 0..2 is complete so don't use an encoding that caps out at 0.8 or 1.1
The difference between all visible colors and a more constrained color space is only a factor of two or three. Less than half a bit per channel.
If you compare Rec.2020 with 10 bits per channel and sRGB at 9.5 bits per channel, both of them using the same peak brightness, you should see similar amounts of banding on both.
Increasing the maximum brightness can require a lot of extra bits to keep things smooth, but that's a separate thing from which colors are possible.
(I'd assume that this is especially relevant for backlit displays like the liquid crystal ones, since they have to use the same potential maximum source power for all of the channels.)
Rec.709 has 10 and 12 bit modes too, and it needs them, even though it has the same gamut as sRGB.
The reason 10 bits isn't enough for Rec.2020 is because 10 bits was never enough to fully represent any gamut. Once you add enough bits to properly support a gamut, then extending it to match human vision is a tiny cost.
And to be extra clear, when I initially said "The difference between all visible colors and a more constrained color space", I was talking about taking a color space and expanding its gamut without changing anything else about it.
I guess when I said "no color space is complete" I meant "no color space is faithfully bijective to nominative human visual sensory response." Because such a thing is basically impossible (or insanely nonlinear and unwieldy).
Also these color systems all or almost all assume a "norminal" (to steal a word from Everyday Astronaut, normal+nominal) average human vision system. If your system isn't norminal, average, human, or even visual (visible light), then your color space is an incomplete map of the territory.
So for any given array of pixel channel, there is either an implicit assumption or explicit color space of what those values "mean".
In order to know what an image should look like, a "perfect" or "complete" representation, at minimum the real colors must be injective to the color space (every value has at most 1 color). You could argue that the color space needn't be injective (values that have no color), which is probably fine if your image is static and you never manipulate it. As soon as you manipulate values of your image in your color space, you run the risk of falling outside the color domain.
But really to unpack this whole thread, my essential point is that color spaces are about engineering tradeoffs. sRGB can't reproduce many colors. CIE 1931 represents all color but has imaginary colors.
Ergo, I contend there is no "complete" color space.
> As soon as you manipulate values of your image in your color space, you run the risk of falling outside the color domain.
If your manipulation could go outside the domain, then your manipulation is incorrect. Like, if I have a number from 0 to 99 that I store in a byte, it's fine that a byte is not bijective to those numbers, and if I end up with a byte representing 105, then I did something wrong that corrupted my byte.
If I'm doing incorrect math, a bijective color space doesn't fix the problem, it just guarantees that I get an incorrect color out the other end instead of possibly getting an incorrect color or possibly getting an error value.
Good luck representing that with only 3 channels, without using "bytes representing 105" (aka imaginary colors), and trying to stay percepetually uniform !
It gets even worse : the real color space is non-Euclidean : it stretches out, parallel lines drift further apart, etc.
I think you misunderstood my point.
I'm saying it doesn't matter if certain byte sequences are invalid and don't correlate to real colors.
Imagine you're encoding two digit numbers into bytes. It's not bijective but that's fine, no correct two-digit math will give you an invalid byte.
Or imagine you're encoding numbers into bytes the way UTF-8 works. It's not bijective to all sequences of bytes but that's fine. No correct manipulation will give you a sequence that can't decode.
If you're trying to do incorrect math, you've already lost, and no color space will help. For example, overflowing your red channel in sRGB could turn your eggshell into a cyan.
> and trying to stay percepetually uniform
Where did that come from?
Zero encodings will be perceptually uniform, therefore being non-uniform isn't a disqualifier.
If you author an image on Display A and save its colours, then how does Display B know to render colours correctly that looks consistently with Display A?
You have two options:
1.) Tag Display A's colour space into the image, then convert colours to match Display B's colour space when rendering;
2.) Convert colour data from Display A colour space into a generic colour space (like sRGB, Display P3, ACEScg, etc) and embed that in to the image, then convert colours to match Display B's colour space when rendering.
I would absolutely recommend basing on some sort of container format (vs "raw" data structures), it makes it really easy to add functionality without breaking backwards compatibility.
Basically all container formats follow this idea of "collection of boxes" pattern. The IFF series are box-length based. Ogg uses capture patterns but no explicit box size (I think). The former means you have to know the box size ahead of time (or seek back and write it in), the latter is better for streaming, but harder to seek (without indexes).
In general, allow arbitrary key-value pairs to be stored.
I don't think that's a good idea. Most images, like photos and screenshots, don't have physical dimensions. Some image tools like to put in bogus default DPI values that are guaranteed to be wrong, and it can be a bit of pain.
But I get your worries. As a compromise, DPI could be optional.
Color profiles however, yes, those are indeed important.
If are worried about dumb users renaming it...I think most users know to not change things after a '.', so could just put the colorspace code there. And then could put another '.QOI' to end the filename.
If want to keep cruft out of the file itself, then putting it in the filename works.
also, changing the name of a file is not like, uncommon. if it is an image sequence, some people prefer an underscore before the frame number, some prefer a dot, and some even prefer a space. some places have strict file naming policies, and file names will be modified to comply.
anyone depending on crucial information to be preserved in a filename is just a Sisyphean effort of control.
Is a bit imprecise.
The algorithm would still be O(n) even with a linear search through the "seen pixel array", as it is bounded by 64 length and therefore a constant factor that gets eaten by the Big-O notation.
Because of this hash the algorythm seems to do something else than first described though.
Instead of "a running array of the 64 pixels it previously encountered", it's actually the previous 64 pixel _values_ previously encountered, which allows for much bigger jumpback.
By deterministically computing this array during execution it itself seems to act as a kind of compressed jumpback table, allowing for approximated arbitrary jumpback with just 6 bit.
I think there might be an interesting tweak that could increase the effectiveness of the hashing technique used.
If one were to walk and store the pixels in a Z-Curve order, the runtime would stay the same but the locality of the color value pool, might be increased.
It's somewhere in between. Because these values are hashed into single slots, assuming that hash codes are random, the probability that a value is still in the table after k intervening values is pow(63/64, k).
That means the table is unlikely to contain all 64 values, but it will also retain some older values. Since older values are less likely to be useful in the future the encoding performance is likely somewhat worse than if the algorithm used strictly the last 64 distinct values.
This means that the algorithm is a bit sensitive to hash collisions, which seem particularly likely for primary colors (i.e. with each of the components at value 0 or 255):
- Slot 0: black, ALL SHADES OF magenta (x ^ 0 ^ x = 0), yellow (etc.) and cyan.
- Slot 15: white (255 ^ 255 ^ 255 = 255), red (255 ^ 0 ^ 0 = 255), green (etc.), blue.
With random byte->byte lookup tables those costs should be hidden by the cache.
so hash = HR[r] ^ HG[g] ^ HB[b] ^ HA[a].
Also, it's non-obvious that older values will be less useful. Going through the pixels in order like this doesn't preserve locality well, so older values may actually be closer to a given pixel than more recent ones.
...unless those images are dithered. Which may be a relatively common thing to do in small color spaces!
Hilbert curves have slightly better locality than Z-curves, but Z-Curves are trivial to produce.
If you have a N dimensional space over unsigned integers, then you can place each point on a one dimensional Z-Curve by interleaving their bits into a new integer.
So for any point (X, Y) in the X-Y coordinates of an image the Z-Curve coordinate is X_1,Y_1,X_2,Y_2...
Which can be achieved with a bunch of bit shifts and masks, (or a bunch of wires if you happen to use it in an FPGA ^^')
Locality is a good thing to exploit.
__m128i src = _mm_set_epi32(0, 0, x, y);
__m128i res = _mm_clmulepi64_si128(src, src, 0);
uint64_t zeorder = _mm_extract_epi32(res, 0) | (_mm_extract_epi32(res, 2) << 1);
If anybody else want to know how this works:
The nice thing about this is that just by changing the tables, you can get things like z-curve, tiles, row-major, or column-major pixel orders (or even combinations), plus swizzled, interleaved, or planar layouts for the color channels.
* Swapping to z-order requires about 5 lines of changes if you ignore all the little edge cases like non power-of-two image sizes. Images used in testing may or may not be representative of anything.
Related to this method, it's sometimes used for texture mapping on GPUs.
I'm actually quite impressed how the resulting size is a little bit smaller than lz4hc (actually not even that far away from libpng), while the encoding speed is quite close to lz4, despite seemingly not having as many hacks and tweaks under the hood to get there. So the encoder is IMO doing amazingly well actually.
However, the decoding speed of liblz4 is off by roughly an order of magnitude. But again, liblz4 probably benefits from decades of tweaking & tuning to get there. It would be interesting how much performance could be gotten out of the qoi decoder.
Here are the totals I get for the "textures" benchmark samples:
decode ms encode ms decode mpps encode mpps size kb
libpng: 2.2 32.4 58.67 3.94 160
stbi: 2.1 17.0 61.50 7.49 228
qoi: 0.7 0.7 191.50 170.54 181
lz4: 0.1 0.6 1226.06 206.40 258
lz4hc: 0.1 70.9 1029.26 1.80 200
decode ms encode ms decode mpps encode mpps size kb
libpng: 131.9 2287.1 66.63 3.84 8648
stbi: 147.5 1177.1 59.55 7.46 12468
qoi: 56.3 56.5 156.13 155.60 9981
lz4: 14.3 53.1 614.53 165.50 18019
lz4hc: 13.9 1901.7 630.94 4.62 12699
Normal fast encoders use some kind of hash table or search to find matches, but encode the offset of the match in the source rather than in the table. QOI is encoding the offset into the table, which gives much shorter offsets but means the decoder has to maintain the table too.
(The slow PPM compressors etc do maintain the same table in both encoder and decoder, and have symmetrical encode/decode performance too. See http://www.mattmahoney.net/dc/text.html)
Its not really in the spirit of of QOI to add the complexity, but I'd imagine allowing the encoder to specify how big the hash table was would be only a small tweak, and encoding literal blocks instead of pixel-by-pixel will improve handling input that QOI can't compress better.
I would be curious to know if planar helps or hinders. Perhaps even see what QOI makes of YUV etc. And I want to see 'heat maps' of encoded images, showing how cheap and expensive parts of them are, and which block type gets used. Yeah, going a bit beyond the spirit of QOI :D
And from looking at the code I'm a bit confused how 3-channel is handled for literals because the alpha still looks like it gets encoded. Would have to walk through the code to understand that a bit better. Is it going to deref off the end of the source RGB image? Etc.
(People interested in compression may be interested in the go-to forum for talking about it https://encode.su/threads/3753-QOI-(Quite-OK-Image-format)-l...)
Every pixel uses a different encoding, 95% of the encodings rely on the value of the previous pixel, or the accumulated state of all previously processed pixels. The number of bytes per pixel and pixels per byte swing wildly.
SIMD optimization would basically require redesigning it from scratch.
Two other immediate avenues are multithreading, which think could be quite effective for this algorithm or GLSL, of that I have no opinion.
Decade, singular. LZ4 was proper around 2011. But the real reason it's faster is because its not doing nearly as much work as QOI. LZ4 is just about as asymmetric as you can get.
u8 tag : 3; // b110
u8 dr : 5; // 5-bit red channel difference: -15..16
u8 dg : 4; // 4-bit green channel difference: -7.. 8
u8 db : 4; // 4-bit blue channel difference: -7.. 8
Though I wouldn't be myself if I didn't have few nitpicks ;)
- You are mixing lossy and lossless compression in your rant. And you are mixing containers with formats. Thing about lossy compression is that it is by design stupidly slow and complex to produce smallest possible results which has best perceived quality. Compress once and trade time for both image and compression quality. Lossless is not that bad, although your simple solution is well, simple :)
- I feel like the benchmark suite is lacking. For better overview you probably should include libpng results with max compression level and lowest compression level. Lossless modes of AVIF and WEBP would be nice. (also could throw similar project to yours like https://github.com/catid/Zpng) Not saying the benchmark is bad, but IMHO doesn't paint the full picture. From quick test I got significantly better compression on libpng, ofc in expense of time, but you didn't configure libpng for speed either. So we have some results, but they are not really representative imho.
- I understand it is pet project, but don't forget about importance of image metadata, like colorspace and so on.
- Also keep in mind that compression is easy at the beginning, but the tricky part is how to squeeze more.
Here's example output..
image: 1502 x 1800 ( 10560 KB )
encode(ms) decode(ms) size(KB)
qoi: 31.2 21.6 2452
qoi+zstd: 46.7 19.3 1832
zstd: 47.4 8.7 2742
lz4: 14.1 3.0 4108
png: 92.4 18.0 2795
zpng: 24.1 13.3 1394
jpg: 2.7 1.8 517 <-- lossy
webp: 199.6 19.1 227 <-- lossy
It is so interesting that if the image have few colors, method #2 alone (An index into a previously seen pixel) encodes the whole image into an indexed colour space (with the hash table becoming the pallet).
Ive seen already some comments recommending adding some vertical locality by tiling, Hilbert order, or Z order. I have yet another different vertical locality suggestion to try:
While encoding/decoding keep a ring buffer the size of one full line of the image, storing each encoded/decoded bit there. Then add a variation of method #3 (The difference to the previous pixel) which instead encodes the difference to the pixel above the current (the last pixel from the ring buffer). Maybe, due to the limited "tag" space it would not be worth it to implement it for DIFF8, but only for DIFF16 (taking one bit from red to the additional tag bit) and DIFF24 (I'm not sure which bit I would steal there).
Edit: Also, when thinking about video encoding in the future you may want to keep a ring buffer fitting the full last frame of video. And then temporal locality is probably even more important, so then you may want to change the tags to encode more commands relating to the pixel from the previous frame.
I quite like this approach of defining a few compression methods, see which one works best, and apply that one. I used a similar strategy for time series compression where I would take a chunk of data, apply z few different compression algorithms on it (RLE, delta RLE, etc) and just keep which ever one was smallest. Of course here it is done on a per pixels basis, and I used arbitrary size chunks. The latter is trivial to parallelize. Same could be applied here though, split the image in half and compress both parts independently
The interesting thing is that it comes close to PNG.
 Perception is pretty logarithmic as usual, so for a 14-bit readout, half the values represent the brightest stop and a little bit, the other half holds the other 12+ stops.
There are a few formats that use a curve and less bits. They do become lossy and doing dithering on decompress is useful to avoid banding.
The Nikon one you mention was only used very early and is decoded by decode_8bit_wtable() in that file. It's just looking up the 8 bit value in the table and then adding some randomness to prevent the banding.
decode ms encode ms decode mpps encode mpps size kb
libpng: 148.4 3995.5 55.88 2.08 12223
stbi: 161.0 1858.3 51.50 4.46 19199
qoi: 60.8 95.6 136.49 86.78 12868
In any case the results are impressive enough that this will 100% be used in projects I work on!
Many thanks to the author for their work <3
Interesting though that FLIF 0.4 was released 3 days ago. I haven't checked out that new release (and won't), but the previous one wasn't great code and was a pain to use (ended up using its CLI instead of the lib). We ended up going back to PNG because of interoperability and speed. I haven't looked at libjxl's API yet.
Edit: 8605KB in 11.5s with libjxl, so it's not actually better than FLIF 0.4 (7368KB in 23.9s)
pngcrush -brute 10355kb
cjxl -q 100 -e 9 8413kb
I have a honest question about the source code, a lot of time since I coded anything meaningful in C, but I remember the header files did not have that much code on it, while here I see most of the code is in the header file. Why is this the case? Inline compilation? What are the advantages?
STB-style libraries skip the implementation already in the preprocessor and compile the implementation only once for the whole project.
Unlike C++ header-only libraries, C libraries like this still need their implementation included exactly once.
I understand the benefits of keeping the implementation in a single source file (simplifies integrating in existing build systems, interprocedural optimizations, etc.) but combining the source and header files too makes them awkward to use, with little benefit beyond being able to claim "our implementation is only 1 file!" which is about as impressive as saying "our implementation is only 1 line!" if you've just deleted all the whitespace.
C doesn't have a packaging format or dependency manager, so having to keep track of two files would be pretty cumbersome.
E.g. one benefit is that you can configure the implementation via preprocessor defines without the defines leaking into the build system. Just add the config defines in front of the implementation include.
It's also trivial to write "extension headers" which need to directly peek into the (otherwise private) implementation of the base library. Just include the extension header implementation after the base library implementation.
PS: but also see: https://github.com/nothings/stb#why-single-file-headers
Too vague. How is it better, concretely?
> E.g. one benefit is that you can configure the implementation via preprocessor defines without the defines leaking into the build system
That's not a benefit. How is this:
#define LIB_KNOB 123
#define LIB_KNOB 123
> It's also trivial to write "extension headers" which need to directly peek into the (otherwise private) implementation of the base library. Just include the extension header implementation after the base library implementation.
I'm not sure what this means. Can you give an example? Is this being used in the above library?
> PS: but also see: https://github.com/nothings/stb#why-single-file-headers
OK, that only offers this explanation:
> Why not two files, one a header and one an implementation? The difference between 10 files and 9 files is not a big deal, but the difference between 2 files and 1 file is a big deal. You don't need to zip or tar the files up, you don't have to remember to attach two files, etc.
This is kind of nonsense to me. The difference between 2 files and 1 file is barely relevant, and doesn't weigh against the awkwardness of having no clear distinction between declarations and definitions, and having to pull in definitions by setting a magic preprocessor header.
> Any better than the two-file version:
> #define LIB_KNOB 123
> #include "lib.c"
> Too vague. How is it better, concretely?
I can only assume from that question that you haven't had the "pleasure" to work much with C/C++ build systems yet, at least when it comes to integrating 3rd party libraries.
Re. extension headers:
> no clear distinction between declarations and definitions
STB-style libraries have this clear distinction, C++ style header-only libraries don't.
The third source file is what the library authors suggested; I'd just create a seperate Makefile target for the implementation, which is then linked into the binary. (Yes, that's an extra target, but that costs very little and allows you to avoid unnecessary rebuilds.)
Why don't you give me a concrete counterexample where the single-header-setup makes things simpler? If it's so obviously beneficial it shouldn't be so hard to come up with a concrete example.
> #define IMPLEMENTATION
> #include "base_library.h"
> #include "extension_library.h"
Thanks for explaining, but that doesn't really motivate having a single file, since you could just as easily do:
> ... without having to add a separate 'private but actually public API'.
Personally, I'd say this is exactly the right way to model this, but even if you disagree: it sounds like this simplifies things mainly for the library/extension authors, and not the library users, because if the extension author forward-declares the implementation details they depend on, there is no additional file to manage for the users either.
To you. To me doing Rust stuff, it's been quite helpful to have single-file C programs; it makes it a lot easier to compile and link things.
See for example:
That seems unrelated to this idea that C code using it is equally reasonable. That's a developer choice.
When I was writing template SDKs, the headers were pretty much the entirety of the library.
I find that odd. How is that significantly better than dropping 2 files into a project and #including the header? A good build system should just compile all the C or C++ files in a given place, so nothing needs to be changed other than dropping the files in and #include where called from.
Whether or not a "good build system" should handle it, the fact that single-file libraries are much preferred these days should demonstrate most people don't have such a build system
* What a translation unit is, and how compiling a translation unit to an object file works
* What a linker is, and how the linker combines object files into a library/executable
* What make/autotools/cmake are, and how they work
* What FHS is, and how it standardizes the lib/bin/include paths that C/C++ build tools read from/install to
And so on. Any C/C++ course that goes beyond a “Hello world” program without explaining these concepts in detail does its students a disservice.
but also ask "why are the tables of content prefixing the contents of all my books?"
its taken ages, but with this model, c lib authoring has finally taken something that was a core strength of pascal: single-file modularity.
though in theory there's no difference between theory and practice, in practice, there is. and this kind
of packaging makes a lot of sense for a lib like this.
2) Most developers just want to compile to a single binary. Any developer who for some reason needs separately compiled objects should be able to quickly achieve that from a single-file library
I was blown away by its simplicity and power.
You can download almost any C/C++ library from GitHub then add two lines to your cmake file. It will then compile and link with your project.
Cmake can generate project files for Visual Studio, makefiles, and countless other formats. Better yet, you can open a cmake folder directly in Visual Studio. No need to generate a project :)
This comes as no surprise to anybody, I'm sure, but cmake is great.
It was tricky to find good information about it though. It has been around forever, but the nicer features were added recently it seems.
This bit made me laugh out loud :D
If something is marked as copyrighted, it may be trivial to remove the mark, but at least you had to remove the mark, and that's evidence of intent rather than ignorance.
Is that evidence of intent still?
Your toolchain should set the bit last thing before distribution, I'd expect.
If you're talking about a toolchain on the consumption side after distribution, well, consumers shouldn't be modifying content, is what I'd expect the position to be. No derivative works and all that.
Yeah, it's funny, until you actually go read the MPEG spec. documentation.
Then any and all inklings of laughter will very quickly evaporate.
I'm currently tinkering around with mp3s and wondered how many "useless" bits are in each frame.
I wonder if they are ever used to trace distributors of pirated music, similar to yellow dots in printers.
People think of technology as a straight line moving forward. But it's perhaps better to imagine like species in nature. It's just everything getting bigger. The smallest ant or bacteria evolves to better adapt to its niche, as the largest elephant or whale will develop adaptations (sometimes getting physically larger, depending on the environmental changes). The rule is that many scales have a role or niche. I think technology is similar.
But what shocks me is how did absolutely no-one apply such a naive RLE in the 30 years of internet image formats?
Is this a case of "man reinvents wheel again because why do research" or is this exceptional?
we applied rle encoding for time-series and triac control systems 43 years ago after getting some guidance from a mainframe guy - but since RLE is practical so that we even use it to give direction to people on the street, its probably so old that its even found its way into the way our genes are encoded. :D
Off the top of my head, both PCX  and TGA  use similarly naive RLE as their compression method.
And while they also offer fancier compression methods, BMP , TIFF , and OpenEXR  all offer RLE modes as well.
If each line were encoded independently they could be en- and decoded in parallel (once you add a skip instruction to tell where the next line starts, this does preclude streaming of intra-line output though). The hit to compression ratio should be small as long as images are wide enough for the color buffer reset to not matter.
This is neat. I wonder if the author would be willing to write a Kaitai Struct definition for it.
Something else interesting: QOI is 1,2,2 letters off from PNG. I'm quite certain this is an accident but it's interesting nonetheless.
> so instead of hashing, you could use direct lookup
However, I don’t think that part gonna work. See the code using that table: https://github.com/phoboslab/qoi/blob/master/qoi.h#L324-L328
SIMD registers aren’t indexable (at least not on AMD64), the register needs to be known to the compiler.
Lanes within each register aren’t indexable either. The insert and extract instructions are encoding lane index in the code. There’re workarounds for this one, like abusing vpshufb or vpermd, but with that amount of overhead I doubt SIMD will deliver any profit at all.
Funny meta comment: when I read "offering a 20x-50x speedup in compression and 3x-4x speedup in decompression" I caught myself reacting with disappointment to the "only" 3x-4x speed up. This is funny because 3x improvement is enormous; it's just small compared to 30x. There's a lesson somewhere in there about sharing stats together to achieve an emotional effect.
One thing I did recognize after my last trip into the rabbit hole - JPEG can be the fastest format by orders of magnitude if you are using an appropriate encoder. The project mentions 20x-50x speedups over PNG which are commendable, but once you throw LibJpegTurbo into the mix (it does use SIMD), you will find a substantially different equation on your hands. I can do 1080p 4:4:4 JPEG encodes in ~3ms on my workstation today.
I am not saying you should always use jpeg (some people like to keep all their original pixels around), but if you want to go really really fast you should consider it. The latency is so low with this approach that I have been looking at the feasibility of building a streaming gaming platform around it. Current status of that effort is somewhere between "holy shit it actually works" and "maybe?"
u8 tag : 4; // b1110
u8 dr : 5; // 5-bit red channel difference: -15..16
u8 dg : 5; // 5-bit green channel difference: -15..16
u8 db : 5; // 5-bit blue channel difference: -15..16
u8 da : 5; // 5-bit alpha channel difference: -15..16
int intermediate = byte_or_word << N; // N chosen to put the five bits in the highest positions in the int.
int value = intermediate >> bits_per_int - 5;
Instead of masking rely on arithmetic shift right to keep the sign bit.
> SIMD acceleration for QOI would also be cool but (from my very limited knowledge about some SIMD instructions on ARM), the format doesn't seem to be well suited for it. Maybe someone with a bit more experience can shed some light?
I don’t know about SIMD encoding, but for decoding at least, the core problem is to make some unit of the image (pixels, scanlines, blocks, etc) independently addressable (readable). The current format is complicated due to pixels of different sizes; there’s no way to know where data for pixel 1000 starts without reading all preceding 999 pixels. You could build a prefix-sum array of pixel indices, but that will fully undermine your compression.
One question might be how well does QOI work on independent scan lines or blocks? Is the compression still pretty good if the look-back window is limited in size and restarts every once in a while?
To make a GPU capable format, the very first thing you would probably need to do is separate the tag bits from all the other data, so a thread could identify the pixel type by direct lookup into the tag buffer. It then needs to know where to go to find the remaining data, so you’d probably need at least one byte of data that could be used to reliably locate the rest of the data. I don’t see an easy way to make this happen for separate pixels, but it might not be bad if the image is encoded in blocks or scanlines.
One other thought:
To make it so that compression and decompression can be multithreaded, you might want to 'reset' the stream every N rows. (i.e. break any RLE runs, start from a colour literal). This would allow a bunch of threads to start at different places in the image, in parallel. There would be some small cost to compression ratio but might be worth it.
I'll probably investigate "resetting" the stream to allow for multithreaded en-/decode when I try to roll this into a video codec.
Modern lossless video codecs don't care that much about saving space, since you're burning gigabytes per minute anyway; the key is that the compression and decompression are as transparent as possible, to reduce the bandwidth to/from the storage medium. A good lossless codec can be used to scrub through and edit on a video editor's timeline, so decomp performance is what's most important.
Also, any such lossless video codec is going to need more than 8 bits per component; most real-world footage coming off of phones and cameras is 10-bit Rec.2020. If it is too difficult to position QOI for real-world sources, you can certainly keep it 8-bit and market it as an animation codec or similar; just keep it in mind.
It would be interesting to know which strategy is used for each pixel. Have you tried making maps of that with four different colours? Particularly interesting would be how much the cache is used and also how often it replaces a colour that it could have used later. Maybe there's a better colour hash. BTW you might want to change "(really just r^g^b^a)" to "(really just (r^g^b^a)%64)".
I was thinking this would combine well with progressive rendering (i.e. store pixels in mipmap-style order) so the cache is warmed up with spread of pixels across the image rather than just scanline order. That doesn’t make it parallelizable in the way tiling does, though, hmm.
Another tweak that would probably help is having the file specify a rotation (or even a full mapping table) for each component of the hash, so a clever encoder can pick values that minimise collisions. (A fast encoder can just use all-zeroes to get the same behaviour as before.)
00xxxxxx - copy (x+1)-th last EXPLICITLY ENCODED pixel (i.e. ignoring repeats)
010xxxxx - repeat the last pixel x+1 times
011xxxxx xxxxxxxx - repeat the last pixel x+33 times
10rrggbb - copy the last pixel and adjust RGB by (r-1, g-1, b-1)
110rrrrr ggggbbbb - copy the last pixel and adjust RGB by (r-15, g-7, b-7)
1110rrrr rgggggbb bbbaaaaa
- copy the last pixel and adjust RGBA by (r-15, g-15, b-15, a-15)
1111RGBA [rrrrrrrr] [gggggggg] [bbbbbbbb] [aaaaaaaa]
- copy the last pixel and replace RGBA if each bit flag is set
E.g. some file formats may be row-major, others column-major, some block-based (like JPEG) or a fancy interleaving scheme (like Adam7). Some might store the channels interleaved, others separate the channels. If any of these choices doesn't match the desired output format, it breaks streaming.
So you can't do such a conversion in a streaming fashion, but there is no need to load all data in memory either.
Because the Hilbert/Z-order curves are defined recursively, algorithms for converting between the curve and row/column order are "cache-oblivious", in that they don't need to take an explicit block size. You write it once, and it performs well on any system regardless the CPU cache size and/or page size.
A fix would be to change:
110rrrrr ggggbbbb - copy the last pixel and adjust RGB by (r-15, g-7, b-7)
1100rrrr ggggbbbb - copy the last pixel and adjust RGB by (r-7, g-7, b-7)
One idea would be to start a predicted color (calculate the delta of the previous two pixels, apply that, then specify a delta to that). Another would be to encode the delta in YUV or some other color space, and then experiment with the best balance of bits between those channels.
Perhaps it would be better to steal bits from RUN16 instead, I somewhat doubt it's usefulness.
Not exactly. It's "copy the last pixel color for which (r^g^b^a)&63 == x".
For each scanline write one byte for the chosen filter method, then $width pixels. There are five filter types. In type 0, write the pixels unmodified. In type 1, write the difference to the previous pixel. In type 2, write the difference to the above pixel. In type 3, write the difference to the average of the above and left pixel. In type 4, write the difference to the Paeth predictor (a simple function calculated from the above, left and left above pixel). Then compress everything with deflate.
Obviously by modern standards that isn't performant, not least because deflate sucks by modern standards. But it's simple, can be implemented in a couple lines of C (if you pull in deflate as dependency), and if you don't do exhaustive searches for the best compression it can certainly run in O(n)
Did you try fuzzing your implementation (e.g. with AFL++ or libfuzzer)? Codecs and especially decoders written in C that handle untrusted binary data are quite worthwhile targets for fuzzing.
Currently Sony has two options. Uncompressed raw that is huge, or lossy compressed raw that is half the size. They probably chose those options because throughput when shooting quickly matters a lot, so something like PNG would slow down shooting too much.
I imagine a blazingly fast halfway decent compression algorithm would be great. It might even speedup shooting by needing to write fewer bytes. But in general people taking star pictures will love having compression without it being lossy.
For reference an uncompressed raw file for 44MP is about 90MB. That starts filling up drive space rather fast.
Also, this format was designed to be as simple as possible, and is byte-aligned, only handles 8-bit RGBA. You would need a different design for 12/14-bit RGB or RGBG (Bayer).
The code is interesting, in that it iterates over the output array, not the input. This helps prevent write buffer overflows. It has read buffer overflows, though. You can craft a compressed file which, when decompressed, will contain info from elsewhere in memory.
Amusingly, if you run out of input too soon, the last pixel is replicated to fill the buffer. So you don't get random memory contents in the output, which would be more exploitable.
One trap to be aware of when profiling image codecs is not knowing the provenance of the test images. Many images will have gone through one or more quantization steps. This can lead to impressive compression ratios, but if you compare with the results on raw, unquantized imagery, it's less of a slam dunk.
That said, PNG is a really low bar to clear, both in terms of speed and compression ratio. Even really image-naive compression algorithms tend to be within a factor of 2 in resulting compressed size. You tend to get diminishing returns. 10% vs 20% is twice as efficient, but starting with a 10MB file, 1 vs 2 MB file sizes isn't a huge delta. (yes I know there are domains where this matters, don't @ me).
My understanding is that this is one of the techniques that png uses (though not zstd) for doing it's xompression too. Along with I think some different strides down the image and some other stuff but I don't fully understand everything it tries. That's what optipng and friends play with for parameters to find the best settings for decomposing the image for compression.
If you look at what jpeg and similar codecs do they nearly take a DC offset out of the image because that difference then greatly reduces the range of values you have to encode which means that you need fewer bits to do so. Combine that with range or arithmatic coding and you can get decent efficency for very little work. Then with lzw, zstd, etc. You can get lots of other patterns taken out and compressed. Lossy codecs try to take out the stuff we won't see with our eyes so that the set of values to be encoded need fewer bits and it'll also make more patterns appear that can be exploited too
I think part of the reason png is so slow is that it empirically tests to find the best one.
> Compression is further improved by choosing filter types adaptively on a line-by-line basis.
I'd be curious to unpack a png and see the statistics of how much each kind is used.
The normal argument for 'slightly compressed' images is that they can be decompressed with very few CPU cycles, but the reality is that with modern platforms lossless image en/decoding is done in hardware and therefore pretty much 'free', as long as you pick lossless h264, lossless webp or lossless hevc. All of those things will also give much smaller file sizes.
In the few places where hardware decode isn't a thing (eg. the memory-constrained 8 bit microcontroller decoding the boot screen for your IoT toaster), there usually isn't a need for losslessness.
It’s useful because the size savings apply in memory, because decompression is so cheap it can be done on the fly.
It has to be lossy because you want a guaranteed fixed-rate compression ratio (eg 4x) and there’s no way to do that losslessly.
Although the worst part is the documentation is in engineering english, so its quite hard to understand. MPEG2 is actually not as bad as you'd imagine. MPEG4 is.