And you can still dither the image without completely destroying compression. You just have to use an ordered dither .
 Xiaolin Wu, "Color Quantization by Dynamic Programming and Principal Analysis", ACM Transactions on Graphics 11(4): 348-372, 1992. https://dl.acm.org/citation.cfm?id=146475
I've tested median cut that can cut at various angles and variants that cut out spheres, and it didn't make much difference.
Choice of the box to cut and the location where you cut is more important. Wu's method is nice in this case because it exactly measures variance of sets after a split, while median cut just estimates.
However, Wu's method needs lookup tables for the whole color (hyper)cube, which requires either large (or for RGBA still ridiculous) amount of RAM or reduced precision, and losing bits of input is way worse than having suboptimal division.
The biggest wins I've found (and implemented in libimagequant) were:
• Voronoi iteration after generating palette (like a final step in LBG quantization). This shifts axis-aligned subdivisions to local optimum (I've tried applying that after every subdivision, but it didn't matter — once in postprocessing is enough!)
• Adjust weights/counts in histogram from quality feedback measured from remapped image. If you're aiming to minimize MSE, then that's not too expensive and it's similar to gradient descent.
You're talking about the 1991 method. The 1992 paper I cited explicitly claims as one of its advantages that it does not require lookup tables (and thus avoids the usual step of pre-quantizing down to 5 bit color to get a reasonably-sized table).
Your two big wins sound an awful lot like "K-means". I'd believe they are responsible for the bulk of the improvement (they're also the bulk of the computation), but they mean you don't have axis-aligned partitioning anymore, which was my point.
The problem is that K-means is notoriously sensitive to the initial clustering. For unsupervised learning, I'd want to do as good a job there as possible, and applying PCA means you are picking the split direction that exactly minimizes variance in the orthogonal directions. Random restarts are also a nice way to escape local minima, but are even more expensive computationally.
EDIT: Is it this color quantization algorithm that is linked on Wu's own webpage? http://www.ece.mcmaster.ca/~xwu/cq.c
Concerning the dithering, you can disable it (dither=none). And indeed, it kills the compression, mainly when using animation, that's why I added the diff_mode option typically (see at the end of the post).
pngquant's method is superior to octtree:
• it uses Voronoi iteration to get local minimum and minimize effects of axis-aligned subdivisions
• it uses sort-of gradient descent to optimize subdivisions
• to some degree edges and noise are included in histogram weights
• does not need to throw away any bits of precision (most implementations just naively discard 2-3 bits, because of RAM limits in '80-'90s)
Might it also have something to do with earlier VGA DACs having 6 bits per color channel? I seem to remember that newer video cards required setting a separate register to enable eight-bit-per-channel palette entries.
Edit: never mind, the discussion of Wu's algorithm in another thread, specifically https://news.ycombinator.com/item?id=9216252, makes the RAM limitation clear.
Not as much as err-diff if you use custom-palette ordered dithering, since the ordering stays the same from frame to frame the transparent-pixel optimization applies, unlike error diffusion dithering.
It's what photoshop uses to create optimized gif, although their particular approach is patented alternative algorithms exist, e.g. http://bisqwit.iki.fi/story/howto/dither/jy/
this image is actually a pretty poor example of quantization, it's too easy. full color gradients give a much better indicator of behavior, quality and performance.
if you really are interested in different algorithms, check this out: http://www.codeproject.com/Articles/66341/A-Simple-Yet-Quite...
error-diffusion dithering explained (and kernels): http://www.tannerhelland.com/4660/dithering-eleven-algorithm...
I'm actually in the process of porting it to js using typed arrays and possibly WebGL/GLSL. You can compile it pretty easily to asm.js for max pure-js perf via Emscripten, but I don't know how clean its API/interface will be and if it will leak globals based on how the code is structured