
Boosting zopfli performance - jsnell
http://roartindon.blogspot.com/2016/04/boosting-zopfli-performance.html
======
lifthrasiir
> Now that I have a 32-bit floating point representation to compare to, I can
> go one step further: Since mincostaddcostj is always greater than or equal
> to 0, and because non-negative floats are monotonically ascending when the
> bits are interpreted as an integer, I can use an _integer comparison_ to do
> the check!

Good points. It exactly sounds like that we can benefit from specialized "non-
negative" or "non-NaN" types, either implemented in the library (with a
limited operator overloading) or in the compiler level (via static range
analyses).

------
andrepd
Very nice work. I just wish that compilers did some of these things
automatically. Seems like even with modern optimising compilers it's still up
to the programmer to do a lot of the work manually.

~~~
jonstewart
That has been my experience. The conventional wisdom is often "don't worry
about optimization, the compiler will do a better job of it than you," and
while it is true that the compiler will do many things I wouldn't have thought
to have done, my experience has been that one can still often improve upon the
compiler's results through careful micro-optimization as illustrated in TFA.

------
hannob
Very interesting, unfortunately he doesn't say if he has submitted these
improvements to the upstream developers so that other users of zopfli will
benefit from them.

~~~
jtlim
Hi, author of the article here.

Yes, talking to Lode (from Google) about getting the changes into the main
development branch.

------
adrianN
Is

    
    
        if (costs[j+k] - costs[j] <= mincost)
    

really the same as

    
    
        if (costs[j+k] <= mincost + costs[j])
    

even with floating point math?

~~~
robryk
My first guess is that it should be (because addition on FP gives the nearest
representable value), but I'd also guess that any small differences here are
irrelevant.

I wonder how many of those optimizations (e.g. maybe this exact one) would be
done by compilers with -ffast-math.

------
olegkikin
I keep a list of the most efficient lossless PNG optimizers, and Zopfli is
definitely one of the best.

[https://olegkikin.com/png_optimizers/](https://olegkikin.com/png_optimizers/)

~~~
masklinn
Unsurprisingly, PNGOut, ZopfliPNG and AdvanceCOMP all use non-zlib DEFLATE
implementations (respectively pkzip, zopfli, and 7zip or zopfli depending on
setting).

Interestingly, as far as I know OptiPNG uses the standard zlib DEFLATE and
still closes in on the lead optimisers for all but picture 3.

Although it would be good to include more information about your compression
parameters: while OptiPNG -o7 doesn't do anything to picture 3, a complete
exhaustive search (-o7 -zm1-9) does manage to compress it down to 8724 bytes
(8471 IDAT)

edit: in fact, despite the help text's claim, optipng shows a very slight
improvement on the first three pictures by using the strongest most exhaustive
search parameters (compared to those you list, using optipng 0.7.6):

* Picture 1 9787 bytes (51.80%) versus 9979 (50.85%)

* Picture 2 20513 bytes (10.63%) versus 20671 (9.94%)

* Picture 3 8724 bytes (10.09%) versus 9703 (0%)

Pictures 4 and 5 compress to the sizes you listed even using exhaustive
searching.

------
JoshTriplett
For the change to reorder the additions to add int-to-int and float-to-float
to remove one conversion, I wonder if the compiler would do that itself if
allowed to by -ffast-math (included in -Ofast)? Normally, the compiler cannot
(even with -O3) perform changes that might change exact floating-point
results.

EDIT: No, looks like gcc can't do this even with -ffast-math or -Ofast. Given
the following C code:

    
    
        double test1(int lbits, double lsym, int dbits, double dsym)
        {
            return lsym + lbits + dsym + dbits;
        }
    
        double test2(int lbits, double lsym, int dbits, double dsym)
        {
            return lbits + dbits + lsym + dsym;
        }
    

GCC, with either -O3 or -Ofast, with or without -ffast-math, always produces
two cvtsi2sd instructions for test1 and one cvtsi2sd instruction for test2.

Filed as a GCC feature request:
[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70731](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70731)

~~~
jtlim
Hi, author of article here.

I tested -ffast-math on clang and it won't reorder the additions to omit the
conversion here.

I'm actually VERY wary of -ffast-math due to a bad experience some years ago
where it just totally messed up my calculations and I couldn't avoid it
messing up my calculations.

Even a 1ulp change (not accounting for the round down) in the final floating
point comparison changed my compression results; so I'm not entirely confident
about using -ffast-math without a huge test corpus -- but turning it on does
improve the run time.

------
caf
I would be curious to know if profile-guided-optimisation would also have
helped.

