Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd.
This approach was also recommended by the Intel optimization manual -- a great resource.
I just went with straight math for single-precision, though.
> which could lead to efficient encoding of AVIF photos
Efficient in what sense? HW encoders usually only explore a fraction of the search space, and in the case of JPEG often result in 3..5 bit per pixel images.
The graph from Cloudinary uses libaom to do the encoding at speed preset 7 (aom s7), which is far from speed preset 0 and disables many AVIF coding tools. I do not know why this was chosen by the author, but it does not reflect AVIF performance. According to https://github.com/AOMediaCodec/libavif/issues/440#issuecomm... speed preset 8 loses 20-35% compression efficiency.
At a guess, probably because it matches or at least is in the same order of magnitude encode speed as JPEG XL? It starts to feel like bait and switch if we say "look what we can do with >1000x slower encode", without noting that JPEG XL can also do a bit better with more encode time, and yet practical encoders, especially HW, use much higher speed settings.
Note: the 20-35% is BDrate, which in this context likely(?) involves some form of PSNR, which has almost nothing to do with human perception and IMO should not be used to guide such decisions.
Encoding speed is not really a concern for web uses, where the image is decoded potentially millions or even billions of times more than it is encoded.
I agree, PSNR is a terrible measure of quality. The study "Benchmarking JPEG XL lossy/lossless image compression" (https://research.google/pubs/benchmrking-jpeg-xl-lossylossle...) which you are an author on included a controlled subjective evaluation done by EPFL using an ITU recommend methodology. The subjective results concluded:
"HEVC with SCC generally results in better rate/distortion performance when compared to JPEG XL at low bitrates (< 0.75), better preserving sharp lines and
smooth areas"
It is known that AVIF performs better than HEVC. Can you say why it was not included in your 2020 subjective evaluation? It would be nice to not need to speculate on the relative quality of AVIF v JPEG XL at web bitrates.
Glad we can agree on that. Unfortunately many evals use that or plain SSIM, which is still L2 at heart.
> Encoding speed is not really a concern for web uses
hm, in many discussions with Jon of Cloudinary, I did not get the impression that this is the case. Imagine their enthusiasm about 100x-ing their compute costs.
> Can you say why it was not included in your 2020 subjective evaluation?
The paper's comment on this is: "The selection of anchors is based on the general popularity of the codecs and the availability of the software implementations. We only intended to use codecs which are standardized internationally."
> using an ITU recommend methodology
While this has some helpful guidance on viewing conditions, it is unfortunately still subjective (what parts are observers looking at, what counts as "annoying") and is more useful for detecting severe artifacts, which less relevant in practice because that's hopefully not the quality range people are using.
Also, these results are quite old and both encoders have changed since then.
> need to speculate on the relative quality of AVIF v JPEG XL at web bitrates.
No need to speculate :) Just to first agree on what are actual web bitrates. From Chrome metrics, IIRC it was over 1bpp.
Here's some newer data: https://discuss.httparchive.org/t/what-compression-ratio-ran...
Even for AVIF, the median is 0.96 and q3 is 1.79(!).
> The paper's comment on this is: "The selection of anchors is based on the general popularity of the codecs and the availability of the software implementations. We only intended to use codecs which are standardized internationally."
how very convenient, looks like politics outweigh any benchmarking
I have no game in the matter, but as a large website provider perspective, handling millions of images and processing thousands per day, I am glad not to have to deal with yet another format that would double our cache costs and force eternal support on the web. Not everybody has infinite google money to afford any kind of image format existing on the planet, and google can get only so much leeway after poisoning the web with webp.
Disclosure: I worked on JPEG XL, opinions are my own.
The radio(network) on phones can consume more power than the SoC(CPU). Thus smaller size can translate into energy savings.
As to hypothetical advantage, we are talking about HW _potentially_ being used for image decode. AFAIK this does not happen in practice, despite WebP non-lossless being a VP8 frame and the hardware being plentifully available.
As evidence for the advantages of JPEG XL being real, consider the fact that it is increasingly being adopted in serious SW including ACR, Darktable, Krita, and Lightroom.
JPEG XL is significantly smaller? All the stuff I've seen has show a slight edge for the same visual quality, but not enough I'd suspect it'd offset hardware decode, although if software doesn't actually do hardware decode, then yeah, the argument falls apart.
I can 100% see JXL being adopted in production tools, where the motivations are different, I was mainly talking about the web adoption perspective for end users (which is the context of Google's supposed war on it, of course).
> it doesn't use intrinsics because they aren't actually easier to work with; they are not faster, not more portable
Well golly, I'll just have to disagree based on >20 years of experience, including several in assembly. asm is only (maybe) faster for the code we manage to get written.
From where I sit, video codecs are a rare special case in that the format is standardized, changes only every few years, and has only a few but super-time-critical kernels.
For many many other use cases, the situation looks different and productivity matters more. Would you rather get a 10x speedup on 40% of the cycles, or 8x on 80%?
BTW the "not more portable" comment is a strawman because intrinsics themselves indeed aren't portable, but a wrapper library on top (such as our Highway) is.
> BTW the "not more portable" comment is a strawman because intrinsics themselves indeed aren't portable, but a wrapper library on top (such as our Highway) is.
That's not intrinsics then, it's different abstraction. You could write a wrapper library over inline assembly if you wanted to.
(And of course the intrinsics themselves could almost all be implemented as a header using inline assembly too. Since you're probably not relying on the compiler to optimize your intrinsic math. But optimization would be a bit worse because it doesn't know the byte size of each instruction.)
The compiler can still do optimizations on intrinsics - clang passes most through its regular optimizations, so you get things like loop unrolling, CSE (quite powerful if you have multiple invocations of the same SIMD thing, deduplicating constant loads or whatnot), and some genuine improvements/reducing what you need to pay attention to (don't need to manually merge to 'vpandn', 'vpand a,b,c; vptest a,a' → 'vptest b,c', sometimes improving shuffles, moving out negation from movmsk of negation of vpcmpeq), though it can of course make things worse too as regular compiler tax.
An example of something that inline assembly would handle badly would be broadcast, which x86 pre-AVX-512 only can do with a value already in a SIMD register, or directly from memory, but the programmer almost always will want to provide it as a regular scalar variable, i.e. GPR.
hm, sounds almost like macros wouldn't make it assembly anymore :)
I'm curious whether you know of any such inline asm wrapper? Seems that this gives the compiler less information than the intrinsics, which largely expand to builtins.
That "20 minute patch" will need to be maintained for decades to come in FFmpeg, long after a standalone JPEG-XL library. Potentially centuries as archives like the Library of Congress are storing FFmpeg. So that's why it's done in assembly, so it's maintainable with the rest of the code.
This is dated to 2009. Probably sound advise back then.
Compilers are much much better with SIMD code than they were then. Today you'll have to work real hard to beat LLVM in optimizing basic SIMD code (edit: when given SIMD code as input, see comment below).
I happen to know because this "hacker news bro" has been dealing with SIMD code for longer than that.
Compared to what? Scalar loopy C code sure. The auto vectorization is not great.
But give LLVM some SIMD code as input, and it will be able to optimize it, and it does a great job with register allocation, spill code, instruction scheduling etc.
Instruction selection isn't as great and you still need to use intrinsics for specialized instructions.
And you get all of this for all CPU architectures and will deal with future microarchitecture changes for free. E.g. more execution ports added by Intel will get used with no code changes on your side.
With infinite time you can still do better by hand, but it gets expensive fast, especially if you have several CPU architectures to deal with.
Blog author (and dav1d/ffmpeg dev) here. My talk at VDD 2023 (https://www.youtube.com/watch?v=Z4DS3jiZhfo&t=9290s) did a comparison like the ones asked above. I compared an intrinsics implementation of the AV1 inverse transform with the hand-written assembly one found in dav1d. I analyzed the compiler-generated version (from intrinsics) versus the hand-written one in terms of instruction count (and cycle runtime, too) for different types of things a compiler does (data loads, stack spills, constant loading, actual multiply/add math, addition of result to predictor, etc.). Conclusion: modern compilers still can't do what we can do by hand, the difference is up to 2x - this is a huge difference. It's partially because compilers are not as clever as everyone likes to think, but also because humans are more clever and can choose to violate ABI rules if it's helpful, which a compiler cannot do. Is this hard? Yes. But at some scale, this is worth it. dav1d/FFmpeg are examples of such scale.
I am happily using compilers for just register allocation, spills, and scheduling in these use cases, but my impression is that compiler authors don't really consider compilers to be especially good at inputs like this where the user has already chosen the instructions and just wants register allocation, spills, and scheduling. The problem is known to be hard and the solutions we have are mostly heuristics tuned for very small numbers of live variables, which is the opposite of what you have if you want to do anything while you wait for your multiplies to be done.
Instruction selection as you said is mostly absent. Compilers will not substitute or for blend or shift for shuffle even in cases where they are trivially equivalent, so the programmer has to know what execution ports are available anyway =/
They were AVX2 SIMD intrinsics versus scalar assembly, but I doubt AVX2 assembly gonna substantially improve performance of my C++. The compiler did a decent job allocating these vector registers and the assembly code is not too bad, not much to improve.
It’s interesting how close your 800% to my 1000%. For this reason, I have a suspicion you tested the opposite, naïve C or C++ versus SIMD assembly. Or maybe you have tested automatically vectorized C or C++ code, automatic vectorizers often fail to deliver anything good.
So you took asm code that had no SIMD instructions in it, made your own version in c++ with intrinsics and figured out that, yes, SIMD is faster? Realy?
I think you're completely missing what are we talking about here.
No, not really. My point is, in modern compilers SSE and AVX intrinsics are usually pretty good, and assembly is not needed anymore even for very performance-sensitive use cases like video codecs or numerical HPC algorithms.
I think in the modern world it’s sufficient for developers to be able to read assembly, to understand what compilers are doing to their codes. However, writing assembly is not the best idea anymore.
Assembly complicates builds because inline assembly is not available in all compilers, and for non-inline assembly every project uses a different version: YASM, NASM, MASM, etc.
> and assembly is not needed anymore even for very performance-sensitive use cases like video codecs
People in this thread, writing video codecs for years that you use daily tell you that, no, it’s a lot faster (10-20%), but you, who have done none of those, know better…
These people aren’t the only ones writing performance-critical SIMD code. I’ve been doing that for more than a decade now, even wrote articles on the subject like http://const.me/articles/simd/simd.pdf
> that you use daily
The video codecs I use daily are mostly implemented in hardware, not software.
> it’s a lot faster (10-20%)
Finally, believable numbers. Please note before this in this very thread you claimed “800% increase” which was totally incorrect.
BTW, it’s often possible to rework source code and/or adjust compiler options to improve performance of the machine code generated from SIMD intrinsics, diminishing these 10-20% to something like 1-2%.
Optimizations like that are obviously less reliable than using assembly, also relatively tricky to implement because compilers don’t expose enough knobs for low-level things like register allocation.
However, the approach still takes much less time than writing assembly. And it’s often good enough for many practical applications. Examples of these applications include Windows software shipped in binaries, and HPC or embedded where you can rely not just on a specific compiler version, but even on specific processor revision and OS build.
> Finally, believable numbers. Please note before this in this very thread you claimed “800% increase” which was totally incorrect.
You cheery pick my comments and cannot be bothered reading.
We’re talking against fully optimized-autovec-all-llvm-options vs hand written asm. And yes, 800% is likely.
The 20% is intrinsics vs hand written.
> The video codecs I use daily are mostly implemented in hardware, not software.
Weirdly, I know a bit more about the transcoding pipelines of the video industry that you do. And it’s far from hardware decoding and encoding over there…
You know nothing about the subject you are talking about.
You don’t need assembly to leverage AVX2 or AVX512 because on mainstream platforms, all modern compilers support SIMD intrinsics.
Based on the performance numbers, whoever was writing that test neglected to implement manual vectorization for the C version. Which is the only reason why assembly is 23x faster for that test. If they rework their C version with the focus on performance i.e. using SIMD intrinsics, pretty sure the performance difference between C and assembly versions gonna be very unimpressive, like couple percent.
The C version is in C because it needs to be portable and so there can be a baseline to find bugs in the other implementations.
The other ones aren't in asm merely because video codecs are "performance sensitive", it's because they're run in such specific contexts that optimizations work that can't be expressed portably in C+intrinsics across the supported compilers and OSes.
Yeah, it’s clear why you can’t have a single optimized C version.
However, can’t you have 5 different non-portable optimized C versions, just like you do with the assembly code?
SIMD intrinsics are generally portable across compilers and OSes, because their C API is defined by Intel, not by compiler or OS vendors. When I want software optimized for multiple targets like SSE, AVX1, AVX2, I sometimes do that in C++.
reply