More

janwas · 2024-05-16T10:17:51

With AVX-512 one can have a 128-byte table with one vector of lookups produced each cycle :)

celrod · 2024-05-16T10:39:51

Yes, I have an AVX512 double precision exp implementation that does this thanks to iperm2pd. This approach was also recommended by the Intel optimization manual -- a great resource.

I just went with straight math for single-precision, though.

reply

janwas · 2024-05-03T17:55:35

> which could lead to efficient encoding of AVIF photos

Efficient in what sense? HW encoders usually only explore a fraction of the search space, and in the case of JPEG often result in 3..5 bit per pixel images.

janwas · 2024-05-02T18:43:24

I'm curious what the evidence is for AVIF outperforming below 1bpp?

Have you seen this more recent data that includes AVIF? https://cloudinary.com/labs/cid22

unlord · 2024-05-02T19:28:03

> Have you seen this more recent data that includes AVIF? https://cloudinary.com/labs/cid22

The graph from Cloudinary uses libaom to do the encoding at speed preset 7 (aom s7), which is far from speed preset 0 and disables many AVIF coding tools. I do not know why this was chosen by the author, but it does not reflect AVIF performance. According to https://github.com/AOMediaCodec/libavif/issues/440#issuecomm... speed preset 8 loses 20-35% compression efficiency.

janwas · 2024-05-03T18:01:28

At a guess, probably because it matches or at least is in the same order of magnitude encode speed as JPEG XL? It starts to feel like bait and switch if we say "look what we can do with >1000x slower encode", without noting that JPEG XL can also do a bit better with more encode time, and yet practical encoders, especially HW, use much higher speed settings.

Note: the 20-35% is BDrate, which in this context likely(?) involves some form of PSNR, which has almost nothing to do with human perception and IMO should not be used to guide such decisions.

unlord · 2024-05-03T20:52:05

Encoding speed is not really a concern for web uses, where the image is decoded potentially millions or even billions of times more than it is encoded.

I agree, PSNR is a terrible measure of quality. The study "Benchmarking JPEG XL lossy/lossless image compression" (https://research.google/pubs/benchmrking-jpeg-xl-lossylossle...) which you are an author on included a controlled subjective evaluation done by EPFL using an ITU recommend methodology. The subjective results concluded:

"HEVC with SCC generally results in better rate/distortion performance when compared to JPEG XL at low bitrates (< 0.75), better preserving sharp lines and smooth areas"

It is known that AVIF performs better than HEVC. Can you say why it was not included in your 2020 subjective evaluation? It would be nice to not need to speculate on the relative quality of AVIF v JPEG XL at web bitrates.

janwas · 2024-05-04T08:50:26

> PSNR is a terrible measure of quality

Glad we can agree on that. Unfortunately many evals use that or plain SSIM, which is still L2 at heart.

> Encoding speed is not really a concern for web uses

hm, in many discussions with Jon of Cloudinary, I did not get the impression that this is the case. Imagine their enthusiasm about 100x-ing their compute costs.

> Can you say why it was not included in your 2020 subjective evaluation?

The paper's comment on this is: "The selection of anchors is based on the general popularity of the codecs and the availability of the software implementations. We only intended to use codecs which are standardized internationally."

> using an ITU recommend methodology

While this has some helpful guidance on viewing conditions, it is unfortunately still subjective (what parts are observers looking at, what counts as "annoying") and is more useful for detecting severe artifacts, which less relevant in practice because that's hopefully not the quality range people are using.

Also, these results are quite old and both encoders have changed since then.

> need to speculate on the relative quality of AVIF v JPEG XL at web bitrates.

No need to speculate :) Just to first agree on what are actual web bitrates. From Chrome metrics, IIRC it was over 1bpp. Here's some newer data: https://discuss.httparchive.org/t/what-compression-ratio-ran... Even for AVIF, the median is 0.96 and q3 is 1.79(!).

Jon has written several articles on comparisons, including https://cloudinary.com/blog/contemplating-codec-comparisons and https://cloudinary.com/blog/jpeg-xl-and-the-pareto-front#med....

kodabbb · 2024-05-04T16:57:16

> The paper's comment on this is: "The selection of anchors is based on the general popularity of the codecs and the availability of the software implementations. We only intended to use codecs which are standardized internationally."

how very convenient, looks like politics outweigh any benchmarking

I have no game in the matter, but as a large website provider perspective, handling millions of images and processing thousands per day, I am glad not to have to deal with yet another format that would double our cache costs and force eternal support on the web. Not everybody has infinite google money to afford any kind of image format existing on the planet, and google can get only so much leeway after poisoning the web with webp.

j2c

kodabbb · 2024-05-03T23:49:50

small typo in the article https://research.google/pubs/benchmarking-jpeg-xl-lossylossl... (very interesting read!)

janwas · 2024-05-02T18:36:54

Disclosure: I worked on JPEG XL, opinions are my own.

The radio(network) on phones can consume more power than the SoC(CPU). Thus smaller size can translate into energy savings.

As to hypothetical advantage, we are talking about HW _potentially_ being used for image decode. AFAIK this does not happen in practice, despite WebP non-lossless being a VP8 frame and the hardware being plentifully available.

As evidence for the advantages of JPEG XL being real, consider the fact that it is increasingly being adopted in serious SW including ACR, Darktable, Krita, and Lightroom.

Latty · 2024-05-02T19:13:15

JPEG XL is significantly smaller? All the stuff I've seen has show a slight edge for the same visual quality, but not enough I'd suspect it'd offset hardware decode, although if software doesn't actually do hardware decode, then yeah, the argument falls apart.

I can 100% see JXL being adopted in production tools, where the motivations are different, I was mainly talking about the web adoption perspective for end users (which is the context of Google's supposed war on it, of course).

janwas · 2024-04-27T08:01:57

> it doesn't use intrinsics because they aren't actually easier to work with; they are not faster, not more portable

Well golly, I'll just have to disagree based on >20 years of experience, including several in assembly. asm is only (maybe) faster for the code we manage to get written.

From where I sit, video codecs are a rare special case in that the format is standardized, changes only every few years, and has only a few but super-time-critical kernels. For many many other use cases, the situation looks different and productivity matters more. Would you rather get a 10x speedup on 40% of the cycles, or 8x on 80%?

BTW the "not more portable" comment is a strawman because intrinsics themselves indeed aren't portable, but a wrapper library on top (such as our Highway) is.

astrange · 2024-04-27T17:50:58

> BTW the "not more portable" comment is a strawman because intrinsics themselves indeed aren't portable, but a wrapper library on top (such as our Highway) is.

That's not intrinsics then, it's different abstraction. You could write a wrapper library over inline assembly if you wanted to.

(And of course the intrinsics themselves could almost all be implemented as a header using inline assembly too. Since you're probably not relying on the compiler to optimize your intrinsic math. But optimization would be a bit worse because it doesn't know the byte size of each instruction.)

dzaima · 2024-04-27T20:50:21

The compiler can still do optimizations on intrinsics - clang passes most through its regular optimizations, so you get things like loop unrolling, CSE (quite powerful if you have multiple invocations of the same SIMD thing, deduplicating constant loads or whatnot), and some genuine improvements/reducing what you need to pay attention to (don't need to manually merge to 'vpandn', 'vpand a,b,c; vptest a,a' → 'vptest b,c', sometimes improving shuffles, moving out negation from movmsk of negation of vpcmpeq), though it can of course make things worse too as regular compiler tax.

An example of something that inline assembly would handle badly would be broadcast, which x86 pre-AVX-512 only can do with a value already in a SIMD register, or directly from memory, but the programmer almost always will want to provide it as a regular scalar variable, i.e. GPR.

janwas · 2024-04-27T18:38:17

hm, sounds almost like macros wouldn't make it assembly anymore :)

I'm curious whether you know of any such inline asm wrapper? Seems that this gives the compiler less information than the intrinsics, which largely expand to builtins.

janwas · 2024-04-11T05:54:04

For anyone interested in a C++ implementation, our github.com/google/gemma.cpp now supports this model.

JyrkiAlakuijala · 2024-04-11T08:43:34

Fun fact -- gemma.cpp uses highway, an amazing high performance computation library originally developed in the JPEG XL effort.

janwas · 2024-04-02T19:27:27

For C++, also check out our https://github.com/google/gemma.cpp/blob/main/gemma.cc, which has direct calls to MatVec.

janwas · 2024-04-01T19:50:35

I think the previous code was using dot products, f32 instead of bf16.

janwas · 2024-04-01T19:36:03

Nice result, congrats Justine! The bf16 dot instruction replaces 6 instructions: https://github.com/google/highway/blob/master/hwy/ops/x86_12... A 3-4x speedup vs SKX sounds quite plausible :)

janwas · 2024-03-26T10:39:21

FWIW I have been writing SIMD since 20+ years and worked on JPEG XL, which also contains a good bit of vector code.

BTW one anecdote: a colleague mentioned what should have been a quick 20 min patch to ffmpeg took a day because it was written in assembly.

kierank · 2024-03-26T11:19:54

That "20 minute patch" will need to be maintained for decades to come in FFmpeg, long after a standalone JPEG-XL library. Potentially centuries as archives like the Library of Congress are storing FFmpeg. So that's why it's done in assembly, so it's maintainable with the rest of the code.

Glacia · 2024-03-26T11:13:10

Here is the old archived blog post where x264 team answered in the comments why they do it that way. https://web.archive.org/web/20091223024333/http://x264dev.mu...

exDM69 · 2024-03-26T11:18:47

This is dated to 2009. Probably sound advise back then.

Compilers are much much better with SIMD code than they were then. Today you'll have to work real hard to beat LLVM in optimizing basic SIMD code (edit: when given SIMD code as input, see comment below).

I happen to know because this "hacker news bro" has been dealing with SIMD code for longer than that.

kierank · 2024-03-26T11:21:14

FFmpeg code by definition is not "basic SIMD code". And it supports numerous other compilers other than LLVM.

jbk · 2024-03-26T11:36:43

> Today you'll have to work real hard to beat LLVM in optimizing basic SIMD code.

On dav1d, we see just a 800% increase… I know it’s negligible, but…

exDM69 · 2024-03-26T11:45:40

Compared to what? Scalar loopy C code sure. The auto vectorization is not great.

But give LLVM some SIMD code as input, and it will be able to optimize it, and it does a great job with register allocation, spill code, instruction scheduling etc.

Instruction selection isn't as great and you still need to use intrinsics for specialized instructions.

And you get all of this for all CPU architectures and will deal with future microarchitecture changes for free. E.g. more execution ports added by Intel will get used with no code changes on your side.

With infinite time you can still do better by hand, but it gets expensive fast, especially if you have several CPU architectures to deal with.

rbultje · 2024-03-26T12:33:33

Blog author (and dav1d/ffmpeg dev) here. My talk at VDD 2023 (https://www.youtube.com/watch?v=Z4DS3jiZhfo&t=9290s) did a comparison like the ones asked above. I compared an intrinsics implementation of the AV1 inverse transform with the hand-written assembly one found in dav1d. I analyzed the compiler-generated version (from intrinsics) versus the hand-written one in terms of instruction count (and cycle runtime, too) for different types of things a compiler does (data loads, stack spills, constant loading, actual multiply/add math, addition of result to predictor, etc.). Conclusion: modern compilers still can't do what we can do by hand, the difference is up to 2x - this is a huge difference. It's partially because compilers are not as clever as everyone likes to think, but also because humans are more clever and can choose to violate ABI rules if it's helpful, which a compiler cannot do. Is this hard? Yes. But at some scale, this is worth it. dav1d/FFmpeg are examples of such scale.

anonymoushn · 2024-03-26T14:15:13

I am happily using compilers for just register allocation, spills, and scheduling in these use cases, but my impression is that compiler authors don't really consider compilers to be especially good at inputs like this where the user has already chosen the instructions and just wants register allocation, spills, and scheduling. The problem is known to be hard and the solutions we have are mostly heuristics tuned for very small numbers of live variables, which is the opposite of what you have if you want to do anything while you wait for your multiplies to be done.

Instruction selection as you said is mostly absent. Compilers will not substitute or for blend or shift for shuffle even in cases where they are trivially equivalent, so the programmer has to know what execution ports are available anyway =/

jbk · 2024-03-26T11:52:43

> Scalar loopy C code sure. The auto vectorization is not great.

Stop considering people as idiots.

People do that because it’s a LOT faster, not just a bit.

If you are so able, please show us your results. Dav1d is full open source, fully documented, and with quite simple C code.

Show your results.

Const-me · 2024-03-26T15:25:39

> show us your results

Not GP but here’s an example where intrinsics outperformed assembly by an order of magnitude: https://news.ycombinator.com/item?id=36624240

They were AVX2 SIMD intrinsics versus scalar assembly, but I doubt AVX2 assembly gonna substantially improve performance of my C++. The compiler did a decent job allocating these vector registers and the assembly code is not too bad, not much to improve.

It’s interesting how close your 800% to my 1000%. For this reason, I have a suspicion you tested the opposite, naïve C or C++ versus SIMD assembly. Or maybe you have tested automatically vectorized C or C++ code, automatic vectorizers often fail to deliver anything good.

Glacia · 2024-03-26T15:40:52

So you took asm code that had no SIMD instructions in it, made your own version in c++ with intrinsics and figured out that, yes, SIMD is faster? Realy?

I think you're completely missing what are we talking about here.

Const-me · 2024-03-26T16:28:31

> Realy?

No, not really. My point is, in modern compilers SSE and AVX intrinsics are usually pretty good, and assembly is not needed anymore even for very performance-sensitive use cases like video codecs or numerical HPC algorithms.

I think in the modern world it’s sufficient for developers to be able to read assembly, to understand what compilers are doing to their codes. However, writing assembly is not the best idea anymore.

Assembly is unreliable due to OS-specific shenanigans, result in bugs like that one: https://issues.chromium.org/issues/40185629

Assembly complicates builds because inline assembly is not available in all compilers, and for non-inline assembly every project uses a different version: YASM, NASM, MASM, etc.

jbk · 2024-03-26T18:20:39

> and assembly is not needed anymore even for very performance-sensitive use cases like video codecs

People in this thread, writing video codecs for years that you use daily tell you that, no, it’s a lot faster (10-20%), but you, who have done none of those, know better…

Const-me · 2024-03-26T18:57:44

> writing video codecs for years

These people aren’t the only ones writing performance-critical SIMD code. I’ve been doing that for more than a decade now, even wrote articles on the subject like http://const.me/articles/simd/simd.pdf

> that you use daily

The video codecs I use daily are mostly implemented in hardware, not software.

> it’s a lot faster (10-20%)

Finally, believable numbers. Please note before this in this very thread you claimed “800% increase” which was totally incorrect.

BTW, it’s often possible to rework source code and/or adjust compiler options to improve performance of the machine code generated from SIMD intrinsics, diminishing these 10-20% to something like 1-2%.

Optimizations like that are obviously less reliable than using assembly, also relatively tricky to implement because compilers don’t expose enough knobs for low-level things like register allocation.

However, the approach still takes much less time than writing assembly. And it’s often good enough for many practical applications. Examples of these applications include Windows software shipped in binaries, and HPC or embedded where you can rely not just on a specific compiler version, but even on specific processor revision and OS build.

jbk · 2024-03-26T20:32:40

> Finally, believable numbers. Please note before this in this very thread you claimed “800% increase” which was totally incorrect.

You cheery pick my comments and cannot be bothered reading.

We’re talking against fully optimized-autovec-all-llvm-options vs hand written asm. And yes, 800% is likely.

The 20% is intrinsics vs hand written.

> The video codecs I use daily are mostly implemented in hardware, not software.

Weirdly, I know a bit more about the transcoding pipelines of the video industry that you do. And it’s far from hardware decoding and encoding over there…

You know nothing about the subject you are talking about.

kierank · 2024-03-26T20:06:22

v210_planar_pack_8_c: 2298.5

v210_planar_pack_8_ssse3: 402.5

v210_planar_pack_8_avx: 413.0

v210_planar_pack_8_avx2: 206.0

v210_planar_pack_8_avx512: 193.0

v210_planar_pack_8_avx512icl: 100.0

23x speedup. The compiler isn't going to come up with some of the trickery to make this function 23x faster.

800% is nothing.

Const-me · 2024-03-26T20:45:08

You don’t need assembly to leverage AVX2 or AVX512 because on mainstream platforms, all modern compilers support SIMD intrinsics.

Based on the performance numbers, whoever was writing that test neglected to implement manual vectorization for the C version. Which is the only reason why assembly is 23x faster for that test. If they rework their C version with the focus on performance i.e. using SIMD intrinsics, pretty sure the performance difference between C and assembly versions gonna be very unimpressive, like couple percent.

astrange · 2024-03-27T01:48:33

The C version is in C because it needs to be portable and so there can be a baseline to find bugs in the other implementations.

The other ones aren't in asm merely because video codecs are "performance sensitive", it's because they're run in such specific contexts that optimizations work that can't be expressed portably in C+intrinsics across the supported compilers and OSes.

Const-me · 2024-03-27T13:36:01

Yeah, it’s clear why you can’t have a single optimized C version.

However, can’t you have 5 different non-portable optimized C versions, just like you do with the assembly code?

SIMD intrinsics are generally portable across compilers and OSes, because their C API is defined by Intel, not by compiler or OS vendors. When I want software optimized for multiple targets like SSE, AVX1, AVX2, I sometimes do that in C++.