It's funny reading comments here, Hackernews bros really think they're smarter t...

JonChesterfield · 2024-03-26T11:32:35

First you write in C. Then it's too slow, you write parts in asm. After a while you have a lot of asm. You lean harder on the macro processor. At some point you port to another processor and decide the macro processor can handle that. Bang, x86inc.asm.

That doesn't make the end point optimal. Nor does it mean it's what the authors would have done from a clean slate. At each step you take the sensible choice and after a long trek down the gradient you end up somewhere like this.

Given a desire to write something analogous to these codecs today, should you copy their development path? Should you try to copy the end result, maybe even using the same tools?

Your argument from authority amounts to "these guys are clever, you should imitate them". There are failure modes in that line of thinking which I hope the above makes clear.

astrange · 2024-03-26T17:22:04

We aren't stupid. We designed x86inc to be like that for good reasons, from a clean slate, and if we didn't like it we would have done something else.

You haven't tested the alternatives - they're slow and don't work in this situation, mostly because C is not actually that low level when it comes to memory aliasing.

JonChesterfield · 2024-03-26T20:05:30

Well that's much more interesting. Is there anything written publicly about the experience? Any tooling used to help get the implementation right beyond testing and writing it carefully?

I've found a little here https://ffmpeg.org/developer.html#SIMD_002fDSP-1

The context is I'm a compiler developer who really liked working side by side with old school assembly developers in a past role. I'm painfully aware that the tribal knowledge of building stuff out of asm is hard to find written down and always curious about the directions in which things like C can be extended to narrow the gap.

janwas · 2024-03-26T10:39:21

FWIW I have been writing SIMD since 20+ years and worked on JPEG XL, which also contains a good bit of vector code.

BTW one anecdote: a colleague mentioned what should have been a quick 20 min patch to ffmpeg took a day because it was written in assembly.

kierank · 2024-03-26T11:19:54

That "20 minute patch" will need to be maintained for decades to come in FFmpeg, long after a standalone JPEG-XL library. Potentially centuries as archives like the Library of Congress are storing FFmpeg. So that's why it's done in assembly, so it's maintainable with the rest of the code.

Glacia · 2024-03-26T11:13:10

Here is the old archived blog post where x264 team answered in the comments why they do it that way. https://web.archive.org/web/20091223024333/http://x264dev.mu...

exDM69 · 2024-03-26T11:18:47

This is dated to 2009. Probably sound advise back then.

Compilers are much much better with SIMD code than they were then. Today you'll have to work real hard to beat LLVM in optimizing basic SIMD code (edit: when given SIMD code as input, see comment below).

I happen to know because this "hacker news bro" has been dealing with SIMD code for longer than that.

kierank · 2024-03-26T11:21:14

FFmpeg code by definition is not "basic SIMD code". And it supports numerous other compilers other than LLVM.

jbk · 2024-03-26T11:36:43

> Today you'll have to work real hard to beat LLVM in optimizing basic SIMD code.

On dav1d, we see just a 800% increase… I know it’s negligible, but…

exDM69 · 2024-03-26T11:45:40

Compared to what? Scalar loopy C code sure. The auto vectorization is not great.

But give LLVM some SIMD code as input, and it will be able to optimize it, and it does a great job with register allocation, spill code, instruction scheduling etc.

Instruction selection isn't as great and you still need to use intrinsics for specialized instructions.

And you get all of this for all CPU architectures and will deal with future microarchitecture changes for free. E.g. more execution ports added by Intel will get used with no code changes on your side.

With infinite time you can still do better by hand, but it gets expensive fast, especially if you have several CPU architectures to deal with.

rbultje · 2024-03-26T12:33:33

Blog author (and dav1d/ffmpeg dev) here. My talk at VDD 2023 (https://www.youtube.com/watch?v=Z4DS3jiZhfo&t=9290s) did a comparison like the ones asked above. I compared an intrinsics implementation of the AV1 inverse transform with the hand-written assembly one found in dav1d. I analyzed the compiler-generated version (from intrinsics) versus the hand-written one in terms of instruction count (and cycle runtime, too) for different types of things a compiler does (data loads, stack spills, constant loading, actual multiply/add math, addition of result to predictor, etc.). Conclusion: modern compilers still can't do what we can do by hand, the difference is up to 2x - this is a huge difference. It's partially because compilers are not as clever as everyone likes to think, but also because humans are more clever and can choose to violate ABI rules if it's helpful, which a compiler cannot do. Is this hard? Yes. But at some scale, this is worth it. dav1d/FFmpeg are examples of such scale.

anonymoushn · 2024-03-26T14:15:13

I am happily using compilers for just register allocation, spills, and scheduling in these use cases, but my impression is that compiler authors don't really consider compilers to be especially good at inputs like this where the user has already chosen the instructions and just wants register allocation, spills, and scheduling. The problem is known to be hard and the solutions we have are mostly heuristics tuned for very small numbers of live variables, which is the opposite of what you have if you want to do anything while you wait for your multiplies to be done.

Instruction selection as you said is mostly absent. Compilers will not substitute or for blend or shift for shuffle even in cases where they are trivially equivalent, so the programmer has to know what execution ports are available anyway =/

jbk · 2024-03-26T11:52:43

> Scalar loopy C code sure. The auto vectorization is not great.

Stop considering people as idiots.

People do that because it’s a LOT faster, not just a bit.

If you are so able, please show us your results. Dav1d is full open source, fully documented, and with quite simple C code.

Show your results.

Const-me · 2024-03-26T15:25:39

> show us your results

Not GP but here’s an example where intrinsics outperformed assembly by an order of magnitude: https://news.ycombinator.com/item?id=36624240

They were AVX2 SIMD intrinsics versus scalar assembly, but I doubt AVX2 assembly gonna substantially improve performance of my C++. The compiler did a decent job allocating these vector registers and the assembly code is not too bad, not much to improve.

It’s interesting how close your 800% to my 1000%. For this reason, I have a suspicion you tested the opposite, naïve C or C++ versus SIMD assembly. Or maybe you have tested automatically vectorized C or C++ code, automatic vectorizers often fail to deliver anything good.

Glacia · 2024-03-26T15:40:52

So you took asm code that had no SIMD instructions in it, made your own version in c++ with intrinsics and figured out that, yes, SIMD is faster? Realy?

I think you're completely missing what are we talking about here.

Const-me · 2024-03-26T16:28:31

> Realy?

No, not really. My point is, in modern compilers SSE and AVX intrinsics are usually pretty good, and assembly is not needed anymore even for very performance-sensitive use cases like video codecs or numerical HPC algorithms.

I think in the modern world it’s sufficient for developers to be able to read assembly, to understand what compilers are doing to their codes. However, writing assembly is not the best idea anymore.

Assembly is unreliable due to OS-specific shenanigans, result in bugs like that one: https://issues.chromium.org/issues/40185629

Assembly complicates builds because inline assembly is not available in all compilers, and for non-inline assembly every project uses a different version: YASM, NASM, MASM, etc.

jbk · 2024-03-26T18:20:39

> and assembly is not needed anymore even for very performance-sensitive use cases like video codecs

People in this thread, writing video codecs for years that you use daily tell you that, no, it’s a lot faster (10-20%), but you, who have done none of those, know better…

Const-me · 2024-03-26T18:57:44

> writing video codecs for years

These people aren’t the only ones writing performance-critical SIMD code. I’ve been doing that for more than a decade now, even wrote articles on the subject like http://const.me/articles/simd/simd.pdf

> that you use daily

The video codecs I use daily are mostly implemented in hardware, not software.

> it’s a lot faster (10-20%)

Finally, believable numbers. Please note before this in this very thread you claimed “800% increase” which was totally incorrect.

BTW, it’s often possible to rework source code and/or adjust compiler options to improve performance of the machine code generated from SIMD intrinsics, diminishing these 10-20% to something like 1-2%.

Optimizations like that are obviously less reliable than using assembly, also relatively tricky to implement because compilers don’t expose enough knobs for low-level things like register allocation.

However, the approach still takes much less time than writing assembly. And it’s often good enough for many practical applications. Examples of these applications include Windows software shipped in binaries, and HPC or embedded where you can rely not just on a specific compiler version, but even on specific processor revision and OS build.

jbk · 2024-03-26T20:32:40

> Finally, believable numbers. Please note before this in this very thread you claimed “800% increase” which was totally incorrect.

You cheery pick my comments and cannot be bothered reading.

We’re talking against fully optimized-autovec-all-llvm-options vs hand written asm. And yes, 800% is likely.

The 20% is intrinsics vs hand written.

> The video codecs I use daily are mostly implemented in hardware, not software.

Weirdly, I know a bit more about the transcoding pipelines of the video industry that you do. And it’s far from hardware decoding and encoding over there…

You know nothing about the subject you are talking about.

kierank · 2024-03-26T20:06:22

v210_planar_pack_8_c: 2298.5

v210_planar_pack_8_ssse3: 402.5

v210_planar_pack_8_avx: 413.0

v210_planar_pack_8_avx2: 206.0

v210_planar_pack_8_avx512: 193.0

v210_planar_pack_8_avx512icl: 100.0

23x speedup. The compiler isn't going to come up with some of the trickery to make this function 23x faster.

800% is nothing.

Const-me · 2024-03-26T20:45:08

You don’t need assembly to leverage AVX2 or AVX512 because on mainstream platforms, all modern compilers support SIMD intrinsics.

Based on the performance numbers, whoever was writing that test neglected to implement manual vectorization for the C version. Which is the only reason why assembly is 23x faster for that test. If they rework their C version with the focus on performance i.e. using SIMD intrinsics, pretty sure the performance difference between C and assembly versions gonna be very unimpressive, like couple percent.

astrange · 2024-03-27T01:48:33

The C version is in C because it needs to be portable and so there can be a baseline to find bugs in the other implementations.

The other ones aren't in asm merely because video codecs are "performance sensitive", it's because they're run in such specific contexts that optimizations work that can't be expressed portably in C+intrinsics across the supported compilers and OSes.

Const-me · 2024-03-27T13:36:01

Yeah, it’s clear why you can’t have a single optimized C version.

However, can’t you have 5 different non-portable optimized C versions, just like you do with the assembly code?

SIMD intrinsics are generally portable across compilers and OSes, because their C API is defined by Intel, not by compiler or OS vendors. When I want software optimized for multiple targets like SSE, AVX1, AVX2, I sometimes do that in C++.