Yes, we are leaving 2X on the table in terms of peak frequency compared to well staffed chipzilla teams. Not ideal, but we have a big enough of a lead in terms of architecture that it kind of works.
The comment above said you couldn't release the info due to the EDA vendor. However, people like Jiri Gaisler have released their methodologies via papers that just describe them with artificial examples. Others use non-manufarable processes and libraries (like NanGates) so the EDA vendors feelings don't get hurt about results that don't apply to real-world processes. ;)
So, if you have a 16nm silicon compiler, I encourage you to pull a Gaisler with a presentation on how you do that with key details and synthetic examples designed to avoid issues with EDA vendors. Or just use Qflow if possible.
I'll pass for now...Gaisler is in the business of consulting, we survive by building products. I am happy to release sources, but it's completely up to the EDA company.
[edit: was thinking of the wrong Gaisler, still will pass]
Damnit. No promises but would you consider putting it together if someone paid your company to do it under an academic grant or something? Quite a few academics trying to do things like you've done with small chance that one might go for that.
That's concurrency, throughput, and load-balancing of web servers connected to pipes of certain bandwidth. It's not the same as parallel execution of CPU-bound code on a tiled processor. You could know a lot about one while knowing almost nothing about the other.
That seems analogous to human assembly optimization vs a compiler. But the time to market is greatly reduced, designs can be vetted and a 2.0 that is optimized for frequency can be shipped later.
IIRC, human assembly optimization is unlikely to be better than a modern compiler nowadays. Same thing could very well happen for this "automated flow" if it starts incorporating its own optimization techniques.
That is a myth. Most developers can't beat LLVM. LLVM can't beat the handcrafted assembly in libjpeg-turbo or x264 or openssl or luajit by compiling the generic C alternative.
In response to the other replies: I'm not sure about luajit, but the other two examples involved a programmer hand crafting algorithms around specific special purpose CPU instructions -- vector processing and video compression hardware, if I remember the details of x264 correctly. This is so specialized and architecture specific that it probably doesn't make sense to push it into the compiler.
Speaking from experience, even getting purpose-built compilers like ICC to apply "simple" optimizations like fused-multiply-add to matrix multiply is non-trivial.
Taking jpeg decoding as a concrete example of why modern compilers fall over, you have two high-level choices: (1) the compiler automatically translates a generic program into one that can be vectorized using the instructions on the target platforms. This will probably involve reworking control flow, loops, heap memory layout, malloc calls, etc, and will require changing the compressed / decompressed images in imperceptible to humans ways (the vector instructions often have different precision/rounding properties than non-vector instructions). This is well beyond the state of the art.
(2) Find a programmer that deeply understands the capabilities of all the target architectures and compilers, who will then write in the subset of C/Java/etc that can be vectorized on each architecture.
I think you'll find there are many more assembler programmers than there are people with the expertise to pull off (2), and that using compiler intrinsics is actually more productive anyway.
x264 does not use any video compression hardware. It uses only regular SIMD.
I don't agree that SIMD is so specialized. It is needed where ever you have operation over arrays of items of the same type, including memcmp, memcpy, strchr, unicode encoders/decoders/checkers, operations on pixels, radio or sound samples, accelerometer data, etc.
Compilers have latency and dependency models for specific CPU arch decoders/schedulers/pipelines. Compiler authors agree that compilers should learn to do good autovectorization. But it's hard. So people use assembly.
> human assembly optimization is unlikely to be better than a modern compiler
You said:
> Most developers can't beat LLVM
Then you pointed out some specific examples where a human can be a compiler.
Seem like you two agree, then you go and call what he is a saying "a myth". I think I need some clarification.
Prior to this my understanding was that if the developer provides the compiler good information with type, const, avoids pointer aliasing and in general makes the code easy to optimize that the compiler can do much better than most humans most of the time, but of course a domain expert willing to expend a huge amount of time with all the knowledge the compiler would have can beat the compiler. It just seems that beating the compiler is rarely cost (time, money, people, etc...) efficient.
Making C compilers for different architectures output great code from same source is really hard. e.g. "const" is not used by optimizers because it can be cast away. Interpreters, compression routines, etc. can always be sped up using assembly.
If what your program does can be sped up using vector registers/instructions (e.g. DSP, image and video processing) then you want to do that because x4 and x8 speedups are common. Current autovectorisers are not very good. If it is not the most trivial example like "sum of contiguous array of floats", you'll want to write SIMD assembly or intrinsics or use something like Halide. In practice projects end up using nasm/yasm or creating a fancy macro assembler in a high level language.
The choice to use assembly is economics, and it's all a matter of degree. How much performance is left on the table by the compiler? How many C lines of code take up 50% of the cpu time in your program? How rare is the person who is able to write fast assembly/SIMD code? How long does it takes to write correct and fast assembly/SIMD code for only the hot function for 4 different platforms (e.g. in-order ARM, Apple A10, AMD Jaguar, Haswell)?
If you think "25%, 100k LoC, very rare, man-years" then you conclude it's not worth it. If you think "x8, 20 lines, only as rare as any other good senior engineer, 50 hours" then you conclude it's stupid to not do the inner loop in assembly.
What are the numbers in practice? I don't know. In practice, all the products that have won in their market and can be sped up using SIMD have hand coded assembly or use something like Halide and none of them think the compiler is good enough.
> Making C compilers for different architectures output great code from same source is really hard. e.g. "const" is not used by optimizers because it can be cast away.
Check out the cppcon 2016 presentation by Jason Turner and watch how eager the compiler optimizes away code when const is enabled on values. Cool presentation too, and uses Godbolt's tool
https://www.youtube.com/watch?v=zBkNBP00wJE
If it's not at least able to match handcrafted assembly using intrinsics, you should file bugs against LLVM. There is no theoretical reason why compilers shouldn't be able to match or beat humans here: these problems are extremely well studied.
Sometimes consistency is desirable, as well as performance. Compilers are heuristic. They evolve and get better, but they can mess up, and it's not always a fun time to find out why the compiler made something that was performance sensitive suddenly do worse, intrinsics or not -- from things like a compiler upgrade, or the inlining heuristic changes because of some slight code change, or because it's Friday the 13th (especially when it's something horridly annoying like a solid %2-3 worse -- at least with %50 worse I can probably figure out where everything went horribly wrong without spending a whole afternoon on it). This is a point that's more general than intrinsics, but I think it's worth mentioning.
Sure, I can file bug reports in those cases, and I would attempt to if possible -- but it also doesn't meaningfully help any users who suddenly experience the problem. At some point I'd rather just write the core bit a few times and future proof myself (and this has certainly happened for me a non-zero amount of times -- but not many more than zero :)
"using intrinsics" is a cop out: you are essentially doing the more complicated part of translating that sequence of generic C code into a rough approximation of a sequence of machine instructions and leave the compiler to do the boring and simpler parts, like register allocation, code layout and ordering of independent instructions.
Compilers are smart at some things and not so smart at others. I can beat the compiler in tight inner loops almost every time, but it will also do insanely clever things that id never think of!
Slides with the talk, not my favorite, have a link to the talk?
The second paper is so biased it hurts. It hardly attempts to hide this bias, on the second page it start referring to one group of people as "clueless" and never justifies it describing what what clued in would be.
The second paper also has a strong assumption that compilers should somehow maintain their current undefined behavior going forward. It is almost as though the paper author thinks a compiler can somehow divine what the programmer wants without referring to some pre-agreed upon document, such as the standard for the language.
The second paper also talks only about performance and not about any other real world concern, like maintainability, reliability or portability.
This paper is setting up straw men when it trots out code with bugs (that loop on page 4) and then a pre-release version of the compiler does something unexpected. Of course non-conforming code breaks when compiled. Of course pre-release compilers are buggy.
The paper's author wants code to work the same on all systems even when the code conveys unclear semantics. That is unreasonable.
To give credit to the paper's author that no-op is part of the SPEC benchmark suite and the author feels that code in that benchmark is being treated as privileged by compiler authors.
Even though I disagree with the author I try to understand some of his perspective.
There's a gap between "humans can't write assembly better than the compilers" and "there's nothing humans can do to help the compiler write better code".
Depends. You won't beat llvm if your code uses strictly intrinsics. Some things, like adding carry bits across 64-bit arrays, might need to be done by hand, because of special, knowledge about your data that are not generalizable.