Yes. If I were to write IR for a real project, I'd probably keep a few pretty large functions to avoid overhead caused by calling conventions.
Very interesting link. Thank you! Though I find the arguments unconvincing. More data needed...
> typed assembly
Yes. But now _every_ mac user and plenty of other people have a typed assembly compiler on their machines! IR is reasonably popular and quite well documented. I'm not saying it's new, I'm saying it's cool :)
It's not just calling conventions. LLVM IR also bakes in various assumptions about the target platform, from endianness to structure alignment to miscellaneous facts like whether char is signed or unsigned.
In many of those there is no going back to the original representation, they are one-way.
If you have IR that you can compile to various archs and have it work there, that is a lucky thing in that particular case. But it is not what LLVM IR was designed for nor should that work in general.
Say you have a struct type X with properties int, double, int. The offset of the last property depends on how alignment works on the target platform - it could be 12 or 16 on some common ones. LLVM IR can contain a read from offset 12, hardcoded. Whereas the C code contains X.propertyThree, which is more portable.
But that's not how LLVM works, at least when I worked with it a couple years ago. You would define a struct type in terms of primitive types (int64, ptr, etc), and then use getelementptr with the offset of the field path you wanted. Yes, it's a numeric offset, but it's a field offset within the struct, not a byte offset. LLVM handles packing, alignment, and pointer size issues for you automatically.
Yes, you can define structs and use getelementptr to access values. But, frontends can also bake in offsets calculated from getelementptr. They can also bake in sizeofs of structs, for example. And LLVM optimizations end up baking in hardcoded values in more subtle ways too.
Your primitive types aren't LLVM's though, are they? I mean, I haven't looked at LLVM thoroughly (just enough to be familiar with it, a friend is writing a language he wanted some input on), but I would be surprised and disappointed if they had a C "int" type as opposed to "signed 32-bit integer" or whatever. At which point it's compatible with whatever else is throwing around a signed 32-bit integer.
But that is exactly the point - that LLVM IR is not platform independent.
The fronted must choose which specific integer type that "int" in C maps to. At that point, the IR is no longer machine independent - if you pick 32 bit signed ints to represent C "int", your program will not match the C ABI on any platform using 16 bit unsigned int as C "int" and you won't be able to directly make calls to libraries on that platform, for example.
This misses the point. The point is that if you pass a C program that uses "int" through a C-compiler that spits out LLVM IR, the resulting LLVM IR is not portable.
You might not be able to change the C program - it might be using "int" because the libraries it needs to interface to uses "int" in their signature (and handles it appropriately) on the platforms you care about.
Ah, I think I see....you mean I could write non-portable IR code by doing that, although LLVM would never produce code like that? I guess there must always be IR that the frontend will never produce then?
No, the implication is that the LLVM IR that the frontend produces changes depending on the ultimate target that the LLVM IR will be compiled to. In other words, the frontends aren't backend-agnostic.
That makes no sense at all. You've just said that every LLVM frontend has to produce code that every LLVM frontend won't produce! If you mean something different, then could you please be clearer, as I really don't understand what you're talking about.
[EDIT: caf's post made it clearer. I know what you meant now]
Obviously a trivial example, but it's illustrative: the front-end compiler knows a bunch of things about your target and bakes that information into the IL. If you took IL generated by the compiler with "-arch i386" and then compiled the IL using "-arch x86_64" it's quite possible to get a non-working executable.
It's possible to carefully write IL that will work on multiple platforms as done in this blog post, but I'm not sure how useful that really is. You're still giving up the exact control that assembly gives you, so I don't know how much better you'd get than clang-produced IR. In other words, if you want "portable assembly language" use C.
Still, its an interesting blog post. It's good to show people how the compiler works behind the curtain.
This is interesting and pretty awesome. I can imagine having a lot of fun making a toy compiler that emits LLVM IR directly.
Practically speaking, is there a good reason to do this in non-toy software rather than the more-or-less standard practice of using C as an IR? That gives you, in addition to compilation via LLVM, the ability (in principle) to use gcc, icc, et al. (In practice, in many cases people end up using compiler-specific extensions in the generated C, which does impose limitations on portability to other compilers, but that's certainly avoidable.)
I suppose if you are dynamically generating things that need to be compiled as quickly as possible, eliminating an unnecessary compilation step could be a significant advantage. It doesn't seem like such a constraint would apply in most cases, though.
> I can imagine having a lot of fun making a toy compiler that emits LLVM IR directly. Practically speaking, is there a good reason to do this in non-toy software rather than the more-or-less standard practice of using C as an IR?
Well what you're describing is the whole living purpose of LLVM as a compiler framework. In other words, clang, LDC, GHC, and all other LLVM front-end compilers are non-toy software that transform their respective language into IR and hand it to libLLVM* for optimization/codegen/etc. Targeting C puts a relatively large layer and potentially good deal of ambiguity in between your compiler and the machine code, rather it seems more fitting for experimental or "toy" software to use that in lieu of a real compiler or LLVM frontend.
C as a target language locks you into a lot of things, like the C calling convention. Might you prefer being able to do tail call optimization without trampolines to C's varargs?
It's also an issue WRT to precise garbage collection, although that's currently about the same with LLVM, which discards the distinction between pointers and ints. But in theory you or someone might enhance LLVM someday.
The flip side of this is that using C gives you the C calling convention for free. LLVM IR doesn't abstract away all the details of the C calling convention, and requires IR producers to produce platform-specific code for it.
I have actually written code to produce LLVM IR to conform to a platform's native C ABI. It's quite a bit more work than "just a bit of register/stack allocation". For one thing, LLVM IR doesn't expose the concept of a stack pointer, so you can't just put things on the stack where they need to go. You need to put things in LLVM's abstractions in places where you happen to know the codegen will lower them to where they need to go. And it's different for every platform.
In the relative context of implementing a production runtime, that's "just a bit of register/stack allocation".
If you're hobby-hacking a project, then you may as well just target something that already exists, because micro-inefficiencies that matter on the scale of a deployed production platform probably don't matter for your human-resource-constrained hobby project.
If you're designing the next browser language runtime, or a language meant to be used outside of hobby deployments, there's no point in permanently hamstringing your implementation architecture to save a small amount of implementation time within what is already a large and complex problem space.
I am not sure what you mean by "standard practice of using C as an IR". I was a compiler engineer for a decade or so and I have only seen C used as an IR for quick and dirty hack projects which don't care about things like debugging or compile time. The commercial compilers I have worked on each used their own intermediate representation; none used C.
Apparently they include an interpreter as well which probably could be used for debugging. And hopefully interpreter behaviour matches compiled behaviour, then ;-)
I can imagine having a lot of fun making a toy compiler that emits LLVM IR directly.
Well, I'm still plugging away at the Deca compiler that outputs to LLVM IR directly. It's loads of fun!
Practically speaking, is there a good reason to do this in non-toy software rather than the more-or-less standard practice of using C as an IR?
I can tell you why I use it:
* Even through the C interface, I have an actual library API that outputs the IR code directly rather than having to transform my syntax trees into C syntax trees and then output through a C parser library.
* You recursively walk your syntax tree passing an LLVMInstructionBuilder* around to different instruction functions in the API. The result is that your code is a recursive tree-walk (rather than an ordered tree-walk), and the builder object serializes it all for you (including using one instruction's output as input to another) according to the natural order of evaluation in your higher-level language.
Lennart Augustsson gave a talk [0] at last year's Haskell Exchange showing how to embed a DSL in Haskell and then use Haskell's LLVM bindings to generate LLVM IR for it. If you need a blazingly fast DSL, give it a try!
IR is a superset of C without the massive overhead, you really need a good portability reason to use C instead. And if llvm doesn't support it, chances are you can't write very portable c, either.
C is a heavyweight intermediary language which brings with it a /lot/ of baggage. There's no valid direct technical reason other than financial resource constraints to justify compiling to C (... or JS, for that matter) instead of a more suitable and expressive IR/bytecode.
It's impossible to implement a variety of useful abstractions in portable C -- for example, function trampolines that do not permute register state and do not appear in the callstack after execution.
> C is a heavyweight intermediary language which brings with it a /lot/ of baggage. There's no valid direct technical reason other than financial resource constraints to justify compiling to C (... or JS, for that matter) instead of a more suitable and expressive IR/bytecode.
1. In theory. But what is that "more suitable and expressive IR/bytecode"? I would argue such a bytecode should be portable, but LLVM IR is not that.
2. There are lots of reasons to target C. C can be compiled by many compilers, while LLVM IR can only be compiled by LLVM to things LLVM can compile to. For example, game consoles, various new embedded platforms, etc. - they all have C compilers, but LLVM might not support them (and possibly cannot support them).
All of your arguments about going through C would go away if you had enough money to build your own ideal toolchain, including the development of your own IR for your language.
> 2. There are lots of reasons to target C. C can be compiled by many compilers, while LLVM IR can only be compiled by LLVM to things LLVM can compile to. For example, game consoles, various new embedded platforms, etc. - they all have C compilers, but LLVM might not support them (and possibly cannot support them).
Is the point to produce something optimal for users, or produce something optimal for developers? Most successful game and application platforms trend towards the former, whereas your work on web technologies trends towards the latter.
No, it is not. But what does it have to do with anything discussed here? I can't see IR being something which locks users or developers to a particular vendor.
LLVM IR is less portable than C. That's a valid direct technical reason. Generating C (or another high level language) also makes integration with tools in that language potentially trivial (depending on the semantics of the language you're compiling), and depending on your preference you may find it far easier to debug the output.
As for portability, while you're right that a number of things can't be implemented in portable C, you gain the benefit that C has been ported to far more systems than LLVM - there are C compilers even for the Commodore 64.
I agree with you that C is heavyweight, but LLVM is too. Personally I prefer to build code generators directly - the LLVM infrastructure may be great, but I much prefer to write compiler that are self contained and bootstrapped in the language they are intended to compile. Then again I might just be difficult.
I have to disagree with that. There are lots of languages whose purpose is to produce JS. For example, it would be very silly for CoffeeScript to use LLVM IR. It would end up reïmplementing all of the things that it shares with JS (that is, most of JS) with no conceivable benefit. Maybe if/when PNaCl is everywhere, it would make sense for it to do that, but even then it's a hard sell as long as four giant companies are competing their asses off to optimize JS.
But I do agree with you if the language doesn't correspond well with JS. In that case, it's probably better to target LLVM and then compile to JS via emscripten.
There's no valid direct technical reason other than financial resource constraints to justify compiling to C (... or JS, for that matter) instead of a more suitable and expressive IR/bytecode.
How does one do call/cc in LLVM, or delimited continuations, for that matter?
You can't do that in portable C as-is. Watch what happens when you try to restore thread state on a different system thread using an OS that instrisically ties system threads to user threads to thread local variables to the stack.
And no, getcontext() and setcontext() aren't portable.
Why would I care if it's portable or not? I've never understood the mentality of "but it's not portable". If something isn't not portable, you put #+ and #- on top of it. I'd rather go for "straight-forward and performant" than for some abstract code transformations to make it runnable on platforms that almost nobody uses to the detriment of everyone else.
LLVM is the lowest common denominator. I'm really not fond of lowest common denominators. (Not to mention the fact that it's bloatware in the first place - it's about twice the size of my favourite compiler, and that's before you add the actual language you're trying to compile.)
Because you literally cannot implement it safely or correctly from the C level on Mac OS X, and any other platform with similar thread/stack/thread-state constraints.
Unless that's a platform you think "nobody uses" ...
The only indirect technical justification is in maintaining compatibility with existing browsers, but that falls over pretty fast when you're a browser maker (Mozilla) and refusing to work with another browser maker (Google) on jumping over that compatibility hurdle.
Replacing JS may not be technically difficult, but practically it's extremely difficult (IE6, anyone?). Plus JS has got so fast, it's becoming hard to see the point anyway.
Getting your bytecode working in shipping in all browsers has proven to be impossible; there's a reason people compile to Javascript, it's the only practical choice.
Even if they did, that's not enough, you have to get it into all popular browsers, not just two. JavaScript works everywhere, you have to match that to have any hope of replacing it.
Ah, so it will lose because it will lose. So we can't implement it because we won't implement it.
That's ridiculous. Mozilla is willing to stick their neck out there implementing a mobile phone operating system, but trying to deploy a sane runtime environment to replace JavaScript is just too risky?
It's reality because Mozilla makes it a reality. See how this is a circular argument?
Meanwhile, native mobile and desktop keep growing, and growing, and the web as an application platform loses out, because none of the native platform vendors are quibbling about whether they can convince themselves to ever do something different than what they were doing before.
For my company, this is exactly what's happening. Like Facebook, we found HTML5 to be insufficient for our app, so our resources have shifted to iOS and Android development. Native gives much better and consistent performance and UX, and the idea of write once-run anywhere on the web is still a pipedream. Cross-browser compatibility is still terrible.
The problem with HTML5 is it was designed by committee for displaying documents not dynamic apps. While JS has come a long way to closing the gap with native performance, it's the other HLML5 tech that's holding the web back. CSS is ill-suited to be hardware acceleration and DOM is a performance sucking hack that kills the UX on mobile platforms.
That didn't stop them from trying to launch a mobile phone platform in competition with Apple, Microsoft, and Google.
If they're willing to accept that risk, by comparison, how large of a risk is trying to push through a better standardized browser execution environment, with Google's cooperation?
God forbid they succeed, and we finally have a competitive application platform.
I see; you're the type who thinks he's smarter than everyone in the industry and just has all the answers and ignores anything that doesn't fit your idea. Never mind that no one has been able to replace JS, just listen to you and all those problems will evaporate. Goodbye.
> I see; you're the type who thinks he's smarter than everyone in the industry ...
You mean, like Google, who continually pushes to do exactly what I've described here, only to be stymied by:
- Apple, who has no reason to support the web as competitive to their native platform.
- Microsoft, same.
- Mozilla, who refuses to consider that there might be a world beyond HTML/CSS/JS because they believe those specific technologies are intrinsic qualities of the web, and thus central to their mission of supporting the web.
Looks like the only people I disagree with are the Mozilla camp. Microsoft and Apple have different priorities, and Google is continually frustrated by exactly what I've described here.
Every major browser vendor has pushed for some stuff and resisted other stuff. Only extremely rarely does anything new make it to 'cross browser compatible' status.
There are far far worse systems for evolving widely used platforms.
> There are far far worse systems for evolving widely used platforms.
That's fine, but perhaps it would behoove Mozilla to not participate in dooming the web as a competitive application platform simply due to a misguided belief that the web is defined by HTML/CSS/JS?
Yes, the reality is that proprietary application platforms are taking over the application market, and that the web is slowly losing one of its major market advantages: a huge brain trust of web-only engineers and web-only engineering organizations.
Neither NaCL, PNaCL, nor asm.js are 3rd-party browser plugins.
I agree that it's time to move on. It's time to treat the web as a real application platform, instead of as document model with a JavaScript scripting interface.
NaCL, PNaCL aren't plugins because they're features of one specific browser. I don't know of any 3rd party developers using them.
I think asm.js is brilliant and I hope it does turn out to be the way forward. But it's also a poster child for maintaining compatibility with existing browsers through standard Javascript.
Direct: Constraints/requirements directly related to implementing a maximally effecient runtime architecture.
Indirect: Market constraints that drive technical limitations.
When it comes to browsers, Mozilla both creates and exists within market constraints that drive indirect technical reasoning.
Lashing your industry to JS for another 10 years as a bytecode target is clearly not an optimal long-term solution as compared to the current state of the art, but it might make sense as a legacy support mechanism given the indirect technical constraints.
I'm just not convinced that something like CoffeeScript would benefit technically from compilation to a different target than JS. When the purpose of a language is simply to have all the features of another language, but with nicer syntax, what is to be gained from rebuilding all of those features from scratch?
The one quasi-counterexample I can think of in that subset of languages is Objective-C which essentially just adds sugar on top of C and can be transformed relatively simply to C (there are library functions you can use to build classes and objects and call methods from scratch, using no actual Objective-C syntax). My understanding is that Objective-C compilers don't go through C as an intermediate, but this is merely to reduce compiling time, not to improve efficiency at runtime. If Objective-C did target C instead, the end result would be identical.
So in other words, you are defining away most of what drives technology decisions.
That makes it totally uninteresting to try to meet your constraint.
We pretty much never try to implement a maximally efficient runtime architecture, cost or resources be damned.
By this argument, there's no direct technical reason why we should not spend 20 years hand writing machine code and proving the code sequence optimal for each target architecture.
Those examples are all very different. What is your goal - speed? Code size? Portability? Security? Those examples don't all achieve those to the same degree, and I would argue that modern JS VMs are competitive with them in many respects.
> What is your goal - speed? Code size? Portability? Security?
That depends on the target. I selected them because they represent different aspects of the state of the art.
> ... and I would argue that modern JS VMs are competitive with them in many respects
But not all, and in many places, not even close. JS VMs make trade-offs that we don't need if we abandon JS, and on top of it all, you have Mozilla insisting against shared-state multi-threading supported and exposed by the VM, which discards enormously important optimization opportunities) -- and that's just the tip of the iceberg.
Get rid of JS, get rid of Mozilla's self-enforced constraints on the browser, and look very seriously at something like NaCL which blends native execution performance (including being able to drop to SIMD instructions) with security sandboxing.
Let language authors target the low-level machine, and target your high-level general purpose bytecode.
Once you've built a runtime in which I could implement a JS runtime that's as performant as yours, we will no longer be operating in a two-tier universe in which browser makers think that JS is good enough for everyone but themselves.
JS VMs make tradeoffs, yes. Regarding your specific examples:
1. Shared-state multithreading. This has been proposed by various people in the past, and is still being debated. There are people for it and against it in various places. It might happen, if there is consensus to standardize it.
2. SIMD: PNaCl (which I mention since you mention NaCl in that context) does not have SIMD. But of course it can add SIMD, just like Mono and Dart have, and there is a proposal for JS as well, hopefully that will get standardized.
So even those limitations are in principle resolvable. They do depend on standardization, of course, but so would any VM you want to run on the web.
> Let language authors target the low-level machine, and target your high-level general purpose bytecode.
asm.js is meant to get performance similar to a low-level machine. It's already very close to native on many benchmarks.
> Once you've built a runtime in which I could implement a JS runtime that's as performant as yours
I don't follow that. You can't make a fast JS runtime in your previous examples (the JVM, PNaCl, .NET, etc.).
> I don't follow that. You can't make a fast JS runtime in your previous examples (the JVM, PNaCl, .NET, etc.).
I included ones you could, such a NaCL. And I should probably have also included the native platforms (Apple, MS, Google), because they're the competition, even if their sandboxing or runtime environment isn't what one might call 'state of the art'.
At the end of the day, it's the two-tier universe that bothers me the most. You're content to foist the JS runtime on everyone but yourselves. Once you implement Firefox entirely as an application running in a JS VM, the argument that JS is good enough might carry some weight.
The point is that NaCl is not portable, just like a native platform such as Windows.
Nonportable native platforms are there, you can write apps for them. They are even the dominant platforms on mobile.
But the web's entire purpose for existing is to be portable. So you do need something like JS or PNaCl or the JVM. And all of those, due to being portable and secure, impose limitations, such as limiting the native operations that you perform. That's unavoidable. But again, if you don't like that, develop directly for a native platform.
> The point is that NaCl is not portable, just like a native platform such as Windows.
So provide fat binaries. Maximize performance on ARM and i386, with fallback to PNaCL for future platforms that are neither.
> They are even the dominant platforms on mobile.
For a good reason. Adopting their strengths while eschewing their weaknesses (proprietary and single-vendor) would benefit the entire industry greatly.
Fat binaries are limited, and people will simply provide the parts for the platforms they currently care about, without care for future platforms.
That's fine for most things, but web content is something that we do want to always be accessible.
It is hard to adopt the strengths of native execution using all the lowest-level tweaks specific to one platform, because that inherently limit portability by definition (and often also security).
I share your goals, but don't think there is an obvious better compromise than the one we are all already making on the web.
The teams at Mozilla and Google have done an amazing job of improving JavaScript performance. But, I also have to agree with PBS, that Mozilla has a knee-jerk reaction to many of Google’s promising technologies like WebP and PNaCl. Like many HTML5 technologies that are being shoehorned into areas they were not originally designed for, JS was never meant to be a bytecode and likely will never be able to achieve the performance possible with PNaCl.
If W3C and WHATWG were competent and delivered a viable web application platform, we wouldn't be hiring Android and iOS developers now. Mozilla needs to be more open to new technologies like PNaCl that enable the web to be competitive.
I don't think "knee-jerk" is a fair description. Consider the collaboration between Google and Mozilla on WebM, WebRTC, and countless others.
> JS was never meant to be a bytecode and likely will never be able to achieve the performance possible with PNaCl.
LLVM IR was also never meant to be a bytecode. But in both the case of JS and LLVM IR, the question is the final result, not the original intention, even if the two are often connected.
> Mozilla needs to be more open to new technologies like PNaCl that enable the web to be competitive.
PNaCl is not even shipped, so it is too early to evaluate it. But the last benchmarks I saw for compilation speed and performance were mixed.
(NaCl is much more proven, but NaCl is not portable.)
All you need to do is perform NaCL validation of pages before marking them executable; this is what NaCL already does.
You can either do this through a high-level "JIT API" that generates safe machine code from a validated IR (aka PNaCL), or through a fancier version of mprotect() that validates the actual machine code (aka NaCL).
In a wondrous hypothetical future where processors support a NaCL restricted operating mode, you wouldn't even need to validate; just set the processor state to thumb^Wnacl mode, and define a syscall instruction that switches to direct execution (without actually requiring a context switch to the kernel, just flip an execution flag).
This is why NaCL is so damn interesting, and JS/asm.js is not. NaCL has the possibility of turning our design of restricted execution environments on its head, in a very good (for performance) way.
You can already do that with NaCl - or even asm.js with eval, I bet it wouldn't be that much slower. But current JITs apparently gain a significant amount of performance with inline caching, which expects code to be able to be rewritten very quickly.
Edit to respond to your edit: although it would be cool to be able to have a "sandboxed mode" that can somehow be switched out of more cheaply than an interrupt, the whole thing seems like a massive hack to me. After all, NaCl does not take advantage of its pseudo-ability to do so: NaCl code runs in its own process and already incurs a context switch whenever it communicates with the browser, so there is no inherent hardware reason NaCl couldn't just run directly under the kernel and have the kernel provide the same level of sandboxing as the NaCl runtime currently does. It's just an issue of getting such an approach to work portably with existing kernels... hardware support might be able to make it easier to get that to work, but it's probably unnecessary.
> In a wondrous hypothetical future where processors support a NaCL restricted operating mode
If we speculate wildly, why not an "asm.js restricted operating mode"? Not saying that's a good idea, but I'm not sure why a NaCl one would be either. Both PNaCl and asm.js should reach pretty much native speed anyhow.
> If we speculate wildly, why not an "asm.js restricted operating mode"?
Because we already had Jazelle, and it sucked, and we learned our lesson. A Jazelle that operates on ASCII-encoded bytecode with traps for unsupported JS constructs? No thanks. :)
> Not saying that's a good idea, but I'm not sure why a NaCl one would be either. Both PNaCl and asm.js should reach pretty much native speed anyhow.
Pretty much native and actually native are very different things.
I've had to sit down and hand-optimize critical paths that would not have been viable otherwise, and would have meant discarding a feature, or introducing a significant impact on usability -- and that's on platforms where similar investments in runtime library optimization were already made by the platform vendor, too.
If we're going to throw away performance, it has to be for a good reason. As someone who doesn't spend all day on platform development (as interesting as it might be), my focus is in providing the best possible user experience.
I'd love to do that on the web, but the web needs to punting throwing away solid technology decisions because of the bad technology decisions made in the early 90s when none of us knew what the hell we were doing.
> Btw, the L is not capitalized in NaCl.
Whoops. Thanks. Now I will look less stupid in front of software engineers AND my chemistry buddies. :)
> If we're going to throw away performance, it has to be for a good reason
I agree.
We are losing performance in return for: portability, security and standardization. The web runs everywhere, has a good outlook for continuing to do so (no fat binaries of current archs), and anyone can build a new web browser based on standards.
None of the other options proposed give us portability, secutiy and standardization right now. Perhaps with more work they might, and perhaps JS VMs will get closer to their speed as well. It's good to try to from both sides to improve things, and people are doing so.
I don't think anyone is overlooking some obvious better solution - that fits the requirements - that is before us. PNaCl (NaCl is not portable, so not relevant here) is interesting, just like the JVM and CLR, but they have major hurdles to pass if they want to be standardized, and that effort has not even begun in the case of PNaCl.
I'm intrigued by this idea, but it seems tricky to really use portably if your goal for writing asm is to carefully manage register usage and SIMD instructions. Sure, the syntax is portable, but to get the "right" result, i.e. one that maps directly onto the fast machine code you had in mind, you need to know something about the target architecture, e.g. how wide its SIMD instructions are, and how many registers it has. And if you want it to be fast on two architectures that differ considerably, you may have to write two versions of your LLVM IR, at which point you're losing some of the benefit of an IR. You do still get the advantage that it will at least compile to something on platforms you hadn't previously heard of, but most projects will just use a straight-C fallback function for those platforms.
Managing register allocation by hand is unlikely to provide significant speedups, considering that modern register allocators are "pretty good." However, if you are wanting to optimize for SIMD, this can be easily done with vector instructions in IR, without needing to worry about register allocation at all.
At least as of a few years ago, the x264 team claimed they didn't use C intrinsics for their inline asm because automatic register allocation wasn't very good: http://x264dev.multimedia.cx/archives/191#comment-1763 (exchange between comments 3 and 4).
Sure, you can write SIMD in LLVM IR, but to do it efficiently you need some knowledge of the target. Should you use i32, or rework your code to use more i16, or throw in some parallel i64 multiplications? Those aren't really portable choices, unless you have a really good auto-vectorizer, in which case the advantage of writing in asm at all disappears, since you can just rely on auto-vectorized C.
LuaJIT has a hand-written assembly language interpreter, which is one of the fastest dynamic language interpreters around. According to its author, register allocation is a significant part of why the hand-written assembly is faster: http://lua-users.org/lists/lua-l/2011-02/msg00742.html
That issue is pretty rare—a tight 'loop' where you don't know where you're coming from (aka the previous instruction) or where the next 'loop' iteration goes (aka the next instruction). Because this code block is bounded by indirect jumps, it's virtually impossible to register allocate without agreeing on a mini ABI for this scenario. Not surprisingly, this is unusual enough it's inexpressible in C. However, there are things that provide a middle ground, like computed GOTOs, which both the ocaml and python interpreter use now for bytecode evaluation.
Of course, I suspect if you pass luajit an unusual program (say, an unusual distribution of instructions), it would actually perform worse. Register allocation is np-complete, so ultimately with modern programs it's dependent on really good heuristics, and C wasn't written to be a bytecode evaluator.
Register allocators use heuristics that are “pretty good” for most usage, but sometimes fail spectacularly in edge cases with very high pressure. Register allocation is, after all, NP-complete. Even when it doesn't fail completely, if you’re tuning a load-store-bound computation, a few extra register spill-fill pairs can easily halve the throughput that you achieve.
That said, for most programmers, most of the time: use intrinsics and let the compiler handle register allocation.
Don't be pedantic. Splitting the problem into parts and saying that the one part which happens to have the name that other people use to refer to the whole problem is easy doesn't make the whole problem easy.
What?
"Register allocators use heuristics that are “pretty good” for most usage, but sometimes fail spectacularly in edge cases with very high pressure. Register allocation is, after all, NP-complete. "
The specific problem he referred to was the heuristics failing spectacularly in high pressure.
In SSA form, they have no need to use heuristics, and if they don't want to, don't. They can optimally choose registers.
This will not fail.
Yes, there are other parts to register allocation. The most common NP complete part referred to, and referred to by the parent (AFAICT), was register choosing. This is not NP complete to do optimally in compilers using SSA form. Period.
There may be other parts to the register allocation pass that use heuristics, and fall down, but most of those are not NP complete on SSA either.
SSA-based register allocation does copy-coalescing afterwards [0] or integrated [1]. This is the NP-hard part. Of course, there are fast heuristics [2] which are "pretty good".
I'm not super familiar with all the recent literature on SSA, but it always seemed to me that while you could optimally allocate registers for the SSA representation, the SSA representation often contains dead Φs, and fully eliminating the Φs may not be doable in polynomial time.
There are other algorithms that may coalesce more phis, and can be done in linear time, but do not have such guarantee.
If your concern is dead phis, all dead phis (where dead is defined as unused, or only used by dead instructions) can be eliminated in linear time by performing a single linear time backwards-DCE pass on the graph.
This will get all phis and instructions that are dead, or only used by instructions that are themselves dead.
If your concern is useless phis and instructions (for lack of a better term, there are papers, and they use the word "dead" differently, so i'm defining a term here for you), where you define useless as "side-effects do not matter", such that in
*a = 5
c = a
*a = 5
c = a
the second store and assignment is useless,
This type of elimination is O(N^3) normally, and unproven bounds in SSA (AFAIK).
What exactly is "this type of elimination"? Within our compiler [0] your concrete example is optimized by an O(1) mechanism and then there is a more complex load-store-optimization which is O(n^2) I think.
Looking at your code, libfirm's DCE the standard SSA backwards DCE algorithm without control dependence computation to eliminate dead phis. It will be O(n).
The load store optimization looks like a modification of stuff i helped Thomas VanDrunen with. It is O(N^2), but will miss loads/stores depending entirely on the strength of your memory dependence calculation.
In fact, it will miss a lot, particularly around loop dependent store/load values that are equivalent.
It will catch my particular example, but GCC's testsuite has load/store examples (see gcc/testsuite/gcc.dg/tree-ssa) that it should fail at without help from other phase ordering/etc.
Thanks for the clarification. You are right, our load-store-opt cannot handle loops.
Dead code elimination technically [0] is O(n). However, it should be called "copying garbage collection" instead.
LibFirms SSA construction is minimal [1], it does not create dead phis. Or course, phis might die due to optimizations, in which case they become unreachable and get garbage collected at some point (see above). However, Firm does not model memory (global variables) as SSA, only registers (local variables).
Minor correction: LibFirm also generates pruned SSA, which is the property that guarantees lack of dead phis. Minimal SSA roughly means "no superfluous live phis".
Let me know if you still do this once you integrate a polyhedral code generator and try to do incremental SSA renaming on the nested loop outputs :)
Note that dead phis are actually useful in some contexts, particularly around loop updates
This is why things like loop-closed ssa (http://gcc.gnu.org/onlinedocs/gccint/LCSSA.html) exist, and phi nodes are generated even if the variable is never initially used outside the loop (makes sinking easier)
Thanks for that info; I have more reading to do. The actual work I get paid to do isn't on an SSA compiler, so keeping up-to-date on SSA is more of a hobby.
You can do dead code elimination before register allocation. Dead code elimination is doable in polynomial time. Alternatively, you can you use this relatively unknown construction algorithm [0] originally invented by Cliff Click.
True.
Let me expand a bit, I guess, to prevent confusion.
Your time bounds and what you get depends on what you mean by "dead code", and how you perform it.
The vast majority of DCE algorithms, in books, in compilers, etc, on SSA form, have the same properties:
1. The "bad" ones are forward DCE algorithms, and do not eliminate everything they could because they process in the wrong order.
2. The "good" ones, used in most compilers, are backwards DCE algorithms, and eliminate all dead scalar code in a straight line.
Both are O(n).
The standard backwards DCE algorithm has a number of deficiencies, that cannot be eliminated sanely in O(n) time.
1. These DCE algorithms do not consider memory affecting statements dead unless they can prove they are non-side effecting. Proving that is going to usually require at least some N^2 analysis somewhere.
This means they miss all dead scalar code that results from a useless memory reload (second order effects), and generally will not eliminate any dead stores. Even though this is code, and it's dead, it is not eliminated.
2. These DCE algorithms that are O(n) don't deal with control altering statements. Jumps to empty blocks, as well as code contained in unreachable blocks, will be considered non-dead unless you include control dependence and some form of branch evaluation. You can compute control dependence in O(n) time. In practice, the constant sucks :) It is also not possible to statically evaluate all branches in O(1) time, so even in the absence of memory statements you cannot guarantee you eliminate all dead code in O(n), only some of it.
3. These DCE algorithms do not handle partially dead code (that is, code that is dead if moved to a different place in the program, but does not otherwise affect semantics).
The canonical example is:
y = a + b
if (...) {
y = c + d
} else {
...
}
The first statement is partially dead (dead along some branches), and performing assignment sinking will make it fully dead (if you sink a copy into the else branch, the original is now fully dead).
Performing this type of elimination is N^3 or worse, i haven't seen better time bounds on SSA.
(It is not, AFAIK, completely subsumed by something like SSUPRE)
GCC (and I think LLVM now) actually performs the most trivial case of this, which catches a very large number of cases once other passes are run.
Well, actually, looking again, it looks like over the years it's grown.
I guess nobody's noticed it's basically equivalent to Cliff Click's GCM algorithm at this point, without the nice guarantees.
It looks like it could be trivially modified into the GCM algorithm now.
In my opinion the "good" DCE is fine. Your deficiencies should be fixed elsewhere. It largely depends on the definition of dead, though.
1. I would never consider a store dead in the sense of DCE. Those side-effect analysis should be handled elsewhere.
2. DCE should not alter control-flow.
3. Partial dead code elimination is not about dead code or elimination anything. Click understood it correctly, it is about code motion.
Effectively, I argue to push the problems somewhere else. This means that DCE is a very cheap optimization and can be run multiple times during compilation. This somewhat helps with the phase ordering problem.
Partial dead code elimination (and partial redundant code elimination) is still an open research question. There are multiple overlapping approaches, but none subsumes everything.
Sure, these are all matters of opinion, though i'll point out GCC started out thinking the same way WRT to keeping things cheap and running them multiple times.
This eventually led to horrendous compile times (that LLVM was much better at) because passes refused to clean up after themselves at all. After all, cheap DCE/cheap GVN/whatever would just get run later, who cared. Those passes became necessary, and even when cheap, if you run DCE 5 times, it starts to add up. Same with not touching control flow during DCE. Leave it to the CFG simplifier, which now gets run 5 or 6 times.
These are all tradeoffs, and especially in a production compiler, you have to make these tradeoffs sometimes, and do more than is academic-wise ideal to do in a single pass.
This is why LLVM's GVN prunes dead instructions, etc
haberman beat me to it, but register allocation algorithms fall down in very branchy control flow like dispatch loops, basically for two reasons: (1) to make compile times reasonable they tend to have to approximate in this case: (2) compilers don't really know the hot paths in such code like a human would, and will likely make suboptimal spilling decisions as a result.
1. This is only true in JIT's and similar things that strongly depend on linear time register allocation. LLVM happens to now use a greedy allocator that still tries to prioritize time over allocation optimality.
2. This is just a bad compiler then. Good static profiling is actually quite good, and quite accurate, in most cases. This includes value and other forms of profiling, rather than just simple static edge estimation based on heuristics. Usually, the thing that gets hurt is not spill placement, but bad inlining decisions.
For diamond shaped switch statement interpreter loops with simple bodies, the real issue is that most greedy/linear allocators are not great at live range splitting.
Compilers like LLVM (and to some degree, GCC), move all the variable allocations up to the beginning of the function to make life easy by removing scopes (otherwise you have really really crazy edge cases performing hoisting/sinking optimizations), and then for those that don't get mem2reg'd, can't prove they aren't live all at the same time during the switch due to the loop.
Then they make bad choices about which of these variables should stay in registers because their value profiling infrastructures are non-existent.
Proper region analysis, allocation regions, and better optimistic live range splitting would go a long way towards fixing this, but it's not worth it. There is little to no sense in optimizing LLVM for the very uncommon case of interpreter loops (particularly when one of the goals of LLVM is to ... replace interpreters).
So the basic answer is: It's not really a problem anyone cares to solve, not "it's a really hard problem to solve".
Nevertheless, register allocation is still an NP-complete problem and even in AOT compilation everything above O(n^3) is too slow (unless you just compile kernels for embedded software).
Actually, it is not the register allocation itself, which is NP-complete [0]. Avoiding spills and copys is the hard part.
The fact that some subproblems people want to solve are NP complete is mostly irrelevant, and has been for many many years. Nobody cares about optimal copy coalescing, only "good enough". There are studies about optimal vs non-optimal, and the answer is basically "you could eliminate 20% more copies". That was against the best optimistic coalescers of the time, which have already been superseded by PBQP and other approaches.
Also remember that people build architectures knowing what compilers can and can't do, and so processors do a lot at runtime.
As I said, the real issue is that nobody has chosen to optimize for the interpreter case, because it's uncommon and doing so does not help anything else, but comes at great cost.
Choosing the one edge case everyone has said "we don't care about" and saying "see, they suck at everything!" does not seem quite right to me.
For example, it is completely irrelevant to whether register allocation is a problem for tight SIMD intrinsic loops, for example.
At least in GCC (which is what x264 was likely talking about), the issues are GCC's architecture, and not some fundamental "register allocation is hard" issue.
remember that people build architectures knowing what compilers can and can't do, and so processors do a lot at runtime
It sounds like you may be referring to MOV elimination by register renaming, which should make the extra moves no more costly than NOP's. I read that Intel post-Ivy Bridge does this, but haven't been able to find any real documentation. Do you know if this is something one can now rely on, or what the limits of this are (number per cycle, size differences, latency)?
Answering myself: Yes, this is documented and can be depended on for Ivy Bridge onward.
3.5.1.13 Zero-Latency MOV Instructions
In processors based on Intel microarchitecture code named Ivy Bridge, a subset of
register-to-register move operations are executed in the front end (similar to zeroidioms, see Section 3.5.1.8). This conserves scheduling/execution resources in the
out-of-order engine. Most forms of register-to-register MOV instructions can benefit
from zero-latency MOV. Example 3-23 list the details of those forms that qualify and
a small set that do not.
So, PRE and other code motion/placement optimizations rely on a CFG for computation of dataflow information either implicitly (SSA form) or explicitly (non-SSA form and some SSA algorithms).
I'm going to stick to non-SSA algorithms because they are actually easier to understand the issues for.
Once you've done this, particularly the section on bitvector problems, realize that most of the properties being computed make no sense earlier than original scope of the variable, and what to do is not always apparent.
Take PRE for example, here's a simple, but crappily devised example:
int main(int argc, char argv)
{
int b;
for (int i = 0; i < 50; i++)
{
int a;
a = argc;
b = a;
}
return b;
}
(It's crappy because any value numbering/copy prop/etc would take care of this particular example)
A standard PRE is going to want to hoist a = argc out of the loop, not b = a;
But, well, it can't, because it's out of scope!
How does it know what it should do?
Should it move int a out of the loop?
It may end up inserting a stack allocation into a hot loop if it just hoists it one scope up!
Should it create a temporary?
If you allow it to create new outside-scope temporaries anyway, what's the point of keeping the scopes?
So now you have to do scope understanding and movement anyway.
Plus, in most cases, you need to construct accurate live ranges of variables anyway, and scopes do not always properly represent those.
You also have to compute what variables may share registers on your own too.
Non-renaming forms of SSA (IE that don't literally rename the temporaries) are worse here, because you can accidentally overlap live ranges of variables without too much trouble.
In any case, the basic answer is "Scopes buy you very little, and hurt a lot, because every single optimizer has to care a lot about them"
Thanks for the reply. I was being pretty inaccurate in my terminology: I said "branchy" when I meant "diamond-shaped" and I said "spilling" when I should have said "splitting". That's what I get for posting before being fully awake :)
The problem of dispatch loops is not that they are "very branchy". The problem is the usual heuristic of less-code-is-better falls down. For an interpreter you want the dispatch code after every op handler again, so the branch predictor can do its work. Unfortunatelly, compilers are very good at removing this code duplication.
Julia is a pretty cool platform for playing with this kind of thing in a REPL environment - you can dump LLVM IR and corresponding disassembly for any function, with source line annotation (some things are rearranged by the LLVM optimizer, but it is usually close). No inline IR support yet, but it's a great way to understand the relationship between high-level constructs and machine code.
Interesting, I didn't realize Julia had that. Fwiw, many Lisp systems also have this available: if you (disassemble 'foo) in SBCL, it'll show you the asm the function compiled to.
It would be nice if the substance of the article actually backed up the title... it doesn't really show this: Yes sure, it's more portable: but plain C is way more portable still.
The places where assembly is justified these days tend to be SIMD inner loops where intrinsic use isn't effective (e.g. due to compilers being pretty lame at vector register allocation) and where a lot of gains can be had by bitops level micro-optimization that depend very precisely on the instruction behavior. I would be surprised if writing LLVM IR was actually better than using intrinsics in these cases.
If it were so it would be neat to see it— an example of taking some well optimized multimedia codec library and converting (some of) the SIMD asm into LLVM IR and showing equal performance would be impressive.
Perhaps I am mistaken, but I was under the impression that the LLVM IR was subject to change at a whim. That may be something to keep in mind if you intend to target it directly.
The C API is deemed stable. The C++ API changes every release. The textual IR has so far been forwards compatible. Bit code is also supposed to be forwards compatible but some bugs have prevented this in the past. Backwards compatibility is right out which can be frustrating.
The author invokes the idea that writing assembly and writing compiler passes are similar enough activities that you can substitute one for another, but that is not the case.
Good thing, too - I don't even want to think about how buggy and slow compilers would become if random people started jamming passes into them based on nothing but expediency.
It should be noted that on the official llvm irc channel (#llvm@oftc.net) people do not like llvm's IR being called a "progamming language". Rather it's intended to be a representation of llvm's internal data structures. It may look similar to a programming language for sure but it's not the intent behind it. It's a representation that should help compiler engineers debugging their llvm-using parsers.
Also, if you've ever looked at clang's "-emit-llvm" output you will notice that there are some parts of the IR which are hard to be generated by humans, e.g. dgb by sequential number references.
To be fair to outsiders, http://llvm.org/docs/LangRef.html is linked from the main page where the human-readable IR format is referred to as "LLVM assembly language", and with the in-memory representation listed as an equivalent form but not with particular privilege.
The problem with writing LLVM IR code by hand is that is must be in SSA (single status assignment) form which is a pain for humans, and requires graph algorithms to insert phi nodes in the basic blocks. What we need is a non-SSA LLVM IR assembler that converts to SSA -- this would be nice for human authors.
I've worked with other (mostly Motorola-based or some sort of embedded) architectures, and while a bit cumbersome to work with, the assembly was clean and understandable.
Back in the days assembly was my primary way of getting things done, in part because C-compilers cost money back then and I barely had money left after buying a computer.
Once the "PC" (and thus Intel) had won the war and I set foot in X86-country, I started looking at the assembly. It took me less than a week to decide to give up assembly forever.
So yeah. It doesn't take much to be better than X86 assembly.
v4si multiply_four(v4si a, v4si b) { return a*b; }
Much more readable, and just as portable. There's a typedef for v4si, of course, but you do that exactly once and then ignore it. The IR produced is identical, and therefore just as portable.
On a related subject, Dan Bernstein of qmail fame created a portable assembly language a few years ago, called qhasm: http://cr.yp.to/qhasm.html The idea is that processors do pretty much the same things, but with different assembly language syntaxes. With a portable subset, one could then use translators to convert qhasm source into whatever dialect required by ARM, Intel, PowerPC, et cetera.
The latest source code dates from 2007, and it's only a prototype, but it's a clever idea.
In the examples given under "vectors" I don't see if they are also multiplatform -- not every platform has "process 4 floats at once this way." Also I don't see the encoding of assumptions of alignment. Not every processor has the same even when the "main" architecture is the same (even x86 generations differ).
I remain suspicious that IR is the only representation you need if you do want to do the things on the assembly level. I welcome examples of somebody in the know.
The another topic is how often IR is going to change.
> ... not every platform has "process 4 floats at once this way
Sure. The point is - you can write IR code that multiplies _any_ number of floats and the backend "should" generate reasonable machine code for any architecture.
> Also I don't see the encoding of assumptions of alignment
Yeah, no memory loads there, data passed in xmm0 and xmm1, plenty of simplifications.
> The another topic is how often IR is going to change.
Fair question. I don't know but given the number of projects that currently use it I'm quite sure it'll be well maintained. Also, keep in mind that adapting IR to a future version (if one is created) is still IMO simpler than adapting assembler. Additionally - you can use IR to generate machine code once (for every architecture). That way you won't depend on your users having LLVM installed.
> you can write IR code that multiplies _any_ number of floats and the backend "should" generate reasonable machine code for any architecture.
Except that it doesn't. If you write your code with <2 x float> and codegen for SSE, you'll get <4 x float> code, with two elements going unused. It's functionally correct, but you're potentially missing out on half the throughput. If you write your code with <8 x float>, you'll get two registers for each value, but this can create extra register pressure without actually giving you any increased throughput in return.
I doubt it has any practical application outside llvm.git. Why would you want to write assembly by hand in the first place?
1. To bootstrap. There's various architecture-specific bootstrap code to load the kernel into memory in linux.git. The project largely depends on GNU as to assemble reliably; llvm-mc doesn't work half as well.
2. To try out new processor extensions. When new instruction mnemonics come out, assembler have to catch up. I'm not sure why anyone would want to use LLVM IR here, because those instructions are architecture-specific anyway.
The exciting thing would be to see applications compiled to LLVM IR get JIT'ed. Finally C or C++ apps that could inline through function pointer or virtual function calls at runtime.
It's an interesting idea but the current LLVM JIT is pretty slow so it hurts my head to think about JIT'ing average size C++ programs. Leaving that aside, the cost of doing this for every codepath would probably outweigh the benefits except in very specific use cases (especially on a cpu with decent indirect prediction). This would be an ideal place for a tracing JIT, which LLVM does not have at the moment.
I write assembly most days (I write system libraries--largely floating-point and vector code--for a living). In general the assembly that I write is between 1.5x and 3x faster than what the best compilers currently produce from equivalent C code (obviously for completely trivial tasks, you usually can’t beat the compiler, but those generally don’t need to be abstracted into a library either; there are also outliers where the code I write is orders of magnitude faster than what the compiler produces). For most programmers most of the time, that’s not enough of a payoff to bother. For library functions that are used by thousands of applications and millions of users, even small performance wins can have huge impact.
Partially my advantage comes from the fact that I have more knowledge of microarchitecture than compilers do, but the real key is that I am hired to do a very different job than the compiler is: if your compiler took a day (or even an hour) to compile a small program, you simply wouldn’t use it, no matter how good the resulting code was. On the other hand, for a library function with significant impact on system performance, I can easily justify spending a week “compiling” it to be as efficient as possible.
if your compiler took a day (or even an hour) to compile a small program, you simply wouldn’t use it, no matter how good the resulting code was
Some research has been moving in that direction, since there does seem to be a demand for it. After all, if someone is willing to wait for you to spend a week hand-tuning a function, maybe they'd also be willing to run a compiler in a "take 10 hours to crunch on this" mode. Example: http://blog.regehr.org/archives/923
That said, it’s also worth keeping in mind the enormous differences in how computers and expert humans currently approach the problem closely parallel the differences in how computers and expert humans play chess; quickly evaluate billions of possible “moves” vs. quickly identify the few most promising “moves” and then slowly evaluate them to pick the best. I fully expect to be regularly beaten by the compiler “someday”, but (a) I believe that day is still several years off and (b) even then, I expect that expert human + compiler will beat compiler alone, just as in chess.
As a former chess engine author, and current compiler writer, with ideas in search-based compiler algorithms/unbounded compilation times, I hope to accelerate the arrival of that day :)
For most programmers most of the time, that’s not enough of a payoff to bother.
I would argue that for most programmers, they can't beat the compiler; they probably don't have your skills and experience. I'm glad there are still people like you (I've probably used some of your code in embedded projects using VxWorks/PPC), but the truth is that many people like you are writing the compilers (or libraries, as you are). That, and the fact that when you mention profiling, you often get blank stares, plus mentioning algorithmic complexity gets you knotted brows, tends to lead me to believe that the great mass of programmers shouldn't try optimizing, prematurely or otherwise, especially in assembly. Play with it, learn about your whole stack, top to bottom, sure, but very few (such as yourself) can beat a good optimizing compiler.
I don’t think the tools really change; architecture reference manuals, your favorite text editor, and an assembler. Processor simulators are sometimes useful for resolving the most perplexing quandaries.
With the new AVX 512 announcement from Intel, I was trying to use the sde to see what I could get away with in terms of reducing cycles for certain operations by loading larger values..
Have you used the Intel SDE? Do you know how helpful the -mix histogram is or isn't? Or just general docs on usage?
SDE has never been particularly interesting to me, but I can certainly see that it could be useful for some cases. In general, the simulator I want is one that completely models the CPU and generates detailed traces (like the late great SimG4).
the simulator I want is one that completely models the CPU and generates detailed traces
I'd like one also. Lacking that, have you found any tools close enough to this to be useful? Intel's IACA is better than pen and paper, but rarely replaces it. Is there an AMD equivalent? Are PTLsim or MARSSx86 useful? (sorry for presuming x86 if not what you work on)
The most common modern cases I've run across are handwritten inner loops that make careful use of SIMD instructions, since auto-vectorization isn't quite there yet. For example, the x264 encoder has a lot of assembly in it. In x264's case they even wrote their own assembly IR, though targeted only at x86 variants: http://x264dev.multimedia.cx/archives/191
- Whoever makes JITs for the faster languages you use (Java, JavaScript, Lua) must make them in assembly. I've made and maintain similar things myself, for x86 and x64.
- Whoever wants to produce the fastest CPU-bound libraries or routines which are to be used in some native language (like the guys who produce the drivers for graphics cards). I've made some such routines.
Assembly is still the only way to reach the limits and who doesn't have to worry about that stuff, good for him. But there are people who do, I'm one of them.
It is at least a goal of LLVM to support this type of JITing, and they cite Lua on the home page (http://llvm.org/):
"A major strength of LLVM is its versatility, flexibility, and reusability, which is why it is being used for such a wide variety of different tasks: everything from doing light-weight JIT compiles of embedded languages like Lua to compiling Fortran code for massive super computers."
Do ask Mike Pall of LuaJIT (he's here on HN) if he uses LLVM for LuaJIT, I think he doesn't and that he has good reasons not to. Which of course doesn't mean that LLVM is bad in general, not that it's simply not a silver bullet.
As I wrote assembly code for x86, I didn't even use assembler, just the inline functionality of the compiler. Using assembler is adding one more dependency in your project. Often it is undesired. Regarding LuaJIT, there other aspects, also important.
LLVM does have JIT capability, used by projects including Numba, Julia, Pure and a GLSL compiler (Gallium). Very useful in some contexts but definitely not a silver bullet.
Regarding LuaJIT, it compiles in about 10 seconds on a modern computer and does not use LLVM. A lot of it is written with an inline assembly preprocessor called DynASM:
> Regarding LuaJIT, it compiles in about 10 seconds on a modern computer and does not use LLVM. A lot of it is written with an inline assembly preprocessor called DynASM
I think he was trying to say "Mike Pall doesn't use LLVM-IR for LuaJIT's bytecode representation of Lua programs, nor does it just embed LLVM processor backends for JIT compilation, for good reason."
Very low level hardware access where you need no use architecture specific opcodes not directly accessible from a higher level language.
Also, performance critical code where you feel you can do a better job than the compiler (good luck with that nowadays).
In the first case I feel LLMV's IR wouldn't offer any significant advantage over inline ASM in some C, in the latter I'm quite perplex. Writing very fast assembly usually means targeting a very specific architecture and use tips and tricks that will make your code run faster than what the compiler might do, in this case I fail to see how that would work "portably" with this intermediate language.
So yeah, I don't really see what writing in this language offers over writing some plain old C. I can't really see it replace ASM for... well anything really.
The graphics and media codec people here at Mozilla regularly write assembly code to provide highly-optimized (often SIMD) code paths for different chips. Example:
Practical example: objc_msgSend (which is called for every message send in an Obj-C program) is written in assembly so it can jump to the target method implementation without disturbing any caller-save/callee-save register state: http://www.friday.com/bbum/2009/12/18/objc_msgsend-part-1-th... (and also presumably because C won't guarantee that a tail call is actually compiled as a jump).
Practical example: objc_msgSend (which is called for every message send in an Obj-C program) is written in assembly so it can jump to the target method implementation without disturbing any caller-save/callee-save register state: http://www.friday.com/bbum/2009/12/18/objc_msgsend-part-1-th... (and also presumably because C won't guarantee that a tail call is actually compiled as a jump).
Assembly is more relevant today than it has been in awhile, largely because Intel has gone pretty gung-ho with vector instructions that compilers mostly can't figure out how to emit on their own.
For example, say I want to convert an array of bytes into a bitmask such that every zero-valued byte is converted into a set bit. On a 64-bit system, you can do this 64-bytes at a time:
unsigned long bits = 0;
for(unsigned char i = 0; i < 64; ++i) {
bits |= ((unsigned long)(bytes[i] == 0) << i);
}
This isn't a particularly efficient use of the CPU. If you've got a CPU that supports AVX2, you've got two very powerful instructions available to you: VPCMPEQB and VPMOVMSKB. VPCMPEQB will compare two YMM registers for equality at a byte granularity and for every byte in which the registers are equal, will set the destination register to all ones. VPMOVMSKB will take the high bit of each byte of a YMM register and store the result in a GPR. Since YMM registers are 256 bits (32 bytes), you can reduce the entire loop above into just a handful of instructions: two vector loads, two pairs of VPCMPEQB (against zero) and VMOVMSKB, and an OR. Instead of a loop processing data a byte at a time, you can have straight-line code processing data 32 bytes at a time.
A 4th generation Core CPU (Haswell) has a tremendous amount of bandwidth. It can do (2) 32-byte loads per clock cycle, 2 32-way byte comparisons per clock cycle, etc. If you're writing regular C code, dealing with 8-byte longs or 4-byte ints, you're leaving much of that bandwidth on the table.
FWIW, those particular instructions have been around since the Pentium 3. Most of the recent advances have been in width (64 -> 128 -> 256 -> ...) and execution units of that actual width.
It's good to know about SIMD instructions, but you can usually get at them with intrinsics (if you're writing in C) without dropping down to pure assembly.
Knuth in The Art of Computer Programming uses (artificial) MIX assembly to describe algorithms. To me it was awkward at the first, but it does make a lot of sense after first few chapters.
Assembler has real world users, but also it has a niche in academic community.
Embedded system folks who target very small and cheap processors. (The last time I did this, I had a 2K code budget, and the CPU was under 20 cents).
People hacking at a very low level, e.g., for boot code, or TLB miss handlers.
Places where you need to be highly efficient, and where cycles matter.
Places where compilers don't operate well (e.g., specialized instructions for doing shared memory operations, or stack manipulations for doing things like coroutines).
I might go a year or so without touching assembly now, but not much more than that.
The last time I wrote assembly was for clock cycle accurate timing manipulating some IO pins at rates very near the limits of the processor. No time or tolerance for interrupt handlers and timers.
Before that it was to program a pair of limited range timers to run with slightly different periods so I could lazily read them from C, then by examining their phase and values determine if I got an uninterrupted pair of data, and if so, how many times the timers had rolled over, thus implementing a single, high resolution, extended range timer.
It is also used to exploit processor instructions which do not yet have compiler support.
People told you about GPU optimization and such, and it is ok.
Another very important one is for security. In big companies and governments people use debuggers like IDA Pro for controlling what really executes in the computer in order to protect communications and such from snooping of other companies or governments, for example.
I've written ARM assembly for iOS apps to optimize carefully for memory ordering constraints and memory access latency (eg, pipeline stalls), and made use of NEON SIMD for certain critical paths.
This has yielded (ballpark) 2x-5x improvements to runtime performance; some operations essentially become 'free' from the application perspective whereas they took a significant hit previously and could cause UI stuttering and/or significant CPU burn (which also directly correlates to battery life consumption).
Assembly is far from dead in desktop/mobile development.
For those wanting to play with llvm as a library but would rather not deal with C++, the llvm-general Haskell library gives you nearly every power of llvm (aside from link time optimization, and writing new optimization passes) in a nice high level lib that's already being used for a few interesting compilers out there.
This reminds me that I need to learn some assembler. I have a hard time finding resources on conventions, project layout, best practices, tooling. If anybody's experienced in an assembler, I'd love to have some tips and pointers.
-LLVM at the IR level still has platform specific information - calling conventions, vector types, pointer widths, etc.
-what was true then is still true now: http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-October/0437...
-typed assembly languages have existed for a while, no one uses them