v8 (before v5.9) used to only inline functions that were under 600 characters (a...

umvi · on Oct 8, 2018

> The kernel compiler, GCC, is unaware to the code size that will be generated by the inline assembly. It therefore tries to estimate its size based on newline characters and statement separators (';' on x86).

Seems like a poor decision to use whitespace as a factor in estimating code size. Some people like to space out their code with lots of comments even if it is only a few lines long. This is my naive view, though - does anyone out there have any insight as to why semicolons alone aren't sufficient to estimate code size?

jcranmer · on Oct 8, 2018

This is measuring code size in inline assembly. The assumption underlying the heuristic is that code size is proportional to the number of assembly instructions and that a line of inline assembly is usually an instruction (although a separator may mean multiple instructions per line). It is not looking at the C code itself, only the inline assembly string.

That heuristic is actually usually a very good one. However, in this case, the inline assembly contains a lot of computation in directives, so it consists of several lines to compute one small instruction.

To answer the question you actually asked rather than what you meant to ask (because the question you asked is interesting in and of itself): compilers throw away the actual source code very quickly. At the stage that inlining is considered, the program is going to consist of a sequence of instructions not unlike the C abstract machine as opposed to a more direct representation of code. The simplest way to estimate a program size is to simply count these instructions, but some instructions stand a good chance of being eliminated (say, a bitcast instruction that changes type without adjusting value), so there's usually some adjustment for "this is really expensive on this machine" or "this instruction is free."

mjevans · on Oct 8, 2018

Arguably inlining (or not) should be deferred to the link / output phases and there would be enough information present to perform the operation or not when putting the puzzle of segments together.

dagenix · on Oct 8, 2018

The problem with doing this is that many other optimizations depend on inlining. So, if inlining is deferred until link time, thise optimizations will be far less useful.

Take for example auto-vecorization. If we have a loop that just calls a function which increments a value in an array, we may very well have a great candidate to vectorize. However, if the function isn't inlined, the auto vectorization optimization will just see a function call as opposed to an increment and won't be able to do anything with it.

vlovich123 · on Oct 9, 2018

Why isn't optimization a repeated process? Apply some sequence of operations in a loop until diminishing gains are detected? So for example, run inlining again after the code has been lowered., I'm guessing it's actually really complex to try & implement & some optimizations can't run once the code has been lowered?

Similarly, the article talks about how the LLVM inliner clearly is smarter since it seems likely they ask their built-in assembler to provide an estimate of the code size rather than using more coarse heuristics. It seems like there's clearly a better path forward for GCC.

jcranmer · on Oct 9, 2018

> Why isn't optimization a repeated process? Apply some sequence of operations in a loop until diminishing gains are detected?

Compile time. Compiler optimizations are a tradeoff between code size, running time, and compile time, complicated by the fact that that sneezing at your code can change performance by 40% or more and the fact that correctness comes with steep performance costs and often isn't actually desired (until it really is).

Of all the optimizations, inlining is probably the worst for the difficulty of balancing the triad of objectives and the black magic of its heuristics.

ken · on Oct 9, 2018

The part of this that I don't understand is: since we will always know more information later in time, why not defer such decisions until runtime?

HP Labs' Dynamo project (~20 years ago) was a JIT for PA-RISC on PA-RISC, and it made most programs run faster, in large part by being able to inline functions at runtime even across library boundaries.

Was there something impractical about this strategy that caused it to end up in the dustbin of computer science history?

anarazel · on Oct 9, 2018

> The part of this that I don't understand is: since we will always know more information later in time, why not defer such decisions until runtime?

Latency vs Throughput. And cost of optimizations vs win of optimization.

aw1621107 · on Oct 9, 2018

> The part of this that I don't understand is: since we will always know more information later in time, why not defer such decisions until runtime?

I don't think I fully agree with this. The amount of information available to the JIT compared to an AoT compiler also depends on what input a JIT works on. In the case of Java, I'd imagine that the JVM has a strictly greater amount of information to work with than an AoT compiler because JVM bytecode carries basically all the information available in Java source code. However, a Rust JIT would have semantic information that a lower-level JIT like Dynamo wouldn't (assuming I'm understanding Dynamo right), like aliasing information. Dynamo could infer that a particular memory store probably doesn't affect a later memory load, and optimize based on that, but a Rust JIT could guarantee it, allowing for the same potential optimizations without having to insert deoptimization checks.

This is where profile-guided optimization comes into play, I think. Assuming a representative profile (and that's admittedly a big if), you can get much of the same information that a JIT gets while also getting to work with a potentially semantically-richer source.

> HP Labs' Dynamo project (~20 years ago) was a JIT for PA-RISC on PA-RISC, and it made most programs run faster, in large part by being able to inline functions at runtime even across library boundaries.

I'm curious what kind of code Dynamo was tested on, as the effectiveness of a JIT depends on the particular code patterns of the software being run. For example, JITs can help a lot with code with lots of virtual function calls or indirections that traditional compilers tend to stumble on, but I don't think a JIT would help as much for something like a cache-optimized computational kernel.

Admittedly, my point is weakened by the graph shown in an old Ars Technica article on Dynamo [0], where Dynamo showed significant improvements on the SPEC benchmark suite, which I assume doesn't rely on shared libraries. Now I'm curious whether similar differences can be shown given the nearly 20 years C/C++/Fortran compilers have had to improve.

> Was there something impractical about this strategy that caused it to end up in the dustbin of computer science history?

I'd hardly say that JITs ended up in the dustbin of computer science history; on the contrary, they're quite common. The JVM and the various Javascript JITs are probably the most well-known ones; you also have Pypy (and everything derived from RPython), LuaJIT, Julia, MATLAB, the new JITs showing up in Ruby and PostgreSQL, and the more recent push to develop new JITs (like Graal). The JVM got its JIT in 1998, and the JIT was made the default in 1.3 in 2000, so it's not like all these are recent developments.

Now, between all that, I don't know why a low-level JIT like Dynamo (if I understood the article I found [1] right) didn't become more popular. It certainly seemed like promising technology at the time. I'd love to know why it seemed to die out and whether it's worth pursuing today.

[0]: http://archive.arstechnica.com/reviews/1q00/dynamo/dynamo-3.... [1]: http://archive.arstechnica.com/reviews/1q00/dynamo/dynamo-1....

chrisseaton · on Oct 8, 2018

> Seems like a poor decision to use whitespace as a factor in estimating code size.

It's a heuristic. It's always going to be more poor in some respect that doing something precise. You could probably make it better in a million different ways, but it's suppose to be quick and simple.

improbable22 · on Oct 8, 2018

Even with a simple heuristic, wouldn't you try to pick this magic number 600 so that it's about as likely to be too big as too small?

Or is the problem very asymmetric, that inlining too-large pieces has dramatic downsides that failing to inline small pieces does not? Or is it just that the magic number was chosen decades ago, and the tradeoffs (perhaps re memory) are different now?

chrisseaton · on Oct 8, 2018

> wouldn't you try to pick this magic number 600...

I guess ideally you'd run a sophisticated machine learning algorithm across a huge corpus of real-world code, on real-world devices to discover the right number... but in reality people just don't have time for that so they pick a number and move on. Can't really blame them.

improbable22 · on Oct 8, 2018

I had a binary search in mind, if that counts as sophisticated :) Just surprised by orig GGP's assertion that more inlining would automatically be faster.

trhway · on Oct 8, 2018

to inline the parsing must be done. Once parsing is done there is no newlines and semicolons, only AST nodes. Counting AST nodes is very quick and simple and provides for much better heuristic.

noselasd · on Oct 8, 2018

> the parsing must be done.

Therein lies the bulk of the problem, gcc doesn't parse the assembly code. The assembly is just copied with some minor string replacements for the registry names, and it's up to the assembler to deal with it and parse it.

chrisseaton · on Oct 8, 2018

> to inline the parsing must be done

Yes... that's the point. For your metric counting AST nodes you need to have parsed the method. The other heuristic works on unparsed source-code. So you don't waste time parsing a method that you won't inline.

> Counting AST nodes is very quick and simple

But parsing is neither. And you need to have done that first. I think (there's a paper I'm sure but can't find it) that parse time for JS applications is very substantial.

MikeHolman · on Oct 8, 2018

You don't want to inline functions that haven't been parsed. If it hasn't been called by the time you are JITing, then it probably isn't going to be on a hot path.

chrisseaton · on Oct 8, 2018

If it's been JITed then the AST may well have been discarded after the JIT completed! Doesn't V8 work like that today? You'll need to pass it again even if you have already parsed it!

It's not hard to think of worst-cases. Think about an enormous function that you would not want to inline. It's been JITed, or it's running in a byte code interpreter, and the AST has been discarded. In your scheme you'll have to parse it to find it's too big. Huge waste of time, if you can see it's too large from the source code.

MikeHolman · on Oct 8, 2018

You can't throw away your interpreter bytecode, because the JIT code needs to be able to bailout if a guard fails (e.g. if you specialized a var as an int and you get a non-int).

So you can always use the size of your bytecode as an inlining heuristic (which is what Chakra does).

Of course that wouldn't work before v8 had an interpreter, but they could have used other heuristics. For example, they could have used the size of the AST as a heuristic and saved that even after the AST gets thrown away.

IshKebab · on Oct 8, 2018

Read the article - LLVM does it correctly.

dan-robertson · on Oct 8, 2018

You don’t put semicolons between instructions in assembly. A semicolon is used for comments to the end of the line. Instructions go one per line

comex · on Oct 8, 2018

No, the GNU assembler (unlike some others) does use semicolons as an instruction separator. Comment markers include // and // (like C), as well as potentially #, @, or others depending on the architecture.

Edit: Actually, on some lesser-used architectures it does support using ; to start a comment, e.g. z80 [1], but not the most common ones.

[1] https://sourceware.org/binutils/docs-2.26/as/Z80_002dChars.h...